Blog


Google Summer of Code 2018

Week 7:8

Monday, 9th July 2018.

  NNPACK is an acceleration package for neural network computations for multi-core CPUs. These two weeks, I mostly spent time on improving the performance of Flux on the CPU, using NNPACK. I aimed to make ReLU, Softmax, convolutions and maxpool faster. Fast CPU based computations is something which machine learning developers expect out of a library. Many a times, GPUs might not be available either because it's costly, or simply because the computation needs to take place on a low-end device, say on the phone. In such cases being able to make fast computations on a CPU is extremely helpful.

  The approach was to make a shared object file of the NNPACK library. Then, various functions of the library would be called from Julia using ccall. The first barrier was discovering how to create the shared object file. Initially, I was planning to create my own Makefile for doing this instead of modifying the Makefile provided by NNPACK (as I found it too complicated). My approach was incorrect for two reasons. First, I was looking at some scalar functions which I found easy to compile. In reality, though these functions were faster than Julia's counterpart in NNlib, they were not making use of multi-core CPUs. Additional dependencies were needed to be built together to take care of that. Secondly, building together with these additional dependencies would be a more complicated task than studying and modifying the existing Makefile. Fortunately enough, my mentors discovered an easy way to get ahead. A tweak in the command line expression involving cmake helped generate the shared object file.

  Next, I tried calling ReLU of NNPACK manually via ccall and benchmarked the results. The initial hurdle was to be able to mimic a pointer to a C struct pthreadpool_t in Julia so that I could use ccall successfully. But I was finding it difficult to do so. My mentor, Mike Innes, gave me an excellent suggestion to use a pointer to Void. Since all pointers take the same amount of memory, its datatype doesn't matter, and this method worked successfully. NNPACK was extremely fast compared to NNlib. Now, I went ahead and integrated it with NNlib so that Flux would use it by default. I created a pull request as well. Unfortunately, even after multiple attempts, I was unable to get all the tests passing. Two tests are still failing. The details have been mentioned in the pull request. They are type inference issues. NNPACK works only on Linux and Mac environments. Since another function has been put into place to check if the system supports NNPACK, to Julia, the return type is becoming ambiguous at times and is unable to infer it properly, making some tests fail.

  Next, I worked on the integration of convolution. Unfortunately, I was not even able to get it running manually. Hence I could not perform the benchmarking. I was getting an error that the input padding was not less than the kernel size, though I had kept it all zero. I got confused. I went ahead with maxpool which too gave me an error that all the stride lengths were zero, though I had kept them 3 in each dimension. I am still unable to understand what went wrong. There are two things which I can do next. First, in place of using ccall, I can try calling the functions of the shared object file using C and check if that too produces a similar kind of error. Otherwise, I can directly file an issue at NNPACK and ask for assistance.

  I also tried integrating Softmax. First, I performed the benchmarking manually and then integrated it with NNlib. I was successful in this attempt. All the tests passed and I have also sent a pull request.

  In order to place the shared object file, we plan to use the BinaryBuilder in all supported platforms . Though NNPACK works on both Linux and Mac, the BinaryBuilder works on Linux only. Hence, we attempt to provide support on all Linux x86_64 architectures which can run the Julia binary. But, we ran into another issue here. The BinaryBuilder was unable to download and install the Python package "Six" while performing cmake. This issue was specific to the BinaryBuilder only as it was not being encountered when cmake was run manually. An issue regarding this has been filed.

  Along with this, I also drafted a blog post which mentions the problems one might encounter with GPU computing with GPUArrays in Julia. It also mentions some debugging techniques. I'll publish it after getting it reviewed. These two weeks, I took a break from my GPU based tasks and worked on making CPU based computations faster. It was a nice experience, and I hope my work will be beneficial to many.