Week 5:6
  These two weeks wasn't too great if measured by the amount of progress made on the proposal. I ran into some issues optimising the GPU kernel for the gradient of repeat. GPUArrays features a special array type called the JLArray. GPU Kernels are pretty tough to debug as the error messages generated are extremely generic. Moreover, they are compiled, and hence is unable to exactly point out to the line causing the issue. Most of the times, I was getting stuck at an LLVM IR error. This is where JLArray comes to rescue. JLArray mimics the kernel on the CPU and attempts to execute it. Here, the errors are no longer cryptic. Just like debugging code on the CPU, one can get to know exactly where the error is occurring along with a proper stacktrace. JLArray is supposed to be able to mimic 90% of GPU activities. But trust me, the remaining 10% can at times was taking such a long time to debug, that I was wondering at times, that perhaps 90% of the work is yet to be done.
  The feedforward repeat kernel initially worked with JLArrays but failed with CuArrays, the datatype meant for NVIDIA GPUs (stands for CUDA Arrays). There was a type inference issue. Hence, I separated them out in such a manner that there was no longer a need for the register memory to store a list. Though this overcame the problem, this restricted the input dimensions to a fixed maximum number. Upon discussion with other members of the community, we decided to stick to a four-dimensional implementation.
  But, the trouble began with the kernel to be used for backpropagation, the one which was to compute the gradients. Again, it didn't take me much time to come up with a kernel which would work with JLArray. But, just like before, it refused to work with CuArray. But, this time it was not an LLVM issue. The output obtained was incorrect. I tried a lot of debugging techniques. Upon consulting with my mentor, he suggested me to come up with a kernel in which two or more threads would not be accessing the same memory location. I tried incorporating the suggestion, but the memory access was getting corrupted. The same memory location was being accessed again and again when running on the GPU, hence the result was still erroneous. I couldn't manage it get it working as of now.
  As of now, I am working on moving the kernel to Julia 0.7. An initial attempt produced the old LLVM error again, but the errors were less cryptic on 0.7 and provided hints as to what could have gone wrong. I am yet to look into this and find out what's wrong. I learnt a lot of things these two weeks along with the fact that debugging on the GPU can be really confusing at times.