Consider the various reduction optimizations taught in class, from reduction1 up to reduction5.

Since reduction5 considers only two global loads per thread, call it
reduction5.1. Let us say reduction5.2 is the variant where 3 global loads are
considered per thread, similarly reduction5.n is the variant with (n+1) global
loads.

With input data size varying from 2^16 to 2^32, check the execution time for reduction
operation with each reduction code from reduction0 to reduction5.n (for n =
1 to 8).

Submit an archive with
1. all individual reduction codes (host+device in each
case).
2. a script that automatically runs them and generates the graph similar to
slide 45.

3. A word doc with interesting observations on teh profiling data about
operation mix, memory vs compute, effect of optimization at each step etc in
each case.

DO not copy or use GPT ( we shall check for plagiarism and AI content both)
consider reduction operations other than addition : min, max, average, and, or
etc