Consider the various reduction optimizations taught in class, from reduction1 up to reduction5. Since reduction5 considers only two global loads per thread, call it reduction5.1. Let us say reduction5.2 is the variant where 3 global loads are considered per thread, similarly reduction5.n is the variant with (n+1) global loads. With input data size varying from 2^16 to 2^32, check the execution time for reduction operation with each reduction code from reduction0 to reduction5.n (for n = 1 to 8). Submit an archive with 1. all individual reduction codes (host+device in each case). 2. a script that automatically runs them and generates the graph similar to slide 45. 3. A word doc with interesting observations on teh profiling data about operation mix, memory vs compute, effect of optimization at each step etc in each case. DO not copy or use GPT ( we shall check for plagiarism and AI content both) consider reduction operations other than addition : min, max, average, and, or etc