Scalable Data Mining (CS60021)
Instructor: Sourangshu Bhattacharya
Teaching Assistants: TBA
Class Schedule: Monday (8:00  9:55), Tuesday (12:00  12:55), Saturday (9:00  10:30)
Classroom: Not applicable
Website: http://cse.iitkgp.ac.in/~sourangshu/coursefiles/cs600212019a.html
First Meeting: Tuesday, 16 July 2019, 12:00 pm
Content:
Syllabus:
In this course, we discuss algorithmic techniques
as well as software paradigms which allow one to
write scalable algorithms for the common data mining tasks.
Software paradigms:
Big Data Processing: Motivation and Fundamentals.
Mapreduce framework. Functional programming and Scala.
Programming using mapreduce paradigm. Case studies: Finding
similar items, Page rank, Matrix factorization.
Tensorflow / Pytorch: Motivation, Computation
graphs, Tensors, , Example programs.
Algorithmic techniques:
Dimensionality reduction: Random
projections, JohnsonLindenstrauss lemma, JL transforms,
sparse JLtransform.
Finding similar items: Shingles, Minhashing, Locality
Sensitive Hashing families.
Stream processing: Motivation, Sampling, Bloom
filtering, Count based sketches: FM sketch, AMS
sketch. Hash based sketches: count sketch.
Optimization and Machine learning algorithms:
Optimization algorithms: Stochastic gradient descent,
Variance reduction, Momentum algorithms, ADAM.
Algorithms for distributed optimization: Distributed
stochastic gradient descent and related methods. ADMM and
decomposition methods.
Textbooks:
 Mining of Massive Datasets. 2nd edition.  Jure Leskovec, Anand Rajaraman, Jeff Ullman. Cambridge University Press. http://www.mmds.org/
 Tensorflow for Machine Intelligence: A hands on
introduction to learning algorithms. Sam Abrahams et al.
Bleeding edge press.
 Hadoop: The definitive Guide. Tom White. Oreilly Press.
 Recent literature.
Course Material:
Date 
Day 
Topics 
Video / Slides / Notes 
Practise Problems 
01/09/20 
Tuesday 
Introduction 
Slides  
07/09/20 
Monday 
Hadoop, Mapreduce 
Slides  Practise Problems, Quiz Solutions 
08/09/20 
Tuesday 
HDFS, Hadoop system 

14/09/20 
Monday 
Spark, RDD, Transformations, Action 
Slides  Practice Problems 
15/09/20 
Tuesday 
Spark runtime system 

21/09/20 
Monday 
DL framework: Pytorch/TF, devices, Variables, differentiation 
Slides  
22/09/20 
Tuesday 
Discussion on Frameworks 

28/09/20 
Monday 
Streaming Algorithms, Reservoir Sampling, Bloom filters 
Slides  
29/09/20 
Tuesday 
Test on Spark and DL framework 

05/10/20 
Monday 
Count distinct, FM sketch 
Slides  
06/10/20 
Tuesday 
KMV
sketch 

12/10/20 
Monday 
Frequency
Count, MisraGries, Space saving 
slides 

13/10/20 
Tuesday 
Countmin sketch, Count sketch 
slides 

19/10/20 
Monday 
Locality sensitive hashing: Shingles, Minhash, 
Slides 

20/10/20 
Tuesday 
Test on Streaming Algorithms 

26/10/20 
Monday 

27/10/20 
Tuesday 

02/11/20 
Monday 
Generalization:
Gap LSH, Multiprobe LSH 
slides: LSH
Genralization slides: multiprobe LSH 

03/11/20 
Tuesday 
Large
Scale ML, Stochastic Optimization, SGD 
slides 

09/11/20 
Monday 
Distributed
Optimization, ADMM 
slides 

10/11/20 
Tuesday 
Batch Normalization 
slides 

16/11/20 
Monday 
Test on LSH
and LSML 

17/11/20 
Tuesday 
Wrapup and discussion 
Assignments: