Sourangshu Bhattacharya's Homepage

Scalable Data Mining (CS60021)

Instructor: Sourangshu Bhattacharya

Teaching Assistants: Soumi Das, Kiran Purohit

Class Schedule: Monday (8:00 - 9:55), Tuesday (12:00 - 12:55)

Classroom: MS Teams class "Scalable Data Mining 2021"

Website: http://cse.iitkgp.ac.in/~sourangshu/coursefiles/cs60021-2021a.html

First Meeting: Tuesday, 10 August 2021, 12:00 pm

Term Project:

Term project topics: Topics

Term project allocation to the teams: Teamwise Allocation

Date of first Term Project Presentation : 25th Sept. 9 a.m. onwards.

Date of second Term Project Presentation: ~~6th~~ 13th November 9 a.m. onwards.

Date of Third test changed to 15th November.

Lecture Schedule:

Week	Dates	Topic / Activity	Links / Material
Week 1	10/8	Introduction	Slides
Week 2	16/8, 17/8	Hadoop: Mapreduce, HDFS, Mapreduce system	Slides
Week 3	23/8, 24/8	Spark: Scala, RDDs, Programming, System	slides Assignment 1
Week 4	30/8, 31/3	Deep learning frameworks: Tensorflow, Pytorch, ONNX.	Slides Assignment 2
Week 5	6/9, 7/9	Optimization in ML Test 1: on 7 / 9
Week 6	13/9, 14/9	Stochastic Gradient Descent, Acceelaration methods: Momentum, Nesterov, Adagrad, ADAM.	Slides Article on Accelerated SGD Review article with SAGA, SVRG
Week 7	20/9, 21/9	Distributed ML: Distributed GD, ADMM	Slides Review Article
Week 8	27/9, 28/9	Large Scale ML wrap up	Practice Problems
Week 9	4/10, 5/10	Streaming Algorithms, reservoir sampling, Bloom Filter, Cuckoo Filter Test 2 on 5 / 10	Slides
Week 10	11/10	Counting distinct items: Flajolet Martin Sketch, k-min value sketch	Slides
Week 11	18/10, 19/10	Counting frequency of items: Misra Gries, Space saving, Count-min sketch	Slides
Week 12	25/10 , 26/10	Count-sketch Locality sensitive hashing: shingles, Minhash.	Slides (LSH)
Week 13	1/11, 2/11	LSH: Gap LSH formulation, Multi-probe LSH	Slides (multi-probe LSH)
Week 14	8/11, 9/11	~~Test 3 on 9 / 11~~ LSH Discussion.	Slides (Learning LSH) Practice Questions
Week 15	15/11, 16/11	Test 3 on 15 / 11 Discussion and Wrap-up

Syllabus:

In this course, we discuss algorithmic techniques as well as software paradigms which allow one to write scalable algorithms for the Machine Learning and Data Mining tasks.

Software paradigms:
Big Data Processing: Motivation and Fundamentals. Map-reduce framework. Functional programming and Scala. Programming using map-reduce paradigm. Case studies: Finding similar items, Page rank, Matrix factorization.
Deep Learning Frameworks (Tensorflow / Pytorch): Motivation, Computation graphs, Tensors, Example programs.

Optimization and Machine learning algorithms:
Optimization algorithms: Stochastic gradient descent, Variance reduction, Momentum algorithms, ADAM.
Algorithms for distributed optimization: Distributed stochastic gradient descent and related methods. ADMM and decomposition methods.

Algorithmic techniques:
Dimensionality reduction: Random projections, Johnson-Lindenstrauss lemma, JL transforms, sparse JL-transform.
Finding similar items: Shingles, Minhashing, Locality Sensitive Hashing families.
Stream processing: Motivation, Sampling, Bloom filtering, Count based sketches: FM sketch, AMS sketch. Hash based sketches: count sketch.

References:

Mining of Massive Datasets. 2nd edition. - Jure Leskovec, Anand Rajaraman, Jeff Ullman. Cambridge University Press. http://www.mmds.org/
Tensorflow for Machine Intelligence: A hands on introduction to learning algorithms. Sam Abrahams et al. Bleeding edge press.
Hadoop: The definitive Guide. Tom White. Oreilly Press.
Recent literature.