Scalable Data Mining (CS60021)

Instructors: Sourangshu Bhattacharya, Pabitra Mitra

Teaching Assistants: Kiran Purohit, Shubhadip Nag

Class Schedule: Monday (10:00 - 10:55), Wednesday (8:00 - 9:55)

Classroom: CSE - 119

Last year course website: https://panuragreddy.github.io/SDM_2023/

Announcements:

Moodle course has been created. Name: CS60021: Scalable Data Mining, URL: https://moodlecse.iitkgp.ac.in/moodle/. Student registration key is shared in email.

Course Schedule:

Week	Dates	Topic / Activity	Links / Material
Week 1	22/7, 24/7	Introduction to ML DL, Stochastic Gradient Descent	Slides
Week 2	29/7, 31/7	SGD convergence, Accelarated SGD	Slides - SGD Convergence, Accelerated SGD
Week 3	5/8, 7/8	Convergence rate SGD, Linear-rate SGD methods, Batch-normalization	Slides - SGD Convergence rate, Slides - Batch-normalization
Week 4	12/8, 14/8	ADMM for distributed loss minimization	Slides - ADMM
Week 5+6	19/8, 21/8, 26/8, 29/8	Hadoop + Spark	Slides - Hadoop, Spark
Week 7	2/9, 4/9	DL frameworks	Slides - Pytorch
Week 8	9/9/11/9	Subset Selection	Slides - Submodular Functions, Sparse Approximation, Convex Online
Week 9	14/10, 16/11	Nearest Neighbor Search	Slides - LSH, HNSW

Syllabus:

Optimization and Machine learning algorithms:

Optimization algorithms: Stochastic gradient descent, Variance reduction, Momentum algorithms, ADAM.
Algorithms for distributed optimization: Distributed stochastic gradient descent and related methods. ADMM and decomposition methods.
(New) Federated Learning.

Software paradigms:

Big Data Processing: Motivation and Fundamentals, Map-reduce framework, Functional programming, and Scala Programming using map-reduce paradigm, Example programs.
Deep Learning Frameworks (Pytorch): Motivation, Computation graphs, Tensors, Autograd, Modules, Example programs.

Algorithmic techniques:

Finding similar items: Shingles, Minhashing, Locality Sensitive Hashing families, FAISS.
Stream processing: Motivation, Sampling, Bloom filtering, Count-based sketches: FM sketch, AMS sketch, Hash-based sketches: count sketch.
Subset Selection Methods: Submodular Optimization, Sparse Approximation, Convex Optimisation.

References:

Mining of Massive Datasets. 2nd edition. - Jure Leskovec, Anand Rajaraman, Jeff Ullman. Cambridge University Press. http://www.mmds.org/
Tensorflow for Machine Intelligence: A hands-on introduction to learning algorithms. Sam Abrahams et al. Bleeding edge press.
Hadoop: The Definitive Guide. Tom White. O'Reilly Press.
Recent literature.