Scalable Data Mining (CS60021)
Instructor: Sourangshu Bhattacharya
Teaching Assistants: Soumi Das, Kiran Purohit
Class Schedule: Monday (8:00 - 9:55), Tuesday (12:00 - 12:55)
Classroom: MS Teams class "Scalable Data Mining
2021"
Website: http://cse.iitkgp.ac.in/~sourangshu/coursefiles/cs60021-2021a.html
First Meeting: Tuesday, 10 August 2021, 12:00 pm
Content:
Term Project:
Term project topics: Topics
Term project allocation to the teams: Teamwise Allocation
Date of first Term Project Presentation : 25th Sept. 9 a.m. onwards.
Date of second Term Project Presentation:
Date of Third test changed to 15th November.
Lecture Schedule:
Week |
Dates |
Topic
/ Activity |
Links / Material |
Week 1 |
10/8 |
Introduction |
Slides |
Week 2 |
16/8, 17/8 |
Hadoop: Mapreduce, HDFS, Mapreduce system |
Slides |
Week 3 |
23/8, 24/8 |
Spark: Scala, RDDs, Programming, System |
slides Assignment 1 |
Week 4 |
30/8, 31/3 |
Deep learning frameworks: Tensorflow, Pytorch, ONNX. |
Slides Assignment 2 |
Week 5 |
6/9, 7/9 |
Optimization in ML Test 1: on 7 / 9 |
|
Week 6 |
13/9, 14/9 |
Stochastic Gradient Descent,
Acceelaration methods: Momentum, Nesterov, Adagrad,
ADAM. |
Slides Article on Accelerated SGD Review article with SAGA, SVRG |
Week 7 |
20/9, 21/9 |
Distributed ML: Distributed GD, ADMM |
Slides Review Article |
Week 8 |
27/9, 28/9 |
Large Scale ML wrap up |
Practice Problems |
Week 9 |
4/10, 5/10 |
Streaming Algorithms, reservoir
sampling, Bloom Filter, Cuckoo Filter Test 2 on 5 / 10 |
Slides |
Week 10 |
11/10 |
Counting distinct items: Flajolet Martin Sketch, k-min value sketch |
Slides |
Week 11 |
18/10, 19/10 |
Counting frequency of items: Misra Gries, Space saving, Count-min sketch |
Slides |
Week 12 |
25/10 , 26/10 |
Count-sketch Locality sensitive hashing: shingles, Minhash. |
Slides (LSH) |
Week 13 |
1/11, 2/11 |
LSH: Gap LSH formulation, Multi-probe LSH |
Slides (multi-probe LSH) |
Week 14 |
8/11, 9/11 |
Slides (Learning LSH) Practice Questions |
|
Week 15 |
15/11, 16/11 |
Test 3 on 15 / 11 Discussion and Wrap-up |
Syllabus:
In this course, we discuss algorithmic techniques
as well as software paradigms which allow one to
write scalable algorithms for the Machine Learning and Data
Mining tasks.
Software paradigms:
Big Data Processing: Motivation and Fundamentals.
Map-reduce framework. Functional programming and Scala.
Programming using map-reduce paradigm. Case studies: Finding
similar items, Page rank, Matrix factorization.
Deep Learning Frameworks (Tensorflow / Pytorch): Motivation,
Computation graphs, Tensors, Example programs.
Optimization and Machine learning algorithms:
Optimization algorithms: Stochastic gradient descent,
Variance reduction, Momentum algorithms, ADAM.
Algorithms for distributed optimization: Distributed
stochastic gradient descent and related methods. ADMM and
decomposition methods.
Algorithmic techniques:
Dimensionality reduction: Random
projections, Johnson-Lindenstrauss lemma, JL transforms,
sparse JL-transform.
Finding similar items: Shingles, Minhashing, Locality
Sensitive Hashing families.
Stream processing: Motivation, Sampling, Bloom
filtering, Count based sketches: FM sketch, AMS
sketch. Hash based sketches: count sketch.
References:
- Mining of Massive Datasets. 2nd edition. - Jure Leskovec, Anand Rajaraman, Jeff Ullman. Cambridge University Press. http://www.mmds.org/
- Tensorflow for Machine Intelligence: A hands on
introduction to learning algorithms. Sam Abrahams et al.
Bleeding edge press.
- Hadoop: The definitive Guide. Tom White. Oreilly Press.
- Recent literature.