Scalable Data Mining (CS60021)
Instructor: Sourangshu Bhattacharya
Teaching Assistants: TBA
Class Schedule: Monday (8:00 - 9:55), Tuesday (12:00 - 12:55), Saturday (9:00 - 10:30)
Classroom: Not applicable
Website: http://cse.iitkgp.ac.in/~sourangshu/coursefiles/cs60021-2019a.html
First Meeting: Tuesday, 16 July 2019, 12:00 pm
Content:
Syllabus:
In this course, we discuss algorithmic techniques
as well as software paradigms which allow one to
write scalable algorithms for the common data mining tasks.
Software paradigms:
Big Data Processing: Motivation and Fundamentals.
Map-reduce framework. Functional programming and Scala.
Programming using map-reduce paradigm. Case studies: Finding
similar items, Page rank, Matrix factorization.
Tensorflow / Pytorch: Motivation, Computation
graphs, Tensors, , Example programs.
Algorithmic techniques:
Dimensionality reduction: Random
projections, Johnson-Lindenstrauss lemma, JL transforms,
sparse JL-transform.
Finding similar items: Shingles, Minhashing, Locality
Sensitive Hashing families.
Stream processing: Motivation, Sampling, Bloom
filtering, Count based sketches: FM sketch, AMS
sketch. Hash based sketches: count sketch.
Optimization and Machine learning algorithms:
Optimization algorithms: Stochastic gradient descent,
Variance reduction, Momentum algorithms, ADAM.
Algorithms for distributed optimization: Distributed
stochastic gradient descent and related methods. ADMM and
decomposition methods.
Textbooks:
- Mining of Massive Datasets. 2nd edition. - Jure Leskovec, Anand Rajaraman, Jeff Ullman. Cambridge University Press. http://www.mmds.org/
- Tensorflow for Machine Intelligence: A hands on
introduction to learning algorithms. Sam Abrahams et al.
Bleeding edge press.
- Hadoop: The definitive Guide. Tom White. Oreilly Press.
- Recent literature.
Course Material:
Date |
Day |
Topics |
Video / Slides / Notes |
Practise Problems |
01/09/20 |
Tuesday |
Introduction |
Slides | |
07/09/20 |
Monday |
Hadoop, Map-reduce |
Slides | Practise Problems, Quiz Solutions |
08/09/20 |
Tuesday |
HDFS, Hadoop system |
||
14/09/20 |
Monday |
Spark, RDD, Transformations, Action |
Slides | Practice Problems |
15/09/20 |
Tuesday |
Spark runtime system |
||
21/09/20 |
Monday |
DL framework: Pytorch/TF, devices, Variables, differentiation |
Slides | |
22/09/20 |
Tuesday |
Discussion on Frameworks |
||
28/09/20 |
Monday |
Streaming Algorithms, Reservoir Sampling, Bloom filters |
Slides | |
29/09/20 |
Tuesday |
Test on Spark and DL framework |
||
05/10/20 |
Monday |
Count distinct, FM sketch |
Slides | |
06/10/20 |
Tuesday |
K-MV
sketch |
||
12/10/20 |
Monday |
Frequency
Count, Misra-Gries, Space saving |
slides |
|
13/10/20 |
Tuesday |
Count-min sketch, Count sketch |
slides |
|
19/10/20 |
Monday |
Locality sensitive hashing: Shingles, Minhash, |
Slides |
|
20/10/20 |
Tuesday |
Test on Streaming Algorithms |
||
26/10/20 |
Monday |
|||
27/10/20 |
Tuesday |
|||
02/11/20 |
Monday |
Generalization:
Gap LSH, Multi-probe LSH |
slides: LSH
Genralization slides: multi-probe LSH |
|
03/11/20 |
Tuesday |
Large
Scale ML, Stochastic Optimization, SGD |
slides |
|
09/11/20 |
Monday |
Distributed
Optimization, ADMM |
slides |
|
10/11/20 |
Tuesday |
Batch Normalization |
slides |
|
16/11/20 |
Monday |
Test on LSH
and LSML |
||
17/11/20 |
Tuesday |
Wrapup and discussion |
Assignments: