Sourangshu Bhattacharya's Homepage

Scalable Data Mining (CS60021)

Instructor: Sourangshu Bhattacharya

Teaching Assistants: TBA

Class Schedule: Monday (8:00 - 9:55), Tuesday (12:00 - 12:55), Saturday (9:00 - 10:30)

Classroom: Not applicable

Website: http://cse.iitkgp.ac.in/~sourangshu/coursefiles/cs60021-2019a.html

First Meeting: Tuesday, 16 July 2019, 12:00 pm

Syllabus:

In this course, we discuss algorithmic techniques as well as software paradigms which allow one to write scalable algorithms for the common data mining tasks.

Software paradigms:
Big Data Processing: Motivation and Fundamentals. Map-reduce framework. Functional programming and Scala. Programming using map-reduce paradigm. Case studies: Finding similar items, Page rank, Matrix factorization.
Tensorflow / Pytorch: Motivation, Computation graphs, Tensors, , Example programs.

Algorithmic techniques:
Dimensionality reduction: Random projections, Johnson-Lindenstrauss lemma, JL transforms, sparse JL-transform.
Finding similar items: Shingles, Minhashing, Locality Sensitive Hashing families.
Stream processing: Motivation, Sampling, Bloom filtering, Count based sketches: FM sketch, AMS sketch. Hash based sketches: count sketch.

Optimization and Machine learning algorithms:
Optimization algorithms: Stochastic gradient descent, Variance reduction, Momentum algorithms, ADAM.
Algorithms for distributed optimization: Distributed stochastic gradient descent and related methods. ADMM and decomposition methods.

Textbooks:

Mining of Massive Datasets. 2nd edition. - Jure Leskovec, Anand Rajaraman, Jeff Ullman. Cambridge University Press. http://www.mmds.org/
Tensorflow for Machine Intelligence: A hands on introduction to learning algorithms. Sam Abrahams et al. Bleeding edge press.
Hadoop: The definitive Guide. Tom White. Oreilly Press.
Recent literature.

Course Material:

Date	Day	Topics	Video / Slides / Notes	Practise Problems
01/09/20	Tuesday	Introduction	Slides
07/09/20	Monday	Hadoop, Map-reduce	Slides	Practise Problems, Quiz Solutions
08/09/20	Tuesday	HDFS, Hadoop system
14/09/20	Monday	Spark, RDD, Transformations, Action	Slides	Practice Problems
15/09/20	Tuesday	Spark runtime system
21/09/20	Monday	DL framework: Pytorch/TF, devices, Variables, differentiation	Slides
22/09/20	Tuesday	Discussion on Frameworks
28/09/20	Monday	Streaming Algorithms, Reservoir Sampling, Bloom filters	Slides
29/09/20	Tuesday	Test on Spark and DL framework
05/10/20	Monday	Count distinct, FM sketch	Slides
06/10/20	Tuesday	K-MV sketch
12/10/20	Monday	Frequency Count, Misra-Gries, Space saving	slides
13/10/20	Tuesday	Count-min sketch, Count sketch	slides
19/10/20	Monday	Locality sensitive hashing: Shingles, Minhash,	Slides
20/10/20	Tuesday	Test on Streaming Algorithms
26/10/20	Monday
27/10/20	Tuesday
02/11/20	Monday	Generalization: Gap LSH, Multi-probe LSH	slides: LSH Genralization slides: multi-probe LSH
03/11/20	Tuesday	Large Scale ML, Stochastic Optimization, SGD	slides
09/11/20	Monday	Distributed Optimization, ADMM	slides
10/11/20	Tuesday	Batch Normalization	slides
16/11/20	Monday	Test on LSH and LSML
17/11/20	Tuesday	Wrapup and discussion

Sourangshu Bhattacharya

Scalable Data Mining (CS60021)

Content:

Textbooks:

Course Material:

Assignments: