... somewhere something incredible is waiting to be known.

Scalable Data Mining (CS60021)

Instructor: Sourangshu Bhattacharya

Teaching Assistants: TBA

Class Schedule: Monday (8:00 - 9:55), Tuesday (12:00 - 12:55), Saturday (9:00 - 10:30)

Classroom: Not applicable

Website: http://cse.iitkgp.ac.in/~sourangshu/coursefiles/cs60021-2019a.html

First Meeting: Tuesday, 16 July 2019, 12:00 pm


Content:


Syllabus:

In this course, we discuss algorithmic techniques as well as software paradigms which allow one to write scalable algorithms for the common data mining tasks.

Software paradigms:
Big Data Processing
: Motivation and Fundamentals. Map-reduce framework. Functional programming and Scala. Programming using map-reduce paradigm. Case studies: Finding similar items, Page rank, Matrix factorization.
Tensorflow / Pytorch: Motivation, Computation graphs, Tensors, , Example programs.

Algorithmic techniques:
Dimensionality reduction: Random projections, Johnson-Lindenstrauss lemma, JL transforms, sparse JL-transform.
Finding similar items
: Shingles, Minhashing, Locality Sensitive Hashing families.
Stream processing: Motivation, Sampling, Bloom filtering, Count based sketches: FM sketch,  AMS sketch. Hash based sketches: count sketch.

Optimization and Machine learning algorithms:
Optimization algorithms:
Stochastic gradient descent, Variance reduction, Momentum algorithms, ADAM.
Algorithms for distributed optimization: Distributed stochastic gradient descent and related methods. ADMM and decomposition methods.


Textbooks:

  • Mining of Massive Datasets. 2nd edition. - Jure Leskovec, Anand Rajaraman, Jeff Ullman. Cambridge University Press. http://www.mmds.org/
  • Tensorflow for Machine Intelligence: A hands on introduction to learning algorithms. Sam Abrahams et al. Bleeding edge press.
  • Hadoop: The definitive Guide. Tom White. Oreilly Press.
  • Recent literature.

Course Material:

Date

Day

Topics

Video / Slides / Notes

Practise Problems

01/09/20

Tuesday

Introduction

Slides

07/09/20

Monday

Hadoop, Map-reduce

Slides Practise Problems, Quiz Solutions

08/09/20

Tuesday

HDFS, Hadoop system



14/09/20

Monday

Spark, RDD, Transformations, Action

Slides Practice Problems

15/09/20

Tuesday

Spark runtime system



21/09/20

Monday

DL framework: Pytorch/TF, devices, Variables, differentiation

Slides

22/09/20

Tuesday

Discussion on Frameworks



28/09/20

Monday

Streaming Algorithms, Reservoir Sampling, Bloom filters

Slides

29/09/20

Tuesday

Test on Spark and DL framework



05/10/20

Monday

Count distinct, FM sketch

Slides

06/10/20

Tuesday

K-MV sketch



12/10/20

Monday

Frequency Count, Misra-Gries, Space saving

slides

13/10/20

Tuesday

Count-min sketch, Count sketch

slides

19/10/20

Monday

Locality sensitive hashing: Shingles, Minhash,

Slides

20/10/20

Tuesday

Test on Streaming Algorithms



26/10/20

Monday




27/10/20

Tuesday




02/11/20

Monday

Generalization: Gap LSH, Multi-probe LSH

slides: LSH Genralization

slides: multi-probe LSH

03/11/20

Tuesday

Large Scale ML, Stochastic Optimization, SGD


slides

09/11/20

Monday

Distributed Optimization, ADMM


slides

10/11/20

Tuesday

Batch Normalization


slides

16/11/20

Monday

Test on LSH and LSML



17/11/20

Tuesday

Wrapup and discussion



Assignments: