... somewhere something incredible is waiting to be known.

Scalable Data Mining (CS60021)

Instructor: Sourangshu Bhattacharya

Teaching Assistants: Soumi Das, Kiran Purohit

Class Schedule: Monday (8:00 - 9:55), Tuesday (12:00 - 12:55)

Classroom: MS Teams class "Scalable Data Mining 2021"

Website: http://cse.iitkgp.ac.in/~sourangshu/coursefiles/cs60021-2021a.html

First Meeting: Tuesday, 10 August 2021, 12:00 pm


Content:



Term Project:

Term project topics: Topics

Term project allocation to the teams: Teamwise Allocation

Date of first Term Project Presentation : 25th Sept. 9 a.m. onwards.

Date of second Term Project Presentation: 6th 13th November 9 a.m. onwards.

Date of Third test changed to 15th November.

Lecture Schedule:

Week

Dates

Topic / Activity

Links / Material


Week 1


10/8

Introduction

Slides

Week 2


16/8, 17/8

Hadoop: Mapreduce, HDFS, Mapreduce system

Slides

Week 3


23/8, 24/8

Spark: Scala, RDDs, Programming, System

slides
Assignment 1

Week 4


30/8, 31/3

Deep learning frameworks: Tensorflow, Pytorch, ONNX.

Slides
Assignment 2

Week 5


6/9, 7/9
Optimization in ML

Test 1: on 7 / 9


Week 6


13/9, 14/9
Stochastic Gradient Descent, Acceelaration methods: Momentum, Nesterov, Adagrad, ADAM.
Slides

Article on Accelerated SGD

Review article with SAGA, SVRG

Week 7


20/9, 21/9

Distributed ML: Distributed GD, ADMM

Slides
Review Article

Week 8


27/9, 28/9

Large Scale ML wrap up

Practice Problems

Week 9


4/10, 5/10
Streaming Algorithms, reservoir sampling, Bloom Filter, Cuckoo Filter

Test 2 on 5 / 10


Slides

Week 10


11/10

Counting distinct items: Flajolet Martin Sketch, k-min value sketch

Slides

Week 11


18/10, 19/10

Counting frequency of items: Misra Gries, Space saving, Count-min sketch

Slides

Week 12


25/10 , 26/10

Count-sketch

Locality sensitive hashing: shingles, Minhash.

Slides (LSH)

Week 13


1/11, 2/11

LSH: Gap LSH formulation, Multi-probe LSH

Slides (multi-probe LSH)

Week 14


8/11, 9/11

Test 3 on 9 / 11

LSH Discussion.

Slides (Learning LSH)

Practice Questions

Week 15


15/11, 16/11

Test 3 on 15 / 11

 Discussion and Wrap-up




Syllabus:

In this course, we discuss algorithmic techniques as well as software paradigms which allow one to write scalable algorithms for the Machine Learning and Data Mining tasks.

Software paradigms:
Big Data Processing
: Motivation and Fundamentals. Map-reduce framework. Functional programming and Scala. Programming using map-reduce paradigm. Case studies: Finding similar items, Page rank, Matrix factorization.
Deep Learning Frameworks (Tensorflow / Pytorch): Motivation, Computation graphs, Tensors, Example programs.

Optimization and Machine learning algorithms:
Optimization algorithms:
Stochastic gradient descent, Variance reduction, Momentum algorithms, ADAM.
Algorithms for distributed optimization: Distributed stochastic gradient descent and related methods. ADMM and decomposition methods.

Algorithmic techniques:
Dimensionality reduction: Random projections, Johnson-Lindenstrauss lemma, JL transforms, sparse JL-transform.
Finding similar items
: Shingles, Minhashing, Locality Sensitive Hashing families.
Stream processing: Motivation, Sampling, Bloom filtering, Count based sketches: FM sketch,  AMS sketch. Hash based sketches: count sketch.


References:

  • Mining of Massive Datasets. 2nd edition. - Jure Leskovec, Anand Rajaraman, Jeff Ullman. Cambridge University Press. http://www.mmds.org/
  • Tensorflow for Machine Intelligence: A hands on introduction to learning algorithms. Sam Abrahams et al. Bleeding edge press.
  • Hadoop: The definitive Guide. Tom White. Oreilly Press.
  • Recent literature.