... somewhere something incredible is waiting to be known.

Scalable Data Mining (CS60021)

Instructor: Sourangshu Bhattacharya

Teaching Assistants: Soumi Das, Kiran Purohit

Class Schedule: Monday (8:00 - 9:55), Tuesday (12:00 - 12:55)

Classroom: CSE - 119

Last year course website: http://cse.iitkgp.ac.in/~sourangshu/coursefiles/cs60021-2021a.html

First Meeting: Tuesday, 2 August 2022, 12:00 pm


Content:



Course Schedule:

Week Dates Topic / Activity Links / Material

Week 1


2/8

Introduction

Slides

Week 2


8/8

Hadoop, Map-reduce, HDFS and Hadoop system.

Ref: Hadoop: The Definitive Guide, Tom white Oreilly Publisher.

Slides

Syllabus:

In this course, we discuss algorithmic techniques as well as software paradigms which allow one to write scalable algorithms for the Machine Learning and Data Mining tasks.

Software paradigms:
Big Data Processing
: Motivation and Fundamentals. Map-reduce framework. Functional programming and Scala. Programming using map-reduce paradigm. Example programs.
Deep Learning Frameworks (Pytorch): Motivation, Computation graphs, Tensors, Autograd, Modules, Example programs.

Optimization and Machine learning algorithms:
Optimization algorithms:
Stochastic gradient descent, Variance reduction, Momentum algorithms, ADAM.
Algorithms for distributed optimization: Distributed stochastic gradient descent and related methods. ADMM and decomposition methods.

Algorithmic techniques:
Dimensionality reduction: Random projections, Johnson-Lindenstrauss lemma, JL transforms, sparse JL-transform.
Finding similar items
: Shingles, Minhashing, Locality Sensitive Hashing families.
Stream processing: Motivation, Sampling, Bloom filtering, Count based sketches: FM sketch,  AMS sketch. Hash based sketches: count sketch.


References:

  • Mining of Massive Datasets. 2nd edition. - Jure Leskovec, Anand Rajaraman, Jeff Ullman. Cambridge University Press. http://www.mmds.org/
  • Tensorflow for Machine Intelligence: A hands on introduction to learning algorithms. Sam Abrahams et al. Bleeding edge press.
  • Hadoop: The definitive Guide. Tom White. Oreilly Press.
  • Recent literature.