Sourangshu Bhattacharya's Homepage

Scalable Data Mining (CS60021)

Instructor: Sourangshu Bhattacharya

Teaching Assistants: Soumi Das and Ayan Kumar Bhowmick

Class Schedule: Wednesday (11:00 - 11:55), Thursday (12:00 - 12:55), Friday (8:00 - 8:55)

Classroom: CSE-107

Website: http://cse.iitkgp.ac.in/~sourangshu/coursefiles/cs60021-2018a.html

First Meeting: 18 July 2018, 11:00 pm

Motivation

Consider the following problems:

One is interested in computing summary statistics (word count distributions) for a set of words which occur in the same document in entire wikipedia collection (5 million documents). Naive techniques, will run out of main memory on most computers.
One needs an approximate count of the number of distinct IP addresses, for packets passing through a router. The algorithm that maintains a list of all distinct IP addresses will consume too much memory.
One needs to train an SVM classifier for text categorization, with unigram features (typically ~10 million) for hundreds of classes. One would run out of main memory, if they store uncompressed model parameters in main memory.

In all the above situations, a simple data mining / machine learning task has been made more complicated due to large scale of input data, output results or both. In this course, we discuss algorithmic techniques as well as software paradigms which allow one to write scalable algorithms for the common data mining tasks.

Syllabus:

Big Data Processing: Motivation and Fundamentals. Map-reduce framework. Functional programming and Scala. Programming using map-reduce paradigm. Case studies: Finding similar items, Page rank, Matrix factorisation.
Finding similar items: Shingles, Minhashing, Locality Sensitive Hashing families.
Stream processing: Motivation, Sampling, Bloom filtering, Count-distinct using FM sketch, Estimating moments using AMS sketch.
Dimensionality reduction: Linear dimensionality reduction, PCA, SVD. Random projections, Johnson-Lindenstrauss lemma, JL transforms, sparse JL-transform. Random hashing, Clarkson-Woodruff algorithm.
Algorithms for distributed machine learning: Distributed supervised learning, distributed clustering, distributed recommendation. distributed optimization on Big data platforms: Gradient descent, spectral gradient descent, Stochastic gradient descent and related methods. ADMM and decomposition methods.

Textbooks:

Mining of Massive Datasets. 2nd edition. - Jure Leskovec, Anand Rajaraman, Jeff Ullman. Cambridge University Press. http://www.mmds.org/
Data-Intensive Text Processing with MapReduce. Jimmy Lin and Chris Dyer. Morgan and Claypool. http://lintool.github.io/MapReduceAlgorithms/index.html
Hadoop: The definitive Guide. Tom White. Oreilly Press.
Distributed optimization and statistical learning via the alternating direction method of multipliers. S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, 2011.
Recent literature.

Course Material:

Mapreduce - introduction Slides, system Slides and programming Slides.
Spark - Scala, Spark (Slides) and data mining applications. Slides
Tensorflow (Reference).

Reference: MMDS Book, Hadoop book (see above)
Reference: Learning Spark. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia. Oreilly Press.
Sample questions: here

Clustering, k-means++, and scalable k-means++. Slides
Similarity search - Shingling, Minhashing, Locality sensitivehashing Slides, generalization Slides

References are: MMDS book, chapter 3 and "Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions" (by Alexandr Andoni and Piotr Indyk)
Sample questions: See MMDS book, assignment.

Sourangshu Bhattacharya