Scalable Data Mining (CS60021)
Instructor: Sourangshu Bhattacharya
Teaching Assistants: Soumi Das and Ayan Kumar Bhowmick
Class Schedule: Wednesday (11:00 - 11:55), Thursday (12:00 - 12:55), Friday (8:00 - 8:55)
Classroom: CSE-107
Website: http://cse.iitkgp.ac.in/~sourangshu/coursefiles/cs60021-2018a.html
First Meeting: 18 July 2018, 11:00 pm
Content:
Motivation
Consider the following problems:- One is interested in computing summary statistics (word count distributions) for a set of words which occur in the same document in entire wikipedia collection (5 million documents). Naive techniques, will run out of main memory on most computers.
- One needs an approximate count of the number of distinct IP addresses, for packets passing through a router. The algorithm that maintains a list of all distinct IP addresses will consume too much memory.
- One needs to train an SVM classifier for text categorization, with unigram features (typically ~10 million) for hundreds of classes. One would run out of main memory, if they store uncompressed model parameters in main memory.
Syllabus:
Big Data Processing: Motivation and Fundamentals. Map-reduce framework. Functional programming and Scala. Programming using map-reduce paradigm. Case studies: Finding similar items, Page rank, Matrix factorisation.
Finding similar items: Shingles, Minhashing, Locality Sensitive Hashing families.
Stream processing: Motivation, Sampling, Bloom filtering, Count-distinct using FM sketch, Estimating moments using AMS sketch.
Dimensionality reduction: Linear dimensionality reduction, PCA, SVD. Random projections, Johnson-Lindenstrauss lemma, JL transforms, sparse JL-transform. Random hashing, Clarkson-Woodruff algorithm.
Algorithms for distributed machine learning: Distributed supervised learning, distributed clustering, distributed recommendation. distributed optimization on Big data platforms: Gradient descent, spectral gradient descent, Stochastic gradient descent and related methods. ADMM and decomposition methods.
Textbooks:
- Mining of Massive Datasets. 2nd edition. - Jure Leskovec, Anand Rajaraman, Jeff Ullman. Cambridge University Press. http://www.mmds.org/
- Data-Intensive Text Processing with MapReduce. Jimmy Lin and Chris Dyer. Morgan and Claypool. http://lintool.github.io/MapReduceAlgorithms/index.html
- Hadoop: The definitive Guide. Tom White. Oreilly Press.
- Distributed optimization and statistical learning via the alternating direction method of multipliers. S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, 2011.
- Recent literature.
Course Material:
- Mapreduce - introduction Slides, system Slides and programming Slides.
- Spark - Scala, Spark (Slides) and data mining applications. Slides
- Tensorflow (Reference).
- Reference: MMDS Book, Hadoop book (see above)
- Reference: Learning Spark. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia. Oreilly Press.
- Sample questions: here
- Clustering, k-means++, and scalable k-means++. Slides
- Similarity search - Shingling, Minhashing, Locality sensitivehashing Slides, generalization Slides
- References are: MMDS book, chapter 3 and "Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions" (by Alexandr Andoni and Piotr Indyk)
- Sample questions: See MMDS book, assignment.
Assignments: