Sourangshu Bhattacharya's Homepage

Scalable Data Mining (CS60021)

Instructor: Sourangshu Bhattacharya

Teaching Assistants: Surjodoy Ghosh Dastider, Chandan Misra, Sanga Chaki.

Class Schedule: MON(17:00-17:55) , THURS(17:00-17:55) , FRI(17:00-17:55)

Classroom: CSE-108

Website: http://cse.iitkgp.ac.in/~sourangshu/coursefiles/cs60021-2017a.html

First Meeting: 17 July 2017, 5:00 pm

Motivation

Consider the following problems:

One is interested in computing summary statistics (word count distributions) for a set of words which occur in the same document in entire wikipedia collection (5 million documents). Naive techniques, will run out of main memory on most computers.
One needs an approximate count of the number of distinct IP addresses, for packets passing through a router. The algorithm that maintains a list of all distinct IP addresses will consume too much memory.
One needs to train an SVM classifier for text categorization, with unigram features (typically ~10 million) for hundreds of classes. One would run out of main memory, if they store uncompressed model parameters in main memory.

In all the above situations, a simple data mining / machine learning task has been made more complicated due to large scale of input data, output results or both. In this course, we discuss algorithmic techniques as well as software paradigms which allow one to write scalable algorithms for the common data mining tasks.

Syllabus:

Big Data Processing: Motivation and Fundamentals. Map-reduce framework. Functional programming and Scala. Programming using map-reduce paradigm. Case studies: Finding similar items, Page rank, Matrix factorisation.
Finding similar items: Shingles, Minhashing, Locality Sensitive Hashing families.
Stream processing: Motivation, Sampling, Bloom filtering, Count-distinct using FM sketch, Estimating moments using AMS sketch.
Dimensionality reduction: Linear dimensionality reduction, PCA, SVD. Random projections, Johnson-Lindenstrauss lemma, JL transforms, sparse JL-transform. Random hashing, Clarkson-Woodruff algorithm.
Algorithms for distributed machine learning: Distributed supervised learning, distributed clustering, distributed recommendation. distributed optimization on Big data platforms: Gradient descent, spectral gradient descent, Stochastic gradient descent and related methods. ADMM and decomposition methods.

Textbooks:

Mining of Massive Datasets. 2nd edition. - Jure Leskovec, Anand Rajaraman, Jeff Ullman. Cambridge University Press. http://www.mmds.org/
Data-Intensive Text Processing with MapReduce. Jimmy Lin and Chris Dyer. Morgan and Claypool. http://lintool.github.io/MapReduceAlgorithms/index.html
Hadoop: The definitive Guide. Tom White. Oreilly Press.
Distributed optimization and statistical learning via the alternating direction method of multipliers. S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, 2011.
Recent literature.

Course Material:

Mapreduce - introduction, system and programming. Slides
Mapreduce - Scala, Spark and data mining applications. Slides

Reference: MMDS Book, Hadoop book (see above)
Reference: Learning Spark. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia. Oreilly Press.
Sample questions: here

Similarity search - Shingling, Minhashing, Locality sensitivehashing Slides

References are: MMDS book, chapter 3 and "Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions" (by Alexandr Andoni and Piotr Indyk)
Sample questions: See MMDS book.

Stream computing - Sampling using hashing, Reservoir sampling, Counting 1s (DGIM)) Slides

Reference: MMDS book, chapter 4
Sample questions: See MMDS book.

Stream computing - Bloom filter, Count distinct, Computing Moments, AMS algorithm Slides

Reference: MMDS book, chapter 4
Reference: Data Stream Management. Minos Garofalakis, Johannes Gehrke, Rajeev Rastogi. Springer. (Chapter on Join Sizes, Frequency Moments, and Applications, Graham Cormode and Minos Garofalakis. ). See moodle for PDF.
Sample questions: See MMDS book.

Dimensionality Reduction - SVD, CUR decomposition Slides

Reference: MMDS book, chapter 11
Reference: CUR matrix decompositions for improved data analysis. Micheal Mahoney and Petros Drineas. PNAS 2009 (This has the non-scalable algorithm)
Reference: Less is More: Compact matrix decompositions for large sparse graphs. Jimeng Sun, Yinglian Xie Hui Zhang CHristos Faloutsos. SDM 2007 (This have scalable algorithm, but no bound)
Reference: Drineas, Petros, Michael W. Mahoney, and S. Muthukrishnan. "Relative-error CUR matrix decompositions." SIAM Journal on Matrix Analysis and Applications 30.2 (2008): 844-881.
Sample questions: TBA

Dimensionality Reduction - Random Projection

Reference: Dasgupta, Sanjoy, and Anupam Gupta. "An elementary proof of the Johnson-Lindenstrauss lemma." International Computer Science Institute, Technical Report (1999): 99-006.
Reference: Achlioptas, Dimitris. "Database-friendly random projections." Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, 2001.
Sample questions: TBA

Large scale machine learning - Gradient Descent, Stochastic Gradient descent. slides

Reference: MMDS book, chapter 12
Reference: SGD proof - Nesterov, Yu, and J-Ph Vial. "Confidence level solutions for stochastic programming." Automatica 44, no. 6 (2008): 1559-1568.
Sample questions: See below.

Large scale machine learning - ADMM slides

Reference: Boyd, Stephen, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. "Distributed optimization and statistical learning via the alternating direction method of multipliers." Foundations and Trends® in Machine Learning 3, no. 1 (2011): 1-122.
Link: http://stanford.edu/~boyd/papers/admm_distr_stats.html
Sample questions: pdf

Assignments:

Mapreduce - Mapreduce. Assignment 1
Mapreduce - Spark. Assignment 2
Locality sensitive hashing. Assignment 3
Stream Computing. Assignment 4 and Supporting file