## Scalable Data Mining (CS60021)

**Instructor**: Sourangshu Bhattacharya

**Teaching Assistants**: Surjodoy Ghosh Dastider, Chandan Misra, Sanga Chaki.

**Class Schedule**: MON(17:00-17:55) , THURS(17:00-17:55) , FRI(17:00-17:55)

**Classroom**: CSE-108

**Website**: http://cse.iitkgp.ac.in/~sourangshu/coursefiles/cs60021-2017a.html

**First Meeting**: 17 July 2017, 5:00 pm

**Content:**

### Motivation

Consider the following problems:- One is interested in computing summary statistics (word count distributions) for a set of words which occur in the same document in entire wikipedia collection (5 million documents). Naive techniques, will run out of main memory on most computers.
- One needs an approximate count of the number of distinct IP addresses, for packets passing through a router. The algorithm that maintains a list of all distinct IP addresses will consume too much memory.
- One needs to train an SVM classifier for text categorization, with unigram features (typically ~10 million) for hundreds of classes. One would run out of main memory, if they store uncompressed model parameters in main memory.

### Syllabus:

**Big Data Processing**: Motivation and Fundamentals. Map-reduce framework. Functional programming and Scala. Programming using map-reduce paradigm. Case studies: Finding similar items, Page rank, Matrix factorisation.

**Finding similar items**: Shingles, Minhashing, Locality Sensitive Hashing families.

**Stream processing**: Motivation, Sampling, Bloom filtering, Count-distinct using FM sketch, Estimating moments using AMS sketch.

**Dimensionality reduction**: Linear dimensionality reduction, PCA, SVD. Random projections, Johnson-Lindenstrauss lemma, JL transforms, sparse JL-transform. Random hashing, Clarkson-Woodruff algorithm.

**Algorithms for distributed machine learning**: Distributed supervised learning, distributed clustering, distributed recommendation. distributed optimization on Big data platforms: Gradient descent, spectral gradient descent, Stochastic gradient descent and related methods. ADMM and decomposition methods.

### Textbooks:

- Mining of Massive Datasets. 2nd edition. - Jure Leskovec, Anand Rajaraman, Jeff Ullman. Cambridge University Press. http://www.mmds.org/
- Data-Intensive Text Processing with MapReduce. Jimmy Lin and Chris Dyer. Morgan and Claypool. http://lintool.github.io/MapReduceAlgorithms/index.html
- Hadoop: The definitive Guide. Tom White. Oreilly Press.
- Distributed optimization and statistical learning via the alternating direction method of multipliers. S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, 2011.
- Recent literature.

### Course Material:

- Mapreduce - introduction, system and programming. Slides
- Mapreduce - Scala, Spark and data mining applications. Slides
- Reference: MMDS Book, Hadoop book (see above)
- Reference: Learning Spark. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia. Oreilly Press.
- Sample questions: here
- Similarity search - Shingling, Minhashing, Locality sensitivehashing Slides
- References are: MMDS book, chapter 3 and "Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions" (by Alexandr Andoni and Piotr Indyk)
- Sample questions: See MMDS book.
- Stream computing - Sampling using hashing, Reservoir sampling, Counting 1s (DGIM)) Slides
- Reference: MMDS book, chapter 4
- Sample questions: See MMDS book.
- Stream computing - Bloom filter, Count distinct, Computing Moments, AMS algorithm Slides
- Reference: MMDS book, chapter 4
- Reference: Data Stream Management. Minos Garofalakis, Johannes Gehrke, Rajeev Rastogi. Springer. (Chapter on Join Sizes, Frequency Moments, and Applications, Graham Cormode and Minos Garofalakis. ). See moodle for PDF.
- Sample questions: See MMDS book.
- Dimensionality Reduction - SVD, CUR decomposition Slides
- Reference: MMDS book, chapter 11
- Reference: CUR matrix decompositions for improved data analysis. Micheal Mahoney and Petros Drineas. PNAS 2009 (This has the non-scalable algorithm)
- Reference: Less is More: Compact matrix decompositions for large sparse graphs. Jimeng Sun, Yinglian Xie Hui Zhang CHristos Faloutsos. SDM 2007 (This have scalable algorithm, but no bound)
- Reference: Drineas, Petros, Michael W. Mahoney, and S. Muthukrishnan. "Relative-error CUR matrix decompositions." SIAM Journal on Matrix Analysis and Applications 30.2 (2008): 844-881.
- Sample questions: TBA
- Dimensionality Reduction - Random Projection
- Reference: Dasgupta, Sanjoy, and Anupam Gupta. "An elementary proof of the Johnson-Lindenstrauss lemma." International Computer Science Institute, Technical Report (1999): 99-006.
- Reference: Achlioptas, Dimitris. "Database-friendly random projections." Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, 2001.
- Sample questions: TBA
- Large scale machine learning - Gradient Descent, Stochastic Gradient descent. slides
- Reference: MMDS book, chapter 12
- Reference: SGD proof - Nesterov, Yu, and J-Ph Vial. "Confidence level solutions for stochastic programming." Automatica 44, no. 6 (2008): 1559-1568.
- Sample questions: See below.
- Large scale machine learning - ADMM slides
- Reference: Boyd, Stephen, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. "Distributed optimization and statistical learning via the alternating direction method of multipliers." Foundations and TrendsĀ® in Machine Learning 3, no. 1 (2011): 1-122.
- Link: http://stanford.edu/~boyd/papers/admm_distr_stats.html
- Sample questions: pdf

### Assignments:

- Mapreduce - Mapreduce. Assignment 1
- Mapreduce - Spark. Assignment 2
- Locality sensitive hashing. Assignment 3
- Stream Computing. Assignment 4 and Supporting file