## Scalable Data Mining (CS60021)

**Instructor**: Sourangshu Bhattacharya

**Teaching Assistants**: Soumi Das and Ayan Kumar Bhowmick

**Class Schedule**: Wednesday (11:00 - 11:55), Thursday (12:00 - 12:55), Friday (8:00 - 8:55)

**Classroom**: CSE-107

**Website**: http://cse.iitkgp.ac.in/~sourangshu/coursefiles/cs60021-2018a.html

**First Meeting**: 18 July 2018, 11:00 pm

**Content:**

### Motivation

Consider the following problems:- One is interested in computing summary statistics (word count distributions) for a set of words which occur in the same document in entire wikipedia collection (5 million documents). Naive techniques, will run out of main memory on most computers.
- One needs an approximate count of the number of distinct IP addresses, for packets passing through a router. The algorithm that maintains a list of all distinct IP addresses will consume too much memory.
- One needs to train an SVM classifier for text categorization, with unigram features (typically ~10 million) for hundreds of classes. One would run out of main memory, if they store uncompressed model parameters in main memory.

### Syllabus:

**Big Data Processing**: Motivation and Fundamentals. Map-reduce framework. Functional programming and Scala. Programming using map-reduce paradigm. Case studies: Finding similar items, Page rank, Matrix factorisation.

**Finding similar items**: Shingles, Minhashing, Locality Sensitive Hashing families.

**Stream processing**: Motivation, Sampling, Bloom filtering, Count-distinct using FM sketch, Estimating moments using AMS sketch.

**Dimensionality reduction**: Linear dimensionality reduction, PCA, SVD. Random projections, Johnson-Lindenstrauss lemma, JL transforms, sparse JL-transform. Random hashing, Clarkson-Woodruff algorithm.

**Algorithms for distributed machine learning**: Distributed supervised learning, distributed clustering, distributed recommendation. distributed optimization on Big data platforms: Gradient descent, spectral gradient descent, Stochastic gradient descent and related methods. ADMM and decomposition methods.

### Textbooks:

- Mining of Massive Datasets. 2nd edition. - Jure Leskovec, Anand Rajaraman, Jeff Ullman. Cambridge University Press. http://www.mmds.org/
- Data-Intensive Text Processing with MapReduce. Jimmy Lin and Chris Dyer. Morgan and Claypool. http://lintool.github.io/MapReduceAlgorithms/index.html
- Hadoop: The definitive Guide. Tom White. Oreilly Press.
- Distributed optimization and statistical learning via the alternating direction method of multipliers. S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, 2011.
- Recent literature.

### Course Material:

- Mapreduce - introduction Slides, system Slides and programming Slides.
- Spark - Scala, Spark (Slides) and data mining applications. Slides
- Tensorflow (Reference).
- Reference: MMDS Book, Hadoop book (see above)
- Reference: Learning Spark. Holden Karau, Andy Konwinski, Patrick Wendell & Matei Zaharia. Oreilly Press.
- Sample questions: here
- Clustering, k-means++, and scalable k-means++. Slides
- Similarity search - Shingling, Minhashing, Locality sensitivehashing Slides, generalization Slides
- References are: MMDS book, chapter 3 and "Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions" (by Alexandr Andoni and Piotr Indyk)
- Sample questions: See MMDS book, assignment.

### Assignments: