CS698V: Data Mining

(Semester II, 2003-2004)

Instructor: Pabitra Mitra (Room No. 224, CSE Dept., EMail: pmitra@iitk.ac.in)

Lecture Timings:

Day
Time
Room
Monday
10:00-11:00
CS102
Wednesday
10:00-11:00
CS102
Thursday
10:00-11:00
CS102
 

Teaching Assistants:
G. Sushma and Saurabh Jindal

Registered Students


Course Contents:

Data Mining is an information extraction activity whose goal is to discover hidden facts contained in databases. Using a combination of machine learning, statistical analysis, modeling techniques and database technology, data mining finds patterns and subtle relationships in data and infers rules that allow the prediction of future results. Typical applications include market segmentation, customer profiling, fraud detection, evaluation of retail promotions, credit risk analysis, web, and bioinformatics.

The course is intended to cover currently popular data mining tasks and algorithms and provide insight into research issues and challenges involved.  The content is listed below:

Overview of the Data Mining and Knowledge Discovery from Databases Process,

Data Warehousing and OLAP,

Data Preprocessing:  Summary Data Structures, Dimensionality Reduction

Association Rule Mining: Apriori Algorithm, Modifications of Apriori Algorithm

Clustering: Partitional, Hierarchical, Density Based, Grid Based

Classification: Decision Trees, Instance Based, Support Vector Machines, Computational Learning Theory

Sequence Mining

Complex Data Mining

Web Mining: Information Retrieval, Link Analysis, Search Engines, Usage Analysis

Data Mining Applications to Bioinformatics


Books and References:

1. J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann/Elsevier India, 2001.

2. D. Hand, H. Mannila, and P. Smyth. Principles of Data Mining, MIT Press, 2001.

3. Recent literature from ACM SIGMOD, VLDB, IEEE Trans. Knowledge & Data Engg., Data Mining and Knowledge Discovery, ACM SIGKDD, IEEE ICDM, SIAM Data Mining, ICML

Course Policies:
Assignments:

Assignment 1 (Due date: January 22, 2004 Thursday)

Assignment 2 (Due date: March 2, 2004, Tuesday)

Assignment 3 (Due date: April 2, 2004, Friday)

Assignment 4 (Due date: April 15, 2004, Thursday)

Midsemester (February 17, 2004)

Question             Solution guideline for Question 6

Endsemester (May 1, 2004)

Question

Projects

Project Proposals

List of Groups and Projects

Course Material:

Lecture No. & Topic
Reading Material
1. Overview
The KDD process for extracting useful knowledge from volumes of data
Usama Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, Communications of the ACM, Vol. 39(11), Nov. 1996, Pages: 27 - 34

2. Data Warehousing & OLAP
An overview of data warehousing and OLAP technology
Surajit Chaudhuri, Umeshwar Dayal, ACM SIGMOD RecordVol. 26, Mar. 1997.

Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. In Proc. 12th ICDE, pages 152--159, New Orleans, March 1996.
3. Data Preprocessing: Summary Data Structures, AD-Tree, KD-Tree
Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets Andrew Moore and Mary Soon Lee, Journal of Artificial Intelligence Research, Vol. 8, Mar. 1998, pp. 67-91.
 

A tutorial on kd-trees, Andrew Moore, University of Cambridge Computer Laboratory Technical Report No. 209, 1991.
4. Data Preprocessing: Attribute Selection
Pattern Classification, Duda, Hart & Stork, Prentice Hall India, 2002,  
Call No.
006.4 D86P2 (Chapters on Feature Selection and Extraction)
5. Association Rules, Apriori Algorithm
Mining association rules between sets of items in large databases, Rakesh Agrawal, Tomasz Imieliński, Arun Swami, ACM SIGMOD Record , Proceedings of the 1993 ACM SIGMOD international conference on Management of data,  Vol. 22(2)

Fast Algorithms for Mining Association Rules,
R. Agrawal, R. Srikant, Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, Sept. 1994.

Chapter 6 of Han & Kamber book.

6. Association Rules, FP-growth algorithm, Generalized Association Rules Mining Frequent Patterns without Candidate Generation, (Slides), J. Han, J. Pei, and Y. Yin, Proc. 2000 ACM-SIGMOD Int. Conf. on Management of Data (SIGMOD'00),  200

Mining Generalized Association Rules, R. Srikant, R. Agrawal, Proc. of the 21st Int'l Conference on Very Large Databases, 1995

Chapter 6 of Han & Kamber book.

7. Clustering: Overview
Data clustering: a review, A. K. Jain, M. N. Murty, P. J. Flynn, ACM Computing Surveys, Vol. 31, Issue 3, Pages: 264 - 323, 1999

Chapter 8 of Han & Kamber book.
8. Partitional Clustering: ScaleKM, CLARANS
Scaling Clustering Algorithms to Large Databases,  P.S. Bradley, U. Fayyad and C. Reina, Knowledge Discovery and Data Mining, 1998, pp. 9-15.

Efficient and Effective Clustering Methods for Spatial Data Mining, R. T. Ng and J. Han, Proceedings of VLDB, pages 144-155, September 1994.
9. Hierarchical Clustering: BIRCH, CURE
BIRCH: an efficient data clustering method for very large databases , Tian Zhang, Raghu Ramakrishnan, Miron Livny, ACM SIGMOD, Vol. 25(2), Pages 103-114, 1996

CURE: An efficient clustering algorithm for large databases,
Sudipto Guha, Rajeev Rastogi, Kyuseok Shim, ACM SIGMOD Record , Proceedings of the 1998 ACM SIGMOD international conference on Management of data Volume 27 Issue 2                                                                                                                                               
10. Density based clustering: DBSCAN, OPTICS
A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, Ester M., Kriegel H.-P., Sander J., Xu X., Proc. 2nd int. Conf. on Knowledge Discovery and Data Mining (KDD ‘96), Portland, Oregon, 1996, AAAI Press, 1996.

OPTICS: Ordering Points To Identify the Clustering Structure , Ankerst M., Breunig M. M., Kriegel H.-P., Sander J., Proc. ACM SIGMOD Int. Conf. on Management of Data (SIGMOD'99), Philadelphia, PA, 1999, pp. 49-60.

11. Grid based clustering: STING, CLIQUE
STING: A Statistical Information Grid Approach to Spatial Data Mining, W. Wang, J. Yang, and R.R. Muntz, Proceedings of the 23rd VLDB Conference, Athens, Greece, Aug 1997.

Automatic subspace clustering of high dimensional data for data mining applications, Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, Prabhakar Raghavan, Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, Seattle, WA, Pages: 94 - 105
12. Model based clustering: Expectation Maximization,
Clustering categorical data: ROCK
Scaling EM algorithm to large database, P.S. Bradley, U. Fayyad, C. Reina, Microsoft Technical Report, 1999.

ROCK: A robust clustering algorithm for categorical attributes, S. Guha, R. Rastogi and K. Shim, Information Systems, 2002

13. Outlier Detection: Statistical, Distance based, exception based, discovery driven exploration of data cubes.
Chapter 8 of Han and Kamber book

Distance-based outliers: algorithms and applications
,
Edwin M. Knorr, Raymond T. Ng, Vladimir Tucakov, The International Journal on Very Large Data BasesVolume 8 Issue 3-4February 2000.
 
A Linear Method for Deviation Detection in Large Databases, A. Arning, R. Agrawal, P. Raghavan Proc. of the 2nd Int'l Conference on Knowledge Discovery in Databases and Data Mining, Portland, Oregon, August, 1996.

Discovery-driven exploration of OLAP data cubes, S. Sarawagi, R. Agrawal, N. Megiddo, Proc. of the Sixth Int'l Conference on Extending Database Technulogy (EDBT), Valencia, Spain, March 1998.

14. Classification: Overview, Decision trees
Chapter 7 of Han and Kamber book
15. Scalable Decision Trees: SLIQ
SLIQ: A Fast Scalable Classifier for Data Mining, M. Mehta, R. Agrawal and J. Rissanen, Proc. of the Fifth Int'l Conference on Extending Database Technulogy, Avignon, France, March 1996.
16. Scalable Decision Trees: SPRINT, RainForest
SPRINT: A Scalable Parallel Classifier for Data Mining, J.C. Shafer, R. Agrawal, M. Mehta, Proc. of the 22th Int'l Conference on Very Large Databases, Mumbai (Bombay), India, Sept. 1996.

RAINFOREST - A Framework for Fast Decision Tree Construction of Large Datasets. J. E. Gehrke, Raghu Ramakrishnan, and Venkatesh Ganti, Proceedings of the Twenty-fourth International Conference on Very Large Data Bases, New York, New York, 1998.
17. Support Vector Machines
A Tutorial on Support Vector Machines for Pattern Recognition, C. J. C. Burges. Knowledge Discovery and Data Mining, 2(2), 1998.
18. Bayes Classifier, k-nearest neighbor rule
Pattern Classification, Duda, Hart & Stork, Prentice Hall India, 2002,  
Call No.
006.4 D86P2 (Chapters on Bayes Decision Rule and k-nearest neighbor rule)
19. Metaclassifiers: Boosting, Error Correcting Output Codes, Query by Committee
Ensemble Learning. T. G. Dietterich, In The Handbook of Brain Theory and Neural Networks, Second edition, (M.A. Arbib, Ed.), Cambridge, MA: The MIT Press, 2002. Postscript Preprint.

A short introduction to boosting, Y. Freund and R.E. Schapire.Journal of Japanese Society for Artificial Intelligence, 14(5):771-780, September 1999.

Solving Multiclass Learning Problems via Error-Correcting Output Codes. Dietterich, T. G., Bakiri, G.  Journal of Artificial Intelligence Research 2: 263-286, 1995. Postscript file.

Query by committee, H. S. Seung, M. Opper, H. Sompolinsky, Proceedings of the fifth annual workshop on Computational learning theory Pages: 295 - 302
20. Overview of Statistical and Computational Learning Theory
An overview of statistical learning theory, Vapnik, V.N. IEEE Trans. Neural Networks, Vol. 10, No. 5, Page(s): 988-999, 1999, PDF Full-Text

A theory of the learnable, L. G. Valiant, Communications of the ACM,  Volume 27, Issue 11, Pages: 1134 - 1142, 1984.
21. Sequence Mining: Sequential Pattern Mining
Mining Sequential Patterns, R. Agrawal, R. Srikant, Proc. of the Int'l Conference on Data Engineering (ICDE), Taipei, Taiwan, March 1995.
22. Motif discovery, Clustering, Similarity Search in Time Series Data
Finding Motifs in Time Series, J. Lin, E. Keogh, P. Patel, and S. Lonardi, In proceedings of the 2nd Workshop on Temporal Data Mining, at the 8th  ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Edmonton, Alberta, Canada. July 23-26, 2002.

Efficient Similarity Search in Sequence Databases, R. Agrawal, C. Faloutsos, A. Swami, Proc. of the 4th Int'l Conference on Foundations of Data Organization and Algorithms, Chicago, Oct. 1993, Also in Lecture Notes in Computer Science 730, Springer Verlag, 1993, 69-84.

An enhanced representation of time series which allows fast and accurate classification clustering and relevance feedback. Keogh, E. and Pazzani, M.,  In 4th International Conference on Knowledge Discovery and Data Mining. New York, NY, Aug 27-31. 1998, pp 239-243. [pdf]

23. Streaming Data Mining: Computing model, Sampling, Histogram
Querying and Mining Data Streams: You Only Get One Look, Minos Garofalakis, Johannes Gehrke, and Rajeev Rastogi, Tutorial at  2002 ACM SIGMOD International Conference on Management of Data (SIGMOD'2002), Madison, Wisconsin, June 2002.
24. Sketches
Processing Complex Aggregate Queries over Data Streams. Alin Dobra, Minos Garofalakis, J. E. Gehrke, and Rajeev Rastogi. In Proceedings of the 2002 ACM Sigmod International Conference on Management of Data, Madison, Wisconsin, June 2002.
25. Clustering Data Streams
Clustering Data Streams: Theory and Practice, Sudipto Guha, Nina Mishra, Adam Meyerson, Rajeev Motwani and Liadan O'Callaghan. IEEE Transactions on Knowledge and Data Engineering, 15(3): 515-528, May/June 2003.
26. Knowledge evaluation: Interestingness
What makes patterns interesting in knowledge discovery systems, Silberschatz, A.; Tuzhilin, A.; Knowledge and Data Engineering, IEEE Transactions on  ,Vol: 8(6) , 1996  Pages:970 - 974,  [PDF Full-Text (708KB)]  
27. Visualization: Data Visualization
Multidimensional Information Visualization: A Survey, P. E. Hoffman and G. G. Grinstein, In Information Visualization in Data Mining and Knowledge Discovery, U Fayyad, G. G. Grinstein and A. Wierse (Eds.) Morgan Kauffman, 2001,
Link: http://ivpr.cs.uml.edu/~phoffman/aviz/program.htm#url1

28. Model Visualization
Visualizing Data Mining Models, Kurt Thearling, Barry Becker, Dennis DeCoste, Bill Mawby, Michel Pilote, and Dan Sommerfield, In Information Visualization in Data Mining and Knowledge Discovery, U Fayyad, G. G. Grinstein and A. Wierse (Eds.) Morgan Kauffman, 2001,
Link: http://www.thearling.com/text/dmviz/modelviz.htm
29. Web Mining: Overview
Web mining research: a survey, Raymond Kosala, Hendrik Blockeel, ACM SIGKDD Explorations, Vol. 2(1), pp. 1-15, June 2000.
30. Document Models, Latent Semantic Indexing, Hyperlink analysis: PageRank
Indexing by latent semantic analysis, Deerwester, S., Dumais, S. T., Landauer, T. K., Furnas, G. W. and Harshman, R. A., Journal of the Society for Information Science, 41(6), 391-407, 1990, PDF version.

The PageRank Citation Ranking: Bringing Order to the Web, Page, Lawrence; Brin, Sergey; Motwani, Rajeev; Winograd, Terry, Proceedings of the SIGCHI conference on Human factors in computing systems, The Hague, Netherlands, pp. 145-152, 2000pdf
31. HITS, Search Engine Architecture
Authoritative sources in a hyperlinked environment. J. Kleinberg, Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998. Extended version in Journal of the ACM 46(1999).

The Anatomy of a Large-Scale Hypertextual Web Search Engine.
S. Brin and L. Page,WWW7 / Computer Networks 30(1-7): 107-117 (1998), http://www-db.stanford.edu/~backrub/google.html
32. Web Page Classification, Web Usage Mining
Mining the Web, Soumen Chakrabarty, Morgan Kauffman, 2003

Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data, Jaideep Srivastava,Robert Cooley, Mukund Deshpande,Pang-Ning Tan, SIGKDD Explorations, Vol. 1, Issue 2, 2000. (Postscript)
Conclusion


Further Readings:

Data Warehouse and OLAP:

1. The Data Warehousing Information Center

2. On Computing the Data Cube, S. Sarawagi, R. Agrawal, A. Gupta, Research Report 10026, IBM Almaden Research Center, San Jose, California, 1996.

3. Implementing Data Cubes Efficiently, J. Ullman, V. Harinarayan and A. Rajaraman, ACM SIGMOD 1996.

Data Summaries:

1. Synopsis Data Structures for Massive Data Sets  P. B. Gibbons and Y. Matias. Tech. Report, Bell Laboratories, Murray Hill, NJ, September 1998. In External Memory Algorithms, DIMACS Series in Discrete Mathematics and Theoretical Computer Science, Vol. 50, 1999, Powerpoint presentation.

2. The New Jersey Data Reduction Report, D. Barbara, C. Faloutsos, J. Hellerstein, Y. Ioannidis, H. V. Jagadish, T. Johnson, R. Ng, V. Poosala, K. Ross, and K.C. SevcikIEEE Data Engineering Bulletin, 20(4):3--45, 1997.
 
Attribute Selection:

1. A Taxonomy of Feature Selection Methods, H. Liu et al.

2. Toward optimal feature selection, D. Koller and M. Sahami. Proceedings of the 13th International Conference on Machine Learning (ICML), Bari, Italy, July 1996, pages 284--292.

Association Rules:

1. Query flocks: a generalization of association-rule mining, Dick Tsur, Jeffrey D. Ullman, Serge Abiteboul, Chris Clifton, Rajeev Motwani, Svetlozar Nestorov, Arnon Rosenthal, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, 1998, Pages: 1 - 12.

2. Integrating association rule mining with databases: alternatives and implications, S. Sarawagi, S. Thomas and R. Agrawal. Proc. of the ACM SIGMOD Int'l Conference on Management of Data, Seattle, Washington, June 1998. PDF format.

Clustering:

1. An Impossibility Theorem for Clustering. J. Kleinberg. Advances in Neural Information Processing Systems (NIPS) 15, 2002. (In PDF.)

2. On spectral clustering: Analysis and an algorithm. A. Y. Ng, M. I. Jordan, and Y. Weiss. In T. Dietterich, S. Becker and Z. Ghahramani (Eds.), Advances in Neural Information Processing Systems (NIPS) 14, 2002.

3.  A Unified Framework for Model-based Clustering, S. Zhong and J. Ghosh, JMLR, 2003, pg: 1001-1037, pdf.



Useful Links:

KDNuggets

UCI KDD Dataset Archive

IBM Intelligent Miner

Weka 3 - Data Mining with Open Source Machine Learning Software in Java

Data Mining: Concepts and Techniques by J. Han and M. Kamber (book page)