data mining

CS698V: Data Mining

(Semester II, 2003-2004)

Instructor: Pabitra Mitra (Room No. 224, CSE Dept., EMail: pmitra@iitk.ac.in)

Lecture Timings:

Day	Time	Room
Monday	10:00-11:00	CS102
Wednesday	10:00-11:00	CS102
Thursday	10:00-11:00	CS102

Teaching Assistants: G. Sushma and Saurabh Jindal

Registered Students

Course Contents:

Data Mining is an information extraction activity whose goal is to discover hidden facts contained in databases. Using a combination of machine learning, statistical analysis, modeling techniques and database technology, data mining finds patterns and subtle relationships in data and infers rules that allow the prediction of future results. Typical applications include market segmentation, customer profiling, fraud detection, evaluation of retail promotions, credit risk analysis, web, and bioinformatics.

The course is intended to cover currently popular data mining tasks and algorithms and provide insight into research issues and challenges involved. The content is listed below:

Overview of the Data Mining and Knowledge Discovery from Databases Process,

Data Warehousing and OLAP,

Data Preprocessing: Summary Data Structures, Dimensionality Reduction

Association Rule Mining: Apriori Algorithm, Modifications of Apriori Algorithm

Clustering: Partitional, Hierarchical, Density Based, Grid Based

Classification: Decision Trees, Instance Based, Support Vector Machines, Computational Learning Theory

Sequence Mining

Complex Data Mining

Web Mining: Information Retrieval, Link Analysis, Search Engines, Usage Analysis

Data Mining Applications to Bioinformatics

Books and References:

1. J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann/Elsevier India, 2001.

2. D. Hand, H. Mannila, and P. Smyth. Principles of Data Mining, MIT Press, 2001.

3. Recent literature from ACM SIGMOD, VLDB, IEEE Trans. Knowledge & Data Engg., Data Mining and Knowledge Discovery, ACM SIGKDD, IEEE ICDM, SIAM Data Mining, ICML

Course Policies:

End semester exam: 40%
One project: 25%
Four assignments: 20%
One Mid semester: 15%

Assignments:

Assignment 1 (Due date: January 22, 2004 Thursday)

Assignment 2 (Due date: March 2, 2004, Tuesday)

Assignment 3 (Due date: April 2, 2004, Friday)

Assignment 4 (Due date: April 15, 2004, Thursday)

Midsemester (February 17, 2004)

Question Solution guideline for Question 6

Endsemester (May 1, 2004)

Question

Projects

Project Proposals

List of Groups and Projects

Course Material:

Lecture No. & Topic	Reading Material
1. Overview	The KDD process for extracting useful knowledge from volumes of data Usama Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, Communications of the ACM, Vol. 39(11), Nov. 1996, Pages: 27 - 34
2. Data Warehousing & OLAP	An overview of data warehousing and OLAP technology Surajit Chaudhuri, Umeshwar Dayal, ACM SIGMOD Record, Vol. 26, Mar. 1997. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. In Proc. 12th ICDE, pages 152--159, New Orleans, March 1996.
3. Data Preprocessing: Summary Data Structures, AD-Tree, KD-Tree	Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets Andrew Moore and Mary Soon Lee, Journal of Artificial Intelligence Research, Vol. 8, Mar. 1998, pp. 67-91. A tutorial on kd-trees, Andrew Moore, University of Cambridge Computer Laboratory Technical Report No. 209, 1991.
4. Data Preprocessing: Attribute Selection	Pattern Classification, Duda, Hart & Stork, Prentice Hall India, 2002, Call No. 006.4 D86P2 (Chapters on Feature Selection and Extraction)
5. Association Rules, Apriori Algorithm	Mining association rules between sets of items in large databases, Rakesh Agrawal, Tomasz Imieliński, Arun Swami, ACM SIGMOD Record , Proceedings of the 1993 ACM SIGMOD international conference on Management of data, Vol. 22(2) Fast Algorithms for Mining Association Rules, *R. Agrawal, R. Srikant, Proc. of the 20th Int'l Conference on Very Large Databases,* Santiago, Chile, Sept. 1994. Chapter 6 of Han & Kamber book.
6. Association Rules, FP-growth algorithm, Generalized Association Rules	Mining Frequent Patterns without Candidate Generation, (Slides), J. Han, J. Pei, and Y. Yin, Proc. 2000 ACM-SIGMOD Int. Conf. on Management of Data (SIGMOD'00), 200 Mining Generalized Association Rules, R. Srikant, R. Agrawal, Proc. of the 21st Int'l Conference on Very Large Databases, 1995 Chapter 6 of Han & Kamber book.
7. Clustering: Overview	Data clustering: a review, A. K. Jain, M. N. Murty, P. J. Flynn, ACM Computing Surveys, Vol. 31, Issue 3, Pages: 264 - 323, 1999 Chapter 8 of Han & Kamber book.
8. Partitional Clustering: ScaleKM, CLARANS	Scaling Clustering Algorithms to Large Databases, P.S. Bradley, U. Fayyad and C. Reina, Knowledge Discovery and Data Mining, 1998, pp. 9-15. Efficient and Effective Clustering Methods for Spatial Data Mining, R. T. Ng and J. Han, Proceedings of VLDB, pages 144-155, September 1994.
9. Hierarchical Clustering: BIRCH, CURE	BIRCH: an efficient data clustering method for very large databases , Tian Zhang, Raghu Ramakrishnan, Miron Livny, ACM SIGMOD, Vol. 25(2), Pages 103-114, 1996 CURE: An efficient clustering algorithm for large databases, Sudipto Guha, Rajeev Rastogi, Kyuseok Shim, ACM SIGMOD Record , Proceedings of the 1998 ACM SIGMOD international conference on Management of data, Volume 27 Issue 2
10. Density based clustering: DBSCAN, OPTICS	A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, Ester M., Kriegel H.-P., Sander J., Xu X., Proc. 2nd int. Conf. on Knowledge Discovery and Data Mining (KDD ‘96), Portland, Oregon, 1996, AAAI Press, 1996. OPTICS: Ordering Points To Identify the Clustering Structure , Ankerst M., Breunig M. M., Kriegel H.-P., Sander J., Proc. ACM SIGMOD Int. Conf. on Management of Data (SIGMOD'99), Philadelphia, PA, 1999, pp. 49-60.
11. Grid based clustering: STING, CLIQUE	STING: A Statistical Information Grid Approach to Spatial Data Mining, W. Wang, J. Yang, and R.R. Muntz, Proceedings of the 23rd VLDB Conference, Athens, Greece, Aug 1997. Automatic subspace clustering of high dimensional data for data mining applications, Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, Prabhakar Raghavan, Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, Seattle, WA, Pages: 94 - 105
12. Model based clustering: Expectation Maximization, Clustering categorical data: ROCK	Scaling EM algorithm to large database, P.S. Bradley, U. Fayyad, C. Reina, Microsoft Technical Report, 1999. ROCK: A robust clustering algorithm for categorical attributes, S. Guha, R. Rastogi and K. Shim, Information Systems, 2002
13. Outlier Detection: Statistical, Distance based, exception based, discovery driven exploration of data cubes.	Chapter 8 of Han and Kamber book Distance-based outliers: algorithms and applications, Edwin M. Knorr, Raymond T. Ng, Vladimir Tucakov, The International Journal on Very Large Data Bases, Volume 8 Issue 3-4, February 2000. A Linear Method for Deviation Detection in Large Databases, A. Arning, R. Agrawal, P. Raghavan Proc. of the 2nd Int'l Conference on Knowledge Discovery in Databases and Data Mining, Portland, Oregon, August, 1996. Discovery-driven exploration of OLAP data cubes, S. Sarawagi, R. Agrawal, N. Megiddo, Proc. of the Sixth Int'l Conference on Extending Database Technulogy (EDBT), Valencia, Spain, March 1998.
14. Classification: Overview, Decision trees	Chapter 7 of Han and Kamber book
15. Scalable Decision Trees: SLIQ	SLIQ: A Fast Scalable Classifier for Data Mining, M. Mehta, R. Agrawal and J. Rissanen, Proc. of the Fifth Int'l Conference on Extending Database Technulogy, Avignon, France, March 1996.
16. Scalable Decision Trees: SPRINT, RainForest	SPRINT: A Scalable Parallel Classifier for Data Mining, J.C. Shafer, R. Agrawal, M. Mehta, Proc. of the 22th Int'l Conference on Very Large Databases, Mumbai (Bombay), India, Sept. 1996. RAINFOREST - A Framework for Fast Decision Tree Construction of Large Datasets. J. E. Gehrke, Raghu Ramakrishnan, and Venkatesh Ganti, Proceedings of the Twenty-fourth International Conference on Very Large Data Bases, New York, New York, 1998.
17. Support Vector Machines	A Tutorial on Support Vector Machines for Pattern Recognition, C. J. C. Burges. Knowledge Discovery and Data Mining, 2(2), 1998.
18. Bayes Classifier, k-nearest neighbor rule	Pattern Classification, Duda, Hart & Stork, Prentice Hall India, 2002, Call No. 006.4 D86P2 (Chapters on Bayes Decision Rule and k-nearest neighbor rule)
19. Metaclassifiers: Boosting, Error Correcting Output Codes, Query by Committee	Ensemble Learning. T. G. Dietterich, In The Handbook of Brain Theory and Neural Networks, Second edition, (M.A. Arbib, Ed.), Cambridge, MA: The MIT Press, 2002. Postscript Preprint. A short introduction to boosting, Y. Freund and R.E. Schapire.Journal of Japanese Society for Artificial Intelligence, 14(5):771-780, September 1999. Solving Multiclass Learning Problems via Error-Correcting Output Codes. Dietterich, T. G., Bakiri, G. Journal of Artificial Intelligence Research 2: 263-286, 1995. Postscript file. Query by committee, H. S. Seung, M. Opper, H. Sompolinsky, Proceedings of the fifth annual workshop on Computational learning theory, Pages: 295 - 302
20. Overview of Statistical and Computational Learning Theory	An overview of statistical learning theory, Vapnik, V.N. IEEE Trans. Neural Networks, Vol. 10, No. 5, Page(s): 988-999, 1999, PDF Full-Text A theory of the learnable, L. G. Valiant, Communications of the ACM, Volume 27, Issue 11, Pages: 1134 - 1142, 1984.
21. Sequence Mining: Sequential Pattern Mining	Mining Sequential Patterns, R. Agrawal, R. Srikant, Proc. of the Int'l Conference on Data Engineering (ICDE), Taipei, Taiwan, March 1995.
22. Motif discovery, Clustering, Similarity Search in Time Series Data	Finding Motifs in Time Series, J. Lin, E. Keogh, P. Patel, and S. Lonardi, In proceedings of the 2nd Workshop on Temporal Data Mining, at the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Edmonton, Alberta, Canada. July 23-26, 2002. Efficient Similarity Search in Sequence Databases, R. Agrawal, C. Faloutsos, A. Swami, Proc. of the 4th Int'l Conference on Foundations of Data Organization and Algorithms, Chicago, Oct. 1993, Also in Lecture Notes in Computer Science 730, Springer Verlag, 1993, 69-84. An enhanced representation of time series which allows fast and accurate classification clustering and relevance feedback. Keogh, E. and Pazzani, M., In 4th International Conference on Knowledge Discovery and Data Mining. New York, NY, Aug 27-31. 1998, pp 239-243. [pdf]
23. Streaming Data Mining: Computing model, Sampling, Histogram	Querying and Mining Data Streams: You Only Get One Look, Minos Garofalakis, Johannes Gehrke, and Rajeev Rastogi, Tutorial at 2002 ACM SIGMOD International Conference on Management of Data (SIGMOD'2002), Madison, Wisconsin, June 2002.
24. Sketches	Processing Complex Aggregate Queries over Data Streams. Alin Dobra, Minos Garofalakis, J. E. Gehrke, and Rajeev Rastogi. In Proceedings of the 2002 ACM Sigmod International Conference on Management of Data, Madison, Wisconsin, June 2002.
25. Clustering Data Streams	Clustering Data Streams: Theory and Practice, Sudipto Guha, Nina Mishra, Adam Meyerson, Rajeev Motwani and Liadan O'Callaghan. IEEE Transactions on Knowledge and Data Engineering, 15(3): 515-528, May/June 2003.
26. Knowledge evaluation: Interestingness	What makes patterns interesting in knowledge discovery systems, Silberschatz, A.; Tuzhilin, A.; Knowledge and Data Engineering, IEEE Transactions on ,Vol: 8(6) , 1996 Pages:970 - 974, [PDF Full-Text (708KB)]
27. Visualization: Data Visualization	Multidimensional Information Visualization: A Survey, P. E. Hoffman and G. G. Grinstein, In Information Visualization in Data Mining and Knowledge Discovery, U Fayyad, G. G. Grinstein and A. Wierse (Eds.) Morgan Kauffman, 2001, Link: http://ivpr.cs.uml.edu/~phoffman/aviz/program.htm#url1
28. Model Visualization	Visualizing Data Mining Models, Kurt Thearling, Barry Becker, Dennis DeCoste, Bill Mawby, Michel Pilote, and Dan Sommerfield, In Information Visualization in Data Mining and Knowledge Discovery, U Fayyad, G. G. Grinstein and A. Wierse (Eds.) Morgan Kauffman, 2001, Link: http://www.thearling.com/text/dmviz/modelviz.htm
29. Web Mining: Overview	Web mining research: a survey, Raymond Kosala, Hendrik Blockeel, ACM SIGKDD Explorations, Vol. 2(1), pp. 1-15, June 2000.
30. Document Models, Latent Semantic Indexing, Hyperlink analysis: PageRank	Indexing by latent semantic analysis, Deerwester, S., Dumais, S. T., Landauer, T. K., Furnas, G. W. and Harshman, R. A., Journal of the Society for Information Science, 41(6), 391-407, 1990, PDF version. The PageRank Citation Ranking: Bringing Order to the Web, Page, Lawrence; Brin, Sergey; Motwani, Rajeev; Winograd, Terry, Proceedings of the SIGCHI conference on Human factors in computing systems, The Hague, Netherlands, pp. 145-152, 2000. pdf
31. HITS, Search Engine Architecture	Authoritative sources in a hyperlinked environment. J. Kleinberg, Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998. Extended version in Journal of the ACM 46(1999). The Anatomy of a Large-Scale Hypertextual Web Search Engine. S. Brin and L. Page,WWW7 / Computer Networks 30(1-7): 107-117 (1998), http://www-db.stanford.edu/~backrub/google.html
32. Web Page Classification, Web Usage Mining	Mining the Web, Soumen Chakrabarty, Morgan Kauffman, 2003 Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data, Jaideep Srivastava,Robert Cooley, Mukund Deshpande,Pang-Ning Tan, SIGKDD Explorations, Vol. 1, Issue 2, 2000. (Postscript)
Conclusion

Further Readings:

Data Warehouse and OLAP:

1. The Data Warehousing Information Center

2. On Computing the Data Cube, S. Sarawagi, R. Agrawal, A. Gupta, Research Report 10026, IBM Almaden Research Center, San Jose, California, 1996.

3. Implementing Data Cubes Efficiently, J. Ullman, V. Harinarayan and A. Rajaraman, ACM SIGMOD 1996.

Data Summaries:

1. Synopsis Data Structures for Massive Data Sets P. B. Gibbons and Y. Matias. Tech. Report, Bell Laboratories, Murray Hill, NJ, September 1998. In External Memory Algorithms, DIMACS Series in Discrete Mathematics and Theoretical Computer Science, Vol. 50, 1999, Powerpoint presentation.

2. The New Jersey Data Reduction Report, D. Barbara, C. Faloutsos, J. Hellerstein, Y. Ioannidis, H. V. Jagadish, T. Johnson, R. Ng, V. Poosala, K. Ross, and K.C. Sevcik, IEEE Data Engineering Bulletin, 20(4):3--45, 1997.

Attribute Selection:

1. A Taxonomy of Feature Selection Methods, H. Liu et al.

2. Toward optimal feature selection, D. Koller and M. Sahami. Proceedings of the 13th International Conference on Machine Learning (ICML), Bari, Italy, July 1996, pages 284--292.

Association Rules:

1. Query flocks: a generalization of association-rule mining, Dick Tsur, Jeffrey D. Ullman, Serge Abiteboul, Chris Clifton, Rajeev Motwani, Svetlozar Nestorov, Arnon Rosenthal, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, 1998, Pages: 1 - 12.

2. Integrating association rule mining with databases: alternatives and implications, S. Sarawagi, S. Thomas and R. Agrawal. Proc. of the ACM SIGMOD Int'l Conference on Management of Data, Seattle, Washington, June 1998. PDF format.

Clustering:

1. An Impossibility Theorem for Clustering. J. Kleinberg. Advances in Neural Information Processing Systems (NIPS) 15, 2002. (In PDF.)

2. On spectral clustering: Analysis and an algorithm. A. Y. Ng, M. I. Jordan, and Y. Weiss. In T. Dietterich, S. Becker and Z. Ghahramani (Eds.), Advances in Neural Information Processing Systems (NIPS) 14, 2002.

3. A Unified Framework for Model-based Clustering, S. Zhong and J. Ghosh, JMLR, 2003, pg: 1001-1037, pdf.

Useful Links:

KDNuggets

UCI KDD Dataset Archive

IBM Intelligent Miner

Weka 3 - Data Mining with Open Source Machine Learning Software in Java

Data Mining: Concepts and Techniques by J. Han and M. Kamber (book page)