Data Analytics

(CS61061)

Wonder what does a data scientist do? How does one unravel mysteries from a volume of data by not only discovering present trends but by also gaining insights about the future? The Data Analytics course aims to address some of the many such problems faced by data scientists today.

Timings: Thursday:15:00—16:55, Friday: 14:00—15:55, at NC 244, Nalanda Complex

View Announcements

Announcements


Date Message
11 November 2023 Announcement of Projects: Last date of submission: 01.12.2023 Project Allocation.
03 October 2023 The Attendance record till 27.09.2023 here.
18 September 2023 The Mid-Semester test of the course will be held on 26.09.2023 (FN). All students should check there allotted rooms and timimg from the ERP.
13 September 2023 The class test of the course will be held on 15.09.2023 in NR 221 from 16:30 hours. All students should occupy their seats latest by 16:15 hours.
16 August 2023 The class of the course on 17.08.2023 stands canceled.
10 August 2023 The class of the course on 10.08.2023 stands canceled.
04 August 2023 The course calendar is announced here.
02 August 2023 The first class of the course will be held on 04.08.2023.

Course Overview & Objectives


This course will cover fundamental algorithms and techniques used in Data Analytics. The statistical foundations will be covered first, followed by various machine learning and data mining algorithms. Technological aspects like data management (Hadoop), scalable computation (MapReduce) and visualization will also be covered. In summary, this course will provide exposure to theory as well as practical systems and software used in data analytics.

After completing this course, you will learn how to:

  • Find a meaningful pattern in data
  • Graphically interpret data
  • Implement the analytic algorithms
  • Handle large scale analytics projects from various domains
  • Develop intelligent decision support systems

Prerequisites


To extract the maximum from the course, the following prerequisites are must.

  • A strong mathematical background in Probability and Statistics
  • Proficiency with algorithms
  • Programming skills in C, Python, R, etc.
  • Critical thinking and problem solving skills
  • A minimum of 7.0 CGPA
  • Only 3rd and 4th year students can attend

Syllabus


An outline of the course is as follows. You can also download the syllabus for your reference.

Data Definitions and Analysis Techniques

  • Elements, Variables, and Data categorization
  • Levels of Measurement
  • Data management and indexing
  • Introduction to statistical learning and R-Programming

Descriptive Statistics

  • Measures of central tendency
  • Measures of location of dispersions
  • Practice and analysis with R

Basic Analysis Techniques

  • Basic analysis techniques
  • Statistical hypothesis generation and testing
  • Chi-Square test
  • t-Test
  • Analysis of variance
  • Correlation analysis
  • Maximum likelihood test
  • Practice and analysis with R

Data analysis techniques

  • Regression analysis
  • Classification techniques
  • Clustering
  • Association rules analysis
  • Practice and analysis with R

Case studies and projects

  • Understanding business scenarios
  • Feature engineering and visualization
  • Scalable and parallel computing with Hadoop and Map-Reduce
  • Sensitivity Analysis

Resources & References


The following text and reference books may be referred to for this course.

  • Probability & Statistics for Engineers & Scientists (9th Edn.), Ronald E. Walpole, Raymond H. Myers, Sharon L. Myers and Keying Ye, Prentice Hall Inc.
  • The Elements of Statistical Learning, Data Mining, Inference, and Prediction (2nd Edn.), Trevor Hastie Robert Tibshirani Jerome Friedman, Springer, 2014
  • An Introduction to Statistical Learning: with Applications in R, G James, D. Witten, T Hastie, and R. Tibshirani, Springer, 2013
  • Software for Data Analysis: Programming with R (Statistics and Computing), John M. Chambers, Springer
  • Mining Massive Data Sets, A. Rajaraman and J. Ullman, Cambridge University Press, 2012
  • Advances in Complex Data Modeling and Computational Methods in Statistics, Anna Maria Paganoni and Piercesare Secchi, Springer, 2013
  • Data Mining and Analysis, Mohammed J. Zaki, Wagner Meira, Cambridge, 2012
  • Hadoop: The Definitive Guide (2nd Edn.) by Tom White, O'Reilly, 2014
  • MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems, Donald Miner, Adam Shook, O'Reilly, 2014
  • Beginning R: The Statistical Programming Language, Mark Gardener, Wiley, 2013

Additionally, you may look at the following materials.

Course Coordinator: Prof. Debasis Samanta


Prof. Debasis Samanta is a Professor in the Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur. For details about him, please see the link http://cse.iitkgp.ac.in/~dsamanta/

For any query, you can reach Dr. Samanta at:

+91-3222-282334 (Office)

+91-3222-282335 (Residence)

+91-7797137576 (Only SMS)

d...@iitkgp.ac.in

d...@gmail.com

Teaching Assistants


  • Subrata Pain (subratabankata@gmail.com)
  • Soham Bandyopadhyay (sohamban@gmail.com )
  • Nilanjan Sinhababu (nilanjan.thecseboy@gmail.com)

Moodle


Moodle, an online course management system, will be used extensively in this course. You should sign up for the course Moodle at the earliest. Once you click on the link, you would be redirected to the CSE home page where you would find a link for signing up at the bottom of the page.

In case of any doubt on the subject matter and topics covered in the class, you are welcome to participate in the Discussion Forum and post your query. We would get back to you with a response as soon as possible. Moreover, your friends can help answer your query too! Discussions in the forum may be moderated.