Data Warehousing Data Mining (17MCA442) 1. GENERAL INFORMATION: PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore 560 100 Department of MCA COURSE INFORMATION SHEET Academic Year: 2018 Semester: IV Title Code Duration (hrs) Lectures 48 Hrs 17MCA442 Seminars 4 Hrs Total 52 Hrs Data Warehousing and Data Mining 2. PRE REQUIREMENT STATEMENT: Data warehousing and data mining are two major areas of exploration for knowledge discovery in databases. These topics have gained great relevance especially in the 1990 s and early 2000 s with web data growing at an exponential rate. As more data is collected by businesses and scientific institutions, knowledge exploration techniques are needed to gain useful business intelligence. This course will cover a wide spectrum of industry standard techniques using widely available database and tools packages for knowledge discovery. Data mining is for relatively unstructured data for which more sophisticated techniques are needed. The course aims to cover powerful data mining techniques including clustering, association rules, and classification. It then teaches high volume data processing mechanisms by building warehouse schemas such as snowflake, and star. OLAP query retrieval techniques are also introduced. Should be familiar with statistics concepts. It may also be helpful to have some background in calculus, linear algebra, and computer science. COURSE DESCRIPTION: This course gives an introduction to methods and theory for development of data warehouses and data analysis using data mining. Data quality and methods and techniques for preprocessing of data. Modeling and design of data warehouses. Algorithms for classification, clustering and association rule analysis. Practical use of software for data analysis.
4. LEARNING OUTCOMES After completion of the subject Data warehousing and data mining the student will be able to Identify the techniques of classification and clustering and calculating distances using centroid Get the knowledge of data preprocessing and data quality. Able to design Data warehouses Ability to apply acquired knowledge for understanding data and select suitable methods for data analysis. 5. FACULTY DETAILS: Faculty Name : Mrs.Jayanthi.R Department : MCA Room Number: 504 Phone Number: 8951112398 Mail-id :jayanthir@pes.edu Contact Hours : College Hours Consultation Time: By E-Mail 6. VENUE AND HOURS/WEEK: 7. MODULE MAP: All lectures will normally be held in 500,501 and 506, 5 th Floor. Lecture Hours/week: 4Hrs All the laboratory sessions will be held in Room 500 & 506, V Floor. Class # % of portions covered Chapter Title/ Reference Topic To be Covered Reference Literature Cumulative Chapter 1. Introduction 2. Operational Data stores,etl Data Warehousing 3. Data warehouses 4. R2:chapter 3 Design issues 11.54 11.54 5. Guide lines for Data warehousing Implementation 6. Data warehouse Metadata 7. Introduction 8. Online analytical Characteristics of OLAP system Processing(OLAP) 9. Multidimensional view and data cube 10. Data cube implementations 11.54 23.08 R2: Chapter4 11. Data cube operations 12. Implementation of OLAP and overview on OLAP Sotwares 13. Introduction 14. Challenges Data mining tasks, 15. Data Mining Types of data, Data preprocessing 16. T1: Chapter 1,2 Measures of Similarity and Dissimilarity 11.54 34.62 17. Measures of Similarity and Dissimilarity contd 18. Data mining applications
19. Frequent Item set generation 20. Rule generation 21. Association Compact representation of frequent item sets 22. analysis-basic Alternative methods for generating frequent concepts and item sets Algorithms1 15.38 50.00 23. Alternative methods for generating frequent item sets contd T1: Chapter 6 24. FP growth algorithm 25. FP growth algorithm contd 26. Evaluation of association patterns
27. Basics, General approach to solving a classification problem 28. Decision tree 29. Decision tree 30. Rule-based classifier 31. Rule-based classifier contd Classification 23.08 73.08 32. T1: Chapter Rule-based classifier contd 4,5(5.1-5.3) 33. Nearest-neighbor classifier. 34. Bayesian classifiers 35. Estimating predictive accuracy of classification methods 36. Improving accuracy of classification methods 37. Evaluation criteria for classification method 38. Multiclass problem 39. Overview, features of cluster analysis 40. Types of data and computing distance 41. Types of data and computing distance contd. Clustering 42. Types of cluster analysis methods Techniques 15.38 88.46 43. T1: Chapter 8,9 Partitional methods R2: Chapter 7 44. Hierarchical methods 45. Density based methods 46. Quality and validity of cluster analysis 47. introduction 48. Web content mining 49. Text mining Web Mining 50. R2: Chapter 10 Unstructured text, text clustering 11.54 100 51. Mining spatial and temporal databases 52. Mining spatial and temporal databases contd. 12.RECOMMENDED BOOKS/JOURNALS/WEBSITES Text Books: 1. Jiawei Han and Micheline Kamber: Data Mining - Concepts and Techniques, 2nd Edition, Morgan Kaufmann Publisher, 2006. 2. Pang-Ning Tan, Michael Steinbach, Vipin Kumar: Introduction to Data Mining, Addison- Wesley,2005 Reference Books: 1. Arun K Pujari: Data Mining Techniques University Press, 2nd Edition, 2009. 2. G. K. Gupta: Introduction to Data Mining with Case Studies, 3 rd Edition, PHI, New Delhi, 2009. 3. Alex Berson and Stephen J.Smith: Data Warehousing, Data Mining, and OLAP Computing McGrawHill Publisher, 1997. 9 ASSIGNMENTS The students has to submit 2 assignments, one before first internal and the second one before second internal exam
Assignment questions:- Define : Association analysis, Itemset, Transaction width, Association rule. 2. For the following market basket transaction compute Support & Confidence for the rule {Milk, Diapers} -> {Beer}. Tid Itemsets 1 {Bread, Milk} 2 {Bread, Diapers, Beer, Eggs} 3 {Milk, Diapers, Beer, Cola} 4 {Bread, Milk, Diapers, Beer} 5 {Bread, Milk, Diapers, Cola} 3. Explain Apriori principle and Illustrate the principle with itemset lattice. 4. Write apriori algorithm for finding frequent itemset. Find the frequent pattern generated using Apriori for the following set of transactions. Tid List of item_id's 100 L1, L2, L5 200 L2, L4 300 L2, L3 400 L1, L2, L4 500 L1, L3 600 L2, L3 700 L1, L3 800 L1, L2, L3, L5 900 L1, L2, L3 5. Define Maximal frequent itemset and Closed frequent itemsets. 6. Explain alternative method for generating frequent itemsets. 7. Draw FP-Tree Tid List of Items 100 {M, O, N, K, E, Y } 200 { D, O, N, K, E, Y } 300 { M, A, K, E } 400 { M, U, C, K, Y} 500 { C, O, O, K, I, E }
8. Explain Frequent itemset generation in FP-Growth Algorithm 10 THEORY ASSESSMENT WRITTEN EXAMINATION Paper Structure No. Of Questions 8 Main Questions No. of questions to be answered 5 Exam date Paper Duration 3 Hrs Total Marks 100 Pass Marks 40 CONTINUOUS ASSESSMENT Parameters Weighting (%) Test(s): 15 Marks Assignment(s): Attendance(s): Total Marks: 3 Marks 2 Marks 20 Marks 11 QUESTION BANK Sl Questions Marks No. 1. What is data mining 5 2. Mention Data mining functionality, classification, prediction, clustering & evolution 5 analysis? 3. What are the challenges in methodology of Data Mining technology? 5 4. Discuss issues to consider during Data Mining? 5 5. What defines a Data Mining Task Explain at least 5 primitives? 5 6. What is knowledge discovery? 5 7. Explain the motivating challenges in development of data mining. 5 8. Explain with example the data mining tasks 5 1. What is a data? What do you mean by quality of data? 4 2. What is a data set? Explain the various types of data sets 10 3. What is data preprocessing? 4. Explain the following 5 marks i. Aggrigation each ii. Sampling iii. Dimensionality reduction iv. Feature subset selection v. Feature creation vi. Discretation and binarization vii. Variable transformation Give example 5. Explain the similarity and dissimilarity between 2 objects 6 6. What is Ecludian distance? Write the generalized Minkowski distance metric for 8 various values r. 7. Explain the properties of Ecludian distance. 6 8. What is simple matching coefficients and Jaccard coefficient? Explain with examples 8 9. What is meant by cousine similarity? Explain with example. 6 10. What is Bregman divergence? 5 11. What are the issues related to proximity measures? 10 12. Discuss on selection on right proximity measures 7
1. Define classification. Explain the purposes of using a classification model 6 2. Explain the general approach for building a classification model. 10 3. What is a decision tree? How a decision tree works? 10 4. Explain Hunts algorithm for inducing decision trees 10 5. What are the various methods for expressing attribute test conditions? Explain with 12 examples 6. Explain the measures that can be used to determine the best way to split the record. 12 7. Explain decision tree induction algorithm 10 8. What are the various characteristics of decision tree induction? 12 9. Explain the rule based classifier with an example 5 10. Explain how a rule based classifier works with a suitable example 6 11. Discus rule based ordering scheme and class based ordering scheme 10 12. Explain the direct methods of extracting the classification rules 8 13. Explain the indirect methods for rule extraction 8 14. What are the characteristics of rule based classifiers 10 15. Explain the Nearest-Neighbor classifier 6 16. Discus the k-nearest neighbor classification algorithm 8 17. Explain the characteristics of Nearest-Neighbor classifiers 8