Course on Data Mining ( )
|
|
- Kimberly Atkins
- 5 years ago
- Views:
Transcription
1 Course on Data Mining ( ) Intro/Ass. Rules 24./ Episodes Home Exam Clustering KDD Process Text Mining Appl./Summary Data mining: KDD Process 1 Course on Data Mining ( ) Today Today's subject: o KDD Process Next week's program: o Lecture: Data mining applications, future, summary o Exercise: KDD Process o Seminar: KDD Process Data mining: KDD Process 2 1
2 KDD process Overview Overview Preprocessing Post-processing Summary Data mining: KDD Process 3 What is KDD? A process! Aim: the selection and processing of data for o the identification of novel, accurate, and useful patterns, and o the modeling of real-world phenomena Data mining is a major component of the KDD process Data mining: KDD Process 4 2
3 Typical KDD process Target data set Selection Operational Database Raw data Input data Preprocessing Cleaned Verified Focused Data mining Postprocessing Selection Results Utilization Selected usable patterns Data mining: KDD Process 5 Phases of the KDD process (1) Learning the domain Eval. of interestingness Preprocessing Creating a target data set Data cleaning, integration and transformation Data reduction and projection Choosing the DM task Data mining: KDD Process 6 3
4 Phases of the KDD process (2) Choosing the DM algorithm(s) Data mining: search Postprocessing Pattern evaluation and interpretation Knowledge presentation Use of discovered knowledge Data mining: KDD Process 7 Preprocessing - overview Preprocessing Why data preprocessing? Data cleaning Data integration and transformation Data reduction Data mining: KDD Process 8 4
5 Why data preprocessing? Aim: to select the data relevant with respect to the task in hand to be mined Data in the real world is dirty o incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data o noisy: containing errors or outliers o inconsistent: containing discrepancies in codes or names No quality data, no quality mining results! Data mining: KDD Process 9 Measures of data quality o accuracy o completeness o consistency o timeliness o believability o value added o interpretability o accessibility Data mining: KDD Process 10 5
6 Preprocessing tasks (1) Data cleaning o fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration o integration of multiple databases, files, etc. Data transformation o normalization and aggregation Data mining: KDD Process 11 Preprocessing tasks (2) Data reduction (including discretization) o obtains reduced representation in volume, but produces the same or similar analytical results o data discretization is part of data reduction, but with particular importance, especially for numerical data Data mining: KDD Process 12 6
7 Preprocessing tasks (3) Data Cleaning Data Integration Data Transformation Data Reduction Data mining: KDD Process 13 Data cleaning tasks Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data Data mining: KDD Process 14 7
8 Missing Data Data is not always available Missing data may be due to o equipment malfunction o inconsistent with other recorded data, and thus deleted o data not entered due to misunderstanding o certain data may not be considered important at the time of entry o not register history or changes of the data Missing data may need to be inferred Data mining: KDD Process 15 How to Handle Missing Data? (1) Ignore the tuple o usually done when the class label is missing o not effective, when the percentage of missing values per attribute varies considerably Fill in the missing value manually o tedious + infeasible? Use a global constant to fill in the missing value o e.g., unknown, a new class?! Data mining: KDD Process 16 8
9 How to Handle Missing Data? (2) Use the attribute mean to fill in the missing value Use the attribute mean for all samples belonging to the same class to fill in the missing value o smarter solution than using the general attribute mean Use the most probable value to fill in the missing value o inference-based tools such as decision tree induction or a Bayesian formalism o regression Data mining: KDD Process 17 Noisy Data Noise: random error or variance in a measured variable Incorrect attribute values may due to o faulty data collection instruments o data entry problems o data transmission problems o technology limitation o inconsistency in naming convention Data mining: KDD Process 18 9
10 How to Handle Noisy Data? Binning o smooth a sorted data value by looking at the values around it Clustering o detect and remove outliers Combined computer and human inspection o detect suspicious values and check by human Regression o smooth by fitting the data into regression functions Data mining: KDD Process 19 Binning methods (1) Equal-depth (frequency) partitioning o sort data and partition into bins, N intervals, each containing approximately same number of samples o smooth by bin means, bin median, bin boundaries, etc. o good data scaling o managing categorical attributes can be tricky Data mining: KDD Process 20 10
11 Binning methods (2) Equal-width (distance) partitioning o divide the range into N intervals of equal size: uniform grid o if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N. o the most straightforward o outliers may dominate presentation o skewed data is not handled well Data mining: KDD Process 21 Equal-depth binning - Example Sorted data for price (in dollars): o 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 Partition into (equal-depth) bins: o Bin 1: 4, 8, 9, 15 o Bin 2: 21, 21, 24, 25 o Bin 3: 26, 28, 29, 34 Smoothing by bin means: o Bin 1: 9, 9, 9, 9 o Bin 2: 23, 23, 23, 23 o Bin 3: 29, 29, 29, 29 by bin boundaries: o Bin 1: 4, 4, 4, 15 o Bin 2: 21, 21, 25, 25 o Bin 3: 26, 26, 26, Data mining: KDD Process 22 11
12 Data Integration (1) Data integration o combines data from multiple sources into a coherent store Schema integration o integrate metadata from different sources o entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id B.cust-# Data mining: KDD Process 23 Data Integration (2) Detecting and resolving data value conflicts o for the same real world entity, attribute values from different sources are different o possible reasons: different representations, different scales, e.g., metric vs. British units Data mining: KDD Process 24 12
13 Handling Redundant Data Redundant data occur often, when multiple databases are integrated o the same attribute may have different names in different databases o one attribute may be a derived attribute in another table, e.g., annual revenue Redundant data may be detected by correlation analysis Careful integration of data from multiple sources may o help to reduce/avoid redundancies and inconsistencies o improve mining speed and quality Data mining: KDD Process 25 Data Transformation Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified range, e.g., o min-max normalization o normalization by decimal scaling Attribute/feature construction o new attributes constructed from the given ones Data mining: KDD Process 26 13
14 Data Reduction Data reduction o obtains a reduced representation of the data set that is much smaller in volume o produces the same (or almost the same) analytical results as the original data Data reduction strategies o dimensionality reduction o numerosity reduction o discretization and concept hierarchy generation Data mining: KDD Process 27 Dimensionality Reduction Feature selection (i.e., attribute subset selection): o select a minimum set of features such that the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all features o reduce the number of patterns in the patterns, easier to understand Heuristic methods (due to exponential # of choices): o step-wise forward selection o step-wise backward elimination o combining forward selection and backward elimination Data mining: KDD Process 28 14
15 Dimensionality Reduction - Example Initial attribute set: {A1, A2, A3, A4, A5, A6} A4? A1? A6? Class 1 Class 2 Class 1 Class 2 > Reduced attribute set: {A1, A4, A6} Data mining: KDD Process 29 Numerosity Reduction Parametric methods o assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers) o e.g., regression analysis, log-linear models Non-parametric methods o do not assume models o e.g., histograms, clustering, sampling Data mining: KDD Process 30 15
16 Discretization Reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Some classification algorithms only accept categorical attributes Data mining: KDD Process 31 Concept Hierarchies Reduce the data by collecting and replacing low level concepts by higher level concepts For example, replace numeric values for the attribute age by more general values young, middle-aged, or senior Data mining: KDD Process 32 16
17 Discretization and concept hierarchy generation for numeric data Binning Histogram analysis Clustering analysis Entropy-based discretization Segmentation by natural partitioning Data mining: KDD Process 33 Concept hierarchy generation for categorical data Specification of a partial ordering of attributes explicitly at the schema level by users or experts Specification of a portion of a hierarchy by explicit data grouping Specification of a set of attributes, but not of their partial ordering Specification of only a partial set of attributes Data mining: KDD Process 34 17
18 Specification of a set of attributes Concept hierarchy can be automatically generated based on the number of distinct values per attribute in the given attribute set. The attribute with the most distinct values is placed at the lowest level of the hierarchy. country province_or_ state city street 15 distinct values 65 distinct values 3567 distinct values distinct values Data mining: KDD Process 35 Post-processing - overview Post-processing Why data postprocessing? Interestingness Visualization Utilization Data mining: KDD Process 36 18
19 Why data post-processing? (1) Aim: to show the results, or more precisely the most interesting findings, of the data mining phase to a user/users in an understandable way A possible post-processing methodology: o find all potentially interesting patterns according to some rather loose criteria o provide flexible methods for iteratively and interactively creating different views of the discovered patterns Other more restrictive or focused methodologies possible as well Data mining: KDD Process 37 Why data post-processing? (2) A post-processing methodology is useful, if o the desired focus is not known in advance (the search process cannot be optimized to look only for the interesting patterns) o there is an algorithm that can produce all patterns from a class of potentially interesting patterns (the result is complete) o the time requirement for discovering all potentially interesting patterns is not considerably longer than, if the discovery was focused to a small subset of potentially interesting patterns Data mining: KDD Process 38 19
20 Are all the discovered pattern interesting? A data mining system/query may generate thousands of patterns, but are they all interesting? Usually NOT! How could we then choose the interesting patterns? => Interestingness Data mining: KDD Process 39 Interestingness criteria (1) Some possible criteria for interestingness: o evidence: statistical significance of finding? o redundancy: similarity between findings? o usefulness: meeting the user's needs/goals? o novelty: already prior knowledge? o simplicity: syntactical complexity? o generality: how many examples covered? Data mining: KDD Process 40 20
21 Interestingness criteria(2) One division of interestingness criteria: o objective measures that are based on statistics and structures of patterns, e.g., J-measure: statistical significance certainty factor: support or frequency strength: confidence o subjective measures that are based on user s beliefs in the data, e.g., unexpectedness: is the found pattern surprising?" actionability: can I do something with it?" Data mining: KDD Process 41 Criticism: Support & Confidence Example: (Aggarwal & Yu, PODS98) o among 5000 students 3000 play basketball, 3750 eat cereal 2000 both play basket ball and eat cereal o the rule play basketball eat cereal [40%, 66.7%] is misleading, because the overall percentage of students eating cereal is 75%, which is higher than 66.7% o the rule play basketball not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence Data mining: KDD Process 42 21
22 Interest Yet another objective measure for interestingness is interest that is defined as P( A B) P( A) P( B) Properties of this measure: o takes both P(A) and P(B) in consideration: o P(A^B)=P(B)*P(A), if A and B are independent events o A and B negatively correlated, if the value is less than 1; otherwise A and B positively correlated Data mining: KDD Process 43 Also J-measure J-measure J measure (1 conf ( A = conf ( A) ( conf ( A B) ) 1 conf ( A B) B) log( 1 conf ( B) conf ( A B) log + conf ( B) )) is an objective measure for interestingness Properties of J-measure: o again, takes both P(A) and P(B) in consideration o value is always between 0 and 1 o can be computed using pre-calculated values Data mining: KDD Process 44 22
23 Support/Frequency/J-measure Rules ,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 Threshold Dataset 1 Dataset 2 Dataset 3 Dataset 4 Dataset 5 Dataset Data mining: KDD Process 45 Confidence Rules ,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 Confidence threshold Dataset 1 Dataset 2 Dataset 3 Dataset 4 Dataset 5 Dataset Data mining: KDD Process 46 23
24 Example Selection of Interesting Association Rules For reducing the number of association rules that have to be considered, we could, for example, use one of the following selection criteria: o frequency and confidence o J-measure or interest o maximum rule size (whole rule, left-hand side, right-hand side) o rule attributes (e.g., templates) Data mining: KDD Process 47 Example Problems with selection of rules A rule can correspond to prior knowledge or expectations o how to encode the background knowledge into the system? A rule can refer to uninteresting attributes or attribute combinations o could this be avoided by enhancing the preprocessing phase? Rules can be redundant o redundancy elimination by rule covers etc Data mining: KDD Process 48 24
25 Interpretation and evaluation of the results of data mining Evaluation o statistical validation and significance testing o qualitative review by experts in the field o pilot surveys to evaluate model accuracy Interpretation o tree and rule models can be read directly o clustering results can be graphed and tabled o code can be automatically generated by some systems Data mining: KDD Process 49 Visualization of Discovered Patterns (1) In some cases, visualization of the results of data mining (rules, clusters, networks ) can be very helpful Visualization is actually already important in the preprocessing phase in selecting the appropriate data or in looking at the data Visualization requires training and practice Data mining: KDD Process 50 25
26 Visualization of Discovered Patterns (2) Different backgrounds/usages may require different forms of representation o e.g., rules, tables, cross-tabulations, or pie/bar chart Concept hierarchy is also important o discovered knowledge might be more understandable when represented at high level of abstraction o interactive drill up/down, pivoting, slicing and dicing provide different perspective to data Different kinds of knowledge require different kinds of representation o association, classification, clustering, etc Data mining: KDD Process 51 Visualization Data mining: KDD Process 52 26
27 Data mining: KDD Process 53 Utilization of the results Increasing potential to support business decisions Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery End User Business Analyst Data Analyst Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA Data Sources Paper, Files, Information Providers, Database Systems, OLTP DBA Data mining: KDD Process 54 27
28 Summary Data mining: semi-automatic discovery of interesting patterns from large data sets Knowledge discovery is a process: o preprocessing o data mining o post-processing o using and utilizing the knowledge Data mining: KDD Process 55 Summary Preprocessing is important in order to get useful results! If a loosely defined mining methodology is used, postprocessing is needed in order to find the interesting results! Visualization is useful in preand post-processing! One has to be able to utilize the found knowledge! Data mining: KDD Process 56 28
29 References KDD Process P. Adriaans and D. Zantinge. Data Mining. Addison-Wesley: Harlow, England, R.J. Brachman, T. Anand. The process of knowledge discovery in databases. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Communications of ACM, 42:73-78, M. S. Chen, J. Han, and P. S. Yu. Data mining: An overview from a database perspective. IEEE Trans. Knowledge and Data Engineering, 8: , U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, T. Imielinski and H. Mannila. A database perspective on knowledge discovery. Communications of ACM, 39:58-64, Jagadish et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical Committee on Data Engineering, 20(4), December D. Keim, Visual techniques for exploring databases. Tutorial notes in KDD 97, Newport Beach, CA, USA, D. Keim, Visual data mining. Tutorial notes in VLDB 97, Athens, Greece, D. Keim, and H.P. Krieger, Visual techniques for mining large databases: a comparison. IEEE Transactions on Knowledge and Data Engineering, 8(6), Data mining: KDD Process 57 References KDD Process M. Klemettinen, A knowledge discovery methodology for telecommunication network alarm databases. Ph.D. thesis, University of Helsinki, Report A , M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I. Verkamo. Finding interesting rules from large sets of discovered association rules. CIKM 94, Gaithersburg, Maryland, Nov G. Piatetsky-Shapiro, U. Fayyad, and P. Smith. From data mining to knowledge discovery: An overview. In U.M. Fayyad, et al. (eds.), Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, T. Redman. Data Quality: Management and Technology. Bantam Books, New York, A. Silberschatz and A. Tuzhilin. What makes patterns interesting in knowledge discovery systems. IEEE Trans. on Knowledge and Data Engineering, 8: , Dec D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov. Query flocks: A generalization of association-rule mining. SIGMOD'98, Seattle, Washington, June Data mining: KDD Process 58 29
30 References KDD Process Y. Wand and R. Wang. Anchoring data quality dimensions ontological foundations. Communications of ACM, 39:86-95, R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans. Knowledge and Data Engineering, 7: , Data mining: KDD Process 59 Reminder: Course Organization Course Evaluation Passing the course: min 30 points o home exam: min 13 points (max 30 points) o exercises/experiments: min 8 points (max 20 points) at least 3 returned and reported experiments o group presentation: min 4 points (max 10 points) Remember also the other requirements: o attending the lectures (5/7) o attending the seminars (4/5) o attending the exercises (4/5) Data mining: KDD Process 60 30
31 Seminar Presentations/Groups 9-10 Visualization and data mining D. Keim, H.P., Kriegel, T. Seidl: Supporting Data Mining of Large Databases by Visual Feedback Queries", ICDE Data mining: KDD Process 61 Seminar Presentations/Groups 9-10 Interestingness G. Piatetsky-Shapiro, C.J. Matheus: The Interestingness of Deviations, KDD Data mining: KDD Process 62 31
32 KDD process Thanks to Jiawei Han from Simon Fraser University and Mika Klemettinen from Nokia Research Center for their slides which greatly helped in preparing this lecture! Also thanks to Fosca Giannotti and Dino Pedreschi from Pisa for their slides Data mining: KDD Process 63 32
Data Preprocessing. Slides by: Shree Jaswal
Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data
More informationPreprocessing Short Lecture Notes cse352. Professor Anita Wasilewska
Preprocessing Short Lecture Notes cse352 Professor Anita Wasilewska Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept
More informationData Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality
Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation = noisy: containing
More informationUNIT 2 Data Preprocessing
UNIT 2 Data Preprocessing Lecture Topic ********************************************** Lecture 13 Why preprocess the data? Lecture 14 Lecture 15 Lecture 16 Lecture 17 Data cleaning Data integration and
More informationData Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha
Data Preprocessing S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha 1 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking
More informationData Mining: Concepts and Techniques. (3 rd ed.) Chapter 3
Data Mining: Concepts and Techniques (3 rd ed.) Chapter 3 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University 2011 Han, Kamber & Pei. All rights
More informationData Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA
Obj ti Objectives Motivation: Why preprocess the Data? Data Preprocessing Techniques Data Cleaning Data Integration and Transformation Data Reduction Data Preprocessing Lecture 3/DMBI/IKI83403T/MTI/UI
More informationData Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing
Data Mining: Concepts and Techniques (3 rd ed.) Chapter 3 1 Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks in Data Preprocessing Data Cleaning Data Integration Data
More informationData Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1395
Data Mining Data preprocessing Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 15 Table of contents 1 Introduction 2 Data preprocessing
More informationData Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1394
Data Mining Data preprocessing Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1394 1 / 15 Table of contents 1 Introduction 2 Data preprocessing
More informationUNIT 2. DATA PREPROCESSING AND ASSOCIATION RULES
UNIT 2. DATA PREPROCESSING AND ASSOCIATION RULES Data Pre-processing-Data Cleaning, Integration, Transformation, Reduction, Discretization Concept Hierarchies-Concept Description: Data Generalization And
More information2. Data Preprocessing
2. Data Preprocessing Contents of this Chapter 2.1 Introduction 2.2 Data cleaning 2.3 Data integration 2.4 Data transformation 2.5 Data reduction Reference: [Han and Kamber 2006, Chapter 2] SFU, CMPT 459
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES 2: Data Pre-Processing Instructor: Yizhou Sun yzsun@ccs.neu.edu September 10, 2013 2: Data Pre-Processing Getting to know your data Basic Statistical Descriptions of Data
More information3. Data Preprocessing. 3.1 Introduction
3. Data Preprocessing Contents of this Chapter 3.1 Introduction 3.2 Data cleaning 3.3 Data integration 3.4 Data transformation 3.5 Data reduction SFU, CMPT 740, 03-3, Martin Ester 84 3.1 Introduction Motivation
More informationSummary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4
Principles of Knowledge Discovery in Data Fall 2004 Chapter 3: Data Preprocessing Dr. Osmar R. Zaïane University of Alberta Summary of Last Chapter What is a data warehouse and what is it for? What is
More informationData Preprocessing. Data Mining 1
Data Preprocessing Today s real-world databases are highly susceptible to noisy, missing, and inconsistent data due to their typically huge size and their likely origin from multiple, heterogenous sources.
More informationBy Mahesh R. Sanghavi Associate professor, SNJB s KBJ CoE, Chandwad
By Mahesh R. Sanghavi Associate professor, SNJB s KBJ CoE, Chandwad Data Analytics life cycle Discovery Data preparation Preprocessing requirements data cleaning, data integration, data reduction, data
More informationECT7110. Data Preprocessing. Prof. Wai Lam. ECT7110 Data Preprocessing 1
ECT7110 Data Preprocessing Prof. Wai Lam ECT7110 Data Preprocessing 1 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest,
More informationCS 521 Data Mining Techniques Instructor: Abdullah Mueen
CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 2: DATA TRANSFORMATION AND DIMENSIONALITY REDUCTION Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks
More informationData Mining Concepts & Techniques
Data Mining Concepts & Techniques Lecture No. 02 Data Processing, Data Mining Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology
More informationData Preprocessing in Python. Prof.Sushila Aghav
Data Preprocessing in Python Prof.Sushila Aghav Sushila.aghav@mitcoe.edu.in Content Why preprocess the data? Descriptive data summarization Data cleaning Data integration and transformation April 24, 2018
More informationJarek Szlichta
Jarek Szlichta http://data.science.uoit.ca/ Open data Business Data Web Data Available at different formats 2 Data Scientist: The Sexiest Job of the 21 st Century Harvard Business Review Oct. 2012 (c)
More informationData Mining: Concepts and Techniques
Data Mining: Concepts and Techniques Slides for Textbook Chapter 4 October 17, 2006 Data Mining: Concepts and Techniques 1 Chapter 4: Data Mining Primitives, Languages, and System Architectures Data mining
More informationCS570: Introduction to Data Mining
CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu, Ph.D. Some slides courtesy of Li Xiong, Ph.D. and 2011 Han, Kamber & Pei. Data Mining. Morgan Kaufmann.
More informationData Preprocessing. Outline. Motivation. How did this happen?
Outline Data Preprocessing Motivation Data cleaning Data integration and transformation Data reduction Discretization and hierarchy generation Summary CS 5331 by Rattikorn Hewett Texas Tech University
More informationData Mining: Concepts and Techniques. Chapter 2
Data Mining: Concepts and Techniques Chapter 2 Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj 2006 Jiawei Han and Micheline Kamber, All rights
More informationK236: Basis of Data Science
Schedule of K236 K236: Basis of Data Science Lecture 6: Data Preprocessing Lecturer: Tu Bao Ho and Hieu Chi Dam TA: Moharasan Gandhimathi and Nuttapong Sanglerdsinlapachai 1. Introduction to data science
More informationData Preprocessing. Komate AMPHAWAN
Data Preprocessing Komate AMPHAWAN 1 Data cleaning (data cleansing) Attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data. 2 Missing value
More informationcse634 Data Mining Preprocessing Lecture Notes Chapter 2 Professor Anita Wasilewska
cse634 Data Mining Preprocessing Lecture Notes Chapter 2 Professor Anita Wasilewska Chapter 2: Data Preprocessing (book slide) Why preprocess the data? Descriptive data summarization Data cleaning Data
More informationData Preprocessing. Erwin M. Bakker & Stefan Manegold. https://homepages.cwi.nl/~manegold/dbdm/
Data Preprocessing Erwin M. Bakker & Stefan Manegold https://homepages.cwi.nl/~manegold/dbdm/ http://liacs.leidenuniv.nl/~bakkerem2/dbdm/ s.manegold@liacs.leidenuniv.nl e.m.bakker@liacs.leidenuniv.nl 9/26/17
More informationData Mining: Concepts and Techniques
Data Mining: Concepts and Techniques Chapter 2 Original Slides: Jiawei Han and Micheline Kamber Modification: Li Xiong Data Mining: Concepts and Techniques 1 Chapter 2: Data Preprocessing Why preprocess
More informationData Mining: Concepts and Techniques. (3 rd ed.) Chapter 3
Data Mining: Concepts and Techniques (3 rd ed.) Chapter 3 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University 2013 Han, Kamber & Pei. All rights
More informationData Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation
Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization
More informationData preprocessing Functional Programming and Intelligent Algorithms
Data preprocessing Functional Programming and Intelligent Algorithms Que Tran Høgskolen i Ålesund 20th March 2017 1 Why data preprocessing? Real-world data tend to be dirty incomplete: lacking attribute
More informationData Mining: Concepts and Techniques. (3 rd ed.) Chapter 3
Data Mining: Concepts and Techniques (3 rd ed.) Chapter 3 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University 2013 Han, Kamber & Pei. All rights
More informationData Collection, Preprocessing and Implementation
Chapter 6 Data Collection, Preprocessing and Implementation 6.1 Introduction Data collection is the loosely controlled method of gathering the data. Such data are mostly out of range, impossible data combinations,
More informationDatabase and Knowledge-Base Systems: Data Mining. Martin Ester
Database and Knowledge-Base Systems: Data Mining Martin Ester Simon Fraser University School of Computing Science Graduate Course Spring 2006 CMPT 843, SFU, Martin Ester, 1-06 1 Introduction [Fayyad, Piatetsky-Shapiro
More informationA Survey on Data Preprocessing Techniques for Bioinformatics and Web Usage Mining
Volume 117 No. 20 2017, 785-794 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu A Survey on Data Preprocessing Techniques for Bioinformatics and Web
More informationData Mining and Analytics. Introduction
Data Mining and Analytics Introduction Data Mining Data mining refers to extracting or mining knowledge from large amounts of data It is also termed as Knowledge Discovery from Data (KDD) Mostly, data
More informationData Mining: Concepts and Techniques. Chapter 2
Data Mining: Concepts and Techniques Chapter 2 Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj 2006 Jiawei Han and Micheline Kamber, All rights
More informationInformation Management course
Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 03 : 13/10/2015 Data Mining: Concepts and Techniques (3 rd ed.) Chapter
More informationChapter 2 Data Preprocessing
Chapter 2 Data Preprocessing CISC4631 1 Outline General data characteristics Data cleaning Data integration and transformation Data reduction Summary CISC4631 2 1 Types of Data Sets Record Relational records
More informationGUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATIONS (MCA) Semester: IV
GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATIONS (MCA) Semester: IV Subject Name: Elective I Data Warehousing & Data Mining (DWDM) Subject Code: 2640005 Learning Objectives: To understand
More informationRoad Map. Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary
2. Data preprocessing Road Map Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary 2 Data types Categorical vs. Numerical Scale types
More informationDATA WAREHOUING UNIT I
BHARATHIDASAN ENGINEERING COLLEGE NATTRAMAPALLI DEPARTMENT OF COMPUTER SCIENCE SUB CODE & NAME: IT6702/DWDM DEPT: IT Staff Name : N.RAMESH DATA WAREHOUING UNIT I 1. Define data warehouse? NOV/DEC 2009
More informationChapter 1, Introduction
CSI 4352, Introduction to Data Mining Chapter 1, Introduction Young-Rae Cho Associate Professor Department of Computer Science Baylor University What is Data Mining? Definition Knowledge Discovery from
More informationDATA PREPROCESSING. Tzompanaki Katerina
DATA PREPROCESSING Tzompanaki Katerina Background: Data storage formats Data in DBMS ODBC, JDBC protocols Data in flat files Fixed-width format (each column has a specific number of characters, filled
More informationData Mining Course Overview
Data Mining Course Overview 1 Data Mining Overview Understanding Data Classification: Decision Trees and Bayesian classifiers, ANN, SVM Association Rules Mining: APriori, FP-growth Clustering: Hierarchical
More informationECLT 5810 Data Preprocessing. Prof. Wai Lam
ECLT 5810 Data Preprocessing Prof. Wai Lam Why Data Preprocessing? Data in the real world is imperfect incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate
More informationDynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering
Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering Abstract Mrs. C. Poongodi 1, Ms. R. Kalaivani 2 1 PG Student, 2 Assistant Professor, Department of
More informationThis tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.
About the Tutorial Data Mining is defined as the procedure of extracting information from huge sets of data. In other words, we can say that data mining is mining knowledge from data. The tutorial starts
More informationAn Overview of various methodologies used in Data set Preparation for Data mining Analysis
An Overview of various methodologies used in Data set Preparation for Data mining Analysis Arun P Kuttappan 1, P Saranya 2 1 M. E Student, Dept. of Computer Science and Engineering, Gnanamani College of
More informationInterestingness Measurements
Interestingness Measurements Objective measures Two popular measurements: support and confidence Subjective measures [Silberschatz & Tuzhilin, KDD95] A rule (pattern) is interesting if it is unexpected
More informationData Preparation. Data Preparation. (Data pre-processing) Why Prepare Data? Why Prepare Data? Some data preparation is needed for all mining tools
Data Preparation Data Preparation (Data pre-processing) Why prepare the data? Discretization Data cleaning Data integration and transformation Data reduction, Feature selection 2 Why Prepare Data? Why
More informationData Mining. Yi-Cheng Chen ( 陳以錚 ) Dept. of Computer Science & Information Engineering, Tamkang University
Data Mining Yi-Cheng Chen ( 陳以錚 ) Dept. of Computer Science & Information Engineering, Tamkang University Why Mine Data? Commercial Viewpoint Lots of data is being collected and warehoused Web data, e-commerce
More informationData Exploration and Preparation Data Mining and Text Mining (UIC Politecnico di Milano)
Data Exploration and Preparation Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining, : Concepts and Techniques", The Morgan Kaufmann
More informationData Mining: Concepts and Techniques
Data Mining: Concepts and Techniques Slides for Textbook Chapter 1 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser University, Canada
More informationTable Of Contents: xix Foreword to Second Edition
Data Mining : Concepts and Techniques Table Of Contents: Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments xxxi About the Authors xxxv Chapter 1 Introduction 1 (38) 1.1 Why Data
More informationCse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University
Cse634 DATA MINING TEST REVIEW Professor Anita Wasilewska Computer Science Department Stony Brook University Preprocessing stage Preprocessing: includes all the operations that have to be performed before
More informationNew Orleans, Louisiana, February/March Knowledge Discovery from Telecommunication. Network Alarm Databases. K. Hatonen M. Klemettinen H.
To appear in the 12th International Conference on Data Engineering (ICDE'96), New Orleans, Louisiana, February/March 1996. Knowledge Discovery from Telecommunication Network Alarm Databases K. Hatonen
More informationData Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining
Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar What is data exploration? A preliminary exploration of the data to better understand its characteristics.
More informationInterestingness Measurements
Interestingness Measurements Objective measures Two popular measurements: support and confidence Subjective measures [Silberschatz & Tuzhilin, KDD95] A rule (pattern) is interesting if it is unexpected
More informationAssociation Rule Selection in a Data Mining Environment
Association Rule Selection in a Data Mining Environment Mika Klemettinen 1, Heikki Mannila 2, and A. Inkeri Verkamo 1 1 University of Helsinki, Department of Computer Science P.O. Box 26, FIN 00014 University
More informationData Mining: Exploring Data. Lecture Notes for Data Exploration Chapter. Introduction to Data Mining
Data Mining: Exploring Data Lecture Notes for Data Exploration Chapter Introduction to Data Mining by Tan, Steinbach, Karpatne, Kumar 02/03/2018 Introduction to Data Mining 1 What is data exploration?
More informationData Mining: Exploring Data. Lecture Notes for Chapter 3
Data Mining: Exploring Data Lecture Notes for Chapter 3 1 What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include
More informationSCHEME OF COURSE WORK. Data Warehousing and Data mining
SCHEME OF COURSE WORK Course Details: Course Title Course Code Program: Specialization: Semester Prerequisites Department of Information Technology Data Warehousing and Data mining : 15CT1132 : B.TECH
More informationHandling Missing Values via Decomposition of the Conditioned Set
Handling Missing Values via Decomposition of the Conditioned Set Mei-Ling Shyu, Indika Priyantha Kuruppu-Appuhamilage Department of Electrical and Computer Engineering, University of Miami Coral Gables,
More informationDiscovering interesting rules from financial data
Discovering interesting rules from financial data Przemysław Sołdacki Institute of Computer Science Warsaw University of Technology Ul. Andersa 13, 00-159 Warszawa Tel: +48 609129896 email: psoldack@ii.pw.edu.pl
More informationWhat is Data Mining?
Introduction What is Data Mining? Data Mining: Concepts and Techniques Slides for Course Data Mining Chapter 1 Jiawei Han 1 Necessity Is the Mother of Invention Data explosion problem Automated data collection
More informationTo Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set
To Enhance Scalability of Item Transactions by Parallel and Partition using Dynamic Data Set Priyanka Soni, Research Scholar (CSE), MTRI, Bhopal, priyanka.soni379@gmail.com Dhirendra Kumar Jha, MTRI, Bhopal,
More informationCS570 Introduction to Data Mining
CS570 Introduction to Data Mining Department of Mathematics and Computer Science Li Xiong Data Exploration and Data Preprocessing Data and attributes Data exploration Data pre-processing 2 10 What is Data?
More informationCT75 (ALCCS) DATA WAREHOUSING AND DATA MINING JUN
Q.1 a. Define a Data warehouse. Compare OLTP and OLAP systems. Data Warehouse: A data warehouse is a subject-oriented, integrated, time-variant, and 2 Non volatile collection of data in support of management
More informationQuestion Bank. 4) It is the source of information later delivered to data marts.
Question Bank Year: 2016-2017 Subject Dept: CS Semester: First Subject Name: Data Mining. Q1) What is data warehouse? ANS. A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile
More informationINSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad
INSTITUTE OF AERONAUTICAL ENGINEERING (Autonomous) Dundigal, Hyderabad - 500 043 INFORMATION TECHNOLOGY DEFINITIONS AND TERMINOLOGY Course Name : DATA WAREHOUSING AND DATA MINING Course Code : AIT006 Program
More informationReducing Redundancy in Characteristic Rule Discovery by Using IP-Techniques
Reducing Redundancy in Characteristic Rule Discovery by Using IP-Techniques Tom Brijs, Koen Vanhoof and Geert Wets Limburg University Centre, Faculty of Applied Economic Sciences, B-3590 Diepenbeek, Belgium
More information9. Conclusions. 9.1 Definition KDD
9. Conclusions Contents of this Chapter 9.1 Course review 9.2 State-of-the-art in KDD 9.3 KDD challenges SFU, CMPT 740, 03-3, Martin Ester 419 9.1 Definition KDD [Fayyad, Piatetsky-Shapiro & Smyth 96]
More informationDynamic Data in terms of Data Mining Streams
International Journal of Computer Science and Software Engineering Volume 1, Number 1 (2015), pp. 25-31 International Research Publication House http://www.irphouse.com Dynamic Data in terms of Data Mining
More informationDta Mining and Data Warehousing
CSCI645 Fall 23 Dta Mining and Data Warehousing Instructor: Qigang Gao, Office: CS219, Tel:494-3356, Email: qggao@cs.dal.ca Teaching Assistant: Christopher Jordan, Email: cjordan@cs.dal.ca Office Hours:
More informationData Mining. Chapter 1: Introduction. Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei
Data Mining Chapter 1: Introduction Adapted from materials by Jiawei Han, Micheline Kamber, and Jian Pei 1 Any Question? Just Ask 3 Chapter 1. Introduction Why Data Mining? What Is Data Mining? A Multi-Dimensional
More informationCS377: Database Systems Data Warehouse and Data Mining. Li Xiong Department of Mathematics and Computer Science Emory University
CS377: Database Systems Data Warehouse and Data Mining Li Xiong Department of Mathematics and Computer Science Emory University 1 1960s: Evolution of Database Technology Data collection, database creation,
More informationContents. Foreword to Second Edition. Acknowledgments About the Authors
Contents Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments About the Authors xxxi xxxv Chapter 1 Introduction 1 1.1 Why Data Mining? 1 1.1.1 Moving toward the Information Age 1
More informationAssociation Rules Mining:References
Association Rules Mining:References Zhou Shuigeng March 26, 2006 AR Mining References 1 References: Frequent-pattern Mining Methods R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm
More informationData Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44
Data Mining Piotr Paszek piotr.paszek@us.edu.pl Introduction (Piotr Paszek) Data Mining DM KDD 1 / 44 Plan of the lecture 1 Data Mining (DM) 2 Knowledge Discovery in Databases (KDD) 3 CRISP-DM 4 DM software
More informationPost-Processing of Discovered Association Rules Using Ontologies
Post-Processing of Discovered Association Rules Using Ontologies Claudia Marinica, Fabrice Guillet and Henri Briand LINA Ecole polytechnique de l'université de Nantes, France {claudia.marinica, fabrice.guillet,
More informationSponsored by AIAT.or.th and KINDML, SIIT
CC: BY NC ND Table of Contents Chapter 2. Data Preprocessing... 31 2.1. Basic Representation for Data: Database Viewpoint... 31 2.2. Data Preprocessing in the Database Point of View... 33 2.3. Data Cleaning...
More informationData Preprocessing. Chapter Why Preprocess the Data?
Contents 2 Data Preprocessing 3 2.1 Why Preprocess the Data?........................................ 3 2.2 Descriptive Data Summarization..................................... 6 2.2.1 Measuring the Central
More informationBBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler, Sanjay Ranka
BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler, Sanjay Ranka Topics What is data? Definitions, terminology Types of data and datasets Data preprocessing Data Cleaning Data integration
More informationKeywords Fuzzy, Set Theory, KDD, Data Base, Transformed Database.
Volume 6, Issue 5, May 016 ISSN: 77 18X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Fuzzy Logic in Online
More informationReduce convention for Large Data Base Using Mathematical Progression
Global Journal of Pure and Applied Mathematics. ISSN 0973-1768 Volume 12, Number 4 (2016), pp. 3577-3584 Research India Publications http://www.ripublication.com/gjpam.htm Reduce convention for Large Data
More informationDiscovery of Association Rules in Temporal Databases 1
Discovery of Association Rules in Temporal Databases 1 Abdullah Uz Tansel 2 and Necip Fazil Ayan Department of Computer Engineering and Information Science Bilkent University 06533, Ankara, Turkey {atansel,
More informationCse352 Artifficial Intelligence Short Review for Midterm. Professor Anita Wasilewska Computer Science Department Stony Brook University
Cse352 Artifficial Intelligence Short Review for Midterm Professor Anita Wasilewska Computer Science Department Stony Brook University Midterm Midterm INCLUDES CLASSIFICATION CLASSIFOCATION by Decision
More informationOverview. Data-mining. Commercial & Scientific Applications. Ongoing Research Activities. From Research to Technology Transfer
Data Mining George Karypis Department of Computer Science Digital Technology Center University of Minnesota, Minneapolis, USA. http://www.cs.umn.edu/~karypis karypis@cs.umn.edu Overview Data-mining What
More informationPESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore
Data Warehousing Data Mining (17MCA442) 1. GENERAL INFORMATION: PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore 560 100 Department of MCA COURSE INFORMATION SHEET Academic
More informationData Preprocessing UE 141 Spring 2013
Data Preprocessing UE 141 Spring 2013 Jing Gao SUNY Buffalo 1 Outline Data Data Preprocessing Improve data quality Prepare data for analysis Exploring Data Statistics Visualization 2 Document Data Each
More informationData warehouse and Data Mining
Data warehouse and Data Mining Lecture No. 14 Data Mining and its techniques Naeem A. Mahoto Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology
More informationData Preprocessing. Data Mining: Concepts and Techniques. c 2012 Elsevier Inc. All rights reserved.
3 Data Preprocessing Today s real-world databases are highly susceptible to noisy, missing, and inconsistent data due to their typically huge size (often several gigabytes or more) and their likely origin
More informationDEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SHRI ANGALAMMAN COLLEGE OF ENGINEERING & TECHNOLOGY (An ISO 9001:2008 Certified Institution) SIRUGANOOR,TRICHY-621105. DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year / Semester: IV/VII CS1011-DATA
More informationAT78 DATA MINING & WAREHOUSING JUN 2015
Q.2 a. Where is data mining is used? (4) Applications: 1) Military uses 2) Medical field 3) Business Intelligence 4) Intelligence agencies (security purpose in communication and other fields) 5) Data Retrieval
More informationInformation Management course
Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 05(b) : 23/10/2012 Data Mining: Concepts and Techniques (3 rd ed.) Chapter
More informationKnowledge Modelling and Management. Part B (9)
Knowledge Modelling and Management Part B (9) Yun-Heh Chen-Burger http://www.aiai.ed.ac.uk/~jessicac/project/kmm 1 A Brief Introduction to Business Intelligence 2 What is Business Intelligence? Business
More information