Course on Data Mining ( ) - PDF Free Download

Course on Data Mining (581550-4) Intro/Ass. Rules 24./26.10. Episodes 30.10. 7.11. Home Exam Clustering 14.11. KDD Process 21.11. Text Mining 28.11. Appl./Summary 21.11.2001 Data mining: KDD Process 1 Course on Data Mining (581550-4) Today 22.11.2001 Today's subject: o KDD Process Next week's program: o Lecture: Data mining applications, future, summary o Exercise: KDD Process o Seminar: KDD Process 21.11.2001 Data mining: KDD Process 2 1

KDD process Overview Overview Preprocessing Post-processing Summary 21.11.2001 Data mining: KDD Process 3 What is KDD? A process! Aim: the selection and processing of data for o the identification of novel, accurate, and useful patterns, and o the modeling of real-world phenomena Data mining is a major component of the KDD process 21.11.2001 Data mining: KDD Process 4 2

Typical KDD process Target data set Selection Operational Database Raw data Input data Preprocessing Cleaned Verified Focused Data mining Postprocessing Selection Results 2 1 3 Utilization Selected usable patterns 21.11.2001 Data mining: KDD Process 5 Phases of the KDD process (1) Learning the domain Eval. of interestingness Preprocessing Creating a target data set Data cleaning, integration and transformation Data reduction and projection Choosing the DM task 21.11.2001 Data mining: KDD Process 6 3

Phases of the KDD process (2) Choosing the DM algorithm(s) Data mining: search Postprocessing Pattern evaluation and interpretation Knowledge presentation Use of discovered knowledge 21.11.2001 Data mining: KDD Process 7 Preprocessing - overview Preprocessing Why data preprocessing? Data cleaning Data integration and transformation Data reduction 21.11.2001 Data mining: KDD Process 8 4

Why data preprocessing? Aim: to select the data relevant with respect to the task in hand to be mined Data in the real world is dirty o incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data o noisy: containing errors or outliers o inconsistent: containing discrepancies in codes or names No quality data, no quality mining results! 21.11.2001 Data mining: KDD Process 9 Measures of data quality o accuracy o completeness o consistency o timeliness o believability o value added o interpretability o accessibility 21.11.2001 Data mining: KDD Process 10 5

Preprocessing tasks (1) Data cleaning o fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration o integration of multiple databases, files, etc. Data transformation o normalization and aggregation 21.11.2001 Data mining: KDD Process 11 Preprocessing tasks (2) Data reduction (including discretization) o obtains reduced representation in volume, but produces the same or similar analytical results o data discretization is part of data reduction, but with particular importance, especially for numerical data 21.11.2001 Data mining: KDD Process 12 6

Preprocessing tasks (3) Data Cleaning Data Integration Data Transformation Data Reduction 21.11.2001 Data mining: KDD Process 13 Data cleaning tasks Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data 21.11.2001 Data mining: KDD Process 14 7

Missing Data Data is not always available Missing data may be due to o equipment malfunction o inconsistent with other recorded data, and thus deleted o data not entered due to misunderstanding o certain data may not be considered important at the time of entry o not register history or changes of the data Missing data may need to be inferred 21.11.2001 Data mining: KDD Process 15 How to Handle Missing Data? (1) Ignore the tuple o usually done when the class label is missing o not effective, when the percentage of missing values per attribute varies considerably Fill in the missing value manually o tedious + infeasible? Use a global constant to fill in the missing value o e.g., unknown, a new class?! 21.11.2001 Data mining: KDD Process 16 8

How to Handle Missing Data? (2) Use the attribute mean to fill in the missing value Use the attribute mean for all samples belonging to the same class to fill in the missing value o smarter solution than using the general attribute mean Use the most probable value to fill in the missing value o inference-based tools such as decision tree induction or a Bayesian formalism o regression 21.11.2001 Data mining: KDD Process 17 Noisy Data Noise: random error or variance in a measured variable Incorrect attribute values may due to o faulty data collection instruments o data entry problems o data transmission problems o technology limitation o inconsistency in naming convention 21.11.2001 Data mining: KDD Process 18 9

How to Handle Noisy Data? Binning o smooth a sorted data value by looking at the values around it Clustering o detect and remove outliers Combined computer and human inspection o detect suspicious values and check by human Regression o smooth by fitting the data into regression functions 21.11.2001 Data mining: KDD Process 19 Binning methods (1) Equal-depth (frequency) partitioning o sort data and partition into bins, N intervals, each containing approximately same number of samples o smooth by bin means, bin median, bin boundaries, etc. o good data scaling o managing categorical attributes can be tricky 21.11.2001 Data mining: KDD Process 20 10

Binning methods (2) Equal-width (distance) partitioning o divide the range into N intervals of equal size: uniform grid o if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N. o the most straightforward o outliers may dominate presentation o skewed data is not handled well 21.11.2001 Data mining: KDD Process 21 Equal-depth binning - Example Sorted data for price (in dollars): o 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 Partition into (equal-depth) bins: o Bin 1: 4, 8, 9, 15 o Bin 2: 21, 21, 24, 25 o Bin 3: 26, 28, 29, 34 Smoothing by bin means: o Bin 1: 9, 9, 9, 9 o Bin 2: 23, 23, 23, 23 o Bin 3: 29, 29, 29, 29 by bin boundaries: o Bin 1: 4, 4, 4, 15 o Bin 2: 21, 21, 25, 25 o Bin 3: 26, 26, 26, 34 21.11.2001 Data mining: KDD Process 22 11

Data Integration (1) Data integration o combines data from multiple sources into a coherent store Schema integration o integrate metadata from different sources o entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id B.cust-# 21.11.2001 Data mining: KDD Process 23 Data Integration (2) Detecting and resolving data value conflicts o for the same real world entity, attribute values from different sources are different o possible reasons: different representations, different scales, e.g., metric vs. British units 21.11.2001 Data mining: KDD Process 24 12

Handling Redundant Data Redundant data occur often, when multiple databases are integrated o the same attribute may have different names in different databases o one attribute may be a derived attribute in another table, e.g., annual revenue Redundant data may be detected by correlation analysis Careful integration of data from multiple sources may o help to reduce/avoid redundancies and inconsistencies o improve mining speed and quality 21.11.2001 Data mining: KDD Process 25 Data Transformation Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified range, e.g., o min-max normalization o normalization by decimal scaling Attribute/feature construction o new attributes constructed from the given ones 21.11.2001 Data mining: KDD Process 26 13

Data Reduction Data reduction o obtains a reduced representation of the data set that is much smaller in volume o produces the same (or almost the same) analytical results as the original data Data reduction strategies o dimensionality reduction o numerosity reduction o discretization and concept hierarchy generation 21.11.2001 Data mining: KDD Process 27 Dimensionality Reduction Feature selection (i.e., attribute subset selection): o select a minimum set of features such that the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all features o reduce the number of patterns in the patterns, easier to understand Heuristic methods (due to exponential # of choices): o step-wise forward selection o step-wise backward elimination o combining forward selection and backward elimination 21.11.2001 Data mining: KDD Process 28 14

Dimensionality Reduction - Example Initial attribute set: {A1, A2, A3, A4, A5, A6} A4? A1? A6? Class 1 Class 2 Class 1 Class 2 > Reduced attribute set: {A1, A4, A6} 21.11.2001 Data mining: KDD Process 29 Numerosity Reduction Parametric methods o assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers) o e.g., regression analysis, log-linear models Non-parametric methods o do not assume models o e.g., histograms, clustering, sampling 21.11.2001 Data mining: KDD Process 30 15

Discretization Reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Some classification algorithms only accept categorical attributes 21.11.2001 Data mining: KDD Process 31 Concept Hierarchies Reduce the data by collecting and replacing low level concepts by higher level concepts For example, replace numeric values for the attribute age by more general values young, middle-aged, or senior 21.11.2001 Data mining: KDD Process 32 16

Discretization and concept hierarchy generation for numeric data Binning Histogram analysis Clustering analysis Entropy-based discretization Segmentation by natural partitioning 21.11.2001 Data mining: KDD Process 33 Concept hierarchy generation for categorical data Specification of a partial ordering of attributes explicitly at the schema level by users or experts Specification of a portion of a hierarchy by explicit data grouping Specification of a set of attributes, but not of their partial ordering Specification of only a partial set of attributes 21.11.2001 Data mining: KDD Process 34 17

Specification of a set of attributes Concept hierarchy can be automatically generated based on the number of distinct values per attribute in the given attribute set. The attribute with the most distinct values is placed at the lowest level of the hierarchy. country province_or_ state city street 15 distinct values 65 distinct values 3567 distinct values 674 339 distinct values 21.11.2001 Data mining: KDD Process 35 Post-processing - overview Post-processing Why data postprocessing? Interestingness Visualization Utilization 21.11.2001 Data mining: KDD Process 36 18

Why data post-processing? (1) Aim: to show the results, or more precisely the most interesting findings, of the data mining phase to a user/users in an understandable way A possible post-processing methodology: o find all potentially interesting patterns according to some rather loose criteria o provide flexible methods for iteratively and interactively creating different views of the discovered patterns Other more restrictive or focused methodologies possible as well 21.11.2001 Data mining: KDD Process 37 Why data post-processing? (2) A post-processing methodology is useful, if o the desired focus is not known in advance (the search process cannot be optimized to look only for the interesting patterns) o there is an algorithm that can produce all patterns from a class of potentially interesting patterns (the result is complete) o the time requirement for discovering all potentially interesting patterns is not considerably longer than, if the discovery was focused to a small subset of potentially interesting patterns 21.11.2001 Data mining: KDD Process 38 19

Are all the discovered pattern interesting? A data mining system/query may generate thousands of patterns, but are they all interesting? Usually NOT! How could we then choose the interesting patterns? => Interestingness 21.11.2001 Data mining: KDD Process 39 Interestingness criteria (1) Some possible criteria for interestingness: o evidence: statistical significance of finding? o redundancy: similarity between findings? o usefulness: meeting the user's needs/goals? o novelty: already prior knowledge? o simplicity: syntactical complexity? o generality: how many examples covered? 21.11.2001 Data mining: KDD Process 40 20

Interestingness criteria(2) One division of interestingness criteria: o objective measures that are based on statistics and structures of patterns, e.g., J-measure: statistical significance certainty factor: support or frequency strength: confidence o subjective measures that are based on user s beliefs in the data, e.g., unexpectedness: is the found pattern surprising?" actionability: can I do something with it?" 21.11.2001 Data mining: KDD Process 41 Criticism: Support & Confidence Example: (Aggarwal & Yu, PODS98) o among 5000 students 3000 play basketball, 3750 eat cereal 2000 both play basket ball and eat cereal o the rule play basketball eat cereal [40%, 66.7%] is misleading, because the overall percentage of students eating cereal is 75%, which is higher than 66.7% o the rule play basketball not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence 21.11.2001 Data mining: KDD Process 42 21

Interest Yet another objective measure for interestingness is interest that is defined as P( A B) P( A) P( B) Properties of this measure: o takes both P(A) and P(B) in consideration: o P(A^B)=P(B)*P(A), if A and B are independent events o A and B negatively correlated, if the value is less than 1; otherwise A and B positively correlated. 21.11.2001 Data mining: KDD Process 43 Also J-measure J-measure J measure (1 conf ( A = conf ( A) ( conf ( A B) ) 1 conf ( A B) B) log( 1 conf ( B) conf ( A B) log + conf ( B) )) is an objective measure for interestingness Properties of J-measure: o again, takes both P(A) and P(B) in consideration o value is always between 0 and 1 o can be computed using pre-calculated values 21.11.2001 Data mining: KDD Process 44 22

Support/Frequency/J-measure Rules 3000 2500 2000 1500 1000 500 0 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 Threshold Dataset 1 Dataset 2 Dataset 3 Dataset 4 Dataset 5 Dataset 6 21.11.2001 Data mining: KDD Process 45 Confidence Rules 3000 2500 2000 1500 1000 500 0 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 Confidence threshold Dataset 1 Dataset 2 Dataset 3 Dataset 4 Dataset 5 Dataset 6 21.11.2001 Data mining: KDD Process 46 23

Example Selection of Interesting Association Rules For reducing the number of association rules that have to be considered, we could, for example, use one of the following selection criteria: o frequency and confidence o J-measure or interest o maximum rule size (whole rule, left-hand side, right-hand side) o rule attributes (e.g., templates) 21.11.2001 Data mining: KDD Process 47 Example Problems with selection of rules A rule can correspond to prior knowledge or expectations o how to encode the background knowledge into the system? A rule can refer to uninteresting attributes or attribute combinations o could this be avoided by enhancing the preprocessing phase? Rules can be redundant o redundancy elimination by rule covers etc. 21.11.2001 Data mining: KDD Process 48 24

Interpretation and evaluation of the results of data mining Evaluation o statistical validation and significance testing o qualitative review by experts in the field o pilot surveys to evaluate model accuracy Interpretation o tree and rule models can be read directly o clustering results can be graphed and tabled o code can be automatically generated by some systems 21.11.2001 Data mining: KDD Process 49 Visualization of Discovered Patterns (1) In some cases, visualization of the results of data mining (rules, clusters, networks ) can be very helpful Visualization is actually already important in the preprocessing phase in selecting the appropriate data or in looking at the data Visualization requires training and practice 21.11.2001 Data mining: KDD Process 50 25

Visualization of Discovered Patterns (2) Different backgrounds/usages may require different forms of representation o e.g., rules, tables, cross-tabulations, or pie/bar chart Concept hierarchy is also important o discovered knowledge might be more understandable when represented at high level of abstraction o interactive drill up/down, pivoting, slicing and dicing provide different perspective to data Different kinds of knowledge require different kinds of representation o association, classification, clustering, etc. 21.11.2001 Data mining: KDD Process 51 Visualization 21.11.2001 Data mining: KDD Process 52 26

21.11.2001 Data mining: KDD Process 53 Utilization of the results Increasing potential to support business decisions Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery End User Business Analyst Data Analyst Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA Data Sources Paper, Files, Information Providers, Database Systems, OLTP DBA 21.11.2001 Data mining: KDD Process 54 27

Summary Data mining: semi-automatic discovery of interesting patterns from large data sets Knowledge discovery is a process: o preprocessing o data mining o post-processing o using and utilizing the knowledge 21.11.2001 Data mining: KDD Process 55 Summary Preprocessing is important in order to get useful results! If a loosely defined mining methodology is used, postprocessing is needed in order to find the interesting results! Visualization is useful in preand post-processing! One has to be able to utilize the found knowledge! 21.11.2001 Data mining: KDD Process 56 28

References KDD Process P. Adriaans and D. Zantinge. Data Mining. Addison-Wesley: Harlow, England, 1996. R.J. Brachman, T. Anand. The process of knowledge discovery in databases. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996. D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Communications of ACM, 42:73-78, 1999. M. S. Chen, J. Han, and P. S. Yu. Data mining: An overview from a database perspective. IEEE Trans. Knowledge and Data Engineering, 8:866-883, 1996. U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996. T. Imielinski and H. Mannila. A database perspective on knowledge discovery. Communications of ACM, 39:58-64, 1996. Jagadish et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical Committee on Data Engineering, 20(4), December 1997. D. Keim, Visual techniques for exploring databases. Tutorial notes in KDD 97, Newport Beach, CA, USA, 1997. D. Keim, Visual data mining. Tutorial notes in VLDB 97, Athens, Greece, 1997. D. Keim, and H.P. Krieger, Visual techniques for mining large databases: a comparison. IEEE Transactions on Knowledge and Data Engineering, 8(6), 1996. 21.11.2001 Data mining: KDD Process 57 References KDD Process M. Klemettinen, A knowledge discovery methodology for telecommunication network alarm databases. Ph.D. thesis, University of Helsinki, Report A-1999-1, 1999. M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I. Verkamo. Finding interesting rules from large sets of discovered association rules. CIKM 94, Gaithersburg, Maryland, Nov. 1994. G. Piatetsky-Shapiro, U. Fayyad, and P. Smith. From data mining to knowledge discovery: An overview. In U.M. Fayyad, et al. (eds.), Advances in Knowledge Discovery and Data Mining, 1-35. AAAI/MIT Press, 1996. G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991. D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999. T. Redman. Data Quality: Management and Technology. Bantam Books, New York, 1992. A. Silberschatz and A. Tuzhilin. What makes patterns interesting in knowledge discovery systems. IEEE Trans. on Knowledge and Data Engineering, 8:970-974, Dec. 1996. D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov. Query flocks: A generalization of association-rule mining. SIGMOD'98, Seattle, Washington, June 1998. 21.11.2001 Data mining: KDD Process 58 29

References KDD Process Y. Wand and R. Wang. Anchoring data quality dimensions ontological foundations. Communications of ACM, 39:86-95, 1996. R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans. Knowledge and Data Engineering, 7:623-640, 1995. 21.11.2001 Data mining: KDD Process 59 Reminder: Course Organization Course Evaluation Passing the course: min 30 points o home exam: min 13 points (max 30 points) o exercises/experiments: min 8 points (max 20 points) at least 3 returned and reported experiments o group presentation: min 4 points (max 10 points) Remember also the other requirements: o attending the lectures (5/7) o attending the seminars (4/5) o attending the exercises (4/5) 21.11.2001 Data mining: KDD Process 60 30

Seminar Presentations/Groups 9-10 Visualization and data mining D. Keim, H.P., Kriegel, T. Seidl: Supporting Data Mining of Large Databases by Visual Feedback Queries", ICDE 94. 21.11.2001 Data mining: KDD Process 61 Seminar Presentations/Groups 9-10 Interestingness G. Piatetsky-Shapiro, C.J. Matheus: The Interestingness of Deviations, KDD 94. 21.11.2001 Data mining: KDD Process 62 31

KDD process Thanks to Jiawei Han from Simon Fraser University and Mika Klemettinen from Nokia Research Center for their slides which greatly helped in preparing this lecture! Also thanks to Fosca Giannotti and Dino Pedreschi from Pisa for their slides. 21.11.2001 Data mining: KDD Process 63 32