Course on Data Mining ( )

Size: px

Start display at page:

Download "Course on Data Mining ( )"

Kimberly Atkins
5 years ago
Views:

1 Course on Data Mining ( ) Intro/Ass. Rules 24./ Episodes Home Exam Clustering KDD Process Text Mining Appl./Summary Data mining: KDD Process 1 Course on Data Mining ( ) Today Today's subject: o KDD Process Next week's program: o Lecture: Data mining applications, future, summary o Exercise: KDD Process o Seminar: KDD Process Data mining: KDD Process 2 1

2 KDD process Overview Overview Preprocessing Post-processing Summary Data mining: KDD Process 3 What is KDD? A process! Aim: the selection and processing of data for o the identification of novel, accurate, and useful patterns, and o the modeling of real-world phenomena Data mining is a major component of the KDD process Data mining: KDD Process 4 2

3 Typical KDD process Target data set Selection Operational Database Raw data Input data Preprocessing Cleaned Verified Focused Data mining Postprocessing Selection Results Utilization Selected usable patterns Data mining: KDD Process 5 Phases of the KDD process (1) Learning the domain Eval. of interestingness Preprocessing Creating a target data set Data cleaning, integration and transformation Data reduction and projection Choosing the DM task Data mining: KDD Process 6 3

4 Phases of the KDD process (2) Choosing the DM algorithm(s) Data mining: search Postprocessing Pattern evaluation and interpretation Knowledge presentation Use of discovered knowledge Data mining: KDD Process 7 Preprocessing - overview Preprocessing Why data preprocessing? Data cleaning Data integration and transformation Data reduction Data mining: KDD Process 8 4

5 Why data preprocessing? Aim: to select the data relevant with respect to the task in hand to be mined Data in the real world is dirty o incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data o noisy: containing errors or outliers o inconsistent: containing discrepancies in codes or names No quality data, no quality mining results! Data mining: KDD Process 9 Measures of data quality o accuracy o completeness o consistency o timeliness o believability o value added o interpretability o accessibility Data mining: KDD Process 10 5

6 Preprocessing tasks (1) Data cleaning o fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration o integration of multiple databases, files, etc. Data transformation o normalization and aggregation Data mining: KDD Process 11 Preprocessing tasks (2) Data reduction (including discretization) o obtains reduced representation in volume, but produces the same or similar analytical results o data discretization is part of data reduction, but with particular importance, especially for numerical data Data mining: KDD Process 12 6

7 Preprocessing tasks (3) Data Cleaning Data Integration Data Transformation Data Reduction Data mining: KDD Process 13 Data cleaning tasks Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data Data mining: KDD Process 14 7

8 Missing Data Data is not always available Missing data may be due to o equipment malfunction o inconsistent with other recorded data, and thus deleted o data not entered due to misunderstanding o certain data may not be considered important at the time of entry o not register history or changes of the data Missing data may need to be inferred Data mining: KDD Process 15 How to Handle Missing Data? (1) Ignore the tuple o usually done when the class label is missing o not effective, when the percentage of missing values per attribute varies considerably Fill in the missing value manually o tedious + infeasible? Use a global constant to fill in the missing value o e.g., unknown, a new class?! Data mining: KDD Process 16 8

9 How to Handle Missing Data? (2) Use the attribute mean to fill in the missing value Use the attribute mean for all samples belonging to the same class to fill in the missing value o smarter solution than using the general attribute mean Use the most probable value to fill in the missing value o inference-based tools such as decision tree induction or a Bayesian formalism o regression Data mining: KDD Process 17 Noisy Data Noise: random error or variance in a measured variable Incorrect attribute values may due to o faulty data collection instruments o data entry problems o data transmission problems o technology limitation o inconsistency in naming convention Data mining: KDD Process 18 9

10 How to Handle Noisy Data? Binning o smooth a sorted data value by looking at the values around it Clustering o detect and remove outliers Combined computer and human inspection o detect suspicious values and check by human Regression o smooth by fitting the data into regression functions Data mining: KDD Process 19 Binning methods (1) Equal-depth (frequency) partitioning o sort data and partition into bins, N intervals, each containing approximately same number of samples o smooth by bin means, bin median, bin boundaries, etc. o good data scaling o managing categorical attributes can be tricky Data mining: KDD Process 20 10

11 Binning methods (2) Equal-width (distance) partitioning o divide the range into N intervals of equal size: uniform grid o if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N. o the most straightforward o outliers may dominate presentation o skewed data is not handled well Data mining: KDD Process 21 Equal-depth binning - Example Sorted data for price (in dollars): o 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 Partition into (equal-depth) bins: o Bin 1: 4, 8, 9, 15 o Bin 2: 21, 21, 24, 25 o Bin 3: 26, 28, 29, 34 Smoothing by bin means: o Bin 1: 9, 9, 9, 9 o Bin 2: 23, 23, 23, 23 o Bin 3: 29, 29, 29, 29 by bin boundaries: o Bin 1: 4, 4, 4, 15 o Bin 2: 21, 21, 25, 25 o Bin 3: 26, 26, 26, Data mining: KDD Process 22 11

12 Data Integration (1) Data integration o combines data from multiple sources into a coherent store Schema integration o integrate metadata from different sources o entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id B.cust-# Data mining: KDD Process 23 Data Integration (2) Detecting and resolving data value conflicts o for the same real world entity, attribute values from different sources are different o possible reasons: different representations, different scales, e.g., metric vs. British units Data mining: KDD Process 24 12

13 Handling Redundant Data Redundant data occur often, when multiple databases are integrated o the same attribute may have different names in different databases o one attribute may be a derived attribute in another table, e.g., annual revenue Redundant data may be detected by correlation analysis Careful integration of data from multiple sources may o help to reduce/avoid redundancies and inconsistencies o improve mining speed and quality Data mining: KDD Process 25 Data Transformation Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified range, e.g., o min-max normalization o normalization by decimal scaling Attribute/feature construction o new attributes constructed from the given ones Data mining: KDD Process 26 13

14 Data Reduction Data reduction o obtains a reduced representation of the data set that is much smaller in volume o produces the same (or almost the same) analytical results as the original data Data reduction strategies o dimensionality reduction o numerosity reduction o discretization and concept hierarchy generation Data mining: KDD Process 27 Dimensionality Reduction Feature selection (i.e., attribute subset selection): o select a minimum set of features such that the probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all features o reduce the number of patterns in the patterns, easier to understand Heuristic methods (due to exponential # of choices): o step-wise forward selection o step-wise backward elimination o combining forward selection and backward elimination Data mining: KDD Process 28 14

15 Dimensionality Reduction - Example Initial attribute set: {A1, A2, A3, A4, A5, A6} A4? A1? A6? Class 1 Class 2 Class 1 Class 2 > Reduced attribute set: {A1, A4, A6} Data mining: KDD Process 29 Numerosity Reduction Parametric methods o assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers) o e.g., regression analysis, log-linear models Non-parametric methods o do not assume models o e.g., histograms, clustering, sampling Data mining: KDD Process 30 15

16 Discretization Reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Some classification algorithms only accept categorical attributes Data mining: KDD Process 31 Concept Hierarchies Reduce the data by collecting and replacing low level concepts by higher level concepts For example, replace numeric values for the attribute age by more general values young, middle-aged, or senior Data mining: KDD Process 32 16

17 Discretization and concept hierarchy generation for numeric data Binning Histogram analysis Clustering analysis Entropy-based discretization Segmentation by natural partitioning Data mining: KDD Process 33 Concept hierarchy generation for categorical data Specification of a partial ordering of attributes explicitly at the schema level by users or experts Specification of a portion of a hierarchy by explicit data grouping Specification of a set of attributes, but not of their partial ordering Specification of only a partial set of attributes Data mining: KDD Process 34 17

18 Specification of a set of attributes Concept hierarchy can be automatically generated based on the number of distinct values per attribute in the given attribute set. The attribute with the most distinct values is placed at the lowest level of the hierarchy. country province_or_ state city street 15 distinct values 65 distinct values 3567 distinct values distinct values Data mining: KDD Process 35 Post-processing - overview Post-processing Why data postprocessing? Interestingness Visualization Utilization Data mining: KDD Process 36 18

19 Why data post-processing? (1) Aim: to show the results, or more precisely the most interesting findings, of the data mining phase to a user/users in an understandable way A possible post-processing methodology: o find all potentially interesting patterns according to some rather loose criteria o provide flexible methods for iteratively and interactively creating different views of the discovered patterns Other more restrictive or focused methodologies possible as well Data mining: KDD Process 37 Why data post-processing? (2) A post-processing methodology is useful, if o the desired focus is not known in advance (the search process cannot be optimized to look only for the interesting patterns) o there is an algorithm that can produce all patterns from a class of potentially interesting patterns (the result is complete) o the time requirement for discovering all potentially interesting patterns is not considerably longer than, if the discovery was focused to a small subset of potentially interesting patterns Data mining: KDD Process 38 19

20 Are all the discovered pattern interesting? A data mining system/query may generate thousands of patterns, but are they all interesting? Usually NOT! How could we then choose the interesting patterns? => Interestingness Data mining: KDD Process 39 Interestingness criteria (1) Some possible criteria for interestingness: o evidence: statistical significance of finding? o redundancy: similarity between findings? o usefulness: meeting the user's needs/goals? o novelty: already prior knowledge? o simplicity: syntactical complexity? o generality: how many examples covered? Data mining: KDD Process 40 20

21 Interestingness criteria(2) One division of interestingness criteria: o objective measures that are based on statistics and structures of patterns, e.g., J-measure: statistical significance certainty factor: support or frequency strength: confidence o subjective measures that are based on user s beliefs in the data, e.g., unexpectedness: is the found pattern surprising?" actionability: can I do something with it?" Data mining: KDD Process 41 Criticism: Support & Confidence Example: (Aggarwal & Yu, PODS98) o among 5000 students 3000 play basketball, 3750 eat cereal 2000 both play basket ball and eat cereal o the rule play basketball eat cereal [40%, 66.7%] is misleading, because the overall percentage of students eating cereal is 75%, which is higher than 66.7% o the rule play basketball not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence Data mining: KDD Process 42 21

22 Interest Yet another objective measure for interestingness is interest that is defined as P( A B) P( A) P( B) Properties of this measure: o takes both P(A) and P(B) in consideration: o P(A^B)=P(B)*P(A), if A and B are independent events o A and B negatively correlated, if the value is less than 1; otherwise A and B positively correlated Data mining: KDD Process 43 Also J-measure J-measure J measure (1 conf ( A = conf ( A) ( conf ( A B) ) 1 conf ( A B) B) log( 1 conf ( B) conf ( A B) log + conf ( B) )) is an objective measure for interestingness Properties of J-measure: o again, takes both P(A) and P(B) in consideration o value is always between 0 and 1 o can be computed using pre-calculated values Data mining: KDD Process 44 22

23 Support/Frequency/J-measure Rules ,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 Threshold Dataset 1 Dataset 2 Dataset 3 Dataset 4 Dataset 5 Dataset Data mining: KDD Process 45 Confidence Rules ,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 Confidence threshold Dataset 1 Dataset 2 Dataset 3 Dataset 4 Dataset 5 Dataset Data mining: KDD Process 46 23

24 Example Selection of Interesting Association Rules For reducing the number of association rules that have to be considered, we could, for example, use one of the following selection criteria: o frequency and confidence o J-measure or interest o maximum rule size (whole rule, left-hand side, right-hand side) o rule attributes (e.g., templates) Data mining: KDD Process 47 Example Problems with selection of rules A rule can correspond to prior knowledge or expectations o how to encode the background knowledge into the system? A rule can refer to uninteresting attributes or attribute combinations o could this be avoided by enhancing the preprocessing phase? Rules can be redundant o redundancy elimination by rule covers etc Data mining: KDD Process 48 24

25 Interpretation and evaluation of the results of data mining Evaluation o statistical validation and significance testing o qualitative review by experts in the field o pilot surveys to evaluate model accuracy Interpretation o tree and rule models can be read directly o clustering results can be graphed and tabled o code can be automatically generated by some systems Data mining: KDD Process 49 Visualization of Discovered Patterns (1) In some cases, visualization of the results of data mining (rules, clusters, networks ) can be very helpful Visualization is actually already important in the preprocessing phase in selecting the appropriate data or in looking at the data Visualization requires training and practice Data mining: KDD Process 50 25

26 Visualization of Discovered Patterns (2) Different backgrounds/usages may require different forms of representation o e.g., rules, tables, cross-tabulations, or pie/bar chart Concept hierarchy is also important o discovered knowledge might be more understandable when represented at high level of abstraction o interactive drill up/down, pivoting, slicing and dicing provide different perspective to data Different kinds of knowledge require different kinds of representation o association, classification, clustering, etc Data mining: KDD Process 51 Visualization Data mining: KDD Process 52 26

27 Data mining: KDD Process 53 Utilization of the results Increasing potential to support business decisions Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery End User Business Analyst Data Analyst Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA Data Sources Paper, Files, Information Providers, Database Systems, OLTP DBA Data mining: KDD Process 54 27

28 Summary Data mining: semi-automatic discovery of interesting patterns from large data sets Knowledge discovery is a process: o preprocessing o data mining o post-processing o using and utilizing the knowledge Data mining: KDD Process 55 Summary Preprocessing is important in order to get useful results! If a loosely defined mining methodology is used, postprocessing is needed in order to find the interesting results! Visualization is useful in preand post-processing! One has to be able to utilize the found knowledge! Data mining: KDD Process 56 28

29 References KDD Process P. Adriaans and D. Zantinge. Data Mining. Addison-Wesley: Harlow, England, R.J. Brachman, T. Anand. The process of knowledge discovery in databases. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Communications of ACM, 42:73-78, M. S. Chen, J. Han, and P. S. Yu. Data mining: An overview from a database perspective. IEEE Trans. Knowledge and Data Engineering, 8: , U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, T. Imielinski and H. Mannila. A database perspective on knowledge discovery. Communications of ACM, 39:58-64, Jagadish et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical Committee on Data Engineering, 20(4), December D. Keim, Visual techniques for exploring databases. Tutorial notes in KDD 97, Newport Beach, CA, USA, D. Keim, Visual data mining. Tutorial notes in VLDB 97, Athens, Greece, D. Keim, and H.P. Krieger, Visual techniques for mining large databases: a comparison. IEEE Transactions on Knowledge and Data Engineering, 8(6), Data mining: KDD Process 57 References KDD Process M. Klemettinen, A knowledge discovery methodology for telecommunication network alarm databases. Ph.D. thesis, University of Helsinki, Report A , M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I. Verkamo. Finding interesting rules from large sets of discovered association rules. CIKM 94, Gaithersburg, Maryland, Nov G. Piatetsky-Shapiro, U. Fayyad, and P. Smith. From data mining to knowledge discovery: An overview. In U.M. Fayyad, et al. (eds.), Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, T. Redman. Data Quality: Management and Technology. Bantam Books, New York, A. Silberschatz and A. Tuzhilin. What makes patterns interesting in knowledge discovery systems. IEEE Trans. on Knowledge and Data Engineering, 8: , Dec D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov. Query flocks: A generalization of association-rule mining. SIGMOD'98, Seattle, Washington, June Data mining: KDD Process 58 29

References KDD Process Y. Wand and R. Wang. Anchoring data quality dimensions ontological foundations. Communications of ACM, 39:86-95, 1996. R. Wang, V. Storey, and C. Firth.

30 References KDD Process Y. Wand and R. Wang. Anchoring data quality dimensions ontological foundations. Communications of ACM, 39:86-95, R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans. Knowledge and Data Engineering, 7: , Data mining: KDD Process 59 Reminder: Course Organization Course Evaluation Passing the course: min 30 points o home exam: min 13 points (max 30 points) o exercises/experiments: min 8 points (max 20 points) at least 3 returned and reported experiments o group presentation: min 4 points (max 10 points) Remember also the other requirements: o attending the lectures (5/7) o attending the seminars (4/5) o attending the exercises (4/5) Data mining: KDD Process 60 30

31 Seminar Presentations/Groups 9-10 Visualization and data mining D. Keim, H.P., Kriegel, T. Seidl: Supporting Data Mining of Large Databases by Visual Feedback Queries", ICDE Data mining: KDD Process 61 Seminar Presentations/Groups 9-10 Interestingness G. Piatetsky-Shapiro, C.J. Matheus: The Interestingness of Deviations, KDD Data mining: KDD Process 62 31

32 KDD process Thanks to Jiawei Han from Simon Fraser University and Mika Klemettinen from Nokia Research Center for their slides which greatly helped in preparing this lecture! Also thanks to Fosca Giannotti and Dino Pedreschi from Pisa for their slides Data mining: KDD Process 63 32

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data