Data Warehousing and Machine Learning

Data Warehousing and Machine Learning Introduction Thomas D. Nielsen Aalborg University Department of Computer Science Spring 2008 DWML Spring 2008 1 / 47

What is Data Mining?? Introduction DWML Spring 2008 2 / 47

What is Data Mining?! Introduction DWML Spring 2008 2 / 47

What is Data Mining? Definitions Knowledge Discovery is the nontrivial extraction of implicit, previously unknown, and potentially useful information from data [Frawley, Piatetsky-Shapiro, Matheus 1991]. Data Mining is a step in the KDD process consisting of applying computational techniques that, under acceptable computational efficiency limitations, produce a particular enumeration of patterns (or models) over the data [Fayyad, Piatetsky-Shapiro, Smyth 1996]. Data Mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner [Hand, Mannila, Smyth 2001]. The field of machine learning is concerned with the question of how to construct computer programs that automatically improve with experience [Mitchell, 1997] Introduction DWML Spring 2008 3 / 47

What is Data Mining? Data Mining in practice Introduction DWML Spring 2008 4 / 47

What is Data Mining? Data Mining in practice Real life data Off the shelf algorithm preprocess adapt Introduction DWML Spring 2008 4 / 47

What is Data Mining? Data Mining in practice Real life data Off the shelf algorithm preprocess adapt evaluate + iterate Introduction DWML Spring 2008 4 / 47

What is Data Mining? Data Mining in practice Real life data Off the shelf algorithm preprocess adapt evaluate + iterate general algorithmic methods data/domain specific operations Introduction DWML Spring 2008 4 / 47

What is Data Mining? Background Developed by a four member consortium in a EU project. Members of the consortium: Teradata (NCR) SPSS (statistical software) DaimlerChrysler OHRA (Insurance and Banking) Consortium supported by a special interest group composed of over 300 organizations involved in data mining projects. Aim From http://www.crisp-dm.org/: The CRISP-DM project has developed an industry- and tool-neutral Data Mining process model. [... ] this project defined and validated a data mining process that is applicable in diverse industry sectors. This will make large data mining projects faster, cheaper, more reliable and more manageable. Even small scale data mining investigations will benefit from using CRISP-DM. Introduction DWML Spring 2008 5 / 47

What is Data Mining? Phases of the CRISP DM Process Model (Illustration from www.crisp-dm.org) Introduction DWML Spring 2008 6 / 47

What is Data Mining? Business/Data understanding Vision: Data Mining extracts whatever interesting hidden information there is in the data Reality: Data Mining techniques solve several types of well-defined tasks Reality: The data used must support the task at hand Reality: The data miner must understand the background of the data, in order to select an appropriate data mining technique Introduction DWML Spring 2008 7 / 47

What is Data Mining? Our Focus Introduction DWML Spring 2008 8 / 47

What is Data Mining? Selecting the Modeling Technique Universe of Techniques (Defined by Tool) Techniques Appropriate for Problem Political Requirements (Management,Understandability) Constraints (Time, Data Characteristics, Staff Training/Knowledge) Tool(s) Selected Introduction DWML Spring 2008 9 / 47

Types of Tasks and Models Prediction (Supervised Learning) Task: predict some (unobserved) target variable based on observed values of attribute variables Regression, if target is continuous Classification, if target is discrete Models e.g.: Decision trees, Neural networks, Bayesian (classification) networks,... Clustering Task: identify coherent subgroups in data Models e.g.: k-means, hierarchical clustering,... Association analysis Task: identify patterns of co-occurrence of attribute values Models: Apriori and extensions Visualization (Exploratory Data Analysis) Task: find intelligible visualization of relevant data properties Models: Graphs, plots,... Types of tasks and models DWML Spring 2008 10 / 47

Example: Regression Nutritional rating of cereals Data: nutritional information and ratings for 77 cereals. Task: find best linear approximation of the dependency of rating on sugars. Types of tasks and models DWML Spring 2008 11 / 47

Example: Classification Text Categorization The Association for Computing Machinery (ACM) maintains a subject classification scheme for computer science research papers. Part of the subject hierarchy (1998 version): I. Computing Methodologies I.2 Artificial Intelligence I.2.6 Learning - Analogies - Concept learning - Connectionism and neural nets - Induction - Knowledge acquisition - Language acquisition - Parameter learning Papers are manually classified by authors or editors. Data: collection of classified papers (full text or abstracts) Task: build a classifier that automatically assigns a subject index to new, unclassified papers. Types of tasks and models DWML Spring 2008 12 / 47

Example: Classification Spam Filtering Spam filtering in Mozilla: user trains the mail reader to recognize spam by manually labeling incoming mails as spam/no spam. Data: collection of user-classified emails (full text). Task: build a classifier that automatically categorizes an incoming email as spam/no spam Types of tasks and models DWML Spring 2008 13 / 47

Example: Classification Character Recognition Example for a Pattern Recognition problem (pattern recognition is an older discipline than data mining, but now can also be seen as a sub-area of data mining): Data: collection of handwritten characters, correctly labeled. Task: build a classifier that identifies new handwritten characters. Types of tasks and models DWML Spring 2008 14 / 47

Example: Classification Credit Rating From existing customer data predict whether a person applying for a new loan will repay or default on the loan. Data: existing customer records with attributes like age, employment type, income,... and information on payback history. Task: build a classifier that predicts whether a new customer will repay the loan. Types of tasks and models DWML Spring 2008 15 / 47

Examples: Clustering Text Categorization Web mining: automatically detect similarity between web pages (e.g. to support search engines or automatic construction of internet directories). Data: the WWW. Task: Construct a (similarity) model for pages on the WWW. Types of tasks and models DWML Spring 2008 16 / 47

Examples: Clustering Bioinformatics: Phylogenetic Trees From biological data construct a model of evolution. Lactococcus Lactis Caulobacter Crescentus Bacillus Halodurans Bacillus Subtilis Rattus Norvegicus Pan Troglodytes Homo Sapiens Data: e.g. genome sequences of different animal species. Task: construct a hierarchical model of similarity between the species. Types of tasks and models DWML Spring 2008 17 / 47

Examples: Association Analysis Association Rules Data: transaction data Task: infer association rules Transaction Items bought 1 Beer,Soap,Milk,Butter 2 Beer,Chips,Butter 3 Milk,Spaghetti,Butter,Tomatos...... {Beer} {Chips} {Spaghetti,Tomatos} {Wine}... Types of tasks and models DWML Spring 2008 18 / 47

Tools WEKA Free open source Java toolbox (www.cs.waikato.ac.nz/ml/weka/) Many methods, good interface Clementine Commercial system, Windows only Many methods, good interface, integrated use of MS SQL server For all toolboxes: easy use of methods can be dangerous correct interpretation of results requires understanding of methods. Documentation essential (and often a weak point...)! Types of tasks and models DWML Spring 2008 19 / 47

Data Warehousing and Machine Learning Decision trees Thomas D. Nielsen Aalborg University Department of Computer Science Spring 2008 Types of tasks and models DWML Spring 2008 20 / 47

Classification A high-level view Classifier Spam yes/no Classification DWML Spring 2008 21 / 47

Classification A high-level view SubAllCap yes/no TrustSend yes/no InvRet yes/no Body adult yes/no Classifier Spam yes/no Body zambia yes/no Classification DWML Spring 2008 21 / 47

Classification A high-level view Cell-1 1..64 Cell-2 1..64 Cell-3 1..64 Classifier Symbol A..Z,0..9 Cell-324 1..64 Classification DWML Spring 2008 21 / 47

Classification Labeled Data Instances (Cases, Examples) Attributes Class variable (Features, Predictor Variables) (Target variable) SubAllCap TrustSend InvRet... B zambia Spam y n n... n y n n n... n n n y n... n y n n n... n n.................. Instances Attributes Class variable Cell-1 Cell-2 Cell-3... Cell-324 Symbol 1 1 4... 12 B 1 1 1... 3 1 34 37 43... 22 Z 1 1 1... 7 0.................. (In principle, any attribute can become the designated class variable) Classification DWML Spring 2008 22 / 47

Classification Attribute Types Each attribute (including the class variable) has associated with it a set of possible values or states. E.g. States(A) = {yes, no} States(A) = {red, blue, green} States(A) = {010100, 020100,..., 311299} States(A) = R States(A) finite: States(A) = R: States(A) = N: A is called discrete A is called continuous or numeric A can be interpreted as continuous (N R), or made discrete by replacing N e.g. with {1, 2,..., 100, > 100} (few data mining methods are specifically adapted to integer valued attributes). Classification DWML Spring 2008 23 / 47

Classification Complete/Incomplete Data Name Gender DoB Income Customer since Last Purchase Thomas Jensen m 050367 190000 010397 250504 Jens Nielsen m 171072 250000 051103 040204 Lene Hansen f 021159 140000 300300 250105 Ulla Sørensen f 220879 210000 180998 031099.................. Name Gender DoB Income Customer since Last Purchase Thomas Jensen m 050367 190000 010397 250504 Jens Nielsen m?? 051103 040204 Lene Hansen f 021159? 300300 250105 Ulla Sørensen f?? 180998 031099.................. Classification DWML Spring 2008 24 / 47

Classification Classification Classification data in general: Attributes: Variables A 1, A 2,..., A n (discrete or continuous). Class variable: Variable C. Always discrete: States(C) = {c 1,..., c l } (set of class labels) A (complete data) Classifier is a mapping C : States(A 1,..., A n) States(C). A classifier able to handle incomplete data provides mappings for subsets {A i1,..., A ik } of {A 1,..., A n}. C : States(A i1,..., A ik ) States(C) A classifier partitions Attribute-value space (also: instance space) into subsets labelled with class labels. Classification DWML Spring 2008 25 / 47

Classification Iris dataset SL PL PW SW Measurement of petal width/length and sepal width/length for 150 flowers of 3 different species of Iris. first reported in: Fisher,R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7 (1936). Attributes Class variable SL SW PL PW Species 5.1 3.5 1.4 0.2 Setosa 4.9 3.0 1.4 0.2 Setosa 6.3 2.9 6.0 2.1 Virginica 6.3 2.5 4.9 1.5 Versicolor............... Classification DWML Spring 2008 26 / 47

Classification Labeled data in instance space: Classification DWML Spring 2008 27 / 47

Classification Labeled data in instance space: Virginica Versicolor Setosa Partition defined by classifier Classification DWML Spring 2008 27 / 47

Classification Decision Regions Deci- Axis-parallel linear: e.g. sion Trees Piecewise linear: e.g. Naive Bayes Nonlinear: e.g. Neural Network Classification DWML Spring 2008 28 / 47

Classification Classifiers differ in... Model space: types of partitions and their representation. how they compute the class label corresponding to a point in instance space (the actual classification task). how they are learned from data. Some important types of classifiers: Decision trees Naive Bayes classifier Other probabilistic classifiers (TAN,... ) Neural networks K-nearest neighbors Classification DWML Spring 2008 29 / 47

Decision Trees Example Attributes: height [0, 2.5], sex {m, f }. Class labels: {tall, short}. 2.5 2.0 tall tall m s f 1.0 short short < 1.8 h < 1.7 1.8 h 1.7 0 m f short tall short tall Partition of instance space Representation by decision tree Decision tree structure DWML Spring 2008 30 / 47

Decision Trees A decision tree is a tree - whose internal nodes are labeled with attributes - whose leaves are labeled with class labels - edges going out from node labeled with attribute A are labeled with subsets of States(A), such that all labels combined form a partition of States(A). Possible partitions e.g.: States(A) = R : [, 2.3[,[2.3, ] [, 1.9[,[1.9, 3.5[,[3.5, ] States(A) = {a, b, c} : {a}, {b}, {c} {a, b}, {c} Decision tree structure DWML Spring 2008 31 / 47

Decision Trees Decision tree classification Each point in the instance space is sorted into a leaf by the decision tree. It is classified according to the class label at that leaf. < 1.8 h m s < 1.7 1.8 f h 1.7 short tall short tall [m,1.85] C([m, 1.85]) = tall Decision tree classification DWML Spring 2008 32 / 47

Decision Trees How to learn a decision tree? Given a dataset: Id. Savings Assets Income Credit Risk ($ 1000s) 1 Medium High 75 Good 2 Low Low 50 Bad 3 High Medium 25 Bad 4 Medium Medium 50 Good 5 Low Medium 100 Good 6 High High 25 Good 7 Low Low 25 Bad 8 Medium Medium 75 Good We want to build a decision tree that is small has high classification accuracy Decision tree learning DWML Spring 2008 33 / 47

Decision Trees Some simple candidate trees: Savings Assets L M H L M H 2,5,7 1,4,8 3,6 G:1, B:2 G:3, B:0 G:1, B:1 Income 2,7 3,4,5,8 1,6 G:0, B:2 G:3, B:1 G:2, B:0 Income 50 > 50 25 > 25 2,3,4,6,7 G:2, B:3 1,5,8 G:3, B:0 3,6,7 G:1, B:2 1,2,4,5,8 G:4, B:1 Decision tree learning: selecting a root DWML Spring 2008 34 / 47

Decision Trees How accurate are these trees? Accurate trees: pure class label distributions at the leaves: (2,0) (0,2) (3,0) pure (1,2) (3,1) (2,3) (2,2) (1,1) impure Decision tree learning: selecting a root DWML Spring 2008 35 / 47

Decision Trees How accurate are these trees? Accurate trees: pure class label distributions at the leaves: (2,0) (0,2) (3,0) pure (1,2) (3,1) (2,3) (2,2) (1,1) impure Entropy A measure of impurity: for S = (x 1, x 2,..., x n) with x = P n i=1 x i : Entropy(S) = nx i=1 x i x log 2 ( x i x ) Entropy(2, 0) = Entropy(0, 2) = Entropy(3, 0) = (1 log 2 (1) + 0 log 2 (0)) = 0 + 0 = 0 Entropy(3, 1) = (0.75 log 2 (0.75) + 0.25 log 2 (0.25)) = 0.311 + 0.5 = 0.811 Entropy(2, 3) = (0.4 log 2 (0.4) + 0.6 log 2 (0.6)) = 0.528 + 0.442 = 0.97 Entropy(2, 2) = Entropy(1, 1) = (0.5 log 2 (0.5) + 0.5 log 2 (0.5)) = 0.5 + 0.5 = 1.0 Decision tree learning: selecting a root DWML Spring 2008 35 / 47

Decision Trees Information Gain A B true false L M H Entropy: 8,2 1,1 2,0 5,1 2,2 0.722 1.0 0.0 0.65 1.0 Expected Entropy: A : B : 10 12 0.722 + 2 1.0 = 0.768 12 2 12 0.0 + 6 12 0.65 + 4 1.0 = 0.658 12 Data Entropy: Entropy(9, 3) = 0.811 Information Gain: A : 0.811 0.768 = 0.043 B : 0.811 0.658 = 0.153 Decision tree learning: selecting a root DWML Spring 2008 36 / 47

Decision Trees Expected entropies: Savings Assets L M H L M H 1,2 3,0 1,1 0,2 3,1 2,0 3 8 0.918 + 3 8 0.0 + 2 8 1.0 = 0.594 2 8 0.0 + 4 8 0.811 + 2 0.0 = 0.405 8 Income Income 50 > 50 25 > 25 2,3 3,0 1,2 4,1 5 8 0.97 + 3 8 0.0 = 0.606 3 8 0.918 + 5 0.722 = 0.795 8 Information gains are Entropy(5, 3) = 0.954 minus expected entropies. Decision tree learning: selecting a root DWML Spring 2008 37 / 47

Decision Trees After the second (and final) ID3 iteration: replacements Assets L M H 2,7 Savings 1,6 G:0, B:2 G:2, B:0 L M H 5 4,8 3 G:0, B:1 G:2, B:0 G:0, B:1 Decision tree learning DWML Spring 2008 38 / 47

Decision Trees After the second (and final) ID3 iteration: replacements Assets L M H bad Savings good L M H bad good bad Decision tree learning DWML Spring 2008 38 / 47

Decision Trees Splitting continuous attributes Sort the continuous values in increasing order Candidate split points are midpoints between adjacent values Define a new attribute based on the candidate split point with highest gain. Example Income: 25 25 25 50 50 75 75 100 Class: Bad Good Bad Good Bad good Good Good Split: 12.5 37.5 62.5 87.5 112.5 Gain: Entropy(S) = 3 8 log 2 «3 5 «5 8 8 log 2 = 0.9544 8 Decision tree learning: continuous attributes DWML Spring 2008 39 / 47

Decision Trees Splitting continuous attributes Sort the continuous values in increasing order Candidate split points are midpoints between adjacent values Define a new attribute based on the candidate split point with highest gain. Example Income: 25 25 25 50 50 75 75 100 Class: Bad Good Bad Good Bad good Good Good Split: 12.5 37.5 62.5 87.5 112.5 Gain: 0 Entropy(S) = 3 8 log 2 «3 5 «5 8 8 log 2 = 0.9544 8 Gain(S, I = 12.5) = 0.9544 0 8 Entropy(S, I 12.5) + 8 Entropy(S, I > 12.5) = 0 8 Decision tree learning: continuous attributes DWML Spring 2008 39 / 47

Decision Trees Splitting continuous attributes Sort the continuous values in increasing order Candidate split points are midpoints between adjacent values Define a new attribute based on the candidate split point with highest gain. Example Income: 25 25 25 50 50 75 75 100 Class: Bad Good Bad Good Bad good Good Good Split: 12.5 37.5 62.5 87.5 112.5 Gain: 0 0.1589 Entropy(S) = 3 8 log 2 «3 5 «5 8 8 log 2 = 0.9544 8 Split(S, I = 37.5) = 0.9544 3 8 Entropy(S, I 37.5) + 5 Entropy(S, I > 37.5) = 0.1589 8 Decision tree learning: continuous attributes DWML Spring 2008 39 / 47

Decision Trees Splitting continuous attributes Sort the continuous values in increasing order Candidate split points are midpoints between adjacent values Define a new attribute based on the candidate split point with highest gain. Example Income: 25 25 25 50 50 75 75 100 Class: Bad Good Bad Good Bad good Good Good Split: 12.5 37.5 62.5 87.5 112.5 Gain: 0 0.1589 0.3476 Entropy(S) = 3 8 log 2 «3 5 «5 8 8 log 2 = 0.9544 8 Split(S, I = 62.5) = 0.9544 5 8 Entropy(S, I 62.5) + 3 Entropy(S, I > 62.5) = 0.3476 8 Decision tree learning: continuous attributes DWML Spring 2008 39 / 47

Decision Trees Splitting continuous attributes Sort the continuous values in increasing order Candidate split points are midpoints between adjacent values Define a new attribute based on the candidate split point with highest gain. Example Income: 25 25 25 50 50 75 75 100 Class: Bad Good Bad Good Bad good Good Good Split: 12.5 37.5 62.5 87.5 112.5 Gain: 0 0.1589 0.3476 0.0923 Entropy(S) = 3 8 log 2 «3 5 «5 8 8 log 2 = 0.9544 8 Split(S, I = 87.5) = 0.9544 7 8 Entropy(S, I 87.5) + 1 Entropy(S, I > 87.5) = 0.0923 8 Decision tree learning: continuous attributes DWML Spring 2008 39 / 47

Decision Trees Splitting continuous attributes Sort the continuous values in increasing order Candidate split points are midpoints between adjacent values Define a new attribute based on the candidate split point with highest gain. Example Income: 25 25 25 50 50 75 75 100 Class: Bad Good Bad Good Bad good Good Good Split: 12.5 37.5 62.5 87.5 112.5 Gain: 0 0.1589 0.3476 0.0923 0 Entropy(S) = 3 8 log 2 «3 5 «5 8 8 log 2 = 0.9544 8 Split(S, I = 112.5) = 0.9544 8 8 Entropy(S, I 112.5) + 0 Entropy(S, I > 112.5) = 0 8 Decision tree learning: continuous attributes DWML Spring 2008 39 / 47

Decision Trees Splitting continuous attributes Sort the continuous values in increasing order Candidate split points are midpoints between adjacent values Define a new attribute based on the candidate split point with highest gain. Example Income: 25 25 25 50 50 75 75 100 Class: Bad Good Bad Good Bad good Good Good Split: 12.5 37.5 62.5 87.5 112.5 Gain: 0 0.1589 0.3476 0.0923 0 Entropy(S) = 3 8 log 2 «3 5 «5 8 8 log 2 = 0.9544 8 Thus, we get an attribute with states 62.5 and > 62.5. Decision tree learning: continuous attributes DWML Spring 2008 39 / 47

Decision Trees ID3 algorithm for decision tree learning Determine attribute A with highest information gain (for continuous attributes: also determine split-value) Construct decision tree with root A, and one leaf for each value of A (two leaves if A is continuous) For a non-pure leaf L: determine attribute B with highest information gain for the data sorted into L. Replace L with a subtree consisting of root B and one leaf for each value of B (two leaves if B is continuous) Continue until all leaves are pure, or some other termination condition applies (e.g.: possible information gains below a given threshold) Label each leaf with the class label that is most frequent among the data sorted into the leaf Decision tree learning: continuous attributes DWML Spring 2008 40 / 47

Decision Trees Pros and Cons + Easy to interpret. + Efficient learning methods. - Difficulties with handling missing data. Decision tree learning: continuous attributes DWML Spring 2008 41 / 47

Overfitting The problem Assets bad L M Savings H good Predictions made by the learned model: Assets=M,Savings=M Risk=good Assets=M,Savings=H Risk=bad L M H bad good bad The training data contained a single case with Assets=M,Savings=H This case had the (uncharacteristic?) class label Risk=bad The model is overfitted to the training data With the prediction Assets=M,Savings=H Risk=good we will likely obtain a higher accuracy on future cases Overfitting DWML Spring 2008 42 / 47

Overfitting The general problem Complex models will represent properties of the training data very precisely The training data may contain some peculiar properties that are not representative for the domain The model will not perform optimally in classifying future instances Classification error Model size Future data Training data Overfitting DWML Spring 2008 43 / 47

Overfitting Decision Tree Pruning To prevent overfitting, extensions of ID3 (C4.5, C5.0) add a pruning step after the tree construction: Data is split into training set and test set Decision tree is learned using training data only Pruning: for internal node A: replace subtree rooted at A with leaf if this reduces the classification error on the test set. Overfitting DWML Spring 2008 44 / 47

Overfitting Example Full bad bad L L Assets M Savings M good H H good bad bad Pruned Assets L M H good good Test data (show only cases with Assets=M): Id. Savings Assets Income Risk 9 High Medium 50 Good 10 Low Medium 50 Bad 11 High Medium 75 Good 12 Medium Medium 50 Good Accuracy of full tree on test data: 50% Accuracy of pruned tree on test data: 75% prune the Savings node. Overfitting DWML Spring 2008 45 / 47

Overfitting Model Tuning with Test Set Test Train learn Model apply Test Data split Train tuning parameter setting tune Data learn final Model Models can be adjusted or tuned (e.g. pruning subtrees, setting model parameters) Tuning can be an iterative process that requires repeated evaluations on the test set A final model is learned using all the data Problem: part of data wasted as test set Overfitting DWML Spring 2008 46 / 47

Overfitting Cross Validation Partition the data into n subsets or folds (typically: n = 10). For each setting of tuning parameter: for i = 1 to n: learn a model using folds 1,..., i 1, i + 1,..., n as training data measure performance on fold i model performance = average performance on the n test sets Choose parameter setting with best performance Learn final model with chosen parameter setting using the whole available data Overfitting DWML Spring 2008 47 / 47