Data Mining
What Is Data Mining? Mining data mining knowledge Data mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data CMPT 354: Database I -- Data Mining 2
The KDD Process Knowledge Selection Target data Preprocessed data Transformed data Transformation Preprocessing Patterns Data mining Interpretation/ evaluation Data CMPT 354: Database I -- Data Mining 3
What Kind of Patterns? Association rules and sequential patterns Classification Clusters Many others CMPT 354: Database I -- Data Mining 4
Frequent Patterns and Association Rules (Time {Fri, Sat}) buy(x, diaper) buy(x, beer) Dads taking care of babies in weekends drink beers Itemsets should be frequent It can be applied extensively Rules should be strong, i.e., confident With strong prediction capability CMPT 354: Database I -- Data Mining 5
Sequential Patterns Frequent patterns in sequence databases Within 3 months, buy computer buy CD-ROM buy digital camera The (temporal) order is important CMPT 354: Database I -- Data Mining 6
Utilizations Find regularities in data What products were often purchased together? What are the subsequent purchases after buying a PC? Can we automatically classify web documents? What kinds of patients are sensitive to this new drug? Basket data analysis, cross-marketing, catalog design, sale campaign analysis, web log (click stream) analysis, CMPT 354: Database I -- Data Mining 7
Classification A decision tree for PlayTennis Day O utlook Temp Humid Wind P laytennis D 1 S unny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No Outlook Sunny Overcast Rain Humidity Yes Wind High Normal Strong Weak No Yes No Yes CMPT 354: Database I -- Data Mining 8
Utilizations Understanding the key features of large data sets Predictions Credit card approval Fraud detection Intrusion detection CMPT 354: Database I -- Data Mining 9
Clusters and Outliers Outliers Cluster 1 Cluster 2 Maximizing the intra-class similarity and minimizing the inter-class similarity CMPT 354: Database I -- Data Mining 10
Utilizations Data summarization Market/customer segmentation Pattern recognition Data preprocessing and compression Exception detection Fraud detection CMPT 354: Database I -- Data Mining 11
Frequent Patterns: Basics Itemset: a set of items E.g., acm={a, c, m} Support of itemsets Sup(acm)=3 Given min_sup = 3, acm is a frequent pattern Frequent pattern mining: find all frequent patterns in a database Sup(c am)=3 Conf(c am)=75% Transaction database TDB TID Items bought 100 f, a, c, d, g, I, m, p 200 a, b, c, f, l, m, o 300 b, f, h, j, o 400 b, c, k, s, p 500 a, f, c, e, l, p, m, n CMPT 354: Database I -- Data Mining 12
A Naïve Attempt Generate all possible itemsets, test their supports against the database 100 items 2 100-1 possible itemets How to test the supports of a huge number of itemsets against a large database, say containing 1 million transactions? CMPT 354: Database I -- Data Mining 13
How to Get an Efficient Method? Reduce the number of itemsets that need to be checked Check the supports of selected itemsets efficiently CMPT 354: Database I -- Data Mining 14
Apriori: Anti-monotonic Property Any subset of a frequent itemset must be also frequent an anti-monotone property A transaction containing {beer, diaper, nuts} also contains {beer, diaper} {beer, diaper, nuts} is frequent {beer, diaper} must also be frequent In other words, any superset of an infrequent itemset must also be infrequent No superset of any infrequent itemset should be generated or tested Many item combinations can be pruned! CMPT 354: Database I -- Data Mining 15
Apriori-based Mining Candidate-generation-and-test Generate length (k+1) candidate itemsets from length k frequent itemsets, and Test the candidates against DB CMPT 354: Database I -- Data Mining 16
Apriori Algorithm A level-wise, candidate-generation-and-test approach (Agrawal & Srikant 1994) Data base D TID Items 10 a, c, d 20 b, c, e 30 a, b, c, e 40 b, e Min_sup=2 Scan D Scan D 3-candidates Itemset bce Freq 3-itemsets Itemset Sup bce 2 1-candidates Itemset Sup a 2 b 3 c 3 d 1 e 3 Freq 2-itemsets Itemset ac bc be ce Sup 2 2 3 2 Freq 1-itemsets Itemset Sup a 2 b 3 c 3 e 3 Counting Itemset ab ac ae bc be ce Sup 1 2 1 2 3 2 2-candidates Itemset ab ac ae bc be ce Scan D CMPT 354: Database I -- Data Mining 17
The Apriori Algorithm C k : Candidate itemset of size k L k : frequent itemset of size k L 1 = {frequent items}; for (k = 1; L k!= ; k++) do C k+1 = candidates generated from L k ; for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t L k+1 = candidates in C k+1 with min_support return k L k ; CMPT 354: Database I -- Data Mining 18
Classification and Prediction Classification: predict categorical class labels Build a model for a set of classes/concepts Classify bank loan applications (safe/risky) Prediction: model continuous-valued functions Predict the economic growth in 2004 CMPT 354: Database I -- Data Mining 19
A Two-step Process Model construction: describe a set of predetermined classes Training dataset: tuples for model construction Each tuple/sample belongs to a predefined class Classification rules, decision trees, or math formulae Model application: classify unseen objects Estimate accuracy of the model using an independent test set Acceptable accuracy apply the model to classify tuples with unknown class labels CMPT 354: Database I -- Data Mining 20
Model Construction Training Data Classification Algorithms Name Rank Years Tenured Mike Ass. Prof 3 No Mary Ass. Prof 7 Yes Bill Prof 2 Yes Jim Asso. Prof 7 Yes Dave Ass. Prof 6 No Anne Asso. Prof 3 No Classifier (Model) IF rank = professor OR years > 6 THEN tenured = yes CMPT 354: Database I -- Data Mining 21
Model Application Classifier Testing Data Name Rank Years Tenured Tom Ass. Prof 2 No Merlisa Asso. Prof 7 No George Prof 5 Yes Joseph Ass. Prof 7 Yes Unseen Data (Jeff, Professor, 4) Tenured? CMPT 354: Database I -- Data Mining 22
Decision Tree A node in the tree a test of some attribute A branch: a possible value of the attribute Classification Start at the root Test the attribute Move down the tree branch Outlook Sunny Overcast Rain Humidity Yes Wind High Normal Strong Weak No Yes No Yes CMPT 354: Database I -- Data Mining 23
Training Dataset Outlook Temp Humid Wind PlayTennis Sunny Hot High Weak No Sunny Hot High Strong No Overcast Hot High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No Overcast Cool Normal Strong Yes Sunny Mild High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes Sunny Mild Normal Strong Yes Overcast Mild High Strong Yes Overcast Hot Normal Weak Yes Rain Mild High Strong No CMPT 354: Database I -- Data Mining 24
Basic Algorithm ID3 Construct a tree in a top-down recursive divideand-conquer manner Which attribute is the best at the current node? Create a nodes for each possible attribute value Partition training data into descendant nodes Conditions for stopping recursion All samples at a given node belong to the same class No attribute remained for further partitioning Majority voting is employed for classifying the leaf There is no sample at the node CMPT 354: Database I -- Data Mining 25
Which Attribute Is the Best? The attribute most useful for classifying examples Information gain and gini index Statistical properties Measure how well an attribute separates the training examples CMPT 354: Database I -- Data Mining 26
Entropy Measure homogeneity of examples Entropy( S) i= 1 S is the training data set, and pi is the proportion of S belong to class i The smaller the entropy, the purer the data set c p i log 2 p i CMPT 354: Database I -- Data Mining 27
Information Gain The expected reduction in entropy caused by partitioning the examples according to an attribute Gain( S, A) Entropy( S) v Values( A) S S v Entropy( S Value(A) is the set of all possible values for attribute A, and S v is the subset of S for which attribute A has value v v ) CMPT 354: Database I -- Data Mining 28
Example 9 9 5 Entropy( S) = log 2 log 2 14 14 14 = 0.94 Gain ( S, Wind = = Entropy ( S ) 0.94 8 14 8 14 0.811 6 14 1.00 5 14 ) = Entropy ( S ) Engropy ( S v { Weak, Strong } Weak = ) 0.048 Outlook Temp Humid Wind PlayTennis Sunny Hot High Weak No Sunny Hot High Strong No Overcast Hot High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No Overcast Cool Normal Strong Yes Sunny Mild High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes Sunny Mild Normal Strong Yes Overcast Mild High Strong Yes Overcast Hot Normal Weak Yes Rain Mild High Strong No CMPT 354: Database I -- Data Mining 29 6 14 S S v Engropy ( S Entropy ( S Strong ) v )
Extracting Classification Rules Each path from the root to a leaf an IF- THEN rule Each attribute-value pair along a path forms a conjunction The leaf node holds the class prediction IF age = <=30 AND student = no THEN buys_computer = no Rules are easy to understand CMPT 354: Database I -- Data Mining 30
Summary Mining patterns Frequent patterns Classification Clustering Frequent patterns and Apriori algorithm Decision tree and ID3 algorithm CMPT 354: Database I -- Data Mining 31