CS 6093 Lecture 7 Spring 2011 Basic Data Mining. Cong Yu 03/21/2011

Size: px

Start display at page:

Download "CS 6093 Lecture 7 Spring 2011 Basic Data Mining. Cong Yu 03/21/2011"

Shonda Wells
6 years ago
Views:

1 CS 6093 Lecture 7 Spring 2011 Basic Data Mining Cong Yu 03/21/2011

2 Announcements No regular office hour next Monday (March 28 th ) Office hour next week will be on Tuesday through Thursday by appointment only I will be out of town from April 1 st to April 16 th Sporadic access, please plan accordingly if you need to discuss your projects with me Midterm report will be graded soon We are aiming for the end of the week Quiz next week will be based on today s lecture And it will be closed notes Any question on the projects?

3 Today s Outline Overview of Data Mining What, Why, How Classic Studies Association Rule Mining Data Cube Analysis Rule Interestingness

4 What is Data Mining? You are familiar with Data/Information Retrieval Querying the database using SQL Search the Web via keyword queries Data mining is NOT data retrieval Data mining = Discover hidden and useful knowledge from large amounts of data Hidden: you can t easily write a query to fetch what you want because you don t even know what you want Interesting: not every piece of hidden knowledge is useful... trivial discoveries can overwhelm the user Large amounts: simple techniques are no longer sufficient need efficient and scalable techniques

5 Examples of Data Mining Results Rules and Patterns Customers who buy Harry Potter books often buy Twilight books Users in NYC tend to search for expensive restaurants on Valentines Day Clusters and Classification TV viewers who watch 2+ hours of cable news every day can be divided into three groups: CNN, MSNBC, and Fox Given a viewer, predict which group s/he falls into (for advertising purpose)

6 Why Study Data Mining? Lots of Data Amazon, Walmart, Citibank, etc. Opportunities for: Purchase recommendation Credit card fraud detection Challenging for: Hidden information detection beyond human eyes Efficiency and scalability

7 How Data Mining Become a Field Started within the Database Systems community OLAP instead of OLTP OLTP: online transactional processing ATM transactions, Shopping transactions, etc. OLAP: online analytical processing Business intelligence, business reporting Heavily influenced by Machine Learning Statistics More recently Information Retrieval Recommendation Systems

8 How is Data Mining Done? Descriptive (closer to database): Classic topics: Association rule mining Frequent pattern mining Data cube analysis Clustering Group similar data points and separate dissimilar data points Anomaly detection Detect data points that significantly deviates from others Predictive (closer to statistics and machine learning): Classification Predict which label to be assigned to a data point based on its features Regression analysis Predict the value of a dependent variable (e.g., sales) based on the underlying variables (e.g., time and location)

9 Today s Outline Overview of Data Mining What, Why, How Classic Studies Association Rule Mining Data Cube Analysis Rule Interestingness

10 Association Rule Mining Classic Example: {diaper, milk} beer Intuitive definition: RHS often appears if the LHS appears, i.e., LHS implies RHS, but NOT causal! Applications Inventory management Determine which items to put together on the shelf to increase the sales Shopping recommendation You bought this book, you might want to check out these other books Many others

11 Formal Definition I = items and D = database of transactions A rule X Y is significant if Support and Confidence

12 Example

13 Brute Force Algorithm Compute support and confidence of all possible rules Exponential to the number of unique items in the database Computationally infeasible

14 The Space of All Itemsets (2 d )

15 Apriori Algorithm First reading assignment Fast Algorithms for Mining Association Rules, by Rakesh Agrawal and Ramakrishnan Srikant, VLDB 2004, Santiago, Chile Simple idea, but the impact is very high One of the classic papers on the topic, cited by others numerous times (~8000 according to Google Scholar latest results)

16 Observation (X U Y) must be frequent Once a frequent itemset is discovered, rules can be easily generated from the itemset Therefore, we have two subproblems Frequent itemset generation from the database Association rule generation from frequent itemsets

17 Apriori Principle If an itemset S is not frequent, then none of the itemsets, say T, is frequent if T is a superset of S, because Therefore, an itemset can be pruned away from consideration if any of its subsets is found to be infrequent

18 Applying Apriori Principle infrequent Processing order can be pruned

19 Example Items (1-itemsets) Minimum Support = 3 Pairs (2-itemsets) Combinations among Bread, Milk, Beer, Diaper only Triplets (3-itemsets) Ignore any 3-itemsets containing {Coke}, {Eggs}, {Bread,Beer}, {Milk,Beer}

20 Candidate Itemsets Generation Starts with frequent 1-itemset At iteration k, generate candidate (k+1)-itemsets: Items in all candidates itemsets are maintained in the same order (i.e., item_id) Merge two itemsets s 1 and s 2, if and only if they share all items except the last item and s 1.last_item_id < s 2.last_item_id Correctness Proof

21 Two Technical Challenges Efficiently eliminating candidates containing at least one infrequent subset Efficiently determining the support of the candidates Both leverage the hash tree structure

22 Hash Tree Construction For k-itemsets, construct a hash tree of depth-k At i-th level, hash on i-th item, follow the branch If there are few itemsets remaining, stop and store the itemsets at the node Hash function 1,4,7 2,5,8 3,6,

23 Match a 5-itemset Against a 3-itemsets Hash Tree Hash function ,4,7 2,5,8 3,6, Matched transaction against 3 out of 15 candidates Compared against 9 out of 15 candidates 6

24 Using the Hash Tree Eliminating candidate k-itemsets containing at least one infrequent subset Construct hash tree H of frequent (k-1)-itemsets Check each candidate k-itemset against H Prune away a candidate if at least one branch leads to an empty match Determining the support of the candidate k-itemsets Construct hash tree H of candidate k-itemsets Check each transaction against H Increment count for each matching candidate

25 The Algorithm Apriori

26 Next: Association Rules Generation Naïve algorithm: For each frequent itemset, S = {i 1, i 2,, i m }: For each subset T of S: Compute confidence as Output (S T) T if confidence is high enough The same Apriori Principle can be applied! Confidence of (S T 1 ) T 1 is always higher than (S T 2 ) T 2 if T 1 is a subset of T 2

27 Property: Closed-ness An itemset S is closed if none of its supersets have the same support {Beer} is not closed because {Beer, Diaper} has the same support {Beer, Diaper} is closed

28 Property: Maximality An itemset S is maximally frequent if none of its supersets is frequent

29 A Comparison An itemset S can be closed, but not maximally frequent The superset T of S may have lower support, but still above the minimum support threshold

30 Why Study Those Properties Closed Frequent Itemsets are interesting Their support is a proper aggregation of all their supersets instead of coming from just one Maximally Frequent Itemsets are the most interesting They are long/large enough to convey non-trivial knowledge

31 Mining Maximally Frequent Itemsets Second Reading Assignment Efficiently Mining Long Patterns from Databases, by Roberto Bayardo, SIGMOD 1998, Seattle, Washington

32 Inefficiency in Apriori Algorithm Processing order

33 Goal Discover maximally frequent itemsets before examining its subsets How? Given a candidate itemset, check not only its own support, but also the support of the largest itemset that can be derived from it If that largest itemset is frequent, then we can then ignore all possible subsets of it

34 Max-Miner: The Key Concept Candidate group (g) Head: H(g), the current candidate frequent itemset being checked Tail: T(g), all subsequent items that can be added to the group H(g) + T(g): the largest itemset that can be derived from H(g) Prune based not only on the support of H(g), as in Apriori, but also on the support of H(g) + T(g)

35 Max-Miner Processing Tree (Breadth First) abcd a bcd b cd c d d Frequent itemsets: {a}, {b}, {c}, {d}, {bc}, {bd}, {cd}, {abd}, {bcd} ab cd ac d ad bc d bd cd {bcd} is frequent {cd} is subset of {bcd} abc d abd acd bcd abcd {abc} is infrequent {ac} is infrequent Pruned candidates: Apriori: 4 Max-Miner: 8

36 Key Optimization Goal: encounter maximally frequent itemsets as often as possible so the pruning can achieve maximum potential Trick: reorder items according to their support Place most frequent items last because they appear in more candidate groups E.g., {d} in the previous example Often results in orders of magnitude increase in performance! (Section 6.3)

37 The Algorithm Max-Miner

38 Today s Outline Overview of Data Mining What, Why, How Classic Studies Association Rule Mining Data Cube Analysis Rule Interestingness

39 Data Warehouse / OLAP A decision support system that enables users to perform aggregate analysis of historical data to discover hidden patterns and trends Separated from the operational system (OLTP): OLTP records the day-to-day transactions and activities Gartner Dataquest estimates the size of the data warehouse market at $30B and growing That s even before the businesses start analyzing Web data

40 Data Cube Analysis TV PC DVD U.S.A Date 1Qtr 2Qtr 3Qtr 4Qtr Group By product, date Group By country, date Group By country, product Group By country Group By date Country Canada Mexico Group By product Group By none ALL

41 OLAP Operators Roll-up Go up the hierarchy or remove dimension from the group by Drill-down Move down the hierarchy or add dimension to the group by Pivot Pick k dimensions, group by all other dimensions Slice and Dice Point/Range selection over k dimensions

42 Cube Operator An operator designed specifically to compute all those aggregations at once Not a fundamental operator like select, project But a very useful one Challenge: how to cube efficiently? Group By product, date Group By country, date Group By country, product Cube Group By By product, country data, country Group By date Group By product Group By none

43 10 min Break

44 Efficient Cube Analysis Bottom-Up Computation of Sparse and Iceberg CUBEs, by Kevin Beyer and Raghu Ramakrishnan, SIGMOD 1999, Philadelphia, PA

45 Observation Not all aggregations are interesting Some aggregations may be computed over zero or very few actual tuples E.g., there is little or no sales of DVDs in the 1Qtr, then computing the aggregate over country for (DVD, 1Qtr) does not provide much information Concept: Iceberg Cube Cube by d 1, d 2,, Having count(*) > s Familiar? Yes, it is basically support!

46 Key Concept Measure The aggregate value computed from tuples within the group-by E.g., total sales, number of transactions Two very important properties Let S 1, S 2,, S m be a complete set of disjoint subsets of T A measure is algebraic if F(T) = H(G(S 1 ), G(S 2 ),, G(S m )), and function G returns an M-tuple as results and M is constant regardless of S i and T. A measure is monotonic (increasing or decreasing) if it is always true that F(T) >= F(S i ) or F(T) <= F(S i )

47 Example Measures Summation Algebraic? Monotonic? Yes and Yes Max and Min Algebraic? Monotonic? Yes and Yes Average Algebraic? Monotonic? Yes and No Count Unique Algebraic? Monotonic? No and Yes Measures that are being handled

48 Processing Tree (Depth First) a 1 all a 2 a b c d a 1 b 1 a k a 1 b 2 a 1 b k ab ac ad bc bd cd {b k } has low support abc abd acd bcd {a i c j } has low support abcd

49 Sorting and Partitioning BUC stops here early when support is below threshold

50 More Optimizations Three sorting algorithms are leveraged at different times Counting Sort When there are more tuples than the dimension cardinality Quick Sort When the number of tuples are much smaller than the dimension cardinality Insertion Sort When there a very few number of tuples Dimension reordering Decreasing cardinality: higher cardinality smaller partitions BUC can stop early Increasing skew: less skew higher effective cardinality

51 High Level Algorithm

52 Drawbacks of BUC If the Cube is dense, BUC is not efficient Most of the bottom partitions will satisfy the minimum support BUC also does not leverage the algebraic property Supersets are always processed before the subsets BUC also does not deal with non-monotonic measures as the pruning condition

53 So Far: Basic Concepts Association rules and frequent itemsets in a transaction database Data warehouse and data cube analysis Algorithms Apriori / Hash tree Max-Miner / Candidate reordering BUC / Dynamic selection of sorting algorithms Common theme: Good idea supported by good engineering

54 Today s Outline Overview of Data Mining What, Why, How Classic Studies Association Rule Mining Data Cube Analysis Rule Interestingness

55 Is Confidence a Good Measure? Coffee No_Coffee Tea No_Tea Association Rule: Tea Coffee Confidence = #(Tea, Coffee) / #(Tea) = 0.75 But Confidence of No_Tea Coffee = !! Although confidence is high, the rule is misleading

56 Alternative Measures: Lift Association Rule: Tea Coffee Coffee No_Coffee Tea No_Tea Lift = P(Tea, Coffee) / (P(Tea) P(Coffee)) = 0.15 / 0.18 = Negative correlation!

57 Alternative Measures: Leverage (P-S) Coffee No_Coffee Tea No_Tea Association Rule: Tea Coffee Leverage = P(Tea,Coffee) - (P(Tea) P(Coffee)) = = Less support than expected!

58 Lift/Leverage is not Perfect Either Lift is too sensitive to support Leverage is insensitive to support Y Y X X Y Y X X Lift = 0.1 (0.1)(0.1) =10 Lift = 0.9 (0.9)(0.9) =1.11

59 More Measures

60 Property of Rules Three properties of good measures M(A B) = 0 or 1 if A and B are statistically independent M(A B) increases with P(AB) if P(A) and P(B) remains constant M(A B) decreases with P(A) if P(AB) and P(B) remains constant Symmetric and asymmetric Symmetric: Lift, Leverage Asymmetric: Confidence Total/partial ordering of rules

61 Type-1 Error: False Positives In exploratory analysis: when the amount of data is large enough, some significant rules will appear purely by chance! Similar to the birthday paradox In a room with 23 people, the probability of two people with the same birthday is greater than 50% Statistics to the rescue!

62 Holdout Evaluation

63 Subjective Interestingness Applicable especially to cube analysis Sales of (ipod, NYC) is $200M Is it interesting? Unexpectedness Interesting if the average sales of (ipod, City) is $20M Actionable Interesting if knowing the sales can trigger some actions, such as adjust logistics and advertising

64 Identifying Best Rules Mining the Most Interesting Rules, Roberto Bayardo and Rakesh Agrawal, SIGKDD Optimized rule mining problem in the presence of partial ordering

65 Optimized Rule Mining If there is a total order t over all the rules, identify the optimal rule A as: A satisfies the conditions There is no rule B s.t. B satisfies the conditions and A < t B E.g., rules can be ordered according to their confidence, and we pick the rule with highest confidence However, often only a partial order is available! Selecting one optimal rule is not reasonable

66 Optimized Rule Mining with Partial Order Given a partial order p over the rules, identify a set of rules S, s.t. Each rule s S, s is optimal There is a rule for each equivalent class in the partial order If two rules can be compared according to the partial order, they are in the same equivalence class Only one rule from each equivalence class can be selected E.g., r 1 > p r 2 and r 1 > p r 3, r 2 and r 3 are not comparable Then only r 1 can be selected

67 SC-Optimal Rule Mining Partial order sc : Defined based on support and confidence A rule r 2 is greater than another rule r 1 if and only if: r 2 has greater or equal support and confidence Partial order s^c : Defined based on support and confidence A rule r 2 is greater than another rule r 1 if and only if: r 2 has greater or equal support and lesser or equal confidence

68 Theoretical Implication Many total orders are implied by the two SC partial orders r 1 < p r 2 r 1 < t r 2 r 1 = p r 2 r 1 = t r 2 Total orders implied by sc Monotonically increasing with support when confidence remains constant Monotonically increasing with confidence when support remains constant Exactly two of the properties of good measures!

69 PC-Optimal Rule Mining Partial order pc : Defined based on population, i.e., set of records characterized by the rule A rule r 2 is greater than another rule r 1 if and only if: The population of r 2 is a superset of that of r 1 r 2 has greater or equal confidence PC partial order contains many more equivalent classes Two rules that are PC comparable must be SC comparable Two rules may be SC comparable, but not PC comparable

70 Summary Association Rule Mining Apriori & Max-Miner Data Cube Analysis BUC Rule Interestingness Finding optimal rules according to a partial ordering Next Week: Advanced data mining topics 70 CS

71 Questions?

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar Frequent Pattern Mining Based on: Introduction to Data Mining by Tan, Steinbach, Kumar Item sets A New Type of Data Some notation: All possible items: Database: T is a bag of transactions Transaction transaction