Introduction to Data Mining

Size: px

Start display at page:

Download "Introduction to Data Mining"

Marilynn Singleton
6 years ago
Views:

1 Introduction to Data Mining 1

commercial and scientific databases due to

data you can whenever and wherever possible.

either for the purpose collected or for a

2 Large-scale data is everywhere! There has been enormous data growth in both commercial and scientific databases due to advances in data generation and collection technologies. Homeland Security New mantra: Gather whatever data you can whenever and wherever possible. Expectations: Gathered data will have value either for the purpose collected or for a purpose not envisioned. Geo-spatial data Business Data Sensor Networks Computational Simulations 2

Why data mining? Commercial viewpoint: Lots of data is being collected and warehoused. Web data: Yahoo has petabytes of web data. Facebook has ~2B active users.

3 Why data mining? Commercial viewpoint: Lots of data is being collected and warehoused. Web data: Yahoo has petabytes of web data. Facebook has ~2B active users. Purchases at department/grocery stores, e-commerce: Amazon records 1.1B orders a year. Bank/Credit Card transactions. Computers have become cheaper and more powerful. Competitive pressure is strong. Provide better, customized services for an edge (e.g. in Customer Relationship Management). 3

enormous speeds. Remote sensors on a satellite.

science data/year. Telescopes scanning the skies.

4 Why data mining? Scientific viewpoint: Data collected and stored at enormous speeds. Remote sensors on a satellite. NASA EOSDIS archives over 1-petabytes of earth science data/year. Telescopes scanning the skies. Sky survey data. High-throughput biological data. Scientific simulations. Terabytes of data generated in a few hours. Data mining helps scientists. In automated analysis of massive datasets. In hypothesis formation. 4

5 What is data mining? Many definitions: Non-trivial extraction of implicit, previously unknown and potentially useful information from data. Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns. 5

6 Origins of data mining Draws ideas from machine learning/ai, pattern recognition, statistics, and database systems. Traditional techniques may be unsuitable due to data that is: Large-scale High dimensional Heterogeneous Complex Distributed Key Distinction: Data driven vs. Hypothesis driven 6

7 Data mining tasks Prediction task: Use some variables to predict unknown or future values of other variables. Description task: Find human-interpretable patterns that describe the data. From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining,

10 Data mining methods Data Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K

8 10 Data mining methods Data Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 11 No Married 60K No 12 Yes Divorced 220K No 13 No Single 85K Yes 14 No Married 75K No 15 No Single 90K Yes Milk 8

9 10 Predictive modeling: Classification Find a model for class attribute as a function of the values of other attributes. Tid Employed Level of Education # years at present address Class Credit Worthy 1 Yes Graduate 5 Yes 2 Yes High School 2 No 3 No Undergrad 1 No 4 Yes High School 10 Yes No Model for predicting credit worthiness No Employed Graduate Number of years Yes Education { High school, Undergrad } Number of years > 3 yr < 3 yr > 7 yrs < 7 yrs Yes No Yes No 9

Examples of classification Predicting tumor cells as benign or malignant. Classifying credit card transactions as legitimate or fraudulent.

10 Examples of classification Predicting tumor cells as benign or malignant. Classifying credit card transactions as legitimate or fraudulent. Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil. Categorizing news stories as finance, weather, entertainment, sports, etc. Identifying intruders in the cyberspace. 10

11 Clustering Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups. Intra-cluster distances are minimized Inter-cluster distances are maximized 11

Applications of clustering Understanding Custom profiling for targeted marketing.

Group stocks with similar price fluctuations. Summarization Reduce the size of large data sets.

Ice or No Sea Clust Use of K-means to partition Sea Surface Temperature (SST) and Net Primary Production

12 Applications of clustering Understanding Custom profiling for targeted marketing. Group related documents for browsing. Group genes and proteins that have similar functionality. Group stocks with similar price fluctuations. Summarization Reduce the size of large data sets. Courtesy: Michael Eisen latitude Clusters for Raw SST and Raw NPP Land Clus Land Clus Ice or No Sea Clust Use of K-means to partition Sea Surface Temperature (SST) and Net Primary Production (NPP) into clusters that reflect the Northern and Southern Hemispheres. Sea Clust longitude Cluster 12

Association rule discovery Given a set of records each of which

Produce dependency rules which will predict occurrence of an item

TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper,

13 Association rule discovery Given a set of records each of which contain some number of items from a given collection. Produce dependency rules which will predict occurrence of an item based on occurrences of other items. TID Items 1 Bread, Coke, Milk 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer} 13

14 Association analysis: Applications Market-basket analysis. Rules are used for sales promotion, shelf management, and inventory management. Telecommunication alarm diagnosis. Rules are used to find combination of alarms that occur together frequently in the same time period. Medical Informatics. Rules are used to find combination of patient symptoms and test results associated with certain diseases. 14

15 Motivating challenges Scalability. High dimensionality. Heterogeneous and complex data. Data ownership and distribution. Non-traditional analysis. 15

16 The 4 V s of Big Data 16

17 Pattern Mining

18 ASSOCIATION RULES

19 Association Rule Mining Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-basket transactions TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Example of association rules {Diaper} {Beer}, {Milk, Bread} {Eggs,Coke}, {Beer, Bread} {Milk}, Implication means co-occurrence, not causality!

20 Definition: Frequent Itemset Itemset A collection of one or more items Example: {Milk, Bread, Diaper} k-itemset An itemset that contains k items Support count (σ) Frequency of occurrence of an itemset E.g. σ({milk, Bread,Diaper}) = 2 Support (s) Fraction of transactions that contain an itemset E.g. s({milk, Bread, Diaper}) = 2/5 Frequent Itemset An itemset whose support is greater than or equal to a minsup threshold TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

21 Definition: Association Rule Association Rule An implication expression of the form X Y, where X and Y are itemsets. Example: {Milk, Diaper} {Beer} Rule Evaluation Metrics Support (s) Fraction of transactions that contain both X and Y. Confidence (c) Measures how often items in Y appear in transactions that contain X. It is nothing more than P(Y X). TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Example: { Milk, Diaper} Þ {Beer} (Milk, Diaper,Beer) s = s = T s (Milk,Diaper,Beer) c = = s (Milk,Diaper) = 0.4 = 0.67

22 Association Rule Mining Task Given a set of transactions T, the goal of association rule mining is to find all rules having: 1) support minsup threshold, and 2) confidence minconf threshold. TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Example of Rules: {Milk,Diaper} {Beer} (s=0.4, c=0.67) {Milk,Beer} {Diaper} (s=0.4, c=1.0) {Diaper,Beer} {Milk} (s=0.4, c=0.67) {Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5)

23 An approach. 1. List all possible association rules. 2. Compute the support and confidence for each rule. 3. Prune rules that fail the minsup and minconf thresholds.

24 Computational Complexity Given d unique items: Total number of itemsets = 2 d Total number of possible association rules: = ú û ù ê ë é ø ö ç è æ - ø ö ç è æ = + - = - = å å d d d k k d j j k d k d R If d=6, R = 602 rules

25 Mining Association Rules TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Example of Rules: {Milk,Diaper} {Beer} (s=0.4, c=0.67) {Milk,Beer} {Diaper} (s=0.4, c=1.0) {Diaper,Beer} {Milk} (s=0.4, c=0.67) {Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5) Observations: All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} Rules originating from the same itemset have identical support but can have different confidence. Thus, we may decouple the support and confidence requirements.

26 Mining association rules Two-step approach: 1. Frequent Itemset Generation Generate all itemsets whose support ³ minsup. 2. Rule Generation Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset. Frequent itemset generation is still expensive.

27 Frequent itemset generation strategies Reduce the number of candidates (M) Complete search: M = 2 1. Use pruning techniques to reduce M. Reduce the number of transactions (N) Reduce size of N as the size of itemset increases. Used by DHP and vertical-based mining algorithms. Reduce the number of comparisons (NM) Use efficient data structures to store the candidates or transactions. No need to match every candidate against every transaction.

28 Pattern Lattice null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE Given d items, there are 2 d possible candidate itemsets.

29 Reducing the number of candidates Observation: If an itemset is frequent, then all of its subsets must also be frequent. This holds due to the following property of the support measure: " X, Y : ( X Í Y ) Þ s( X ) ³ s( Y ) Support of an itemset never exceeds the support of its subsets. This is known as the anti-monotone property of support.

30 Illustrating support s anti-monotonicity null A B C D E AB AC AD AE BC BD BE CD CE DE Found to be Infrequent ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE Pruned supersets ABCDE

31 Illustrating support s anti-monotonicity TID Items 1 Bread, Milk 2 Beer, Bread, Diaper, Eggs 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Bread, Coke, Diaper, Milk Minimum Support = 3

32 Illustrating support s anti-monotonicity TID Items 1 Bread, Milk 2 Beer, Bread, Diaper, Eggs 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Bread, Coke, Diaper, Milk Items (1-itemsets) Item Count Bread 4 Coke 2 Milk 4 Beer 3 Diaper 4 Eggs 1 Minimum Support = 3

33 Illustrating support s anti-monotonicity TID Items 1 Bread, Milk 2 Beer, Bread, Diaper, Eggs 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Bread, Coke, Diaper, Milk Items (1-itemsets) Item Count Bread 4 Coke 2 Milk 4 Beer 3 Diaper 4 Eggs 1 Minimum Support = 3

34 Illustrating support s anti-monotonicity Item Count Bread 4 Coke 2 Milk 4 Beer 3 Diaper 4 Eggs 1 Minimum Support = 3 Items (1-itemsets) Itemset {Bread,Milk} {Bread, Beer } {Bread,Diaper} {Beer, Milk} {Diaper, Milk} {Beer,Diaper} Pairs (2-itemsets) (No need to generate candidates involving Coke or Eggs)

35 Illustrating support s anti-monotonicity Item Count Bread 4 Coke 2 Milk 4 Beer 3 Diaper 4 Eggs 1 Minimum Support = 3 Items (1-itemsets) Itemset Count {Bread,Milk} 3 {Beer, Bread} 2 {Bread,Diaper} 3 {Beer,Milk} 2 {Diaper,Milk} 3 {Beer,Diaper} 3 Pairs (2-itemsets) (No need to generate candidates involving Coke or Eggs)

36 Illustrating support s anti-monotonicity Item Count Bread 4 Coke 2 Milk 4 Beer 3 Diaper 4 Eggs 1 Minimum Support = 3 Items (1-itemsets) Itemset Count {Bread,Milk} 3 {Bread,Beer} 2 {Bread,Diaper} 3 {Milk,Beer} 2 {Milk,Diaper} 3 {Beer,Diaper} 3 Pairs (2-itemsets) (No need to generate candidates involving Coke or Eggs) Triplets (3-itemsets) Itemset { Beer, Diaper, Milk} { Beer,Bread,Diaper} {Bread, Diaper, Milk} { Beer, Bread, Milk}

37 Illustrating support s anti-monotonicity Item Count Bread 4 Coke 2 Milk 4 Beer 3 Diaper 4 Eggs 1 Minimum Support = 3 Items (1-itemsets) Itemset Count {Bread,Milk} 3 {Bread,Beer} 2 {Bread,Diaper} 3 {Milk,Beer} 2 {Milk,Diaper} 3 {Beer,Diaper} 3 Pairs (2-itemsets) (No need to generate candidates involving Coke or Eggs) Triplets (3-itemsets) If every subset is considered, 6 C C C = 41 With support-based pruning, = 16 Itemset { Beer, Diaper, Milk} { Beer,Bread, Diaper} {Bread, Diaper, Milk} {Beer, Bread, Milk} Count

38 Illustrating support s anti-monotonicity Item Count Bread 4 Coke 2 Milk 4 Beer 3 Diaper 4 Eggs 1 Minimum Support = 3 If every subset is considered, 6 C C C = 41 With support-based pruning, = = 13 Items (1-itemsets) Itemset Count {Bread,Milk} 3 {Bread,Beer} 2 {Bread,Diaper} 3 {Milk,Beer} 2 {Milk,Diaper} 3 {Beer,Diaper} 3 Itemset { Beer, Diaper, Milk} { Beer,Bread, Diaper} {Bread, Diaper, Milk} {Beer, Bread, Milk} Pairs (2-itemsets) (No need to generate candidates involving Coke or Eggs) Triplets (3-itemsets) Count

39 APRIORI

40 Apriori algorithm F 4 : frequent k-itemsets L 4 : candidate k-itemsets null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE Algorithm Let k = 1 Generate F 8 = {frequent 1-itemsets} Repeat until F 4 is empty 1. Candidate Generation: Generate L 498 from F 4. ABCD ABCE ABDE ACDE BCDE 2. Candidate Pruning: Prune candidate itemsets in L 498 containing subsets of length k that are infrequent. 3. Support Counting: Count the support of each candidate in L 498 by scanning the DB. 4. Candidate Elimination: Eliminate candidates in L 498 that are infrequent, leaving only those that are frequent, leading to F 498. ABCDE Level-by-level traversal of the lattice.

41 Candidate Generation: the F 4:8 F 4:8 method Merge two frequent (k 1)-itemsets if their first (k 2) items are identical F > = {ABC,ABD,ABE,ACD,BCD,BDE,CDE} Merge(ABC, ABD) = ABCD Merge(ABC, ABE) = ABCE Merge(ABD, ABE) = ABDE Do not merge(abd,acd) because they share only prefix of length 1 instead of length 2.

42 Candidate pruning Let F > = {ABC,ABD,ABE,ACD,BCD,BDE,CDE} be the set of frequent 3-itemsets. L? = {ABCD,ABCE,ABDE} is the set of candidate 4-itemsets generated (from previous slide). Candidate pruning: Prune ABCE because ACE and BCE are infrequent. Prune ABDE because ADE is infrequent. After candidate pruning: L? = {ABCD}.

43 Support counting of candidate itemsets Scan the database of transactions to determine the support of each candidate itemset. Must match every candidate itemset against every transaction, which is an expensive operation. TID Items 1 Bread, Milk 2 Beer, Bread, Diaper, Eggs 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Bread, Coke, Diaper, Milk Itemset { Beer, Diaper, Milk} { Beer,Bread,Diaper} {Bread, Diaper, Milk} { Beer, Bread, Milk} Q: How should we perform this operation?

44 Support counting of candidate itemsets To reduce number of comparisons, store the candidate itemsets in a hash structure. Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets. Transactions Hash Structure N TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke k Buckets

45 Support counting: An example Suppose you have 15 candidate itemsets of length 3: {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} How many of these itemsets are supported by transaction (1,2,3,5,6)? Transaction, t Level This is a full n-ary tree where n is the number of items. Level Level 3 Subsets of 3 items Q: Can we reduce storage requirements?

46 Support counting using a hash tree Suppose you have 15 candidate itemsets of length 3: {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} You need: Hash function. Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node). Hash function 3,6,9 1,4,7 2,5,

47 Factors affecting the complexity of Apriori

48 MAXIMAL & CLOSED ITEMSETS

49 Maximal frequent itemset An itemset is maximal frequent if it is frequent and none of its immediate supersets are frequent. null Maximal Itemsets A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE Infrequent Itemsets ABCD E Border

50 Closed itemsets An itemset X is closed if all of its immediate supersets have a lower support than X. Itemset X is not closed if at least one of its immediate supersets has that same support as X. TID Items 1 {A,B} 2 {B,C,D} 3 {A,B,C,D} 4 {A,B,D} 5 {A,B,C,D} Itemset Support {A} 4 {B} 5 {C} 3 {D} 4 {A,B} 4 {A,C} 2 {A,D} 3 {B,C} 3 {B,D} 4 {C,D} 3 Itemset Support {A,B,C} 2 {A,B,D} 3 {A,C,D} 2 {B,C,D} 2 {A,B,C,D} 2

51 Maximal vs closed frequent itemsets Minimum support = 2 null Closed but not maximal A B C D E Closed and maximal AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE TID Items 1 ABC 2 ABCD 3 BCE 4 ACDE 5 DE 2 4 ABCD ABCE ABDE ACDE BCDE ABCDE # Closed = 9 # Maximal = 4

52 Frequent, maximal, and closed itemsets Frequent Itemsets Closed Frequent Itemsets Maximal Frequent Itemsets

53 Frequent, maximal, and closed itemsets Frequent Itemsets Closed Frequent Itemsets Maximal Frequent Itemsets Q1: What if instead of finding the frequent itemsets, we find the maximal frequent itemsets or the closed frequent itemsets? Q2: Does the knowledge of just the maximal frequent itemsets will allow me to generate all required association rules? Q3: Does the knowledge of just the closed frequent itemsets will allow me to generate all required association rules?

54 BEYOND LEVEL-BY-LEVEL EXPLORATION

(patterns that contain C and any other item except D) Patterns starting with A.

55 Traversing the pattern lattice Patterns starting with B. (patterns that contain B and any other item except A) Patterns ending with C. (patterns that contain C and any other item except D) Patterns starting with A. (patterns that contain A and any other item) null null A B C D A B C D Patterns ending with D. (patterns that contain D and any other item) AB AC AD BC BD CD AB AC BC AD BD CD ABC ABD ACD BCD ABC ABD ACD BCD ABCD ABCD (a) Prefix tree (b) Suffix tree

56 Breadth-first vs Depth-first "" Plusses and minuses? "" a" b" c" d" a" b" c" d" ab" ac" ad" bc" bd" cd" ab" ac" ad" bc" bd" cd" abc" abd" acd" bcd" abc" abd" acd" bcd" abcd" abcd"

57 PROJECTION METHODS

58 Projection-based methods null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE

59 Projection-based methods Projected DB Initial database TID Items 1 {A,B} 2 {B,C,D} 3 {A,C,D,E} 4 {A,D,E} 5 {A,B,C} 6 {A,B,C,D} 7 {B,C} 8 {A,B,C} 9 {A,B,D} 10 {B,C,E} Database associated with node A TID Items 1 {B} 2 {} 3 {C,D,E} 4 {D,E} 5 {B,C} 6 {B,C,D} 7 {} 8 {B,C} 9 {B,D} 10 {} null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE A projected DB on prefix pattern X is obtained as follows: Eliminate any transactions that do not contain X. From the transactions that are left, retain only the items that are lexicographically greater than the items in X. Database associated with node C TID Items 1 {} 2 {D} 3 {D,E} 4 {} 5 {} 6 {D} 7 {} 8 {} 9 {} 10 {E}

60 Projection-based method Items are listed in lexicographic order. Let P and DB(P) be a node s pattern and its associated projected database. Mining is performed by recursively calling this function: TP(P, DB(P)) 1. Determine the frequent items in DB(P), and denote them by E(P). 2. Eliminate from DB(P) any items not in E(P). 3. For each item x in E(P), call TP(Px, DB(Px)).

61 BEYOND TRANSACTIONS

62 Beyond transaction datasets The concept of frequent patterns and association rules has been generalized to different types of datasets: Sequential datasets: Sequence of purchasing transactions, web-pages visited, articles read, biological sequences, event logs, etc. Relational/Graph datasets: Social networks, chemical compounds, web-graphs, information networks, etc. There is an extensive set of approaches and algorithms for them, many of which follow similar ideas to those developed for transaction datasets.

63 Clustering (Unsupervised learning)

64 What is cluster analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Intra-cluster distances are minimized Inter-cluster distances are maximized

65 Notion of a cluster can be ambiguous How many clusters?

66 Notion of a cluster can be ambiguous How many clusters? Six Clusters Two Clusters Four Clusters

67 Clustering formulations A number of clustering formulations have been developed: 1. We need to find a fixed number of clusters. Well-suited for compression-like applications. 2. We need to find clusters of fixed size. Well-suited for neighborhood-discovery (recommendation engine). 3. We need to find the smallest number of clusters that satisfy certain quality criteria. Well-suited for applications in which cluster quality is important. 4. We need to find the natural number of clusters. This is clustering's holly-grail! Extremely hard, problem dependent, and quite supervised.

68 Types of clusterings A clustering is a set of clusters. Important distinction between hierarchical and partitional sets of clusters. Partitional clustering A division of data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset. Hierarchical clustering A set of nested clusters organized as a hierarchical tree.

69 Partitional clustering Original Points A Partitional Clustering

70 Hierarchical clustering p1 p2 p3 p4 p1 p2 p3 p4 Hierarchical clustering Dendrogram

71 Other distinctions between sets of clusters Exclusive versus non-exclusive In non-exclusive clusterings, points may belong to multiple clusters. Can represent multiple classes or border points. Fuzzy versus non-fuzzy In fuzzy clustering, a point belongs to every cluster with some weight between 0 and 1. Weights must sum to 1. Probabilistic clustering has similar characteristics. Partial versus complete In some cases, we only want to cluster some of the data. Heterogeneous versus homogeneous Clusters of widely different sizes, shapes, and densities.

72 Types of clusters Well-separated clusters Center-based clusters Contiguous clusters Density-based clusters Property or conceptual Described by an objective function

73 Types of clusters: Well-separated Well-separated clusters: A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster. Three well-separated clusters

74 Types of clusters: Center-based Center-based A cluster is a set of objects such that an object in a cluster is closer (more similar) to the center of a cluster, than to the center of any other cluster. The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most representative point of a cluster. Four center-based clusters

75 Types of clusters: Contiguity-based Contiguous cluster (nearest neighbor or transitive) A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster. Eight contiguous clusters

76 Types of clusters: Density-based Density-based A cluster is a dense region of points, which is separated by low-density regions, from other regions of high density. Used when the clusters are irregular or intertwined, and when noise and outliers are present. Six density-based clusters

77 Types of clusters: Conceptual clusters Shared Property or Conceptual Clusters Finds clusters that share some common property or represent a particular concept. Two overlapping circles

78 Types of clusters: Objective function Clusters defined by an objective function Finds clusters that minimize or maximize an objective function. Enumerate all possible ways of dividing the points into clusters and evaluate the goodness of each potential set of clusters by using the given objective function. (NP Hard) Can have global or local objectives. Hierarchical clustering algorithms typically have local objectives. Partitional algorithms typically have global objectives. A variation of the global objective function approach is to fit the data to a parameterized model. Parameters for the model are determined from the data. Mixture models assume that the data is a mixture' of a number of statistical distributions.

79 Clustering requirements The fundamental requirement for clustering is the availability of a function to determine the similarity or distance between objects in the database. The user must be able to answer some of the following questions: 1. When should two objects belong to the same cluster? 2. How should the clusters look like (i.e., what type of objects should the contain)? 3. What are the object-related characteristics of good clusters?

80 Data characteristics & clustering Type of proximity or density measure Central to clustering. Depends on data and application. Data characteristics that affect proximity and/or density are Dimensionality Sparseness Attribute type Special relationships in the data For example, autocorrelation Distribution of the data Noise and outliers Often interfere with the operation of the clustering algorithm

81 1. K-means 2. Hierarchical clustering 3. Density-based clustering BASIC CLUSTERING ALGORITHMS

82 K-means clustering Partitional clustering approach. Number of clusters, K, must be specified. Each cluster is associated with a centroid (center point/object). Each point is assigned to the cluster with the closest centroid. The basic algorithm is very simple.

83 Example of K-means clustering 3 Iteration 1 3 Iteration 2 3 Iteration y 1.5 y 1.5 y x x x 3 Iteration 4 3 Iteration 5 3 Iteration y y y x x x

84 K-means clustering Details Initial centroids are often chosen randomly. Clusters produced vary from one run to another. The centroid is (typically) the mean of the points in the cluster. Closeness is measured by Euclidean distance, cosine similarity, correlation, etc. K-means will converge for common similarity measures mentioned above. Most of the convergence happens in the first few iterations. Often the stopping condition is changed to Until relatively few points change clusters. Complexity is O( n * K * I * d ) n = number of points, K = number of clusters, I = number of iterations, d = number of attributes.

85 K-means clustering Objective Let o 1,...,o n be the set of objects to be clustered, k be the number of desired clusters, p be the clustering indicator vector such that p i is the cluster number that the ith object belongs to, and c i be the centroid of the ith cluster. In the case of Euclidean distance, the K-means clustering algorithm solves the following optimization problem: minimize p f(p) = nx o i c pi 2 2. Function f() is the objective or clustering criterion function of K-means. i=1

86 K-means clustering Objective Let o 1,...,o n be the set of objects to be clustered, k be the number of desired clusters, p be the clustering indicator vector such that p i is the cluster number that the ith object belongs to, and r i be a vector associated with the ith cluster. In the case of Euclidean distance, the K-means clustering algorithm solves the following optimization problem: minimize p,r 1,...,r k g(p, r 1,...,r k )= nx o i r pi 2 2. Note that p and r 1,...,r k are the variables of the optimization problem that need to be estimated such that the value of g() is minimized. i=1

87 K-means clustering Objective The solution to minimize p is the same as the solution to and 8i, r i = c i. f(p) = nx o i c pi 2 2 i=1 minimize p,r 1,...,r k g(p, r 1,...,r k )= nx o i r pi 2 2. i=1

88 K-means clustering Objective The solution to minimize p is the same as the solution to and 8i, r i = c i. f(p) = nx o i c pi 2 2 i=1 minimize p,r 1,...,r k g(p, r 1,...,r k )= nx o i r pi 2 2. i=1 The r_i vectors can be thought as being representatives of the objects that are assigned to the ith cluster. The r_i vectors represent a compressed view of the data.

89 K-means clustering Objective minimize p P n i=1 o i c pi 2 2 minimize p,r 1,...,r k P n i=1 o i r pi 2 2 These are non-convex optimization problems. The K-means clustering algorithm is a way of solving the optimization problem. It uses an iterative alternate least squares optimization strategy. a. Optimize cluster assignments p, given r $ for i = 1,, k. b. Optimize r $ for i = 1,, k, given cluster assignments p. It guarantees convergence to a local minima solution. However, due to the nonconvexity of the problem, it may not be the global minimum. Run K-means multiple times with different initial centroids and return the solution that has the best value.

90 Two different K-means clusterings Original Points y x y y x Optimal Clustering x Sub-optimal Clustering

91 Limitations of K-means Def. problem: when the clustering solution that you get is not the best, natural, insightful, etc. K-means has problems when clusters are of differing Sizes Densities Non-globular shapes K-means has problems when the data contains outliers.

92 Limitations of K-means: Differing sizes Original Points K-means (3 Clusters)

93 Limitations of K-means: Differing density Original Points K-means (3 Clusters)

94 Limitations of K-means: Non-globular shapes Original Points K-means (2 Clusters)

95 Overcoming K-means Limitations Original Points K-means Clusters One solution is to use many clusters. Finds parts of clusters, and we may need to put them back together.

96 Overcoming K-means limitations Original Points K-means Clusters

97 Importance of choosing initial centroids 3 Iteration 1 3 Iteration 2 3 Iteration y y y x x x 3 Iteration 4 3 Iteration 5 3 Iteration y y y x x x

98 Importance of choosing initial centroids 3 Iteration 1 3 Iteration y y x x 3 Iteration 3 3 Iteration 4 3 Iteration y y y x x x

99 Solutions to initial centroids problem Multiple runs Helps, but probability is not on your side. Sample and use hierarchical clustering to determine initial centroids. Select more than k initial centroids and then select among these initial centroids. Select most widely separated. Generate a larger number of clusters and then perform a hierarchical clustering. Bisecting K-means Not as susceptible to initialization issues.

100 Outliers A principled way of dealing with outliers is to do so directly during the optimization process. Robust k-means algorithms as part of the optimization process in addition to determining the clustering solution they also identify a set of outlier objects that are not clustered by the algorithm. The non-clustered objects are treated as a penalty component of the objective function (in supervised learning, these penalty components are often called regularizers) like minimize p X i : p i 6= 1 o i c pi X i : p i = 1 q(i), where is a user-specified parameter that controls the penalty associated with not clustering an object, and q(i) is a cost function associated with the ith object. A simple q() = 1 is such a cost function.

101 K-Means and the Curse of dimensionality When dimensionality increases, data becomes increasingly sparse in the space that it occupies. Definitions of density and distance between points, which is critical for clustering and outlier detection, become less meaningful. Randomly generate 500 points. Compute difference between max and min distance between any pair of points.

102 Asymmetric attributes If we met a friend in the grocery store would we ever say the following? I see our purchases are very similar since we didn t buy most of the same things. 40

103 Spherical K-means clustering Let d 1,...,d n be the unit length vectors of the set of objects to be clustered, k be the number of desired clusters, p be the clustering indicator vector such that p i is the cluster number that the ith object belongs to, and c i be the centroid of the ith cluster. The spherical K-means clustering algorithm solves the following optimization problem: maximize p nx cos(d i,c pi ). i=1

104 Spherical K-means & Text In high-dimensional data, clusters exist in lower-dimensional sub-spaces.

105 HIERARCHICAL CLUSTERING

106 Hierarchical clustering Produces a set of nested clusters organized as a hierarchical tree. Can be visualized as a dendrogram. A tree like diagram that records the sequences of merges or splits

107 Advantages of hierarchical clustering Do not have to assume any particular number of clusters. Any desired number of clusters can be obtained by cutting the dendrogram at the proper level. They may correspond to meaningful taxonomies. Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, )

108 Hierarchical clustering Two main ways of obtaining hierarchical clusterings: Agglomerative: Start with the points as individual clusters. At each step, merge the closest pair of clusters until only one cluster (or k clusters) left. Divisive: Start with one, all-inclusive cluster. At each step, split a cluster until each cluster contains a point (or there are k clusters). Traditional hierarchical algorithms use a similarity or distance matrix. Merge or split one cluster at a time.

109 Agglomerative clustering algorithm More popular hierarchical clustering technique Basic algorithm is straightforward 1. Compute the proximity matrix. 2. Let each data point be a cluster. 3. Repeat: 4. Merge the two closest clusters. 5. Update the proximity matrix. 6. Until only a single cluster remains (or k clusters remain). Key operation is the computation of the proximity of two clusters. Different approaches to defining the distance between clusters distinguish the different algorithms

110 Starting situation Start with clusters of individual points and a proximity matrix p1 p2 p3 p4 p5.. p1 p2 p3 p4 p5.... Proximity Matrix... p1 p2 p3 p4 p9 p10 p11 p12

111 Intermediate situation After some merging steps, we have some clusters C1 C2 C3 C4 C5 C1 C2 C3 C3 C4 C4 C1 C5 Proximity Matrix C2 C5... p1 p2 p3 p4 p9 p10 p11 p12

112 Intermediate situation We want to merge the two closest clusters (C2 and C5) and update the proximity matrix. C1 C1 C2 C3 C4 C5 C3 C4 C2 C3 C4 C1 C5 Proximity Matrix C2 C5... p1 p2 p3 p4 p9 p10 p11 p12

113 After merging How do we update the proximity matrix? C1 C2 U C5 C3 C4 C3 C1 C2 U C5????? C4 C3? C1 C4? Proximity Matrix C2 U C5... p1 p2 p3 p4 p9 p10 p11 p12

114 Defining inter-cluster proximity p1 p2 p3 p4 p5... Proximity? p1 p2 p3 Minimum distance, maximum distance, average distance, distance between centroids, objective-driven selection, etc. p4 p5... Proximity Matrix

115 Defining inter-cluster proximity Using minimum distance. p1 p2 p3 p4 p5.. p1 p2 p3 p4 p5.... Proximity Matrix

116 Defining inter-cluster proximity Using maximum distance. p1 p2 p3 p4 p5.. p1 p2 p3 p4 p5.... Proximity Matrix

117 Defining inter-cluster proximity Using average distance. p1 p2 p3 p4 p5.. p1 p2 p3 p4 p5.... Proximity Matrix

118 Defining inter-cluster proximity Using distance between centroids. p1 p1 p2 p3 p4 p5... p2 p3 p4 p5... Proximity Matrix

119 Strength of minimum distance Original Points Six Clusters Can handle non-elliptical shapes.

120 Limitations of minimum distance Two Clusters Original Points Sensitive to noise and outliers. Three Clusters

121 Strength of maximum distance Original Points Two Clusters Less susceptible to noise and outliers.

122 Limitations of maximum distance Original Points Two Clusters Tends to break large clusters. Biased towards globular clusters.

123 Group average Compromise between single and complete link. Strengths: Less susceptible to noise and outliers. Limitations: Biased towards globular clusters.

124 Hierarchical clustering: Time and space requirements O(N 2 ) space since it uses the proximity matrix. N is the number of points. O(N 3 ) time in many cases There are N steps and at each step the proximity matrix must be updated and searched (on the average there are N 2 on that matrix). Complexity can be reduced to Ο(N 2 log (N)) time with some cleverness.

125 Hierarchical clustering: Problems and limitations Once a decision is made to combine two clusters, it cannot be undone. Objective function is optimized only locally. Different schemes have problems with one or more of the following: Sensitivity to noise and outliers. Difficulty handling different sized clusters and convex shapes. Breaking large clusters.

126 DENSITY-BASED CLUSTERING

127 DBSCAN DBSCAN is a density-based algorithm. The density is the number of points within a specified radius (Eps) A point is a core point if it has more than a specified number of points (MinPts) within Eps. These are points that are at the interior of a cluster. A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point. A noise point is any point that is not a core point or a border point.

128 DBSCAN: core, border, and noise points

129 DBSCAN algorithm Algorithm DBSCAN(Data: D, Radius:Eps, Density:τ ) begin Determine core, border and noise points of D at level (Eps,τ); Create graph in which core points are connected if they are within Eps of one another; Determine connected components in graph; Assign each border point to connected component with which it is best connected; return points in each connected component as a cluster; end

130 DBSCAN: core, border and noise points Original Points Point types: core, border and noise Eps = 10, MinPts = 4

131 DBSCAN clustering Clusters

132 DBSCAN clustering These are also clusters. They are usually eliminated by putting a minimum cluster size threshold. Clusters

133 DBSCAN clustering Original Points Clusters Resistant to (some) noise. Can handle clusters of different shapes and sizes.

134 DBSCAN: How much noise?

135 When DBSCAN does not work well Original Points (MinPts=4, Eps=9.75). Varying densities High-dimensional data (MinPts=4, Eps=9.92)

136 DBSCAN: Determining EPS and MinPts Idea is that for points in a cluster, their k th nearest neighbors are roughly at the same distance. Noise points have the k th nearest neighbor at farther distance. So, plot sorted distance of every point to its k th nearest neighbor.

137 CLUSTER VALIDITY

138 Different aspects of cluster validation Determining the clustering tendency of a set of data: Is there a non-random structure in the data? Comparing the results of a cluster analysis to externally known results. Do the clusters contain objects of mostly a single class label? Evaluating how well the results of a cluster analysis fit the data without reference to external information. Look at various intra- and inter-cluster data-derived properties. Comparing the results of two different sets of cluster analyses to determine which is better. The evaluation can be done for the entire clustering solution or just for selected clusters.

139 Measures of cluster validity Numerical measures that are applied to judge various aspects of cluster validity, are classified into the following three types. Internal Index (II): Used to measure the goodness of a clustering structure without respect to external information. Sum of Squared Error (SSE) (or any other of the objective functions that we discussed). External Index (EI): Used to measure the extent to which cluster labels match externally supplied class labels. Entropy, purity, f-score, etc. Relative Index (RI): Used to compare two different clusterings or clusters. Often an external or internal index is used for this function, e.g., SSE or entropy.

140 II: Measuring cluster validity via correlation Two matrices: Proximity (distance) matrix of the data (e.g., pair-wise cosine similarity (Euclidean distance)). Ideal proximity matrix that is implied by the clustering solution. One row and one column for each data point. An entry is 1 if the associated pair of points belong to the same cluster. An entry is 0 if the associated pair of points belongs to different clusters. Compute the correlation between the two matrices. i.e., the correlation between the vectorized matrices. (make sure that the ordering of the data points is the same in both matrices) High (low) correlation indicates that points that belong to the same cluster are close to each other. Not a good measure for some density or contiguity based clusters.

141 II: Measuring cluster validity via correlation Correlation of ideal similarity and proximity matrices for the K-means clusterings of the following two data sets. y x y x Corr = Corr =

II: Using similarity matrix for cluster validation Order the similarity matrix with respect to cluster labels and inspect visually. y 1 0.9 0.8 0.7 0.

142 II: Using similarity matrix for cluster validation Order the similarity matrix with respect to cluster labels and inspect visually. y x Points Similarity 0 Points

143 Clusters found in random data Random Points y x y x DBSCAN K-means Complete Link y x y x

II: Using similarity matrix for cluster validation Clusters in random data are not so crisp. 1 1 10 0.9 0.9 20 0.8 0.8 30 0.7 0.7 40 0.

144 II: Using similarity matrix for cluster validation Clusters in random data are not so crisp Points y Similarity 0 Points x DBSCAN

145 II: Using similarity matrix for cluster validation Clusters in random data are not so crisp Points y Similarity 0 Points x K-means

II: Using similarity matrix for cluster validation Clusters in random data are not so crisp. 1 1 10 0.9 0.9 20 0.8 0.8 30 0.7 0.7 40 0.6 0.

146 II: Using similarity matrix for cluster validation Clusters in random data are not so crisp Points y Similarity 0 Points x Complete Link

II: Using similarity matrix for cluster validation 1 0.9 1 2 6 500 0.

147 II: Using similarity matrix for cluster validation DBSCAN

148 II: Framework for cluster validity Need a framework to interpret any measure. For example, if our measure of evaluation has a value of 10, is that good, fair, or poor? Statistics provide a framework for cluster validity. The more atypical a clustering result is, the more likely it represents valid structure in the data. Can compare the values of an index that result from random data or clusterings to those of a clustering result. If the value of the index is unlikely, then the cluster results are valid. These approaches are more complicated and harder to understand. For comparing the results of two different sets of cluster analyses, a framework is less necessary. However, there is the question of whether the difference between two index values is significant.

149 II: Statistical framework for SSE Example Compare SSE of against three clusters in random data. Histogram shows SSE of three clusters in 500 sets of random data points of size 100 distributed over the range for x and y values. y x Count SSE

150 II: Statistical framework for correlation Correlation of ideal similarity and proximity matrices for the K-means clusterings of the following two data sets. y x y x Corr = Corr =

151 Final comment on cluster validity The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage. Algorithms for Clustering Data, Jain and Dubes

152 Classification (Supervised learning)

153 BASIC CONCEPTS

154 Classification: Definition We are given a collection of records (training set) Each record is characterized by a tuple (x, y), where x is a set of attributes and y is the class label Task: x: set of attributes, predictors, independent variables, inputs. y: class, response, dependent variable, or output. Learn a model that maps each set of attributes x into one of the predefined class labels y.

155 Examples of classification tasks Task Attribute set, x Class label, y Categorizing messages Features extracted from message header and content spam or non-spam Identifying tumor cells Features extracted from MRI scans malignant or benign cells Cataloging galaxies Features extracted from telescope images Elliptical, spiral, or irregular-shaped galaxies

156 10 10 Building and using a classification model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Training Set Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K? 12 Yes Medium 80K? 13 Yes Large 110K? 14 No Small 95K? 15 No Large 67K? Test Set Induction Deduction Learning algorithm Learn Model Apply Model Model

157 Classification techniques Base classifiers Decision tree-based methods. Rule-based methods. Nearest-neighbor. Neural networks. Naïve Bayes and Bayesian belief networks. Support vector machines. and others. Ensemble classifiers Boosting, bagging, random forests, etc.

158 We will use this method to illustrate various concepts and issues associated with the classification task. DECISION TREES

159 10 Example of a decision tree ID Home Owner Marital Status Annual Income 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Defaulted Borrower Yes NO NO Home Owner Income No Single, Divorced Splitting Attributes MarSt < 80K > 80K YES Married NO Training Data Model: Decision tree

160 10 Example of decision tree ID Home Owner Marital Status Annual Income 1 Yes Single 125K No 2 No Married 100K No Defaulted Borrower Married NO MarSt Yes Single, Divorced Home Owner No 3 No Single 70K No NO Income 4 Yes Married 120K No < 80K > 80K 5 No Divorced 95K Yes 6 No Married 60K No NO YES 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes There could be more than one tree that fits the same data!

161 10 10 Decision tree classification task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Training Set Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K? 12 Yes Medium 80K? 13 Yes Large 110K? 14 No Small 95K? 15 No Large 67K? Test Set Induction Deduction Tree Induction algorithm Learn Model Apply Model Model Decision Tree

162 10 Apply model to test data Start from the root of the tree Test Data Home Owner Marital Status Annual Income Defaulted Borrower Yes Home Owner No No Married 80K? NO Single, Divorced MarSt Married Assign Defaulted to No Income NO < 80K > 80K NO YES

163 10 10 Decision tree classification task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Training Set Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K? 12 Yes Medium 80K? 13 Yes Large 110K? 14 No Small 95K? 15 No Large 67K? Test Set Induction Deduction Tree Induction algorithm Learn Model Apply Model Model Decision Tree

164 10 Building the decision tree Tree induction Let D " be the set of training records that reach a node t. General procedure: If D " contains records that belong the same class y ", then t is a leaf node labeled as y ". If D " contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset. ID Home Owner Marital Status Annual Income 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes? D " Defaulted Borrower

165 10 Building the decision tree: Example ID Home Owner Marital Status Annual Income Defaulted Borrower 1 Yes Single 125K No Defaulted = No (7,3) (a) 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes

166 10 Building the decision tree: Example Defaulted = No (7,3) (a) Yes Defaulted = No Home Owner (b) No Defaulted = No (3,0) (4,3) ID Home Owner Marital Status Annual Income 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Defaulted Borrower

167 10 Building the decision tree: Example Defaulted = No Yes (7,3) (a) Home Owner No Yes Defaulted = No Home Owner (b) No Defaulted = No (3,0) (4,3) ID Home Owner Marital Status Annual Income 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Defaulted Borrower Defaulted = No (3,0) Single, Divorced Marital Status Married Defaulted = Yes Defaulted = No (1,3) (3,0) (c)

168 10 Building the decision tree: Example Defaulted = No Yes Defaulted = No (3,0) Single, Divorced (7,3) (a) Home Owner Defaulted = Yes (c) No Marital Status Married Defaulted = No (1,3) (3,0) Yes Defaulted = No Yes Defaulted = No Home Owner (b) Home Owner Single, Divorced Annual Income (d) No Defaulted = No (3,0) (4,3) (3,0) No Marital Status <480K >=480K Defaulted = No Defaulted = Yes (1,0) (0,3) Married Defaulted = No (3,0) ID Home Owner Marital Status Annual Income 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Defaulted Borrower

169 Design issues of decision tree induction How should the training records be split? Method for specifying test condition. This depends on the attribute types. Method for selecting which attribute and split condition to choose. Need a measure for evaluating the goodness of a test condition. When should the splitting procedure stop? Stop splitting if all the records belong to the same class or have identical attribute values. Early termination.

170 Methods for expressing test conditions Depends on attribute types: Binary Nominal Ordinal Continuous Depends on number of ways to split: 2-way split Multi-way split

171 Test condition for nominal attributes Multi-way split: Use as many partitions as distinct values: Marital Status Single Divorced Married Binary split: Divide values into two subsets: Marital Status Marital Status Marital Status OR OR {Married} {Single, Divorced} {Single} {Married, Divorced} {Single, Married} {Divorced}

172 Test condition for ordinal attributes Multi-way split: Use as many partitions as distinct values. Shirt Size Small Medium Large Extra Large Binary split: Divides values into two subsets. Preserve order property among attribute values. {Small, Medium} Shirt Size Shirt Size {Large, Extra Large} {Small} Shirt Size {Medium, Large, Extra Large} This grouping violates order property. {Small, Large} {Medium, Extra Large}

173 Test condition for continuous attributes Annual Income > 80K? Annual Income? < 10K > 80K Yes No [10K,25K) [25K,50K) [50K,80K) (i) Binary split (ii) Multi-way split

174 How to determine the best split? Before splitting: 10 records of class 0, and 10 records of class 1. Gender Car Type Customer ID Yes No Family Luxury c 1 c 10 c 20 Sports c 11 C0: 6 C1: 4 C0: 4 C1: 6 C0: 1 C1: 3 C0: 8 C1: 0 C0: 1 C1: 7 C0: 1 C1: 0... C0: 1 C1: 0 C0: 0 C1: 1... C0: 0 C1: 1 Which test condition is the best?

175 How to determine the best split? Greedy approach: Nodes with purer class distribution are preferred. Need a measure of node purity/impurity: C0: 5 C1: 5 High degree of impurity C0: 9 C1: 1 Low degree of impurity

176 Measures of node impurity Gini Index GINI( t) = 1 -å j [ p( j t)] 2 Two-class problem Entropy Entropy ( t) = -å p( j t)log p( j t) j Misclassification error Error( t) = 1- max P( i t) i

177 Finding the best split 1. Compute impurity measure (P) before splitting. 2. Compute impurity measure (M) after splitting. Compute impurity measure of each child node. M is the size-weighted impurity of the children. 3. Choose the attribute test condition that produces the highest gain: Gain = P M, or equivalently, lowest impurity measure after splitting (M).

178 Decision tree based classification Advantages: Inexpensive to construct. Extremely fast at classifying unknown records. Easy to interpret for small-sized trees. Robust to noise (especially when methods to avoid overfitting are employed). Can easily handle redundant or irrelevant attributes (unless the attributes are interacting). Disadvantages: Space of possible decision trees is exponentially large. Greedy approaches are often unable to find the best tree. Does not take into account interactions between attributes. Each decision boundary involves only a single attribute.

179 OVERFITTING

180 Classification errors Training errors (apparent errors): Errors committed on the training set. Test errors: Errors committed on the test set. Generalization errors: Expected error of a model in a randomly selected subset of records from the same distribution.

181 Example data set Two class problem: + : 5400 instances 5000 instances generated from a Gaussian centered at (10,10) 400 noisy instances added o : 5400 instances Generated from a uniform distribution 10 % of the data used for training and 90% of the data used for testing

182 Increasing number of nodes in the decision tree

183 Decision tree with 4 nodes Decision tree Decision boundaries on training data

184 Decision tree with 50 nodes Decision Tree Decision boundaries on training data

185 Increasing number of nodes in decision trees Decision Tree with 4 nodes Which tree is better? Decision Tree with 50 nodes

186 Model overfitting Underfitting: when model is too simple, both training and test errors are large. Overfitting: when model is too complex, training error is small but test error is large.

187 Model overfitting Using twice the number of data instances If training data is under-representative, testing errors increase and training errors decrease on increasing number of nodes. Increasing the size of training data reduces the difference between training and testing errors at a given number of nodes.

188 Reasons for model overfitting Presence of noise. Lack of representative samples. Multiple comparison procedure.

189 Effect of multiple comparison procedure Consider the task of predicting whether stock market will rise/fall in the next 10 trading days. Random guessing: P(correct) = 0.5 Make 10 random guesses in a row: æ10ö æ10ö æ10ö ç + ç + ç P(# correct ³ 8) = è ø è ø è ø = Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10 Up Down Down Up Down Down Up Up Up Down

190 Effect of multiple comparison procedure Approach: Get 50 analysts. Each analyst makes 10 random guesses. Choose the analyst that makes the most number of correct predictions. Probability that at least one analyst makes at least 8 correct predictions: P(# correct ³ 8) = 1- ( ) 50 =

191 Effect of multiple comparison procedure Many algorithms employ the following greedy strategy: Initial model: M. Alternative model: M ' = M γ, where γis a component to be added to the model (e.g., a test condition of a decision tree). Keep M ' if improvement, Δ M, M ' > α. Often times, γ is chosen from a set of alternative components, Γ= best(γ 1, γ 2,, γ 4 ). If many alternatives are available, one may inadvertently add irrelevant components to the model, resulting in model overfitting.

192 Effect of multiple comparison: Example Use additional 100 noisy variables generated from a uniform distribution along with X and Y as attributes. Use 30% of the data for training and 70% of the data for testing. Using only X and Y as attributes

193 Notes on overfitting Overfitting results in decision trees that are more complex than necessary. Training error does not provide a good estimate of how well the tree will perform on previously unseen records. Need ways for estimating generalization errors.

194 Handling overfitting in decision trees Pre-Pruning (early stopping rule): Stop the algorithm before it becomes a fully-grown tree. Typical stopping conditions for a node: Stop if all instances belong to the same class. Stop if all the attribute values are the same. More restrictive conditions: Stop if number of instances is less than some user-specified threshold. Stop if class distribution of instances are independent of the available features (e.g., using χ 2 test). Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain). Stop if estimated generalization error falls below certain threshold.

195 Handling overfitting in decision trees Post-pruning: Grow decision tree to its entirety. Subtree replacement: Trim the nodes of the decision tree in a bottom-up fashion. If generalization error improves after trimming, replace sub-tree by a leaf node. Class label of leaf node is determined from majority class of instances in the sub-tree. Subtree raising: Replace subtree with most frequently used branch.

196 Examples of post-pruning Decision Tree: depth = 1 : breadth > 7 : class 1 breadth <= 7 : breadth <= 3 : ImagePages > : class 0 ImagePages <= : totalpages <= 6 : class 1 totalpages > 6 : breadth <= 1 : class 1 breadth > 1 : class 0 width > 3 : MultiIP = 0: ImagePages <= : class 1 ImagePages > : breadth <= 6 : class 0 breadth > 6 : class 1 MultiIP = 1: TotalTime <= 361 : class 0 TotalTime > 361 : class 1 depth > 1 : MultiAgent = 0: depth > 2 : class 0 depth <= 2 : MultiIP = 1: class 0 MultiIP = 0: breadth <= 6 : class 0 breadth > 6 : RepeatedAccess <= : class 0 RepeatedAccess > : class 1 MultiAgent = 1: totalpages <= 81 : class 0 totalpages > 81 : class 1 Subtree Raising Subtree Replacement Simplified Decision Tree: depth = 1 : ImagePages <= : class 1 ImagePages > : breadth <= 6 : class 0 breadth > 6 : class 1 depth > 1 : MultiAgent = 0: class 0 MultiAgent = 1: totalpages <= 81 : class 0 totalpages > 81 : class 1

197 ENSEMBLE METHODS

198 Ensemble methods Construct a set of classifiers from the training data. Predict class label of test records by combining the predictions made by multiple classifiers.

199 Why ensemble methods work? Suppose there are 25 base classifiers: Each classifier has error rate, e = Assume errors made by classifiers are uncorrelated. Probability that the ensemble classifier makes a wrong prediction: P(X 13) = 25 i=13 25 i ε i (1 ε) 25 i = 0.06

200 General approach D Original Training data Step 1: Create Multiple Data Sets... D 1 D 2 D t-1 D t Step 2: Build Multiple Classifiers C 1 C 2 C t -1 C t Step 3: Combine Classifiers C *

201 Types of ensemble methods Manipulate data distribution. Resampling method. Bagging and boosting. Manipulate input features. Feature subset selection. Random forest: Randomly select feature subsets and built decision trees. Manipulate class labels. Randomly partition the classes into two subsets, treat them as +ve and ve, and learn a binary classifier. Do that many times. At classification, use all binary classifiers and give credits to the constituent classes. By using different models. E.g., Different ANN topologies.

202 Bagging Sampling with replacement. Original Data Bagging (Round 1) Bagging (Round 2) Bagging (Round 3) Build a classifier on each bootstrap sample. Use a majority voting prediction approach: Predict an unlabeled instance using all classifiers and return the most frequently predicted class as the prediction.

203 Boosting An iterative procedure to adaptively change the distribution of training data by focusing more on previously misclassified records. Initially, all N records are assigned equal weights. Unlike bagging, weights may change at the end of each boosting round. The weights can be used to create a weighted-loss function or bias the selection of the sample.

204 Boosting Records that are wrongly classified will have their weights increased. Records that are classified correctly will have their weights decreased. Original Data Boosting (Round 1) Boosting (Round 2) Boosting (Round 3) Example 4 is hard to classify. Its weight is increased, therefore it is more likely to be chosen again in subsequent rounds.

205 ARTIFICIAL NEURAL NETWORKS

206 Consider the following X 1 X 2 X 3 Y Input X 1 X 2 X 3 Black box Output Y Output Y is 1 if at least two of the three inputs are equal to 1.

207 Consider the following X 1 X 2 X 3 Y Input nodes X 1 X 2 X 3 Black box S 0.3 t=0.4 Output node Y Y = sign(0.3x X X 3 0.4) where sign(x) = +1 if x 0 1 if x < 0

208 Perceptron Model is an assembly of inter-connected nodes and weighted links. Input nodes X 1 Black box w 1 Output node Output node sums up each of its input value according to the weights of its links. Compare output node against some threshold t. X 2 X 3 w 2 w 3 Perceptron Model d Y = sign( w i X i t) i=1 d i=0 = sign( w i X i ) S t Y

209 Perceptron Single layer network Contains only input and output nodes. Activation function: f (w, x) = sign( x, w ) Applying model is straightforward: Y = sign(0.3x X X 3 0.4) where sign(x) = +1 if x 0 1 if x < 0 X 1 = 1, X 2 = 0, X 3 =1 => y = sign(0.2) = 1

210 Perceptron learning rule Initialize the weights (w 0, w 1,, w d ) Repeat For each training example (x i, y i ) Compute f(w, x i ) Update the weights: w (k+1) = w (k) + λ y i f (w (k), x i ) x i Until stopping condition is met. The above is an example of a stochastic gradient descent optimization method.

211 Perceptron learning rule Weight update formula: Intuition: w (k+1) = w (k) + λ y i f (w (k), x i ) x i ; λ: learning rate Update weight based on error: e = y i f (w (k), x i ) If y=f(x,w), e=0: no update needed. If y>f(x,w), e=2: weight must be increased so that f(x,w) will increase. If y<f(x,w), e=-2: weight must be decreased so that f(x,w) will decrease.

212 Perceptron learning rule Since f(w,x) is a linear combination of input variables, decision boundary is linear. For nonlinearly separable problems, perceptron learning algorithm will fail because no linear hyperplane can separate the data perfectly.

213 Nonlinearly separable data XOR Data y = x Å x 1 2 x 1 x 2 y

214 Multilayer artificial neural networks (ANN) x 1 x 2 x 3 x 4 x 5 Input Layer Input Neuron i Output Hidden Layer I 1 I 2 I 3 w i1 w i2 w i3 S i Activation function g(s i ) O i O i threshold, t Output Layer Training ANN means learning the weights of the neurons y

215 Artificial neural networks Various types of neural network topologies: Single-layered network (perceptron) versus multi-layered network. Feed-forward versus recurrent network. Various types of activation functions (f): Y = f ( w i X i ) i

216 Artificial neural networks Multi-layer neural network can solve any type of classification task involving nonlinear decision surfaces. XOR Data Input Layer Hidden Layer Output Layer w 31 x 1 n 1 n 3 w 53 w 41 n 5 y w 32 x 2 n 2 w 42 n 4 w 54

217 Design issues of ANN Number of nodes in input layer: One input node per binary/continuous attribute. k or log 2 k nodes for each categorical attribute with k values. Number of nodes in output layer: One output for binary class problem. k or log 2 (k) nodes for k-class problem. Number of nodes in hidden layer. Initial weights and biases.

218 Characteristics of ANN Multilayer ANN are universal function approximators but could suffer from overfitting if the network is too large. Gradient descent may converge to local minimum. Model building can be very time consuming, but applying the model can be very fast. Can handle redundant attributes because weights are automatically learnt. Sensitive to noise in training data. Difficult to handle missing attributes.

219 Recent noteworthy developments in ANN Use in deep learning and unsupervised feature learning. Seek to automatically learn a good representation of the input from unlabeled data. Google Brain project: Learned the concept of a cat by looking at unlabeled pictures from YouTube. One billion connection network.

220 Purpose-built neural networks Convolution neural networks Deep networks that are designed to extract successively more complicated features from 1D, 2D, and 3D signals (i.e., audio, images, video).

221 Purpose-built neural networks Networks that are specifically designed to model arbitrary length sequences and non-local dependencies: Recurrent neural networks Bi-directional recurrent neural networks Long short-term memory Good for language modeling and various biological applications.

222 SUPPORT VECTOR MACHINES

223 Separating hyperplanes Find a linear hyperplane (decision boundary) that separates the data.

224 Separating hyperplanes B 1 One possible solution.

225 Separating hyperplanes B 2 Another possible solution.

226 Separating hyperplanes B 2 Other possible solutions.

227 Separating hyperplanes B 1 B 2 Which one is better? B1 or B2? How do we define better?

228 Support Vector Machines (SVM) B 1 B 2 b 21 b 22 margin b 11 Find the hyperplane that maximizes the margin: B 1 is better than B 2. b 12

229 Support vector machines B 1 Vector w is normal to the separating hyperplane. Let x and y be two points on the hyperplane. Then, w wx T + b =0 & wy T + b =0, and w(x y) T =0, which indicates that w is orthogonal to the vector x y, which lies on the hyperplane. Classifica on is performed as follows: b 11 f(x) = +1 if wxt + b 0 1 if wx T + b<0 b 12 wx T + b =0

230 Model estimation The goal is to find the parameters w and b (i.e., the model s parameters) such that it separates the classes and maximizes the margin. We know how to measure classifica on accuracy, but how do we measure the margin? Let (w,b) be the parameters of a hyperplane that is in the middle between the two classes. We can scale (w,b) in order to obtain (w,b) such that Let x and y be two points such that f(x) = +1 if wxt + b +1 1 if wx T + b 1 wx T + b =+1 & wy T + b = 1, that is, these points are the closest to the hyperplane posi ve and nega ve instances, respec vely. Then, w(x y) T = 2 w x y cos(w, x y) = 2 w (margin) = 2 margin = 2/ w which indicates that w is orthogonal to the vector x y, which lies on the hyperplane.

231 Support Vector Machines B 1 w b 11 wx T + b =+1 b 12 wx T + b =0

232 Model estimation The op miza on problem is formulated as follows: maximize w,b 2 w subject to wx T i + b +1 if x i is +ve wx T i + b 1 if x i is -ve If y i be +1 or 1 if x i is +ve or -ve, respec vely, then the above can be concisely wri en in a standard minimiza on form: minimize w,b w 2 subject to y i (wx T i + b) 1 x i This is a constrained quadra c op miza on problem, which is convex and can be solved efficiently using Lagrange mul plies by minimizing the following func on: L p = w 2 i(y i (wx T i + b) 1), where the i 0 s are what they are called Lagrange mul plies. i

233 Model estimation The dual Lagrangian is used for solving this problem, which can be shown to be: L D = i i i,j i jy i y j x i x T j. Since this is the dual of the primal op miza on problem, the problem is now becomes a maximiza on problem. At the op mal solu on of the primal/dual problem we have that: w = i iy i x i. Most of the i s are 0, and the non-zero i s are those that define the w vector. They correspond to the training examples for which the model predicts either +1 or 1. These training examples are called the support vectors. A test instance z is classified as +ve or -ve based on f(z) =sign(wz T + b) =sign i iy i x i z T + b.

234 Example of Linear SVM Support vectors x1 x2 y l

235 Support vector machines What if the problem is not linearly separable?

236 Non-separable case Non-linearly separable cases are handled by introducing for each training instance a slack variable i and solving the following op miza on problem: minimize w,b, i w 2 + c ( i i)... Or by using a non-linear hyperplane.... Or by doing both. subject to wx T i + b +1 i if x i is +ve wx T i + b 1+ i if x i is -ve i 0

237 Nonlinear support vector machines What if decision boundary is not linear?

238 Nonlinear support vector machines Transform data into higher dimensional space. Decision boundary: Φ(x)w T + b =0

239 Nonlinear SVMs Mapping from the original space to a different space can make things separable.

240 Learning non-linear SVMs The dual Lagrangian L D = i i i,j i jy i y j x i x T j, now becomes: L D = i i i,j i jy i y j (x i ) (x j ) T, A test instance z is classified as +ve or -ve based on f(z) =sign The matrix K such that K(x i, x j )= i iy i (x i ) (z) T + b. (x i ) (x j ) T is called the kernel matrix. Non-linear SVMs require to have such a kernel matrix. I can derive interes ng kernel matrices involving extremely high-dimensional func ons by opera ng on the original space. This is called the kernel trick.

241 Kernel trick Examples: This is an infinite dimension polynomial.

242 Example of nonlinear SVM SVM with polynomial degree 2 kernel

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf