Association Rule Mining and Clustering Lecture Outline: Classification vs. Association Rule Mining vs. Clustering Association Rule Mining Clustering Types of Clusters Clustering Algorithms Hierarchical: agglomerative, divisive Non-hierarchical: k-means Reading: Chapters 3.4, 3.9, 4.5, 4.8, 6.6 Witten and Frank, 2nd ed. Chapter 14, Foundations of Statistical Language Processing, C.D. Manning & H. Schütze, MIT Press, 1999 COM3250 / 6170 1 2010-2011
Classification vs. Association Rule Mining vs. Clustering So far we have primarily focused on classification: Given: a set of training examples represented as pairs of attribute value vectors (instance representations) + a designated target class Learn: how to predict the target class of an unseen instance Example: learn to distinguish edible/poisonous mushrooms; credit-worthy loan applicants Works well if we understand which attributes are likely to predict others and/or we have a clear-cut classification task in mind. However in other cases there may be no distinguished class attribute. COM3250 / 6170 2 2010-2011
Classification vs. Association Rule Mining vs. Clustering So far we have primarily focused on classification: Given: a set of training examples represented as pairs of attribute value vectors (instance representations) + a designated target class Learn: how to predict the target class of an unseen instance Example: learn to distinguish edible/poisonous mushrooms; credit-worthy loan applicants Works well if we understand which attributes are likely to predict others and/or we have a clear-cut classification task in mind. However in other cases there may be no distinguished class attribute. We may want to learn association rules capturing regularities underlying a dataset: Given: set of training examples represented as attribute value vectors Learn: if-then rules expressing significant associations between attributes Example: learn associations between items consumers buy at the supermarket COM3250 / 6170 2-a 2010-2011
Classification vs. Association Rule Mining vs. Clustering So far we have primarily focused on classification: Given: a set of training examples represented as pairs of attribute value vectors (instance representations) + a designated target class Learn: how to predict the target class of an unseen instance Example: learn to distinguish edible/poisonous mushrooms; credit-worthy loan applicants Works well if we understand which attributes are likely to predict others and/or we have a clear-cut classification task in mind. However in other cases there may be no distinguished class attribute. We may want to learn association rules capturing regularities underlying a dataset: Given: set of training examples represented as attribute value vectors Learn: if-then rules expressing significant associations between attributes Example: learn associations between items consumers buy at the supermarket We may want discover clusters in our data either to understand the data or to train classifiers Given: set of training examples + a similarity measure Learn: a set of clusters capturing significant groupings amongst instances Example: cluster documents returned by a search engine COM3250 / 6170 2-b 2010-2011
Association Rule Mining Could use rule learning methods studied earlier: consider each possible attribute + value and each possible combination of attributes + values as a potential consequent (RHS) of an if-then rule run a rule induction process to induce rules for each such consequent then prune resulting association rules by coverage number of instances rules correctly predicts (also called support); and accuracy proportion of instances to which the rule applies which it correctly predicts (also called confidence) COM3250 / 6170 3 2010-2011
Association Rule Mining Could use rule learning methods studied earlier: consider each possible attribute + value and each possible combination of attributes + values as a potential consequent (RHS) of an if-then rule run a rule induction process to induce rules for each such consequent then prune resulting association rules by coverage number of instances rules correctly predicts (also called support); and accuracy proportion of instances to which the rule applies which it correctly predicts (also called confidence) However, given the combinatorics such an approach is computationally infeasible... COM3250 / 6170 3-a 2010-2011
Association Rule Mining (cont) Instead, assume we are only interested in rules with some minimum coverage Look for combinations of attribute-value pairs with pre-specified minimum coverage called item sets, where an item is an attribute-value pair (terminology borrowed from market basket analyis, where associations are sought between items customers buy) This approach followed by the Apriori association rule miner in Weka (Agrawal et al.) Sequentially generate all 1-item, 2-item, 3-item,... n-item sets that have minimum coverage This can be done efficiently by observing that an n-item set can achieve minimum coverage only if all of the n 1-item sets which are subsets of the n-item set have minimum coverage Example: in the PlayTennis dataset the 3-item set {humidity=normal, windy=false, play=yes} has coverage 4 (i.e. these three attribute value pairs are true of 4 instances). COM3250 / 6170 4 2010-2011
Association Rule Mining (cont) Next, form rules by considering for each minimum coverage item set all possible rules containing 0 or more attribute value pairs from the item set in the antecedent and one or more attribute value pairs from the item set in the consequent. From the 3-item set {humidity=normal, windy=false, play=yes} generate 7 rules: Association Rule IF humidity=normal windy=false THEN play=yes 4/4 IF humidity=normal play=yes THEN windy=false 4/6 IF windy=false play=yes THEN humidity=normal 4/6 IF humidity=normal THEN windy=false play=yes 4/7 IF windy=false THEN humidity=normal play=yes 4/8 IF play=yes THEN humidity=normal windy=false 4/9 IF -- THEN humidity=normal windy=false play=yes 4/12 Keep only those rules that meet pre-specified desired accuracy e.g. in this example only first rule kept if accuracy of 100% is specified For PlayTennis dataset there are: 3 rules with coverage 4 and accuracy 100% 5 rules with coverage 3 and accuracy 100% 50 rules with coverage 2 and accuracy 100% COM3250 / 6170 5 2010-2011 Accuracy
Types of Clusters Approaches to clustering can be characterised in various ways One characterisation is by the type of clusters produced clusters may be: 1. partitions of the instance space each instance is assigned to exactly one cluster 2. overlapping subsets of the instance space instances may belong to more than one cluster 3. probabilities of cluster membership associated with instance each instance has some probability of belonging to each cluster 4. hierarchical structures any given cluster may consist of subclusters or instances or both (1) (2) e d c d e a k g j h i f b a k j g h c i b f (3) 1 2 3 (4) a 0.4 0.1 0.5 b 0.1 0.8 0.1 c 0.3 0.3 0..4 d 0.1 0.1 0.8 e 0.4 0.2 0.4 f 0..1 0.4 0.5 g 0.7 0.2 0.1 h 0.5 0.4 0.1 g a c i e d k b j f COM3250 / 6170 6 2010-2011
Clustering Algorithms: A Taxonomy Hierarchical clustering Agglomerative: bottom up start with individual instances and group the most similar Divisive: top down start with all instances in a single cluster and divide into groups so as to maximize within group similarity Mixed: start with individual instances and either add to existing cluster or form new cluster, possibly merging or splitting instances in existing clusters (CobWeb) Non-hierarchical ( flat ) clustering Partitioning approaches hypothesise k clusters, randomly pick cluster centres, and iteratively assign instances to centres and recompute centres until stable Probabilistic approaches hypothesise k clusters each with an associated (initially guessed) probability distribution of attribute values for instances in the cluster, then iteratively compute cluster probabilities for each instance and recompute cluster parameters until stability Incremental vs batch clustering: are clusters computed dynamically as instances become available (CobWeb) or statically on presumption whole instance set is available? COM3250 / 6170 7 2010-2011
Hierarchical Clustering: Agglomerative Clustering Given: a set X = {x 1,...,x n } of instances + a function sim: 2 X 2 X R for i = 1 to n endfor c i {x i } C {c 1,...,c n } j n+1 while C > 1 return C (c n1,c n2 ) argmax (cu,c v ) C Csim(c u,c v ) c j c n1 c n2 C C \ {c n1,c n2 } {c j } j j+ 1 (Manning & Schütze, p. 502) Start with a separate cluster for each instance Repeatedly determine two most similar clusters and merge them together Terminate when a single cluster containing all instances has been formed COM3250 / 6170 8 2010-2011
Hierarchical Clustering: Divisive Clustering Given: a set X = {x 1,...,x n } of instances + a function coh: 2 X R + a function split: 2 X 2 X 2 X C {X} (= {c 1 }) j 1 while c i C s.t. c i > 1 return C c u argmin cv C coh(c v ) (c j+1,c j+2 ) = split(c u ) C C \ {c u } {c j+1,c j+2 } j j+ 2 (Manning & Schütze, p. 502) Start with a single cluster containing all instances Repeatedly determine least coherent cluster and split into two subclusters Terminate when no cluster contains more than one instance COM3250 / 6170 9 2010-2011
Similarity Functions used in Clustering (1) Single Link: similarity between two clusters = similarity of the two most similar members Complete Link: similarity between two clusters = similarity of the two least similar members Group average: similarity between two clusters = average similarity between members 5 a b c d 4 3 d 2d 2 1 3/2 d e f g h 1 2 3 4 5 6 7 8 Mannning & Schütze, pp. 504-505 COM3250 / 6170 10 2010-2011
Similarity Functions used in Clustering (2) Best initial move is to merge a/b, c/d, e/ f and g/h since the similarities between these objects are greatest (assume similarity reciprocally related to distance) 5 a b c d 4 3 d 2d 2 1 3/2 d e f g h 1 2 3 4 5 6 7 8 Mannning & Schütze, pp. 504-505 COM3250 / 6170 11 2010-2011
Similarity Functions used in Clustering (3) Using single link clustering the clusters {a,b} and {c,d} and also {e, f } and {g,h} are merged next since the pairs b/c and f/g are closer than other pairs not in the same cluster (e.g. than b/ f or c/g) members Single link clustering results in elongated clusters ( chaining effect ) that are locally coherent in that close objects are in same cluster However, may have poor global quality a is closer to e than to d, but a and d are in same cluster while a and e are not. 5 a b c d 4 3 d 2d 2 1 3/2 d e f g h 1 2 3 4 5 6 7 8 Mannning & Schütze, pp. 504-505 COM3250 / 6170 12 2010-2011
Similarity Functions used in Clustering (4) Complete link clustering avoids this problem by focusing on global rather than local quality similarity of two clusters is the similarity of their two most dissimilar members Results in tighter clusters in the example than single link similarity minimally similar pairs for complete link clusters (a/ f or b/e) closer than minimally similar pairs for single link clusters (a/d) 5 a b c d 4 3 d 2d 2 1 3/2 d e f g h 1 2 3 4 5 6 7 8 Mannning & Schütze, pp. 504-505 COM3250 / 6170 13 2010-2011
Similarity Functions used in Clustering (5) Unfortunately complete link clustering has time complexity O(n 3 ) single link clustering is O(n 2 ). Group average clustering is a compromise that is O(n 2 ) but avoids the elongated clusters of single link clustering. The average similarity between vectors in a cluster c is defined as: S(c) = 1 c ( c 1) x c sim( x, y) x y c At each iteration the clustering algorithm picks two clusters c u and c v that maximize S(c u c v ) To carry out group average clustering efficiently care must be taken to avoid recomputing average similarities from scratch after each of the merging steps can avoid doing this representing instances as length-normalised vectors in m-dimensional real-valued space and using cosine similarity measure given this approach average similarity of a cluster can be computed in constant time from the average similarity of its two children see Manning and Schütze for details COM3250 / 6170 14 2010-2011
Similarity Functions used in Clustering (6) In top down hierarchical clustering a measure is needed for cluster coherence and an operation to split clusters must be defined The similarity measures already defined for bottom up clustering can be used for these tasks Coherence can be defined as the smallest similarity in the minimum spanning tree for the cluster (tree connecting all instances the sum of whose edge lengths is minimal) according to the single link similarity measure the smallest similarity between any two instances in the cluster, according to the complete link measure the average similarity between objects in the cluster, according to the group average measure Once the least coherent cluster is identified it needs to be split Splitting can be seen as a clustering task find two subclusters of a given cluster any clustering algorithm can be used for this task COM3250 / 6170 15 2010-2011
Non-hierarchical Clustering Non-hierarchical clustering algorithms typically start with a partition based on randomly selected seeds and then iteratively refine this partition by reallocating instances to current best cluster contrast with hierarchical algorithms which typically require only one pass COM3250 / 6170 16 2010-2011
Non-hierarchical Clustering Non-hierarchical clustering algorithms typically start with a partition based on randomly selected seeds and then iteratively refine this partition by reallocating instances to current best cluster contrast with hierarchical algorithms which typically require only one pass Termination occurs when according to some measure of goodness clusters are no longer improving measures of goodness include: group average similarity; mutual information between adjacent clusters; likelihood of data given the clustering model COM3250 / 6170 16-a 2010-2011
Non-hierarchical Clustering Non-hierarchical clustering algorithms typically start with a partition based on randomly selected seeds and then iteratively refine this partition by reallocating instances to current best cluster contrast with hierarchical algorithms which typically require only one pass Termination occurs when according to some measure of goodness clusters are no longer improving measures of goodness include: group average similarity; mutual information between adjacent clusters; likelihood of data given the clustering model How many clusters? may have some prior knowledge about right number of clusters can try various cluster numbers n and see how measures of cluster goodness compare or if there is a reduction in the rate of increase of goodness for some n can use Minimum Description Length to minimize sum of lengths of encodings of instances in terms of distance from clusters + encodings of clusters COM3250 / 6170 16-b 2010-2011
Non-hierarchical Clustering Non-hierarchical clustering algorithms typically start with a partition based on randomly selected seeds and then iteratively refine this partition by reallocating instances to current best cluster contrast with hierarchical algorithms which typically require only one pass Termination occurs when according to some measure of goodness clusters are no longer improving measures of goodness include: group average similarity; mutual information between adjacent clusters; likelihood of data given the clustering model How many clusters? may have some prior knowledge about right number of clusters can try various cluster numbers n and see how measures of cluster goodness compare or if there is a reduction in the rate of increase of goodness for some n can use Minimum Description Length to minimize sum of lengths of encodings of instances in terms of distance from clusters + encodings of clusters Hierarchical clustering does not require number of clusters to be determined however full hierarchical clusterings are rarely usable and tree must be cut at some point to specifying a number of clusters COM3250 / 6170 16-c 2010-2011
K-means Clustering Given: a set X = { x 1,..., x n } R m + a distance measure: d : R m R m R + a function for computing the mean µ : P(R) R m select k initial centres f 1,..., f k while stopping criterion is not true for all clusters c j c j = { x i f l d( x i, f j ) d( x i, f l )} end for for all means f j f j = µ(c j ) end for end while (Manning & Schütze, p. 516) The algorithm picks k initial centres and forms clusters by allocating each instance to its nearest centre Centres for each cluster are recomputed as the centroid or mean of the cluster s members: µ= (1/ c j ) x c j x and instances are once again allocated to their nearest centre The algorithm iterates until stability or some measure of goodness is attained COM3250 / 6170 17 2010-2011
K-means Clustering Movement of Cluster Centres http://www.cs.umd.edu/ mount/projects/kmeans/images/centers.gif COM3250 / 6170 18 2010-2011
Probability-based Clustering and The EM Algorithm In probability-based clustering an instance is not placed categorically in a single cluster, but rather is assigned a probability of belonging to every cluster COM3250 / 6170 19 2010-2011
Probability-based Clustering and The EM Algorithm In probability-based clustering an instance is not placed categorically in a single cluster, but rather is assigned a probability of belonging to every cluster Basis of statistical clustering is the finite mixture model mixture of k probability distributions representing k clusters each distribution gives probability that an instance would have a certain set of attribute values if it were known to be a member of that cluster COM3250 / 6170 19-a 2010-2011
Probability-based Clustering and The EM Algorithm In probability-based clustering an instance is not placed categorically in a single cluster, but rather is assigned a probability of belonging to every cluster Basis of statistical clustering is the finite mixture model mixture of k probability distributions representing k clusters each distribution gives probability that an instance would have a certain set of attribute values if it were known to be a member of that cluster Clustering problem is to take a set of instances and a pre-specified number of clusters and work out each cluster s mean and variance and the population distribution between clusters COM3250 / 6170 19-b 2010-2011
Probability-based Clustering and The EM Algorithm In probability-based clustering an instance is not placed categorically in a single cluster, but rather is assigned a probability of belonging to every cluster Basis of statistical clustering is the finite mixture model mixture of k probability distributions representing k clusters each distribution gives probability that an instance would have a certain set of attribute values if it were known to be a member of that cluster Clustering problem is to take a set of instances and a pre-specified number of clusters and work out each cluster s mean and variance and the population distribution between clusters EM Expectation-Maximisation is an algorithm for doing this Like k-means start with guess for parameters governing clusters Use these parameters to calculate cluster probabalities for each instance (expectation of class values) Use these cluster probabilities for each instance to re-estimate cluster parameters (maximisation of the likelihood of the distributions given the data) Terminate when some goodness measure is met usually when increase in log likelihood that data came from the finite mixture model is negligible between iterations COM3250 / 6170 19-c 2010-2011
Summary While classification learning to predict an instance s class given a set of attribute values is central to machine learning it is not the only task of interest/value COM3250 / 6170 20 2010-2011
Summary While classification learning to predict an instance s class given a set of attribute values is central to machine learning it is not the only task of interest/value When there is no clearly distinguished class attribute we may want to learn association rules reflecting regularities underlying the data discover clusters in the data COM3250 / 6170 20-a 2010-2011
Summary While classification learning to predict an instance s class given a set of attribute values is central to machine learning it is not the only task of interest/value When there is no clearly distinguished class attribute we may want to learn association rules reflecting regularities underlying the data discover clusters in the data Association rules can be learned by a procedure which identifies sets of attribute-value pairs which occur together sufficiently often to be of interest proposes rules relating these attribute-value pairs whose accuracy over the data set is sufficiently high as to be useful COM3250 / 6170 20-b 2010-2011
Summary While classification learning to predict an instance s class given a set of attribute values is central to machine learning it is not the only task of interest/value When there is no clearly distinguished class attribute we may want to learn association rules reflecting regularities underlying the data discover clusters in the data Association rules can be learned by a procedure which identifies sets of attribute-value pairs which occur together sufficiently often to be of interest proposes rules relating these attribute-value pairs whose accuracy over the data set is sufficiently high as to be useful Clusters which can hard or soft, hierarchical or non-hierarchical can be discovered using a variety of algorithms including: for hierarchical clusters: agglomerative or divisive clustering for non-hierarchical clusters: k-means or EM COM3250 / 6170 20-c 2010-2011