Association Rule Mining and Clustering

Similar documents
INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

COMS 4771 Clustering. Nakul Verma

Based on Raymond J. Mooney s slides

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

CHAPTER 4: CLUSTER ANALYSIS

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic


Machine Learning. Unsupervised Learning. Manfred Huber

MIA - Master on Artificial Intelligence

Information Retrieval and Web Search Engines

Cluster Analysis. Ying Shen, SSE, Tongji University

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

IBL and clustering. Relationship of IBL with CBR

Introduction to Mobile Robotics

Unsupervised Learning

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that:

Unsupervised: no target value to predict

Data Informatics. Seon Ho Kim, Ph.D.

10701 Machine Learning. Clustering

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

CSE 5243 INTRO. TO DATA MINING

Clustering Lecture 5: Mixture Model

CSE 5243 INTRO. TO DATA MINING

Unsupervised Learning

Supervised and Unsupervised Learning (II)

Information Retrieval and Web Search Engines

Clustering. Supervised vs. Unsupervised Learning

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

CHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification

Hierarchical Clustering Lecture 9

Clustering and Visualisation of Data

What to come. There will be a few more topics we will cover on supervised learning

CLUSTERING. JELENA JOVANOVIĆ Web:

Data Mining Part 5. Prediction

Lesson 3. Prof. Enza Messina

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Text Documents clustering using K Means Algorithm

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

Unsupervised Learning : Clustering

K-Means Clustering 3/3/17

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Clustering: Overview and K-means algorithm

Intro to Artificial Intelligence

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Machine learning - HT Clustering

Clustering CS 550: Machine Learning

Unsupervised Learning: Clustering

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Unsupervised Learning and Clustering

ECLT 5810 Clustering

Clustering & Bootstrapping

Data Mining. Covering algorithms. Covering approach At each stage you identify a rule that covers some of instances. Fig. 4.

4. Ad-hoc I: Hierarchical clustering

Data Mining. Lesson 9 Support Vector Machines. MSc in Computer Science University of New York Tirana Assoc. Prof. Dr.

Clustering algorithms

Machine Learning: Symbol-based

Introduction to Information Retrieval

ECLT 5810 Clustering

Natural Language Processing

Data Clustering. Danushka Bollegala

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Finding Clusters 1 / 60

Supervised vs. Unsupervised Learning

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Chapter 4: Algorithms CS 795

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

Hierarchical Clustering 4/5/17

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection

ECG782: Multidimensional Digital Signal Processing

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

MSA220 - Statistical Learning for Big Data

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Introduction to Machine Learning CMU-10701

Lecture 7: Segmentation. Thursday, Sept 20

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)

Multi-label classification using rule-based classifier systems

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Introduction to Machine Learning. Xiaojin Zhu

Clustering. Bruno Martins. 1 st Semester 2012/2013

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

INF 4300 Classification III Anne Solberg The agenda today:

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Introduction to Information Retrieval

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

University of Florida CISE department Gator Engineering. Clustering Part 2

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation

Transcription:

Association Rule Mining and Clustering Lecture Outline: Classification vs. Association Rule Mining vs. Clustering Association Rule Mining Clustering Types of Clusters Clustering Algorithms Hierarchical: agglomerative, divisive Non-hierarchical: k-means Reading: Chapters 3.4, 3.9, 4.5, 4.8, 6.6 Witten and Frank, 2nd ed. Chapter 14, Foundations of Statistical Language Processing, C.D. Manning & H. Schütze, MIT Press, 1999 COM3250 / 6170 1 2010-2011

Classification vs. Association Rule Mining vs. Clustering So far we have primarily focused on classification: Given: a set of training examples represented as pairs of attribute value vectors (instance representations) + a designated target class Learn: how to predict the target class of an unseen instance Example: learn to distinguish edible/poisonous mushrooms; credit-worthy loan applicants Works well if we understand which attributes are likely to predict others and/or we have a clear-cut classification task in mind. However in other cases there may be no distinguished class attribute. COM3250 / 6170 2 2010-2011

Classification vs. Association Rule Mining vs. Clustering So far we have primarily focused on classification: Given: a set of training examples represented as pairs of attribute value vectors (instance representations) + a designated target class Learn: how to predict the target class of an unseen instance Example: learn to distinguish edible/poisonous mushrooms; credit-worthy loan applicants Works well if we understand which attributes are likely to predict others and/or we have a clear-cut classification task in mind. However in other cases there may be no distinguished class attribute. We may want to learn association rules capturing regularities underlying a dataset: Given: set of training examples represented as attribute value vectors Learn: if-then rules expressing significant associations between attributes Example: learn associations between items consumers buy at the supermarket COM3250 / 6170 2-a 2010-2011

Classification vs. Association Rule Mining vs. Clustering So far we have primarily focused on classification: Given: a set of training examples represented as pairs of attribute value vectors (instance representations) + a designated target class Learn: how to predict the target class of an unseen instance Example: learn to distinguish edible/poisonous mushrooms; credit-worthy loan applicants Works well if we understand which attributes are likely to predict others and/or we have a clear-cut classification task in mind. However in other cases there may be no distinguished class attribute. We may want to learn association rules capturing regularities underlying a dataset: Given: set of training examples represented as attribute value vectors Learn: if-then rules expressing significant associations between attributes Example: learn associations between items consumers buy at the supermarket We may want discover clusters in our data either to understand the data or to train classifiers Given: set of training examples + a similarity measure Learn: a set of clusters capturing significant groupings amongst instances Example: cluster documents returned by a search engine COM3250 / 6170 2-b 2010-2011

Association Rule Mining Could use rule learning methods studied earlier: consider each possible attribute + value and each possible combination of attributes + values as a potential consequent (RHS) of an if-then rule run a rule induction process to induce rules for each such consequent then prune resulting association rules by coverage number of instances rules correctly predicts (also called support); and accuracy proportion of instances to which the rule applies which it correctly predicts (also called confidence) COM3250 / 6170 3 2010-2011

Association Rule Mining Could use rule learning methods studied earlier: consider each possible attribute + value and each possible combination of attributes + values as a potential consequent (RHS) of an if-then rule run a rule induction process to induce rules for each such consequent then prune resulting association rules by coverage number of instances rules correctly predicts (also called support); and accuracy proportion of instances to which the rule applies which it correctly predicts (also called confidence) However, given the combinatorics such an approach is computationally infeasible... COM3250 / 6170 3-a 2010-2011

Association Rule Mining (cont) Instead, assume we are only interested in rules with some minimum coverage Look for combinations of attribute-value pairs with pre-specified minimum coverage called item sets, where an item is an attribute-value pair (terminology borrowed from market basket analyis, where associations are sought between items customers buy) This approach followed by the Apriori association rule miner in Weka (Agrawal et al.) Sequentially generate all 1-item, 2-item, 3-item,... n-item sets that have minimum coverage This can be done efficiently by observing that an n-item set can achieve minimum coverage only if all of the n 1-item sets which are subsets of the n-item set have minimum coverage Example: in the PlayTennis dataset the 3-item set {humidity=normal, windy=false, play=yes} has coverage 4 (i.e. these three attribute value pairs are true of 4 instances). COM3250 / 6170 4 2010-2011

Association Rule Mining (cont) Next, form rules by considering for each minimum coverage item set all possible rules containing 0 or more attribute value pairs from the item set in the antecedent and one or more attribute value pairs from the item set in the consequent. From the 3-item set {humidity=normal, windy=false, play=yes} generate 7 rules: Association Rule IF humidity=normal windy=false THEN play=yes 4/4 IF humidity=normal play=yes THEN windy=false 4/6 IF windy=false play=yes THEN humidity=normal 4/6 IF humidity=normal THEN windy=false play=yes 4/7 IF windy=false THEN humidity=normal play=yes 4/8 IF play=yes THEN humidity=normal windy=false 4/9 IF -- THEN humidity=normal windy=false play=yes 4/12 Keep only those rules that meet pre-specified desired accuracy e.g. in this example only first rule kept if accuracy of 100% is specified For PlayTennis dataset there are: 3 rules with coverage 4 and accuracy 100% 5 rules with coverage 3 and accuracy 100% 50 rules with coverage 2 and accuracy 100% COM3250 / 6170 5 2010-2011 Accuracy

Types of Clusters Approaches to clustering can be characterised in various ways One characterisation is by the type of clusters produced clusters may be: 1. partitions of the instance space each instance is assigned to exactly one cluster 2. overlapping subsets of the instance space instances may belong to more than one cluster 3. probabilities of cluster membership associated with instance each instance has some probability of belonging to each cluster 4. hierarchical structures any given cluster may consist of subclusters or instances or both (1) (2) e d c d e a k g j h i f b a k j g h c i b f (3) 1 2 3 (4) a 0.4 0.1 0.5 b 0.1 0.8 0.1 c 0.3 0.3 0..4 d 0.1 0.1 0.8 e 0.4 0.2 0.4 f 0..1 0.4 0.5 g 0.7 0.2 0.1 h 0.5 0.4 0.1 g a c i e d k b j f COM3250 / 6170 6 2010-2011

Clustering Algorithms: A Taxonomy Hierarchical clustering Agglomerative: bottom up start with individual instances and group the most similar Divisive: top down start with all instances in a single cluster and divide into groups so as to maximize within group similarity Mixed: start with individual instances and either add to existing cluster or form new cluster, possibly merging or splitting instances in existing clusters (CobWeb) Non-hierarchical ( flat ) clustering Partitioning approaches hypothesise k clusters, randomly pick cluster centres, and iteratively assign instances to centres and recompute centres until stable Probabilistic approaches hypothesise k clusters each with an associated (initially guessed) probability distribution of attribute values for instances in the cluster, then iteratively compute cluster probabilities for each instance and recompute cluster parameters until stability Incremental vs batch clustering: are clusters computed dynamically as instances become available (CobWeb) or statically on presumption whole instance set is available? COM3250 / 6170 7 2010-2011

Hierarchical Clustering: Agglomerative Clustering Given: a set X = {x 1,...,x n } of instances + a function sim: 2 X 2 X R for i = 1 to n endfor c i {x i } C {c 1,...,c n } j n+1 while C > 1 return C (c n1,c n2 ) argmax (cu,c v ) C Csim(c u,c v ) c j c n1 c n2 C C \ {c n1,c n2 } {c j } j j+ 1 (Manning & Schütze, p. 502) Start with a separate cluster for each instance Repeatedly determine two most similar clusters and merge them together Terminate when a single cluster containing all instances has been formed COM3250 / 6170 8 2010-2011

Hierarchical Clustering: Divisive Clustering Given: a set X = {x 1,...,x n } of instances + a function coh: 2 X R + a function split: 2 X 2 X 2 X C {X} (= {c 1 }) j 1 while c i C s.t. c i > 1 return C c u argmin cv C coh(c v ) (c j+1,c j+2 ) = split(c u ) C C \ {c u } {c j+1,c j+2 } j j+ 2 (Manning & Schütze, p. 502) Start with a single cluster containing all instances Repeatedly determine least coherent cluster and split into two subclusters Terminate when no cluster contains more than one instance COM3250 / 6170 9 2010-2011

Similarity Functions used in Clustering (1) Single Link: similarity between two clusters = similarity of the two most similar members Complete Link: similarity between two clusters = similarity of the two least similar members Group average: similarity between two clusters = average similarity between members 5 a b c d 4 3 d 2d 2 1 3/2 d e f g h 1 2 3 4 5 6 7 8 Mannning & Schütze, pp. 504-505 COM3250 / 6170 10 2010-2011

Similarity Functions used in Clustering (2) Best initial move is to merge a/b, c/d, e/ f and g/h since the similarities between these objects are greatest (assume similarity reciprocally related to distance) 5 a b c d 4 3 d 2d 2 1 3/2 d e f g h 1 2 3 4 5 6 7 8 Mannning & Schütze, pp. 504-505 COM3250 / 6170 11 2010-2011

Similarity Functions used in Clustering (3) Using single link clustering the clusters {a,b} and {c,d} and also {e, f } and {g,h} are merged next since the pairs b/c and f/g are closer than other pairs not in the same cluster (e.g. than b/ f or c/g) members Single link clustering results in elongated clusters ( chaining effect ) that are locally coherent in that close objects are in same cluster However, may have poor global quality a is closer to e than to d, but a and d are in same cluster while a and e are not. 5 a b c d 4 3 d 2d 2 1 3/2 d e f g h 1 2 3 4 5 6 7 8 Mannning & Schütze, pp. 504-505 COM3250 / 6170 12 2010-2011

Similarity Functions used in Clustering (4) Complete link clustering avoids this problem by focusing on global rather than local quality similarity of two clusters is the similarity of their two most dissimilar members Results in tighter clusters in the example than single link similarity minimally similar pairs for complete link clusters (a/ f or b/e) closer than minimally similar pairs for single link clusters (a/d) 5 a b c d 4 3 d 2d 2 1 3/2 d e f g h 1 2 3 4 5 6 7 8 Mannning & Schütze, pp. 504-505 COM3250 / 6170 13 2010-2011

Similarity Functions used in Clustering (5) Unfortunately complete link clustering has time complexity O(n 3 ) single link clustering is O(n 2 ). Group average clustering is a compromise that is O(n 2 ) but avoids the elongated clusters of single link clustering. The average similarity between vectors in a cluster c is defined as: S(c) = 1 c ( c 1) x c sim( x, y) x y c At each iteration the clustering algorithm picks two clusters c u and c v that maximize S(c u c v ) To carry out group average clustering efficiently care must be taken to avoid recomputing average similarities from scratch after each of the merging steps can avoid doing this representing instances as length-normalised vectors in m-dimensional real-valued space and using cosine similarity measure given this approach average similarity of a cluster can be computed in constant time from the average similarity of its two children see Manning and Schütze for details COM3250 / 6170 14 2010-2011

Similarity Functions used in Clustering (6) In top down hierarchical clustering a measure is needed for cluster coherence and an operation to split clusters must be defined The similarity measures already defined for bottom up clustering can be used for these tasks Coherence can be defined as the smallest similarity in the minimum spanning tree for the cluster (tree connecting all instances the sum of whose edge lengths is minimal) according to the single link similarity measure the smallest similarity between any two instances in the cluster, according to the complete link measure the average similarity between objects in the cluster, according to the group average measure Once the least coherent cluster is identified it needs to be split Splitting can be seen as a clustering task find two subclusters of a given cluster any clustering algorithm can be used for this task COM3250 / 6170 15 2010-2011

Non-hierarchical Clustering Non-hierarchical clustering algorithms typically start with a partition based on randomly selected seeds and then iteratively refine this partition by reallocating instances to current best cluster contrast with hierarchical algorithms which typically require only one pass COM3250 / 6170 16 2010-2011

Non-hierarchical Clustering Non-hierarchical clustering algorithms typically start with a partition based on randomly selected seeds and then iteratively refine this partition by reallocating instances to current best cluster contrast with hierarchical algorithms which typically require only one pass Termination occurs when according to some measure of goodness clusters are no longer improving measures of goodness include: group average similarity; mutual information between adjacent clusters; likelihood of data given the clustering model COM3250 / 6170 16-a 2010-2011

Non-hierarchical Clustering Non-hierarchical clustering algorithms typically start with a partition based on randomly selected seeds and then iteratively refine this partition by reallocating instances to current best cluster contrast with hierarchical algorithms which typically require only one pass Termination occurs when according to some measure of goodness clusters are no longer improving measures of goodness include: group average similarity; mutual information between adjacent clusters; likelihood of data given the clustering model How many clusters? may have some prior knowledge about right number of clusters can try various cluster numbers n and see how measures of cluster goodness compare or if there is a reduction in the rate of increase of goodness for some n can use Minimum Description Length to minimize sum of lengths of encodings of instances in terms of distance from clusters + encodings of clusters COM3250 / 6170 16-b 2010-2011

Non-hierarchical Clustering Non-hierarchical clustering algorithms typically start with a partition based on randomly selected seeds and then iteratively refine this partition by reallocating instances to current best cluster contrast with hierarchical algorithms which typically require only one pass Termination occurs when according to some measure of goodness clusters are no longer improving measures of goodness include: group average similarity; mutual information between adjacent clusters; likelihood of data given the clustering model How many clusters? may have some prior knowledge about right number of clusters can try various cluster numbers n and see how measures of cluster goodness compare or if there is a reduction in the rate of increase of goodness for some n can use Minimum Description Length to minimize sum of lengths of encodings of instances in terms of distance from clusters + encodings of clusters Hierarchical clustering does not require number of clusters to be determined however full hierarchical clusterings are rarely usable and tree must be cut at some point to specifying a number of clusters COM3250 / 6170 16-c 2010-2011

K-means Clustering Given: a set X = { x 1,..., x n } R m + a distance measure: d : R m R m R + a function for computing the mean µ : P(R) R m select k initial centres f 1,..., f k while stopping criterion is not true for all clusters c j c j = { x i f l d( x i, f j ) d( x i, f l )} end for for all means f j f j = µ(c j ) end for end while (Manning & Schütze, p. 516) The algorithm picks k initial centres and forms clusters by allocating each instance to its nearest centre Centres for each cluster are recomputed as the centroid or mean of the cluster s members: µ= (1/ c j ) x c j x and instances are once again allocated to their nearest centre The algorithm iterates until stability or some measure of goodness is attained COM3250 / 6170 17 2010-2011

K-means Clustering Movement of Cluster Centres http://www.cs.umd.edu/ mount/projects/kmeans/images/centers.gif COM3250 / 6170 18 2010-2011

Probability-based Clustering and The EM Algorithm In probability-based clustering an instance is not placed categorically in a single cluster, but rather is assigned a probability of belonging to every cluster COM3250 / 6170 19 2010-2011

Probability-based Clustering and The EM Algorithm In probability-based clustering an instance is not placed categorically in a single cluster, but rather is assigned a probability of belonging to every cluster Basis of statistical clustering is the finite mixture model mixture of k probability distributions representing k clusters each distribution gives probability that an instance would have a certain set of attribute values if it were known to be a member of that cluster COM3250 / 6170 19-a 2010-2011

Probability-based Clustering and The EM Algorithm In probability-based clustering an instance is not placed categorically in a single cluster, but rather is assigned a probability of belonging to every cluster Basis of statistical clustering is the finite mixture model mixture of k probability distributions representing k clusters each distribution gives probability that an instance would have a certain set of attribute values if it were known to be a member of that cluster Clustering problem is to take a set of instances and a pre-specified number of clusters and work out each cluster s mean and variance and the population distribution between clusters COM3250 / 6170 19-b 2010-2011

Probability-based Clustering and The EM Algorithm In probability-based clustering an instance is not placed categorically in a single cluster, but rather is assigned a probability of belonging to every cluster Basis of statistical clustering is the finite mixture model mixture of k probability distributions representing k clusters each distribution gives probability that an instance would have a certain set of attribute values if it were known to be a member of that cluster Clustering problem is to take a set of instances and a pre-specified number of clusters and work out each cluster s mean and variance and the population distribution between clusters EM Expectation-Maximisation is an algorithm for doing this Like k-means start with guess for parameters governing clusters Use these parameters to calculate cluster probabalities for each instance (expectation of class values) Use these cluster probabilities for each instance to re-estimate cluster parameters (maximisation of the likelihood of the distributions given the data) Terminate when some goodness measure is met usually when increase in log likelihood that data came from the finite mixture model is negligible between iterations COM3250 / 6170 19-c 2010-2011

Summary While classification learning to predict an instance s class given a set of attribute values is central to machine learning it is not the only task of interest/value COM3250 / 6170 20 2010-2011

Summary While classification learning to predict an instance s class given a set of attribute values is central to machine learning it is not the only task of interest/value When there is no clearly distinguished class attribute we may want to learn association rules reflecting regularities underlying the data discover clusters in the data COM3250 / 6170 20-a 2010-2011

Summary While classification learning to predict an instance s class given a set of attribute values is central to machine learning it is not the only task of interest/value When there is no clearly distinguished class attribute we may want to learn association rules reflecting regularities underlying the data discover clusters in the data Association rules can be learned by a procedure which identifies sets of attribute-value pairs which occur together sufficiently often to be of interest proposes rules relating these attribute-value pairs whose accuracy over the data set is sufficiently high as to be useful COM3250 / 6170 20-b 2010-2011

Summary While classification learning to predict an instance s class given a set of attribute values is central to machine learning it is not the only task of interest/value When there is no clearly distinguished class attribute we may want to learn association rules reflecting regularities underlying the data discover clusters in the data Association rules can be learned by a procedure which identifies sets of attribute-value pairs which occur together sufficiently often to be of interest proposes rules relating these attribute-value pairs whose accuracy over the data set is sufficiently high as to be useful Clusters which can hard or soft, hierarchical or non-hierarchical can be discovered using a variety of algorithms including: for hierarchical clusters: agglomerative or divisive clustering for non-hierarchical clusters: k-means or EM COM3250 / 6170 20-c 2010-2011