Machine Learning (BSMC-GA 4439) Wenke Liu

Similar documents
Machine Learning (BSMC-GA 4439) Wenke Liu

Cluster Analysis. Angela Montanari and Laura Anderlucci

Supervised vs. Unsupervised Learning

CSE 5243 INTRO. TO DATA MINING

Clustering. Supervised vs. Unsupervised Learning

CSE 5243 INTRO. TO DATA MINING

STATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Clustering CS 550: Machine Learning

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Understanding Clustering Supervising the unsupervised

University of Florida CISE department Gator Engineering. Clustering Part 2

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

SGN (4 cr) Chapter 11

Clustering. Chapter 10 in Introduction to statistical learning

Gene Clustering & Classification

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Stat 321: Transposable Data Clustering

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Hierarchical Clustering

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Clustering. Lecture 6, 1/24/03 ECS289A

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Machine Learning. Unsupervised Learning. Manfred Huber

Cluster Analysis: Agglomerate Hierarchical Clustering

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

ECLT 5810 Clustering

ECLT 5810 Clustering

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

ECS 234: Data Analysis: Clustering ECS 234

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Unsupervised Learning and Clustering

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Distances, Clustering! Rafael Irizarry!

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Multivariate Analysis

Hierarchical Clustering 4/5/17

Cluster Analysis for Microarray Data

Machine Learning for OR & FE

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Cluster analysis formalism, algorithms. Department of Cybernetics, Czech Technical University in Prague.

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Cluster analysis. Agnieszka Nowak - Brzezinska

Unsupervised Learning

Data Clustering. Danushka Bollegala

Unsupervised Learning : Clustering

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

High throughput Data Analysis 2. Cluster Analysis

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Unsupervised Learning

Clustering and The Expectation-Maximization Algorithm

Unsupervised Learning and Clustering

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures

CHAPTER 4: CLUSTER ANALYSIS

Unsupervised Learning

Today s lecture. Clustering and unsupervised learning. Hierarchical clustering. K-means, K-medoids, VQ

INF 4300 Classification III Anne Solberg The agenda today:

Cluster Analysis. Summer School on Geocomputation. 27 June July 2011 Vysoké Pole

University of Florida CISE department Gator Engineering. Clustering Part 5

Cluster Analysis. Ying Shen, SSE, Tongji University

Unsupervised Learning: Clustering

Market basket analysis

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

What is machine learning?

Finding Clusters 1 / 60

Chapter 6: Cluster Analysis

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Chapter 6 Continued: Partitioning Methods

Clustering algorithms

Hierarchical Clustering

Mixture Models and EM

Unsupervised Learning

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

K-means clustering Based in part on slides from textbook, slides of Susan Holmes. December 2, Statistics 202: Data Mining.

MSA220 - Statistical Learning for Big Data

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Hierarchical clustering

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering Part 3. Hierarchical Clustering

Introduction to Mobile Robotics

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Dynamic Thresholding for Image Analysis

Exploratory data analysis for microarrays

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes.

Behavioral Data Mining. Lecture 18 Clustering

Machine Learning and Data Mining. Clustering. (adapted from) Prof. Alexander Ihler

10701 Machine Learning. Clustering

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot

Unsupervised: no target value to predict

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

Cluster Analysis chapter 7. cse634 Data Mining. Professor Anita Wasilewska Compute Science Department Stony Brook University NY

CSE 5243 INTRO. TO DATA MINING

Clustering Lecture 3: Hierarchical Methods

Transcription:

Machine Learning (BSMC-GA 4439) Wenke Liu 01-31-017

Outline Background Defining proximity Clustering methods Determining number of clusters Comparing two solutions

Cluster analysis as unsupervised Learning Unsupervised learning (as opposed to supervised learning) seeks to identify structures in the usually high-dimensional data X without the help of a known label Y. Cluster analysis focuses on informative partition of observations into groups. Cluster analysis may lead to simpler and sometimes more meaningful representation of the data.

What is a cluster? Objects within each cluster are closer to one another than objects in different clusters. Internal cohesion: homogeneity. External isolation: separation.

Similarity, dissimilarity and distance Cluster analysis starts with defining a quantitative measure of proximity between each pair of objects (observations) x i and x j. It can be a similarity measure s ij (the larger the closer) or a dissimilarity measure δ ij (the smaller the closer). Proximity measures can be directly given or calculated from the data. When a dissimilarity measure fulfills the metric inequality δ ij + δ im δ jm for all (i, j, m) and δ ii =0, it is a distance measure d ij. Not all metric distances are Euclidean.

Proximity measures for continuous data Dissimilarity measures between continuous data points are usually general distance measures or correlation measures. Variables can be weighted by a nonnegative w k. Pearson correlation and angular separation are also metric. Cluster Analysis chapter 3

Proximity measures for binary data Cluster Analysis chapter 3

Proximity measures for mixed-type data Gower s general similarity measure: s ij = p k = 1 w ijk s ijk For categorical variables, if the two objects have the same value for variable k, s ijk =1, otherwise s ijk =0. For continuous variables, s ijk =1- x ik -x jk /R k, where Rk is the range of the kth variable. The weights w can be very flexible to deal with missing values or specific requirements. p k = 1 w ijk

Clustering algorithms Hierarchical clustering Optimization clustering Model-based clustering

Hierarchical clustering Partition the data in a series of steps Agglomerative and divisive methods Produces a tree-shaped dendrogram

Agglomerative hierarchical clustering dendrogram ITSL chapter 10

Distance between clusters: linkage 1 3 4 5 Single linkage (nearest neighbor): d 3 1 3 4 5 Complete linkage (furthest neighbor): d 15 1 3 4 5 Average linkage (mean): (d 13 + d 14 + d 15 + d 3 + d 4 + d 5 )/6 1 3 4 5 Centroid linkage: Calculated from original data X

Linkage choice affects clustering results

Centroid linkage may lead to inversion

Summary of different linkage options Cluster Analysis chapter 4

Summary of hierarchical clustering Performer has to choose proximity measure, clustering methods and the number of clusters (k). Being able to show hierarchy structures. Unsatisfactory grouping cannot be undone in later steps. May not perform well for large datasets.

Optimization clustering With any partition of the n data points into g groups, there could be an index c(n, g) that measures the quality of the partition, e.g., which can be optimized. Usually c(n, g) is related to within-group homogeneity or betweengroup separation. Produces the best group assignment, but no hierarchical structure.

Optimization clustering In theory, one could find the best clustering solution by going through all possible C and find the smallest within cluster scatter W(C). In reality the number of possible C grows rapidly with increasing n, and this direct approach is not feasible. The best C will have to be approximated by some iterative algorithm.

K-means clustering If all p variables are quantitative and the dissimilarity is squared Euclidean distance: Then W(C) is the sum of distance between each member of the cluster and the mean vector of that cluster: 1 ) ( ), ( j i jm p m im j i x x d x x x x = = = ( ) ( ) ( ) = = = = = = = k g g i C k k k g g j C g i C n C W 1 1 1 ) ( x x x x i j i

K-means clustering W(C) can be iteratively improved by the following algorithm, but may converge on a local minimum. Different random initial assignments may produce different results. ITSL chapter 10

K-medoids In the k-means algorithm, the problem of minimizing within group scatter is transformed into minimizing the distance between each member and a cluster center (in this case the centroid). This approach can be extended to other dissimilarity measure and other choice of cluster centers. K-medoids is a robust extension of the K-means, in which a representative data point is chosen as the cluster center. Optimization can be based directly on dissimilarity matrix.

K-medoids ESL chapter 14

Summary of optimization clustering Search for an optimization of a clustering criterion, which depends on dissimilarity and within cluster homogeneity or between cluster separation. Methods like K-means and K-medoids are approximations, no guarantee of finding the global maximum. Only produces cluster assignments. Initial values of cluster centers affect final results.

Model-based clustering A parametric solution (probability density estimation). Assume that data points are sampled from a family of probability density functions of the form: Cluster assignment is based on the estimated posterior probability: ( cluster g x ) P = ( xi, θˆ), pˆ, θˆ) Parameters are estimated by maximum likelihood or Bayesian methods. i k f ( x, p, θ) = p g f g ( x, θ) g= 1 pˆ g f g f ( x i

Gaussian mixture models example

How to determine k? Sometimes the number of clusters k is known a priori. Otherwise, there is no universal solutions, but heuristics are available. There are a lot of indices that may help evaluating the quality of clustering with different k (the NbClust package in R includes 30 of them ).

Determining k for finite mixture model Equivalent to a model selection problem. Akaike s information criterion (AIC) or Bayesian information criterion (BIC) are usually used. AIC and BIC are penalized likelihood. AIC = LL( x, ˆ) θ + p BIC = LL ( x, ˆ) θ + p ln( n)

Silhouette analysis Silhouette measures the tightness and separation of the clustering solution. a b s ( i) ( i) ( i) = d( i, A) = min d( i, C) C A b = max{ ( i) a( i) a( i), b( i)}

The Gap statistic Normalize the within cluster scatter criterion over a reference distribution. W k Gap k 1 = d n n r= 1 ( k) = E r i, j C * n r ij {log( W )} log( W Reference distribution obtained by Monte Carlo sampling of a uniform distribution over a box aligned with the principle component of the data. k k )

Comparing two clustering solutions Comparing dendrogram and proximity: cophenetic correlation Comparing two dendrograms: Goodman and Kruskal s γ Comparing two cluster assignments: adjusted Rand index

Cophenetic correlation The cophenetic distance between two objects is defined as the height of the dendrogram when they merged into one cluster. Correlation coefficient between all cophenetic distance h ij and the corresponding distance measure d ij how well the dendrogram reserves the pairwise distances. c = ( dij d )( hij h ) ( dij d ) ( hij h ) i< j i< j i< j

Goodman and Kruskal s γ for dendrograms Non parametric correlation between two ranks of dendrogram heights: γ = S S + + + S S Where S+ and S- are counts of concordance and discordance between pairs of pairwise cophenetic distance.

Adjusted Rand index For two clustering solutions C1 and C, a c1 c contingency table can be defined, where n ij is the number of objects in group i of C1 and group j of C. Measures agreement between two partitions, taking agreement by chance into account (up bounded by 1). + = = expected index max index expected index index 1, n n n n n n n n n ARI i j j i j j i i j i i j j i ij

General remarks Proximity measure has to be defined with caution. Choice of clustering methods depends on proximity (e.g., K-means only for squared Euclidean). Determining the number of k is a hard problem, but can be guided by heuristics.

Recommended readings An Introduction to Statistical Learning: With Applications in R. James G, Witten D, Hastie T, Tibshirani R. Springer: 013. Chapter 10. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Hastie T, Tibshirani R, Friedman J. Springer: 011. Chapter 14. Cluster analysis (5ed). Everitt BS, Landau S, Leese M, Stahl D. Wiley: 011.