Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot

Similar documents
Clustering CS 550: Machine Learning

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Understanding Clustering Supervising the unsupervised

7. Nearest neighbors. Learning objectives. Centre for Computational Biology, Mines ParisTech

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Gene Clustering & Classification

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

CHAPTER 4: CLUSTER ANALYSIS

Network Traffic Measurements and Analysis

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Unsupervised Learning and Clustering

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Distances, Clustering! Rafael Irizarry!

Exploratory data analysis for microarrays

Unsupervised Learning and Clustering

Clustering in Data Mining

ECS 234: Data Analysis: Clustering ECS 234

Clustering. Lecture 6, 1/24/03 ECS289A

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

7. Nearest neighbors. Learning objectives. Foundations of Machine Learning École Centrale Paris Fall 2015

Machine Learning (BSMC-GA 4439) Wenke Liu

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Machine Learning (BSMC-GA 4439) Wenke Liu

DATA MINING - 1DL105, 1Dl111. An introductory class in data mining

Lecture Notes for Chapter 8. Introduction to Data Mining

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Data Mining Algorithms

High throughput Data Analysis 2. Cluster Analysis

Unsupervised Learning : Clustering

Unsupervised Learning Partitioning Methods

Data Mining Concepts & Techniques

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering Part 3. Hierarchical Clustering

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Measure of Distance. We wish to define the distance between two objects Distance metric between points:

Unsupervised Learning

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

ECLT 5810 Clustering

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

CSE 347/447: DATA MINING

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

Cluster Analysis. Ying Shen, SSE, Tongji University

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

10701 Machine Learning. Clustering

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

A Review on Cluster Based Approach in Data Mining

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

10. Clustering. Introduction to Bioinformatics Jarkko Salojärvi. Based on lecture slides by Samuel Kaski

Unsupervised Learning Hierarchical Methods

Statistics 202: Data Mining. c Jonathan Taylor. Clustering Based in part on slides from textbook, slides of Susan Holmes.

CLUSTERING IN BIOINFORMATICS

Clustering algorithms and introduction to persistent homology

ECLT 5810 Clustering

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Hierarchical Clustering

Clustering. Chapter 10 in Introduction to statistical learning

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Based on Raymond J. Mooney s slides

Online Social Networks and Media. Community detection

University of Florida CISE department Gator Engineering. Clustering Part 2

Data Mining and Data Warehousing Henryk Maciejewski Data Mining Clustering

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

What is Unsupervised Learning?

Finding Clusters 1 / 60

Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY

Applied Clustering Techniques. Jing Dong

Chapter DM:II. II. Cluster Analysis

Cluster Analysis: Basic Concepts and Algorithms

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

Hierarchical Clustering Lecture 9

Hierarchical clustering

EECS730: Introduction to Bioinformatics

Chapter VIII.3: Hierarchical Clustering

Stat 321: Transposable Data Clustering

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

Clustering. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238

4. Ad-hoc I: Hierarchical clustering

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

CSE 5243 INTRO. TO DATA MINING

Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013

Distance-based Methods: Drawbacks

Cluster Analysis. Angela Montanari and Laura Anderlucci

Clustering k-mean clustering

Hierarchical Clustering

Clustering: K-means and Kernel K-means

Clustering Tips and Tricks in 45 minutes (maybe more :)

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

Transcription:

Foundations of Machine Learning CentraleSupélec Fall 2017 12. Clustering Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr

Learning objectives Explain what clustering algorithms can be used for. Explain and implement three different ways to evaluate clustering algorithms. Implement hierarchical clustering, discuss its various flavors. Implement k-means clustering, discuss its advantages and drawbacks. Sketch out a density-based clustering algorithm. 2

Goals of clustering Group objects that are similar into clusters: classes that are unknown beforehand. 3

Goals of clustering Group objects that are similar into clusters: classes that are unknown beforehand. 4

Goals of clustering Group objects that are similar into clusters: classes that are unknown beforehand. E.g. group genes that are similarly affected by a disease group patients whose genes respond similarly to a disease group pixels in an image that belong to the same object (image segmentation). 5

Applications of clustering Understand general characteristics of the data Visualize the data Infer some properties of a data point based on how it relates to other data points E.g. find subtypes of diseases visualize protein families find categories among images find patterns in financial transactions detect communities in social networks 6

Distances and similarities 7

Distances & similarities Assess how close / far data points are from each other a data point is from a cluster two clusters are from each other Distance metric 8

Distances & similarities Assess how close / far data points are from each other a data point is from a cluster two clusters are from each other Distance metric symmetry triangle inequality E.g. Lq distances 9

Distance & similarities How do we get similarities? 10

Distance & similarities Transform distances into similarities? Kernels define similarities For a given mapping from the space of objects X to some Hilbert space H, the kernel between two objects x and x' is the inner product of their images in the feature spaces. 11

Pearson's correlation Measure of the linear correlation between two variables If the features are centered:? 12

Pearson's correlation Measure of the linear correlation between two variables If the features are centered: Normalized dot product = cosine 13

Pearson vs Euclide Pearson's coefficient Profiles of similar shapes will be close to each other, even if they differ in magnitude. Euclidean distance Magnitude is taken into account. 14

Pearson vs Euclide 15

Evaluating clusters 16

Evaluating clusters Clustering is unsupervised. There is no ground truth. How do we evaluate the quality of a clustering algorithm? 17

Evaluating clusters Clustering is unsupervised. There is no ground truth. How do we evaluate the quality of a clustering algorithm? 1) Based on the shape of the clusters: Points within the same cluster should be nearby/similar and points far from each other should belong to different clusters. Based on the stability of the clusters: We should get the same results if we remove some data points, add noise, etc. Based on domain knowledge: The clusters should make sense. 18

Evaluating clusters Clustering is unsupervised. There is no ground truth. How do we evaluate the quality of a clustering algorithm? 1) Based on the shape of the clusters: Points within the same cluster should be nearby/similar and points far from each other should belong to different clusters. Based on the stability of the clusters: We should get the same results if we remove some data points, add noise, etc. Based on domain knowledge: The clusters should make sense. 19

Centroids and medoids Centroid: mean of the points in the cluster. Medoid: point in the cluster that is closest to the centroid. 20

Cluster shape: Tightness vs 21

Cluster shape: Tightness Tk 22

Cluster shape: Separability vs 23

Cluster shape: Separability Skl 24

Clusters shape: Davies-Bouldin Cluster tightness (homogeneity) Tk Cluster separation Skl Davies-Bouldin index 25

Clusters shape: Silhouete coefficient how well x fits in its cluster: how well x would fit in another cluster: if x is very close to the other points of its cluster: s(x) = 1 if x is very close to the points in another cluster: s(x) = -1 26

Evaluating clusters Clustering is unsupervised. There is no ground truth. How do we evaluate the quality of a clustering algorithm? 1) Based on the shape of the clusters: Points within the same cluster should be nearby/similar and points far from each other should belong to different clusters. 2) Based on the stability of the clusters: We should get the same results if we remove some data points, add noise, etc. Based on domain knowledge: The clusters should make sense. 27

Cluster stability How many clusters? 28

Cluster stability K=2 K=3 29

Cluster stability K=2 K=3 30

Evaluating clusters Clustering is unsupervised. There is no ground truth. How do we evaluate the quality of a clustering algorithm? 1) Based on the shape of the clusters: Points within the same cluster should be nearby/similar and points far from each other should belong to different clusters. 2) Based on the stability of the clusters: We should get the same results if we remove some data points, add noise, etc. 3) Based on domain knowledge: The clusters should make sense. 31

Domain knowledge Do the cluster match natural categories? Check with human expertise 32

Ontology enrichment analysis Ontology: Entities may be grouped, related within a hierarchy, and subdivided according to similarities and differences. Build by human experts E.g.: The Gene Ontology http://geneontology.org/ Describe genes with a common vocabulary, organized in categories E.g. cellular process > cell death > programmed cell death > apoptotic process > execution phase of apoptosis 33

Ontology enrichment analysis Enrichment analysis: Are there more data points from ontology category G in cluster C than expected by chance? TANGO [Tanay et al., 2003] Assume data points sampled from a hypergeometric distribution The probability for the intersection of G and C to contain more than t points is: 34

Ontology enrichment analysis Enrichment analysis: Are there more data points from ontology category G in cluster C than expected by chance? TANGO [Tanay et al., 2003] Assume data points sampled from a hypergeometric distribution The probability for the intersection of G and C to contain more than t points is: Probability of getting i points from G when drawing C points from a total of n samples. 35

Hierarchical clustering 36

Hierachical clustering Group data over a variety of possible scales, in a multi-level hierarchy. 37

Construction Agglomerative approach (botom-up) Start with each element in its own cluster Iteratively join neighboring clusters. Divisive approach (top-down) Start with all elements in the same cluster Iteratively separate into smaller clusters. 38

Dendogram The results of a hierarchical clustering algorithm are presented in a dendogram. Branch length = cluster distance. 39

Dendogram The results of a hierarchical clustering algorithm are presented in a dendogram. U height = distance. How many clusters?? 40

Dendogram The results of a hierarchical clustering algorithm are presented in a dendogram. U height = distance. 1 2 3 4 41

Linkage: connecting two clusters Single linkage 42

Linkage: connecting two clusters Complete linkage 43

Linkage: connecting two clusters Average linkage 44

Linkage: connecting two clusters Centroid linkage 45

Linkage: connecting two clusters Ward Join clusters so as to minimize within-cluster variance 46

Example: Gene expression clustering Breast cancer survival signature [Bergamashi et al. 2011] genes 1 2 1 patients 2 47

Hierarchical clustering Advantages No need to pre-define the number of clusters Interpretability Drawbacks Computational complexity? 48

Hierarchical clustering Advantages No need to pre-define the number of clusters Interpretability Drawbacks Computational complexity E.g. Single/complete linkage (naive): At least O(pn²) to compute all pairwise distances. Must decide at which level of the hierarchy to split Lack of robustness (unstable) 49

K-means 50

K-means clustering Minimize the intra-cluster variance What will this partition of the space look like? 51

K-means clustering Minimize the intra-cluster variance For each cluster, the points in that cluster are those that are closest to its centroid than to any other centroid 52

K-means clustering Minimize the intra-cluster variance Voronoi tesselation 53

Lloyd's algorithm K-means cannot be easily optimized We adopt a greedy strategy. Partition the data into K clusters at random Compute the centroid of each cluster Assign each point to the cluster whose centroid it is closest to Repeat until cluster membership converges. 54

K-means Advantages What is the computational time of k-means? 55

K-means Advantages What is the computational time of k-means? compute kn distances in p dimensions number of iterations Can be small if there's indeed a cluster structure in the data 56

K-means Advantages Computational time is linear Easily implementable Drawbacks Need to set up K ahead of time What happens when there are outliers? 57

K-means Advantages Computational time is linear Easily implementable Drawbacks Need to set up K ahead of time Sensitive to noise and outliers Stochastic (different solutions with each iteration) The clusters are forced to have convex shapes 58

K-means variants K-means++ Seeding algorithm to initialize clusters with centroids spread-out throughout the data. Deterministic K-medoids Kernel k-means Find clusters in feature space k-means kernel k-means 59

Density-based clustering 60

Density-based clustering 61

Hierarchical clustering: cluster.agglomerativeclustering(linkage='average', n_clusters=3) 62

k-means clustering cluster.kmeans(n_clusters=3) 63

DBSCAN Density-based clustering: clusters are made of dense neighborhoods of points 64

DBSCAN ε-neighborhood: core points: x and z are density-connected: core points such that 65

Summary Clustering: unsupervised approach to group similar data points together. Evaluate clustering algorithms based on Hierarchical clustering the shape of the cluster the stability of the results the consistency with domain knowledge. top-down / bottom-up various linkage functions. k-means clustering tries to minimize intra-cluster variance density-based clustering clusters dense neighborhoods together. 66

References Introduction to Data Mining P. Tang, M. Steinbach, V. Kumar Chap. 8: Cluster analysis https://www-users.cs.umn.edu/~kumar001/dmbook/ch8.pdf 67