Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Similar documents
CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering CS 550: Machine Learning

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Unsupervised Learning

University of Florida CISE department Gator Engineering. Clustering Part 2

CSE 5243 INTRO. TO DATA MINING

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering

Cluster Analysis. Ying Shen, SSE, Tongji University

CHAPTER 4: CLUSTER ANALYSIS

CSE 5243 INTRO. TO DATA MINING

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Understanding Clustering Supervising the unsupervised

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Finding Clusters 1 / 60

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Unsupervised Learning : Clustering

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Cluster analysis formalism, algorithms. Department of Cybernetics, Czech Technical University in Prague.

Hierarchical Clustering

K-means and Hierarchical Clustering

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

High throughput Data Analysis 2. Cluster Analysis

Clustering Algorithms. Margareta Ackerman

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Clustering. Lecture 6, 1/24/03 ECS289A

Clustering part II 1

Based on Raymond J. Mooney s slides

Clustering algorithms

Lesson 3. Prof. Enza Messina

Machine Learning (BSMC-GA 4439) Wenke Liu

Chapter DM:II. II. Cluster Analysis

Dimension reduction : PCA and Clustering

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that:

Clustering in R d. Clustering. Widely-used clustering methods. The k-means optimization problem CSE 250B

Clustering Lecture 5: Mixture Model

Data Mining. Clustering. Hamid Beigy. Sharif University of Technology. Fall 1394

Cluster Analysis: Agglomerate Hierarchical Clustering

10701 Machine Learning. Clustering

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

K-Means. Oct Youn-Hee Han

Unsupervised Learning

ECS 234: Data Analysis: Clustering ECS 234

Clustering in Ratemaking: Applications in Territories Clustering

Unsupervised Learning

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

3. Cluster analysis Overview

Machine Learning. Unsupervised Learning. Manfred Huber

Gene Clustering & Classification

Document Clustering: Comparison of Similarity Measures

Hierarchical Clustering 4/5/17

Data Informatics. Seon Ho Kim, Ph.D.

SGN (4 cr) Chapter 11

Objective of clustering

Clustering. Unsupervised Learning

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

What to come. There will be a few more topics we will cover on supervised learning

Machine Learning (BSMC-GA 4439) Wenke Liu

Information Retrieval and Web Search Engines

Clustering. Unsupervised Learning

K-Means Clustering. Sargur Srihari

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

K-Means Clustering 3/3/17

CLUSTERING IN BIOINFORMATICS

Introduction to Machine Learning CMU-10701

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Clustering: K-means and Kernel K-means

STATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

Tree Models of Similarity and Association. Clustering and Classification Lecture 5

Data Clustering. Danushka Bollegala

ECLT 5810 Clustering

Unsupervised Learning Partitioning Methods

Clustering. Chapter 10 in Introduction to statistical learning

Measure of Distance. We wish to define the distance between two objects Distance metric between points:

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Introduction to Machine Learning. Xiaojin Zhu

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures

Clustering. Unsupervised Learning

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

Clustering. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 238

Kapitel 4: Clustering

Cluster Analysis. Angela Montanari and Laura Anderlucci

Cluster Analysis for Microarray Data

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Hierarchical clustering

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

ECLT 5810 Clustering

Motivation. Technical Background

Introduction to Mobile Robotics

Transcription:

An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters) Requires a distance metric (or distance function) to quantify similarity Hard clustering Each object is assigned to one and only one cluster Hierarchical clustering is usually hard Soft (fuzzy) clustering Allows degrees of membership and membership in multiple clusters Centroid-based clustering can be both hard and soft 1

Based on the idea of objects being more related to nearby objects than to objects farther away. Connects objects to form clusters based on their distance. A cluster can be described largely by the maximum distance needed to connect parts of the cluster. At different distances, different clusters will form Can be represented using a hierarchical dendrogram Does not provide a single partitioning of the data set, but instead provides an extensive hierarchy of clusters that merge with each other at certain distances. In a dendrogram, the y-axis marks the distance at which the clusters merge, while the objects are placed along the x-axis such that the clusters don't mix. Connectivity based clustering is a whole family of methods that differ by the way distances and linkage criterion are computed. The linkage criterion is the method to compute the distance (link) to a cluster that consists of multiple Common linkage criteria Single-Linkage Clustering - the minimum of object distances Complete Linkage Clustering - the maximum of object distances UPGMA ("Unweighted Pair Group Method with Arithmetic Mean", also known as average linkage clustering). Does not produce a unique partitioning of the data set, but a hierarchy from which the user still needs to choose appropriate clusters. Not robust to outliers, which will either show up as additional clusters or even cause other clusters to merge (known as the "chaining phenomenon", common with single-linkage clustering). Recognized as the theoretical foundation of all cluster analysis, but often considered obsolete. Hierarchical clustering can be top-down or bottom-up: Top-down (partitioning) starts with one group (all objects belong to one cluster) divides it into groups as to maximize within group similarity Bottom-up (agglomerative) starts with separate cluster for each object in each step two most similar clusters are determined and merged into new cluster 2

Coordinate p2 (patient 2 gene expression) 0 1 2 3 4 5 6 g2 g1 g4 g3 g5 Distance 0.0 0.5 1.0 1.5 2.0 2.5 3.0 g5 g1 g2 g3 g4 0 1 2 3 4 5 6 Coordinate p1 (patient 1 gene expression) Clustering of the expression of 5 genes g1 g2 g3 g4 g2 0.30 g3 2.24 2.12 g4 2.44 2.28 0.40 g5 5.66 5.45 3.61 3.28 A metric d on a set X: d : X X R (where R is the set of real numbers) For all x, y, z in X, this function is required to satisfy the following conditions: d(x, y) 0 (non-negativity, or separation axiom) d(x, y) = 0 if and only if x = y (identity of indiscernibles, or coincidence axiom) d(x, y) = d(y, x) (symmetry) d(x, z) d(x, y) + d(y, z) (subadditivity / triangle inequality). Conditions 1 and 2 together produce positive definiteness. The first condition is implied by the others. A metric is ultrametric if it satisfies the following stronger version of the triangle inequality where points can never fall 'between' other points: For all x, y, z in X, d(x, z) max(d(x, y), d(y, z)) A metric d on X is called intrinsic if any two points x and y in X can be joined by a curve with length arbitrarily close to d(x, y). 3

Euclidean Metric Manhattan (taxicab) Metric Clusters are represented by a central vector, which may not necessarily be a member of the data set. When the number of clusters is fixed to k, k-means clustering gives a formal definition as an optimization problem: find the k cluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized. Given a set of observations (x 1, x 2,, x n ), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k sets (k n) S = {S 1, S 2,, S k } so as to minimize the Within-Cluster Sum of Squares (WCSS): where μ i is the centroid (mean of all points) in S i. 4

The optimization problem itself is known to be NP-hard, and thus the common approach is to search only for approximate solutions. A common method is Lloyd's algorithm, often referred to as "k-means algorithm". Finds a local optimum, and is commonly run multiple times with different random initializations. Variations of k-means often include different optimizations Restricting the centroids to members of the data set (k-medoids) Choosing medians (k-medians clustering), Choosing the initial centers less randomly (K-means++) Allowing a fuzzy cluster assignment (Fuzzy c-means). Most k-means-type algorithms require the number of clusters - k - to be specified in advance, which is considered to be one of the biggest drawbacks of these algorithms. The algorithm prefers clusters of approximately similar size, as they will always assign an object to the nearest centroid. This often leads to incorrectly cut borders in between of clusters. 5

Clusters are defined as objects belonging most likely to the same distribution. A nice property of this approach is that it models how many biological data sets are generated: by sampling random objects from a distribution. A drawback is that it can suffer from overfitting, unless constraints are put on the model complexity. A more complex model will usually always be able to explain the data better, which makes choosing the appropriate model complexity inherently difficult. One common method is the Gaussian mixture model The data set is usually modelled with a fixed (to avoid overfitting) number of Gaussian distributions that are initialized randomly and whose parameters are iteratively optimized to fit better to the data set. Converges to a local optimum, so multiple runs may produce different results. Typically uses the expectation-maximization (EM) algorithm 6

7