Machine Learning. Unsupervised Learning. Manfred Huber

Similar documents
Unsupervised Learning: Clustering

Introduction to Mobile Robotics

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Introduction to Machine Learning CMU-10701

K-Means Clustering 3/3/17

Clustering web search results

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

10701 Machine Learning. Clustering

COMS 4771 Clustering. Nakul Verma

Unsupervised Learning and Clustering

Unsupervised Learning

Clustering Lecture 5: Mixture Model

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

10. MLSP intro. (Clustering: K-means, EM, GMM, etc.)

Machine Learning. Semi-Supervised Learning. Manfred Huber

Unsupervised Learning and Clustering

k-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out

Clustering CS 550: Machine Learning

Unsupervised: no target value to predict

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

CHAPTER 4: CLUSTER ANALYSIS

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Machine Learning and Data Mining. Clustering. (adapted from) Prof. Alexander Ihler

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering algorithms

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

Understanding Clustering Supervising the unsupervised

Note Set 4: Finite Mixture Models and the EM Algorithm

Machine Learning. Supervised Learning. Manfred Huber

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Expectation Maximization!

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Machine Learning (BSMC-GA 4439) Wenke Liu


Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Clustering & Dimensionality Reduction. 273A Intro Machine Learning

Gene Clustering & Classification

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

Unsupervised Learning and Data Mining

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

K-means and Hierarchical Clustering

Unsupervised Learning

Introduction to Machine Learning. Xiaojin Zhu

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Mixture Models and the EM Algorithm

Clustering: K-means and Kernel K-means

SGN (4 cr) Chapter 11

Bioinformatics - Lecture 07

Machine Learning Department School of Computer Science Carnegie Mellon University. K- Means + GMMs

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Clustering and The Expectation-Maximization Algorithm

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Based on Raymond J. Mooney s slides

CS 229 Midterm Review

Clustering COMS 4771

Inference and Representation

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Clustering. Unsupervised Learning

Machine Learning (BSMC-GA 4439) Wenke Liu

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Clustering and Visualisation of Data

Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

CSE 5243 INTRO. TO DATA MINING

Hierarchical Clustering 4/5/17

ECE 5424: Introduction to Machine Learning

Association Rule Mining and Clustering

Mixture Models and EM

PATTERN CLASSIFICATION AND SCENE ANALYSIS

Unsupervised Learning Hierarchical Methods

CSE 5243 INTRO. TO DATA MINING

K-Means and Gaussian Mixture Models

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Big Data Analytics! Special Topics for Computer Science CSE CSE Feb 9

Cluster analysis formalism, algorithms. Department of Cybernetics, Czech Technical University in Prague.

Information Retrieval and Web Search Engines

CSC 411: Lecture 12: Clustering

Chapter 9. Classification and Clustering

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

Unsupervised Learning : Clustering

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

Grundlagen der Künstlichen Intelligenz

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

4. Cluster Analysis. Francesc J. Ferri. Dept. d Informàtica. Universitat de València. Febrer F.J. Ferri (Univ. València) AIRF 2/ / 1

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Unsupervised Learning

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Transcription:

Machine Learning Unsupervised Learning Manfred Huber 2015 1

Unsupervised Learning In supervised learning the training data provides desired target output for learning In unsupervised learning the training data does not contain a target value Training data does not specify what to learn Learning algorithm has to specify what to learn Find potential hidden categories Find patterns in the data (data mining) Learn feature representations for the data Performance function has to be defined entirely in terms of the input data Manfred Huber 2015 2

Clustering Clustering is the task of dividing the data into a set of groups (clusters) Clustering can be thought of as dividing data into hidden/unknown classes (clusters) Akin to clustering but without knowledge of any class labels or any definitions of what classes are To allow this, the clustering algorithm has to define internally what makes good clusters Similarity of instances is the most common criterion Manfred Huber 2015 3

Clustering Classes are not given so the number of clusters has to be determined differently Fixed number of clusters set beforehand Maximum dissimilarity in a class Ratio of intra-cluster and inter-cluster variance Clustering is based on a measure of similarity Euclidian distance of feature vector Graph similarity (traversal stats, modification distance) Dynamic Time Warping (DTW) distance Distribution similarity (KL divergence) Manfred Huber 2015 4

K-Means Clustering K-Means clustering is one of the simplest clustering algorithms Randomly place k points in the feature space These are the first cluster centers For each cluster center form a set of data points by assigning each data point to the set with the closest center Euclidian distance in feature space Compute new cluster centers as mean of points in each set and repeat until point sets do not change Manfred Huber 2015 5

K-Means Example (g) (h) (j) Manfred Huber 2015 6

K-Means Clustering K-Means clustering requires use of particular representation and performance functions Similarity function has to be Cartesian distance Performance function is average squared distance from cluster mean If these assumptions are violated then the mean of the data set for the cluster might no longer be the right pick for the new cluster center K-Means performs iterative local optimization on the average squared (Cartesian) distance Manfred Huber 2015 7

K-Means Clustering If the similarity metric is not Cartesian distance the algorithm has to be adapted Mean calculation for new cluster center is no longer valid Often it is not even defined (e.g. for certain types of graphs or metrics such as DTW) Use the central element of the set as a cluster prototype (instead of mean) Maintains the property of optimizing the average Cartesian distance in each cluster But: is no longer K-Means clustering! Manfred Huber 2015 8

K-Means Clustering K-Means and related techniques are very simple and frequently used clustering techniques Sensitive to the starting point High complexity due to the iterative optimization Prone to local extrema Requires to pre-determine number of clusters Can repeat it with different cluster numbers and see whether the performance increases significantly Manfred Huber 2015 9

Hierarchical Clustering Hierarchical clustering forms a hierarchy of clusters (i.e. different number of clusters at each level of the hierarchy). Divisive clustering starts with a single cluster and divides clusters recursively according to a metric Generally exponential complexity to find best cluster split Agglomerative clustering starts with one cluster per data point and recursively merges clusters according to a cluster similarity measure Complexity varies based on similarity metric used Manfred Huber 2015 10

Agglomerative Clustering Agglomerative clustering incrementally merges clusters based on cluster similarity Start with one cluster for each data point Find the two most similar clusters and merge them and repeat until either only one cluster is left or until the clusters become too dissimilar Need to define which two clusters are most similar to each other Similarity of two data sets (not data points) Often called linkage Manfred Huber 2015 11

Linkage Criteria Agglomerative clustering can use different linkage (cluster similarity) measures Average (mean) linkage: average similarity between any two items from the two clusters Single (minimum) linkage: similarity of the two most similar items in the two clusters Complete (maximum) linkage: similarity of the two most dissimilar items in the two clusters Linkage influences complexity and clusters Manfred Huber 2015 12

Linkage Criteria Average linkage forms most uniform clusters Average linkage requires recomputation of the linkage after every merge using O(n 2 ) time Clustering is O(n 3 ) Complete linkage forms most compact clusters Complete linkage recomputation uses MAX ( O(n) ) Clustering is O(n 2 ) Single linkage forms most connected clusters Single linkage recomputation uses MIN ( O(n) ) Clustering is O(n 2 ) Manfred Huber 2015 13

Example Average Farthest Nearest Hastie Manfred Huber 2015 14

Hierarchical Clustering Hierarchical clustering (agglomerative or divisive) form a hierarchy of clusters Need to decide which clusters to use Change to intra-cluster variances Maximum linkage for merging Ratio of inter-cluster and intra-cluster variances Hierarchical clustering has a fixed maximum run time that depends on size of data set Agglomerative linkage criteria influence run time Complexity of computing linkage is higher for average Average number of merge operations is different Manfred Huber 2015 15

Hierarchical Clustering Hierarchical clustering (agglomerative or divisive) form a hierarchy of clusters Need to decide which clusters to use Change to intra-cluster variances Maximum linkage for merging Ratio of inter-cluster and intra-cluster variances Hierarchical clustering has a fixed maximum run time that depends on size of data set Agglomerative linkage criteria influence run time Complexity of computing linkage is higher for average Average number of merge operations is different Manfred Huber 2015 16

Probabilistic (Soft) Clustering So far data point had to belong to one cluster For fixed cluster sizes this can result in cluster boundaries that through dense data Does not look like clusters As in clustering this makes clusters sensitive in noisy situations Does not allow overlapping clusters Probabilistic (Soft) clustering removes this assumption by using the probability that a point belongs to a cluster, P(z x (i) ) Manfred Huber 2015 17

Probabilistic Clustering Like in Bayesian (probabilistic) classification, probabilistic clustering takes a generative view But, cluster labels are not given sot hey need to be predicted based on a cluster purity criterion P(z x) = Clusters should have a particular shape and as cleanly as possible reflect the data p(x z)p(z) p(x) Maximize marginal likelihood given a particular type of cluster distribution! ˆ = argmax! p(x (i) )! i i " c j Manfred Huber 2015 18! = argmax! p(x (i), z = c j )

Gaussian Mixture Models In Gaussian Mixture Models each cluster is represented by a multivariate Gaussian distribution p(x z i ) = 1 m 1 2 (2! )! 2 i e "1 ( 2 x"µ i) T! i ( x"µ i ) Cluster assignments are then done probabilistically P(z x) = p(x z)p(z) p(x) Manfred Huber 2015 19

Gaussian Mixture Models Examples Manfred Huber 2015 20

Gaussian Mixture Model Learning is performed by maximizing the marginal likelihood! ˆ = argmax!! p(x (i) ) = argmax! p(x (i), z = c j ) i " i c j! i " c j!!!= argmax!! p(x (i) z = c j )P(z = c j ) 1!!!= argmax! e $1 ( 2 x(i ) $µ j ) T # j ( x (i ) $µ j )! " m 1 P(z = c i c j ) j 2 (2" ) # 2 j Requires learning Gaussian parameters for p(x z) as well as prior cluster probabilities P(z) Difficult to solve and with many local extrema Manfred Huber 2015 21

Expectation Maximization Expectation Maximization (EM) is a general algorithm to maximize marginal likelihood! ˆ = argmax! " p(x (i), z = c j!) = argmax! log p(x (i), z = c j!) Performs optimization by iterating two steps Expectation step: computes the expectations for the hidden/missing variables given the current parameters Maximization step: optimizes the parameters with a weighted estimate (optimizing lower bound) Similar to coordinate ascent! i c j (! ) c j In the GMM case it is similar to the way K-Means does its optimization Manfred Huber 2015 22! i

Expectation Maximization Expectation step: Computer expected distribution for hidden variables Maximization step: P(z j x (i),! t ) = " p(x (i) z j,! t )P(z j! t ) Recompute the optimal parameters for a lower bound on the likelihood!! i j! t+1 = argmax! P(z j x (i),! t )log p(z j, x (i)!)!!!!!!= argmax! E z " # log p(z j, x (i)!) $ %! i Manfred Huber 2015 23

Expectation Maximization Expectation maximization effectively conducts coordinate ascent on the lower bound of the log likelihood Still maximizes the original likelihood Expectation step tightens the bound Ensures that bound eventually has same extremum as function Significantly simpler than original optimization Often both steps are analytically solvable Guaranteed convergence But: Only to local optimum, making start point important Manfred Huber 2015 24

EM for GMM Applying EM to GMM: Expectation step computes the expected cluster values for the data points P(z j x (i),! t ) = " p(x (i) z j,! t )! P(z j! t ) Maximization step re-computes the means and variances of the Gaussian distributions for each cluster as well as the cluster priors!p(z j! t+1 ) = " j (t +1) =! i P(z j x (i),! t )!!!,!!!µ j (t +1) = n! P(z j x (i),! t ) x (i) # µ j (t +1)! i! i P(z j x (i),! t )x (i) P(z j x (i),! t ) ( )( x (i) # µ j (t +1) ) T i Manfred Huber 2015! P(z j x (i),! t ) i 25

GMM Example Start (%)*#+,'-)"./0,1'2*"+*' with 3 clusters Manfred Huber 2015 26

First Iteration GMM Example Manfred Huber 2015 27

GMM Example Second Iteration Manfred Huber 2015 28

GMM Example Third Iteration Manfred Huber 2015 29

GMM Example 20 th Iteration Manfred Huber 2015 30

Clustering There are a wide range of other clustering methods Clustering methods for other similarity metrics E.g.: Spectral clustering graph similarity measure Clustering using neural networks E.g.: Self-organizing maps for topological clustering Clustering using sampling methods E.g.: Ant colony optimization-based clustering Manfred Huber 2015 31

Clustering Clustering is an unsupervised learning problem aimed at dividing data into groups Like classification but without known classes What makes good clusters has to be designed into the algorithm in the form of a similarity measure Deterministic clustering assigns each data point to exactly one cluster K-Means clustering uses a fixed number of clusters Hierarchical clustering builds hierarchy of clusters Manfred Huber 2015 32

Clustering Clustering can be used for a number of purposes Identify different types within the data Type can be used as a feature for subsequent tasks In probabilistic clustering the probability vector over types can be used as a continuous feature vector Identify different causes for the data Can be used to identify whether the data generation process was uniform Approximately compress the data set Manfred Huber 2015 33