Administrative. Machine learning code. Machine learning: Unsupervised learning

Similar documents
k-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out


Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Based on Raymond J. Mooney s slides

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that:

Data Informatics. Seon Ho Kim, Ph.D.

Clustering algorithms

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Information Retrieval and Organisation

CS47300: Web Information Search and Management

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

Machine Learning. Unsupervised Learning. Manfred Huber

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

Unsupervised learning, Clustering CS434

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

Hierarchical Clustering

Information Retrieval and Web Search Engines

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

Expectation Maximization!

What to come. There will be a few more topics we will cover on supervised learning

Introduction to Information Retrieval

Lecture 15 Clustering. Oct

Information Retrieval and Web Search Engines

Introduction to Information Retrieval

Unsupervised Learning

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)

Informa(on Retrieval

Unsupervised Learning and Clustering

PV211: Introduction to Information Retrieval

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

CHAPTER 4: CLUSTER ANALYSIS

COMS 4771 Clustering. Nakul Verma

Administrative UNSUPERVISED LEARNING. Unsupervised learning. Supervised learning 11/25/13. Final project. No office hours today

DD2475 Information Retrieval Lecture 10: Clustering. Document Clustering. Recap: Classification. Today

CSE 5243 INTRO. TO DATA MINING

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

Introduction to Mobile Robotics

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

K-Means Clustering 3/3/17

Informa(on Retrieval

Introduction to Machine Learning CMU-10701

CS Introduction to Data Mining Instructor: Abdullah Mueen

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Unsupervised Learning and Clustering

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Big Data Analytics! Special Topics for Computer Science CSE CSE Feb 9

CSE 5243 INTRO. TO DATA MINING

Clustering: K-means and Kernel K-means

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Unsupervised Learning: Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 11

Lecture 7: Segmentation. Thursday, Sept 20

Clustering CS 550: Machine Learning

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Clustering Lecture 5: Mixture Model

Cluster Analysis: Agglomerate Hierarchical Clustering

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Clustering. Unsupervised Learning

Clustering: Classic Methods and Modern Views

SGN (4 cr) Chapter 11

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Unsupervised Learning and Data Mining

Clustering. Unsupervised Learning

ALTERNATIVE METHODS FOR CLUSTERING

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Introduction to Information Retrieval

Clustering. Unsupervised Learning

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

4. Ad-hoc I: Hierarchical clustering

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Clustering web search results

Unsupervised: no target value to predict

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

ECE 5424: Introduction to Machine Learning

Introduction to Machine Learning. Xiaojin Zhu

10701 Machine Learning. Clustering

Clustering & Dimensionality Reduction. 273A Intro Machine Learning

Finding Clusters 1 / 60

Clustering: Overview and K-means algorithm

Association Rule Mining and Clustering

CS 2750: Machine Learning. Clustering. Prof. Adriana Kovashka University of Pittsburgh January 17, 2017

Machine Learning (BSMC-GA 4439) Wenke Liu

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

MI example for poultry/export in Reuters. Overview. Introduction to Information Retrieval. Outline.

Clustering in Data Mining

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

10. MLSP intro. (Clustering: K-means, EM, GMM, etc.)

Machine Learning Department School of Computer Science Carnegie Mellon University. K- Means + GMMs

Machine Learning (BSMC-GA 4439) Wenke Liu

Network Traffic Measurements and Analysis

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Transcription:

Machine learning: Unsupervised learning http://www.youtube.com/watch?v=or_-y-eilqo David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture17-clustering.ppt Machine learning code Administrative Weka Java based Tons of approaches Good for playing with different approaches, but faster/better of individual approaches can be found http://www.cs.waikato.ac.nz/ml/weka/ SVMs SVMLight (C-based fast) http://www.cs.cornell.edu/people/tj/svm_light/ LIBSVM (Java) http://www.csie.ntu.edu.tw/~cjlin/libsvm/ Others many others out there search for them e.g. PyML (in Python, but I ve never used it) Keep up with the written homeworks Project proposals due this Friday Assignment 5 due next Wednesday! Will have assignment 4 back to you soon and literature reviews 1

Supervised learning Unsupervised learning Training or learning phase Raw data Label features f 1, f 2, f 3,, f n Label f 1, f 2, f 3,, f n f 1, f 2, f 3,, f n classifier extract features f 1, f 2, f 3,, f n f 1, f 2, f 3,, f n train a predictive model User supervision, we re given the labels (classes) Given some example without labels, do something! Unsupervised learning: clustering Unsupervised learning: modeling Raw data extract features features f 1, f 2, f 3,, f n f 1, f 2, f 3,, f n f 1, f 2, f 3,, f n f 1, f 2, f 3,, f n f 1, f 2, f 3,, f n group into classes/ clusters Clusters Most frequently, when people think of unsupervised learning they think clustering Another category: learning probabilities/parameters for models without supervision Learn a translation dictionary Learn an HMM from just sequences of data (i.e. not given the states) Learn a grammar for a language Learn the social graph No supervision, we re only given data and want to find natural groupings 2

Clustering Gene expression data Data from Garber et al. PNAS (98), 2001. Clustering: the process of grouping a set of objects into classes of similar objects Applications? Face Clustering Face Clustering Applications of clustering in IR vary search results over clusters/ topics to improve recall 3

Google News: automatic clustering gives an effective news presentation metaphor Clustering in search advertising bids Find clusters of advertisers and keywords Keyword suggestion Performance estimation Advertis Bidded er Keyword ~10M nodes 14 Clustering applications Data visualization Find clusters of users Targeted advertising Exploratory analysis Wise et al, Visualizing the non-visual PNNL ThemeScapes, Cartia [Mountain height = cluster size] Who-messages-whom IM graph Clusters of the Web Graph Distributed pagerank computation ~100M nodes 15 4

A data set with clear cluster structure Issues for clustering What are some of the issues for clustering? What clustering algorithms have you seen/ used? Representation for clustering How do we represent an example features, etc. Similarity/distance between examples Flat clustering or hierarchical Number of clusters Fixed a priori Data driven? Clustering Algorithms Flat algorithms Usually start with a random (partial) partitioning Refine it iteratively K means clustering Model based clustering Spectral clustering Hierarchical algorithms Bottom-up, agglomerative Top-down, divisive Hard vs. soft clustering Hard clustering: Each example belongs to exactly one cluster Soft clustering: An example can belong to more than one cluster (probabilistic) Makes more sense for applications like creating browsable hierarchies You may want to put a pair of sneakers in two clusters: (i) sports apparel and (ii) shoes 5

K-Means K-means: an example Most well-known and popular clustering algorithm Start with some initial cluster centers Iterate: Assign/cluster each example to closest center Recalculate centers as the mean of the points in a cluster, c: K-means: Initialize centers randomly K-means: assign points to nearest center 6

K-means: readjust centers K-means: assign points to nearest center K-means: readjust centers K-means: assign points to nearest center 7

K-means: readjust centers K-means: assign points to nearest center No changes: Done K-means variations/parameters K-means: Initialize centers randomly Initial (seed) cluster centers Convergence A fixed number of iterations partitions unchanged Cluster centers don t change K What would happen here? Seed selection ideas? 8

Seed Choice Results can vary drastically based on random seed selection Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings Common heuristics Random centers in the space Randomly pick examples Points least similar to any existing center Try out multiple starting points Initialize with the results of another clustering method How Many Clusters? Number of clusters K must be provided Somewhat application dependent How should we determine the number of clusters? too many too few One approach Project to one dimension and check Assume data is Gaussian (i.e. spherical) Test for this Testing in high dimensions doesn t work well Testing in lower dimensions does work well For each cluster, project down to one dimension Use a statistical test to see if the data is Gaussian ideas? 9

Project to one dimension and check Project to one dimension and check For each cluster, project down to one dimension For each cluster, project down to one dimension Use a statistical test to see if the data is Gaussian Use a statistical test to see if the data is Gaussian Project to one dimension and check On synthetic data For each cluster, project down to one dimension Use a statistical test to see if the data is Gaussian The dimension of the projection is based on the data Split too far 10

Compared to other approaches K-Means time complexity Variables: K clusters, n data points, m features/dimensions, I iterations What is the runtime complexity? Computing distance between two points Reassigning clusters Computing new centers Iterate http://cs.baylor.edu/~hamerly/papers/nips_03.pdf K-Means time complexity Problems with K-means Variables: K clusters, n data points, m features/dimensions, I iterations What is the runtime complexity? Computing distance between two points is O(m) where m is the dimensionality of the vectors. Reassigning clusters: O(Kn) distance computations, or O(Knm) Computing centroids: Each points gets added once to some centroid: O(nm) Assume these two steps are each done once for I iterations: O(Iknm) Determining K is challenging Spherical assumption about the data (distance to cluster center) Hard clustering isn t always right Greedy approach In practice, K-means converges quickly and is fairly fast 11

Problems with K-means Assumes spherical clusters What would K-means give us here? k-means EM clustering: mixtures of Gaussians Assume data came from a mixture of Gaussians (elliptical data), assign data to cluster with a certain probability k-means EM EM is a general framework Create an initial model, θ Arbitrarily, randomly, or with a small set of training examples Use the model θ to obtain another model θ such that Σ i log P θ (data i ) > Σ i log P θ (data i ) i.e. better models data (increased log likelihood) Let θ = θ and repeat the above step until reaching a local maximum Guaranteed to find a better model after each iteration Where else have you seen EM? 12

EM shows up all over the place E and M steps: creating a better model Training HMMs (Baum-Welch algorithm) Learning probabilities for Bayesian networks EM-clustering Learning word alignments for language translation Learning Twitter friend network Genetics Finance Anytime you have a model and unlabeled data! Expectation: Given the current model, figure out the expected probabilities of the data points to each cluster p(x θ c ) What is the probability of each point belonging to each cluster? Maximization: Given the probabilistic assignment of all the points, estimate a new model, θ c Just like NB maximum likelihood estimation, except we use fractional counts instead of whole counts Similar to K-Means E and M steps Iterate: Assign/cluster each point to closest center Expectation: Given the current model, figure out the expected probabilities of the points to each cluster p(x θ c ) Recalculate centers as the mean of the points in a cluster Maximization: Given the probabilistic assignment of all the points, estimate a new model, θ c Expectation: Given the current model, figure out the expected probabilities of the data points to each cluster Maximization: Given the probabilistic assignment of all the points, estimate a new model, θ c Iterate: each iterations increases the likelihood of the data and guaranteed to converge (though to a local optimum)! 13

Model: mixture of Gaussians EM example Covariance determines the shape of these contours Fit these Gaussian densities to the data, one per cluster Figure from Chris Bishop EM example EM EM is a general purpose approach for training a model when you don t have labels Not just for clustering! K-means is just for clustering One of the most general purpose unsupervised approaches can be hard to get right! Figure from Chris Bishop 14

Finding Word Alignments Finding Word Alignments la maison la maison bleue la fleur la maison la maison bleue la fleur the house the blue house the flower In machine translation, we train from pairs of translated sentences Often useful to know how the words align in the sentences Use EM! learn a model of P(french-word english-word) the house the blue house the flower All word alignments equally likely All P(french-word english-word) equally likely Finding Word Alignments Finding Word Alignments la maison la maison bleue la fleur la maison la maison bleue la fleur the house the blue house the flower the house the blue house the flower la and the observed to co-occur frequently, so P(la the) is increased. house co-occurs with both la and maison, but P(maison house) can be raised without limit, to 1.0, while P(la house) is limited because of the (pigeonhole principle) 15

Finding Word Alignments Finding Word Alignments la maison la maison bleue la fleur la maison la maison bleue la fleur the house the blue house the flower settling down after another iteration the house the blue house the flower Inherent hidden structure revealed by EM training! For details, see - A Statistical MT Tutorial Workbook (Knight, 1999). - 37 easy sections, final section promises a free beer. - The Mathematics of Statistical Machine Translation (Brown et al, 1993) - Software: GIZA++ Statistical Machine Translation la maison la maison bleue la fleur Discussion How does an infant learning to understand and speak language fit into our model? the house the blue house the flower P(maison house ) = 0.411 P(maison building) = 0.027 P(maison manson) = 0.020 How about learning to play tennis? How are these different/similar? Estimating the model from training data 16

Other clustering algorithms Non-gaussian data K-means and EM-clustering are by far the most popular for clustering However, they can t handle all clustering tasks What is the problem? What types of clustering problems can t they handle? Similar to classification: global decision vs. local decision Spectral clustering Similarity Graph Graph Partitioning Represent dataset as a weighted graph G(V,E) Data points: V={x i } Set of n vertices representing points E={w ij } Set of weighted edges indicating pair-wise similarity between points 0.1 5 1 0.6 6 2 4 0.7 3 0.2 What does clustering represent? Clustering can be viewed as partitioning a similarity graph Bi-partitioning task: Divide vertices into two disjoint groups (A,B) A B 1 5 2 4 6 3 What would define a good partition? 17

Clustering Objectives Graph Cuts Traditional definition of a good clustering: 1. within cluster should be highly similar. 2. between different clusters should be highly dissimilar. Apply these objectives to our graph representation Express partitioning objectives as a function of the edge cut of the partition. Cut: Set of edges with only one vertex in a group. 2 3 1 0.6 0.1 0.2 4 5 6 0.7 A 2 1 0.6 0.1 B 4 5 6 cut(a,b) = 0.3 1. Maximise weight of within-group connections 2. Minimise weight of between-group connections 0.7 0.2 3 Can we use the minimum cut? Graph Cut Criteria Graph Cut Criteria Criterion: Normalised-cut (Shi & Malik, 97) Optimal cut Minimum cut Consider the connectivity between groups relative to the density of each group: Normalize the association between groups by volume Vol(A): The total weight of the edges originating from group A Why does this work? 18

Normalized cut Spectral clustering examples A 1 0.1 B 5 2 0.6 4 6 3 0.2 0.7 Vol(A): The total weight of the edges originating from group A minimize between group connections maximize within group connections Ng et al On Spectral clustering: analysis and algorithm Spectral clustering examples Spectral clustering examples Ng et al On Spectral clustering: analysis and algorithm Ng et al On Spectral clustering: analysis and algorithm 19

Hierarchical clustering Hierarchical Clustering We didn t cover this, but if you re interested, take a look through these slides for the common approaches Recursive partitioning/merging of a data set 1-clustering 5 2-clustering 2 1 4 3-clustering 3 4-clustering 5-clustering 1 2 3 4 5 Dendogram Advantages of hierarchical clustering Represents all partitionings of the data We can get a K clustering by looking at the connected components at any given level Frequently binary dendograms, but n-ary dendograms are generally easy to obtain with minor changes to the algorithms Don t need to specify the number of clusters Good for data visualization See how the data points interact at many levels Can view the data at multiple levels of granularity Understand how all points interact Specifies all of the K clusterings/partitions 20

Hierarchical Clustering Common in many domains Biologists and social scientists Gene expression data Document/web page organization DMOZ Yahoo directories Two main approaches animal vertebrate invertebrate fish reptile amphib. mammal worm insect crustacean Divisive hierarchical clustering Finding the best partitioning of the data is generally exponential in time Common approach: Let C be a set of clusters Initialize C to be the one-clustering of the data While there exists a cluster c in C remove c from C partition c into 2 clusters using a flat clustering algorithm, c 1 and c 2 Add to c 1 and c 2 C Bisecting k-means Divisive clustering Divisive clustering start with one cluster 21

Divisive clustering Divisive clustering split using flat clustering Divisive clustering Divisive clustering split using flat clustering split using flat clustering 22

Divisive clustering Hierarchical Agglomerative Clustering (HAC) Let C be a set of clusters Initialize C to be all points/docs as separate clusters While C contains more than one cluster find c 1 and c 2 in C that are closest together remove c 1 and c 2 from C merge c 1 and c 2 and add resulting cluster to C The history of merging forms a binary tree or hierarchy Note, there is a temporal component not seen here How do we measure the distance between clusters? Distance between clusters Distance between clusters Single-link Complete-link Similarity of the most similar (single-link) Similarity of the furthest points, the least similar min sim(l,r) l L,r R max sim(l,r) l L,r R Why are these local methods used? efficiency 23

Distance between clusters Distance between clusters Centroid Centroid Clusters whose centroids (centers of gravity) are the most similar Clusters whose centroids (centers of gravity) are the most similar Ward s method What does this do? Distance between clusters Distance between clusters Centroid Average-link Clusters whose centroids (centers of gravity) are the most similar Average similarity between all pairs of elements Ward s method Encourages similar sized clusters 24

Single Link Example Complete Link Example Computational Complexity For m dimensions n documents/points How many iterations? n-1 iterations First iteration Need to compute similarity of all pairs of n points/documents: O(n 2 m) Remaining n-2 iterations compute the distance between the most recently created cluster and all other existing clusters: O(nm) Does depend on the cluster similarity approach Overall run-time: O(n 2 m) generally slower than flat clustering! single linkage complete linkage 25

Problems with hierarchical clustering Problems with hierarchical clustering Locally greedy: once a merge decision has been made it cannot be changed Single-linkage: chaining effect 1 - jδ 1 2 3 j j+1 n 26