Unsupervised Learning

Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However, there are problems where the definition of the classes and even the number of the classes may be unknown. Machine learning methods which deal with such data are said to be unsupervised. Questions: Why would one even be interested in learning from unlabeled samples? Is it even possible in principle to learn anything of value from unlabeled samples?

Why unsupervised learning (in random order) 1. To limit the cost of the often surprisingly costly process of collecting and labeling a large set of sample patterns. E.g., videos are virtually free, but accurately labeling the video pixels is expensive and time consuming. 2. To obtain a larger training set than the one available using semi-supervised learning. Train a classifier on a small set of samples, then tune it up to make it run without supervision on a large, unlabeled set. Or, in the reverse direction, let a large set of unlabeled data group automatically, then label the groupings found. 3. To detect a gradual change of patterns over time. 4. To find features that will be useful for categorization. 5. To gain insight into the nature or structure of the data during the early stages of an investigation.

Unsupervised learning: clustering In practice, unsupervised learning methods implement what is usually referred to as data clustering. Qualitatively and generally, the problem of data clustering can be defined as: Grouping of objects into meaningful categories Given a representation of N objects, find k clusters based on a similarity measure.

Data clustering The problem can be tackled from several points of view: Statistics: represent the density function for all data as the mixture of a number of different distributions S i w i p y wi (y w i ) and fit a set of weights w i and component densities p y wi to the given data

Data clustering The problem can be tackled from several points of view: Geometry/topology : Partition the pattern space such that data belonging to each partition are highly homogeneous (i.e., they are similar to one another) More directly related with classification: Group (label) data such that average intra-group distance is minimized and average inter-group distance is maximized (yet another optimization problem!)

Data clustering Why data clustering? Natural Classification: degree of similarity among species. Data exploration: discover underlying structure, generate hypotheses, detect anomalies. Data Compression: for organizing/indexing/storing/broadcasting data. Applications: useful to any scientific field that collects data! Relevance: 2340 papers about data clustering indexed in Scopus in 2014!

Data clustering: examples 800,000 scientific papers clustered into 776 topics based on how often the papers were cited together by the authors of other papers

Data clustering Given a set of N unlabeled examples D = (x 1 ; x 2 ; ; x N ) in a d-dimensional feature space, D is partitioned into a number of disjoint subsets Dj 's: D = j=1,k Dj where i j D i D j = where the points in each subset are similar to each other according to a given criterion.

Data clustering A partition is denoted by p = (D 1 ;D 2 ;.. ; D k ) and the problem of data clustering is thus formulated as p * = argmin f(p) p where f( ) is formulated according to.

Data clustering A general optimization (minimization) algorithm for a classification function J(Y, W) (Y being the dataset and W the ordered set of labels assigned to each sample) can be described as follows: Choose an initial classifier W 0 repeat (step i) Change the classification such that J decreases until the classification is the same as the previous one. If the variables were continuous, a gradient method could be used. However, here, we are dealing with discrete variables.

Data clustering A reasonable algorithm (based on the simplifying assumption that the optimization problem is separable, i.e., that the minimum of an n-dimensional function can be found by minimizing it along each dimension separately) would assign to each sample the label that causes the largest (negative) J. NB Since the problem is non-separable, there is no guarantee that J decreases as the sum of J s. It may even increase! A better but slower solution, which guarantees monotonicity, would be to change, in each step, the label that causes the greatest negative J.

K-means clustering K-means clustering is developed by choosing the Euclidean distance as the similarity criterion and J = S k=1,nc Sy (i) ~w k y (i) - m k as the function to be optimized. y (i) is the i th sample and m k is the centroid of the k th cluster. y (i) ~w k refers to all samples y (i) assigned to cluster k Then M {m 1, m 2,.., m Nc } is the set of reference vectors, each of which represents the prototype for a class. J is minimized by choosing m k as the sample mean of the data having label w k.

Unsupervised learning K-means clustering In practice: The algorithm partitions the input space S into a predefined number of subspaces, induced by the Euclidian distance. Each subspace s i of S is defined as: s i ={x j ϵs d(x j,m i ) = min t d(x j,m t )} This induces a so-called Voronoi tessellation of the input space (example limited to 2D patterns)

K-means clustering Randomly initialize m {m 1, m 2, m 3, m 4,..., m Nc } repeat Classify N samples according to nearest m i Recompute m i (as the mean of the patterns assigned to cluster i) until there is no change in m

Improving on k-means The main problem with k-means is the need to set a priori the number of desired clusters. A large number of algorithms have been proposed that overcome this limitation, by determining an optimal number of clusters at runtime. The basic idea behind these algorithms is splitting and merging : A cluster is split into two clusters when a measure of homogeneity falls below a certain threshold Two clusters are merged into one when a separation measure falls below a certain threshold

Isodata N D = approximate (desired) number of clusters T= threshold for (minimum) number of samples in a cluster Set Nc = N D 1. Cluster data into Nc clusters, eliminating data and clusters with fewer than T members and decreasing Nc accordingly. Exit if classification has not changed, 2. If Nc N D /2 or (Nc < 2 N D and iteration is odd) : a. Split clusters whose samples are sufficiently disjointed and increase Nc accordingly. b. If any clusters have been split go to 1. 3. Merge any pair of clusters whose samples are sufficiently close and/or overlapping and decrease Nc accordingly. 4. Go to step 1.

Isodata (cluster computation) 1. Cluster data into Nc clusters, eliminating data and clusters with fewer than T members and decreasing Nc accordingly. Exit if classification has not changed, For each cluster k the following quantities are computed: d k = (1/N k ) S y(i)~wk y (i) - m k avg. distance of samples from the mean (centroid) of cluster k s k 2 = max { (1/N k ) S y(i)~wk (y(i) j - m kj ) 2 } j largest variance along the coordinate axes. d = 1 N Nc k=1 N kd k overall average distance of samples N k = number of samples in cluster k

Isodata (split) s s 2 = max. spread threshold for splitting (no splitting below s s2 ) For k = 1,.., Nc If s k 2 > s s 2 If (d k > d and ( N k > 2T + 1 or Nc N D /2 or (Nc < 2 N D and iteration odd) ): split cluster k and increase Nc accordingly. Splitting means replacing the original center with two new centers displaced slightly (usually a fraction of s m ) in opposite directions along the axis m of largest variance.

Isodata (merge) D m = maximum distance separation for merging N max = maximum number of clusters that can be merged For i = 1,.., Nc For j = 1,.., Nc i. Compute d ij = m i - m j ii. Sort d ij < D m in ascending order For all sorted d ij, if neither cluster i or j have been merged: While the number of merges < N max Merge clusters i and j and decrease Nc accordingly end while The new center m of the merged cluster is computed as: m = (N i m i + N j m j )/(N i +N j )