Unsupervised Learning

Size: px

Start display at page:

Download "Unsupervised Learning"

Bernice Franklin
6 years ago
Views:

1 Unsupervised Learning

2 Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However, there are problems where a definition of the classes and even the number of the classes is unknown. Machine learning methods which deal with such data are said to be unsupervised. Questions: Why would one even be interested in learning from unlabeled samples? Is it even possible in principle to learn anything of value from unlabeled samples?

3 Why unsupervised learning 1. Collecting and labeling a large set of sample patterns can be surprisingly costly. E.g., videos are virtually free, but accurately labeling the video pixels is expensive and time consuming. 2. Extend to a larger training set by using semi-supervised learning. Train a classier on a small set of samples, then tune it up to make it run without supervision on a large, unlabeled set. Or, in the reverse direction, let a large set of unlabeled data group automatically, then label the groupings found. 3. To detect the gradual change of pattern over time. 4. To find features that will then be useful for categorization. 5. To gain insight into the nature or structure of the data during the early stages of an investigation.

4 Unsupervised learning: clustering In practice, unsupervised learning methods implement what is usually referred to as data clustering. Qualitatively and generally, the problem of data clustering can be defined as: Grouping of objects into meaningful categories Given a representation of N objects, find k clusters based on a measure of similarity.

5 Data clustering The problem can be tackled from several points of view: Statistics: represent the density function for all data as the mixture of a number of different distributions S i w i p y wi (y w i ) and fit a set of weights w i and component densities p y wi to the given data

6 Data clustering The problem can be tackled from several points of view: Geometry/topology : Partition the pattern space such that data belonging to each partition are highly homogeneous (i.e., they are similar to one another) More directly related with classification: Group (label) data such that average intragroup distance is minimized and average inter-group distance is maximized (yet another optimization problem!)

7 Data clustering Why data clustering? Natural Classification: degree of similarity among forms. Data exploration: discover underlying structure, generate hypotheses, detect anomalies. Compression: for organizing/indexing/storing/broadcasting data. Applications: can be used by any scientific field that collects data! Relevance: 2340 papers about data clustering indexed in Scopus in 2014!

8 Data clustering: examples

9 Data clustering Given a set of N unlabeled examples D = x 1 ; x 2 ; ; x N in a d-dimensional feature space, D is partitioned into a number of disjoint subsets Dj 's: D = j=1,k Dj where D i D j = ; i j ; where the points in each subset are similar to each other according to a given criterion.

10 Data clustering A partition is denoted by p = (D 1 ;D 2 ;.. ; D k ) and the problem of data clustering is thus formulated as p* = argmin f(p) p where f( ) is formulated according to.

11 Data clustering A general optimization (minimization) algorithm for a classification function J(Y, W) (Y being the dataset and W the ordered set of labels assigned to each sample) can be described as follows: Choose an initial classifier W 0 repeat (step i) Change the classification such that J decreases until the classification is the same as the previous one. If the variables were continuous, a gradient method could be

12 Data clustering A reasonable algorithm (based on the simplifying assumption that the optimization problem is separable, i.e., that the minimum of an n-dimensional function can be found by minimizing it along each dimension separately) would assign to each sample the label that causes the largest (negative) J. NB Since the problem is not separable, there is no guarantee that J decreases as the sum of J s. It may even increase! A better but slower solution, which guarantees montonicity, would be to change, in each step, the label that causes the greatest negative J.

13 K-means clustering K-means clustering is developed by choosing the Euclidean distance as the similarity criterion and J = S k=1,nc Sy (i) ~w k y (i) - m k as the function to be optimized. y (i) is the i th sample and m k is the center of the cluster. y (i) ~w k refers to all samples y (i) assigned to cluster k J is minimized by choosing m k as the sample mean of the data having label w k.

14 K-means clustering Randomly initialize m repeat Classify N samples according to nearest m i Recompute m i until there is no change in {m i }

24 Improving on k-means The main problem with k-means is the need to set a priori the number of desired clusters. A large number of algorithms have been proposed that overcome this limitation, by determining an optimal number of clusters at runtime. The basic idea behind these algorithms is splitting and merging : A cluster is split into two clusters when a measure of homogeneity falls below a certain threshold Two clusters are merged into one when a separation measure falls below a certain threshold

25 Isodata N D = approximate (desired) number of clusters T= threshold of number of samples in a cluster Set Nc = N D 1. Cluster data into Nc clusters, eliminating data and clusters with fewer than T members and decreasing Nc accordingly. Exit if classification has not changed, 2. If Nc N D /2 or (Nc < 2 N D and iteration is odd) : a. Split clusters whose samples are sufficiently disjointed and increase Nc accordingly. b. If any clusters have been split go to Merge any pair of clusters whose samples are sufficiently close and/or overlapping and decrease Nc accordingly. 4. Go to step 1.

26 Isodata (cluster computation) 1. Cluster data into Nc clusters, eliminating data and clusters with fewer than T members and decreasing Nc accordingly. Exit if classification has not changed, For each cluster the following quantities are computed: d k = (1/N k ) S y(i)~wk y (i) - m k avg. distance of samples from the mean for cluster k s k 2 = max (1/N k ) S y(i)~wk (y j (i) - m kj ) 2 largest variance along the coordinate axes. d = 1 N Nc k=1 N kd k overall average distance of samples N k = number of samples in cluster k

27 Isodata (split) s s 2 = max. spread parameter for splitting For k = 1,.., N If s k 2 > s s 2 If (d k > d and ( N k > 2T + 1 or Nc N D /2 or (Nc < 2 N D and iteration odd) ): split cluster k and increase Nc accordingly. Splitting means replacing the original center with two new centers displaced slightly (usually a fraction of s m ) in opposite directions along the axis m of largest variance.

28 Isodata (merge) D m = maximum distance separation for merging N max = maximum number of clusters that can be merged For i = 1,.., Nc For j = 1,.., Nc i. Compute d ij = m i - m j ii. Sort d ij < D m in ascending order For all sorted d ij, if neither cluster i or j have been merged: While the number of merges < N max Merge clusters i and j and decrease Nc accordingly The new center m of the split is computed as: m = (N i m i + N j m j )/(N i +Nj)

29 Data clustering questions 1. What is a cluster? 2. How to define pair-wise similarity? 3. Which features and normalization scheme? 4. How many clusters? 5. Which clustering method? 6. Are the discovered clusters and partition valid? 7. Does the data have any clustering tendency?

30 Similarity Compact Clusters Within-cluster distance < between-cluster distance Connected Clusters Within-cluster connectivity > between-cluster connectivity Ideal cluster: compact and isolated.

31 Representation There's no universal representation; representation is domain dependent.

32 Representation quality A good representation leads to compact and isolated clusters.

33 Feature relevance Two different meaningful groupings produced by different weighting schemes.

34 Number of clusters? The samples are generated by 6 independent classes, yet: Ground truth K=2 K=5 K=6

35 Meaning/validity of clusters Clustering algorithms find clusters, even if there are no natural clusters in the data D uniform data points k-means with k=3

36 Clustering methods: which is the best?

37 There is no best clustering algorithm! Each algorithm imposes a structure on data. Good fit between model and data success. GMM K=3 GMM K=2 Spectral K=3 Spectral K=2

38 References C.W. Therrien Decision, Estimation, Classification J. Corso, A. Chen. Clustering / Unsupervised Methods A. K. Jain and R. C. Dubes. Algorithms for Clustering Data, Prentice Hall, Map of Science, Nature, 2006 D. Aurthor and S. Vassilvitskii. k-means++: The Advantages of Careful Seeding R. Dubes and A. K. Jain. Clustering Techniques: User's Dilemma, Pattern Recognition 1976

39 Unsupervised Learning with Neural Networks

40 Clustering and Neural Networks Neural Networks trained by some unsupervised learning algorithm can also be used for tasks like Clustering Feature extraction Compression (pattern space dimensionality reduction, vector quantization) etc.

41 Clustering and Neural Networks Regarding clustering, the Self-Organizing Maps theorized by Teuvo Kohonen in the late 80 s, despite being modeled after some properties of the cerebral cortex, can be shown to be equivalent, under certain conditions, to the k-means algorithm. It can also be seen as an implementation of the vector quantization and can lead to the definition of a supervised learning algorithm which is rather fast and extremely easy to implement.

Unsupervised learning Kohonen s Self-Organizing Map (SOM) Homunculus Biological Model Mappings (projections) of sensory stimuli onto specific nets of cortical neurons can be observed in the cerebral

42 Unsupervised learning Kohonen s Self-Organizing Map (SOM) Homunculus Biological Model Mappings (projections) of sensory stimuli onto specific nets of cortical neurons can be observed in the cerebral cortex. Sensory-motor neurons form a distorted map of the human body surface as the extension of each region is proportional to the sensitivity of the corresponding body area, not to its size. However, adjacent parts of the cortex correspond to body areas which are also adjacent.

43 Unsupervised learning Kohonen s Self-Organizing Map (SOM) Lateral interactions between neurons Short-range excitatory ( mm) Inhibitory action (fino a mm) weak long-range excitatory action (up to a few cm) Mexican hat function Their profile can be approximated as:

44 Unsupervised learning Kohonen s Self-Organizing Map (SOM) Kohonen SOMs are sensory maps made up of a single layer of neurons, each of which becomes specialized in responding to specific stimuli such that: Different types of inputs (stimuli) activate different neurons Neighboring neurons respond to similar stimuli

45 Unsupervised learning Kohonen s Self-Organizing Map (SOM) Single layer of w*h neurons n i i=1,w*h (w=width h=height) Each input X={x j, j=1,n} is connected to all neurons (therefore each neuron has N connections) Each connection is associated with a weight w ij (i=neuron, j=dimension) Each neuron s weight set is isomorphic with the input patterns. Activation function f i 1/d(W i,x) d = distance Lateral interactions between neurons exist whose strengths depends on the distance between neurons according to the mexican-hat function

46 Unsupervised learning Kohonen s Self-Organizing Map (SOM) When a pattern is input, each neuron s weights are modified: in an excitatory way, with entity proportional to the value of their own and their neighbors activation function, and inversely proportional to their distance; in an inhibitory way, with entity proportional to the value of the activation function of the neurons which are outside their neighborhood and inversely proportional to their distance That means that when, after the weight update, the same input is given to the net: Strongly-activated neurons and their neighbors will be activated even more intensely. Weakly-activated ones will be activated even less.

47 Unsupervised learning Kohonen s Self-Organizing Map (SOM) Remember that: a neuron s activation is inversely proportional to the distance between its weights and the input pattern. Weight sets associated to neurons can be seen as points in the pattern space, being isomorphic with it. This implies that weights are updated such that, each time a pattern is input, weight sets that are close to the input pattern move even closer to it, and weight sets that are far move even farther. If data well distributed within the input space are successively input, each neuron s weight set tends to move to areas where data are more densely present, which means that the corresponding neuron specializes and tends to be activated by data belonging to a specific partition of the pattern space.

48 Unsupervised learning Kohonen s Self-Organizing Map (SOM) If we look at this process in the pattern space, this means that the weight set of each neuron tends to become the center of a data cluster. Also, due to the excitatory effect of the lateral connections, neighboring neurons tend to be activated by similar inputs. Thus, weight sets of neurons which are physically close in the map tend to be close in the pattern space. Because of this, the Self-Organizing Map (SOM) is also frequently called Topology Preserving Map (TPM).

49 Unsupervised learning Kohonen s Self-Organizing Map (SOM) Once a SOM is trained, it can act as a classifier according to a minimum-distance criterion. L = argmin x w i i L being a label, x the input pattern and w i the weights of neuron i. Thus, each input pattern is associated with the neuron coordinates on which it is projected i.e., the coordinates of the neuron with the highest activation (activation i being inversely proportional to the distance x w i ). In practice, the input space is projected onto the neuron layer, causing a dimensionality reduction of data from N (input size) to m (size of the map), but the topological relationships among data are preserved.

50 Unsupervised learning Kohonen s Self-Organizing Map (SOM) In practice: a trained SOM partitions the input space S into as many subspaces as the number of neurons (as k-means does) Each subspace s i of S is defined as: s i ={X j ϵs d(x j,w i ) = min t d(x j,w t )} This induces a so-called Voronoi tessellation of the input space (example limited to 2D patterns)

51 Unsupervised learning Kohonen s Self-Organizing Map (SOM) For a more efficient computer implementation the model is simplified: 1. Only the weights of neighbors of the most activated neuron (a.k.a. winning neuron, as one usually refers to this sort of algorithms as competitive learning algorithms) are updated, by the same rule, as only excitatory lateral interactions within a small neighborhood of the winning neuron are taken into account 2. The update rule w j (t+1) = w j (t) + a (x i - w j (t)) does not depend on the distance from the winning neuron. NB Modifying weights in an excitatory way means making them more similar to the input (activation increases); modifying them in an inhibitory way means to make them less similar to the input pattern (their activation decreases).

52 Unsupervised learning Kohonen s Self-Organizing Map (SOM) The Mexican-hat function which models lateral interactions between neurons is approximated as a box function (a) (a) (b) The inhibitory long-range connections may also be modeled, applying the rule w j (t+1) = w j (t) - a (x i - w j (t)) within an external neighborhood (b) The neighborhood is usually square but it can be of any shape (hexagonal ones are also common).

53 Unsupervised learning Kohonen s Self-Organizing Map (SOM) a=c (a = learning rate, C small positive constant << 1) Given a training set X = { x i, x i=(x i1, x i2,.,, x im ), i=1 N } - Initialize weights with values compatible with data (or randomly) - For each training pattern x i : 1. Find the winning neuron n w 2. Modify the weights of the winning neuron as well as the weights of the neurons located within its neighborhood I(n w ) in the map as follows: 3. w j (t+1) = w j (t) + a (x i - w j (t)) 4. a(t+1) = a(t) * (1 - g) [g small positive constant << 1 ] until the weights do not reach stable values or a = 0

54 Unsupervised learning Kohonen s Self-Organizing Map (SOM) For the sake of simplicity the neighborhood is usually square but it can be of any shape (hexagonal ones are also common). Sometimes also the inhibitory long-range connections are modeled, applying the rule w j (t+1) = w j (t) - a (x i - w j (t)) within an external neighborhood

55 Unsupervised learning Kohonen s Self-Organizing Map (SOM): summary Kohonen SOMs cluster data, i.e., they identify clouds of data which are densely distributed in specific regions, according to which the input space can be partitioned. As in the k-means algorithm, each partition can be represented by the centroid of the cloud: in a SOM it corresponds to the weights associated with a neuron of the map. Conversely, each weight vector associated with a neuron of a trained SOM is a centroid for a data cloud. It is possible to classify data a posteriori based on the partition of the input space to which they belong. Supposing we have a labeled data set, each partition induced by the SOM can be labeled after the class for which the corresponding neuron is most frequently the winner neuron.

56 Unsupervised learning Kohonen s Self-Organizing Map (SOM): summary In practice, to label neurons of a SOM a posteriori according to a labeled data set X (stacked generalization/majority vote) : label(i) = argmax (histogram (l, winner(x, w i )) where X is the dataset represented as a matrix (each row x i is a pattern). W are the SOM weights represented as a matrix (row w i is the weight set of the i-th neuron histogram (L,v) is the frequency histogram of vector v, whose elements may have L different values (from 1 to L). winner (X, w i ) returns a vector J having the same number of elements as the rows of X (i.e., the number of data in the dataset): Element J i is the label of the winning neuron when x i is input.

57 Unsupervised learning Kohonen s Self-Organizing Map (SOM): observations With respect to k-means, a SOM adds an ordering of the centroids which preserves the topological properties of the input space of which the SOM can be considered a lower-dimensional projection. The fact that neighboring neurons respond to similar (neighboring) patterns creates a lattice whose nodes are located where data are actually found and whose arcs never cross, as if a net was casted over the pattern space and its nodes were orderly anchored in the middle of regions where data are densely present. Somehow, it creates a distorted re-projection of data onto a lower-dimensional space where, however, data are (not strictly but surely more) uniformly distributed.

58 Unsupervised learning Kohonen s Self-Organizing Map (SOM): observations It can be demonstrated (Bishop) that the algorithm for training a SOM with neighborhood radius decreasing with time is equivalent to the K-means algorithm. The weight update equation can be derived as the (gradientbased) solution of the vector quantization problem : given K, find a set of K centroids (codebook in TLC terms; each centroid is then termed codebook vector) that minimize the squared error (actually it holds for any exponent) made when each sample in a (large) dataset is substituted by the closest centroid.

Unsupervised Learning

Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,