Introduction to Machine Learning

Department of Computer Science, University of Helsinki Autumn 2009, second term Session 8, November 27 th, 2009

1 2 3

Multiplicative Updates for L1-Regularized Linear and Logistic Last time I gave you the link http://cseweb.ucsd.edu/ saul/papers/ida07 mult.pdf of a paper presenting a new method for optimizing the weights in a regression model. I have learned that one of the authors has given a talk about this paper, which can be watched at http://videolectures.net/ida07 saul mufl1/ Let s see it and hope for some further insight into the proposed method and regression in general!

1 2 3

Finite Mixtures as a Graphical Model This is what Finite Mixture Modelling looks like as a graphical network (cf. Bayesian Network). The associated probabilty of a vector x breaks up into P FM (X = x) = k P(z = k)p(x = x z = k) = k P(k) i P(x i k). The probability of a vector x belonging to cluster k becomes P(k x) = P(k) i P(x i k) k P(k ) i P(x i k ) P(k) i P(x i k).

Gaussian Finite Mixtures The root (clustering) variable is multinomial, taking values in, e.g., {1,..., K}. The model contains a probabilty distribution over these values: P(k) = P(z = k Θ) = θ k. In case of Gaussian Finite Mixtures, the conditional probability density of a feature X i given cluster k is modelled as a Gaussian N (µ ki, σ 2 ki ) and the and the data probability becomes P(x) = k P(k)P(x k, Θ) = k P(k) i P(x i k, µ ki, σ 2 ki) = k 1 θ k exp (x i µ ki ) 2. µ ki 2π 2σ 2 i ki

Gaussian Finite Mixtures Accordingly, the cluster membership probability can be written as P(k x) = P(k) i P(x i k) k P(k ) i P(x i k ) θ k i 1 exp (x i µ ki ) 2. µ ki 2π 2σ 2 ki [Actually it is smarter to model all features simultaneously as a Multivariate Gaussian but that gets more ugly. Let s assume the features to be independent of each other, which can be achieved by decoupling them first Whitening Principal Component Analysis.]

Gaussian Finite Mixtures The sufficient statitstics (for the complete data) of Gaussian FM are the cluster counts h k = z jk j where z jk is the indicator function of z j being k when z j is known and a measure of belief otherwise as well as the mean x ki = 1 z jk x ji h k and sum of squared deviance form the mean j S xi x i = j z jk (x ji x ki ) 2 at the features with given cluster.

EM Algorithm Outline Recall the EM algorithm: 1 Initialization: set each z j to a random distribution over {1,..., K} and calculate the according sufficient staistics 2 M-step: Calculate the model parameters from the sufficient statistics 3 E-step: Calculate the new, expected sufficient statistics given these parameters 4 Alternate M- and E-step until convergence. Under mild assumptions the (complete) data likelihood will improve in each step and thus convergence to a local optimum is guaranteed.

M-Step for Gaussian FM In the M-step (maximization) we recalculate the model parameters. 1 θ k h k+1 N+K (here: maximum a posteriori parameters with uniform prior or, equivalently, one-step-look-ahead) 2 µ ki x ki 3 σki 2 Sx i x i h k 1 (this is the unbiased estimate. We need this to avoid singular distributions. Clusters of size 1 must be eliminated!)

E-Step for Gaussian FM In the E-step (expectation) we recalculate the expected sufficient statistics. 1 z jk P(k x j, Θ) 2 h k, x ki and S xi x i functions thereof

Bayesian (Multinomial) Finite Mixtures The Bayesian FM model again contains a probabilty distribution over the values of the clustering variable P(k) = P(z = k Θ) = θ k and in addition a cpt (conditional probability table) for the feature values given the cluster P(x i = l z = k) = P(x i = l z = k, Θ) = θ kil. The probability of a vector x becomes P(x) = k P(k)P(x k, Θ) = k P(k) i P(x i k, Θ) = k θ k θ kixi i

Bayesian Finite Mixtures The cluster membership probability is then P(k x) = P(k) i P(x i k) k P(k ) i P(x i k ) θ k The sufficient statitstics are again the cluster counts θ kixi. i h k = j z jk and now the expected value counts at the X i given k f kil =. j:x ji =l z jk

M-Step for Bayesian FM In the M-step (maximization) we recalculate the model parameters. 1 θ k h k+1 N+K (MAP parameters, uniform prior) 2 θ kil f kil +1 h k +K i (likewise)

E-Step for Bayesian FM In the E-step (expectation) we recalculate the expected sufficient statistics. 1 z jk P(k x j, Θ) 2 h k and f kil functions thereof

Choice of K Outline So how many clusters should we choose for the model? In some applications (e.g. Visaulization, Batch Learning) K may be predifined. If it is not, one good criterion is the complete data likelihood P FM (X N, z N Θ) Sometimes also: constraint on the data homogenity within each cluster, manual sanity check, not more than what I want to find on my desk in the morning, etc.

Bayesian Batch FM Outline In Batch Learning, when Z is actually the class variable, but with only some of the labels known, simply fix the z jk corresponding to known labels to the according 0/1-distribution and learn only the rest. K is given by the number of classes. More elaborately: P. Kontkanen, P. Myllymki, T. Silander, and H. Tirri, Batch Classifications with Discrete Finite Mixtures. http://cosco.hiit.fi/articles/ecml98batch.ps.gz

1 2 3

Sometimes we have data that are not explicitly stated (in some convenient matrix format) We still want to cluster this data, ie. split it into subgroups of items close to each other. For example 1 generic data files 2 words in the english language 3 points on a manifold 4 probability densities and so on

Example: Normalized Compression Distance Take the example of files on your hard drive. Clearly, it is hard to describe them in matrix format. But we can define a distance measure between them, the Normalized Compression Distance (NCD). Let C(x) denote the length of file x after it has been compressed by some (fixed) compressor (zip, gzip, bzip2, etc.). For a pair (x, y) define ncd(x, y) := C(xy) min(c(x), C(y)) max(c(x), C(y)) where xy denotes the concatenation of x and y.

Example: Normalized Compression Distance ncd(x, y) := C(xy) min(c(x), C(y)) max(c(x), C(y)) ]0, 1 + ε] NCD is motivated by (the uncomputable) Kolmogorov Complexity. It is not actually a metric, but it returns small values for similar documents and values close to one for very different files.

Outline We can now map our items (e.g. files on disk) into the real space using (aka Sammon s Projection). Given distances d(x i, x j ) for each pair of items, minimize (traditionally via gradient descent) 1 (d(x i, x j ) pr(x i ) pr(x j ) 2 ) 2 i<j d(x i, x j ) d(x i, x j ) i<j where pr(x) denotes the projection of x into the real space. We thus mimic the original metric with the euclidean metric in the projection space by minimizing the sum of squares of the relative error.

Example: points on a manifold If we would Sammon project strategic locations in Scandinavia into the real plane using their air-line distance form each other, colour the mapping nicely and annote a little bit we would get something like this:

Example: BayMiner/HS Vaalikone The data mining company BayMiner has visualized the Helsingin Sanomien Vaalikone the following way. For each question X i build a Naive Bayes model predicting the answer based on all other answers. For each candidate j you get 20 probability distributions P(x i x i ). Define a distance metric d(j, j ) := i d KL (P(x ji x j, i )) P(x j i x j, i)) + i d KL (P(x j i x j, i)) P(x ji x j, i ))

Example: BayMiner/HS Vaalikone Then Sammon map into the two-dimensional real space and turn such that the red guys (SDP, Vasemmistoliitto, SKP) are on the left and the blue (Kokoomus) and orange (RKP) are on the right. The larger, white cross resembles (online) your own answers. (You re green!)

or Kohonen Maps (after their inventer Teuvo Kohonen) are another type of distance-based projection. We lay out one neuron for each data vector in a two- or three-dimensional grid. Iteratively neurons adapt to their neighbours and the data points are re-allocated as a distribution according to their fit at each neuron. The algorithm converges to something where items (partially) belonging to neighbouring cell are similar

: Example We can now feed a new colour and find out that it is probably kind of greenish. We would have never guessed. were hot some time before the wars. I forgot which wars, probably the clone wars. Self-Organizing Map of colours based on their euclidean distance in RGB space. Image: Timo Honkela

We might be interested also in the relations of the clusters among each other. Why not group clusters of similar objects into groups of similar groups? This is called. We can obviously rerun any clustering algorithm on, say, the averages of each cluster. Most often (especially in distance-based clustering) we will simply group the two closest objects together, calculate common statistics that our distance measure can handle, and iterate

Example: family tree of animals This tree has been built using NCD on the DNA sequences of different animals using a special DNA compressor. Always the two closest genomes were linked together and replaced by their concatenation. Postprocessing: the animals were ordered (within the constraints of the tree) to make the picture look nice.

Example: number and colours as search terms This quad tree was generated using the Normalized Google Distance a special case of NCD, that translates the number of hits Google returns to estimate a probability distribution (and thus code length). The items here actually represent search terms. Image: Rudi Cilibrasi

To be continued...