Introduction to Machine Learning

Similar documents
Mixture Models and the EM Algorithm

732A54/TDDE31 Big Data Analytics

Note Set 4: Finite Mixture Models and the EM Algorithm

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Unsupervised Learning: Clustering

Artificial Neural Networks Unsupervised learning: SOM

Machine Learning. Unsupervised Learning. Manfred Huber

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

9.1. K-means Clustering

Clustering: Classic Methods and Modern Views

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Clustering Lecture 5: Mixture Model

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Expectation Maximization (EM) and Gaussian Mixture Models

Clustering and The Expectation-Maximization Algorithm

Methods for Intelligent Systems

K-Means and Gaussian Mixture Models

Segmentation: Clustering, Graph Cut and EM

ECE 5424: Introduction to Machine Learning

Unsupervised Learning

Generative and discriminative classification techniques

STATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

Supervised vs. Unsupervised Learning

A Topography-Preserving Latent Variable Model with Learning Metrics

More Learning. Ensembles Bayes Rule Neural Nets K-means Clustering EM Clustering WEKA

Clustering algorithms

MULTICORE LEARNING ALGORITHM

Unsupervised Learning

10-701/15-781, Fall 2006, Final

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

K-means and Hierarchical Clustering

1 Case study of SVM (Rob)

Expectation Maximization: Inferring model parameters and class labels

K-Means Clustering. Sargur Srihari

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

More on Learning. Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization

Machine Learning for OR & FE

Mixture Models and EM

CS Introduction to Data Mining Instructor: Abdullah Mueen

Clustering web search results

Clustering. Supervised vs. Unsupervised Learning

Homework #4 Programming Assignment Due: 11:59 pm, November 4, 2018

ALTERNATIVE METHODS FOR CLUSTERING

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

CS 229 Midterm Review

10701 Machine Learning. Clustering

Using Machine Learning to Optimize Storage Systems

Unsupervised: no target value to predict

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Network Traffic Measurements and Analysis

IBL and clustering. Relationship of IBL with CBR

Introduction to Mobile Robotics

Assignment 2. Unsupervised & Probabilistic Learning. Maneesh Sahani Due: Monday Nov 5, 2018

Gaussian Mixture Models For Clustering Data. Soft Clustering and the EM Algorithm

Expectation-Maximization. Nuno Vasconcelos ECE Department, UCSD

Latent Variable Models and Expectation Maximization

Machine Learning / Jan 27, 2010

Unsupervised Learning

Machine Learning Classifiers and Boosting

Today. Lecture 4: Last time. The EM algorithm. We examine clustering in a little more detail; we went over it a somewhat quickly last time

k-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Supervised vs unsupervised clustering

Clustering & Dimensionality Reduction. 273A Intro Machine Learning

Finding Clusters 1 / 60

CS325 Artificial Intelligence Ch. 20 Unsupervised Machine Learning

Machine Learning for Signal Processing Clustering. Bhiksha Raj Class Oct 2016

ECG782: Multidimensional Digital Signal Processing

CLUSTERING. JELENA JOVANOVIĆ Web:

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Louis Fourrier Fabien Gaie Thomas Rolf

Machine Learning (BSMC-GA 4439) Wenke Liu

Clustering. Shishir K. Shah

Deep Generative Models Variational Autoencoders

Regularization and model selection

Fall 09, Homework 5

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Network Traffic Measurements and Analysis

Machine Learning A WS15/16 1sst KU Version: January 11, b) [1 P] For the probability distribution P (A, B, C, D) with the factorization

Clustering in R d. Clustering. Widely-used clustering methods. The k-means optimization problem CSE 250B

Today. Gradient descent for minimization of functions of real variables. Multi-dimensional scaling. Self-organizing maps

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Pattern Classification Algorithms for Face Recognition

Introduction to Machine Learning CMU-10701

Clustering. Image segmentation, document clustering, protein class discovery, compression

Generative and discriminative classification techniques

Problem Set 4. Assigned: March 23, 2006 Due: April 17, (6.882) Belief Propagation for Segmentation

Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

CS6220: DATA MINING TECHNIQUES

Modelling image complexity by independent component analysis, with application to content-based image retrieval

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

Conditional Random Fields - A probabilistic graphical model. Yen-Chin Lee 指導老師 : 鮑興國

Transcription:

Department of Computer Science, University of Helsinki Autumn 2009, second term Session 8, November 27 th, 2009

1 2 3

Multiplicative Updates for L1-Regularized Linear and Logistic Last time I gave you the link http://cseweb.ucsd.edu/ saul/papers/ida07 mult.pdf of a paper presenting a new method for optimizing the weights in a regression model. I have learned that one of the authors has given a talk about this paper, which can be watched at http://videolectures.net/ida07 saul mufl1/ Let s see it and hope for some further insight into the proposed method and regression in general!

1 2 3

Finite Mixtures as a Graphical Model This is what Finite Mixture Modelling looks like as a graphical network (cf. Bayesian Network). The associated probabilty of a vector x breaks up into P FM (X = x) = k P(z = k)p(x = x z = k) = k P(k) i P(x i k). The probability of a vector x belonging to cluster k becomes P(k x) = P(k) i P(x i k) k P(k ) i P(x i k ) P(k) i P(x i k).

Gaussian Finite Mixtures The root (clustering) variable is multinomial, taking values in, e.g., {1,..., K}. The model contains a probabilty distribution over these values: P(k) = P(z = k Θ) = θ k. In case of Gaussian Finite Mixtures, the conditional probability density of a feature X i given cluster k is modelled as a Gaussian N (µ ki, σ 2 ki ) and the and the data probability becomes P(x) = k P(k)P(x k, Θ) = k P(k) i P(x i k, µ ki, σ 2 ki) = k 1 θ k exp (x i µ ki ) 2. µ ki 2π 2σ 2 i ki

Gaussian Finite Mixtures Accordingly, the cluster membership probability can be written as P(k x) = P(k) i P(x i k) k P(k ) i P(x i k ) θ k i 1 exp (x i µ ki ) 2. µ ki 2π 2σ 2 ki [Actually it is smarter to model all features simultaneously as a Multivariate Gaussian but that gets more ugly. Let s assume the features to be independent of each other, which can be achieved by decoupling them first Whitening Principal Component Analysis.]

Gaussian Finite Mixtures The sufficient statitstics (for the complete data) of Gaussian FM are the cluster counts h k = z jk j where z jk is the indicator function of z j being k when z j is known and a measure of belief otherwise as well as the mean x ki = 1 z jk x ji h k and sum of squared deviance form the mean j S xi x i = j z jk (x ji x ki ) 2 at the features with given cluster.

EM Algorithm Outline Recall the EM algorithm: 1 Initialization: set each z j to a random distribution over {1,..., K} and calculate the according sufficient staistics 2 M-step: Calculate the model parameters from the sufficient statistics 3 E-step: Calculate the new, expected sufficient statistics given these parameters 4 Alternate M- and E-step until convergence. Under mild assumptions the (complete) data likelihood will improve in each step and thus convergence to a local optimum is guaranteed.

M-Step for Gaussian FM In the M-step (maximization) we recalculate the model parameters. 1 θ k h k+1 N+K (here: maximum a posteriori parameters with uniform prior or, equivalently, one-step-look-ahead) 2 µ ki x ki 3 σki 2 Sx i x i h k 1 (this is the unbiased estimate. We need this to avoid singular distributions. Clusters of size 1 must be eliminated!)

E-Step for Gaussian FM In the E-step (expectation) we recalculate the expected sufficient statistics. 1 z jk P(k x j, Θ) 2 h k, x ki and S xi x i functions thereof

Bayesian (Multinomial) Finite Mixtures The Bayesian FM model again contains a probabilty distribution over the values of the clustering variable P(k) = P(z = k Θ) = θ k and in addition a cpt (conditional probability table) for the feature values given the cluster P(x i = l z = k) = P(x i = l z = k, Θ) = θ kil. The probability of a vector x becomes P(x) = k P(k)P(x k, Θ) = k P(k) i P(x i k, Θ) = k θ k θ kixi i

Bayesian Finite Mixtures The cluster membership probability is then P(k x) = P(k) i P(x i k) k P(k ) i P(x i k ) θ k The sufficient statitstics are again the cluster counts θ kixi. i h k = j z jk and now the expected value counts at the X i given k f kil =. j:x ji =l z jk

M-Step for Bayesian FM In the M-step (maximization) we recalculate the model parameters. 1 θ k h k+1 N+K (MAP parameters, uniform prior) 2 θ kil f kil +1 h k +K i (likewise)

E-Step for Bayesian FM In the E-step (expectation) we recalculate the expected sufficient statistics. 1 z jk P(k x j, Θ) 2 h k and f kil functions thereof

Choice of K Outline So how many clusters should we choose for the model? In some applications (e.g. Visaulization, Batch Learning) K may be predifined. If it is not, one good criterion is the complete data likelihood P FM (X N, z N Θ) Sometimes also: constraint on the data homogenity within each cluster, manual sanity check, not more than what I want to find on my desk in the morning, etc.

Bayesian Batch FM Outline In Batch Learning, when Z is actually the class variable, but with only some of the labels known, simply fix the z jk corresponding to known labels to the according 0/1-distribution and learn only the rest. K is given by the number of classes. More elaborately: P. Kontkanen, P. Myllymki, T. Silander, and H. Tirri, Batch Classifications with Discrete Finite Mixtures. http://cosco.hiit.fi/articles/ecml98batch.ps.gz

1 2 3

Sometimes we have data that are not explicitly stated (in some convenient matrix format) We still want to cluster this data, ie. split it into subgroups of items close to each other. For example 1 generic data files 2 words in the english language 3 points on a manifold 4 probability densities and so on

Example: Normalized Compression Distance Take the example of files on your hard drive. Clearly, it is hard to describe them in matrix format. But we can define a distance measure between them, the Normalized Compression Distance (NCD). Let C(x) denote the length of file x after it has been compressed by some (fixed) compressor (zip, gzip, bzip2, etc.). For a pair (x, y) define ncd(x, y) := C(xy) min(c(x), C(y)) max(c(x), C(y)) where xy denotes the concatenation of x and y.

Example: Normalized Compression Distance ncd(x, y) := C(xy) min(c(x), C(y)) max(c(x), C(y)) ]0, 1 + ε] NCD is motivated by (the uncomputable) Kolmogorov Complexity. It is not actually a metric, but it returns small values for similar documents and values close to one for very different files.

Outline We can now map our items (e.g. files on disk) into the real space using (aka Sammon s Projection). Given distances d(x i, x j ) for each pair of items, minimize (traditionally via gradient descent) 1 (d(x i, x j ) pr(x i ) pr(x j ) 2 ) 2 i<j d(x i, x j ) d(x i, x j ) i<j where pr(x) denotes the projection of x into the real space. We thus mimic the original metric with the euclidean metric in the projection space by minimizing the sum of squares of the relative error.

Example: points on a manifold If we would Sammon project strategic locations in Scandinavia into the real plane using their air-line distance form each other, colour the mapping nicely and annote a little bit we would get something like this:

Example: BayMiner/HS Vaalikone The data mining company BayMiner has visualized the Helsingin Sanomien Vaalikone the following way. For each question X i build a Naive Bayes model predicting the answer based on all other answers. For each candidate j you get 20 probability distributions P(x i x i ). Define a distance metric d(j, j ) := i d KL (P(x ji x j, i )) P(x j i x j, i)) + i d KL (P(x j i x j, i)) P(x ji x j, i ))

Example: BayMiner/HS Vaalikone Then Sammon map into the two-dimensional real space and turn such that the red guys (SDP, Vasemmistoliitto, SKP) are on the left and the blue (Kokoomus) and orange (RKP) are on the right. The larger, white cross resembles (online) your own answers. (You re green!)

or Kohonen Maps (after their inventer Teuvo Kohonen) are another type of distance-based projection. We lay out one neuron for each data vector in a two- or three-dimensional grid. Iteratively neurons adapt to their neighbours and the data points are re-allocated as a distribution according to their fit at each neuron. The algorithm converges to something where items (partially) belonging to neighbouring cell are similar

: Example We can now feed a new colour and find out that it is probably kind of greenish. We would have never guessed. were hot some time before the wars. I forgot which wars, probably the clone wars. Self-Organizing Map of colours based on their euclidean distance in RGB space. Image: Timo Honkela

We might be interested also in the relations of the clusters among each other. Why not group clusters of similar objects into groups of similar groups? This is called. We can obviously rerun any clustering algorithm on, say, the averages of each cluster. Most often (especially in distance-based clustering) we will simply group the two closest objects together, calculate common statistics that our distance measure can handle, and iterate

Example: family tree of animals This tree has been built using NCD on the DNA sequences of different animals using a special DNA compressor. Always the two closest genomes were linked together and replaced by their concatenation. Postprocessing: the animals were ordered (within the constraints of the tree) to make the picture look nice.

Example: number and colours as search terms This quad tree was generated using the Normalized Google Distance a special case of NCD, that translates the number of hits Google returns to estimate a probability distribution (and thus code length). The items here actually represent search terms. Image: Rudi Cilibrasi

To be continued...