Assignment 2. Unsupervised & Probabilistic Learning. Maneesh Sahani Due: Monday Nov 5, 2018

Similar documents
ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

Inference and Representation

Hidden Markov Models. Slides adapted from Joyce Ho, David Sontag, Geoffrey Hinton, Eric Xing, and Nicholas Ruozzi

Note Set 4: Finite Mixture Models and the EM Algorithm

Introduction to Machine Learning CMU-10701

9.1. K-means Clustering

The EM Algorithm Lecture What's the Point? Maximum likelihood parameter estimates: One denition of the \best" knob settings. Often impossible to nd di

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

A Visualization Tool to Improve the Performance of a Classifier Based on Hidden Markov Models

K-Means and Gaussian Mixture Models

Generative and discriminative classification techniques

Clustering Lecture 5: Mixture Model

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Expectation Maximization (EM) and Gaussian Mixture Models

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Random projection for non-gaussian mixture models

18 October, 2013 MVA ENS Cachan. Lecture 6: Introduction to graphical models Iasonas Kokkinos

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Today. Lecture 4: Last time. The EM algorithm. We examine clustering in a little more detail; we went over it a somewhat quickly last time

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

K-Means Clustering. Sargur Srihari

Unsupervised Learning

Clustering web search results

Modeling time series with hidden Markov models

Clustering & Dimensionality Reduction. 273A Intro Machine Learning

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Mixture Models and the EM Algorithm

Latent Variable Models and Expectation Maximization

Reinforcement Learning: A brief introduction. Mihaela van der Schaar

Machine Learning A W 1sst KU. b) [1 P] For the probability distribution P (A, B, C, D) with the factorization

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Graphical Models & HMMs

K-Means Clustering 3/3/17

Machine Learning A WS15/16 1sst KU Version: January 11, b) [1 P] For the probability distribution P (A, B, C, D) with the factorization

Mixture Models and EM

Time Series Analysis by State Space Methods

Natural Language Processing - Project 2 CSE 398/ Prof. Sihong Xie Due Date: 11:59pm, Oct 12, 2017 (Phase I), 11:59pm, Oct 31, 2017 (Phase II)

Time series, HMMs, Kalman Filters

COMS 4771 Clustering. Nakul Verma

10-701/15-781, Fall 2006, Final

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Invariant Recognition of Hand-Drawn Pictograms Using HMMs with a Rotating Feature Extraction

Hidden Markov Models in the context of genetic analysis

Lecture 5: Markov models

An Introduction to Hidden Markov Models

CSCI 5582 Artificial Intelligence. Today 10/31

Learning from Data Mixture Models

CSCI 599 Class Presenta/on. Zach Levine. Markov Chain Monte Carlo (MCMC) HMM Parameter Es/mates

COMP90051 Statistical Machine Learning

Mixture Models and EM

CS273: Algorithms for Structure Handout # 4 and Motion in Biology Stanford University Thursday, 8 April 2004

General Instructions. Questions

Deep Generative Models Variational Autoencoders

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

Clustering Sequences with Hidden. Markov Models. Padhraic Smyth CA Abstract

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

CSE 547: Machine Learning for Big Data Spring Problem Set 2. Please read the homework submission policies.

Lecture 8: The EM algorithm

( ) =cov X Y = W PRINCIPAL COMPONENT ANALYSIS. Eigenvectors of the covariance matrix are the principal components

STA 4273H: Statistical Machine Learning

ε-machine Estimation and Forecasting

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

IBL and clustering. Relationship of IBL with CBR

ECE 5424: Introduction to Machine Learning

CS281 Section 9: Graph Models and Practical MCMC

An Introduction to PDF Estimation and Clustering

Machine Learning Department School of Computer Science Carnegie Mellon University. K- Means + GMMs

Package gibbs.met documentation

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Biology 644: Bioinformatics

Probabilistic Modelling and Reasoning Assignment 2

Expectation-Maximization. Nuno Vasconcelos ECE Department, UCSD

Fall 09, Homework 5

Lab # 2 - ACS I Part I - DATA COMPRESSION in IMAGE PROCESSING using SVD

Computer vision: models, learning and inference. Chapter 10 Graphical Models

Introduction to Machine Learning

COMPUTATIONAL STATISTICS UNSUPERVISED LEARNING

Estimation of Item Response Models

CSC 2515 Introduction to Machine Learning Assignment 2

Generative and discriminative classification techniques

CS787: Assignment 3, Robust and Mixture Models for Optic Flow Due: 3:30pm, Mon. Mar. 12, 2007.

Clustering. Image segmentation, document clustering, protein class discovery, compression

A Fast Learning Algorithm for Deep Belief Nets

Supervised vs unsupervised clustering

CSE 586 Final Programming Project Spring 2011 Due date: Tuesday, May 3

Introduction to Graphical Models

DISCRETE HIDDEN MARKOV MODEL IMPLEMENTATION

Optimization of HMM by the Tabu Search Algorithm

Stephen Scott.

Three-Dimensional Sensors Lecture 6: Point-Cloud Registration

Biostatistics 615/815 Lecture 23: The Baum-Welch Algorithm Advanced Hidden Markov Models

Quiz Section Week 8 May 17, Machine learning and Support Vector Machines

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

A reevaluation and benchmark of hidden Markov Models

Speech Recognition Lecture 8: Acoustic Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

1 : Introduction to GM and Directed GMs: Bayesian Networks. 3 Multivariate Distributions and Graphical Models

Segmentation: Clustering, Graph Cut and EM

Transcription:

Assignment 2 Unsupervised & Probabilistic Learning Maneesh Sahani Due: Monday Nov 5, 2018 Note: Assignments are due at 11:00 AM (the start of lecture) on the date above. he usual College late assignments policy will apply. You must make a credible attempt at the main questions to receive credit for the bonus ones. 1. [65 marks + 5 BONUS] EM for Binary Data. Consider the data set of binary (black and white) images used in the previous assignment. Each image is arranged into a vector of pixels by concatenating the columns of pixels in the image. he data set has N images {x (1),..., x (N) } and each image has D pixels, where D is (number of rows number of columns) in the image. For example, image x (n) is a vector (x (n) 1,..., x(n) D ) where x(n) d {0, 1} for all n {1,..., N} and d {1,..., D}. (a) Write down the likelihood for a model consisting of a mixture of K multivariate Bernoulli distributions. Use the parameters π 1,..., π K to denote the mixing proportions (0 π k 1; k π k = 1) and arrange the K Bernoulli parameter vectors into a matrix P with elements p kd denoting the probability that pixel d takes value 1 under mixture component k. Assume the images are iid under the model, and that the pixels are independent of each other within each component distribution. [5 marks] Just as we can for a mixture of Gaussians, we can formulate this mixture as a latent variable model, introducing a discrete hidden variable s (n) {1,..., K} where P (s (n) = k π) = π k. (b) Write down the expression for the responsibility of mixture component k for data vector x (n), i.e. r nk P (s (n) = k x (n), π, P). his computation provides the E-step for an EM algorithm. [5 marks] (c) Find the maximizing parameters for the expected log-joint argmax log P (x (n), s (n) π, P) π,p n q({s (n) }) thus obtaining an iterative update for the parameters π and P in the M-step of EM. [10 marks] (d) Implement the EM algorithm for a mixture of K multivariate Bernoullis. Your code should take as input the number K, a matrix X containing the data set, and a maximum number of iterations to run. he algorithm should run for that number of iterations or until the log likelihood converges (does not increase by more than a very small amount). Hand in clearly commented code and a description of how the code works. Run your algorithm on the data set for values of K in {2, 3, 4, 7, 10}. Report the log likelihoods obtained and display the parameters found. [30 marks] [Hints: Although the implementation may seem simple enough given the equations, there are many numerical pitfalls (that are common in much of probabilistic learning). A few suggestions: Likelihoods can be very small; it is often better to work with log-likelihoods. You may still encounter numerical issues computing responsibilities. Consider scaling the numerator and denominator of the equation by a suitable constant while still in the log domain.

It may also help to introduce (weak) priors on P and π and use EM to find the MAP estimates, rather than ML. If working in MALAB, consider vectorizing your code, avoiding for loops where possible. his often [though not always] makes the code more readable and lets it run faster. It is always useful to plot the log-likelihood after every iteration. If it ever decreases (when implementing EM for ML) you have a bug! he same is true for the log-joint when implementing MAP. (e) Express the log-likelihoods obtained in bits and relate these numbers to the length of the naive encoding of these binary data. How does your number compare to gzip (or another compression algorithm)? Why the difference? [5 marks] (f) Run the algorithm a few times starting from randomly chosen initial conditions. Do you obtain the same solutions (up to permutation)? Does this depend on K? Comment on how well the algorithm works, whether it finds good clusters (look at the cluster means and responsibilities and try to interpret them), and how you might improve the model. [10 marks] (g) [BONUS] Consider the total cost of encoding both the model parameters and the data given the model. How does this total cost compare to gzip (or similar)? How does it depend on K? What might this tell you? [5 marks] 2. [20 marks] Modelling Data. (a) Download the data file called geyser.txt from the course web site. his is a sequence of 295 consecutive measurements of two variables from Old Faithful geyser in Yellowstone National Park: the duration of the current eruption in minutes (rounded to the nearest second), and the waiting time since the last eruption in minutes (to the nearest minute). Examine the data by plotting the variables within (plot(geyser(:,1),geyser(:,2), o );) and between (e.g. plot(geyser(1:end-n,1),geyser(n+1:end,1 or 2), o ); for various n) time steps. Discuss and justify based on your observations what kind of model might be most appropriate for this data set. Consider each of the models we have encountered in the course through week 3: a multivariate normal, a mixture of Gaussians, a Markov chain, a hidden Markov model, an observed stochastic linear dynamical system and a linear-gaussian state-space model. Can you guess how many discrete states or (continuous) latent dimensions the model might have? [10 marks] (b) Consider a data set consisting of the following string of 160 symbols from the alphabet {A, B, C}: AABBBACABBBACAAAAAAAAABBBACAAAAABACAAAAAABBBBACAAAAAAAAAAAABACABACAABBACAAABBBBA CAAABACAAAABACAABACAAABBACAAAABBBBACABBACAAAAAABACABACAAABACAABBBACAAAABACABBACA Study this string and suggest the structure of an HMM model that may have generated it. Specify the number of hidden states in the HMM, the transition matrix with any constraints and estimates of the transition probabilities and the output or emission matrix probalities, and the intial state probabilities. You need to provide some description/justification for how you arrived at these numbers. We do not expect you to implement the Baum-Welch algorithm you should be able to answer this question just by examining the sequence carefully. [10 marks] 3. [15 marks] Zero-temperature EM. In the automatic speech recognition community HMMs are sometimes trained by using the Viterbi algorithm in place of the forward backward algorithm. In other words, in the E step of EM (Baum Welch), instead of computing the expected sufficient statistics from the posterior distribution over hidden states: p(s 1: x 1:, θ), the sufficient statistics are computed using the single most probable hidden state sequence: s 1: = arg max s 1: p(s 1: x 1:, θ). (a) Is this algorithm guaranteed to converge (in the sense that the free-energy or some other reaches an asymptote)? o answer this you might want to consider the discussion of the real EM algorithm and what happens if we constrain q(s) to put all its mass on one setting of the hidden variables. Support your arguments. [10 marks]

(b) If it converges, will it converge to a maximum of the likelihood? If it does not converge what will happen? Support your arguments. [5 marks] (c) [Bonus (just for culture)] Why do you think this question is labelled Zero-temperature EM Hint: think about where temperature would appear in the the free-energy. [no marks]

BONUS QUESION: attempt the questions above before anwering this one. 4. [Bonus: 55 marks] BONUS question: LGSSMs, EM and SSID. Download the datafiles ssm_spins.txt and ssm_spins_test.txt. Both have been generated by an LGSSM: y t N (Ay t 1, Q) [t = 2... ] y 1 N (0, I) x t N (Cy t, R) [t = 1... ] using the parameters: cos( 180 ) sin( 180 ) 0 0 A = 0.99 sin( 180 ) cos( 180 ) 0 0 0 0 cos( 90 ) sin( 90 ) 0 0 sin( 90 ) cos( 90 ) 1 0 1 0 0 1 0 1 C = 1 0 0 1 0 0 1 1 0.5 0.5 0.5 0.5 Q = I AA R = I but different random seeds. We shall use the first as a training data set and the second as a test set. (a) Run the function ssm_kalman.m or ssm_kalman.py that we have provided (or a re-implementation in your favourite language if you prefer) on the training data. Warning (for MALAB version): the function expects data vectors in columns; you will need to transpose the loaded matrices! Make the following plots (or equivelent): logdet = @(A)(2*sum(log(diag(chol(A))))); [Y,V,~,L] = ssm_kalman(x,y0,q0,a,q,c,r, filt ); plot(y ): plot(cellfun(logdet,v)); [Y,V,Vj,L] = ssm_kalman(x,y0,q0,a,q,c,r, smooth ); plot(y ): plot(cellfun(logdet,v)); Explain the behaviour of Y and V in both cases (and the differences between them). [5 marks] (b) Write a function to learn the parameters A, Q, C and R using EM (we will assume that the distribution on the first state is known a priori). he M-step update equations for A and C were derived in lecture. You should show that the updates for R and Q can be written: R new = 1 [ ( x t x ) ] t x t y t Cnew t=1 t=1 Q new = 1 [ yt y ( t yt y ) ] t 1 A new 1 t=2 where C new and A new are as in the lecture notes. Store the log-likelihood at every step (easy to compute from the fourth value returned by ssm_kalman) and check that it increases. [Hint: the matlab code cellsum=@(c)(sum(cat(3,c{:}),3)) defines an inline function cellsum() that sums the entries of a cell array of matrices.] t=2

Run at least 50 iterations of EM starting from a number of different initial conditions: (1) the generating parameters above (why does EM not terminate immediately?) and (2) 10 random choices. Show how the likelihood increases over the EM iterations (hand in a plot showing likelihood vs iteration for each run, plotted in the same set of axes). Explain the features of this plot that you think are salient. [If your code and/or computer is fast enough, try running more EM iterations. What happens?] [25 marks] (c) Write a function to learn A, Q, C and R using Ho-Kalman SSID. You should provide options to specify the maximum lag to include in the future-past Hankel matrix (H), and the latent dimensionality. You may like to plot the singular value spectrum of H to see how good a clue it provides to dimensionality. [Hints: if M is a cell array of the lagged correlation matrices M τ, you can build the Hankel matrix using: H = cell2mat(m(hankel([1:maxlag], [maxlag:2*maxlag-1]))); he matrices Ξ (Xi) and Υ (Ups) are found using SVD (assuming K latent dimensions): [Xi,SV,Ups] = svds(h, K); Xi = Xi*sqrt(SV); Ups = sqrt(sv)*ups ; and these can be used to find Ĉ and  and Π by regression ( / or \ in MALAB), e.g.: Ahat = Xi(1:end-DD,:)\Xi(DD+1:end,:); (he code for C and Π will take a little more thought!). o find Q and R recall that, as Π is the stationary (prior) latent covariance: AΠA + Q = Π and CΠC + R = E t [x t x t ] Run your code on the training data using a maximum lag of 10 and the true latent dimensionality. Evaluate the likelihood of this model using ssm_kalman.m and show it on the likelihood figure. Now use the SSID parameters as initial values for EM and show the resulting likelihood curve. Comment on the advantages and disadvantanges of SSID as compared to EM. (If you have time and inclination, you might like to explore how your results vary with different choices of lag and dimensionality) [20 marks] (d) Evaluate the likelihood of the test data under the true parameters, and all of the parameters found above (EM initialised at the true parameters, random parameters and the SSID parameters, as well as the SSID parameters without EM). Show these numbers on or next to the training data likelihoods plotted above. Comment on the results. [5 marks]