Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

Similar documents
Machine Learning A W 1sst KU. b) [1 P] For the probability distribution P (A, B, C, D) with the factorization

Machine Learning A WS15/16 1sst KU Version: January 11, b) [1 P] For the probability distribution P (A, B, C, D) with the factorization

10-701/15-781, Fall 2006, Final

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Note Set 4: Finite Mixture Models and the EM Algorithm

Mixture Models and EM

Mixture Models and the EM Algorithm

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Generative and discriminative classification techniques

Homework #4 Programming Assignment Due: 11:59 pm, November 4, 2018

SD 372 Pattern Recognition

An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework

Assignment 2. Unsupervised & Probabilistic Learning. Maneesh Sahani Due: Monday Nov 5, 2018

Supervised vs. Unsupervised Learning

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

K-Means and Gaussian Mixture Models

Unsupervised Learning

Clustering & Dimensionality Reduction. 273A Intro Machine Learning

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

K-means and Hierarchical Clustering

Machine Learning Classifiers and Boosting

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Bayes Classifiers and Generative Methods

Cluster Analysis. Jia Li Department of Statistics Penn State University. Summer School in Statistics for Astronomers IV June 9-14, 2008

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Network Traffic Measurements and Analysis

Generative and discriminative classification techniques

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Clustering Lecture 5: Mixture Model

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Introduction to Machine Learning CMU-10701

CS787: Assignment 3, Robust and Mixture Models for Optic Flow Due: 3:30pm, Mon. Mar. 12, 2007.

Figure (5) Kohonen Self-Organized Map

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

ECE 5424: Introduction to Machine Learning

Introduction to Mobile Robotics

K-Means Clustering. Sargur Srihari

Grundlagen der Künstlichen Intelligenz

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

1 Case study of SVM (Rob)

9.1. K-means Clustering

Clustering. Shishir K. Shah

More Learning. Ensembles Bayes Rule Neural Nets K-means Clustering EM Clustering WEKA

Programming Exercise 3: Multi-class Classification and Neural Networks

Clustering and Visualisation of Data

ECE521: Week 11, Lecture March 2017: HMM learning/inference. With thanks to Russ Salakhutdinov

COMS 4771 Clustering. Nakul Verma

Cluster Analysis. Debashis Ghosh Department of Statistics Penn State University (based on slides from Jia Li, Dept. of Statistics)

Chapter 6 Continued: Partitioning Methods

732A54/TDDE31 Big Data Analytics

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Introduction to Machine Learning

Statistics 202: Statistical Aspects of Data Mining

More on Learning. Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Expectation-Maximization. Nuno Vasconcelos ECE Department, UCSD

Clustering. Supervised vs. Unsupervised Learning

Inference and Representation

Content-based image and video analysis. Machine learning

Machine Learning Lecture 3

CS325 Artificial Intelligence Ch. 20 Unsupervised Machine Learning

Clustering web search results

K-means clustering Based in part on slides from textbook, slides of Susan Holmes. December 2, Statistics 202: Data Mining.

10708 Graphical Models: Homework 4

Latent Variable Models and Expectation Maximization

Colorado School of Mines. Computer Vision. Professor William Hoff Dept of Electrical Engineering &Computer Science.

Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

An Introduction to PDF Estimation and Clustering

Machine Learning. Supervised Learning. Manfred Huber

A Dendrogram. Bioinformatics (Lec 17)

10. MLSP intro. (Clustering: K-means, EM, GMM, etc.)

Learning from Data Mixture Models

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

The exam is closed book, closed notes except your one-page cheat sheet.

A Visualization Tool to Improve the Performance of a Classifier Based on Hidden Markov Models

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Dynamic Thresholding for Image Analysis

To earn the extra credit, one of the following has to hold true. Please circle and sign.

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

One-Shot Learning with a Hierarchical Nonparametric Bayesian Model

Fall 09, Homework 5

Today. Lecture 4: Last time. The EM algorithm. We examine clustering in a little more detail; we went over it a somewhat quickly last time

Unsupervised Learning

Final Exam. Introduction to Artificial Intelligence. CS 188 Spring 2010 INSTRUCTIONS. You have 3 hours.

Deep Generative Models Variational Autoencoders

Approaches to Clustering

CLUSTERING. JELENA JOVANOVIĆ Web:

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

CS 229 Midterm Review

Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013

IBL and clustering. Relationship of IBL with CBR

Region-based Segmentation

Mixture Models and EM

Computer Vision. Exercise Session 10 Image Categorization

Fuzzy Segmentation. Chapter Introduction. 4.2 Unsupervised Clustering.

Expectation Maximization (EM) and Gaussian Mixture Models

Segmentation: Clustering, Graph Cut and EM

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Machine Learning. Unsupervised Learning. Manfred Huber

Transcription:

Machine Learning A 708.064 11W 1sst KU Exercises Problems marked with * are optional. 1 Conditional Independence I [2 P] a) [1 P] Give an example for a probability distribution P (A, B, C) that disproves P (A, B) = P (A)P (B) P (A, B C) = P (A C)P (B C). Illustrate the phenomenon of explaining away. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves P (A, B C) = P (A C)P (B C) P (A, B) = P (A)P (B). 2 Conditional Independence II [2+1* P] a) [2 P] Consider the two networks: For each of them, determine whether there can be any other Bayesian network that encodes the same independence assertions. b) [1* P] Give an example for a probability distribution P so that all independence assertions that hold in P can not be represented by any (single) Bayesian network.

3 Bayesian Networks [3 P] Figure 1: Three possible networks for the telescope problem. Two astronomers in different parts of the world make measurements M1 and M2 of the number of stars N in some small region of the sky, using their telescopes. Normally, there is a small possibility e of error by up to one star in each direction. Each telescope can also (with a smaller probability f) be badly out of focus (events F 1 and F 2), in which case the scientists will undercount by three or more stars (or, if N is less than 3, fail to detect any stars at all). Consider the three networks illustrated in Figure 1. a) [1 P] Which of these Bayesian networks are correct (but not necessarily efficient) representations of the preceding information? b) [1 P] Which is the best network? Why? c) [1 P] Write out a conditional distribution for P (M1 N), for the case N {1, 2, 3} and M1 {0, 1, 2, 3, 4}. Each entry in the conditional distribution should be expressed as a function of the parameters e and/or f. 4 d-separation [4+1* P] a) Prove or disprove: 1. [1P ] (X Y, W Z)) (X Y Z). 2. [1P ] (X Y Z) & (X, Y W Z) (X W Z). 3. [1P ] (X Y, W Z) & (Y W Z) (X, W Y Z). 4. [1* P] (X Y Z) & (X Y W ) (X Y Z, W ).

Figure 2: Bayesian network. b) [1 P] Apply d-seperation to determine whether the following conditional independence statements hold for the Bayesian network illustrated in Fig. 2. 1. (X3 X7) 2. (X3 X7 X4) 3. (X3 X5) 4. (X3 X5 X7) 5 Inference in Factor Graphs [2 P] Consider a tree-structured factor graph, in which a given subset of the variable nodes form a connected subgraph (i.e., any variable node of the subset is connected to at least one of the other variable nodes via a single factor node). Show how the sum-product algorithm can be used to compute the marginal distribution over that subset. 6 Factor graphs: HMM model [5 P] Implement the sum-product algorithm for factor graphs in MATLAB to infer the hidden states of a HMM for the following problem. An agent is located randomly in a 10 10 grid world. In each of T = 20 time steps he either stays at his current location with a probability of 25% or

Figure 3: Grid world. moves up, down, left or right, where each of these four actions is chosen with a probability of 18.75%. Each cell in the grid world is painted with a certain color that is chosen randomly from a set of k possible colors as shown in Fig. 3. This color is observed by the agent at each time step and reported to us. Only these color values are available to infer the initial location and the subsequent positions of the agent resulting in the Hidden Markov model (HMM) illustrated in Fig. 4. Figure 4: Hidden Markov model (HMM). Modify the file hmmmessagepassingtask.m available for download on the course homepage 1 and implement the sum-product algorithm to solve this inference problem. Investigate and discuss the dependence of the solution on the number of different colors in the grid world (k). Hand in figures of representative results that show the actual agent trajectories, the most likely trajectories 1 http://www.igi.tugraz.at/lehre/intern/mla WS1112 HW6.zip

and the probabilities of the agent positions at each time step as infered by the sum-product algorithm. Present your results clearly, structured and legible. Document them in such a way that anybody can reproduce them effortless. Hand in printouts of your MATLAB code. 7 Message passing in Gaussian Bayesian networks [3 P] Consider the Bayesian network illustrated in Fig. 5 with the probabilities P (x 0 ) = N (x 0 a 0, Q 0 ) (1) P (x t+1 x t ) = N (x t+1 A t x t + a t Q t ) (2) P (y t x t ) = N (y t D t x t + d t C t ) (3) where N (x µ, Σ) denotes the multivariate Gaussian distribution with mean µ and covariance Σ. Figure 5: Gaussian Bayesian network. Prove that the forward messages have again the form of a Gaussian distribution α(x t ) = N (x t S 1 t s t, S 1 t ) (4) with mean S 1 t s t and covariance S 1 t as defined on the slides for the exercise hour (slide 23). (Hint: Use the three properties of Gaussian distributions discussed in the exercise hour.)

8 Factor graphs: Robot Localization [5 P] Implement the sum-product algorithm for factor graphs in MATLAB for the Gaussian Bayesian network illustrated in Fig. 5 with the probabilities given in Eq. 1-3. Figure 6: Robot control task. Modify the files maintemplate.m, kalmanfilter.m and kalmansmoother.m available for download on the course homepage 2 and implement a) the Kalman filter algorithm and b) the Kalman smoother algorithm. The task is to infer the trajectory of a robot that starts at position x 0 and moves according to the linear transition model x t+1 = Ax t + Bu t + N (0 Q) (5) for given actions a t from the observations (green dots) y t = Dx t + d + N (0 C) (6) as illustrated in Fig. 6. The environment and the parameters for the linear state transition and observation model are already provided in the file maintemplate.m. Investigate and discuss the results for both algorithms. Hand in figures of representative results that show the actual robot trajectories and the most likely robot trajectories for both algorithms as infered by the sum-product algorithm. 2 http://www.igi.tugraz.at/lehre/intern/mla WS1112 HW8.zip

Present your results clearly, structured and legible. Document them in such a way that anybody can reproduce them effortless. Hand in printouts of your MATLAB code. 9 Beta distribution [2* P] Show that the mean, variance, and mode of the beta distribution are given respectively by E[µ] = var[µ] = mode[µ] = a a + b ab (a + b) 2 (a + b + 1) a 1 a + b 2. 10 EM Algorithm Applet [3 P] Go to Olivier Michel s applet Gaussian Mixture Model EM Algorithm at http://lcn.epfl.ch/tutorial/english/gaussian/html/index.html and answer questions 1-6 [1 P]. Additionally do the following: a) [1 P] Create points from two intersecting lines and run the algorithm with option LineMix and 2 clusters. Click repeatedly on EM 1 Step to run the algorithm. Repeat this experiment with different cutting angles of the lines and analyze if this has an effect on the number of iterations until convergence. b) [1 P] Create points in a ring around the center with the button RingPts. Try to fit a mixture of lines model (with EM Run) to this distribution and analyze what happens when you change the number of lines. Do the same experiment with mixtures of Gaussians and describe what you find. 11 EM Algorithm for Mixtures of Lines [3 P] Assume that the training examples x n R 2 with n = 1,..., N were generated from a mixture of K lines P (x n,2 z n,k = 1) = N (x n,2 θ k,1 x n,1 + θ k,2, σ k ) (7)

where N (x µ, σ) = ) 1 (x µ)2 exp ( 2πσ 2σ 2 (8) and the hidden variable z n,k = 1 if x n is generated from line k and 0 otherwise. Derive the update equations for the M-step of the EM algorithm for the variables θ k and σ k. 12 Clustering: Face recognition [3 P] In this homework example you should cluster images derived from the Olivetti face database with the cluster algorithms k-means and k-medoids. You can download the dataset and its description from the course homepage. 3 Normalize the pixels in each 50 by 50 image (each column of the design matrix) to have mean 0 and variance 0.1. Apply k-means and k-medoids to the dataset. For k-means use the MAT- LAB function kmeans. For k-medoids you have to write your own MATLAB function. For both algorithms use the squared Euclidean distance. Analyze the dependence of the average squared error on the number of clusters. Analyze for k {2, 5, 10, 30} the cluster centers and outliers (points with the largest squared errors) for both algorithms. Explain the differences in the results. 13 Clustering: Image compression [3 P] Apply the k-means algorithm for lossy image compression by means of vector quantization. Download the 512 512 image mandrill.tif. 4 Each pixel represents a point in a three dimensional (r,g,b) color space. Each color dimension encodes the corresponding intensity with an 8 bit integer. Cluster the pixels in color space using k {2, 4, 8, 16, 32, 64, 128} clusters and replace the original color values with the indices of the closest cluster centers. Determine the compression factor for each value of k and relate it to the quality of the image. Apply an appropriate quality measure of you choice. 3 http://www.igi.tugraz.at/lehre/intern/mla WS1112 HW12.zip 4 http://www.igi.tugraz.at/lehre/intern/mla WS1112 HW13.zip

14 Mixture distributions [2* P] Consider a density model given by a mixture distribution p(x) = K π k p(x k) (9) k=1 and suppose that we partition the vector x into two parts so that x = (x a, x b ). Show that the conditional density p(x b x a ) is itself a mixture distribution and find expressions for the mixing coefficients and for the component densities. 15 Parameter Learning for Naive Bayes Classifiers [3* P] Implement an algorithm for learning a naive Bayes classifier and apply it to a spam email data set. You are required to use MATLAB for this assignment. The spam dataset is available for download on the course homepage 5. a) [1/2* P] Write a function called nbayes learn.m that takes a training dataset for a binary classification task with binary attributes and returns the posterior Beta distributions of all model parameters (specified by variables a i and b i for the ith model parameter) of a naive Bayes classifier given a prior Beta distribution for each of the model parameters (specified by variables a i and b i for the ith model parameter). b) [1/2* P] Write a function called nbayes predict.m that takes a set of test data vectors and returns the most likely class label predictions for each input vector based on the posterior parameter distributions obtained in a). c) [1* P] Use both functions to conduct the following experiment. For your assignment you will be working with a data set that was created a few years ago at the Hewlett Packard Research Labs as a testbed data set to test different spam email classification algorithms. 1. Verify the naive Bayes assumption for all pairs of input attributes. 5 http://www.igi.tugraz.at/lehre/intern/mla WS1112 HW15.zip

2. Train a naive Bayes model on the first 2500 samples (using Laplace uniform prior distributions) and report the classification error of the trained model on a test data set consisting of the remaining examples that were not used for training. 3. Repeat the previous step, now training on the first {10, 50, 100, 200,..., 500} samples, and again testing on the same test data as used in point 1 (samples 2501 through 4601). Report the classification error on the test dataset as a function of the number of training examples. Hand in a plot of this function. 4. Comment on how accurate the classifier would be if it would randomly guess a class label or it would always pick the most common label in the training data. Compare these performance values to the results obtained for the naive Bayes model. d) [1* P] Train a feedforward neural network with one sigmoidal output unit and no hidden units with backpropagation (use the algorithm traingdx and initialize the network with small but nonzero weights) on the first {10, 50, 100, 200,..., 500} samples and test on the same test data as used in point 1 (samples 2501 through 4601). Report the classification error on the test dataset as a function of the number of training examples and compare the results to the one obtained for the naive Bayes classifier. Hand in a plot of this function. Present your results clearly, structured and legible. Document them in such a way that anybody can reproduce them effortless.