Fall 09, Homework 5

Similar documents
Unsupervised Learning

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

CS 229 Midterm Review

Clustering Lecture 5: Mixture Model

10-701/15-781, Fall 2006, Final

Neural Network Learning. Today s Lecture. Continuation of Neural Networks. Artificial Neural Networks. Lecture 24: Learning 3. Victor R.

5 Learning hypothesis classes (16 points)

Midterm Examination CS540-2: Introduction to Artificial Intelligence

The exam is closed book, closed notes except your one-page cheat sheet.

Simple Model Selection Cross Validation Regularization Neural Networks

Machine Learning Classifiers and Boosting

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Clustering web search results

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Mixture Models and the EM Algorithm

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Problem Set 3 CMSC 426 Due: Thursday, April 3

Introduction to Machine Learning CMU-10701

Lecture #11: The Perceptron

Note Set 4: Finite Mixture Models and the EM Algorithm

Markov Random Fields and Segmentation with Graph Cuts

Unsupervised Learning: Clustering

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

10703 Deep Reinforcement Learning and Control

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2017

Today. Lecture 4: Last time. The EM algorithm. We examine clustering in a little more detail; we went over it a somewhat quickly last time

CSE 573: Artificial Intelligence Autumn 2010

Content-based image and video analysis. Machine learning

732A54/TDDE31 Big Data Analytics

Supervised vs. Unsupervised Learning

5 Machine Learning Abstractions and Numerical Optimization

Network Traffic Measurements and Analysis

Lecture 9: Hough Transform and Thresholding base Segmentation

10701 Machine Learning. Clustering

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS

Using Machine Learning to Optimize Storage Systems

( ) =cov X Y = W PRINCIPAL COMPONENT ANALYSIS. Eigenvectors of the covariance matrix are the principal components

Please write your initials at the top right of each page (e.g., write JS if you are Jonathan Shewchuk). Finish this by the end of your 3 hours.

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Clustering. Shishir K. Shah

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

Unsupervised Learning : Clustering

Week 3: Perceptron and Multi-layer Perceptron

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Lecture 3: Linear Classification

Chapter 3 Analyzing Normal Quantitative Data

CS 237: Probability in Computing

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Introduction to Machine Learning. Xiaojin Zhu

Recap Hill Climbing Randomized Algorithms SLS for CSPs. Local Search. CPSC 322 Lecture 12. January 30, 2006 Textbook 3.8

K-means clustering Based in part on slides from textbook, slides of Susan Holmes. December 2, Statistics 202: Data Mining.

Short-Cut MCMC: An Alternative to Adaptation

Lecture 11: E-M and MeanShift. CAP 5415 Fall 2007

Network Traffic Measurements and Analysis

Lecture 13. Deep Belief Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

mywbut.com Informed Search Strategies-II

Section 1.5 Transformation of Functions

1) Give decision trees to represent the following Boolean functions:

IQR = number. summary: largest. = 2. Upper half: Q3 =

Artificial Neural Networks (Feedforward Nets)

Norbert Schuff VA Medical Center and UCSF

10601 Machine Learning. Model and feature selection

Learning. Learning agents Inductive learning. Neural Networks. Different Learning Scenarios Evaluation

Homework #4 Programming Assignment Due: 11:59 pm, November 4, 2018

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2016

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Supervised vs unsupervised clustering

Online Algorithm Comparison points

9.1. K-means Clustering

Uninformed Search Methods. Informed Search Methods. Midterm Exam 3/13/18. Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Unsupervised Learning

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Midterm Examination CS 540-2: Introduction to Artificial Intelligence

Review: Final Exam CPSC Artificial Intelligence Michael M. Richter

Nearest neighbors classifiers

INF Biologically inspired computing Lecture 1: Marsland chapter 9.1, Optimization and Search Jim Tørresen

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Generative and discriminative classification techniques

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

N-Queens problem. Administrative. Local Search

CLUSTERING. JELENA JOVANOVIĆ Web:

CS6220: DATA MINING TECHNIQUES

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

More on Learning. Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization

Classification: Feature Vectors

To earn the extra credit, one of the following has to hold true. Please circle and sign.

Tracking Computer Vision Spring 2018, Lecture 24

Machine Learning for Software Engineering

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

DS504/CS586: Big Data Analytics Big Data Clustering II

Math 214 Introductory Statistics Summer Class Notes Sections 3.2, : 1-21 odd 3.3: 7-13, Measures of Central Tendency

Transcription:

5-38 Fall 09, Homework 5 Due: Wednesday, November 8th, beginning of the class You can work in a group of up to two people. This group does not need to be the same group as for the other homeworks. You only need to turn in one write-up per group, but you must include the andrew ID s of the group members. Late homework is due by 3:00 P.M. on the day they are due. Please bring late homeworks to Susan Hrishenko in GHC 79 (slide it under the door if she is not there; write on the homework the date and time turned in). Note that GHC 7th floor is now locked after 6pm. If you finish in the evening, you ll probably need to wait until the next morning to turn it in. Email all questions to 538-instructors@cs.cmu.edu More Decision Trees- 0 pts. (5 pts) Describe how one might use search techniques (genetic algorithms, simulated annealing, etc.) to find a good decision tree. Make sure you address the representation problem. Answers were evaluated based on whether the approach was feasible, and whether the representation problem for the search techniques (states, neighbors, fitness function) were addressed. One possible solution was the following: Each state is a possible decision tree. A neighbor of some tree T is T with one split node pruned, or T with one of its leaves being a different split. (This state space will then look like a tree with the root being an empty decision tree). Fitness function is the tree evaluated on the training set. (With possibly an additional penalty term for large trees, to help avoid overfitting.) For GAs, one could generate decision trees, mutate by randomly choosing a neighbor, and crossing over by, say, picking a random point in one of the trees, pruning it there, and attaching the other tree to that point (although I have no idea if that would perform well). For hill climbing, one could start at random points in the state space and move around the state space as long as the fitness function kept increasing. One would want to do this multiple times, as this will probably get stuck at local optima very often. For simulated annealing, one could take the same approach as hill-climbing only with some probability /temp move to a random neighbor instead. Another solution we accepted uses a simulated annealing-based modification to ID3. States, neighbors, and fitness are as in ID3. Only at each split of the decision tree, with some probability /T choose a random attribute to split on, instead of the one with highest information gain.

2. (5 pts) What are pros and cons to doing that, compared with using the greedy information gain approach? Some pros: The key advantage is that this will result in a much more thorough search of the solution space, and possibly result in a better tree. If one uses a penalty term in the fitness function, it may avoid overfitting. Some cons: This takes a lot longer to run. You still may get caught at local optima, and most of your solutions will be pretty bad. It only works for categorical data it d be much more difficult to implement for realvalued data (a huge state space). 2 Artificial Neural Networks- 20 pts. (5 pts) In class we discussed that a single perceptron cannot compute XOR. Construct a network of multiple perceptrons that will accomplish this (draw a picture, and include weights). Here s one possible solution. Assume threshold activation function at each node with threshold 0. -.5 x x2 - - -0.5 -.5 Figure : An XOR neural network 2. (5 pts) Now construct a network that will compute parity of four 0/ inputs (that is, will output if the sum is odd, or 0 if the sum is even). One solution is to use the XOR network from part and stack them. 3. (5 pts) Suppose instead of using the notion of squared-error, as in lecture, we wanted to use E = Err = y h W (x). Derive the perceptron learning rule. (Hint, you ll need to assume that error never becomes zero). To derive the update rule, we first take the gradient of the error function E = Err with respect to the weights W. Here s the gradient with respect to one component, W i. First, 2

x x2 x3 x4 XOR XOR XOR Figure 2: A neural network for parity using a chain rule we can break it down to: Assuming that Err is never 0, we have = rr W i rr W i = Err rr W i rr W i Err rr = { if Err > 0 if Err < 0 To avoid writing cases, we can rewrite the expression as So the overall gradient is: Err rr = Err Err. = Err rr W i Err W i = Err W i Err g (in)x i where for the last step we just used the expression from the lecture slides for rr W i. Given the gradient, we can update the weights by moving a little bit in the steepest decreasing direction, which is the opposite direction of the gradient. So the update rule is: for some α > 0. W i W i α W i W i W i + α Err Err g (in)x i 4. (5 pts) Why might we want to use a sigmoid function instead of a threshold function for activation? Sigmoid is differentiable, so we can use the update rule. It also might be more resilient to noise. 3

3 Linear Separators and SVMs- 0 pts We will construct a support vector machine that computes the XOR function. It will be convenient to use values of and - instead of and 0 for the inputs/outputs. So a training example looks like ([-, ], ) or ([-, -], -).. ( point) Draw the four examples using axes x, x 2, denoting class by +/-. Label them (p,..., p 4 ). 2.5 0.5 0!0.5!!.5!2!2!.5!!0.5 0 0.5.5 2 Figure 3: Plot of 4 points for xor where I forgot to label the axes: the horizontal axis is x and the vertical is x 2. 2. (6 points) Propose a transformation that will make these points linearly separable (it can be done in 2D). Draw the four input points in this space, and the maximal margin separator. What is the margin? One transformation: y = x y 2 = (x x 2 ) 2 The margin is 2 (4 also accepted, since it wasn t clearly defined in lecture). 3. (3 points) Now draw the separating line back in the original Euclidean input space in part. For the transformation given in part 2, the max-margin separator is at y 2 = 2. So working backwards, we get: y 2 = 2 (x x 2 ) 2 = 2 x x 2 = 2 or x x 2 = 2 4 Pruning Decision Trees- 35 pts For this question, if you can find a suitable implementation of the algorithms online, you are welcome to use them. Because different implementations were used, answers varied. Plots were expected to have been generated from actual data, and not hand-drawn. 4

5 4 3 2 0 2.5 0.5 0 0.5.5 2 Figure 4: One transformation and the max-margin separator where I forgot to label the axes again: the horizontal axis is y and the vertical is y 2.. (0 pts) Run ID3 to train a decision tree for the German Credit Approval data set, classifying whether someone is approved for credit. See data available on the webpage. As you are building the tree, record the error rates on both the training set and the test set (similar to Slide 8 in the second Decision Tree lecture). How many nodes are in the resulting tree? You would most likely get around 750 nodes. 2. (0 pts) Now run reduced-error pruning based on the validation set. Record the error rate for the validation set as the tree is being pruned, with the plot as in the previous part. How many nodes are in the resulting tree? You would most likely get around 50-00 nodes. 3. (0 pts) Now use the statistical significance testing method to prune the tree. Plot the accuracy for different thresholds of significance. How many nodes are in the resulting tree? For the resulting tree we accepted either the tree with the threshold with the best accuracy, or the smallest tree. You would also most likely get around 50-00 nodes. 4. (5 pts) Interpret your plots. How does pruning affect results on the training and test error? What pruning method works best, and why do you think that is? If we picked out a new test set to measure accuracy, which pruning method would have better results? Typically reduced-error pruning achieved up to 80% accuracy, while significance testing achieved between 65-80%. RE pruning probably did better because you re essentially fitting on the test set, while SS is ignorant of the test data and therefore a little more immune to overfitting on test sets. For this reason, the resulting tree from the SS data would perform better than the resulting tree from RE on the test set we used earlier. (Also accepted: if you assumed that we were fitting RE on the new test set, then saying that RE does better was fine.) 5 Clustering- 30 pts For this question, implement the algorithms yourself and submit code in the AFS space. 5

Sometimes instead of treating data as clumps of points, we may want to assume they have a certain distribution. Gaussian mixture models are one example of these. For example, suppose that for some reason you wanted to classify a population of students as male or female. We know that men are generally taller than women (with each population having approximately a Gaussian/normal distribution), but there s a lot of overlap. We also know that the male:female at CMU is 3:2, so if you have a data point of middling height, you re probably safer guessing that the student is male. Gaussian mixture models try to take this into account, by giving a probabilistic assignment to membership, rather than deterministic as in K-means. The way one can go about solving this is to use what s called the EM Algorithm. For Gaussian mixture models, the EM algorithm is similar to K-means, only in addition to calculating means (centers) for each cluster, you also use mixing variables a i to reflect the overall distribution of the classes (a 3:2 ratio would have (a 0, a ) = (.6,.4). Based on these parameters, each point will be assigned to a cluster with some probability of membership to each class. After initialization of µ i and a i for i = (0, ), you iterate through the following two steps until convergence: First, in the expectation ( E ) step, you estimate an expectation value for the membership m ij of each point x j for distribution Y i. m ij = a i f Yi (x j ; µ i, σ i ) K k= a kf Y k(x j ; µ k, σ k ) where f is the formula for the normal distribution. Then, based on the membership values assigned, you re-adjust your estimates of the mean and variance of each Gaussian to maximize the likelihood ( M-step ), as well as the mixing coefficients: Each mixing coefficient is the mean of the membership values for all points. a i = N The mean of each Gaussian is a weighted average of all points with its membership value. j µ i = m ijx j j m ij You can derive an update for variance, but for this problem we ll cheat and assume σ =. N j=. (5 points) Implement EM as above for two -D Gaussian mixtures, and run on the GMM data set available on the webpage. Do several runs, each time randomizing your initial settings for µ and a (let s say pick some µ between (-0, 0) ). Show the distribution of the end estimates for µ and a over all the initial settings. What is the average accuracy? What is the maximum accuracy attained by any single run? In general, EM can get stuck at local optima (which is why we initally intended to see the distributions of results). Most people s implementations did not run into this problem, however, if you did occasionally that s to be expected. The most common convergence point was around ( 0.007, 2.482), very close to the generating parameters of (0, 2.5). Implementations that had no issue with local optima tended to have equivalent values for average and max accuracy, others varied. Accuracy is measured by comparing the resulting class assignments with the classes provided in the gmmclasses.txt file. You can measure accuracy in one of two ways: ) for each point, if its membership of one class was higher than the other, automatically assign it to that class. 6 m ij

2) For each point, probabilistically assign membership (flip a biased coin, where the bias is the membership probability of one class). Since this is an unsupervised learning problem, accuracy is really only in differentiating clusters, not whether or not the class labels are correct. So, since sometimes the class labels get flipped (depending on which µ value was lower initially), you can occasionally get really bad accuracy even if you re doing it right. In the ideal implementation you could simply take accuracy = max(accuracy, accuracy), since accuracy is what you d get if you flipped the labels. Note that it s impossible to get 00% accuracy on this data, because of the overlap. This was one source of missed points. Also, a 70% accuracy usually indicated that you were labeling all as one class (since 70% of the data came from one class). Points were also taken off if EM never converged to anything close to accurate, since that indicates a bug. 2. (0 points) Implement K-means (one dimensional, for K=2) and repeat, also using several runs with random initialization. Show the distribution for the end estimates of the centers over all initial settings. What is the average accuracy? What is the maximum accuracy attained by any single run? Usually the best that k-means did was around 88% accuracy. There was generallyeven less problem with local maxima, mainly because of the low dimensionality of the parameters. (EM has to fit both the as and the µs.) Generally the centers chosen were around ( 0.2, 2.37). 3. (5 points) How do your results compare? Why do you think one performs better than the other? In which cases would the other perform better? EM usually performed better. Reasons for this are because the data are distributed in such a way that EM for Gaussians is suited. The data classes aren t very even (70-30 split), and the data was generated using two different Gaussians. If the data were generated from a uniform random distribution, k-means might be better. If the data were linearly separable, k-means might perform better. Note that EM is not limited to Gaussians, or even to clustering EM is an algorithm that can be used for a wide range of problems. 7