Stat 60X Exam Spring 0 I have neither given nor received unauthorized assistance on this exam. Name Signed Date Name Printed
. Below is a small p classification training set (for classes) displayed in graphical and tabular forms (circles are class and squares are class ). y x x 0 3 5 7 0 7 0 8 6 4 7 5 7 7 0 9 4 3 3 0 4 6 7 7 9 a) Using the geometry above (and not trying to solve an optimization problem analytically) find the support vector machine for this classification problem. You may find it helpful to know that if uv,, and w points in and u v then the distance from the point w to the line through u and v is w v u w v u vu uv v u v u b) List the set of support vectors and the "margin" ( M ) for your SVM.
. Below is a different p classification training set for classes (specified as in problem ). y x x 3 5 7 0 7 3 8 6 5 4 7 5 7 7 5 6 3 7 3 0 5 7 6 7 9 a) Using empirical misclassification rate as your splitting criterion and standard forward selection, find a reasonably simple binary tree classifier that has empirical error rate 0. Carefully describe it below, using as many nodes as you need. At the root node: split on x (circle the correct one of these) Classify to Class if (creating Node #) Classify to Class otherwise (creating Node #) At Node # : split on x Classify to Class if (creating Node #3) Classify to Class otherwise (creating Node #4) At Node # : split on x Classify to Class if (creating Node #5) Classify to Class otherwise (creating Node #6) 3
At Node # : split on x Classify to Class if (creating Node #7) Classify to Class otherwise (creating Node #8) At Node # : split on x Classify to Class if (creating Node #7) Classify to Class otherwise (creating Node #8) At Node # : split on x Classify to Class if (creating Node #9) Classify to Class otherwise (creating Node #0) At Node # : split on x Classify to Class if (creating Node #) Classify to Class otherwise (creating Node #) (Add more of these on another page if you need them.) b) Draw in the final set of rectangles corresponding to your binary tree on the graph of the previous page. 4
c) For every sub-tree, T, of your full binary tree above, list in the table below the size (number of final nodes) of the sub-tree, T, and the empirical error rate of its associated classifier. Full tree pruned at Nodes # T (pruned tree size) err None (full tree) (If you need more room in the table, add rows on another page.) d) Using the values in your table from c), find for every 0 a sub-tree of your full tree minimizing C T err 5
3. Here again is the p classification training set specified in problem. y x x 3 5 7 0 7 3 8 6 5 4 7 5 7 7 5 6 3 7 3 0 5 7 6 7 9 Using "stumps" (binary trees with only nodes) as your base classifiers, find the M term AdaBoost classifier for this problem. (Show your work!) 6
4. The machine learning/data mining folklore is full of statements like "combining uncorrelated classifiers through majority voting produces a committee classifier better than every individual in the committee." Consider the scenario outlined in the table below as regards classifiers f, f, and f and a target (class variable) taking values in 0,. 3 Outcome f f f3 Committee Classification y Probability 0 0 0 0 0 0 0 0.008 3 0 0 0.008 4 0 0 0.008 5 0 0.08 6 0 0.08 7 0 0.08 8 0 0 9 0 0 0 0 0 0 0 0 0 0 3 0 4 0 5 0..9..008.9..008.9..008 3.9..08.9..08.9..08 6.9 Fill in the (majority vote) Committee Classification column and say carefully (show appropriate calculations to support your statement(s)) what the example shows about the folklore. 3 7
5. Suppose that in a p linear discriminant analysis problem, 4 transformed means * k k are covariance matrix 0 4 3.5.5,,, and 0 4.5.5 * * * * 3 4 3.5.65 diag 4.75,.5.65 3.5. These have sample a) Suppose that one wants to do reduced rank ( rank ) linear discrimination based on a single real variable, w u u x Identify an appropriate vector u, u and with your choice of vector, give the function f w mapping,,3, 4 that defines this 4-class classifier. b) What set of "prototypes" ( w ) yields the classifier in a)? 8
6. In classification problems, searching for a small number of effective basis functions h, h,, hk p to use in transforming a complicated situation involving input x, into a simpler situation k involving "feature vector" h x, h x,, h k x is a major concern. a) The p toy classification problem with K classes and training data below is not simple. y x 3 0 3 But it's easy to find a single function h x that makes the problem into a linearly separable problem with k p. Name one such function. b) In what specific way(s) does the use of kernels and SVM methodology typically lead to identification of a small number of important features (basis functions) that are effective in -class classification problems? c) If you didn't have available SVM software but DID have available Lasso or LAR regression software, say how you might use the kernel idea and the Lasso/LAR software in the search for a few effective features in a -class classification problem. 9
7. Consider a Bayesian model averaging problem where x takes values in 0, and y takes values in 0,. The quantity of interest is P y x /P y 0 x and there are M models under consideration. We'll suppose that joint probabilities for xy, are as given in the tables below for the two models for some p 0, and r 0, Model Model y\ x 0 y\ x 0 0 p / p / 0.5.5.5.5 r / r / so that under Model, the quantity of interest is.5 / p and under Model, it is r /.5. Suppose that under both models, training data x, y i,,, N are iid. For priors, in Model suppose that i a priori p Beta, and in Model suppose that a priori r Beta, the prior probabilities of the two models are.5. i. Further, suppose that Find the posterior probabilities of the models, T and T and the Bayes model average squared error loss predictor of P y x /P y 0 x. (You may think of the training data as summarized in the 4 counts N the number of training vectors with value x, y.) xy, 0