6 Model selection and kernels

Size: px

Start display at page:

Download "6 Model selection and kernels"

Joshua Daniel
5 years ago
Views:

1 6. Bias-Variance Dilemma Esercizio 6. While you fit a Linear Model to your data set. You are thinking about changing the Linear Model to a Quadratic one (i.e., a Linear Model with quadratic features φ(x) = [, x, x ]). Which of the following is most likely true:. Using the Quadratic Model will decrease your Irreducible Error;. Using the Quadratic Model will decrease the Bias of your model; 3. Using the Quadratic Model will decrease the Variance of your model; 4. Using the Quadratic Model will decrease your Reducible Error. Provide motivations to your answers. Esercizio 6. Are the following statement regarding the No Free Lunch (NFL) theorem true or false? Explain why.. On a specific task all the ML algorithms perform in the same way;. It is always possible to find a set of data where an algorithm performs arbitrarily bad; 3. All concepts belonging to the concept space F have the same probability to occur when we restrict our attention to a specific task; 4. We can design an algorithm which is always correct on all the samples on a task. Esercizio 6.3 Which of the following is/are the benefits of the sparsity imposed by the Lasso?

2 . Sparse models are generally more easy to interpret.. The Lasso does variable selection by default. 3. Using the Lasso penalty helps to decrease the bias of the fits. 4. Using the Lasso penalty helps to decrease the variance of the fits. Provide motivation for your answer. Esercizio 6.4 We estimate the regression coefficients in a linear regression model by minimizing ridge regression for a particular value of λ. For each of the following, describe the behaviour of the following elements as we increase λ from (e.g., remains constant, increases, decreases, increase and then decrease):. The training RSS;. The test RSS; 3. The variance; 4. The squared bias; 5. The irreducible error. Esercizio 6.5 Suppose that Figure 6. is showing the Test error of K-NN obtained by using different values for K. X Number of classification errors X - K Figura 6.: Dataset and corresponding error for different K in the K-NN classifier. Which of the following is most likely true of what would happen to the Test Error curve A.A.7-8 Intelligenza Artificiale - UniBG Page

3 as we move K further above?. The Test Errors will increase;. The Test Errors will decrease; 3. Not enough information is given to decide; 4. It does not make sense to have K > ; Esercizio 6.6 Comment on advantages and drawbacks of the following choices:. Increase the model complexity and fix number of samples.. Increase the number of the samples and fix model complexity. * Esercizio 6.7 A hypercube with side length in d dimensions is defined to be the set of points (x, x,..., x d ) such that x j for all j =,,..., d. The boundary of the hypercube is defined to be the set of all points such that there exists a j for which x j.5 or.95 x j, i.e., the set of all points that have at least one dimension in the most extreme % of possible values). What proportion of the points in a hypercube of dimension 5 are in the boundary? In this case, we are considering points with small or big L norm, i.e., a boundary point x is s.t. x.5 x.95. What happens if we consider L norm for vectors? Remember that the sphere volume in k and k + dimensions is: respectively. V k (R) = πk k! Rk, V k+ (R) = (k!)(4π)k (k + )! Rk+, Esercizio 6.8 Assume to have two different linear models working on the same dataset of N = samples. The first model has k = input, considers linear features and has a residual sum of squares of RSS =.5 on a validation set; A.A.7-8 Intelligenza Artificiale - UniBG Page 3

4 The second model has k = 8 input, considers only quadratic features and has a residual sum of squares of RSS =.3 on a validation set; Which one would you choose? Why? Recall that the F-test for statistics for distinguish between linear models is: ˆF = N p p p RSS RSS RSS F (p p, N p ), where p and p are the two parameters of the two models and F (a, b) is the Fisher distribution with a and b degrees of freedom. Esercizio 6.9 Which techniques would you consider for model selection in the case we have:. A small dataset and a set of simple models;. A small dataset and a set of complex models; 3. A large dataset and a set of simple models; 4. A large dataset and a trainer with parallel computing abilities. Justify you choices. Esercizio 6. Suppose you have a dataset and you decided to use all the samples to train your model, including the selection of the parameters of your model and the features you want to consider. What are the problems and issues arising if you use this methodology? Which procedure a ML scientist should follow? 6. Kernel methods Esercizio 6. Comment the following statements about adding new features to your model:. It is always a good idea to add some feature in classification since they increase the chance to consider feature spaces where it is possible to linearly separate the classes;. The addition of new features requires a longer time for the training of the model; A.A.7-8 Intelligenza Artificiale - UniBG Page 4

5 3. The addition of new features requires a longer time in prediction of newly seen samples; 4. It is not a trivial task to chose properly the features which might improve your learner capabilities; 5. You need to know the right set of features if we want to make use of them. Esercizio 6. For which one of the following dataset you would use the kernel trick to represent your data? Would you use some other methodology? Provide motivation for your choice x x x (a) x (b) x x x x (c) (d) Figura 6.3: Different datasets. Esercizio 6.3 Answer the following questions about kernels. Motivate your answers. A.A.7-8 Intelligenza Artificiale - UniBG Page 5

6 . Can you define a kernel over a feature set composed of colors? For instance the set could be F = {red, green, blue, black, white}.. Can you define a kernel over a feature set composed of graphs? 3. Do you prefer to have a larger hard drive and/or a faster CPU to apply a kernel method? 4. Assume to have a non-linearly separable dataset, but you know which mapping is able project them in a linearly separable space. Are there still reasons to consider the use of kernels? * Esercizio 6.4 Derive the kernel formulation for the ridge regression, when we consider φ(x) as input features. Is k(x, x ) = φ(x) T φ(x ) + λi always a valid kernel?. Esercizio 6.5 Consider x, y R d, which ones of these are similarity measure:. k(x, y) = x T y (dot product);. k(x, y) = x T y + (x T y) ; 3. k(x, y) = ck (x, y) + k (x, y) k 3 (x, y), where k, k and k 3 are valid kernels in R d ; 4. k(x, y) = log(x)e y (d = ); [ ] k(x, y) = x T Ay with A = (d = ); k(x, y) = ( cos (x)) cos(y π/), (d = ). 6.3 SVM Esercizio 6.6 Answer the following questions about SVM. Provide adequate explanation for you answer. A.A.7-8 Intelligenza Artificiale - UniBG Page 6

7 . If we increase the regularization parameter C in an SVM, do you expect the margin to behave? Do they become thinner or thicker?. If no linear boundary can perfectly classify all the training data, this means we need to use a feature expansion. True or false? 3. The computational effort required to solve a kernel support vector machine becomes greater and greater as the dimension of the basis increases. True or false? 4. Suppose that after our computer works for an hour to fit an SVM on a large data set, we notice that x 4, the feature vector for the fourth example, was recorded incorrectly, i.e., we use ˆx 4 instead of x 4 to train our model. However, your coworker notices that the pair (ˆx 4 y 4 ) did not turn out to be a support point in the original fit. He says there is no need to train again the SVM on the corrected data set, because changing the value of a non support point can t possibly change the fit. True or false? Esercizio 6.7 Which of the following statements are true?. Suppose you have D input examples (i.e., x i R ). The decision boundary of the SVM (with the linear kernel) is a straight line.. If you are training multi-class SVM with the one-vs-all method, it is not possible to use a kernel. 3. The maximum value of the Gaussian kernel is. 4. If the data are linearly separable, an SVM using a linear kernel will return the same parameters w regardless of the chosen value of C. Esercizio 6.8 Consider the linear two-class SVM classifier defined by the parameters w = [ ] b =. Answer the following questions providing adequate motivations. Is the point x = [ 4] a support vector? Give an example of a point which is on the boundary of the SVM. How the point x = [3 ] is classified according to the trained SVM? Esercizio 6.9 A.A.7-8 Intelligenza Artificiale - UniBG Page 7

8 After training a logistic regression classifier with gradient descent on a given dataset, you find that it does not achieve the desired performance on the training set, nor the cross validation one. Which of the following might be a promising step to take?. Use an SVM with a Gaussian Kernel.. Introduce a regularization term. 3. Add features by basing on the problem characteristics. 4. Use an SVM with a linear kernel, without introducing new features. * Esercizio 6. Derive the dual formulation from the primal SVM minimization problem with soft margins. Esercizio 6. What s the black magic of SVMs? More specifically, which parameters we can tune to better fit a specific classification task with a SVM? A.A.7-8 Intelligenza Artificiale - UniBG Page 8

Support Vector Machines.

Support Vector Machines. Support Vector Machines srihari@buffalo.edu SVM Discussion Overview 1. Overview of SVMs 2. Margin Geometry 3. SVM Optimization 4. Overlapping Distributions 5. Relationship to Logistic Regression 6. Dealing