Statistical foundations of machine learning

Size: px

Start display at page:

Download "Statistical foundations of machine learning"

Felix Johnson
5 years ago
Views:

1 Statistical foundations of machine learning INFO-F-422 Gianluca Bontempi Machine Learning Group Computer Science Department mlg.ulb.ac.be

2 Some algorithms for nonlinear modeling Feedforward neural network Regression tree Neuro-fuzzy inference systems Radial basis function Local modeling regression Support vector machine Hierarchical mixtures of experts

3 Global and local approaches Global: it has two properties. they make the assumption that the relationship between the inputs and the output values can be described by an analytical function over the whole input domain. 2 learning is seen as a problem of function estimation: given a set of data, they extract the hypothesis which is expected to approximate the best the whole data distribution Examples of global models are linear models, nonlinear statistical regressions, and Neural Networks Divide-and-conquer: it attacks a complex problem by dividing it into simpler problems whose solutions can be combined to yield a solution to the original problem. This principle presents two main advantages. simpler problems can be solved with simpler estimation techniques; in statistics this means to adopt linear techniques, well studied and developed over the years. 2 the learning method can better adjust to the properties of the available dataset.

4 Modular modeling The divide-and-conquer idea evolved in two different paradigms: modular modeling: modules cover different parts (operating regimes) of the input space. although modular architectures are a combination of local models, their identification is still performed on the basis of the whole dataset. Hence, the learning procedure remains a function estimation problem, with the advantage that the parametric identification can be made simpler by the adoption of local linear modules. in terms of structural identification the problem is still nonlinear and requires the same procedures used for generic global models. Examples are Fuzzy Inference Systems, Radial Basis Functions, Local Model Networks, Classification and Regression Trees.

5 Local modeling local modeling: they turn the problem of function estimation into a problem of value estimation do not return a complete description of the input/output mapping but approximate the function in a neighborhood of the point to be predicted. adoption of linear techniques both in parametric and structural identification with a gain in terms of analytical tractability and fast design.

6 Artificial neural networks Artificial neural networks (aka neural nets) are parallel, distributed information processing computational models which draw their inspiration from neurons in the brain. The main class of neural network used in supervised learning for classification and regression is the feed-forward network, aka as multi-layer perceptron (MLP). Feed-forward ANN have been applied to a wide range of prediction tasks in such diverse fields as speech recognition, financial prediction, image compression, adaptive industrial control. One of the most important trends in recent neural computing has been a move away from a biologically inspired interpretation of neural networks to a more rigorous and statistically founded interpretation based on results deriving from statistical pattern recognition theory.

7 Feed-forward architecture Feed-forward NN have a layered architecture, with each layer comprising one or more simple processing units called artificial neurons or nodes. Each node is connected to one or more other nodes by real valued weights (in the following we will refer to them as parameters) but not to nodes in the same layer. All FNN have an input layer and an output layer. All FNN have a connectivity structure which is an acyclic graph: no closed loops are admitted. FNN are generally implemented with an additional node, called the bias unit, in all layers except the output layer. It plays the role of the intercept term in linear models. For simplicity we will consider only FNN with one single output.

8 Two-layer feed-forward NN bias units NETWORK INPUTS x x x n layer w () w () nh z z z H layer (2) w w (2) w (2) H y layer 2

9 Feed-forward architecture Let n be the number of inputs, L the number of layers, H (l) the number of hidden units of the lth layer (l =,...,L) of the FNN, w (l) kh denote the weight of the link connecting the kth node in the l layer and the hth node in the l layer, z (l) h, h =,...,H(l) the output of the hth hidden node of the lth layer, z (l) denote the bias for the l, l =,...,L layer.

10 Feed-forward architecture Let H () = n and z () h = x h, h =,...,n. For l the output of the hth, h =,...,H (l), hidden unit of the lth layer, is obtained by first forming a weighted linear combination of the H (l ) outputs of the lower level a (l) H (l ) h = w (l) kh z(l ) k + w (l) h z(l ), h =,...,H (l) k= and then by transforming the sum using an activation function to give z (l) h = g (l) (a (l) h ), h =,...,H(l) The activation function g (l) ( ) is typically a nonlinear transformation like the logistic or sigmoid function g (l) (z) = + e z

11 Two-layer feed-forward NN For L = 2 (i.e. single hidden layer), the input/output relation is given by ( H ) ŷ = g (2) (a (2) ) = g(2) w (2) k z k + w (2) z j= k= where n z k = g () w () jk x j + w () k x, k =,...,H Note that if g () ( ) and g (2) ( ) are linear mappings, this functional form becomes linear. Once given the number of inputs and the form of the function g( ) two are the things which remain to be chosen: parameters: the value of weights w (l), l =, 2 2 structural hyperparameters: number of layers L and hidden nodes H.

12 Backpropagation It is an algorithm which, once the number of hidden nodes H is given, estimates the weights W = {w (l) kh },l =,...,L,h =,...,H(l),k =,...,H (l ) on the basis of the training set D N. It is a gradient-based algorithm which aims to minimize the cost function N i= MISE emp (W) = (y i ŷ i ) 2 N i= = (y i ŷ(x i,w)) 2 N N where W = arg min W MISE emp (W) is the optimal set of weights for the given training set. Backprop exploits the network structure in order to compute recursively the gradient.

13 Backpropagation (II) The simplest (and least effective) backprop algorithm is an iterative gradient descent which is based on the iterative formula W(k + ) = W(k) η MISE emp (W(k)) W(k) where W(k) is the weight vector at the kth iteration and η is the learning rate which indicates the relative size of the change in weights. The weights are initialized with random values and are changed in a direction that will reduce the error. Some convergence criterion is used to terminate the algorithm. This method is known to be inefficient. Many steps are needed to reach a stationariy point and no monotone decrease of MISE emp is guaranteed. More effective version of the algorithm are based on the Levenberg-Marquardt algorithm.

14 Backpropagation example Consider an single-input (i.e. n = ) single-output neural network with one hidden layer, two hidden nodes and no bias units. The predictor has the form ŷ(x) = g(a (2) (x)) = g(w(2) z +w (2) 2 z 2) = g(w (2) g(a() )+w(2) 2 g(a( 2 = g(w (2) g(w() x) + w(2) 2 g(w() 2 x)) where W = [w (),w() 2,w(2),w(2) 2 ] The backprop algorithm needs the derivatives of MISE emp wrt to each weight w W. Since for each w W MISE emp w we focus on ŷ w. = 2/N N i= (y i ŷ(x i )) ŷ(x i) w

15 Backpropagation example NETWORK INPUT x layer () w w () 2 z z 2 layer w (2) w (2) 2 y layer 2

16 Backpropagation example (II) Since a (2) (x) = w(2) z + w (2) 2 z 2 we obtain for the weights of the hidden/output layer ŷ(x) w (2) h = g a (2) a (2) w (2) h = g (a (2) (x))z h(x), h =,...,2 Since a () h = w () h x we obtain for the weights of the input/hidden layer ŷ(x) w () h = g a (2) a (2) z h z h a () h a () h w () h = g (a (2) (x))w(2) h g (a () h (x))x where the term g (a (x)) has been already computed for the upper layer. Note that for a sigmoid function g g (z) = e z ( + e z ) 2

17 Two-layer feed-forward NN Let us consider a two-layer FNN with sigmoidal hidden units. This has proven to be an important class of network for practical applications. It can be shown that such networks are universal approximators. This means that they can approximate arbitrarily well any functional (one-one or many-one) continuous mapping from one finite-dimensional space to another, provided the number H of hidden units is sufficiently large. Note that although this result is remarkable, it is of no practical use. No indication is given about the number of hidden nodes to choose for a finite number of samples and a generic nonlinear mapping

18 R TP: An overfitting example Consider a dataset D N = {x i,y i }, i =,...,N where N = 5 and x N [,,], is a 3-dimensional vector. Suppose that y is linked to x by the input/output relation y = x 2 + 4log( x 2 ) + 5x 3 + x where x i is the ith component of the vector x R 3 and Var [w] =.. Consider as non-linear model a single-hidden-layer neural network (implemented by the R package nnet) with s = 25 hidden neurons. The number of neurons is an index of the complexity of the model.

19 We want the estimate the prediction accuracy on a new i.i.d dataset of N ts = 5 samples. Let us train the neural network on the whole training set. The empirical prediction MISE error is MISE emp = N N (y i h(x i,α N )) 2 =.6 6 i= where α N is obtained by the parametric identification step. However, if we test h(,α N ) on the test set we obtain MISE ts = N ts N ts i= (y i h(x i,α N )) 2 = 3.58 This neural network is seriously overfitting the dataset. The empirical error is a very bad estimate of the MISE.

20 We perform a K-fold cross-validation in order to have a better estimate of MISE. We put K =. Cross-validation implemented in the cv.r R file. The K = cross-validated estimate of MISE is MISE CV = 8.29 This figure is a much more reliable estimation of the prediction accuracy. The leave-one-out estimate K = N = 5 is MISE loo = 7.2 The cross-validated estimate could be used to select a better number of hidden neurons.

21 R TP: Bagging against overfitting Consider a dataset D N = {x i,y i }, i =,...,N of N = i.i.d. normally distributed inputs x N([,, ], I). Suppose that y is linked to x by the input/output relation y = x 2 + 4log( x 2 ) + 5x 3 + ǫ where ǫ N(,.25) represents the noise. Let s train a single-hidden-layer neural network with s = 5 hidden neurons on the training set. The prediction accuracy on the test set (N ts = ) is MISE ts = 9.95.

22 Let us apply a bagging combination with B = 5 (R-file bagging.r). The prediction accuracy on the test set of the bagging predictor is MISE ts = This shows that the bagging combination reduces the overfitting of the single neural network. Below there is the histogram of the MISE ts accuracy of each bootstrap repetition. We can see that the performance of the bagging predictor is much better than the average performance. Frequency

23 Decision Trees The use of tree-based learners dates back to the work of Morgan and Sonquist in 963. A decision tree partitions the input space into mutually exclusive regions, each of which is assigned a specific model. The nodes of a decision tree can be classified in internal nodes and terminal nodes. An internal node is a decision-making unit that evaluates a decision function to determine which child node to visit next. A terminal node or leaf has no child nodes and is associated with one of the partitions of the input space. Note that each terminal node has a unique path that leads from the root to itself. In classification trees each terminal node contains a label that indicates the class for the associated input region. In regression trees the terminal node contains a model that specifies the input/output mapping for the corresponding input partition.

24 A binary decision tree. x < a? YES NO x < c? 2 x < b? YES NO YES NO R R2 R3 x < d? 2 YES NO R4 R5

25 Input space partitioning x.2 d R2 R5 R3 c R R4 a b x.

26 Regression tree predictor Let m be the number of leaves and h j (,α j ), j =,...,m, the input/output model associated with the j th leaf. Once a prediction in a query point x q R n is required, the query is presented to the root node of the decision tree; Depending on the result of the associated decision function, the tree will branch to one of the root s children. The procedure is iterated recursively until a leaf j is reached and a input/output model is selected. The returned output will be the value h j (x q,α j ). Example let x q = (x q,x q2 ) with x q < a and x q2 > c. The predicted output will be ŷ q = h 2 (x q,α 2 ) where α 2 is the vector of parameters of the model localized in region R2.

27 Regression tree learning The learning procedure has two steps known as tree growing and tree pruning: During tree growing the algorithm makes a succession of splits that partition the training data into disjoint subsets. Starting from the root node that contains the whole dataset, an exhaustive search is performed to find the split that best reduces a certain cost function. Let us consider a certain node t and let D(t) be the corresponding subset of the original D N. Consider the empirical error of the local model h t fitting the N(t) data contained in the node t : N(t) SSE emp (t) = min L(y i,h t (x i,α t )) () α t i= Note that h t could be whatever regression model (e.g. constant, linear,...).

28 Regression tree construction For any possible split s of node t into the two children t r and t l, we define the quantity E(s,t) = SSE emp (t) (SSE emp (t l ) + SSE emp (t r )) with N(t r ) + N(t l ) = N(t) that represents the decrease of the empirical error due to a further partition of the dataset. The best split is the one that maximizes the decrease E s = arg max s E(s,t) (2) Once the best split is attained, the dataset is partitioned into the two disjoint subsets of length N(t r ) and N(t l ), respectively. The same method is recursively applied to all the leaves. The procedure terminates either when the error measure associated with a node falls below a certain tolerance level, or when the error reduction E resulting from further splitting does not exceed a threshold value.

29 Tree pruning The tree that the growing procedure yields is typically too large and presents a serious risk of overfitting the dataset. For that reason a pruning procedure is often adopted. Pruning uses a complexity based measure of the tree performance R λ (T) = MISE emp (T) + λ T where λ is a parameter that accounts for the tree s complexity and T is the number of terminal nodes of the tree T. For a fixed λ we define with T(λ) the tree structure which minimizes the quantity R λ (T). The parameter λ is gradually increased in order to generate a sequence of tree configurations with decreasing complexity

30 Tree pruning For a generic subtree T t T, MISE emp (T t ) > MISE emp (T), T > T t MISE emp (T t ) + λ t T t MISE emp (T) + λ t T λ t MISE emp (T t ) MISE emp (T) T T t This means that the subtree T t is preferrable to T when λ t is greater than the above quantity. Therefore we choose among all the admissible subtrees T t the one with the smallest term MISE emp (T t ) MISE emp (T) T T t At the end of the shrinking process we have a sequence of candidate trees which have to be properly assessed (e.g. by crossval) in order to perform the structural selection. (3)

31 Random Forests Ensemble learning is efficient when it combines low bias estimators. Non pruned decision trees are low bias and high variance estimators. Random forest is an ensemble learning technique proposed by Breiman which combines bagging and random feature selection. The algorithm consists in generate by bootstrap a set of B training sets 2 fit to each of them a decision tree where the set of variables considered for each split is a random subset of the original one. 3 return as final prediction the average of the B predictions. The precision of Random Forest improves by improving the single classifiers and by reducing their correlation.

32 Radial Basis Functions Radial Basis Function (RBF) is a modular architecture which is described by the weighted linear combination m y = ρ j (x;c j,b j )h j j= where the weights are returned by the activations of m local nonlinear basis functions ρ j having the center c j and the bandwidth B j and where the term h j is a constant. The basis or activation function ρ j is a function ρ j : X [,] usually designed so that its value monotonically decreases towards zero as the input point moves away from its center c j. An example of basis function is: ρ j (x;c j,b j ) = exp (x c 2 j) B j

33 Radial Basis Functions Once we define with η j the set {c j,b j } of parameters of the basis function, we have ρ j = ρ j (,η j ) If the basis ρ j have localized receptive fields and a limited degree of overlap with their neighbors, the weights h j can be interpreted as locally piecewise constant models, whose validity for a given input is indicated by the corresponding activation function for a given input. The basis function idea arose almost at the same time in different fields and led to similar approaches, often denoted with different names (Radial Basis Function, Local Model Networks and the Neuro-Fuzzy Inference Systems).

34 Fitting of basis functions For a given number m of basis, a clustering technique (e.g. K-means or EM) can be adopted to locate center and variance. In the example, note that n = 2 and m =

35 Fitting of RBF weights Let us consider a dataset D N = { x,y, x 2,y 2,..., x N,y N }. Once the parameters η j,j =,...,m are fixed the parametric identification of the remaining parameters h j boils down to satisfying the following constraints: y = h ρ (x,η ) + h 2 ρ 2 (x,η 2 ) + + h m ρ m (x,η m ) y 2 = h ρ (x 2,η ) + h 2 ρ 2 (x 2,η 2 ) + + h m ρ m (x 2,η m ). y N = h ρ (x N,η ) + h N ρ 2 (x N,η 2 ) + + h m ρ m (x N,η m )

36 Fitting of RBF weights which can be written as where y Y =. y N,X = Y = Xβ ρ (x,η )... ρ m (x,η m )... ρ (x N,η )... ρ m (x N,η m ).,β = [h,...,h m ] T

37 Local Model Networks Local Model Networks (LMN) are a generalized form of Basis Function Network in the sense that the constant weights h j associated with the basis functions are replaced by local models h j (,α j ). The typical form of a LMN is then m y = ρ j (x,η j )h j (x,α j ) j= where the ρ j are constrained to satisfy m ρ j (x,η j ) = x X j= This means that the basis functions form a partition of unity. This ensures that every point in the input space has equal weight, so that any variation in the output over the input space is due only to the models h j.

38 Local Model Networks Local model h j (x) Basis function ρ j (x) Model y = P 3 j= ρ j(x)h j (x).5

39 Local modeling procedure The learning of a local model in x q R n can be summarized in these steps: Compute the distance between the query x q and the training samples according to a predefined metric. 2 Rank the neighbors on the basis of their distance to the query. 3 Select a subset of the k nearest neighbors according to the bandwidth which measures the size of the neighborhood. 4 Fit a local model (e.g. constant, linear,...). Each of the local approaches has one or more structural (or smoothing) parameters that control the amount of smoothing performed. Let us focus on the bandwidth selection.

40 The bandwidth trade-off: overfit e q x y x y Too narrow bandwidth overfitting large prediction error e. In terms of bias/variance trade-off, this is typically a situation of high variance.

41 The bandwidth trade-off: underfit e q x y x y Too large bandwidth underfitting large prediction error e In terms of bias/variance trade-off, this is typically a situation of high bias.

42 Bias/variance decomposition In the case of a constant local model the prediction in x q is the quantity h(x q,α N ) = k y k [i] i= computed by averaging the value of y for the k closest neighbors x [i], i =,...,k of x q. The bias/variance decomposition takes the form MSE(x q ) = σ 2 w + ( k ) 2 k f (x [i] ) f (x q ) + σw/k 2 i=

43 Bandwidth and bias/variance trade-off Mean Squared Error Underfitting Overfitting Variance Bias /Bandwith MANY NEIGHBORS FEW NEIGHBORS

44 The PRESS statistic Cross-validation can provide a reliable estimate of the algorithm generalization error but it requires the training process to be repeated K times, which sometimes means a large computational effort. In the case of linear models there exists a powerful statistical procedure to compute the leave-one-out cross-validation measure at a reduced computational cost It is the PRESS (Prediction Sum of Squares) statistic, a simple formula which returns the leave-one-out (l-o-o) as a by-product of the least-squares.

45 Leave-one-out for linear models TRAINING SET PARAMETRIC IDENTIFICATION ON N SAMPLES PRESS STATISTIC N TIMES PUT THE j-th SAMPLE ASIDE PARAMETRIC IDENTIFICATION ON N- SAMPLES TEST ON THE j-th SAMPLE LEAVE-ONE-OUT The leave-one-out error can be computed in two equivalent ways: the slowest way (on the right) which repeats N times the training and the test procedure; the fastest way (on the left) which performs only once the parametric identification and the computation of the PRESS statistic.

46 The PRESS statistic This allows a fast cross-validation without repeating N times the leave-one-out procedure. The PRESS procedure can be described as follows: we use the whole training set to estimate the linear regression coefficients ˆβ = (X T X) X T Y 2 This procedure is performed only once on the N samples and returns as by product the Hat matrix H = X(X T X) X T 3 we compute the residual vector e, whose j th term is e j = y j xj T ˆβ, 4 we use the PRESS statistic to compute ej loo as e loo j = e j H jj where H jj is the j th diagonal term of the matrix H.

47 The PRESS statistic Thus, the leave-one-out estimate of the local mean integrated squared error is: MISE LOO = N N i= { yi ŷ i H ii } 2 Note that PRESS is not an approximation of the loo error but simply a faster way of computing it.

48 Selection of the number of neighbours For a given query point x q, we can compute a set of predictions ŷ q (k) = x T q ˆβ(k), together with a set of associated leave-one-out error vectors MISE LOO (k) for a number of neighbors ranging in [k min,kmax]. If the selection paradigm, frequently called winner-takes-all, is adopted, the most natural way to extract a final prediction ŷ q, consists in comparing the prediction obtained for each value of k on the basis of the classical mean square error criterion: ŷ q = x T q ˆβ(ˆk), with ˆk = arg min k MISE LOO (k)

49 Local Model combination As an alternative to the winner-takes-all paradigm, we can use a combination of estimates. The final prediction of the value y q is obtained as a weighted average of the best b models, where b is a parameter of the algorithm. Suppose the predictions ŷ q (k) and the loo errors MISE LOO (k) have been ordered creating a sequence of integers {k i } so that MISE LOO (k i ) MISE LOO (k j ), i < j. The prediction of ŷ q is given by b i= ŷ q = ζ iŷ q (k i ) b i= ζ, i where the weights are the inverse of the mean square errors: ζ i = / MISE LOO (k i ).

Random Forest A. Fornaser

Random Forest A. Fornaser alberto.fornaser@unitn.it Sources Lecture 15: decision trees, information theory and random forests, Dr. Richard E. Turner Trees and Random Forests, Adele Cutler, Utah State University