Statistical foundations of machine learning

Size: px
Start display at page:

Download "Statistical foundations of machine learning"

Transcription

1 Statistical foundations of machine learning INFO-F-422 Gianluca Bontempi Machine Learning Group Computer Science Department mlg.ulb.ac.be

2 Some algorithms for nonlinear modeling Feedforward neural network Regression tree Neuro-fuzzy inference systems Radial basis function Local modeling regression Support vector machine Hierarchical mixtures of experts

3 Global and local approaches Global: it has two properties. they make the assumption that the relationship between the inputs and the output values can be described by an analytical function over the whole input domain. 2 learning is seen as a problem of function estimation: given a set of data, they extract the hypothesis which is expected to approximate the best the whole data distribution Examples of global models are linear models, nonlinear statistical regressions, and Neural Networks Divide-and-conquer: it attacks a complex problem by dividing it into simpler problems whose solutions can be combined to yield a solution to the original problem. This principle presents two main advantages. simpler problems can be solved with simpler estimation techniques; in statistics this means to adopt linear techniques, well studied and developed over the years. 2 the learning method can better adjust to the properties of the available dataset.

4 Modular modeling The divide-and-conquer idea evolved in two different paradigms: modular modeling: modules cover different parts (operating regimes) of the input space. although modular architectures are a combination of local models, their identification is still performed on the basis of the whole dataset. Hence, the learning procedure remains a function estimation problem, with the advantage that the parametric identification can be made simpler by the adoption of local linear modules. in terms of structural identification the problem is still nonlinear and requires the same procedures used for generic global models. Examples are Fuzzy Inference Systems, Radial Basis Functions, Local Model Networks, Classification and Regression Trees.

5 Local modeling local modeling: they turn the problem of function estimation into a problem of value estimation do not return a complete description of the input/output mapping but approximate the function in a neighborhood of the point to be predicted. adoption of linear techniques both in parametric and structural identification with a gain in terms of analytical tractability and fast design.

6 Artificial neural networks Artificial neural networks (aka neural nets) are parallel, distributed information processing computational models which draw their inspiration from neurons in the brain. The main class of neural network used in supervised learning for classification and regression is the feed-forward network, aka as multi-layer perceptron (MLP). Feed-forward ANN have been applied to a wide range of prediction tasks in such diverse fields as speech recognition, financial prediction, image compression, adaptive industrial control. One of the most important trends in recent neural computing has been a move away from a biologically inspired interpretation of neural networks to a more rigorous and statistically founded interpretation based on results deriving from statistical pattern recognition theory.

7 Feed-forward architecture Feed-forward NN have a layered architecture, with each layer comprising one or more simple processing units called artificial neurons or nodes. Each node is connected to one or more other nodes by real valued weights (in the following we will refer to them as parameters) but not to nodes in the same layer. All FNN have an input layer and an output layer. All FNN have a connectivity structure which is an acyclic graph: no closed loops are admitted. FNN are generally implemented with an additional node, called the bias unit, in all layers except the output layer. It plays the role of the intercept term in linear models. For simplicity we will consider only FNN with one single output.

8 Two-layer feed-forward NN bias units NETWORK INPUTS x x x n layer w () w () nh z z z H layer (2) w w (2) w (2) H y layer 2

9 Feed-forward architecture Let n be the number of inputs, L the number of layers, H (l) the number of hidden units of the lth layer (l =,...,L) of the FNN, w (l) kh denote the weight of the link connecting the kth node in the l layer and the hth node in the l layer, z (l) h, h =,...,H(l) the output of the hth hidden node of the lth layer, z (l) denote the bias for the l, l =,...,L layer.

10 Feed-forward architecture Let H () = n and z () h = x h, h =,...,n. For l the output of the hth, h =,...,H (l), hidden unit of the lth layer, is obtained by first forming a weighted linear combination of the H (l ) outputs of the lower level a (l) H (l ) h = w (l) kh z(l ) k + w (l) h z(l ), h =,...,H (l) k= and then by transforming the sum using an activation function to give z (l) h = g (l) (a (l) h ), h =,...,H(l) The activation function g (l) ( ) is typically a nonlinear transformation like the logistic or sigmoid function g (l) (z) = + e z

11 Two-layer feed-forward NN For L = 2 (i.e. single hidden layer), the input/output relation is given by ( H ) ŷ = g (2) (a (2) ) = g(2) w (2) k z k + w (2) z j= k= where n z k = g () w () jk x j + w () k x, k =,...,H Note that if g () ( ) and g (2) ( ) are linear mappings, this functional form becomes linear. Once given the number of inputs and the form of the function g( ) two are the things which remain to be chosen: parameters: the value of weights w (l), l =, 2 2 structural hyperparameters: number of layers L and hidden nodes H.

12 Backpropagation It is an algorithm which, once the number of hidden nodes H is given, estimates the weights W = {w (l) kh },l =,...,L,h =,...,H(l),k =,...,H (l ) on the basis of the training set D N. It is a gradient-based algorithm which aims to minimize the cost function N i= MISE emp (W) = (y i ŷ i ) 2 N i= = (y i ŷ(x i,w)) 2 N N where W = arg min W MISE emp (W) is the optimal set of weights for the given training set. Backprop exploits the network structure in order to compute recursively the gradient.

13 Backpropagation (II) The simplest (and least effective) backprop algorithm is an iterative gradient descent which is based on the iterative formula W(k + ) = W(k) η MISE emp (W(k)) W(k) where W(k) is the weight vector at the kth iteration and η is the learning rate which indicates the relative size of the change in weights. The weights are initialized with random values and are changed in a direction that will reduce the error. Some convergence criterion is used to terminate the algorithm. This method is known to be inefficient. Many steps are needed to reach a stationariy point and no monotone decrease of MISE emp is guaranteed. More effective version of the algorithm are based on the Levenberg-Marquardt algorithm.

14 Backpropagation example Consider an single-input (i.e. n = ) single-output neural network with one hidden layer, two hidden nodes and no bias units. The predictor has the form ŷ(x) = g(a (2) (x)) = g(w(2) z +w (2) 2 z 2) = g(w (2) g(a() )+w(2) 2 g(a( 2 = g(w (2) g(w() x) + w(2) 2 g(w() 2 x)) where W = [w (),w() 2,w(2),w(2) 2 ] The backprop algorithm needs the derivatives of MISE emp wrt to each weight w W. Since for each w W MISE emp w we focus on ŷ w. = 2/N N i= (y i ŷ(x i )) ŷ(x i) w

15 Backpropagation example NETWORK INPUT x layer () w w () 2 z z 2 layer w (2) w (2) 2 y layer 2

16 Backpropagation example (II) Since a (2) (x) = w(2) z + w (2) 2 z 2 we obtain for the weights of the hidden/output layer ŷ(x) w (2) h = g a (2) a (2) w (2) h = g (a (2) (x))z h(x), h =,...,2 Since a () h = w () h x we obtain for the weights of the input/hidden layer ŷ(x) w () h = g a (2) a (2) z h z h a () h a () h w () h = g (a (2) (x))w(2) h g (a () h (x))x where the term g (a (x)) has been already computed for the upper layer. Note that for a sigmoid function g g (z) = e z ( + e z ) 2

17 Two-layer feed-forward NN Let us consider a two-layer FNN with sigmoidal hidden units. This has proven to be an important class of network for practical applications. It can be shown that such networks are universal approximators. This means that they can approximate arbitrarily well any functional (one-one or many-one) continuous mapping from one finite-dimensional space to another, provided the number H of hidden units is sufficiently large. Note that although this result is remarkable, it is of no practical use. No indication is given about the number of hidden nodes to choose for a finite number of samples and a generic nonlinear mapping

18 R TP: An overfitting example Consider a dataset D N = {x i,y i }, i =,...,N where N = 5 and x N [,,], is a 3-dimensional vector. Suppose that y is linked to x by the input/output relation y = x 2 + 4log( x 2 ) + 5x 3 + x where x i is the ith component of the vector x R 3 and Var [w] =.. Consider as non-linear model a single-hidden-layer neural network (implemented by the R package nnet) with s = 25 hidden neurons. The number of neurons is an index of the complexity of the model.

19 We want the estimate the prediction accuracy on a new i.i.d dataset of N ts = 5 samples. Let us train the neural network on the whole training set. The empirical prediction MISE error is MISE emp = N N (y i h(x i,α N )) 2 =.6 6 i= where α N is obtained by the parametric identification step. However, if we test h(,α N ) on the test set we obtain MISE ts = N ts N ts i= (y i h(x i,α N )) 2 = 3.58 This neural network is seriously overfitting the dataset. The empirical error is a very bad estimate of the MISE.

20 We perform a K-fold cross-validation in order to have a better estimate of MISE. We put K =. Cross-validation implemented in the cv.r R file. The K = cross-validated estimate of MISE is MISE CV = 8.29 This figure is a much more reliable estimation of the prediction accuracy. The leave-one-out estimate K = N = 5 is MISE loo = 7.2 The cross-validated estimate could be used to select a better number of hidden neurons.

21 R TP: Bagging against overfitting Consider a dataset D N = {x i,y i }, i =,...,N of N = i.i.d. normally distributed inputs x N([,, ], I). Suppose that y is linked to x by the input/output relation y = x 2 + 4log( x 2 ) + 5x 3 + ǫ where ǫ N(,.25) represents the noise. Let s train a single-hidden-layer neural network with s = 5 hidden neurons on the training set. The prediction accuracy on the test set (N ts = ) is MISE ts = 9.95.

22 Let us apply a bagging combination with B = 5 (R-file bagging.r). The prediction accuracy on the test set of the bagging predictor is MISE ts = This shows that the bagging combination reduces the overfitting of the single neural network. Below there is the histogram of the MISE ts accuracy of each bootstrap repetition. We can see that the performance of the bagging predictor is much better than the average performance. Frequency

23 Decision Trees The use of tree-based learners dates back to the work of Morgan and Sonquist in 963. A decision tree partitions the input space into mutually exclusive regions, each of which is assigned a specific model. The nodes of a decision tree can be classified in internal nodes and terminal nodes. An internal node is a decision-making unit that evaluates a decision function to determine which child node to visit next. A terminal node or leaf has no child nodes and is associated with one of the partitions of the input space. Note that each terminal node has a unique path that leads from the root to itself. In classification trees each terminal node contains a label that indicates the class for the associated input region. In regression trees the terminal node contains a model that specifies the input/output mapping for the corresponding input partition.

24 A binary decision tree. x < a? YES NO x < c? 2 x < b? YES NO YES NO R R2 R3 x < d? 2 YES NO R4 R5

25 Input space partitioning x.2 d R2 R5 R3 c R R4 a b x.

26 Regression tree predictor Let m be the number of leaves and h j (,α j ), j =,...,m, the input/output model associated with the j th leaf. Once a prediction in a query point x q R n is required, the query is presented to the root node of the decision tree; Depending on the result of the associated decision function, the tree will branch to one of the root s children. The procedure is iterated recursively until a leaf j is reached and a input/output model is selected. The returned output will be the value h j (x q,α j ). Example let x q = (x q,x q2 ) with x q < a and x q2 > c. The predicted output will be ŷ q = h 2 (x q,α 2 ) where α 2 is the vector of parameters of the model localized in region R2.

27 Regression tree learning The learning procedure has two steps known as tree growing and tree pruning: During tree growing the algorithm makes a succession of splits that partition the training data into disjoint subsets. Starting from the root node that contains the whole dataset, an exhaustive search is performed to find the split that best reduces a certain cost function. Let us consider a certain node t and let D(t) be the corresponding subset of the original D N. Consider the empirical error of the local model h t fitting the N(t) data contained in the node t : N(t) SSE emp (t) = min L(y i,h t (x i,α t )) () α t i= Note that h t could be whatever regression model (e.g. constant, linear,...).

28 Regression tree construction For any possible split s of node t into the two children t r and t l, we define the quantity E(s,t) = SSE emp (t) (SSE emp (t l ) + SSE emp (t r )) with N(t r ) + N(t l ) = N(t) that represents the decrease of the empirical error due to a further partition of the dataset. The best split is the one that maximizes the decrease E s = arg max s E(s,t) (2) Once the best split is attained, the dataset is partitioned into the two disjoint subsets of length N(t r ) and N(t l ), respectively. The same method is recursively applied to all the leaves. The procedure terminates either when the error measure associated with a node falls below a certain tolerance level, or when the error reduction E resulting from further splitting does not exceed a threshold value.

29 Tree pruning The tree that the growing procedure yields is typically too large and presents a serious risk of overfitting the dataset. For that reason a pruning procedure is often adopted. Pruning uses a complexity based measure of the tree performance R λ (T) = MISE emp (T) + λ T where λ is a parameter that accounts for the tree s complexity and T is the number of terminal nodes of the tree T. For a fixed λ we define with T(λ) the tree structure which minimizes the quantity R λ (T). The parameter λ is gradually increased in order to generate a sequence of tree configurations with decreasing complexity

30 Tree pruning For a generic subtree T t T, MISE emp (T t ) > MISE emp (T), T > T t MISE emp (T t ) + λ t T t MISE emp (T) + λ t T λ t MISE emp (T t ) MISE emp (T) T T t This means that the subtree T t is preferrable to T when λ t is greater than the above quantity. Therefore we choose among all the admissible subtrees T t the one with the smallest term MISE emp (T t ) MISE emp (T) T T t At the end of the shrinking process we have a sequence of candidate trees which have to be properly assessed (e.g. by crossval) in order to perform the structural selection. (3)

31 Random Forests Ensemble learning is efficient when it combines low bias estimators. Non pruned decision trees are low bias and high variance estimators. Random forest is an ensemble learning technique proposed by Breiman which combines bagging and random feature selection. The algorithm consists in generate by bootstrap a set of B training sets 2 fit to each of them a decision tree where the set of variables considered for each split is a random subset of the original one. 3 return as final prediction the average of the B predictions. The precision of Random Forest improves by improving the single classifiers and by reducing their correlation.

32 Radial Basis Functions Radial Basis Function (RBF) is a modular architecture which is described by the weighted linear combination m y = ρ j (x;c j,b j )h j j= where the weights are returned by the activations of m local nonlinear basis functions ρ j having the center c j and the bandwidth B j and where the term h j is a constant. The basis or activation function ρ j is a function ρ j : X [,] usually designed so that its value monotonically decreases towards zero as the input point moves away from its center c j. An example of basis function is: ρ j (x;c j,b j ) = exp (x c 2 j) B j

33 Radial Basis Functions Once we define with η j the set {c j,b j } of parameters of the basis function, we have ρ j = ρ j (,η j ) If the basis ρ j have localized receptive fields and a limited degree of overlap with their neighbors, the weights h j can be interpreted as locally piecewise constant models, whose validity for a given input is indicated by the corresponding activation function for a given input. The basis function idea arose almost at the same time in different fields and led to similar approaches, often denoted with different names (Radial Basis Function, Local Model Networks and the Neuro-Fuzzy Inference Systems).

34 Fitting of basis functions For a given number m of basis, a clustering technique (e.g. K-means or EM) can be adopted to locate center and variance. In the example, note that n = 2 and m =

35 Fitting of RBF weights Let us consider a dataset D N = { x,y, x 2,y 2,..., x N,y N }. Once the parameters η j,j =,...,m are fixed the parametric identification of the remaining parameters h j boils down to satisfying the following constraints: y = h ρ (x,η ) + h 2 ρ 2 (x,η 2 ) + + h m ρ m (x,η m ) y 2 = h ρ (x 2,η ) + h 2 ρ 2 (x 2,η 2 ) + + h m ρ m (x 2,η m ). y N = h ρ (x N,η ) + h N ρ 2 (x N,η 2 ) + + h m ρ m (x N,η m )

36 Fitting of RBF weights which can be written as where y Y =. y N,X = Y = Xβ ρ (x,η )... ρ m (x,η m )... ρ (x N,η )... ρ m (x N,η m ).,β = [h,...,h m ] T

37 Local Model Networks Local Model Networks (LMN) are a generalized form of Basis Function Network in the sense that the constant weights h j associated with the basis functions are replaced by local models h j (,α j ). The typical form of a LMN is then m y = ρ j (x,η j )h j (x,α j ) j= where the ρ j are constrained to satisfy m ρ j (x,η j ) = x X j= This means that the basis functions form a partition of unity. This ensures that every point in the input space has equal weight, so that any variation in the output over the input space is due only to the models h j.

38 Local Model Networks Local model h j (x) Basis function ρ j (x) Model y = P 3 j= ρ j(x)h j (x).5

39 Local modeling procedure The learning of a local model in x q R n can be summarized in these steps: Compute the distance between the query x q and the training samples according to a predefined metric. 2 Rank the neighbors on the basis of their distance to the query. 3 Select a subset of the k nearest neighbors according to the bandwidth which measures the size of the neighborhood. 4 Fit a local model (e.g. constant, linear,...). Each of the local approaches has one or more structural (or smoothing) parameters that control the amount of smoothing performed. Let us focus on the bandwidth selection.

40 The bandwidth trade-off: overfit e q x y x y Too narrow bandwidth overfitting large prediction error e. In terms of bias/variance trade-off, this is typically a situation of high variance.

41 The bandwidth trade-off: underfit e q x y x y Too large bandwidth underfitting large prediction error e In terms of bias/variance trade-off, this is typically a situation of high bias.

42 Bias/variance decomposition In the case of a constant local model the prediction in x q is the quantity h(x q,α N ) = k y k [i] i= computed by averaging the value of y for the k closest neighbors x [i], i =,...,k of x q. The bias/variance decomposition takes the form MSE(x q ) = σ 2 w + ( k ) 2 k f (x [i] ) f (x q ) + σw/k 2 i=

43 Bandwidth and bias/variance trade-off Mean Squared Error Underfitting Overfitting Variance Bias /Bandwith MANY NEIGHBORS FEW NEIGHBORS

44 The PRESS statistic Cross-validation can provide a reliable estimate of the algorithm generalization error but it requires the training process to be repeated K times, which sometimes means a large computational effort. In the case of linear models there exists a powerful statistical procedure to compute the leave-one-out cross-validation measure at a reduced computational cost It is the PRESS (Prediction Sum of Squares) statistic, a simple formula which returns the leave-one-out (l-o-o) as a by-product of the least-squares.

45 Leave-one-out for linear models TRAINING SET PARAMETRIC IDENTIFICATION ON N SAMPLES PRESS STATISTIC N TIMES PUT THE j-th SAMPLE ASIDE PARAMETRIC IDENTIFICATION ON N- SAMPLES TEST ON THE j-th SAMPLE LEAVE-ONE-OUT The leave-one-out error can be computed in two equivalent ways: the slowest way (on the right) which repeats N times the training and the test procedure; the fastest way (on the left) which performs only once the parametric identification and the computation of the PRESS statistic.

46 The PRESS statistic This allows a fast cross-validation without repeating N times the leave-one-out procedure. The PRESS procedure can be described as follows: we use the whole training set to estimate the linear regression coefficients ˆβ = (X T X) X T Y 2 This procedure is performed only once on the N samples and returns as by product the Hat matrix H = X(X T X) X T 3 we compute the residual vector e, whose j th term is e j = y j xj T ˆβ, 4 we use the PRESS statistic to compute ej loo as e loo j = e j H jj where H jj is the j th diagonal term of the matrix H.

47 The PRESS statistic Thus, the leave-one-out estimate of the local mean integrated squared error is: MISE LOO = N N i= { yi ŷ i H ii } 2 Note that PRESS is not an approximation of the loo error but simply a faster way of computing it.

48 Selection of the number of neighbours For a given query point x q, we can compute a set of predictions ŷ q (k) = x T q ˆβ(k), together with a set of associated leave-one-out error vectors MISE LOO (k) for a number of neighbors ranging in [k min,kmax]. If the selection paradigm, frequently called winner-takes-all, is adopted, the most natural way to extract a final prediction ŷ q, consists in comparing the prediction obtained for each value of k on the basis of the classical mean square error criterion: ŷ q = x T q ˆβ(ˆk), with ˆk = arg min k MISE LOO (k)

49 Local Model combination As an alternative to the winner-takes-all paradigm, we can use a combination of estimates. The final prediction of the value y q is obtained as a weighted average of the best b models, where b is a parameter of the algorithm. Suppose the predictions ŷ q (k) and the loo errors MISE LOO (k) have been ordered creating a sequence of integers {k i } so that MISE LOO (k i ) MISE LOO (k j ), i < j. The prediction of ŷ q is given by b i= ŷ q = ζ iŷ q (k i ) b i= ζ, i where the weights are the inverse of the mean square errors: ζ i = / MISE LOO (k i ).

Random Forest A. Fornaser

Random Forest A. Fornaser Random Forest A. Fornaser alberto.fornaser@unitn.it Sources Lecture 15: decision trees, information theory and random forests, Dr. Richard E. Turner Trees and Random Forests, Adele Cutler, Utah State University

More information

Supervised Learning for Image Segmentation

Supervised Learning for Image Segmentation Supervised Learning for Image Segmentation Raphael Meier 06.10.2016 Raphael Meier MIA 2016 06.10.2016 1 / 52 References A. Ng, Machine Learning lecture, Stanford University. A. Criminisi, J. Shotton, E.

More information

Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

More information

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1 Big Data Methods Chapter 5: Machine learning Big Data Methods, Chapter 5, Slide 1 5.1 Introduction to machine learning What is machine learning? Concerned with the study and development of algorithms that

More information

What is machine learning?

What is machine learning? Machine learning, pattern recognition and statistical data modelling Lecture 12. The last lecture Coryn Bailer-Jones 1 What is machine learning? Data description and interpretation finding simpler relationship

More information

A Systematic Overview of Data Mining Algorithms

A Systematic Overview of Data Mining Algorithms A Systematic Overview of Data Mining Algorithms 1 Data Mining Algorithm A well-defined procedure that takes data as input and produces output as models or patterns well-defined: precisely encoded as a

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:

More information

5 Learning hypothesis classes (16 points)

5 Learning hypothesis classes (16 points) 5 Learning hypothesis classes (16 points) Consider a classification problem with two real valued inputs. For each of the following algorithms, specify all of the separators below that it could have generated

More information

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Neural Networks CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Biological and artificial neural networks Feed-forward neural networks Single layer

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Image Data: Classification via Neural Networks Instructor: Yizhou Sun yzsun@ccs.neu.edu November 19, 2015 Methods to Learn Classification Clustering Frequent Pattern Mining

More information

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition Pattern Recognition Kjell Elenius Speech, Music and Hearing KTH March 29, 2007 Speech recognition 2007 1 Ch 4. Pattern Recognition 1(3) Bayes Decision Theory Minimum-Error-Rate Decision Rules Discriminant

More information

Logical Rhythm - Class 3. August 27, 2018

Logical Rhythm - Class 3. August 27, 2018 Logical Rhythm - Class 3 August 27, 2018 In this Class Neural Networks (Intro To Deep Learning) Decision Trees Ensemble Methods(Random Forest) Hyperparameter Optimisation and Bias Variance Tradeoff Biological

More information

Optimization Methods for Machine Learning (OMML)

Optimization Methods for Machine Learning (OMML) Optimization Methods for Machine Learning (OMML) 2nd lecture Prof. L. Palagi References: 1. Bishop Pattern Recognition and Machine Learning, Springer, 2006 (Chap 1) 2. V. Cherlassky, F. Mulier - Learning

More information

CS 229 Midterm Review

CS 229 Midterm Review CS 229 Midterm Review Course Staff Fall 2018 11/2/2018 Outline Today: SVMs Kernels Tree Ensembles EM Algorithm / Mixture Models [ Focus on building intuition, less so on solving specific problems. Ask

More information

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18 Lecture 6: k-nn Cross-validation Regularization LEARNING METHODS Lazy vs eager learning Eager learning generalizes training data before

More information

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS

LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS Neural Networks Classifier Introduction INPUT: classification data, i.e. it contains an classification (class) attribute. WE also say that the class

More information

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used. 1 4.12 Generalization In back-propagation learning, as many training examples as possible are typically used. It is hoped that the network so designed generalizes well. A network generalizes well when

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA Predictive Analytics: Demystifying Current and Emerging Methodologies Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA May 18, 2017 About the Presenters Tom Kolde, FCAS, MAAA Consulting Actuary Chicago,

More information

Machine Learning. Chao Lan

Machine Learning. Chao Lan Machine Learning Chao Lan Machine Learning Prediction Models Regression Model - linear regression (least square, ridge regression, Lasso) Classification Model - naive Bayes, logistic regression, Gaussian

More information

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or

More information

CSE 5526: Introduction to Neural Networks Radial Basis Function (RBF) Networks

CSE 5526: Introduction to Neural Networks Radial Basis Function (RBF) Networks CSE 5526: Introduction to Neural Networks Radial Basis Function (RBF) Networks Part IV 1 Function approximation MLP is both a pattern classifier and a function approximator As a function approximator,

More information

Lecture on Modeling Tools for Clustering & Regression

Lecture on Modeling Tools for Clustering & Regression Lecture on Modeling Tools for Clustering & Regression CS 590.21 Analysis and Modeling of Brain Networks Department of Computer Science University of Crete Data Clustering Overview Organizing data into

More information

Bias-Variance Analysis of Ensemble Learning

Bias-Variance Analysis of Ensemble Learning Bias-Variance Analysis of Ensemble Learning Thomas G. Dietterich Department of Computer Science Oregon State University Corvallis, Oregon 97331 http://www.cs.orst.edu/~tgd Outline Bias-Variance Decomposition

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Contents. Preface to the Second Edition

Contents. Preface to the Second Edition Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................

More information

Perceptron as a graph

Perceptron as a graph Neural Networks Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 10 th, 2007 2005-2007 Carlos Guestrin 1 Perceptron as a graph 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-6 -4-2

More information

Machine Learning. Topic 4: Linear Regression Models

Machine Learning. Topic 4: Linear Regression Models Machine Learning Topic 4: Linear Regression Models (contains ideas and a few images from wikipedia and books by Alpaydin, Duda/Hart/ Stork, and Bishop. Updated Fall 205) Regression Learning Task There

More information

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York A Systematic Overview of Data Mining Algorithms Sargur Srihari University at Buffalo The State University of New York 1 Topics Data Mining Algorithm Definition Example of CART Classification Iris, Wine

More information

Nonparametric Methods Recap

Nonparametric Methods Recap Nonparametric Methods Recap Aarti Singh Machine Learning 10-701/15-781 Oct 4, 2010 Nonparametric Methods Kernel Density estimate (also Histogram) Weighted frequency Classification - K-NN Classifier Majority

More information

Machine Learning Classifiers and Boosting

Machine Learning Classifiers and Boosting Machine Learning Classifiers and Boosting Reading Ch 18.6-18.12, 20.1-20.3.2 Outline Different types of learning problems Different types of learning algorithms Supervised learning Decision trees Naïve

More information

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Ensemble methods in machine learning. Example. Neural networks. Neural networks Ensemble methods in machine learning Bootstrap aggregating (bagging) train an ensemble of models based on randomly resampled versions of the training set, then take a majority vote Example What if you

More information

Learning via Optimization

Learning via Optimization Lecture 7 1 Outline 1. Optimization Convexity 2. Linear regression in depth Locally weighted linear regression 3. Brief dips Logistic Regression [Stochastic] gradient ascent/descent Support Vector Machines

More information

1) Give decision trees to represent the following Boolean functions:

1) Give decision trees to represent the following Boolean functions: 1) Give decision trees to represent the following Boolean functions: 1) A B 2) A [B C] 3) A XOR B 4) [A B] [C Dl Answer: 1) A B 2) A [B C] 1 3) A XOR B = (A B) ( A B) 4) [A B] [C D] 2 2) Consider the following

More information

Machine Learning in Biology

Machine Learning in Biology Università degli studi di Padova Machine Learning in Biology Luca Silvestrin (Dottorando, XXIII ciclo) Supervised learning Contents Class-conditional probability density Linear and quadratic discriminant

More information

Data Mining. Neural Networks

Data Mining. Neural Networks Data Mining Neural Networks Goals for this Unit Basic understanding of Neural Networks and how they work Ability to use Neural Networks to solve real problems Understand when neural networks may be most

More information

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods Ensemble Learning Ensemble Learning So far we have seen learning algorithms that take a training set and output a classifier What if we want more accuracy than current algorithms afford? Develop new learning

More information

Chapter 7: Numerical Prediction

Chapter 7: Numerical Prediction Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases SS 2016 Chapter 7: Numerical Prediction Lecture: Prof. Dr.

More information

The exam is closed book, closed notes except your one-page cheat sheet.

The exam is closed book, closed notes except your one-page cheat sheet. CS 189 Fall 2015 Introduction to Machine Learning Final Please do not turn over the page before you are instructed to do so. You have 2 hours and 50 minutes. Please write your initials on the top-right

More information

Topics in Machine Learning-EE 5359 Model Assessment and Selection

Topics in Machine Learning-EE 5359 Model Assessment and Selection Topics in Machine Learning-EE 5359 Model Assessment and Selection Ioannis D. Schizas Electrical Engineering Department University of Texas at Arlington 1 Training and Generalization Training stage: Utilizing

More information

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes

More information

April 3, 2012 T.C. Havens

April 3, 2012 T.C. Havens April 3, 2012 T.C. Havens Different training parameters MLP with different weights, number of layers/nodes, etc. Controls instability of classifiers (local minima) Similar strategies can be used to generate

More information

A Dendrogram. Bioinformatics (Lec 17)

A Dendrogram. Bioinformatics (Lec 17) A Dendrogram 3/15/05 1 Hierarchical Clustering [Johnson, SC, 1967] Given n points in R d, compute the distance between every pair of points While (not done) Pick closest pair of points s i and s j and

More information

Radial Basis Function Networks: Algorithms

Radial Basis Function Networks: Algorithms Radial Basis Function Networks: Algorithms Neural Computation : Lecture 14 John A. Bullinaria, 2015 1. The RBF Mapping 2. The RBF Network Architecture 3. Computational Power of RBF Networks 4. Training

More information

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany Syllabus Fri. 27.10. (1) 0. Introduction A. Supervised Learning: Linear Models & Fundamentals Fri. 3.11. (2) A.1 Linear Regression Fri. 10.11. (3) A.2 Linear Classification Fri. 17.11. (4) A.3 Regularization

More information

For Monday. Read chapter 18, sections Homework:

For Monday. Read chapter 18, sections Homework: For Monday Read chapter 18, sections 10-12 The material in section 8 and 9 is interesting, but we won t take time to cover it this semester Homework: Chapter 18, exercise 25 a-b Program 4 Model Neuron

More information

CSC 411 Lecture 4: Ensembles I

CSC 411 Lecture 4: Ensembles I CSC 411 Lecture 4: Ensembles I Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 04-Ensembles I 1 / 22 Overview We ve seen two particular classification algorithms:

More information

Lecture #11: The Perceptron

Lecture #11: The Perceptron Lecture #11: The Perceptron Mat Kallada STAT2450 - Introduction to Data Mining Outline for Today Welcome back! Assignment 3 The Perceptron Learning Method Perceptron Learning Rule Assignment 3 Will be

More information

Image Compression: An Artificial Neural Network Approach

Image Compression: An Artificial Neural Network Approach Image Compression: An Artificial Neural Network Approach Anjana B 1, Mrs Shreeja R 2 1 Department of Computer Science and Engineering, Calicut University, Kuttippuram 2 Department of Computer Science and

More information

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska Classification Lecture Notes cse352 Neural Networks Professor Anita Wasilewska Neural Networks Classification Introduction INPUT: classification data, i.e. it contains an classification (class) attribute

More information

Simple Model Selection Cross Validation Regularization Neural Networks

Simple Model Selection Cross Validation Regularization Neural Networks Neural Nets: Many possible refs e.g., Mitchell Chapter 4 Simple Model Selection Cross Validation Regularization Neural Networks Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February

More information

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13 CSE 634 - Data Mining Concepts and Techniques STATISTICAL METHODS Professor- Anita Wasilewska (REGRESSION) Team 13 Contents Linear Regression Logistic Regression Bias and Variance in Regression Model Fit

More information

Function approximation using RBF network. 10 basis functions and 25 data points.

Function approximation using RBF network. 10 basis functions and 25 data points. 1 Function approximation using RBF network F (x j ) = m 1 w i ϕ( x j t i ) i=1 j = 1... N, m 1 = 10, N = 25 10 basis functions and 25 data points. Basis function centers are plotted with circles and data

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

Lecture 25: Review I

Lecture 25: Review I Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,

More information

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA Exploratory Machine Learning studies for disruption prediction on DIII-D by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA Presented at the 2 nd IAEA Technical Meeting on

More information

Introduction to Machine Learning. Xiaojin Zhu

Introduction to Machine Learning. Xiaojin Zhu Introduction to Machine Learning Xiaojin Zhu jerryzhu@cs.wisc.edu Read Chapter 1 of this book: Xiaojin Zhu and Andrew B. Goldberg. Introduction to Semi- Supervised Learning. http://www.morganclaypool.com/doi/abs/10.2200/s00196ed1v01y200906aim006

More information

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics

More information

Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm

Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm Sham M Kakade c 2018 University of Washington cse446-staff@cs.washington.edu 1 / 17 Review 1 / 17 Decision Tree: Making a

More information

Stat 342 Exam 3 Fall 2014

Stat 342 Exam 3 Fall 2014 Stat 34 Exam 3 Fall 04 I have neither given nor received unauthorized assistance on this exam. Name Signed Date Name Printed There are questions on the following 6 pages. Do as many of them as you can

More information

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

More information

CART. Classification and Regression Trees. Rebecka Jörnsten. Mathematical Sciences University of Gothenburg and Chalmers University of Technology

CART. Classification and Regression Trees. Rebecka Jörnsten. Mathematical Sciences University of Gothenburg and Chalmers University of Technology CART Classification and Regression Trees Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology CART CART stands for Classification And Regression Trees.

More information

Supervised vs unsupervised clustering

Supervised vs unsupervised clustering Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful

More information

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1 Preface to the Second Edition Preface to the First Edition vii xi 1 Introduction 1 2 Overview of Supervised Learning 9 2.1 Introduction... 9 2.2 Variable Types and Terminology... 9 2.3 Two Simple Approaches

More information

The Basics of Decision Trees

The Basics of Decision Trees Tree-based Methods Here we describe tree-based methods for regression and classification. These involve stratifying or segmenting the predictor space into a number of simple regions. Since the set of splitting

More information

Assignment 2. Classification and Regression using Linear Networks, Multilayer Perceptron Networks, and Radial Basis Functions

Assignment 2. Classification and Regression using Linear Networks, Multilayer Perceptron Networks, and Radial Basis Functions ENEE 739Q: STATISTICAL AND NEURAL PATTERN RECOGNITION Spring 2002 Assignment 2 Classification and Regression using Linear Networks, Multilayer Perceptron Networks, and Radial Basis Functions Aravind Sundaresan

More information

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer Model Assessment and Selection Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Model Training data Testing data Model Testing error rate Training error

More information

Neural Networks (Overview) Prof. Richard Zanibbi

Neural Networks (Overview) Prof. Richard Zanibbi Neural Networks (Overview) Prof. Richard Zanibbi Inspired by Biology Introduction But as used in pattern recognition research, have little relation with real neural systems (studied in neurology and neuroscience)

More information

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or

More information

Machine Learning: Think Big and Parallel

Machine Learning: Think Big and Parallel Day 1 Inderjit S. Dhillon Dept of Computer Science UT Austin CS395T: Topics in Multicore Programming Oct 1, 2013 Outline Scikit-learn: Machine Learning in Python Supervised Learning day1 Regression: Least

More information

ECG782: Multidimensional Digital Signal Processing

ECG782: Multidimensional Digital Signal Processing ECG782: Multidimensional Digital Signal Processing Object Recognition http://www.ee.unlv.edu/~b1morris/ecg782/ 2 Outline Knowledge Representation Statistical Pattern Recognition Neural Networks Boosting

More information

Neural Network Neurons

Neural Network Neurons Neural Networks Neural Network Neurons 1 Receives n inputs (plus a bias term) Multiplies each input by its weight Applies activation function to the sum of results Outputs result Activation Functions Given

More information

Support Vector Machines

Support Vector Machines Support Vector Machines RBF-networks Support Vector Machines Good Decision Boundary Optimization Problem Soft margin Hyperplane Non-linear Decision Boundary Kernel-Trick Approximation Accurancy Overtraining

More information

Neural Network Weight Selection Using Genetic Algorithms

Neural Network Weight Selection Using Genetic Algorithms Neural Network Weight Selection Using Genetic Algorithms David Montana presented by: Carl Fink, Hongyi Chen, Jack Cheng, Xinglong Li, Bruce Lin, Chongjie Zhang April 12, 2005 1 Neural Networks Neural networks

More information

Pattern Classification Algorithms for Face Recognition

Pattern Classification Algorithms for Face Recognition Chapter 7 Pattern Classification Algorithms for Face Recognition 7.1 Introduction The best pattern recognizers in most instances are human beings. Yet we do not completely understand how the brain recognize

More information

Lecture 20: Bagging, Random Forests, Boosting

Lecture 20: Bagging, Random Forests, Boosting Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter 8 STATS 202: Data mining and analysis November 13, 2017 1 / 17 Classification and Regression trees, in a nut shell Grow the tree by recursively

More information

Uninformed Search Methods. Informed Search Methods. Midterm Exam 3/13/18. Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall

Uninformed Search Methods. Informed Search Methods. Midterm Exam 3/13/18. Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall Midterm Exam Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall Covers topics through Decision Trees and Random Forests (does not include constraint satisfaction) Closed book 8.5 x 11 sheet with notes

More information

Topics in Machine Learning

Topics in Machine Learning Topics in Machine Learning Gilad Lerman School of Mathematics University of Minnesota Text/slides stolen from G. James, D. Witten, T. Hastie, R. Tibshirani and A. Ng Machine Learning - Motivation Arthur

More information

Machine Learning (CS 567)

Machine Learning (CS 567) Machine Learning (CS 567) Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol Han (cheolhan@usc.edu)

More information

Bioinformatics - Lecture 07

Bioinformatics - Lecture 07 Bioinformatics - Lecture 07 Bioinformatics Clusters and networks Martin Saturka http://www.bioplexity.org/lectures/ EBI version 0.4 Creative Commons Attribution-Share Alike 2.5 License Learning on profiles

More information

Advanced and Predictive Analytics with JMP 12 PRO. JMP User Meeting 9. Juni Schwalbach

Advanced and Predictive Analytics with JMP 12 PRO. JMP User Meeting 9. Juni Schwalbach Advanced and Predictive Analytics with JMP 12 PRO JMP User Meeting 9. Juni 2016 -Schwalbach Definition Predictive Analytics encompasses a variety of statistical techniques from modeling, machine learning

More information

error low bias high variance test set training set high low Model Complexity Typical Behaviour 2 CSC2515 Machine Learning high bias low variance

error low bias high variance test set training set high low Model Complexity Typical Behaviour 2 CSC2515 Machine Learning high bias low variance CSC55 Machine Learning Sam Roweis high bias low variance Typical Behaviour low bias high variance Lecture : Overfitting and Capacity Control error training set test set November, 6 low Model Complexity

More information

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016 CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2016 A2/Midterm: Admin Grades/solutions will be posted after class. Assignment 4: Posted, due November 14. Extra office hours:

More information

Classification and Regression Trees

Classification and Regression Trees Classification and Regression Trees David S. Rosenberg New York University April 3, 2018 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 1 / 51 Contents 1 Trees 2 Regression

More information

Model Answers to The Next Pixel Prediction Task

Model Answers to The Next Pixel Prediction Task Model Answers to The Next Pixel Prediction Task December 2, 25. (Data preprocessing and visualization, 8 marks) (a) Solution. In Algorithm we are told that the data was discretized to 64 grey scale values,...,

More information

Introduction to Classification & Regression Trees

Introduction to Classification & Regression Trees Introduction to Classification & Regression Trees ISLR Chapter 8 vember 8, 2017 Classification and Regression Trees Carseat data from ISLR package Classification and Regression Trees Carseat data from

More information

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013 Machine Learning Topic 5: Linear Discriminants Bryan Pardo, EECS 349 Machine Learning, 2013 Thanks to Mark Cartwright for his extensive contributions to these slides Thanks to Alpaydin, Bishop, and Duda/Hart/Stork

More information

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu FMA901F: Machine Learning Lecture 3: Linear Models for Regression Cristian Sminchisescu Machine Learning: Frequentist vs. Bayesian In the frequentist setting, we seek a fixed parameter (vector), with value(s)

More information

CS249: ADVANCED DATA MINING

CS249: ADVANCED DATA MINING CS249: ADVANCED DATA MINING Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu April 24, 2017 Homework 2 out Announcements Due May 3 rd (11:59pm) Course project proposal

More information

Computer Vision Group Prof. Daniel Cremers. 8. Boosting and Bagging

Computer Vision Group Prof. Daniel Cremers. 8. Boosting and Bagging Prof. Daniel Cremers 8. Boosting and Bagging Repetition: Regression We start with a set of basis functions (x) =( 0 (x), 1(x),..., M 1(x)) x 2 í d The goal is to fit a model into the data y(x, w) =w T

More information

Supervised Learning. Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

Supervised Learning. Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression... Supervised Learning Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression... Supervised Learning y=f(x): true function (usually not known) D: training

More information

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University Cse634 DATA MINING TEST REVIEW Professor Anita Wasilewska Computer Science Department Stony Brook University Preprocessing stage Preprocessing: includes all the operations that have to be performed before

More information

Information theory methods for feature selection

Information theory methods for feature selection Information theory methods for feature selection Zuzana Reitermanová Department of Computer Science Faculty of Mathematics and Physics Charles University in Prague, Czech Republic Diplomový a doktorandský

More information

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017 3/0/207 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network 3/0/207 Perceptron as a neural

More information

27: Hybrid Graphical Models and Neural Networks

27: Hybrid Graphical Models and Neural Networks 10-708: Probabilistic Graphical Models 10-708 Spring 2016 27: Hybrid Graphical Models and Neural Networks Lecturer: Matt Gormley Scribes: Jakob Bauer Otilia Stretcu Rohan Varma 1 Motivation We first look

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct outputs) Density Estimation Learn P(X) given training data for X Clustering Partition data into clusters Dimensionality Reduction Discover

More information

Data Mining: Models and Methods

Data Mining: Models and Methods Data Mining: Models and Methods Author, Kirill Goltsman A White Paper July 2017 --------------------------------------------------- www.datascience.foundation Copyright 2016-2017 What is Data Mining? Data

More information

Linear Models. Lecture Outline: Numeric Prediction: Linear Regression. Linear Classification. The Perceptron. Support Vector Machines

Linear Models. Lecture Outline: Numeric Prediction: Linear Regression. Linear Classification. The Perceptron. Support Vector Machines Linear Models Lecture Outline: Numeric Prediction: Linear Regression Linear Classification The Perceptron Support Vector Machines Reading: Chapter 4.6 Witten and Frank, 2nd ed. Chapter 4 of Mitchell Solving

More information

Week 3: Perceptron and Multi-layer Perceptron

Week 3: Perceptron and Multi-layer Perceptron Week 3: Perceptron and Multi-layer Perceptron Phong Le, Willem Zuidema November 12, 2013 Last week we studied two famous biological neuron models, Fitzhugh-Nagumo model and Izhikevich model. This week,

More information

Univariate and Multivariate Decision Trees

Univariate and Multivariate Decision Trees Univariate and Multivariate Decision Trees Olcay Taner Yıldız and Ethem Alpaydın Department of Computer Engineering Boğaziçi University İstanbul 80815 Turkey Abstract. Univariate decision trees at each

More information