Statistical foundations of machine learning
|
|
- Felix Johnson
- 5 years ago
- Views:
Transcription
1 Statistical foundations of machine learning INFO-F-422 Gianluca Bontempi Machine Learning Group Computer Science Department mlg.ulb.ac.be
2 Some algorithms for nonlinear modeling Feedforward neural network Regression tree Neuro-fuzzy inference systems Radial basis function Local modeling regression Support vector machine Hierarchical mixtures of experts
3 Global and local approaches Global: it has two properties. they make the assumption that the relationship between the inputs and the output values can be described by an analytical function over the whole input domain. 2 learning is seen as a problem of function estimation: given a set of data, they extract the hypothesis which is expected to approximate the best the whole data distribution Examples of global models are linear models, nonlinear statistical regressions, and Neural Networks Divide-and-conquer: it attacks a complex problem by dividing it into simpler problems whose solutions can be combined to yield a solution to the original problem. This principle presents two main advantages. simpler problems can be solved with simpler estimation techniques; in statistics this means to adopt linear techniques, well studied and developed over the years. 2 the learning method can better adjust to the properties of the available dataset.
4 Modular modeling The divide-and-conquer idea evolved in two different paradigms: modular modeling: modules cover different parts (operating regimes) of the input space. although modular architectures are a combination of local models, their identification is still performed on the basis of the whole dataset. Hence, the learning procedure remains a function estimation problem, with the advantage that the parametric identification can be made simpler by the adoption of local linear modules. in terms of structural identification the problem is still nonlinear and requires the same procedures used for generic global models. Examples are Fuzzy Inference Systems, Radial Basis Functions, Local Model Networks, Classification and Regression Trees.
5 Local modeling local modeling: they turn the problem of function estimation into a problem of value estimation do not return a complete description of the input/output mapping but approximate the function in a neighborhood of the point to be predicted. adoption of linear techniques both in parametric and structural identification with a gain in terms of analytical tractability and fast design.
6 Artificial neural networks Artificial neural networks (aka neural nets) are parallel, distributed information processing computational models which draw their inspiration from neurons in the brain. The main class of neural network used in supervised learning for classification and regression is the feed-forward network, aka as multi-layer perceptron (MLP). Feed-forward ANN have been applied to a wide range of prediction tasks in such diverse fields as speech recognition, financial prediction, image compression, adaptive industrial control. One of the most important trends in recent neural computing has been a move away from a biologically inspired interpretation of neural networks to a more rigorous and statistically founded interpretation based on results deriving from statistical pattern recognition theory.
7 Feed-forward architecture Feed-forward NN have a layered architecture, with each layer comprising one or more simple processing units called artificial neurons or nodes. Each node is connected to one or more other nodes by real valued weights (in the following we will refer to them as parameters) but not to nodes in the same layer. All FNN have an input layer and an output layer. All FNN have a connectivity structure which is an acyclic graph: no closed loops are admitted. FNN are generally implemented with an additional node, called the bias unit, in all layers except the output layer. It plays the role of the intercept term in linear models. For simplicity we will consider only FNN with one single output.
8 Two-layer feed-forward NN bias units NETWORK INPUTS x x x n layer w () w () nh z z z H layer (2) w w (2) w (2) H y layer 2
9 Feed-forward architecture Let n be the number of inputs, L the number of layers, H (l) the number of hidden units of the lth layer (l =,...,L) of the FNN, w (l) kh denote the weight of the link connecting the kth node in the l layer and the hth node in the l layer, z (l) h, h =,...,H(l) the output of the hth hidden node of the lth layer, z (l) denote the bias for the l, l =,...,L layer.
10 Feed-forward architecture Let H () = n and z () h = x h, h =,...,n. For l the output of the hth, h =,...,H (l), hidden unit of the lth layer, is obtained by first forming a weighted linear combination of the H (l ) outputs of the lower level a (l) H (l ) h = w (l) kh z(l ) k + w (l) h z(l ), h =,...,H (l) k= and then by transforming the sum using an activation function to give z (l) h = g (l) (a (l) h ), h =,...,H(l) The activation function g (l) ( ) is typically a nonlinear transformation like the logistic or sigmoid function g (l) (z) = + e z
11 Two-layer feed-forward NN For L = 2 (i.e. single hidden layer), the input/output relation is given by ( H ) ŷ = g (2) (a (2) ) = g(2) w (2) k z k + w (2) z j= k= where n z k = g () w () jk x j + w () k x, k =,...,H Note that if g () ( ) and g (2) ( ) are linear mappings, this functional form becomes linear. Once given the number of inputs and the form of the function g( ) two are the things which remain to be chosen: parameters: the value of weights w (l), l =, 2 2 structural hyperparameters: number of layers L and hidden nodes H.
12 Backpropagation It is an algorithm which, once the number of hidden nodes H is given, estimates the weights W = {w (l) kh },l =,...,L,h =,...,H(l),k =,...,H (l ) on the basis of the training set D N. It is a gradient-based algorithm which aims to minimize the cost function N i= MISE emp (W) = (y i ŷ i ) 2 N i= = (y i ŷ(x i,w)) 2 N N where W = arg min W MISE emp (W) is the optimal set of weights for the given training set. Backprop exploits the network structure in order to compute recursively the gradient.
13 Backpropagation (II) The simplest (and least effective) backprop algorithm is an iterative gradient descent which is based on the iterative formula W(k + ) = W(k) η MISE emp (W(k)) W(k) where W(k) is the weight vector at the kth iteration and η is the learning rate which indicates the relative size of the change in weights. The weights are initialized with random values and are changed in a direction that will reduce the error. Some convergence criterion is used to terminate the algorithm. This method is known to be inefficient. Many steps are needed to reach a stationariy point and no monotone decrease of MISE emp is guaranteed. More effective version of the algorithm are based on the Levenberg-Marquardt algorithm.
14 Backpropagation example Consider an single-input (i.e. n = ) single-output neural network with one hidden layer, two hidden nodes and no bias units. The predictor has the form ŷ(x) = g(a (2) (x)) = g(w(2) z +w (2) 2 z 2) = g(w (2) g(a() )+w(2) 2 g(a( 2 = g(w (2) g(w() x) + w(2) 2 g(w() 2 x)) where W = [w (),w() 2,w(2),w(2) 2 ] The backprop algorithm needs the derivatives of MISE emp wrt to each weight w W. Since for each w W MISE emp w we focus on ŷ w. = 2/N N i= (y i ŷ(x i )) ŷ(x i) w
15 Backpropagation example NETWORK INPUT x layer () w w () 2 z z 2 layer w (2) w (2) 2 y layer 2
16 Backpropagation example (II) Since a (2) (x) = w(2) z + w (2) 2 z 2 we obtain for the weights of the hidden/output layer ŷ(x) w (2) h = g a (2) a (2) w (2) h = g (a (2) (x))z h(x), h =,...,2 Since a () h = w () h x we obtain for the weights of the input/hidden layer ŷ(x) w () h = g a (2) a (2) z h z h a () h a () h w () h = g (a (2) (x))w(2) h g (a () h (x))x where the term g (a (x)) has been already computed for the upper layer. Note that for a sigmoid function g g (z) = e z ( + e z ) 2
17 Two-layer feed-forward NN Let us consider a two-layer FNN with sigmoidal hidden units. This has proven to be an important class of network for practical applications. It can be shown that such networks are universal approximators. This means that they can approximate arbitrarily well any functional (one-one or many-one) continuous mapping from one finite-dimensional space to another, provided the number H of hidden units is sufficiently large. Note that although this result is remarkable, it is of no practical use. No indication is given about the number of hidden nodes to choose for a finite number of samples and a generic nonlinear mapping
18 R TP: An overfitting example Consider a dataset D N = {x i,y i }, i =,...,N where N = 5 and x N [,,], is a 3-dimensional vector. Suppose that y is linked to x by the input/output relation y = x 2 + 4log( x 2 ) + 5x 3 + x where x i is the ith component of the vector x R 3 and Var [w] =.. Consider as non-linear model a single-hidden-layer neural network (implemented by the R package nnet) with s = 25 hidden neurons. The number of neurons is an index of the complexity of the model.
19 We want the estimate the prediction accuracy on a new i.i.d dataset of N ts = 5 samples. Let us train the neural network on the whole training set. The empirical prediction MISE error is MISE emp = N N (y i h(x i,α N )) 2 =.6 6 i= where α N is obtained by the parametric identification step. However, if we test h(,α N ) on the test set we obtain MISE ts = N ts N ts i= (y i h(x i,α N )) 2 = 3.58 This neural network is seriously overfitting the dataset. The empirical error is a very bad estimate of the MISE.
20 We perform a K-fold cross-validation in order to have a better estimate of MISE. We put K =. Cross-validation implemented in the cv.r R file. The K = cross-validated estimate of MISE is MISE CV = 8.29 This figure is a much more reliable estimation of the prediction accuracy. The leave-one-out estimate K = N = 5 is MISE loo = 7.2 The cross-validated estimate could be used to select a better number of hidden neurons.
21 R TP: Bagging against overfitting Consider a dataset D N = {x i,y i }, i =,...,N of N = i.i.d. normally distributed inputs x N([,, ], I). Suppose that y is linked to x by the input/output relation y = x 2 + 4log( x 2 ) + 5x 3 + ǫ where ǫ N(,.25) represents the noise. Let s train a single-hidden-layer neural network with s = 5 hidden neurons on the training set. The prediction accuracy on the test set (N ts = ) is MISE ts = 9.95.
22 Let us apply a bagging combination with B = 5 (R-file bagging.r). The prediction accuracy on the test set of the bagging predictor is MISE ts = This shows that the bagging combination reduces the overfitting of the single neural network. Below there is the histogram of the MISE ts accuracy of each bootstrap repetition. We can see that the performance of the bagging predictor is much better than the average performance. Frequency
23 Decision Trees The use of tree-based learners dates back to the work of Morgan and Sonquist in 963. A decision tree partitions the input space into mutually exclusive regions, each of which is assigned a specific model. The nodes of a decision tree can be classified in internal nodes and terminal nodes. An internal node is a decision-making unit that evaluates a decision function to determine which child node to visit next. A terminal node or leaf has no child nodes and is associated with one of the partitions of the input space. Note that each terminal node has a unique path that leads from the root to itself. In classification trees each terminal node contains a label that indicates the class for the associated input region. In regression trees the terminal node contains a model that specifies the input/output mapping for the corresponding input partition.
24 A binary decision tree. x < a? YES NO x < c? 2 x < b? YES NO YES NO R R2 R3 x < d? 2 YES NO R4 R5
25 Input space partitioning x.2 d R2 R5 R3 c R R4 a b x.
26 Regression tree predictor Let m be the number of leaves and h j (,α j ), j =,...,m, the input/output model associated with the j th leaf. Once a prediction in a query point x q R n is required, the query is presented to the root node of the decision tree; Depending on the result of the associated decision function, the tree will branch to one of the root s children. The procedure is iterated recursively until a leaf j is reached and a input/output model is selected. The returned output will be the value h j (x q,α j ). Example let x q = (x q,x q2 ) with x q < a and x q2 > c. The predicted output will be ŷ q = h 2 (x q,α 2 ) where α 2 is the vector of parameters of the model localized in region R2.
27 Regression tree learning The learning procedure has two steps known as tree growing and tree pruning: During tree growing the algorithm makes a succession of splits that partition the training data into disjoint subsets. Starting from the root node that contains the whole dataset, an exhaustive search is performed to find the split that best reduces a certain cost function. Let us consider a certain node t and let D(t) be the corresponding subset of the original D N. Consider the empirical error of the local model h t fitting the N(t) data contained in the node t : N(t) SSE emp (t) = min L(y i,h t (x i,α t )) () α t i= Note that h t could be whatever regression model (e.g. constant, linear,...).
28 Regression tree construction For any possible split s of node t into the two children t r and t l, we define the quantity E(s,t) = SSE emp (t) (SSE emp (t l ) + SSE emp (t r )) with N(t r ) + N(t l ) = N(t) that represents the decrease of the empirical error due to a further partition of the dataset. The best split is the one that maximizes the decrease E s = arg max s E(s,t) (2) Once the best split is attained, the dataset is partitioned into the two disjoint subsets of length N(t r ) and N(t l ), respectively. The same method is recursively applied to all the leaves. The procedure terminates either when the error measure associated with a node falls below a certain tolerance level, or when the error reduction E resulting from further splitting does not exceed a threshold value.
29 Tree pruning The tree that the growing procedure yields is typically too large and presents a serious risk of overfitting the dataset. For that reason a pruning procedure is often adopted. Pruning uses a complexity based measure of the tree performance R λ (T) = MISE emp (T) + λ T where λ is a parameter that accounts for the tree s complexity and T is the number of terminal nodes of the tree T. For a fixed λ we define with T(λ) the tree structure which minimizes the quantity R λ (T). The parameter λ is gradually increased in order to generate a sequence of tree configurations with decreasing complexity
30 Tree pruning For a generic subtree T t T, MISE emp (T t ) > MISE emp (T), T > T t MISE emp (T t ) + λ t T t MISE emp (T) + λ t T λ t MISE emp (T t ) MISE emp (T) T T t This means that the subtree T t is preferrable to T when λ t is greater than the above quantity. Therefore we choose among all the admissible subtrees T t the one with the smallest term MISE emp (T t ) MISE emp (T) T T t At the end of the shrinking process we have a sequence of candidate trees which have to be properly assessed (e.g. by crossval) in order to perform the structural selection. (3)
31 Random Forests Ensemble learning is efficient when it combines low bias estimators. Non pruned decision trees are low bias and high variance estimators. Random forest is an ensemble learning technique proposed by Breiman which combines bagging and random feature selection. The algorithm consists in generate by bootstrap a set of B training sets 2 fit to each of them a decision tree where the set of variables considered for each split is a random subset of the original one. 3 return as final prediction the average of the B predictions. The precision of Random Forest improves by improving the single classifiers and by reducing their correlation.
32 Radial Basis Functions Radial Basis Function (RBF) is a modular architecture which is described by the weighted linear combination m y = ρ j (x;c j,b j )h j j= where the weights are returned by the activations of m local nonlinear basis functions ρ j having the center c j and the bandwidth B j and where the term h j is a constant. The basis or activation function ρ j is a function ρ j : X [,] usually designed so that its value monotonically decreases towards zero as the input point moves away from its center c j. An example of basis function is: ρ j (x;c j,b j ) = exp (x c 2 j) B j
33 Radial Basis Functions Once we define with η j the set {c j,b j } of parameters of the basis function, we have ρ j = ρ j (,η j ) If the basis ρ j have localized receptive fields and a limited degree of overlap with their neighbors, the weights h j can be interpreted as locally piecewise constant models, whose validity for a given input is indicated by the corresponding activation function for a given input. The basis function idea arose almost at the same time in different fields and led to similar approaches, often denoted with different names (Radial Basis Function, Local Model Networks and the Neuro-Fuzzy Inference Systems).
34 Fitting of basis functions For a given number m of basis, a clustering technique (e.g. K-means or EM) can be adopted to locate center and variance. In the example, note that n = 2 and m =
35 Fitting of RBF weights Let us consider a dataset D N = { x,y, x 2,y 2,..., x N,y N }. Once the parameters η j,j =,...,m are fixed the parametric identification of the remaining parameters h j boils down to satisfying the following constraints: y = h ρ (x,η ) + h 2 ρ 2 (x,η 2 ) + + h m ρ m (x,η m ) y 2 = h ρ (x 2,η ) + h 2 ρ 2 (x 2,η 2 ) + + h m ρ m (x 2,η m ). y N = h ρ (x N,η ) + h N ρ 2 (x N,η 2 ) + + h m ρ m (x N,η m )
36 Fitting of RBF weights which can be written as where y Y =. y N,X = Y = Xβ ρ (x,η )... ρ m (x,η m )... ρ (x N,η )... ρ m (x N,η m ).,β = [h,...,h m ] T
37 Local Model Networks Local Model Networks (LMN) are a generalized form of Basis Function Network in the sense that the constant weights h j associated with the basis functions are replaced by local models h j (,α j ). The typical form of a LMN is then m y = ρ j (x,η j )h j (x,α j ) j= where the ρ j are constrained to satisfy m ρ j (x,η j ) = x X j= This means that the basis functions form a partition of unity. This ensures that every point in the input space has equal weight, so that any variation in the output over the input space is due only to the models h j.
38 Local Model Networks Local model h j (x) Basis function ρ j (x) Model y = P 3 j= ρ j(x)h j (x).5
39 Local modeling procedure The learning of a local model in x q R n can be summarized in these steps: Compute the distance between the query x q and the training samples according to a predefined metric. 2 Rank the neighbors on the basis of their distance to the query. 3 Select a subset of the k nearest neighbors according to the bandwidth which measures the size of the neighborhood. 4 Fit a local model (e.g. constant, linear,...). Each of the local approaches has one or more structural (or smoothing) parameters that control the amount of smoothing performed. Let us focus on the bandwidth selection.
40 The bandwidth trade-off: overfit e q x y x y Too narrow bandwidth overfitting large prediction error e. In terms of bias/variance trade-off, this is typically a situation of high variance.
41 The bandwidth trade-off: underfit e q x y x y Too large bandwidth underfitting large prediction error e In terms of bias/variance trade-off, this is typically a situation of high bias.
42 Bias/variance decomposition In the case of a constant local model the prediction in x q is the quantity h(x q,α N ) = k y k [i] i= computed by averaging the value of y for the k closest neighbors x [i], i =,...,k of x q. The bias/variance decomposition takes the form MSE(x q ) = σ 2 w + ( k ) 2 k f (x [i] ) f (x q ) + σw/k 2 i=
43 Bandwidth and bias/variance trade-off Mean Squared Error Underfitting Overfitting Variance Bias /Bandwith MANY NEIGHBORS FEW NEIGHBORS
44 The PRESS statistic Cross-validation can provide a reliable estimate of the algorithm generalization error but it requires the training process to be repeated K times, which sometimes means a large computational effort. In the case of linear models there exists a powerful statistical procedure to compute the leave-one-out cross-validation measure at a reduced computational cost It is the PRESS (Prediction Sum of Squares) statistic, a simple formula which returns the leave-one-out (l-o-o) as a by-product of the least-squares.
45 Leave-one-out for linear models TRAINING SET PARAMETRIC IDENTIFICATION ON N SAMPLES PRESS STATISTIC N TIMES PUT THE j-th SAMPLE ASIDE PARAMETRIC IDENTIFICATION ON N- SAMPLES TEST ON THE j-th SAMPLE LEAVE-ONE-OUT The leave-one-out error can be computed in two equivalent ways: the slowest way (on the right) which repeats N times the training and the test procedure; the fastest way (on the left) which performs only once the parametric identification and the computation of the PRESS statistic.
46 The PRESS statistic This allows a fast cross-validation without repeating N times the leave-one-out procedure. The PRESS procedure can be described as follows: we use the whole training set to estimate the linear regression coefficients ˆβ = (X T X) X T Y 2 This procedure is performed only once on the N samples and returns as by product the Hat matrix H = X(X T X) X T 3 we compute the residual vector e, whose j th term is e j = y j xj T ˆβ, 4 we use the PRESS statistic to compute ej loo as e loo j = e j H jj where H jj is the j th diagonal term of the matrix H.
47 The PRESS statistic Thus, the leave-one-out estimate of the local mean integrated squared error is: MISE LOO = N N i= { yi ŷ i H ii } 2 Note that PRESS is not an approximation of the loo error but simply a faster way of computing it.
48 Selection of the number of neighbours For a given query point x q, we can compute a set of predictions ŷ q (k) = x T q ˆβ(k), together with a set of associated leave-one-out error vectors MISE LOO (k) for a number of neighbors ranging in [k min,kmax]. If the selection paradigm, frequently called winner-takes-all, is adopted, the most natural way to extract a final prediction ŷ q, consists in comparing the prediction obtained for each value of k on the basis of the classical mean square error criterion: ŷ q = x T q ˆβ(ˆk), with ˆk = arg min k MISE LOO (k)
49 Local Model combination As an alternative to the winner-takes-all paradigm, we can use a combination of estimates. The final prediction of the value y q is obtained as a weighted average of the best b models, where b is a parameter of the algorithm. Suppose the predictions ŷ q (k) and the loo errors MISE LOO (k) have been ordered creating a sequence of integers {k i } so that MISE LOO (k i ) MISE LOO (k j ), i < j. The prediction of ŷ q is given by b i= ŷ q = ζ iŷ q (k i ) b i= ζ, i where the weights are the inverse of the mean square errors: ζ i = / MISE LOO (k i ).
Random Forest A. Fornaser
Random Forest A. Fornaser alberto.fornaser@unitn.it Sources Lecture 15: decision trees, information theory and random forests, Dr. Richard E. Turner Trees and Random Forests, Adele Cutler, Utah State University
More informationSupervised Learning for Image Segmentation
Supervised Learning for Image Segmentation Raphael Meier 06.10.2016 Raphael Meier MIA 2016 06.10.2016 1 / 52 References A. Ng, Machine Learning lecture, Stanford University. A. Criminisi, J. Shotton, E.
More informationUsing Machine Learning to Optimize Storage Systems
Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation
More informationBig Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1
Big Data Methods Chapter 5: Machine learning Big Data Methods, Chapter 5, Slide 1 5.1 Introduction to machine learning What is machine learning? Concerned with the study and development of algorithms that
More informationWhat is machine learning?
Machine learning, pattern recognition and statistical data modelling Lecture 12. The last lecture Coryn Bailer-Jones 1 What is machine learning? Data description and interpretation finding simpler relationship
More informationA Systematic Overview of Data Mining Algorithms
A Systematic Overview of Data Mining Algorithms 1 Data Mining Algorithm A well-defined procedure that takes data as input and produces output as models or patterns well-defined: precisely encoded as a
More informationNetwork Traffic Measurements and Analysis
DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:
More information5 Learning hypothesis classes (16 points)
5 Learning hypothesis classes (16 points) Consider a classification problem with two real valued inputs. For each of the following algorithms, specify all of the separators below that it could have generated
More informationNeural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani
Neural Networks CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Biological and artificial neural networks Feed-forward neural networks Single layer
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES Image Data: Classification via Neural Networks Instructor: Yizhou Sun yzsun@ccs.neu.edu November 19, 2015 Methods to Learn Classification Clustering Frequent Pattern Mining
More informationPattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition
Pattern Recognition Kjell Elenius Speech, Music and Hearing KTH March 29, 2007 Speech recognition 2007 1 Ch 4. Pattern Recognition 1(3) Bayes Decision Theory Minimum-Error-Rate Decision Rules Discriminant
More informationLogical Rhythm - Class 3. August 27, 2018
Logical Rhythm - Class 3 August 27, 2018 In this Class Neural Networks (Intro To Deep Learning) Decision Trees Ensemble Methods(Random Forest) Hyperparameter Optimisation and Bias Variance Tradeoff Biological
More informationOptimization Methods for Machine Learning (OMML)
Optimization Methods for Machine Learning (OMML) 2nd lecture Prof. L. Palagi References: 1. Bishop Pattern Recognition and Machine Learning, Springer, 2006 (Chap 1) 2. V. Cherlassky, F. Mulier - Learning
More informationCS 229 Midterm Review
CS 229 Midterm Review Course Staff Fall 2018 11/2/2018 Outline Today: SVMs Kernels Tree Ensembles EM Algorithm / Mixture Models [ Focus on building intuition, less so on solving specific problems. Ask
More informationCOMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization
COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18 Lecture 6: k-nn Cross-validation Regularization LEARNING METHODS Lazy vs eager learning Eager learning generalizes training data before
More informationLECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS
LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS Neural Networks Classifier Introduction INPUT: classification data, i.e. it contains an classification (class) attribute. WE also say that the class
More information4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.
1 4.12 Generalization In back-propagation learning, as many training examples as possible are typically used. It is hoped that the network so designed generalizes well. A network generalizes well when
More informationApplying Supervised Learning
Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains
More informationPredictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA
Predictive Analytics: Demystifying Current and Emerging Methodologies Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA May 18, 2017 About the Presenters Tom Kolde, FCAS, MAAA Consulting Actuary Chicago,
More informationMachine Learning. Chao Lan
Machine Learning Chao Lan Machine Learning Prediction Models Regression Model - linear regression (least square, ridge regression, Lasso) Classification Model - naive Bayes, logistic regression, Gaussian
More informationThe exam is closed book, closed notes except your one-page (two-sided) cheat sheet.
CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or
More informationCSE 5526: Introduction to Neural Networks Radial Basis Function (RBF) Networks
CSE 5526: Introduction to Neural Networks Radial Basis Function (RBF) Networks Part IV 1 Function approximation MLP is both a pattern classifier and a function approximator As a function approximator,
More informationLecture on Modeling Tools for Clustering & Regression
Lecture on Modeling Tools for Clustering & Regression CS 590.21 Analysis and Modeling of Brain Networks Department of Computer Science University of Crete Data Clustering Overview Organizing data into
More informationBias-Variance Analysis of Ensemble Learning
Bias-Variance Analysis of Ensemble Learning Thomas G. Dietterich Department of Computer Science Oregon State University Corvallis, Oregon 97331 http://www.cs.orst.edu/~tgd Outline Bias-Variance Decomposition
More informationCLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS
CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of
More informationContents. Preface to the Second Edition
Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................
More informationPerceptron as a graph
Neural Networks Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 10 th, 2007 2005-2007 Carlos Guestrin 1 Perceptron as a graph 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-6 -4-2
More informationMachine Learning. Topic 4: Linear Regression Models
Machine Learning Topic 4: Linear Regression Models (contains ideas and a few images from wikipedia and books by Alpaydin, Duda/Hart/ Stork, and Bishop. Updated Fall 205) Regression Learning Task There
More informationA Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York
A Systematic Overview of Data Mining Algorithms Sargur Srihari University at Buffalo The State University of New York 1 Topics Data Mining Algorithm Definition Example of CART Classification Iris, Wine
More informationNonparametric Methods Recap
Nonparametric Methods Recap Aarti Singh Machine Learning 10-701/15-781 Oct 4, 2010 Nonparametric Methods Kernel Density estimate (also Histogram) Weighted frequency Classification - K-NN Classifier Majority
More informationMachine Learning Classifiers and Boosting
Machine Learning Classifiers and Boosting Reading Ch 18.6-18.12, 20.1-20.3.2 Outline Different types of learning problems Different types of learning algorithms Supervised learning Decision trees Naïve
More informationEnsemble methods in machine learning. Example. Neural networks. Neural networks
Ensemble methods in machine learning Bootstrap aggregating (bagging) train an ensemble of models based on randomly resampled versions of the training set, then take a majority vote Example What if you
More informationLearning via Optimization
Lecture 7 1 Outline 1. Optimization Convexity 2. Linear regression in depth Locally weighted linear regression 3. Brief dips Logistic Regression [Stochastic] gradient ascent/descent Support Vector Machines
More information1) Give decision trees to represent the following Boolean functions:
1) Give decision trees to represent the following Boolean functions: 1) A B 2) A [B C] 3) A XOR B 4) [A B] [C Dl Answer: 1) A B 2) A [B C] 1 3) A XOR B = (A B) ( A B) 4) [A B] [C D] 2 2) Consider the following
More informationMachine Learning in Biology
Università degli studi di Padova Machine Learning in Biology Luca Silvestrin (Dottorando, XXIII ciclo) Supervised learning Contents Class-conditional probability density Linear and quadratic discriminant
More informationData Mining. Neural Networks
Data Mining Neural Networks Goals for this Unit Basic understanding of Neural Networks and how they work Ability to use Neural Networks to solve real problems Understand when neural networks may be most
More informationEnsemble Learning. Another approach is to leverage the algorithms we have via ensemble methods
Ensemble Learning Ensemble Learning So far we have seen learning algorithms that take a training set and output a classifier What if we want more accuracy than current algorithms afford? Develop new learning
More informationChapter 7: Numerical Prediction
Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases SS 2016 Chapter 7: Numerical Prediction Lecture: Prof. Dr.
More informationThe exam is closed book, closed notes except your one-page cheat sheet.
CS 189 Fall 2015 Introduction to Machine Learning Final Please do not turn over the page before you are instructed to do so. You have 2 hours and 50 minutes. Please write your initials on the top-right
More informationTopics in Machine Learning-EE 5359 Model Assessment and Selection
Topics in Machine Learning-EE 5359 Model Assessment and Selection Ioannis D. Schizas Electrical Engineering Department University of Texas at Arlington 1 Training and Generalization Training stage: Utilizing
More informationCS6375: Machine Learning Gautam Kunapuli. Mid-Term Review
Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes
More informationApril 3, 2012 T.C. Havens
April 3, 2012 T.C. Havens Different training parameters MLP with different weights, number of layers/nodes, etc. Controls instability of classifiers (local minima) Similar strategies can be used to generate
More informationA Dendrogram. Bioinformatics (Lec 17)
A Dendrogram 3/15/05 1 Hierarchical Clustering [Johnson, SC, 1967] Given n points in R d, compute the distance between every pair of points While (not done) Pick closest pair of points s i and s j and
More informationRadial Basis Function Networks: Algorithms
Radial Basis Function Networks: Algorithms Neural Computation : Lecture 14 John A. Bullinaria, 2015 1. The RBF Mapping 2. The RBF Network Architecture 3. Computational Power of RBF Networks 4. Training
More informationLars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Syllabus Fri. 27.10. (1) 0. Introduction A. Supervised Learning: Linear Models & Fundamentals Fri. 3.11. (2) A.1 Linear Regression Fri. 10.11. (3) A.2 Linear Classification Fri. 17.11. (4) A.3 Regularization
More informationFor Monday. Read chapter 18, sections Homework:
For Monday Read chapter 18, sections 10-12 The material in section 8 and 9 is interesting, but we won t take time to cover it this semester Homework: Chapter 18, exercise 25 a-b Program 4 Model Neuron
More informationCSC 411 Lecture 4: Ensembles I
CSC 411 Lecture 4: Ensembles I Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 04-Ensembles I 1 / 22 Overview We ve seen two particular classification algorithms:
More informationLecture #11: The Perceptron
Lecture #11: The Perceptron Mat Kallada STAT2450 - Introduction to Data Mining Outline for Today Welcome back! Assignment 3 The Perceptron Learning Method Perceptron Learning Rule Assignment 3 Will be
More informationImage Compression: An Artificial Neural Network Approach
Image Compression: An Artificial Neural Network Approach Anjana B 1, Mrs Shreeja R 2 1 Department of Computer Science and Engineering, Calicut University, Kuttippuram 2 Department of Computer Science and
More informationClassification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska
Classification Lecture Notes cse352 Neural Networks Professor Anita Wasilewska Neural Networks Classification Introduction INPUT: classification data, i.e. it contains an classification (class) attribute
More informationSimple Model Selection Cross Validation Regularization Neural Networks
Neural Nets: Many possible refs e.g., Mitchell Chapter 4 Simple Model Selection Cross Validation Regularization Neural Networks Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February
More informationCSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13
CSE 634 - Data Mining Concepts and Techniques STATISTICAL METHODS Professor- Anita Wasilewska (REGRESSION) Team 13 Contents Linear Regression Logistic Regression Bias and Variance in Regression Model Fit
More informationFunction approximation using RBF network. 10 basis functions and 25 data points.
1 Function approximation using RBF network F (x j ) = m 1 w i ϕ( x j t i ) i=1 j = 1... N, m 1 = 10, N = 25 10 basis functions and 25 data points. Basis function centers are plotted with circles and data
More informationInternational Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X
Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,
More informationLecture 25: Review I
Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,
More informationMIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA
Exploratory Machine Learning studies for disruption prediction on DIII-D by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA Presented at the 2 nd IAEA Technical Meeting on
More informationIntroduction to Machine Learning. Xiaojin Zhu
Introduction to Machine Learning Xiaojin Zhu jerryzhu@cs.wisc.edu Read Chapter 1 of this book: Xiaojin Zhu and Andrew B. Goldberg. Introduction to Semi- Supervised Learning. http://www.morganclaypool.com/doi/abs/10.2200/s00196ed1v01y200906aim006
More informationEvaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München
Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics
More informationMachine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm
Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm Sham M Kakade c 2018 University of Washington cse446-staff@cs.washington.edu 1 / 17 Review 1 / 17 Decision Tree: Making a
More informationStat 342 Exam 3 Fall 2014
Stat 34 Exam 3 Fall 04 I have neither given nor received unauthorized assistance on this exam. Name Signed Date Name Printed There are questions on the following 6 pages. Do as many of them as you can
More informationAnalytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.
Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied
More informationCART. Classification and Regression Trees. Rebecka Jörnsten. Mathematical Sciences University of Gothenburg and Chalmers University of Technology
CART Classification and Regression Trees Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology CART CART stands for Classification And Regression Trees.
More informationSupervised vs unsupervised clustering
Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful
More informationPreface to the Second Edition. Preface to the First Edition. 1 Introduction 1
Preface to the Second Edition Preface to the First Edition vii xi 1 Introduction 1 2 Overview of Supervised Learning 9 2.1 Introduction... 9 2.2 Variable Types and Terminology... 9 2.3 Two Simple Approaches
More informationThe Basics of Decision Trees
Tree-based Methods Here we describe tree-based methods for regression and classification. These involve stratifying or segmenting the predictor space into a number of simple regions. Since the set of splitting
More informationAssignment 2. Classification and Regression using Linear Networks, Multilayer Perceptron Networks, and Radial Basis Functions
ENEE 739Q: STATISTICAL AND NEURAL PATTERN RECOGNITION Spring 2002 Assignment 2 Classification and Regression using Linear Networks, Multilayer Perceptron Networks, and Radial Basis Functions Aravind Sundaresan
More informationModel Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer
Model Assessment and Selection Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Model Training data Testing data Model Testing error rate Training error
More informationNeural Networks (Overview) Prof. Richard Zanibbi
Neural Networks (Overview) Prof. Richard Zanibbi Inspired by Biology Introduction But as used in pattern recognition research, have little relation with real neural systems (studied in neurology and neuroscience)
More informationThe exam is closed book, closed notes except your one-page (two-sided) cheat sheet.
CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or
More informationMachine Learning: Think Big and Parallel
Day 1 Inderjit S. Dhillon Dept of Computer Science UT Austin CS395T: Topics in Multicore Programming Oct 1, 2013 Outline Scikit-learn: Machine Learning in Python Supervised Learning day1 Regression: Least
More informationECG782: Multidimensional Digital Signal Processing
ECG782: Multidimensional Digital Signal Processing Object Recognition http://www.ee.unlv.edu/~b1morris/ecg782/ 2 Outline Knowledge Representation Statistical Pattern Recognition Neural Networks Boosting
More informationNeural Network Neurons
Neural Networks Neural Network Neurons 1 Receives n inputs (plus a bias term) Multiplies each input by its weight Applies activation function to the sum of results Outputs result Activation Functions Given
More informationSupport Vector Machines
Support Vector Machines RBF-networks Support Vector Machines Good Decision Boundary Optimization Problem Soft margin Hyperplane Non-linear Decision Boundary Kernel-Trick Approximation Accurancy Overtraining
More informationNeural Network Weight Selection Using Genetic Algorithms
Neural Network Weight Selection Using Genetic Algorithms David Montana presented by: Carl Fink, Hongyi Chen, Jack Cheng, Xinglong Li, Bruce Lin, Chongjie Zhang April 12, 2005 1 Neural Networks Neural networks
More informationPattern Classification Algorithms for Face Recognition
Chapter 7 Pattern Classification Algorithms for Face Recognition 7.1 Introduction The best pattern recognizers in most instances are human beings. Yet we do not completely understand how the brain recognize
More informationLecture 20: Bagging, Random Forests, Boosting
Lecture 20: Bagging, Random Forests, Boosting Reading: Chapter 8 STATS 202: Data mining and analysis November 13, 2017 1 / 17 Classification and Regression trees, in a nut shell Grow the tree by recursively
More informationUninformed Search Methods. Informed Search Methods. Midterm Exam 3/13/18. Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall
Midterm Exam Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall Covers topics through Decision Trees and Random Forests (does not include constraint satisfaction) Closed book 8.5 x 11 sheet with notes
More informationTopics in Machine Learning
Topics in Machine Learning Gilad Lerman School of Mathematics University of Minnesota Text/slides stolen from G. James, D. Witten, T. Hastie, R. Tibshirani and A. Ng Machine Learning - Motivation Arthur
More informationMachine Learning (CS 567)
Machine Learning (CS 567) Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol Han (cheolhan@usc.edu)
More informationBioinformatics - Lecture 07
Bioinformatics - Lecture 07 Bioinformatics Clusters and networks Martin Saturka http://www.bioplexity.org/lectures/ EBI version 0.4 Creative Commons Attribution-Share Alike 2.5 License Learning on profiles
More informationAdvanced and Predictive Analytics with JMP 12 PRO. JMP User Meeting 9. Juni Schwalbach
Advanced and Predictive Analytics with JMP 12 PRO JMP User Meeting 9. Juni 2016 -Schwalbach Definition Predictive Analytics encompasses a variety of statistical techniques from modeling, machine learning
More informationerror low bias high variance test set training set high low Model Complexity Typical Behaviour 2 CSC2515 Machine Learning high bias low variance
CSC55 Machine Learning Sam Roweis high bias low variance Typical Behaviour low bias high variance Lecture : Overfitting and Capacity Control error training set test set November, 6 low Model Complexity
More informationCPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016
CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2016 A2/Midterm: Admin Grades/solutions will be posted after class. Assignment 4: Posted, due November 14. Extra office hours:
More informationClassification and Regression Trees
Classification and Regression Trees David S. Rosenberg New York University April 3, 2018 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 1 / 51 Contents 1 Trees 2 Regression
More informationModel Answers to The Next Pixel Prediction Task
Model Answers to The Next Pixel Prediction Task December 2, 25. (Data preprocessing and visualization, 8 marks) (a) Solution. In Algorithm we are told that the data was discretized to 64 grey scale values,...,
More informationIntroduction to Classification & Regression Trees
Introduction to Classification & Regression Trees ISLR Chapter 8 vember 8, 2017 Classification and Regression Trees Carseat data from ISLR package Classification and Regression Trees Carseat data from
More informationMachine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013
Machine Learning Topic 5: Linear Discriminants Bryan Pardo, EECS 349 Machine Learning, 2013 Thanks to Mark Cartwright for his extensive contributions to these slides Thanks to Alpaydin, Bishop, and Duda/Hart/Stork
More informationFMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu
FMA901F: Machine Learning Lecture 3: Linear Models for Regression Cristian Sminchisescu Machine Learning: Frequentist vs. Bayesian In the frequentist setting, we seek a fixed parameter (vector), with value(s)
More informationCS249: ADVANCED DATA MINING
CS249: ADVANCED DATA MINING Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu April 24, 2017 Homework 2 out Announcements Due May 3 rd (11:59pm) Course project proposal
More informationComputer Vision Group Prof. Daniel Cremers. 8. Boosting and Bagging
Prof. Daniel Cremers 8. Boosting and Bagging Repetition: Regression We start with a set of basis functions (x) =( 0 (x), 1(x),..., M 1(x)) x 2 í d The goal is to fit a model into the data y(x, w) =w T
More informationSupervised Learning. Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...
Supervised Learning Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression... Supervised Learning y=f(x): true function (usually not known) D: training
More informationCse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University
Cse634 DATA MINING TEST REVIEW Professor Anita Wasilewska Computer Science Department Stony Brook University Preprocessing stage Preprocessing: includes all the operations that have to be performed before
More informationInformation theory methods for feature selection
Information theory methods for feature selection Zuzana Reitermanová Department of Computer Science Faculty of Mathematics and Physics Charles University in Prague, Czech Republic Diplomový a doktorandský
More informationNeural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017
3/0/207 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network 3/0/207 Perceptron as a neural
More information27: Hybrid Graphical Models and Neural Networks
10-708: Probabilistic Graphical Models 10-708 Spring 2016 27: Hybrid Graphical Models and Neural Networks Lecturer: Matt Gormley Scribes: Jakob Bauer Otilia Stretcu Rohan Varma 1 Motivation We first look
More informationUnsupervised Learning
Unsupervised Learning Learning without Class Labels (or correct outputs) Density Estimation Learn P(X) given training data for X Clustering Partition data into clusters Dimensionality Reduction Discover
More informationData Mining: Models and Methods
Data Mining: Models and Methods Author, Kirill Goltsman A White Paper July 2017 --------------------------------------------------- www.datascience.foundation Copyright 2016-2017 What is Data Mining? Data
More informationLinear Models. Lecture Outline: Numeric Prediction: Linear Regression. Linear Classification. The Perceptron. Support Vector Machines
Linear Models Lecture Outline: Numeric Prediction: Linear Regression Linear Classification The Perceptron Support Vector Machines Reading: Chapter 4.6 Witten and Frank, 2nd ed. Chapter 4 of Mitchell Solving
More informationWeek 3: Perceptron and Multi-layer Perceptron
Week 3: Perceptron and Multi-layer Perceptron Phong Le, Willem Zuidema November 12, 2013 Last week we studied two famous biological neuron models, Fitzhugh-Nagumo model and Izhikevich model. This week,
More informationUnivariate and Multivariate Decision Trees
Univariate and Multivariate Decision Trees Olcay Taner Yıldız and Ethem Alpaydın Department of Computer Engineering Boğaziçi University İstanbul 80815 Turkey Abstract. Univariate decision trees at each
More information