An Implementation and Discussion of Random Forest with Gaussian Process Leaves
|
|
- Edwin Wiggins
- 5 years ago
- Views:
Transcription
1 An Implementation and Discussion of Random Forest with Gaussian Process Leaves Anonymous Author(s) Affiliation Address Abstract Stationary Gaussian Process Regression assumes that the correlation structure is always appropriate in all spatial locations and thus couldn t fit piece-wise continuous data set well. Various nonstationary GP model has been developed to solve this problem. Here, we propose to use random forest for partitioning and Gaussian Process Regression on leaves of the trees to handle piece-wise continuous data sets. Such combination takes the advantages of the randomization and averaging lies in Random Forest, the independence gained from binary tree partitioning and the smooth nonlinear regression achieved by Gaussian Process to provide a solution for general massive data regression. 1 Introduction Gaussian Process regression models are adopted widely in many machine learning applications, especially the domains need prediction such as earth sciences, planning, computer simulation experiments, etc. Because of the correlation matrix based nature, Gaussian Process model is able to simulate the smooth nature of the objective function and show the effect of potential correlated input dimensions. However, such property also causes problems in some situations where discontinuity is a nature of the objective function. Nonstationary Gaussian Process models are proposed to solve this problem by partitioning the input space into different regions and each of the regions will be fitted with an independent GP model. This transformed the regression problem into a partitioning problem where the tree structure can return a satisfactory disjoint partition given a specified standard. Similar to Classification Forest, the partitioning procedure is conducted by trees independently based on information gain and limited by the tree structure parameters. Several other partitioning methods are based on input space clustering or MCMC based posterior computing. Compared with those options, our method is simpler in principle and thus can be more general. Random Forest can mitigate the problem caused by an over fitted tree and effectively lower the input dimension number of Gaussian Process, whose result will degrade largely when more than, possibly, a dozen input dimensions are feed. Generally, tree structure can partition the data space into appropriate sections to achieve a better fit of Gaussian Process models while the bootstrapping and bagging procedure of Random Forest will limit the dimensions of each tree for Gaussian Process model and mitigate the possible over fitted result.
2 Method The construction of the forest follows the route of classification forests in general. Only at the leaves of the trees, the output is the prediction of the input given by the Gaussian Process model possessed by the node. 2.1 Gaussian Process The leaf nodes of each tree will possess a Gaussian Process model which is independent from the models of others nodes after the construction of the trees. But, similar to the combination of linear regression and Random Forest, the information gain will be measured at each node during construction to decide if one split benefits the most. Like decision trees, we need the entropy which can represent regression quality of the current node. To keep things simple, we use squared-exponential kernel to simulate the smoothness nature and same length scale parameters for different dimension. Due to the complexity of optimization when using different length scale parameters for different dimensions, a unified length scale parameter is still the most popular model get adopted in many applications. Before we compute the entropy, we need to fit the Gaussian Process model for the current node to obtain the best fitted model parameters and then the optimized correlation matrix. κ (x, x ) = δ exp ( 1 2l x x ) = X κ (x, x ) X + diag(δ ) logp(y X) = 1 2 YΣ Y 1 2 log Σ N log (2π) 2 δ, l, δ = arg { logp(y X) } Σ = X κ (x, x ) X + diag(δ ) 59 Here we give the differential entropy for node u: E(u) = P(Y μ, Σ )log P(Y μ, Σ )dx = 1 2 log { (2πe) Σ } Binary Tree The construction of trees is based on the decisions of splitting. Here we use information gain to quantify the quality of a split [1]. During growing, a sample of input dimensions will be taken to decide the dimensions going to be split upon. All breaks of these dimensions will be tested to measure the corresponding information gain. The node will fork only when at least one positive information gain achieved and split at the threshold leads to largest information gain. The information gain, N, N left, N right represent the number of data points in current node, number of the data points in the left node and number of data points in the right node respectively. I = H(u) N N H(u ) N N H(u ) Here we use the differential entropy E(u) gained before as H(u): I = E(u) N N E(u ) N N Eu 2.3 Forest At forest level, we only need to define several parameters to decide the forest structure. n : number of trees in the forest m : maximum depth of each tree n :minimum number of data points for leaf nodes n number of dimensions each split will try
3 d : number of data points each tree owns The bootstrapping procedure is done in the same way as classification forests. d means the number of data points and each tree will be fed with d data points sampled from the whole training data set based on uniformly possibility distribution. The bagging process is also same as classification forests, except the output distribution function is averaged from all trees whose output distribution function are computed by Gaussian Process regressions instead of counting up labels. Estimate the prediction Y by averaging the prediction from T trees given input X : 3 Experiments P(Y X ) = 1 T P (Y X ) So far, our Random Forest Gaussian Process regression model processes following parameters: n : number of trees in the forest m : maximum depth of each tree n :minimum number of data points for leaf nodes n number of dimensions one split will try on d : number of data points each tree owns μ the mean of the prior multivariate Gaussian distribution l the initial value of the length scale parameter in the squared exponential kernel δ : the initial value of the noise variance added to the correlation matrix the initial value of the coeficient in the squared exponential kernel δ In our experiments, all training data from real data sets were normalized to zero mean, so the μ is set to 0 in all experiments. The first five parameters decide the size and structure of the forest. The last three parameters should not affect the result in theory while, in practice, a successful optimization of the three parameters in the correlation matrix depends on appropriate initial values of them. Because the correlation variance of two given input points: x, x is initialized as δ exp ( x x ), if δ or l is too small, the optimization is possible to generate a correlation matrix of independent multivariate Gaussian distribution in which any correlation between two different data points will be eliminated and the regression will definitely fail. If δ or l is too large, it is also possible to encounter overflow or singular matrix due to the numerically approximated gradient and Hessian functions. As for the initial value of δ, both a too large and too small can easily leads to computing failure. In the following subsections, we will show the comparison between single Gaussian Process regression and Random Forest Gaussian Process regression on simple synthetic data. A discussion of the parameters will also be covered. 3.1 Simple synthetic data Compared with single Gaussian Process regression, the tree partitioned GP model should own the ability to recognize discontinuities exist within the data. To prove this, we added faults into sin(x) to make it discontinuous. The Gaussian noise added to the objective function follows N(0, 0.1). A comparison between single GP regression and tree partitioned GP regression is showed in Figure 1. Figure 1 shows the tree partitioned GP regression has a much better result than the single GP model. The parameters of the random forest GP model are listed as below: m n = 1 = 20 n = 3
4 n = 1 d = 20 (all training data) l = 1.0 δ = 1.0 δ = Figure 1: Comparison between single GP regression and tree partitioned GP regression. The left grapy is shows the result of a single GP model, the right one shows the result of a random forest GP model with only one tree. The red dash line represents the objective function. The black lines represent the result of the regression. x marked spots are training data points. The random forest GP model used in Figure 1 has only one tree and all training data, thus it is actually a tree partitioned GP model. The reason for using one tree is the so few data points and input dimensions. The random forest GP result also benefitted from the small noise. As the noise goes up (Figure 2), the results of both single GP model and random forest GP become worse. But we still can see the benefit of partitioning. Another property we want our model holds is the ability to distinguish if a split is beneficial or not. The property is achieved by setting a rational information gain threshold. So far, we believe the value zero is a reasonable choice since we don t want the overall entropy to go up. From this perspective, regression models can be different from the classification ones. For classification model, a split won t increase the overall entropy in any cases. But it is possible for regression model. In Figure 3, we use the continuous sin(x) function to test if our model will wrongly split it. The result shows our model decided not to generate child branches. 3.2 Discussion of the parameters This simple synthetic data set provides a good chance for observing the influence of the tree-structure-relevant parameters because of its easy-to-understand output. We discuss the influence of structure relevant parameters including: n, m, d and n here. Due the only one input dimension we have here, we won t be able to discuss the effect of n and we also can expect a relatively trivial effect of m because of the few of faults. In addition, in order to amplify the influence of these parameters, we increase the number of training points from twenty to forty. Given one tree, n affect the fineness of the regression largely. A too large n will result in a similar underfitting result similar to single GP model while a too small n will introduce unnecessary zigzags which mean an overfitting regression. However, m is also able to limit the fineness and can correct the overfitting regression resulted from a too small n. But, the influence of m is sensitive to the location of the discontinuity in the objective function. For example, if one side of the optimal split point has more faults while the other side owns only a few of faults, some m can result in half overfitting and half underfitting regression. n needs to be considered together with d to enable nontrivial bootstrapping and bagging
5 (since our inputs have only one dimension). The results show bootstrapping and bagging is a sort of smoothing method. Appling bootstrapping and bagging reduces the chance of splitting for a noisy vibration by attenuating the density of the data and keep the overall trend and meaning small vibrations which can be observed by most of the trees. Because of bootstrapping and bagging, we can use smaller n and larger m while don t need to concern much about the overfitting problem. However, this is just the advantage from the data set perspective introduced by bootstrapping and bagging. The benefits from data dimension perspective are not covered here. Figure 2: The influence of noise variance on the regression results of single GP model (left column) and random forest GP (tree partitioned GP) model (right column). The noise variance imposed upon the models for the three rows are 0.2, 0.5 and 1.0 respectively Figure 3: For well formed continuous objective function with non-significant noise, the random forest GP model should know it is unnecessary to partition the data. The left graph is
6 returned from a single GP model, the right one is obtained by our random forest GP model. The black lines are the prediction returned by the models, the red line is the objective function: sin(x) and the x marks represent training data points. 4 Apply to Real world data sets In this section, we demonstrate how we apply our random forest Gaussian Process regression model in real world data sets. Two data sets will be used here. One is the Canada flu trend data downloaded from Google.org[6]. A regression of this data set might be helpful for flu trend prediction. The other data set is the records of salinity, temperature and oxygen density at deep water region. The data is drawn from the database of UBC Earth and Ocean Science department. The goal is find the relation of oxygen density with temperature and salinity. 4.1 Canada flu trends The data set records the flu intensity index for nine provinces of Canada for every seven days from 2003 till now. We use the records from 2004 to 2012 due to the completeness. As for the provinces, we picked up the records of Alberta, British Columbia, Saskatchewan, Manitoba, Ontario, Quebec, Newfoundland and Labrador, in total seven provinces. The location of these seven provinces is roughly in order, from the west coast to the east coast. So it is appropriate to treat these provinces as the axis of location. The input data contains two dimensions: date and location. The output is the flu intensity index. We hope to find a function of date and location to simulate the flu intensity index which might be helpful for flu prediction M anipulating the data Appling the input data directly to the model will ends in a failure of optimization due to the too large x x which will let the correlation matrix turn to be a diagonal one. Besides, the output vector Y also needs to be normalized to ease the calculation of the log likelihood. Both location and date axes need to subtract their mean values respectively and date axis should be divided by seven to unify the intervals of the two axes to one The normalization of the output vector Y is simply conducted as: Y = Result Here we compare the regressions achieved by single GP and random forest GP respectively. Figure 4 shows the result. The training set is sampled as ¼ size of the whole data set we used. For the random forest model, we set up it with n = 10, m = 10, n = 5, n = 1, 191 d = (training data set). From Figure 4, we can see that random forest GP returned a finer grained graph in general while avoided an abnormal high output for the winter of 2009 in all districts. After checking the records for all of these districts, we found the single GP regression is correct. There is an apparent increase in all these districts in the winter of The reason why random forest GP doesn t return an obvious pike as the single GP does is the property of bootstrapping and bagging. Although there is an obvious peak in most locations, the duration time for that increase is so short and only occupied a few records. Many trees of the forest didn t own enough data points to describe the peak and finally result in a not really responsive surface at that region. Although the output of the random forest GP shows a steadier surface, it is actually slightly overfitting. Possible reasons for this are too small n, too large m or even d. 4.2 Deep water oxygen density To find the latent oxygen density function of salinity and temperature, we build the random forest GP model with two input axes: normalized salinity and temperature and one output
7 axis representing the normalized oxygen density. Figure 5 shows the data set. The results from single GP and our random forest GP are showed in Figure 6. The training data MSE for single GP and random forest GP are and respectively. The testing data MSE for single them are and respectively Figure 4: The regression result of the flu trend data set. The left one is returned from a single GP, the right one is returned from our random forest GP. The blue spots displayed in the From the result, we can see the random forest GP model returned a finer grained surface in figure are the whole points of the data set. Many of them get shadowed by the surface, but the differences of the result are still clear Figure 5: The normalized data of oxygen density, temperature and salinity. Both of the graphs shows the same data set, but from different perspectives Figure 6: The results of single GP regression (left) and random forest GP regression (right). Blue circle spots represent are the whole data points and the blue-to-red surfaces represent the prediction for the corresponding input. The random forest GP regression results in a more similar MSE for both training and testing data set. The result from single GP is very likely to be an overfitting one and thus not a reliable prediction to generalize the potential pattern. In this data set, random forest GP shows a better performance on extracting the latent objective function from very noisy data sets.
8 The property is achieved due to the averaging effect of forests. The right graph of Figure 6 comes from a forest with forty trees, ten as maximum growing depth and twenty points limit for a smallest node. The data fed to each tree are only 1/10 in size of the whole training data. 5 Conclusion and future work In this work, a random forest with Gaussian Process on leaves for regression is implemented and the route behind it is provided. Many of the steps are inspired by and referred to the classification counterparts, including the calculation of the information gain and the bootstrapping/bagging procedures. Based on the simple synthetic data, some properties of the random forest GP model are demonstrated. We found using tree structure to partition the data set enables the model to adapt to those data sets contains discontinuities. The combination tree classification and Gaussian Process can keep the piece-wise smoothes which shows the correlation between different points, just as single Gaussian Process regression does, while also introduces discontinuities to cut off the false correlation brought by the kernel function who always treat all data points in the same way. When the parameters are set properly, the tree partitioned GP regression can result in a better regression for a piece-wise continuous data set. The bootstrapping and bagging process, as a sort of randomization and averaging, can improve the robustness of the tree structure partitioning process when plenty of data points are available. Another potential benefit of random forest is limiting the input dimensions for Gaussian Process regression. Since our data sets have a few dimensions, we didn t cover this part. But the performance for such data sets is worth of study. However, there are still many problems left for future studies. The values of the parameters are very important to the final regression while the concrete effects of them and the correlation between them can be complex and subtle. In general, we can observe that a smaller number of trees, a smaller limit of the minimum number of points for each leaf and a larger maximum allowed growing depth will generate a grain-finer result which is prone to be an overfitting one. The opposite direction of such parameters is more likely to generate an underfitting regression. But how to adjust those parameters for an ideal result is not answered in this paper. 6 Related work Non-stationary Gaussian Process regression has been studied for many years and many partition strategies have been proposed so far. Chipman et al.[3] proposed regression with random forests and Gramacy et al.[4] augmented the model with Gaussian Process at leaves. The fitting procedure is conducted with MCMC algorithm all guided by posterior estimation. Although their inference and calculation, which are all based on posterior instead of likelihood, are more accurate and correct in theory, the practical implementation can be too complex. Kim et al.[2] and K, Das et al.[5] utilize other clustering algorithm to fulfill the partitioning task. Such pre-processing requires some knowledge about the specific data set and thus might not be a general solution. But this is also a promising study direction which is more likely to a promising improvement in the result. Re fere nce s [1] A. Criminisi, J. Shotton, E. Konukoglu, (2011), Decision Forests for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning, Tech. Rep. MSR-TR , Microsoft. [2] Kim, H.-M., Mallick, B. K., and Holmes, C. C., (2005), Analyzing Nonstationary Spatial Data Using Piecewise Gaussian Processes, Journal of the American Statistical Association, 100, [3] Chipman, H., George, E., & McCulloch, R. (1998), Bayesian CART model search (with dis-cussion). Journal of the American Statistical Association, 93, [4] Gramacy, R. B. and Lee, H. K. H. (2008). Bayesian treed Gaussian process models with an application to computer modeling. J. of the American Statistical Association, 103, [5] K. Das and A. Srivastava. (2010) Block-GP: Scalable Gaussian Process Regression for Multimodal Data. In the 10 th IEEE International Conference on Data Mining, ICDM 2010, pages [6] Data Source: Google Flu Trends (
Supervised Learning for Image Segmentation
Supervised Learning for Image Segmentation Raphael Meier 06.10.2016 Raphael Meier MIA 2016 06.10.2016 1 / 52 References A. Ng, Machine Learning lecture, Stanford University. A. Criminisi, J. Shotton, E.
More informationMondrian Forests: Efficient Online Random Forests
Mondrian Forests: Efficient Online Random Forests Balaji Lakshminarayanan (Gatsby Unit, UCL) Daniel M. Roy (Cambridge Toronto) Yee Whye Teh (Oxford) September 4, 2014 1 Outline Background and Motivation
More informationThe Curse of Dimensionality
The Curse of Dimensionality ACAS 2002 p1/66 Curse of Dimensionality The basic idea of the curse of dimensionality is that high dimensional data is difficult to work with for several reasons: Adding more
More informationPredicting Messaging Response Time in a Long Distance Relationship
Predicting Messaging Response Time in a Long Distance Relationship Meng-Chen Shieh m3shieh@ucsd.edu I. Introduction The key to any successful relationship is communication, especially during times when
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 12 Combining
More informationAlgorithms: Decision Trees
Algorithms: Decision Trees A small dataset: Miles Per Gallon Suppose we want to predict MPG From the UCI repository A Decision Stump Recursion Step Records in which cylinders = 4 Records in which cylinders
More informationClassification. Instructor: Wei Ding
Classification Part II Instructor: Wei Ding Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 1 Practical Issues of Classification Underfitting and Overfitting Missing Values Costs of Classification
More informationComputer Vision Group Prof. Daniel Cremers. 8. Boosting and Bagging
Prof. Daniel Cremers 8. Boosting and Bagging Repetition: Regression We start with a set of basis functions (x) =( 0 (x), 1(x),..., M 1(x)) x 2 í d The goal is to fit a model into the data y(x, w) =w T
More informationAnalytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.
Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied
More informationLecture 7: Decision Trees
Lecture 7: Decision Trees Instructor: Outline 1 Geometric Perspective of Classification 2 Decision Trees Geometric Perspective of Classification Perspective of Classification Algorithmic Geometric Probabilistic...
More information08 An Introduction to Dense Continuous Robotic Mapping
NAVARCH/EECS 568, ROB 530 - Winter 2018 08 An Introduction to Dense Continuous Robotic Mapping Maani Ghaffari March 14, 2018 Previously: Occupancy Grid Maps Pose SLAM graph and its associated dense occupancy
More informationCPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017
CPSC 340: Machine Learning and Data Mining Probabilistic Classification Fall 2017 Admin Assignment 0 is due tonight: you should be almost done. 1 late day to hand it in Monday, 2 late days for Wednesday.
More informationRandom Forest A. Fornaser
Random Forest A. Fornaser alberto.fornaser@unitn.it Sources Lecture 15: decision trees, information theory and random forests, Dr. Richard E. Turner Trees and Random Forests, Adele Cutler, Utah State University
More informationContext-sensitive Classification Forests for Segmentation of Brain Tumor Tissues
Context-sensitive Classification Forests for Segmentation of Brain Tumor Tissues D. Zikic, B. Glocker, E. Konukoglu, J. Shotton, A. Criminisi, D. H. Ye, C. Demiralp 3, O. M. Thomas 4,5, T. Das 4, R. Jena
More informationEnsemble Learning: An Introduction. Adapted from Slides by Tan, Steinbach, Kumar
Ensemble Learning: An Introduction Adapted from Slides by Tan, Steinbach, Kumar 1 General Idea D Original Training data Step 1: Create Multiple Data Sets... D 1 D 2 D t-1 D t Step 2: Build Multiple Classifiers
More informationEnsemble methods in machine learning. Example. Neural networks. Neural networks
Ensemble methods in machine learning Bootstrap aggregating (bagging) train an ensemble of models based on randomly resampled versions of the training set, then take a majority vote Example What if you
More informationMachine Learning Techniques for Data Mining
Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already
More informationClassification/Regression Trees and Random Forests
Classification/Regression Trees and Random Forests Fabio G. Cozman - fgcozman@usp.br November 6, 2018 Classification tree Consider binary class variable Y and features X 1,..., X n. Decide Ŷ after a series
More informationThe exam is closed book, closed notes except your one-page cheat sheet.
CS 189 Fall 2015 Introduction to Machine Learning Final Please do not turn over the page before you are instructed to do so. You have 2 hours and 50 minutes. Please write your initials on the top-right
More information3 Nonlinear Regression
3 Linear models are often insufficient to capture the real-world phenomena. That is, the relation between the inputs and the outputs we want to be able to predict are not linear. As a consequence, nonlinear
More informationBusiness Club. Decision Trees
Business Club Decision Trees Business Club Analytics Team December 2017 Index 1. Motivation- A Case Study 2. The Trees a. What is a decision tree b. Representation 3. Regression v/s Classification 4. Building
More informationLast time... Bias-Variance decomposition. This week
Machine learning, pattern recognition and statistical data modelling Lecture 4. Going nonlinear: basis expansions and splines Last time... Coryn Bailer-Jones linear regression methods for high dimensional
More informationSpatial Outlier Detection
Spatial Outlier Detection Chang-Tien Lu Department of Computer Science Northern Virginia Center Virginia Tech Joint work with Dechang Chen, Yufeng Kou, Jiang Zhao 1 Spatial Outlier A spatial data point
More informationLearning from Data: Adaptive Basis Functions
Learning from Data: Adaptive Basis Functions November 21, 2005 http://www.anc.ed.ac.uk/ amos/lfd/ Neural Networks Hidden to output layer - a linear parameter model But adapt the features of the model.
More informationCS 229 Midterm Review
CS 229 Midterm Review Course Staff Fall 2018 11/2/2018 Outline Today: SVMs Kernels Tree Ensembles EM Algorithm / Mixture Models [ Focus on building intuition, less so on solving specific problems. Ask
More informationDECISION TREES & RANDOM FORESTS X CONVOLUTIONAL NEURAL NETWORKS
DECISION TREES & RANDOM FORESTS X CONVOLUTIONAL NEURAL NETWORKS Deep Neural Decision Forests Microsoft Research Cambridge UK, ICCV 2015 Decision Forests, Convolutional Networks and the Models in-between
More informationComputer Vision Group Prof. Daniel Cremers. 6. Boosting
Prof. Daniel Cremers 6. Boosting Repetition: Regression We start with a set of basis functions (x) =( 0 (x), 1(x),..., M 1(x)) x 2 í d The goal is to fit a model into the data y(x, w) =w T (x) To do this,
More informationSlides for Data Mining by I. H. Witten and E. Frank
Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-
More informationNeural Network Optimization and Tuning / Spring 2018 / Recitation 3
Neural Network Optimization and Tuning 11-785 / Spring 2018 / Recitation 3 1 Logistics You will work through a Jupyter notebook that contains sample and starter code with explanations and comments throughout.
More informationWhat is machine learning?
Machine learning, pattern recognition and statistical data modelling Lecture 12. The last lecture Coryn Bailer-Jones 1 What is machine learning? Data description and interpretation finding simpler relationship
More informationMarkov Random Fields and Gibbs Sampling for Image Denoising
Markov Random Fields and Gibbs Sampling for Image Denoising Chang Yue Electrical Engineering Stanford University changyue@stanfoed.edu Abstract This project applies Gibbs Sampling based on different Markov
More informationApplied Bayesian Nonparametrics 5. Spatial Models via Gaussian Processes, not MRFs Tutorial at CVPR 2012 Erik Sudderth Brown University
Applied Bayesian Nonparametrics 5. Spatial Models via Gaussian Processes, not MRFs Tutorial at CVPR 2012 Erik Sudderth Brown University NIPS 2008: E. Sudderth & M. Jordan, Shared Segmentation of Natural
More informationCSC 411 Lecture 4: Ensembles I
CSC 411 Lecture 4: Ensembles I Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 04-Ensembles I 1 / 22 Overview We ve seen two particular classification algorithms:
More informationMetrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?
Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to
More information10.4 Linear interpolation method Newton s method
10.4 Linear interpolation method The next best thing one can do is the linear interpolation method, also known as the double false position method. This method works similarly to the bisection method by
More informationAllstate Insurance Claims Severity: A Machine Learning Approach
Allstate Insurance Claims Severity: A Machine Learning Approach Rajeeva Gaur SUNet ID: rajeevag Jeff Pickelman SUNet ID: pattern Hongyi Wang SUNet ID: hongyiw I. INTRODUCTION The insurance industry has
More informationPattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition
Pattern Recognition Kjell Elenius Speech, Music and Hearing KTH March 29, 2007 Speech recognition 2007 1 Ch 4. Pattern Recognition 1(3) Bayes Decision Theory Minimum-Error-Rate Decision Rules Discriminant
More information3 Nonlinear Regression
CSC 4 / CSC D / CSC C 3 Sometimes linear models are not sufficient to capture the real-world phenomena, and thus nonlinear models are necessary. In regression, all such models will have the same basic
More informationPerceptron: This is convolution!
Perceptron: This is convolution! v v v Shared weights v Filter = local perceptron. Also called kernel. By pooling responses at different locations, we gain robustness to the exact spatial location of image
More informationBreaking it Down: The World as Legos Benjamin Savage, Eric Chu
Breaking it Down: The World as Legos Benjamin Savage, Eric Chu To devise a general formalization for identifying objects via image processing, we suggest a two-pronged approach of identifying principal
More informationLars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Syllabus Fri. 27.10. (1) 0. Introduction A. Supervised Learning: Linear Models & Fundamentals Fri. 3.11. (2) A.1 Linear Regression Fri. 10.11. (3) A.2 Linear Classification Fri. 17.11. (4) A.3 Regularization
More informationDecision Trees Dr. G. Bharadwaja Kumar VIT Chennai
Decision Trees Decision Tree Decision Trees (DTs) are a nonparametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target
More informationLogistic Regression. Abstract
Logistic Regression Tsung-Yi Lin, Chen-Yu Lee Department of Electrical and Computer Engineering University of California, San Diego {tsl008, chl60}@ucsd.edu January 4, 013 Abstract Logistic regression
More informationMachine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme
Machine Learning B. Unsupervised Learning B.1 Cluster Analysis Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany
More informationClassification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University
Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate
More informationClassification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University
Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate
More informationSupervised Learning. Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...
Supervised Learning Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression... Supervised Learning y=f(x): true function (usually not known) D: training
More informationInternational Journal of Software and Web Sciences (IJSWS)
International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research) ISSN (Print): 2279-0063 ISSN (Online): 2279-0071 International
More informationWhy CART Works for Variability-Aware Performance Prediction? An Empirical Study on Performance Distributions
GSDLAB TECHNICAL REPORT Why CART Works for Variability-Aware Performance Prediction? An Empirical Study on Performance Distributions Jianmei Guo, Krzysztof Czarnecki, Sven Apel, Norbert Siegmund, Andrzej
More informationCART. Classification and Regression Trees. Rebecka Jörnsten. Mathematical Sciences University of Gothenburg and Chalmers University of Technology
CART Classification and Regression Trees Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology CART CART stands for Classification And Regression Trees.
More informationIntroduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering
Introduction to Pattern Recognition Part II Selim Aksoy Bilkent University Department of Computer Engineering saksoy@cs.bilkent.edu.tr RETINA Pattern Recognition Tutorial, Summer 2005 Overview Statistical
More information10-701/15-781, Fall 2006, Final
-7/-78, Fall 6, Final Dec, :pm-8:pm There are 9 questions in this exam ( pages including this cover sheet). If you need more room to work out your answer to a question, use the back of the page and clearly
More informationData Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University
Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Exploratory data analysis tasks Examine the data, in search of structures
More informationCSC411 Fall 2014 Machine Learning & Data Mining. Ensemble Methods. Slides by Rich Zemel
CSC411 Fall 2014 Machine Learning & Data Mining Ensemble Methods Slides by Rich Zemel Ensemble methods Typical application: classi.ication Ensemble of classi.iers is a set of classi.iers whose individual
More informationBayesian model ensembling using meta-trained recurrent neural networks
Bayesian model ensembling using meta-trained recurrent neural networks Luca Ambrogioni l.ambrogioni@donders.ru.nl Umut Güçlü u.guclu@donders.ru.nl Yağmur Güçlütürk y.gucluturk@donders.ru.nl Julia Berezutskaya
More informationBayesian Optimization for Parameter Selection of Random Forests Based Text Classifier
Bayesian Optimization for Parameter Selection of Random Forests Based Text Classifier 1 2 3 4 Anonymous Author(s) Affiliation Address email 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
More informationClustering Lecture 5: Mixture Model
Clustering Lecture 5: Mixture Model Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced topics
More informationCSE 158. Web Mining and Recommender Systems. Midterm recap
CSE 158 Web Mining and Recommender Systems Midterm recap Midterm on Wednesday! 5:10 pm 6:10 pm Closed book but I ll provide a similar level of basic info as in the last page of previous midterms CSE 158
More informationFMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu
FMA901F: Machine Learning Lecture 3: Linear Models for Regression Cristian Sminchisescu Machine Learning: Frequentist vs. Bayesian In the frequentist setting, we seek a fixed parameter (vector), with value(s)
More informationDense Image-based Motion Estimation Algorithms & Optical Flow
Dense mage-based Motion Estimation Algorithms & Optical Flow Video A video is a sequence of frames captured at different times The video data is a function of v time (t) v space (x,y) ntroduction to motion
More informationCPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016
CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2016 A2/Midterm: Admin Grades/solutions will be posted after class. Assignment 4: Posted, due November 14. Extra office hours:
More informationUnsupervised Learning
Networks for Pattern Recognition, 2014 Networks for Single Linkage K-Means Soft DBSCAN PCA Networks for Kohonen Maps Linear Vector Quantization Networks for Problems/Approaches in Machine Learning Supervised
More informationWarped Mixture Models
Warped Mixture Models Tomoharu Iwata, David Duvenaud, Zoubin Ghahramani Cambridge University Computational and Biological Learning Lab March 11, 2013 OUTLINE Motivation Gaussian Process Latent Variable
More informationBagging for One-Class Learning
Bagging for One-Class Learning David Kamm December 13, 2008 1 Introduction Consider the following outlier detection problem: suppose you are given an unlabeled data set and make the assumptions that one
More informationELEC Dr Reji Mathew Electrical Engineering UNSW
ELEC 4622 Dr Reji Mathew Electrical Engineering UNSW Review of Motion Modelling and Estimation Introduction to Motion Modelling & Estimation Forward Motion Backward Motion Block Motion Estimation Motion
More informationCS 559: Machine Learning Fundamentals and Applications 10 th Set of Notes
1 CS 559: Machine Learning Fundamentals and Applications 10 th Set of Notes Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Office: Lieb 215
More informationNetwork Traffic Measurements and Analysis
DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:
More informationPreface to the Second Edition. Preface to the First Edition. 1 Introduction 1
Preface to the Second Edition Preface to the First Edition vii xi 1 Introduction 1 2 Overview of Supervised Learning 9 2.1 Introduction... 9 2.2 Variable Types and Terminology... 9 2.3 Two Simple Approaches
More informationMachine Learning Lecture 3
Machine Learning Lecture 3 Probability Density Estimation II 19.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Announcements Exam dates We re in the process
More informationNonparametric Approaches to Regression
Nonparametric Approaches to Regression In traditional nonparametric regression, we assume very little about the functional form of the mean response function. In particular, we assume the model where m(xi)
More informationRegion-based Segmentation
Region-based Segmentation Image Segmentation Group similar components (such as, pixels in an image, image frames in a video) to obtain a compact representation. Applications: Finding tumors, veins, etc.
More informationClustering Using Graph Connectivity
Clustering Using Graph Connectivity Patrick Williams June 3, 010 1 Introduction It is often desirable to group elements of a set into disjoint subsets, based on the similarity between the elements in the
More informationSupplementary Figure 1. Decoding results broken down for different ROIs
Supplementary Figure 1 Decoding results broken down for different ROIs Decoding results for areas V1, V2, V3, and V1 V3 combined. (a) Decoded and presented orientations are strongly correlated in areas
More informationUniversity of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques
University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques Mark Gales mjfg@eng.cam.ac.uk Michaelmas 2015 11. Non-Parameteric Techniques
More informationSupervised vs unsupervised clustering
Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful
More informationWeka ( )
Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised
More informationClassification with PAM and Random Forest
5/7/2007 Classification with PAM and Random Forest Markus Ruschhaupt Practical Microarray Analysis 2007 - Regensburg Two roads to classification Given: patient profiles already diagnosed by an expert.
More informationCPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2018
CPSC 340: Machine Learning and Data Mining Deep Learning Fall 2018 Last Time: Multi-Dimensional Scaling Multi-dimensional scaling (MDS): Non-parametric visualization: directly optimize the z i locations.
More informationMIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018
MIT 801 [Presented by Anna Bosman] 16 February 2018 Machine Learning What is machine learning? Artificial Intelligence? Yes as we know it. What is intelligence? The ability to acquire and apply knowledge
More informationCLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS
CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of
More informationMachine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling
Machine Learning B. Unsupervised Learning B.1 Cluster Analysis Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim,
More informationCS Machine Learning
CS 60050 Machine Learning Decision Tree Classifier Slides taken from course materials of Tan, Steinbach, Kumar 10 10 Illustrating Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K
More informationChapter 2 Basic Structure of High-Dimensional Spaces
Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,
More informationChallenges motivating deep learning. Sargur N. Srihari
Challenges motivating deep learning Sargur N. srihari@cedar.buffalo.edu 1 Topics In Machine Learning Basics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation
More informationAn Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework
IEEE SIGNAL PROCESSING LETTERS, VOL. XX, NO. XX, XXX 23 An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework Ji Won Yoon arxiv:37.99v [cs.lg] 3 Jul 23 Abstract In order to cluster
More informationUniversity of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques.
. Non-Parameteric Techniques University of Cambridge Engineering Part IIB Paper 4F: Statistical Pattern Processing Handout : Non-Parametric Techniques Mark Gales mjfg@eng.cam.ac.uk Michaelmas 23 Introduction
More informationUnivariate and Multivariate Decision Trees
Univariate and Multivariate Decision Trees Olcay Taner Yıldız and Ethem Alpaydın Department of Computer Engineering Boğaziçi University İstanbul 80815 Turkey Abstract. Univariate decision trees at each
More informationClassification and Regression Trees
Classification and Regression Trees David S. Rosenberg New York University April 3, 2018 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 1 / 51 Contents 1 Trees 2 Regression
More informationLecture outline. Decision-tree classification
Lecture outline Decision-tree classification Decision Trees Decision tree A flow-chart-like tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes
More informationCOMPUTATIONAL STATISTICS UNSUPERVISED LEARNING
COMPUTATIONAL STATISTICS UNSUPERVISED LEARNING Luca Bortolussi Department of Mathematics and Geosciences University of Trieste Office 238, third floor, H2bis luca@dmi.units.it Trieste, Winter Semester
More informationData Mining Lecture 8: Decision Trees
Data Mining Lecture 8: Decision Trees Jo Houghton ECS Southampton March 8, 2019 1 / 30 Decision Trees - Introduction A decision tree is like a flow chart. E. g. I need to buy a new car Can I afford it?
More informationObject Classification Problem
HIERARCHICAL OBJECT CATEGORIZATION" Gregory Griffin and Pietro Perona. Learning and Using Taxonomies For Fast Visual Categorization. CVPR 2008 Marcin Marszalek and Cordelia Schmid. Constructing Category
More informationFinal Review CMSC 733 Fall 2014
Final Review CMSC 733 Fall 2014 We have covered a lot of material in this course. One way to organize this material is around a set of key equations and algorithms. You should be familiar with all of these,
More informationNetwork Traffic Measurements and Analysis
DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,
More informationApplying the Q n Estimator Online
Applying the Q n Estimator Online Robin Nunkesser 1, Karen Schettlinger 2, and Roland Fried 2 1 Department of Computer Science, Univ. Dortmund, 44221 Dortmund Robin.Nunkesser@udo.edu 2 Department of Statistics,
More informationClustering CS 550: Machine Learning
Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf
More informationEstimating Data Center Thermal Correlation Indices from Historical Data
Estimating Data Center Thermal Correlation Indices from Historical Data Manish Marwah, Cullen Bash, Rongliang Zhou, Carlos Felix, Rocky Shih, Tom Christian HP Labs Palo Alto, CA 94304 Email: firstname.lastname@hp.com
More informationMachine Learning. A. Supervised Learning A.7. Decision Trees. Lars Schmidt-Thieme
Machine Learning A. Supervised Learning A.7. Decision Trees Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany 1 /
More informationEdge Detection. Announcements. Edge detection. Origin of Edges. Mailing list: you should have received messages
Announcements Mailing list: csep576@cs.washington.edu you should have received messages Project 1 out today (due in two weeks) Carpools Edge Detection From Sandlot Science Today s reading Forsyth, chapters
More informationSummary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4
Principles of Knowledge Discovery in Data Fall 2004 Chapter 3: Data Preprocessing Dr. Osmar R. Zaïane University of Alberta Summary of Last Chapter What is a data warehouse and what is it for? What is
More information