Weighted Sample. Weighted Sample. Weighted Sample. Training Sample

Similar documents
Chapter 5. Tree-based Methods

Random Forests and Boosting

Model Inference and Averaging. Baging, Stacking, Random Forest, Boosting

Classification and Regression Trees

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Package gbts. February 27, 2017

The Basics of Decision Trees

Data Science Essentials

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Comparison of Optimization Methods for L1-regularized Logistic Regression

STAT Statistical Learning. Predictive Modeling. Statistical Learning. Overview. Predictive Modeling. Classification Methods.

Network Traffic Measurements and Analysis

Lecture 06 Decision Trees I

Machine Learning Techniques for Detecting Hierarchical Interactions in GLM s for Insurance Premiums

Lecture 25: Review I

Regression. Dr. G. Bharadwaja Kumar VIT Chennai

Comparison of Statistical Learning and Predictive Models on Breast Cancer Data and King County Housing Data

Lecture 24: Generalized Additive Models Stat 704: Data Analysis I, Fall 2010

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Discriminant analysis in R QMMA

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

STAT 705 Introduction to generalized additive models

Additive Models, Trees, etc. Based in part on Chapter 9 of Hastie, Tibshirani, and Friedman David Madigan

Introduction to Classification & Regression Trees

Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing

Project Presentation. Pattern Recognition. Under the guidance of Prof. Sumeet Agar wal

Generalized Additive Model

Statistical Methods for Data Mining

[8] Data Mining: Trees

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

The exam is closed book, closed notes except your one-page cheat sheet.

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

Introducing TreeNet Gradient Boosting Machine

Lecture 19: Decision trees

Applying Supervised Learning

Lecture 20: Bagging, Random Forests, Boosting

Stat 342 Exam 3 Fall 2014

Fast or furious? - User analysis of SF Express Inc

Solutions. Algebra II Journal. Module 2: Regression. Exploring Other Function Models

CS6716 Pattern Recognition. Ensembles and Boosting (1)

CSC411 Fall 2014 Machine Learning & Data Mining. Ensemble Methods. Slides by Rich Zemel

Logistic Regression: Probabilistic Interpretation

Statistics & Analysis. Fitting Generalized Additive Models with the GAM Procedure in SAS 9.2

STA 4273H: Statistical Machine Learning

CS381V Experiment Presentation. Chun-Chen Kuo

Relations and Functions 2.1

from sklearn import tree from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

Exam 4. In the above, label each of the following with the problem number. 1. The population Least Squares line. 2. The population distribution of x.

Lecture 13: Model selection and regularization

Kevin James. MTHSC 102 Section 1.5 Polynomial Functions and Models

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

1.1 Pearson Modeling and Equation Solving

Gradient LASSO algoithm

Stat 4510/7510 Homework 4

Computer Vision Group Prof. Daniel Cremers. 8. Boosting and Bagging

CS294-1 Final Project. Algorithms Comparison

Support Vector Machines

Machine learning techniques for binary classification of microarray data with correlation-based gene selection

Tutorials Case studies

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

SLStats.notebook. January 12, Statistics:

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

CSE 546 Machine Learning, Autumn 2013 Homework 2

GLM II. Basic Modeling Strategy CAS Ratemaking and Product Management Seminar by Paul Bailey. March 10, 2015

Overview of the TREC 2005 Spam Track. Gordon V. Cormack Thomas R. Lynam. 18 November 2005

What is machine learning?

Machine Learning. Chao Lan

Math 1020 Objectives & Exercises Calculus Concepts Spring 2019

Globally Induced Forest: A Prepruning Compression Scheme

N = Fraction randomly sampled. Error / min (error)

Stat 602X Exam 2 Spring 2011

Providing Real-time Information for Transit Riders: In Search of an Equitable Technology

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

7. Boosting and Bagging Bagging

Universität Freiburg Lehrstuhl für Maschinelles Lernen und natürlichsprachliche Systeme. Machine Learning (SS2012)

Lecture 05 Additive Models

SPM Users Guide. This guide elaborates on powerful ways to combine the TreeNet and GPS engines to achieve model compression and more.

CGBoost: Conjugate Gradient in Function Space

Statistical Consulting Topics Using cross-validation for model selection. Cross-validation is a technique that can be used for model evaluation.

Face Detection Using Look-Up Table Based Gentle AdaBoost

Predicting Song Popularity

S2 Text. Instructions to replicate classification results.

Module 4. Non-linear machine learning econometrics: Support Vector Machine

CREDIT RISK MODELING IN R. Finding the right cut-off: the strategy curve

Leveling Up as a Data Scientist. ds/2014/10/level-up-ds.jpg

Predicting housing price

Mapping of Hierarchical Activation in the Visual Cortex Suman Chakravartula, Denise Jones, Guillaume Leseur CS229 Final Project Report. Autumn 2008.

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Minitab detailed

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Evolution of Regression II: From OLS to GPS to MARS Hands-on with SPM

Tutorial: Getting Started with MART in R

Random Forest A. Fornaser

Last time... Coryn Bailer-Jones. check and if appropriate remove outliers, errors etc. linear regression

Tutorial #1: Using Latent GOLD choice to Estimate Discrete Choice Models

Transcription:

Final Classifier [ M ] G(x) = sign m=1 α mg m (x) Weighted Sample G M (x) Weighted Sample G 3 (x) Weighted Sample G 2 (x) Training Sample G 1 (x) FIGURE 10.1. Schematic of AdaBoost. Classifiers are trained on weighted versions of the dataset, and then combined to produce a final prediction.

Test Error 0.0 0.1 0.2 0.3 0.4 0.5 Single Stump 244 Node Tree 0 100 200 300 400 Boosting Iterations FIGURE 10.2. Simulated data (10.2): test error rate for boosting with stumps, as a function of the number of iterations. Also shown are the test error rate for a single stump, and a 244-node classification tree.

Training Error 0.0 0.2 0.4 0.6 0.8 1.0 Exponential Loss Misclassification Rate 0 100 200 300 400 Boosting Iterations FIGURE 10.3. Simulated data, boosting with stumps: misclassification error rate on the training set, and average exponential loss: (1/N ) P N i=1 exp( y if(x i )). After about 250 iterations, the misclassification error is zero, while the exponential loss continues to decrease.

Loss 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Misclassification Exponential Binomial Deviance Squared Error Support Vector 2 1 0 1 2 y f FIGURE 10.4. Loss functions for two-class classification. The response is y = ±1; the prediction is f, with class prediction sign(f). The losses are misclassification: I(sign(f) y); exponential: exp( yf); binomial deviance: log(1 + exp( 2yf)); squared error: (y f) 2 ; and support vector: (1 yf) + (see Section 12.3). Each function has been scaled so that it passes through the point (0, 1).

Loss 0 2 4 6 8 Squared Error Absolute Error Huber 3 2 1 0 1 2 3 y f FIGURE 10.5. A comparison of three loss functions for regression, plotted as a function of the margin y f. The Huber loss function combines the good properties of squared-error loss near zero and absolute error loss when y f is large.

3d addresses labs telnet 857 415 direct table cs 85 parts # credit lab [ conference report original data project font make address order all hpl technology people pm mail over 650 meeting ; email 000 internet receive business re( 1999 will money our you edu CAPTOT george CAPMAX your CAPAVE free remove hp $! 0 20 40 60 80 100 Relative Importance

Partial Dependence -0.2 0.0 0.2 0.4 0.6 0.8 1.0 Partial Dependence -0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0! 0.0 0.2 0.4 0.6 remove Partial Dependence -1.0-0.6-0.2 0.0 0.2 0.0 0.2 0.4 0.6 0.8 1.0 edu Partial Dependence -1.0-0.6-0.2 0.0 0.2 0.0 0.5 1.0 1.5 2.0 2.5 3.0 hp FIGURE 10.7. Partial dependence of log-odds of spam on four important predictors. The red ticks at the base of the plots are deciles of the input variable.

1.0 0.5 0.0-0.5-1.0 1.0 0.8 0.6! 0.4 0.2 3.0 2.5 2.0 1.5 hp 1.0 0.5 FIGURE 10.8. Partial dependence of the log-odds of spam vs. email as a function of joint frequencies of hp and the character!.

Test Error 0.0 0.1 0.2 0.3 0.4 Stumps 10 Node 100 Node Adaboost 0 100 200 300 400 Number of Terms FIGURE 10.9. Boosting with different sized trees, applied to the example (10.2) used in Figure 10.2. Since the generative model is additive, stumps perform the best. The boosting algorithm used the binomial deviance loss in Algorithm 10.3; shown for comparison is the AdaBoost Algorithm 10.1.

Coordinate Functions for Additive Logistic Trees f 1 (x 1 ) f 2 (x 2 ) f 3 (x 3 ) f 4 (x 4 ) f 5 (x 5 ) f 6 (x 6 ) f 7 (x 7 ) f 8 (x 8 ) f 9 (x 9 ) f 10 (x 10 ) FIGURE 10.10. Coordinate functions estimated by boosting stumps for the simulated example used in Figure 10.9. The true quadratic functions are shown for comparison.

Stumps Deviance Stumps Misclassification Error Test Set Deviance 0.0 0.5 1.0 1.5 2.0 No shrinkage Shrinkage=0.2 Test Set Misclassification Error 0.0 0.1 0.2 0.3 0.4 0.5 No shrinkage Shrinkage=0.2 0 500 1000 1500 2000 0 500 1000 1500 2000 Boosting Iterations Boosting Iterations 6-Node Trees Deviance 6-Node Trees Misclassification Error Test Set Deviance 0.0 0.5 1.0 1.5 2.0 No shrinkage Shrinkage=0.6 Test Set Misclassification Error 0.0 0.1 0.2 0.3 0.4 0.5 No shrinkage Shrinkage=0.6 0 500 1000 1500 2000 0 500 1000 1500 2000 Boosting Iterations Boosting Iterations FIGURE 10.11. Test error curves for simulated example (10.2) of Figure 10.9, using gradient boosting (MART). The models were trained using binomial deviance, either stumps or six terminal-node trees, and

4 Node Trees Deviance Absolute Error Test Set Deviance 0.4 0.6 0.8 1.0 1.2 1.4 Test Set Absolute Error 0.30 0.35 0.40 0.45 0.50 No shrinkage Shrink=0.1 Sample=0.5 Shrink=0.1 Sample=0.5 0 200 400 600 800 1000 Boosting Iterations 0 200 400 600 800 1000 Boosting Iterations FIGURE 10.12. Test-error curves for the simulated example (10.2), showing the effect of stochasticity. For the curves labeled Sample= 0.5, a different 50% subsample of the training data was used each time a tree was grown. In the left panel the models were fit by gbm using a binomial deviance loss function; in the right hand panel using square-error loss.

Training and Test Absolute Error Absolute Error 0.0 0.2 0.4 0.6 0.8 Train Error Test Error 0 200 400 600 800 Iterations M FIGURE 10.13. Average-absolute error as a function of number of iterations for the California housing data.

Population AveBedrms AveRooms HouseAge Latitude AveOccup Longitude MedInc 0 20 40 60 80 100 Relative importance FIGURE 10.14. Relative importance of the predictors for the California housing data.

Partial Dependence -0.5 0.0 0.5 1.0 1.5 2.0 Partial Dependence -1.0-0.5 0.0 0.5 1.0 1.5 2 4 6 8 10 MedInc 2 3 4 5 AveOccup Partial Dependence -1.0-0.5 0.0 0.5 1.0 Partial Dependence -1.0-0.5 0.0 0.5 1.0 1.5 10 20 30 40 50 HouseAge 4 6 8 10 AveRooms FIGURE 10.15. Partial dependence of housing value on the nonlocation variables for the California housing data. The red ticks at the base of the plot are deciles of the input variables.

1.0 0.5 0.0 50 40 30 HouseAge 20 10 5 4 3 AveOccup 2 FIGURE 10.16. Partial dependence of house value on median age and average occupancy. There appears to be a strong interaction effect between these two variables.

Latitude 34 36 38 40 42 1.0 0.5 0.0 0.5 1.0 124 122 120 118 116 114 Longitude FIGURE 10.17. Partial dependence of median house value on location in California. One unit is $100, 000, at 1990 prices, and the values plotted are relative to the overall median of $180, 000.

c Elements of Statistical Learning (2nd Ed.) Hastie, Tibshirani & Friedman 2009 Chap 10 FIGURE 10.18. Map of New Zealand and its sur-

Mean Deviance 0.24 0.26 0.28 0.30 0.32 0.34 GBM Test GBM CV GAM Test Sensitivity 0.0 0.2 0.4 0.6 0.8 1.0 AUC GAM 0.97 GBM 0.98 0 500 1000 1500 0.0 0.2 0.4 0.6 0.8 1.0 Number of Trees Specificity FIGURE 10.19. The left panel shows the mean deviance as a function of the number of trees for the GBM logistic regression model fit to the presence/absence data. Shown are 10-fold cross-validation on the training data (and 1 s.e. bars), and test deviance on the test data. Also shown for comparison is the test deviance using a GAM model with 8 df for each term. The right panel shows ROC curves on the test data for the chosen GBM model (vertical line in left plot) and the GAM model.

TempResid AvgDepth SusPartMatter SalResid SSTGrad ChlaCase2 Slope TidalCurr Pentade CodendSize DisOrgMatter Distance Speed OrbVel f(tempresid) 7 5 3 1 f(avgdepth) 6 4 2 0 10 25 4 0 2 4 6 0 500 1000 2000 Relative influence TempResid AvgDepth f(suspartmatter) 7 5 3 f(salresid) 7 5 3 1 f(sstgrad) 7 5 3 1 0 5 10 15 SusPartMatter 0.8 0.4 0.0 0.4 SalResid 0.00 0.05 0.10 0.15 SSTGrad FIGURE 10.20. The top-left panel shows the relative influence computed from the GBM logistic regression model. The remaining panels show the partial dependence plots for the leading five variables, all plotted on the same scale for comparison.

FIGURE 10.21. Geological prediction maps of the presence probability (left map) and catch size (right map) obtained from the gradient boosted models.

Overall Error Rate = 0.425 Student Retired Prof/Man Homemaker Labor Clerical Military Unemployed Sales 0.0 0.2 0.4 0.6 0.8 1.0 Error Rate FIGURE 10.22. Error rate for each occupation in the demographics data.

yrs-ba children num-hsld lang typ-home mar-stat ethnic sex mar-dlinc hsld-stat edu income age 0 20 40 60 80 100 Relative Importance FIGURE 10.23. Relative importance of the predictors as averaged over all classes for the demographics data.

Class = Retired Class = Student yrs-ba num-hsld edu children typ-home lang mar-stat hsld-stat income ethnic sex mar-dlinc age children yrs-ba lang mar-dlinc sex typ-home num-hsld ethnic edu mar-stat income age hsld-stat 0 20 40 60 80 100 Relative Importance Class = Prof/Man 0 20 40 60 80 100 Relative Importance Class = Homemaker children yrs-ba mar-stat lang num-hsld sex typ-home hsld-stat ethnic mar-dlinc age income edu yrs-ba hsld-stat age income typ-home lang mar-stat edu num-hsld ethnic children mar-dlinc sex 0 20 40 60 80 100 Relative Importance 0 20 40 60 80 100 Relative Importance FIGURE 10.24. Predictor variable importances separately for each of the four classes with lowest error rate for the demographics data.

Retired Student Partial Dependence 0 1 2 3 4 Partial Dependence -2-1 0 1 2 1 2 3 4 5 6 7 age 1 2 3 4 5 6 7 age Prof/Man Partial Dependence -2-1 0 1 2 1 2 3 4 5 6 7 age FIGURE 10.25. Partial dependence of the odds of three different occupations on age, for the demographics data.