Weighted Sample. Weighted Sample. Weighted Sample. Training Sample

Final Classifier [ M ] G(x) = sign m=1 α mg m (x) Weighted Sample G M (x) Weighted Sample G 3 (x) Weighted Sample G 2 (x) Training Sample G 1 (x) FIGURE 10.1. Schematic of AdaBoost. Classifiers are trained on weighted versions of the dataset, and then combined to produce a final prediction.

Test Error 0.0 0.1 0.2 0.3 0.4 0.5 Single Stump 244 Node Tree 0 100 200 300 400 Boosting Iterations FIGURE 10.2. Simulated data (10.2): test error rate for boosting with stumps, as a function of the number of iterations. Also shown are the test error rate for a single stump, and a 244-node classification tree.

Training Error 0.0 0.2 0.4 0.6 0.8 1.0 Exponential Loss Misclassification Rate 0 100 200 300 400 Boosting Iterations FIGURE 10.3. Simulated data, boosting with stumps: misclassification error rate on the training set, and average exponential loss: (1/N ) P N i=1 exp( y if(x i )). After about 250 iterations, the misclassification error is zero, while the exponential loss continues to decrease.

Loss 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Misclassification Exponential Binomial Deviance Squared Error Support Vector 2 1 0 1 2 y f FIGURE 10.4. Loss functions for two-class classification. The response is y = ±1; the prediction is f, with class prediction sign(f). The losses are misclassification: I(sign(f) y); exponential: exp( yf); binomial deviance: log(1 + exp( 2yf)); squared error: (y f) 2 ; and support vector: (1 yf) + (see Section 12.3). Each function has been scaled so that it passes through the point (0, 1).

Loss 0 2 4 6 8 Squared Error Absolute Error Huber 3 2 1 0 1 2 3 y f FIGURE 10.5. A comparison of three loss functions for regression, plotted as a function of the margin y f. The Huber loss function combines the good properties of squared-error loss near zero and absolute error loss when y f is large.

3d addresses labs telnet 857 415 direct table cs 85 parts # credit lab [ conference report original data project font make address order all hpl technology people pm mail over 650 meeting ; email 000 internet receive business re( 1999 will money our you edu CAPTOT george CAPMAX your CAPAVE free remove hp $! 0 20 40 60 80 100 Relative Importance

Partial Dependence -0.2 0.0 0.2 0.4 0.6 0.8 1.0 Partial Dependence -0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0! 0.0 0.2 0.4 0.6 remove Partial Dependence -1.0-0.6-0.2 0.0 0.2 0.0 0.2 0.4 0.6 0.8 1.0 edu Partial Dependence -1.0-0.6-0.2 0.0 0.2 0.0 0.5 1.0 1.5 2.0 2.5 3.0 hp FIGURE 10.7. Partial dependence of log-odds of spam on four important predictors. The red ticks at the base of the plots are deciles of the input variable.

1.0 0.5 0.0-0.5-1.0 1.0 0.8 0.6! 0.4 0.2 3.0 2.5 2.0 1.5 hp 1.0 0.5 FIGURE 10.8. Partial dependence of the log-odds of spam vs. email as a function of joint frequencies of hp and the character!.

Test Error 0.0 0.1 0.2 0.3 0.4 Stumps 10 Node 100 Node Adaboost 0 100 200 300 400 Number of Terms FIGURE 10.9. Boosting with different sized trees, applied to the example (10.2) used in Figure 10.2. Since the generative model is additive, stumps perform the best. The boosting algorithm used the binomial deviance loss in Algorithm 10.3; shown for comparison is the AdaBoost Algorithm 10.1.

Coordinate Functions for Additive Logistic Trees f 1 (x 1 ) f 2 (x 2 ) f 3 (x 3 ) f 4 (x 4 ) f 5 (x 5 ) f 6 (x 6 ) f 7 (x 7 ) f 8 (x 8 ) f 9 (x 9 ) f 10 (x 10 ) FIGURE 10.10. Coordinate functions estimated by boosting stumps for the simulated example used in Figure 10.9. The true quadratic functions are shown for comparison.

Stumps Deviance Stumps Misclassification Error Test Set Deviance 0.0 0.5 1.0 1.5 2.0 No shrinkage Shrinkage=0.2 Test Set Misclassification Error 0.0 0.1 0.2 0.3 0.4 0.5 No shrinkage Shrinkage=0.2 0 500 1000 1500 2000 0 500 1000 1500 2000 Boosting Iterations Boosting Iterations 6-Node Trees Deviance 6-Node Trees Misclassification Error Test Set Deviance 0.0 0.5 1.0 1.5 2.0 No shrinkage Shrinkage=0.6 Test Set Misclassification Error 0.0 0.1 0.2 0.3 0.4 0.5 No shrinkage Shrinkage=0.6 0 500 1000 1500 2000 0 500 1000 1500 2000 Boosting Iterations Boosting Iterations FIGURE 10.11. Test error curves for simulated example (10.2) of Figure 10.9, using gradient boosting (MART). The models were trained using binomial deviance, either stumps or six terminal-node trees, and

4 Node Trees Deviance Absolute Error Test Set Deviance 0.4 0.6 0.8 1.0 1.2 1.4 Test Set Absolute Error 0.30 0.35 0.40 0.45 0.50 No shrinkage Shrink=0.1 Sample=0.5 Shrink=0.1 Sample=0.5 0 200 400 600 800 1000 Boosting Iterations 0 200 400 600 800 1000 Boosting Iterations FIGURE 10.12. Test-error curves for the simulated example (10.2), showing the effect of stochasticity. For the curves labeled Sample= 0.5, a different 50% subsample of the training data was used each time a tree was grown. In the left panel the models were fit by gbm using a binomial deviance loss function; in the right hand panel using square-error loss.

Training and Test Absolute Error Absolute Error 0.0 0.2 0.4 0.6 0.8 Train Error Test Error 0 200 400 600 800 Iterations M FIGURE 10.13. Average-absolute error as a function of number of iterations for the California housing data.

Population AveBedrms AveRooms HouseAge Latitude AveOccup Longitude MedInc 0 20 40 60 80 100 Relative importance FIGURE 10.14. Relative importance of the predictors for the California housing data.

Partial Dependence -0.5 0.0 0.5 1.0 1.5 2.0 Partial Dependence -1.0-0.5 0.0 0.5 1.0 1.5 2 4 6 8 10 MedInc 2 3 4 5 AveOccup Partial Dependence -1.0-0.5 0.0 0.5 1.0 Partial Dependence -1.0-0.5 0.0 0.5 1.0 1.5 10 20 30 40 50 HouseAge 4 6 8 10 AveRooms FIGURE 10.15. Partial dependence of housing value on the nonlocation variables for the California housing data. The red ticks at the base of the plot are deciles of the input variables.

1.0 0.5 0.0 50 40 30 HouseAge 20 10 5 4 3 AveOccup 2 FIGURE 10.16. Partial dependence of house value on median age and average occupancy. There appears to be a strong interaction effect between these two variables.

Latitude 34 36 38 40 42 1.0 0.5 0.0 0.5 1.0 124 122 120 118 116 114 Longitude FIGURE 10.17. Partial dependence of median house value on location in California. One unit is $100, 000, at 1990 prices, and the values plotted are relative to the overall median of $180, 000.

c Elements of Statistical Learning (2nd Ed.) Hastie, Tibshirani & Friedman 2009 Chap 10 FIGURE 10.18. Map of New Zealand and its sur-

Mean Deviance 0.24 0.26 0.28 0.30 0.32 0.34 GBM Test GBM CV GAM Test Sensitivity 0.0 0.2 0.4 0.6 0.8 1.0 AUC GAM 0.97 GBM 0.98 0 500 1000 1500 0.0 0.2 0.4 0.6 0.8 1.0 Number of Trees Specificity FIGURE 10.19. The left panel shows the mean deviance as a function of the number of trees for the GBM logistic regression model fit to the presence/absence data. Shown are 10-fold cross-validation on the training data (and 1 s.e. bars), and test deviance on the test data. Also shown for comparison is the test deviance using a GAM model with 8 df for each term. The right panel shows ROC curves on the test data for the chosen GBM model (vertical line in left plot) and the GAM model.

TempResid AvgDepth SusPartMatter SalResid SSTGrad ChlaCase2 Slope TidalCurr Pentade CodendSize DisOrgMatter Distance Speed OrbVel f(tempresid) 7 5 3 1 f(avgdepth) 6 4 2 0 10 25 4 0 2 4 6 0 500 1000 2000 Relative influence TempResid AvgDepth f(suspartmatter) 7 5 3 f(salresid) 7 5 3 1 f(sstgrad) 7 5 3 1 0 5 10 15 SusPartMatter 0.8 0.4 0.0 0.4 SalResid 0.00 0.05 0.10 0.15 SSTGrad FIGURE 10.20. The top-left panel shows the relative influence computed from the GBM logistic regression model. The remaining panels show the partial dependence plots for the leading five variables, all plotted on the same scale for comparison.

FIGURE 10.21. Geological prediction maps of the presence probability (left map) and catch size (right map) obtained from the gradient boosted models.

Overall Error Rate = 0.425 Student Retired Prof/Man Homemaker Labor Clerical Military Unemployed Sales 0.0 0.2 0.4 0.6 0.8 1.0 Error Rate FIGURE 10.22. Error rate for each occupation in the demographics data.

yrs-ba children num-hsld lang typ-home mar-stat ethnic sex mar-dlinc hsld-stat edu income age 0 20 40 60 80 100 Relative Importance FIGURE 10.23. Relative importance of the predictors as averaged over all classes for the demographics data.

Class = Retired Class = Student yrs-ba num-hsld edu children typ-home lang mar-stat hsld-stat income ethnic sex mar-dlinc age children yrs-ba lang mar-dlinc sex typ-home num-hsld ethnic edu mar-stat income age hsld-stat 0 20 40 60 80 100 Relative Importance Class = Prof/Man 0 20 40 60 80 100 Relative Importance Class = Homemaker children yrs-ba mar-stat lang num-hsld sex typ-home hsld-stat ethnic mar-dlinc age income edu yrs-ba hsld-stat age income typ-home lang mar-stat edu num-hsld ethnic children mar-dlinc sex 0 20 40 60 80 100 Relative Importance 0 20 40 60 80 100 Relative Importance FIGURE 10.24. Predictor variable importances separately for each of the four classes with lowest error rate for the demographics data.

Retired Student Partial Dependence 0 1 2 3 4 Partial Dependence -2-1 0 1 2 1 2 3 4 5 6 7 age 1 2 3 4 5 6 7 age Prof/Man Partial Dependence -2-1 0 1 2 1 2 3 4 5 6 7 age FIGURE 10.25. Partial dependence of the odds of three different occupations on age, for the demographics data.