Weighted Sample. Weighted Sample. Weighted Sample. Training Sample

Size: px

Start display at page:

Download "Weighted Sample. Weighted Sample. Weighted Sample. Training Sample"

Laureen Brown
5 years ago
Views:

1 Final Classifier [ M ] G(x) = sign m=1 α mg m (x) Weighted Sample G M (x) Weighted Sample G 3 (x) Weighted Sample G 2 (x) Training Sample G 1 (x) FIGURE Schematic of AdaBoost. Classifiers are trained on weighted versions of the dataset, and then combined to produce a final prediction.

2 Test Error Single Stump 244 Node Tree Boosting Iterations FIGURE Simulated data (10.2): test error rate for boosting with stumps, as a function of the number of iterations. Also shown are the test error rate for a single stump, and a 244-node classification tree.

3 Training Error Exponential Loss Misclassification Rate Boosting Iterations FIGURE Simulated data, boosting with stumps: misclassification error rate on the training set, and average exponential loss: (1/N ) P N i=1 exp( y if(x i )). After about 250 iterations, the misclassification error is zero, while the exponential loss continues to decrease.

4 Loss Misclassification Exponential Binomial Deviance Squared Error Support Vector y f FIGURE Loss functions for two-class classification. The response is y = ±1; the prediction is f, with class prediction sign(f). The losses are misclassification: I(sign(f) y); exponential: exp( yf); binomial deviance: log(1 + exp( 2yf)); squared error: (y f) 2 ; and support vector: (1 yf) + (see Section 12.3). Each function has been scaled so that it passes through the point (0, 1).

5 Loss Squared Error Absolute Error Huber y f FIGURE A comparison of three loss functions for regression, plotted as a function of the margin y f. The Huber loss function combines the good properties of squared-error loss near zero and absolute error loss when y f is large.

6 3d addresses labs telnet direct table cs 85 parts # credit lab [ conference report original data project font make address order all hpl technology people pm mail over 650 meeting ; 000 internet receive business re( 1999 will money our you edu CAPTOT george CAPMAX your CAPAVE free remove hp $! Relative Importance

7 Partial Dependence Partial Dependence ! remove Partial Dependence edu Partial Dependence hp FIGURE Partial dependence of log-odds of spam on four important predictors. The red ticks at the base of the plots are deciles of the input variable.

8 ! hp FIGURE Partial dependence of the log-odds of spam vs. as a function of joint frequencies of hp and the character!.

9 Test Error Stumps 10 Node 100 Node Adaboost Number of Terms FIGURE Boosting with different sized trees, applied to the example (10.2) used in Figure Since the generative model is additive, stumps perform the best. The boosting algorithm used the binomial deviance loss in Algorithm 10.3; shown for comparison is the AdaBoost Algorithm 10.1.

10 Coordinate Functions for Additive Logistic Trees f 1 (x 1 ) f 2 (x 2 ) f 3 (x 3 ) f 4 (x 4 ) f 5 (x 5 ) f 6 (x 6 ) f 7 (x 7 ) f 8 (x 8 ) f 9 (x 9 ) f 10 (x 10 ) FIGURE Coordinate functions estimated by boosting stumps for the simulated example used in Figure The true quadratic functions are shown for comparison.

11 Stumps Deviance Stumps Misclassification Error Test Set Deviance No shrinkage Shrinkage=0.2 Test Set Misclassification Error No shrinkage Shrinkage= Boosting Iterations Boosting Iterations 6-Node Trees Deviance 6-Node Trees Misclassification Error Test Set Deviance No shrinkage Shrinkage=0.6 Test Set Misclassification Error No shrinkage Shrinkage= Boosting Iterations Boosting Iterations FIGURE Test error curves for simulated example (10.2) of Figure 10.9, using gradient boosting (MART). The models were trained using binomial deviance, either stumps or six terminal-node trees, and

12 4 Node Trees Deviance Absolute Error Test Set Deviance Test Set Absolute Error No shrinkage Shrink=0.1 Sample=0.5 Shrink=0.1 Sample= Boosting Iterations Boosting Iterations FIGURE Test-error curves for the simulated example (10.2), showing the effect of stochasticity. For the curves labeled Sample= 0.5, a different 50% subsample of the training data was used each time a tree was grown. In the left panel the models were fit by gbm using a binomial deviance loss function; in the right hand panel using square-error loss.

13 Training and Test Absolute Error Absolute Error Train Error Test Error Iterations M FIGURE Average-absolute error as a function of number of iterations for the California housing data.

14 Population AveBedrms AveRooms HouseAge Latitude AveOccup Longitude MedInc Relative importance FIGURE Relative importance of the predictors for the California housing data.

15 Partial Dependence Partial Dependence MedInc AveOccup Partial Dependence Partial Dependence HouseAge AveRooms FIGURE Partial dependence of housing value on the nonlocation variables for the California housing data. The red ticks at the base of the plot are deciles of the input variables.

16 HouseAge AveOccup 2 FIGURE Partial dependence of house value on median age and average occupancy. There appears to be a strong interaction effect between these two variables.

17 Latitude Longitude FIGURE Partial dependence of median house value on location in California. One unit is $100, 000, at 1990 prices, and the values plotted are relative to the overall median of $180, 000.

18 c Elements of Statistical Learning (2nd Ed.) Hastie, Tibshirani & Friedman 2009 Chap 10 FIGURE Map of New Zealand and its sur-

19 Mean Deviance GBM Test GBM CV GAM Test Sensitivity AUC GAM 0.97 GBM Number of Trees Specificity FIGURE The left panel shows the mean deviance as a function of the number of trees for the GBM logistic regression model fit to the presence/absence data. Shown are 10-fold cross-validation on the training data (and 1 s.e. bars), and test deviance on the test data. Also shown for comparison is the test deviance using a GAM model with 8 df for each term. The right panel shows ROC curves on the test data for the chosen GBM model (vertical line in left plot) and the GAM model.

20 TempResid AvgDepth SusPartMatter SalResid SSTGrad ChlaCase2 Slope TidalCurr Pentade CodendSize DisOrgMatter Distance Speed OrbVel f(tempresid) f(avgdepth) Relative influence TempResid AvgDepth f(suspartmatter) f(salresid) f(sstgrad) SusPartMatter SalResid SSTGrad FIGURE The top-left panel shows the relative influence computed from the GBM logistic regression model. The remaining panels show the partial dependence plots for the leading five variables, all plotted on the same scale for comparison.

21 FIGURE Geological prediction maps of the presence probability (left map) and catch size (right map) obtained from the gradient boosted models.

22 Overall Error Rate = Student Retired Prof/Man Homemaker Labor Clerical Military Unemployed Sales Error Rate FIGURE Error rate for each occupation in the demographics data.

23 yrs-ba children num-hsld lang typ-home mar-stat ethnic sex mar-dlinc hsld-stat edu income age Relative Importance FIGURE Relative importance of the predictors as averaged over all classes for the demographics data.

24 Class = Retired Class = Student yrs-ba num-hsld edu children typ-home lang mar-stat hsld-stat income ethnic sex mar-dlinc age children yrs-ba lang mar-dlinc sex typ-home num-hsld ethnic edu mar-stat income age hsld-stat Relative Importance Class = Prof/Man Relative Importance Class = Homemaker children yrs-ba mar-stat lang num-hsld sex typ-home hsld-stat ethnic mar-dlinc age income edu yrs-ba hsld-stat age income typ-home lang mar-stat edu num-hsld ethnic children mar-dlinc sex Relative Importance Relative Importance FIGURE Predictor variable importances separately for each of the four classes with lowest error rate for the demographics data.

25 Retired Student Partial Dependence Partial Dependence age age Prof/Man Partial Dependence age FIGURE Partial dependence of the odds of three different occupations on age, for the demographics data.

Chapter 5. Tree-based Methods

Chapter 5. Tree-based Methods Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Regression