Machine Learning Lecture 11

Similar documents
Our second algorithm. Comp 135 Machine Learning Computer Science Tufts University. Decision Trees. Decision Trees. Decision Trees.

Designing a learning system

Pattern Recognition Systems Lab 1 Least Mean Squares

Designing a learning system

Image Segmentation EEE 508

Enhancements to basic decision tree induction, C4.5

Machine Learning Lecture 11

Fundamentals of Media Processing. Shin'ichi Satoh Kazuya Kodama Hiroshi Mo Duy-Dinh Le

CSCI 5090/7090- Machine Learning. Spring Mehdi Allahyari Georgia Southern University

Our Learning Problem, Again

Administrative UNSUPERVISED LEARNING. Unsupervised learning. Supervised learning 11/25/13. Final project. No office hours today

3D Model Retrieval Method Based on Sample Prediction

15 UNSUPERVISED LEARNING

Lecture 18. Optimization in n dimensions

Numerical Methods Lecture 6 - Curve Fitting Techniques

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

The isoperimetric problem on the hypercube

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming

Learning to Shoot a Goal Lecture 8: Learning Models and Skills

GRADIENT DESCENT. Admin 10/24/13. Assignment 5. David Kauchak CS 451 Fall 2013

CIS 121 Data Structures and Algorithms with Java Fall Big-Oh Notation Tuesday, September 5 (Make-up Friday, September 8)

EM375 STATISTICS AND MEASUREMENT UNCERTAINTY LEAST SQUARES LINEAR REGRESSION ANALYSIS

Lecture 6. Lecturer: Ronitt Rubinfeld Scribes: Chen Ziv, Eliav Buchnik, Ophir Arie, Jonathan Gradstein

Structuring Redundancy for Fault Tolerance. CSE 598D: Fault Tolerant Software

IMP: Superposer Integrated Morphometrics Package Superposition Tool

arxiv: v2 [cs.ds] 24 Mar 2018

Improving Template Based Spike Detection

Big-O Analysis. Asymptotics

Big-O Analysis. Asymptotics

DATA MINING II - 1DL460

Evaluation of Support Vector Machine Kernels for Detecting Network Anomalies

condition w i B i S maximum u i

CIS 121 Data Structures and Algorithms with Java Spring Stacks, Queues, and Heaps Monday, February 18 / Tuesday, February 19

A New Morphological 3D Shape Decomposition: Grayscale Interframe Interpolation Method

Lower Bounds for Sorting

1 Graph Sparsfication

Dimensionality Reduction PCA

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000.

GRADIENT DESCENT. An aside: text classification. Text: raw data. Admin 9/27/16. Assignment 3 graded. Assignment 5. David Kauchak CS 158 Fall 2016

Lecture 5. Counting Sort / Radix Sort

. Written in factored form it is easy to see that the roots are 2, 2, i,

Ones Assignment Method for Solving Traveling Salesman Problem

CIS 121 Data Structures and Algorithms with Java Spring Stacks and Queues Monday, February 12 / Tuesday, February 13

Introduction. Nature-Inspired Computing. Terminology. Problem Types. Constraint Satisfaction Problems - CSP. Free Optimization Problem - FOP

A SOFTWARE MODEL FOR THE MULTILAYER PERCEPTRON

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #2: Randomized MST and MST Verification January 14, 2015

Major CSL Write your name and entry no on every sheet of the answer script. Time 2 Hrs Max Marks 70

Fast Fourier Transform (FFT) Algorithms

Computational Geometry

Chapter 3 Classification of FFT Processor Algorithms

Σ P(i) ( depth T (K i ) + 1),

Octahedral Graph Scaling

n Some thoughts on software development n The idea of a calculator n Using a grammar n Expression evaluation n Program organization n Analysis

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis

Lecture Notes on Integer Linear Programming

Bayesian approach to reliability modelling for a probability of failure on demand parameter

prerequisites: 6.046, 6.041/2, ability to do proofs Randomized algorithms: make random choices during run. Main benefits:

Eigenimages. Digital Image Processing: Bernd Girod, Stanford University -- Eigenimages 1

CMSC Computer Architecture Lecture 10: Caches. Prof. Yanjing Li University of Chicago

Eigenimages. Digital Image Processing: Bernd Girod, 2013 Stanford University -- Eigenimages 1

Polynomial Functions and Models. Learning Objectives. Polynomials. P (x) = a n x n + a n 1 x n a 1 x + a 0, a n 0

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5

How do we evaluate algorithms?

9.1. Sequences and Series. Sequences. What you should learn. Why you should learn it. Definition of Sequence

ANN WHICH COVERS MLP AND RBF

Heaps. Presentation for use with the textbook Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015

Announcements. Reading. Project #4 is on the web. Homework #1. Midterm #2. Chapter 4 ( ) Note policy about project #3 missing components

Descriptive Statistics Summary Lists

The Adjacency Matrix and The nth Eigenvalue

Sorting in Linear Time. Data Structures and Algorithms Andrei Bulatov

( n+1 2 ) , position=(7+1)/2 =4,(median is observation #4) Median=10lb

An Efficient Algorithm for Graph Bisection of Triangularizations

ECE4050 Data Structures and Algorithms. Lecture 6: Searching

Chapter 5. Functions for All Subtasks. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Lecture 13: Validation

FPGA IMPLEMENTATION OF BASE-N LOGARITHM. Salvador E. Tropea

Performance Plus Software Parameter Definitions

Hash Tables. Presentation for use with the textbook Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015.

Recursive Estimation

Lecturers: Sanjam Garg and Prasad Raghavendra Feb 21, Midterm 1 Solutions

Dynamic Programming and Curve Fitting Based Road Boundary Detection

Consider the following population data for the state of California. Year Population

Announcements. Recognition III. A Rough Recognition Spectrum. Projection, and reconstruction. Face detection using distance to face space

Lecture 2: Spectra of Graphs

A new algorithm to build feed forward neural networks.

4.2.1 Bayesian Principal Component Analysis Weighted K Nearest Neighbor Regularized Expectation Maximization

Pseudocode ( 1.1) Analysis of Algorithms. Primitive Operations. Pseudocode Details. Running Time ( 1.1) Estimating performance

Feature Selection for Change Detection in Multivariate Time-Series

Unsupervised Discretization Using Kernel Density Estimation

SD vs. SD + One of the most important uses of sample statistics is to estimate the corresponding population parameters.

Fuzzy Rule Selection by Data Mining Criteria and Genetic Algorithms

Lecture 1: Introduction and Strassen s Algorithm

Lecture 28: Data Link Layer

Homework 1 Solutions MA 522 Fall 2017

6.854J / J Advanced Algorithms Fall 2008

CS200: Hash Tables. Prichard Ch CS200 - Hash Tables 1

Project 2.5 Improved Euler Implementation

Bezier curves. Figure 2 shows cubic Bezier curves for various control points. In a Bezier curve, only

ENGR Spring Exam 1

Image Analysis. Segmentation by Fitting a Model

Transcription:

Course Outlie Machie Learig Lecture 11 Fudametals (2 weeks) Bayes Decisio Theory Probability Desity Estimatio AdaBoost & Decisio Trees 07.06.2016 Discrimiative Approaches (5 weeks) Liear Discrimiat Fuctios Statistical Learig Theory & SVMs Esemble Methods & Boostig Radomized Trees, Forests & Fers Bastia Leibe RWTH Aache http://www.visio.rwth-aache.de Geerative Models (4 weeks) Bayesia Networks Markov Radom Fields leibe@visio.rwth-aache.de 2 Recap: Stackig Recap: Bayesia Model Averagig Idea Lear L classifiers (based o the traiig data) Fid a meta-classifier that takes as iput the output of the L first-level classifiers. Classifier 1 Example Lear L classifiers with leave-oe-out. Data Classifier 2 Classifier L Combiatio Classifier Iterpret the predictio of the L classifiers as L-dimesioal feature vector. Lear level-2 classifier based o the examples geerated this way. 3 Slide credit: Bert Schiele Model Averagig Suppose we have H differet models h = 1,,H with prior probabilities p(h). Costruct the margial distributio over the data set p(x) = HX p(xjh)p(h) h=1 Average error of committee E COM = 1 M E AV This suggests that the average error of a model ca be reduced by a factor of M simply by averagig M versios of the model! Ufortuately, this assumes that the errors are all ucorrelated. I practice, they will typically be highly correlated. 4 Topics of This Lecture Recap: AdaBoost Adaptive Boostig AdaBoost Algorithm Aalysis Extesios Mai idea [Freud & Schapire, 1996] Istead of resamplig, reweight misclassified traiig examples. Icrease the chace of beig selected i a sampled traiig set. Or icrease the misclassificatio cost whe traiig o the full set. Aalysis Comparig Error Fuctios Applicatios AdaBoost for face detectio Decisio Trees CART Impurity measures, Stoppig criterio, Pruig Extesios, Issues Historical developmet: ID3, C4.5 5 Compoets h m (x): weak or base classifier Coditio: <50% traiig error over ay distributio H(x): strog or fial classifier AdaBoost: Costruct a strog classifier as a thresholded liear combiatio of the weighted weak classifiers: Ã M! X H(x) = sig m h m (x) m=1 6 1

AdaBoost Algorithm AdaBoost Historical Developmet 1. Iitializatio: Set w (1) = 1 for = 1,,N. N 2. For m = 1,,M iteratios a) Trai a ew weak classifier h m (x) usig the curret weightig coefficiets W (m) by miimizig the weighted error fuctio J m = w (m) I(hm(x) 6= t) b) Estimate the weighted error of this classifier o X: ² m = w(m) I(h m(x) 6= t ) w(m) c) Calculate a weightig coefficiet for h m (x): m =? How should we d) Update the weightig coefficiets: w (m+1) do this exactly? =? 8 Origially motivated by Statistical Learig Theory AdaBoost was itroduced i 1996 by Freud & Schapire. It was empirically observed that AdaBoost ofte teds ot to overfit. (Breima 96, Cortes & Drucker 97, etc.) As a result, the margi theory (Schapire et al. 98) developed, which is based o loose geeralizatio bouds. Note: margi for boostig is ot the same as margi for SVM. A bit like retrofittig the theory However, those bouds are too loose to be of practical value. Differet explaatio (Friedma, Hastie, Tibshirai, 2000) Iterpretatio as sequetial miimizatio of a expoetial error fuctio ( Forward Stagewise Additive Modelig ). Explais why boostig works well. Improvemets possible by alterig the error fuctio. 9 AdaBoost Miimizig Expoetial Error AdaBoost Miimizig Expoetial Error Expoetial error fuctio E = exp f t f m (x )g where f m (x) is a classifier defied as a liear combiatio of base classifiers h l (x): f m (x) = 1 mx l h l (x) 2 l=1 Goal Miimize E with respect to both the weightig coefficiets l ad the parameters of the base classifiers h l (x). Sequetial Miimizatio Suppose that the base classifiers h 1 (x),, h m-1 (x) ad their coefficiets 1,, m-1 are fixed. Oly miimize with respect to m ad h m (x). f m(x) = 1 mx E = exp f t f m (x )g with lh l(x) 2 l=1 = exp t f m 1 (x ) 1 ¾ 2 t m h m (x ) = = cost. w (m) exp 1 ¾ 2 t m h m (x ) 10 11 AdaBoost Miimizig Expoetial Error E = w (m) exp 1 ¾ 2 t m h m (x ) AdaBoost Miimizig Expoetial Error E = w (m) exp 1 ¾ 2 t m h m (x ) Observatio: Correctly classified poits: t h m (x ) = +1 Misclassified poits: t h m (x ) = 1 collect i T m collect i F m Observatio: Correctly classified poits: t h m (x ) = +1 Misclassified poits: t h m (x ) = 1 collect i T m collect i F m Rewrite the error fuctio as E = e X m=2 + e m=2 X Rewrite the error fuctio as E = e X m=2 + e m=2 X 2Tm 2Fm 2Tm 2Fm = ³e m=2 X N m=2 e w (m) I(hm(x) 6= t) + e m=2 w (m) = ³e m=2 X N m=2 e w (m) I(hm(x) 6= t) + e m=2 w (m) 12 13 2

AdaBoost Miimizig Expoetial Error AdaBoost Miimizig Expoetial Error @E! Miimize with respect to h m (x): = 0 E = ³e m=2 @h m(x ) X N m=2 e I(hm(x) 6= t) + e m=2 @E! Miimize with respect to m : = 0 E = ³e m=2 @ m X N m=2 e I(hm(x) 6= t) + e m=2 = cost. This is equivalet to miimizig J m = w (m) I(hm(x) 6= t) (our weighted error fuctio from step 2a) of the algorithm) We re o the right track. Let s cotiue = cost. µ 1 2 e m=2 + 1 X N 2 e m=2 w (m) I(hm(x) 6= t)! weighted error ² m := w(m) I(h m(x ) 6= t ) w(m) Update for the coefficiets: = 1 2 e m=2 N X = e m=2 e m=2 + e m=2 1 ² m = e m + 1 ¾ 1 ²m m = l ² m 14 15 AdaBoost Miimizig Expoetial Error AdaBoost Fial Algorithm Remaiig step: update the weights Recall that E = w (m) exp 1 ¾ 2 t m h m (x ) Therefore w (m+1) This becomes w (m+1) i the ext iteratio. = w (m) exp 1 ¾ 2 t m h m (x ) = ::: = exp f m I(h m (x ) 6= t )g Update for the weight coefficiets. 16 1. Iitializatio: Set w (1) = 1 for = 1,,N. N 2. For m = 1,,M iteratios a) Trai a ew weak classifier h m (x) usig the curret weightig coefficiets W (m) by miimizig the weighted error fuctio J m = w (m) I(hm(x) 6= t) b) Estimate the weighted error of this classifier o X: ² m = w(m) I(h m(x) 6= t ) w(m) c) Calculate a weightig coefficiet ¾ for h m (x): 1 ²m m = l ² m d) Update the weightig coefficiets: w (m+1) = w (m) exp f mi(h m(x ) 6= t )g 17 Topics of This Lecture AdaBoost Aalysis AdaBoost Algorithm Aalysis Extesios Aalysis Comparig Error Fuctios Result of this derivatio We ow kow that AdaBoost miimizes a expoetial error fuctio i a sequetial fashio. This allows us to aalyze AdaBoost s behavior i more detail. I particular, we ca see how robust it is to outlier data poits. Applicatios AdaBoost for face detectio Decisio Trees CART Impurity measures, Stoppig criterio, Pruig Extesios, Issues Historical developmet: ID3, C4.5 18 19 3

Recap: Error Fuctios Recap: Error Fuctios Ideal misclassificatio error Ideal misclassificatio error Squared error Sesitive to outliers! Pealizes too correct data poits! Not differetiable! Ideal misclassificatio error fuctio (black) This is what we wat to approximate, Ufortuately, it is ot differetiable. The gradiet is zero for misclassified poits. z = t y(x ) We caot miimize it by gradiet descet. 20 Squared error used i Least-Squares Classificatio Very popular, leads to closed-form solutios. However, sesitive to outliers due to squared pealty. Pealizes too correct data poits z = t y(x ) Geerally does ot lead to good classifiers. 21 Recap: Error Fuctios Discussio: AdaBoost Error Fuctio Robust to outliers! Ideal misclassificatio error Squared error Hige error Ideal misclassificatio error Squared error Hige error Expoetial error Not differetiable! Favors sparse solutios! z = t y(x ) z = t y(x ) Hige error used i SVMs Zero error for poits outside the margi (z > 1) sparsity Liear pealty for misclassified poits (z < 1) robustess Not differetiable aroud z = 1 Caot be optimized directly. 22 Expoetial error used i AdaBoost Cotiuous approximatio to ideal misclassificatio fuctio. Sequetial miimizatio leads to simple AdaBoost scheme. Properties? 23 Discussio: AdaBoost Error Fuctio Discussio: Other Possible Error Fuctios Sesitive to outliers! Ideal misclassificatio error Squared error Hige error Expoetial error Ideal misclassificatio error Squared error Hige error Expoetial error Cross-etropy error E = X ft l y + (1 t) l(1 y)g Expoetial error used i AdaBoost No pealty for too correct data poits, fast covergece. Disadvatage: expoetial pealty for large egative values! Less robust to outliers or misclassified data poits! z = t y(x ) 24 z = t y(x ) Cross-etropy error used i Logistic Regressio Similar to expoetial error for z>0. Oly grows liearly with large egative values of z. Make AdaBoost more robust by switchig to this error fuctio. GetleBoost 25 4

Summary: AdaBoost Topics of This Lecture Properties Simple combiatio of multiple classifiers. Easy to implemet. Ca be used with may differet types of classifiers. Noe of them eeds to be too good o its ow. I fact, they oly have to be slightly better tha chace. Commoly used i may areas. Empirically good geeralizatio capabilities. Limitatios Origial AdaBoost sesitive to misclassified traiig data poits. Because of expoetial error fuctio. Improvemet by GetleBoost Sigle-class classifier Multiclass extesios available 26 Recap: AdaBoost Algorithm Aalysis Extesios Aalysis Comparig Error Fuctios Applicatios AdaBoost for face detectio Decisio Trees CART Impurity measures, Stoppig criterio, Pruig Extesios, Issues Historical developmet: ID3, C4.5 27 Example Applicatio: Face Detectio Feature extractio Frotal faces are a good example of a class where global appearace models + a slidig widow detectio approach fit well: Rectagular filters Feature output is differece betwee adjacet regios Regular 2D structure Ceter of face almost shaped like a patch /widow Efficietly computable with itegral image: ay sum ca be computed i costat time Value at (x,y) is sum of pixels above ad to the left of (x,y) Now we ll take AdaBoost ad see how the Viola- Joes face detector works Avoid scalig images scale features directly for same cost Itegral image Slide credit: Kriste Grauma 28 Slide credit: Kriste Grauma 29 [Viola & Joes, CVPR 2001] Large Library of Filters Cosiderig all possible filter parameters: positio, scale, ad type: 180,000+ possible features associated with each 24 x 24 widow Use AdaBoost both to select the iformative features ad to form the classifier Slide credit: Kriste Grauma 30 [Viola & Joes, CVPR 2001] AdaBoost for Feature+Classifier Selectio Wat to select the sigle rectagle feature ad threshold that best separates positive (faces) ad egative (ofaces) traiig examples, i terms of weighted error. Outputs of a possible rectagle feature o faces ad o-faces. Slide credit: Kriste Grauma Resultig weak classifier: For ext roud, reweight the examples accordig to errors, choose aother filter/threshold combo. 31 [Viola & Joes, CVPR 2001] 5

AdaBoost for Efficiet Feature Selectio Viola-Joes Face Detector: Results Image features = weak classifiers For each roud of boostig: Evaluate each rectagle filter o each example Sort examples by filter values Select best threshold for each filter (mi error) Sorted list ca be quickly scaed for the optimal threshold Select best filter/threshold combiatio Weight o this features is a simple fuctio of error rate Reweight examples P. Viola, M. Joes, Robust Real-Time Face Detectio, IJCV, Vol. 57(2), 2004. (first versio appeared at CVPR 2001) Slide credit: Kriste Grauma 32 Slide credit: Kriste Grauma 33 Viola-Joes Face Detector: Results Viola-Joes Face Detector: Results Slide credit: Kriste Grauma 34 Slide credit: Kriste Grauma 35 Refereces ad Further Readig Topics of This Lecture More iformatio o Classifier Combiatio ad Boostig ca be foud i Chapters 14.1-14.3 of Bishop s book. Christopher M. Bishop Patter Recogitio ad Machie Learig Spriger, 2006 Recap: AdaBoost Algorithm Aalysis Extesios Aalysis Comparig Error Fuctios A more i-depth discussio of the statistical iterpretatio of AdaBoost is available i the followig paper: J. Friedma, T. Hastie, R. Tibshirai, Additive Logistic Regressio: a Statistical View of Boostig, The Aals of Statistics, Vol. 38(2), pages 337-374, 2000. 36 Applicatios AdaBoost for face detectio Decisio Trees CART Impurity measures, Stoppig criterio, Pruig Extesios, Issues Historical developmet: ID3, C4.5 37 6

Decisio Trees Decisio Trees Very old techique Origi i the 60s, might seem outdated. But Ca be used for problems with omial data E.g. attributes color 2 {red, gree, blue} or weather 2 {suy, raiy}. Discrete values, o otio of similarity or eve orderig. Iterpretable results Leared trees ca be writte as sets of if-the rules. Methods developed for hadlig missig feature values. Successfully applied to broad rage of tasks E.g. Medical diagosis E.g. Credit risk assessmet of loa applicats Some iterestig ovel developmets buildig o top of them Example: Classify Saturday morigs accordig to whether they re suitable for playig teis. 38 39 Image source: T. Mitchell, 1997 Decisio Trees Decisio Trees Assumptio Liks must be mutually distict ad exhaustive I.e. oe ad oly oe lik will be followed at each step. Elemets Each ode specifies a test for some attribute. Each brach correspods to a possible value of the attribute. Iterpretability Iformatio i a tree ca the be redered as logical expressios. I our example: (Outlook = Suy ^ Humidity = Normal) _ (Outlook = Overcast) _ (Outlook = Rai ^ Wid = Weak) 40 Image source: T. Mitchell, 1997 41 Image source: T. Mitchell, 1997 Traiig Decisio Trees CART Framework Fidig the optimal decisio tree is NP-hard Commo procedure: Greedy top-dow growig Start at the root ode. Progressively split the traiig data ito smaller ad smaller subsets. I each step, pick the best attribute to split the data. If the resultig subsets are pure (oly oe label) or if o further attribute ca be foud that splits them, termiate the tree. Else, recursively apply the procedure to the subsets. CART framework Classificatio Ad Regressio Trees (Breima et al. 1993) Formalizatio of the differet desig choices. Six geeral questios 1. Biary or multi-valued problem? I.e. how may splits should there be at each ode? 2. Which property should be tested at a ode? I.e. how to select the query attribute? 3. Whe should a ode be declared a leaf? I.e. whe to stop growig the tree? 4. How ca a grow tree be simplified or prued? Goal: reduce overfittig. 5. How to deal with impure odes? I.e. whe the data itself is ambiguous. 6. How should missig attributes be hadled? 42 43 7

CART 1. Number of Splits CART 2. Pickig a Good Splittig Feature Each multi-valued tree ca be coverted ito a equivalet biary tree: Goal Wat a tree that is as simple/small as possible (Occam s razor). But: Fidig a miimal tree is a NP-hard optimizatio problem. Greedy top-dow search Efficiet, but ot guarateed to fid the smallest tree. Seek a property T at each ode N that makes the data i the child odes as pure as possible. For formal reasos more coveiet to defie impurity i(n). Several possible defiitios explored. Oly cosider biary trees here 44 Image source: R.O. Duda, P.E. Hart, D.G. Stork, 2001 45 CART Impurity Measures CART Impurity Measures i(p) Problem: discotiuous derivative! i(p) P P Misclassificatio impurity i(n) = 1 max p(c j jn) j Fractio of the traiig patters i category C j that ed up i ode N. Etropy impurity i(n) = X j p(c j jn) log 2 p(c j jn) Reductio i etropy = gai i iformatio. 46 Image source: R.O. Duda, P.E. Hart, D.G. Stork, 2001 47 Image source: R.O. Duda, P.E. Hart, D.G. Stork, 2001 CART Impurity Measures CART Impurity Measures i(p) Which impurity measure should we choose? Some problems with misclassificatio impurity. Discotiuous derivative. Problems whe searchig over cotiuous parameter space. Sometimes misclassificatio impurity does ot decrease whe Gii impurity would. Gii impurity (variace impurity) i(n) = X i6=j = 1 2 [1 X j p(c i jn)p(c j jn) p 2 (C j jn)] P Expected error rate at ode N if the category label is selected radomly. 48 Image source: R.O. Duda, P.E. Hart, D.G. Stork, 2001 Both etropy impurity ad Gii impurity perform well. No big differece i terms of classifier performace. I practice, stoppig criterio ad pruig method are ofte more importat. 49 8

accuracy CART 2. Pickig a Good Splittig Feature CART Pickig a Good Splittig Feature Applicatio Select the query that decreases impurity the most 4i(N) = i(n) P L i(n L ) (1 P L )i(n R ) For efficiecy, splits are ofte based o a sigle feature Moothetic decisio trees Multiway geeralizatio (gai ratio impurity): Maximize Ã! 4i(s) = 1 KX i(n) P k i(n k ) Z where the ormalizatio factor esures that large K are ot iheretly favored: KX Z = P k log 2 P k k=1 k=1 50 Evaluatig cadidate splits Nomial attributes: exhaustive search over all possibilities. Real-valued attributes: oly eed to cosider chages i label. Order all data poits based o attribute x i. Oly eed to test cadidate splits where label(x i ) label(x i+1 ). 51 CART 3. Whe to Stop Splittig CART Overfittig Prevetio (Pruig) Problem: Overfittig Learig a tree that classifies the traiig data perfectly may ot lead to the tree with the best geeralizatio to usee data. Reasos Noise or errors i the traiig data. Poor decisios towards the leaves of the tree that are based o very little data. Typical behavior Two basic approaches for decisio trees Prepruig: Stop growig tree as some poit durig top-dow costructio whe there is o loger sufficiet data to make reliable decisios. Postpruig: Grow the full tree, the remove subtrees that do ot have sufficiet evidece. Label leaf resultig from pruig with the majority class of the remaiig data, or a class probability distributio. o traiig data o test data N N C N = arg max p(c kjn) k Slide adapted from Raymod Mooey hypothesis complexity 52 Slide adapted from Raymod Mooey p(c k jn) 53 Decisio Trees Hadlig Missig Attributes Decisio Trees Feature Choice Durig traiig Calculate impurities at a ode usig oly the attribute iformatio preset. E.g. 3-dimesioal data, oe poit is missig attribute x 3. Bad tree Compute possible splits o x 1 usig all N poits. Compute possible splits o x 2 usig all N poits. Compute possible splits o x 3 usig N-1 o-deficiet poits. Choose split which gives greatest reductio i impurity. Durig test Caot hadle test patters that are lackig the decisio attribute! I additio to primary split, store a ordered set of surrogate splits that try to approximate the desired outcome based o differet attributes. 57 Best results if proper features are used 58 9

Decisio Trees Feature Choice Decisio Trees No-Uiform Cost Good tree Icorporatig category priors Ofte desired to icorporate differet priors for the categories. Solutio: weight samples to correct for the prior frequecies. Icorporatig o-uiform loss Create loss matrix ij Loss ca easily be icorporated ito Gii impurity i(n) = X ij ij p(c i )p(c j ) Best results if proper features are used Preprocessig to fid importat axes ofte pays off. 59 60 Historical Developmet Historical Developmet ID3 (Quila 1986) Oe of the first widely used decisio tree algorithms. Iteded to be used with omial (uordered) variables Real variables are first bied ito discrete itervals. Geeral brachig factor Use gai ratio impurity based o etropy (iformatio gai) criterio. Algorithm Select attribute a that best classifies examples, assig it to root. For each possible value v i of a, Add ew tree brach correspodig to test a = v i. If example_list(v i ) is empty, add leaf ode with most commo label i example_list(a). Else, recursively call ID3 for the subtree with attributes A \ a. 61 C4.5 (Quila 1993) Improved versio with exteded capabilities. Ability to deal with real-valued variables. Multiway splits are used with omial data Usig gai ratio impurity based o etropy (iformatio gai) criterio. Heuristics for pruig based o statistical sigificace of splits. Rule post-pruig Mai differece to CART Strategy for hadlig missig attributes. Whe missig feature is queried, C4.5 follows all B possible aswers. Decisio is made based o all B possible outcomes, weighted by decisio probabilities at ode N. 62 Decisio Trees Computatioal Complexity Summary: Decisio Trees Give Data poits {x 1,,x N } Dimesioality D Complexity Storage: O(N) Properties Simple learig procedure, fast evaluatio. Ca be applied to metric, omial, or mixed data. Ofte yield iterpretable results. Test rutime: O(log N) Traiig rutime: O(DN 2 log N) Most expesive part. Critical step: selectig the optimal splittig poit. Need to check D dimesios, for each eed to sort N data poits. O(DN log N) 63 64 10

Summary: Decisio Trees Limitatios Ofte produce oisy (bushy) or weak (stuted) classifiers. Do ot geeralize too well. Traiig data fragmetatio: As tree progresses, splits are selected based o less ad less data. Overtraiig ad udertraiig: Deep trees: fit the traiig data well, will ot geeralize well to ew test data. Shallow trees: ot sufficietly refied. Stability Trees ca be very sesitive to details of the traiig poits. If a sigle data poit is oly slightly shifted, a radically differet tree may come out! Result of discrete ad greedy learig procedure. Expesive learig step Refereces ad Further Readig More iformatio o Decisio Trees ca be foud i Chapters 8.2-8.4 of Duda & Hart. R.O. Duda, P.E. Hart, D.G. Stork Patter Classificatio 2 d Ed., Wiley-Itersciece, 2000 Mostly due to costly selectio of optimal split. 65 66 11