Machine Learning (CSE 446): Decision Trees

Size: px

Start display at page:

Download "Machine Learning (CSE 446): Decision Trees"

Esther Parsons
5 years ago
Views:

1 Machine Learning (CSE 446): Decision Trees Sham M Kakade c 28 University of Washington cse446-staff@cs.washington.edu / 8

2 Announcements First assignment posted. Due Thurs, Jan 8th. Remember the late policy (see the website). TA office hours posted. (Please check website before you go, just in case of changes.) Midterm: Weds, Feb 7. Today: Decision Trees, the supervised learning 2 / 8

3 Features (a conceptual point) Let φ be (one such) function that maps from inputs x to values. There could be many such functions, sometimes we write Φ(x) for the feature vector (it s really a tuple ). If φ maps to {, }, we call it a binary feature (function). If φ maps to R, we call it a real-valued feature (function). φ could map to categorical values. ordinal values, integers,... Often, there isn t much of a difference between x and the tuple of features. 3 / 8

4 Features Data derived from mpg; cylinders; displacement; horsepower; weight; acceleration; year; origin Input: a row in this table. a feature mapping corresponds to a column. Goal: predict whether mpg is < 23 ( bad = ) or above ( good = ) given other attributes (other columns). 2 good and 97 bad ; guessing the most frequent class (good) will get 5.5% accuracy. 4 / 8

5 Let s build a classifier! Let s just try to build a classifier. (This is our first learning algorithm) For now, let s ignore the test set and trying to generalize Let s start with just looking at a simple classifier. What is a simple classification rule 5 / 8

6 Contingency Table values of y values of feature φ v v 2 v K 6 / 8

7 Decision Stump Example y maker america europe asia / 8

8 Decision Stump Example 97:2 maker y maker america europe asia america 74:75 europe 4:56 asia 9:7 7 / 8

9 Decision Stump Example 97:2 maker y maker america europe asia america 74:75 europe 4:56 asia 9:7 Errors: = 98 (about 25%) 7 / 8

10 Decision Stump Example 97:2 cylinders 3 3: 4 2:84 5 :2 6 73: 8 :3 8 / 8

11 Decision Stump Example 97:2 cylinders 3 3: 4 2:84 5 :2 6 73: 8 :3 Errors: = 36 (about 9%) 8 / 8

12 Key Idea: Recursion A single feature partitions the data. For each partition, we could choose another feature and partition further. Applying this recursively, we can construct a decision tree. 9 / 8

13 Decision Tree Example 97:2 cylinders 3 3: 4 2:84 5 :2 6 73: 8 :3 maker america 7:65 europe :53 asia 3:66 Error reduction compared to the cylinders stump / 8

14 Decision Tree Example 97:2 cylinders 3 3: 4 2:84 5 :2 6 73: 8 :3 maker america 67:7 europe 3: asia 3:3 Error reduction compared to the cylinders stump / 8

15 Decision Tree Example 97:2 cylinders 3 3: 4 2:84 5 :2 6 73: 8 :3 ϕ 73: : Error reduction compared to the cylinders stump / 8

16 Decision Tree Example 97:2 cylinders 3 3: 4 2:84 5 :2 6 73: 8 :3 ϕ ϕ 2:69 8:5 73: : Error reduction compared to the cylinders stump / 8

17 Decision Tree: Making a Prediction n:p ϕ n :p n :p ϕ 2 n :p n :p ϕ 3 ϕ 4 n :p n :p n :p n :p / 8

18 Decision Tree: Making a Prediction n:p n :p ϕ 3 ϕ n :p ϕ 2 n :p n :p ϕ 4 Data: decision tree t, input example x Result: predicted class if t has the form Leaf(y) then return y; else # t.φ is the feature associated with t; # t.child(v) is the subtree for value v; return DTreeTest(t.child(t.φ(x)), x)); end Algorithm : DTreeTest n :p n :p n :p n :p / 8

19 Decision Tree: Making a Prediction n:p n :p ϕ n :p ϕ 2 n :p n :p Equivalent boolean formulas: (φ = ) n < p (φ = ) (φ 2 = ) (φ 3 = ) n < p (φ = ) (φ 2 = ) (φ 3 = ) n < p (φ = ) (φ 2 = ) (φ 4 = ) n < p (φ = ) (φ 2 = ) (φ 4 = ) n < p ϕ 3 ϕ 4 n :p n :p n :p n :p / 8

20 Tangent: How Many Formulas Assume we have D binary features. Each feature could be set to, or set to, or excluded (wildcard/don t care). 3 D formulas. 2 / 8

21 Building a Decision Tree n:p 3 / 8

22 Building a Decision Tree n:p ϕ n :p n :p We chose feature φ. Note that n = n + n and p = p + p. 3 / 8

23 Building a Decision Tree n:p ϕ n :p n :p We chose not to split the left partition. Why not 3 / 8

24 Building a Decision Tree n:p ϕ n :p n :p ϕ 2 n :p n :p 3 / 8

25 Building a Decision Tree n:p ϕ n :p n :p ϕ 2 n :p n :p ϕ 3 n :p n :p 3 / 8

26 Building a Decision Tree n:p ϕ n :p n :p ϕ 2 n :p n :p ϕ 3 ϕ 4 n :p n :p n :p n :p 3 / 8

27 Greedily Building a Decision Tree (Binary Features) Data: data D, feature set Φ Result: decision tree if all examples in D have the same label y, or Φ is empty and y is the best guess then return Leaf(y); else for each feature φ in Φ do partition D into D and D based on φ-values; let mistakes(φ) = (non-majority answers in D ) + (non-majority answers in D ); end let φ be the feature with the smallest number of mistakes; return Node(φ, { DTreeTrain(D, Φ \ {φ }), DTreeTrain(D, Φ \ {φ })}); end Algorithm 2: DTreeTrain 4 / 8

28 What could go wrong Suppose we split on a variable with many values (e.g. a continous one like displacement ) Suppose we built out our tree to be very deep and wide 5 / 8

29 Danger: Overfitting overfitting error rate (lower is better) unseen data training data depth of the decision tree 6 / 8

30 Detecting Overfitting If you use all of your data to train, you won t be able to draw the red curve on the preceding slide! 7 / 8

31 Detecting Overfitting If you use all of your data to train, you won t be able to draw the red curve on the preceding slide! Solution: hold some out. This data is called development data. More terms: Decision tree max depth is an example of a hyperparameter I used my development data to tune the max-depth hyperparameter. 7 / 8

32 Detecting Overfitting If you use all of your data to train, you won t be able to draw the red curve on the preceding slide! Solution: hold some out. This data is called development data. More terms: Decision tree max depth is an example of a hyperparameter I used my development data to tune the max-depth hyperparameter. Better yet, hold out two subsets, one for tuning and one for a true, honest-to-science test. Splitting your data into training/development/test requires careful thinking. Starting point: randomly shuffle examples with an 8%/%/% split. 7 / 8

33 The i.i.d. Supervised Learning Setup Let l be a loss function; l(y, ŷ) is what we lose by outputting ŷ when y is the correct output. For classification: l(y, ŷ) = y ŷ Let D(x, y) define the true probability of input/output pair (x, y), in nature. We never know this distribution. The training data D = (x, y ), (x 2, y 2 ),..., (x N, y N ) are assumed to be identical, independently, distributed (i.i.d.) samples from D. The test data are also assumed to be i.i.d. samples from D. The space of classifiers we re considering is F; f is a classifier from F, chosen by our learning algorithm. 8 / 8

Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm

Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm Sham M Kakade c 2018 University of Washington cse446-staff@cs.washington.edu 1 / 17 Review 1 / 17 Decision Tree: Making a