Machine Learning (CSE 446): Practical Issues

Size: px

Start display at page:

Download "Machine Learning (CSE 446): Practical Issues"

Leslie Fleming
5 years ago
Views:

1 Machine Learning (CSE 446): Practical Issues Noah Smith c 2017 University of Washington nasmith@cs.washington.edu October 18, / 39

2 scary words 2 / 39

3 Outline of CSE 446 We ve already covered stuff in blue! Problem formulations: classification, regression Supervised techniques: decision trees, nearest neighbors, perceptron, linear models, probabilistic models, neural networks, kernel methods Unsupervised techniques: clustering, linear dimensionality reduction Meta-techniques : ensembles, expectation-maximization Understanding ML: limits of learning, practical issues, bias & fairness Recurring themes: (stochastic) gradient descent, bullshit detection 3 / 39

4 Today: (More) Best Practices You already know: Separating training and test data Hyperparameter tuning on development data Understanding machine learning is partly about knowing algorithms and partly about the art of mapping abstract problems to learning tasks. 4 / 39

5 Features Matter 5 / 39

6 Features Matter 6 / 39

7 Features Matter 7 / 39

8 Features Matter 8 / 39

9 Irrelevant Features 9 / 39

10 Irrelevant Features One irrelevant feature isn t a big deal; what we re worried about is when irrelevant features outnumber useful ones! 10 / 39

11 Irrelevant Features One irrelevant feature isn t a big deal; what we re worried about is when irrelevant features outnumber useful ones! Decision trees (not too deep)? 11 / 39

12 Irrelevant Features One irrelevant feature isn t a big deal; what we re worried about is when irrelevant features outnumber useful ones! Decision trees (not too deep)? Somewhat protected, but beware spurious correlations! 12 / 39

13 Irrelevant Features One irrelevant feature isn t a big deal; what we re worried about is when irrelevant features outnumber useful ones! Decision trees (not too deep)? Somewhat protected, but beware spurious correlations! K-nearest neighbors? 13 / 39

14 Irrelevant Features One irrelevant feature isn t a big deal; what we re worried about is when irrelevant features outnumber useful ones! Decision trees (not too deep)? Somewhat protected, but beware spurious correlations! K-nearest neighbors? 14 / 39

15 Irrelevant Features One irrelevant feature isn t a big deal; what we re worried about is when irrelevant features outnumber useful ones! Decision trees (not too deep)? Somewhat protected, but beware spurious correlations! K-nearest neighbors? Perceptron? 15 / 39

16 Irrelevant Features One irrelevant feature isn t a big deal; what we re worried about is when irrelevant features outnumber useful ones! Decision trees (not too deep)? Somewhat protected, but beware spurious correlations! K-nearest neighbors? Perceptron? 16 / 39

17 Irrelevant Features One irrelevant feature isn t a big deal; what we re worried about is when irrelevant features outnumber useful ones! Decision trees (not too deep)? Somewhat protected, but beware spurious correlations! K-nearest neighbors? Perceptron? What about redundant features φ j and φ j such that φ j φ j? 17 / 39

18 Technique: Feature Pruning If a binary feature is present in too small or too large a fraction of D, remove it. 18 / 39

19 Technique: Feature Pruning If a binary feature is present in too small or too large a fraction of D, remove it. Example: φ(x) = the word the occurs in document x 19 / 39

20 Technique: Feature Pruning If a binary feature is present in too small or too large a fraction of D, remove it. Example: φ(x) = the word the occurs in document x Generalization: if a feature has variance (in D) lower than some threshhold value, remove it. Note: in lecture, I mistakenly said to remove high-variance features. Mea culpa. sample mean(φ; D) = 1 N N φ(x n ) n=1 sample variance(φ; D) = 1 N 1 (call it φ ) N ( φ(xn ) φ ) 2 n=1 (call it Var(φ) ) 20 / 39

21 Technique: Feature Normalization Center a feature: φ(x) φ(x) φ (This was a required step for principal components analysis!) Scale a feature. Two choices: φ(x) φ(x) Var(φ) variance scaling φ(x) φ(x) max φ(x n) n absolute scaling 21 / 39

22 Technique: Feature Normalization Center a feature: φ(x) φ(x) φ (This was a required step for principal components analysis!) Scale a feature. Two choices: φ(x) φ(x) Var(φ) variance scaling φ(x) φ(x) max φ(x n) n absolute scaling Remember that you ll need to normalize test data before you test! 22 / 39

23 Techniques: Creating New Features 1. Consider two binary features, φ j and φ j. A new conjunction feature can be defined by: φ j j (x) = φ j (x) φ j (x) 23 / 39

24 Techniques: Creating New Features 1. Consider two binary features, φ j and φ j. A new conjunction feature can be defined by: φ j j (x) = φ j (x) φ j (x) The classic xor problem: these points are not linearly separable. 24 / 39

25 Techniques: Creating New Features 1. Consider two binary features, φ j and φ j. A new conjunction feature can be defined by: φ j j (x) = φ j (x) φ j (x) Define x[3] = x[1] x[2]. 25 / 39

26 Techniques: Creating New Features 1. Consider two binary features, φ j and φ j. A new conjunction feature can be defined by: φ j j (x) = φ j (x) φ j (x) Rotating the view. 26 / 39

27 Techniques: Creating New Features 1. Consider two binary features, φ j and φ j. A new conjunction feature can be defined by: φ j j (x) = φ j (x) φ j (x) 27 / 39

28 Techniques: Creating New Features 1. Consider two binary features, φ j and φ j. A new conjunction feature can be defined by: φ j j (x) = φ j (x) φ j (x) 2 x[1] + 2 x[2] 4 x[3] 1 = 0 28 / 39

29 Techniques: Creating New Features 1. Consider two binary features, φ j and φ j. A new conjunction feature can be defined by: φ j j (x) = φ j (x) φ j (x) Generalization: take the product of two features. 29 / 39

30 Techniques: Creating New Features 1. Consider two binary features, φ j and φ j. A new conjunction feature can be defined by: φ j j (x) = φ j (x) φ j (x) Generalization: take the product of two features. 2. Even more generally, we can create conjunctions (or products) using as many features as we d like. 30 / 39

31 Techniques: Creating New Features 1. Consider two binary features, φ j and φ j. A new conjunction feature can be defined by: φ j j (x) = φ j (x) φ j (x) Generalization: take the product of two features. 2. Even more generally, we can create conjunctions (or products) using as many features as we d like. This is one view of what decision trees are doing! Every leaf s path (from root) is a conjunction feature. Why not build decision trees, extract the features and toss them into the perceptron? 31 / 39

32 Techniques: Creating New Features 1. Consider two binary features, φ j and φ j. A new conjunction feature can be defined by: φ j j (x) = φ j (x) φ j (x) Generalization: take the product of two features. 2. Even more generally, we can create conjunctions (or products) using as many features as we d like. This is one view of what decision trees are doing! Every leaf s path (from root) is a conjunction feature. Why not build decision trees, extract the features and toss them into the perceptron? 3. Transformations on features can be useful. For example, φ(x) sign(φ(x)) log (1 + φ(x) ) Example: φ(x) is the count of the word cool in document x. 32 / 39

33 Evaluation Accuracy: A(f) = x D(x, f(x)) = { 1 if f(x) = y D(x, y) 0 otherwise x,y = x,y D(x, y) f(x) = y where D is the true distribution over data. Error is 1 A; earlier we denoted error ɛ(f). This is estimated using a test dataset x 1, y 2,..., x N, y N : Â(f) = 1 N N f(x i ) = y i i=1 33 / 39

34 Issues with Test-Set Accuracy 34 / 39

35 Issues with Test-Set Accuracy Class imbalance: if D(, not spam) = 0.99, then you can get Â 0.99 by always guessing not spam. 35 / 39

36 Issues with Test-Set Accuracy Class imbalance: if D(, not spam) = 0.99, then you can get Â 0.99 by always guessing not spam. Relative importance of classes or cost of error types. 36 / 39

37 Issues with Test-Set Accuracy Class imbalance: if D(, not spam) = 0.99, then you can get Â 0.99 by always guessing not spam. Relative importance of classes or cost of error types. Variance due to the test data. 37 / 39

38 Evaluation in the Two-Class Case Suppose we have two classes, and one of them, t, is a target. E.g., given a query, find relevant documents. Precision and recall encode the goals of returning a pure set of targeted instances and capturing all of them. A C B actually in the target class; y = t correctly labeled as t believed to be in the target class; f(x) = t ˆP(f) = C A B = B B ˆR(f) = C A B = A A ˆP ˆF 1 (f) = 2 ˆR ˆP + ˆR 38 / 39

39 Another View: Contingency Table y = t y t f(x) = t C (true positives) B \ C (false positives) B f(x) t A \ C (false negatives) (true negatives) A 39 / 39

Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm

Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm Sham M Kakade c 2018 University of Washington cse446-staff@cs.washington.edu 1 / 17 Review 1 / 17 Decision Tree: Making a