Algorithm Independent Machine Learning

Size: px

Start display at page:

Download "Algorithm Independent Machine Learning"

Sherman White
6 years ago
Views:

1 Algorithm Independent Machine Learning 15s1: COMP9417 Machine Learning and Data Mining School of Computer Science and Engineering, University of New South Wales April 22, 2015 COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

2 Acknowledgements Material derived from slides for the book Machine Learning by T. Mitchell, McGraw-Hill (1997) and slides by Andrew W. Moore available at and the book Data Mining, Ian H. Witten and Eibe Frank, Morgan Kauffman, and the book Pattern Classification, Richard O. Duda, Peter E. Hart, and David G. Stork. Copyright (c) 2001 by John Wiley & Sons, Inc. and the book Elements of Statistical Learning, Trevor Hastie, Robert Tibshirani and Jerome Friedman. (c) 2001, Springer. and from slides for the book Machine Learning by P. Flach Cambridge University Press (2012) COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

3 1. Aims Aims This lecture will enable you to define and reproduce fundamental approaches and results from the computational and statistical theory of machine learning. Following it you should be able to: describe a basic theoretical framework for sample complexity of learning define training error and true error describe the Probably Approximately Correct (PAC) learning framework describe the Vapnik-Chervonenkis (VC) dimension framework describe the Mistake Bounds framework and apply the WINNOW algorithm outline the No Free Lunch Theorem describe the framework of the bias-variance decomposition and some of its practical implications COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

4 2. Introduction Introduction What general laws constrain inductive learning? Theoretical results from two main areas: computational complexity, and statistics In both cases, theoretical results are not always easy to obtain for practically useful machine learning methods and applications Need to make (sometimes unrealistic) assumptions Results may be overly pessimistic, e.g., worst-case bounds Nonetheless, there are results which are important for understanding some key issues in machine learning COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

5 Computational Learning Theory We seek theory to relate: Probability of successful learning Number of training examples Complexity of hypothesis space Time complexity of learning algorithm Accuracy to which target concept is approximated Manner in which training examples presented COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

6 Computational Learning Theory Characterise classes of algorithms using questions such as: Sample complexity How many training examples required for learner to converge (with high probability) to a successful hypothesis? Computational complexity How much computational effort required for learner to converge (with high probability) to a successful hypothesis? Hypothesis complexity How do we measure the complexity of a hypothesis? How large is a hypothesis space? Mistake bounds How many training examples will the learner misclassify before converging to a successful hypothesis? COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

7 Computational Learning Theory What do we consider to be a successful hypothesis: identical to target concept? mostly agrees with target concept...?... does this most of the time? COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

8 Prototypical Concept Learning Task Given: Instances X : Possible days, each described by the attributes Sky, AirTemp, Humidity, Wind, Water, Forecast Target function c: En j oyspor t : X {0,1} Hypotheses H: Conjunctions of literals. E.g.?,Cold, Hi g h,?,?,? Training examples D: Positive and negative examples of the target function x 1,c(x 1 ),... x m,c(x m ) Determine: A hypothesis h in H such that h(x) = c(x) for all x in D? A hypothesis h in H such that h(x) = c(x) for all x in X? COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

9 Sample Complexity How many training examples are sufficient to learn the target concept? If learner proposes instances, as queries to teacher learner proposes instance x, teacher provides c(x) If teacher (who knows c) provides training examples teacher provides sequence of examples of form x, c(x) If some random process (e.g., nature) proposes instances instance x generated randomly, teacher provides c(x) COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

10 Sample Complexity: 1 Learner proposes instance x, teacher provides c(x) (assume c is in learner s hypothesis space H) Optimal query strategy: play 20 questions pick instance x such that half of the hypotheses in V S classify x positive, and half classify x negative when this is possible, need log 2 H queries to learn c when not possible, need even more learner conducts experiments (generates queries) see Version Space learning COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

11 Sample Complexity: 2 Teacher (who knows c) provides training examples (assume c is in learner s hypothesis space H) Optimal teaching strategy: depends on H used by learner Consider the case H = conjunctions of up to n Boolean literals e.g., (Air Temp = W ar m) (W ind = Str ong ), where Air Temp,W ind,... each have 2 possible values. generally, if n possible Boolean attributes in H, n + 1 examples suffice why? COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

12 Sample Complexity: 3 Given: set of instances X set of hypotheses H set of possible target concepts C training instances generated by a fixed, unknown probability distribution D over X Learner observes a sequence D of training examples of form x, c(x), for some target concept c C instances x are drawn from distribution D teacher provides target value c(x) for each Learner must output a hypothesis h estimating c h is evaluated by its performance on subsequent instances drawn according to D Note: randomly drawn instances, noise-free classifications COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

13 True Error of a Hypothesis Instance space X - c + h - + Where c and h disagree - COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

14 True Error of a Hypothesis Definition: The true error (denoted er r or D (h)) of hypothesis h with respect to target concept c and distribution D is the probability that h will misclassify an instance drawn at random according to D. er r or D (h) Pr [c(x) h(x)] x D COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

15 Two Notions of Error Training error of hypothesis h with respect to target concept c How often h(x) c(x) over training instances True error of hypothesis h with respect to c How often h(x) c(x) over future random instances Our concern: Can we bound the true error of h given the training error of h? First consider when training error of h is zero (i.e., h V S H,D ) COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

16 Exhausting the Version Space Hypothesis space H error =.3 r =.1 error =.1 r =.2 error =.2 r =0 VS H,D error =.1 r =0 error =.3 r =.4 error =.2 r =.3 COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

17 Exhausting the Version Space Note: in the diagram (r = training error, error = true error) Definition: The version space V S H,D is said to be ɛ-exhausted with respect to c and D, if every hypothesis h in V S H,D has error less than ɛ with respect to c and D. ( h V S H,D ) error D (h) < ɛ So V S H,D is not ɛ-exhausted if it contains at least one h with error D (h) ɛ. COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

18 How many examples will ɛ-exhaust the VS? Theorem: [Haussler, 1988]. If the hypothesis space H is finite, and D is a sequence of m 1 independent random examples of some target concept c, then for any 0 ɛ 1, the probability that the version space with respect to H and D is not ɛ-exhausted (with respect to c) is less than H e ɛm Interesting! This bounds the probability that any consistent learner will output a hypothesis h with er r or (h) ɛ COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

19 How many examples will ɛ-exhaust the VS? If we want this probability to be below δ then H e ɛm δ m 1 (ln H + ln(1/δ)) ɛ COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

20 Learning Conjunctions of Boolean Literals How many examples are sufficient to assure with probability at least (1 δ) that every h in V S H,D satisfies er r or D (h) ɛ Use our theorem: m 1 (ln H + ln(1/δ)) ɛ COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

21 Learning Conjunctions of Boolean Literals Suppose H contains conjunctions of constraints on up to n Boolean attributes (i.e., n Boolean literals any h can contain a literal, or its negation, or neither). Then H = 3 n, and or m 1 ɛ (ln3n + ln(1/δ)) m 1 (n ln3 + ln(1/δ)) ɛ COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

22 How About En j oyspor t? m 1 (ln H + ln(1/δ)) ɛ If H is as given in En j oyspor t then H = 973, and m 1 (ln973 + ln(1/δ)) ɛ COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

23 How About En j oyspor t?... if want to assure that with probability 95%, V S contains only hypotheses with er r or D (h).1, then it is sufficient to have m examples, where m 1 (ln973 + ln(1/.05)).1 m 10(ln973 + ln20) m 10( ) m 98.8 COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

24 PAC Learning Consider a class C of possible target concepts defined over a set of instances X of length n, and a learner L using hypothesis space H. Definition: C is PAC-learnable by L using H if for all c C, distributions D over X, ɛ such that 0 < ɛ < 1/2, and δ such that 0 < δ < 1/2, learner L will with probability at least (1 δ) output a hypothesis h H such that er r or D (h) ɛ, in time that is polynomial in 1/ɛ, 1/δ, n and si ze(c). Probably Approximately Correct Learning (Valiant, 1984). COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

25 Agnostic Learning So far, assumed c H consistent learners Agnostic learning setting: don t assume c H What do we want then? The hypothesis h that makes fewest errors on training data What is sample complexity in this case? derived from Hoeffding bounds: m 1 (ln H + ln(1/δ)) 2ɛ2 Pr [er r or D (h) > er r or D (h) + ɛ] e 2mɛ2 COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

26 Unbiased Learners Unbiased concept class C contains all target concepts definable on instance space X. C = 2 X Say X is defined using n Boolean features, then X = 2 n. C = 2 2n An unbiased learner has a hypothesis space able to represent all possible target concepts, i.e., H = C. m 1 ɛ (2n ln2 + ln(1/δ)) i.e., exponential (in the number of features) sample complexity COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

27 How Complex is a Hypothesis? COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

28 How Complex is a Hypothesis? The solid curve is the function sin(50x) for x [0,1]. The blue (solid) and green (hollow) points illustrate how the associated indicator function I (sin(αx) > 0) can shatter (separate) an arbitrarily large number of points by choosing an appropriately high frequency α. Classes separated based on sin(αx), for frequency α, a single parameter. COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

29 Shattering a Set of Instances Definition: a dichotomy of a set S is a partition of S into two disjoint subsets. Definition: a set of instances S is shattered by hypothesis space H if and only if for every dichotomy of S there exists some hypothesis in H consistent with this dichotomy. COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

30 Three Instances Shattered Instance space X COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

31 Example 4.7, p.125 Shattering a set of instances Consider the following instances: m = g = s = b = ManyTeeth Gills Short Beak ManyTeeth Gills Short Beak ManyTeeth Gills Short Beak ManyTeeth Gills Short Beak There are 16 different subsets of the set {m, g, s,b}. Can each of them be represented by its own conjunctive concept? The answer is yes: for every instance we want to exclude, we add the corresponding negated literal to the conjunction. Thus, {m, s} is represented by Gills Beak, {g, s,b} is represented by ManyTeeth, {s} is represented by ManyTeeth Gills Beak, and so on. We say that this set of four instances is shattered by the hypothesis language of conjunctive concepts. COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

32 The Vapnik-Chervonenkis Dimension Definition: The Vapnik-Chervonenkis dimension, V C(H), of hypothesis space H defined over instance space X is the size of the largest finite subset of X shattered by H. If arbitrarily large finite sets of X can be shattered by H, then V C(H). COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

33 VC Dim. of Linear Decision Surfaces COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

34 VC Dim. of Linear Decision Surfaces ( a) ( b) COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

35 Sample Complexity from VC Dimension How many randomly drawn examples suffice to ɛ-exhaust V S H,D with probability at least (1 δ)? m 1 ɛ (4log 2 (2/δ) + 8V C(H)log 2 (13/ɛ)) COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

36 Mistake Bounds So far: how many examples needed to learn? What about: how many mistakes before convergence? Let s consider similar setting to PAC learning: Instances drawn at random from X according to distribution D Learner must classify each instance before receiving correct classification from teacher Can we bound the number of mistakes learner makes before converging? COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

37 WINNOW Mistake bounds following Littlestone (1987) who developed WINNOW, an online, mistake-driven algorithm similar to perceptron learns linear threshold function designed to learn in the presence of many irrelevant features number of mistakes grows only logarithmically with number of irrelevant features Next slide shows the algorithm known as WINNOW2 in WINNOW1 attribute elimination not demotion COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

38 WINNOW2 While some instances are misclassified For each instance a classify a using current weights If predicted class is incorrect If a has class 1 For each a i = 1, w i αw i (if a i = 0, leave w i Otherwise unchanged) For each a i = 1, w i w i α (if a i = 0, leave w i unchanged) # Promotion # Demotion COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

39 WINNOW user-supplied threshold θ class is 1 if w i a i > θ note similarity to perceptron training rule WINNOW uses multiplicative weight updates α > 1 will do better than perceptron with many irrelevant attributes an attribute-efficient learner COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

40 Mistake Bounds: Find-S Consider Find-S when H = conjunction of Boolean literals FIND-S: Initialize h to the most specific hypothesis l 1 l 1 l 2 l 2...l n l n For each positive training instance x Remove from h any literal that is not satisfied by x Output hypothesis h. How many mistakes before converging to correct h? COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

41 Mistake Bounds: Find-S FIND-S will converge to error-free hypothesis in the limit if C H data are noise-free How many mistakes before converging to correct h? COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

42 Mistake Bounds: Find-S FIND-S initially classifies all instances as negative will generalize for positive examples by dropping unsatisfied literals since c H, will never classify a negative instance as positive (if it did, would exist a hypothesis h such that c g h would exist an instance x such that h (x) = 1 and c(x) 1 contradicts definition of generality ordering g ) mistake bound from number of positives classified as negatives How many mistakes before converging to correct h? COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

43 Mistake Bounds: Find-S 2n terms in initial hypothesis first mistake, remove half of these terms, leaving n each further mistake, remove at least 1 term in worst case, will have to remove all n remaining terms would be most general concept - everything is positive worst case number of mistakes would be n + 1 worst case sequence of learning steps, removing only one literal per step COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

44 Mistake Bounds: Halving Algorithm Consider the Halving Algorithm: Learn concept using version space CANDIDATE-ELIMINATION algorithm Classify new instances by majority vote of version space members How many mistakes before converging to correct h?... in worst case?... in best case? COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

45 Mistake Bounds: Halving Algorithm Consider the Halving Algorithm: Learn concept using version space CANDIDATE-ELIMINATION algorithm Classify new instances by majority vote of version space members How many mistakes before converging to correct h?... in worst case?... in best case? COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

46 Mistake Bounds: Halving Algorithm HALVING ALGORITHM learns exactly when the version space contains only one hypothesis which corresponds to the target concept at each step, remove all hypotheses whose vote was incorrect mistake when majority vote classification is incorrect mistake bound in converging to correct h? COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

47 Mistake Bounds: Halving Algorithm how many mistakes worst case? on every step, mistake because majority vote is incorrect each mistake, number of hypotheses reduced by at least half hypothesis space size H, worst-case mistake bound log 2 H how many mistakes best case? on every step, no mistake because majority vote is correct still remove all incorrect hypotheses, up to half best case, no mistakes in converging to correct h COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

48 Optimal Mistake Bounds Let M A (C) be the max number of mistakes made by algorithm A to learn concepts in C. (maximum over all possible c C, and all possible training sequences) M A (C) max c C M A(c) Definition: Let C be an arbitrary non-empty concept class. The optimal mistake bound for C, denoted Opt(C), is the minimum over all possible learning algorithms A of M A (C). Opt(C) min M A(C) A lear ning al g or i thms V C(C) Opt(C) M Hal ving (C) log 2 ( C ). COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

49 Weighted Majority generalisation of HALVING ALGORITHM predicts by weighted vote of set of prediction algorithms learns by altering weights for prediction algorithms a prediction algorithm simply predicts value of target concept given instance elements of hypothesis space H different learning algorithms inconsistency between prediction algorithm and training example reduce weight of prediction algorithm bound no. of mistakes of ensemble by no. of mistakes made by best prediction algorithm COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

50 Weighted Majority a i is the i -th prediction algorithm w i is the weight associated with a i For all i, initialize w i 1 For each training example x, c(x) Initialize q 0 and q 1 to 0 For each prediction algorithm a i If a i (x) = 0 then q 0 q 0 + w i If a i (x) = 1 then q 1 q 1 + w i If q 1 > q 0 then predict c(x) = 1 If q 0 > q 1 then predict c(x) = 0 If q 0 = q 1 then predict 0 or 1 at random for c(x) For each prediction algorithm a i If a i (x) c(x) then w i βw i COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

51 Weighted Majority WEIGHTED MAJORITY algorithm begins by assigning weight of 1 to each prediction algorithm. On misclassification by a prediction algorithm its weight is reduced by multiplication by constant β, 0 β 1. Equivalent to HALVING ALGORITHM when β = 0. Otherwise, just down-weights contribution of algorithms which make errors. Key result: number of mistakes made by WEIGHTED MAJORITY algorithm will never be greater than a constant factor times the number of mistakes made by the best member of the pool, plus a term that grows only logarithmically in the number of prediction algorithms in the pool. COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

52 Some questions about Machine Learning Are there reasons to prefer one learning algorithm over another? Can we expect any method to be superior overall? Can we even find an algorithm that is overall superior to random guessing? COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

53 No Free Lunch Theorem uniformly averaged over all target functions, the expected off-training-set error for all learning algorithms is the same even for a fixed training set, averaged over all target functions no learning algorithm yields an off-training-set error that is superior to any other Therefore, all statements of the form learning algorithm 1 is better than algorithm 2 are ultimately statements about the relevant target functions. COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

54 No Free Lunch example Assuming that the training set D can be learned correctly by all algorithms, averaged over all target functions no learning algorithm gives an off-training set error superior to any other: Σ F [E 1 (E F,D) E 2 (E F,D)] = 0 COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

55 No Free Lunch example x F h 1 h D E 1 (E F,D) = 0.4 E 2 (E F,D) = 0.6 COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

56 No Free Lunch example BUT if we have no prior knowledge about which F we are trying to learn, neither algorithm is superior to the other both fit the training data correctly, but there are 2 5 target functions consistent with D and for each there is exactly one other function whose output is inverted with respect to each of the off-training set patterns so the performance of algorithms 1 and 2 will be inverted thus ensuring average error difference of zero COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

57 A Conservation Theorem of Generalization Performance For every possible learning algorithm for binary classification the sum of performance over all possible target functions is exactly zero! on some problems we get positive performance so there must be other problems for which we get an equal and opposite amount of negative performance COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

58 A Conservation Theorem of Generalization Performance COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

59 A Conservation Theorem of Generalization Performance COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

60 A Conservation Theorem of Generalization Performance Takeaways: Even popular, theoretically algorithms will perform poorly on some domains, i.e., those where the learning algorithm is not a good match to the class of target functions. It is the assumptions about the learning domains that are relevant. Experience with a broad range of techniques is the best insurance for solving arbitrary new classification problems. from: Duda, Hart & Stork (2001) COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

61 Ugly Duckling Theorem In the absence of assumptions there is no privileged or best feature representation. For n objects, the set of concepts is 2 n. The number of concepts of which any pair of objects is a member (has the same concept definition, or pattern) is 2 n 1 using a finite number of predicates to distinguish any two patterns denoting sets of objects, the number of predicates shared by any two such patterns is constant and independent of those patterns. Therefore, even the notion of similarity between patterns depends on assumptions, i.e., inductive bias. COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

62 Bias, Variance and Model Complexity COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

63 Bias, Variance and Model Complexity We can see the behaviour of different models predictive accuracy on test sample and training sample as the model complexity is varied. How to measure model complexity? number of parameters? description length?... COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

64 Example 3.8, p.92 Line fitting example Consider the following set of five points: x y We want to estimate y by means of a polynomial in x. Figure 3.2 (left) shows the result for degrees of 1 to 5 using linear regression, which will be explained in Chapter 7. The top two degrees fit the given points exactly (in general, any set of n points can be fitted by a polynomial of degree no more than n 1), but they differ considerably at the extreme ends: e.g., the polynomial of degree 4 leads to a decreasing trend from x = 0 to x = 1, which is not really justified by the data. COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

65 Figure 3.2, p.92 Fitting polynomials to data (left) Polynomials of different degree fitted to a set of five points. From bottom to top in the top right-hand corner: degree 1 (straight line), degree 2 (parabola), degree 3, degree 4 (which is the lowest degree able to fit the points exactly), degree 5. (right) A piecewise constant function learned by a grouping model; the dotted reference line is the linear function from the left figure. COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

66 Overfitting again An n-degree polynomial has n + 1 parameters: e.g., a straight line y = a x + b has two parameters, and the polynomial of degree 4 that fits the five points exactly has five parameters. A piecewise constant model with n segments has 2n 1 parameters: n y-values and n 1 x-values where the jumps occur. So the models that are able to fit the points exactly are the models with more parameters. A rule of thumb is that, to avoid overfitting, the number of parameters estimated from the data must be considerably less than the number of data points. COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

67 Bias and variance I If we underestimate the number of parameters of the model, we will not be able to decrease the loss to zero, regardless of how much training data we have. On the other hand, with a larger number of parameters the model will be more dependent on the training sample, and small variations in the training sample can result in a considerably different model. This is sometimes called the bias variance dilemma: a low-complexity model suffers less from variability due to random variations in the training data, but may introduce a systematic bias that even large amounts of training data can t resolve; on the other hand, a high-complexity model eliminates such bias but can suffer non-systematic errors due to variance. COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

68 Bias and variance II We can make this a bit more precise by noting that expected squared loss on a training example x can be decomposed as follows: (f ) E[ (x) f ˆ 2 ] (x) = ( f (x) E [ ]) f ˆ 2 [ ( [ ]) (x) + E f ˆ(x) E f ˆ 2 ] (x) The expectation is taken over different training sets and hence different function estimators. The first term on the right is zero if these function estimators get it right on average; otherwise the learning algorithm exhibits a systematic bias of some kind. The second term quantifies the variance in the function estimates ˆ f (x) as a result of variations in the training set. COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

69 Figure 3.3, p.94 Bias and variance X X X X X X X X X X X X X X X X X X X X A dartboard metaphor illustrating the concepts of bias and variance. Each dartboard corresponds to a different learning algorithm, and each dart signifies a different training sample. The top row learning algorithms exhibit low bias, staying close to the bull s eye (the true function value for a particular x) on average, while the ones on the bottom row have high bias. The left column shows low variance and the right column high variance. COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

70 Bias-variance decomposition Theoretical tool for analyzing how much specific training set affects performance of classifier Assume we have an infinite number of classifiers built from different training sets of size n The bias of a learning scheme is the expected error of the combined classifier on new data The variance of a learning scheme is the expected error due to the particular training set used Total expected error: bias + variance COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

71 Bias-variance: a trade-off Easier to see with regression in the following figure 1 (to see the details you will have to zoom in in your viewer): each column represents a different model class g (x) shown in red each row represents a different set of n = 6 training points, D i, randomly sampled from target function F (x) with noise, shown in black probability functions of mean squared error E are shown 1 from: Elements of Statistical Learning by Hastie, Tibshirani and Friedman (2001) COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

72 Bias-variance: a trade-off COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

73 Bias-variance: a trade-off a) is very poor: a linear model with fixed parameters independent of training data; high bias, zero variance b) is better: a linear model with fixed parameters independent of training data; slightly lower bias, zero variance c) is a cubic model with parameters trained by mean-square-error on training data; low bias, moderate variance d) is a linear model with parameters adjusted to fit each training set; intermediate bias and variance training with data n would give c) with bias approaching small value due to noise but not d) variance for all models would approach zero COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

74 4. Summary Summary: Algorithm Independent Machine Learning Algorithm independent analyses using techniques from computational complexity, pattern recognition, statistics, etc. Some results over-conservative from practical viewpoint, e.g., worst-case rather than average-case analyses Nonetheless, has led to many practically useful algorithms, e.g., PAC Boosting VC dimension SVM Mistake Bounds WINNOW Bias-variance decomposition Ensemble learning methods COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

Concept Learning (2) Aims. Introduction. t Previously we looked at concept learning in an arbitrary conjunctive representation

Concept Learning (2) Aims. Introduction. t Previously we looked at concept learning in an arbitrary conjunctive representation Acknowledgements Concept Learning (2) 14s1: COMP9417 Machine Learning and Data Mining School of Computer Science and Engineering, University of New South Wales Material derived from slides for the book