Algorithm Independent Machine Learning

Size: px
Start display at page:

Download "Algorithm Independent Machine Learning"

Transcription

1 Algorithm Independent Machine Learning 15s1: COMP9417 Machine Learning and Data Mining School of Computer Science and Engineering, University of New South Wales April 22, 2015 COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

2 Acknowledgements Material derived from slides for the book Machine Learning by T. Mitchell, McGraw-Hill (1997) and slides by Andrew W. Moore available at and the book Data Mining, Ian H. Witten and Eibe Frank, Morgan Kauffman, and the book Pattern Classification, Richard O. Duda, Peter E. Hart, and David G. Stork. Copyright (c) 2001 by John Wiley & Sons, Inc. and the book Elements of Statistical Learning, Trevor Hastie, Robert Tibshirani and Jerome Friedman. (c) 2001, Springer. and from slides for the book Machine Learning by P. Flach Cambridge University Press (2012) COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

3 1. Aims Aims This lecture will enable you to define and reproduce fundamental approaches and results from the computational and statistical theory of machine learning. Following it you should be able to: describe a basic theoretical framework for sample complexity of learning define training error and true error describe the Probably Approximately Correct (PAC) learning framework describe the Vapnik-Chervonenkis (VC) dimension framework describe the Mistake Bounds framework and apply the WINNOW algorithm outline the No Free Lunch Theorem describe the framework of the bias-variance decomposition and some of its practical implications COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

4 2. Introduction Introduction What general laws constrain inductive learning? Theoretical results from two main areas: computational complexity, and statistics In both cases, theoretical results are not always easy to obtain for practically useful machine learning methods and applications Need to make (sometimes unrealistic) assumptions Results may be overly pessimistic, e.g., worst-case bounds Nonetheless, there are results which are important for understanding some key issues in machine learning COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

5 Computational Learning Theory We seek theory to relate: Probability of successful learning Number of training examples Complexity of hypothesis space Time complexity of learning algorithm Accuracy to which target concept is approximated Manner in which training examples presented COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

6 Computational Learning Theory Characterise classes of algorithms using questions such as: Sample complexity How many training examples required for learner to converge (with high probability) to a successful hypothesis? Computational complexity How much computational effort required for learner to converge (with high probability) to a successful hypothesis? Hypothesis complexity How do we measure the complexity of a hypothesis? How large is a hypothesis space? Mistake bounds How many training examples will the learner misclassify before converging to a successful hypothesis? COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

7 Computational Learning Theory What do we consider to be a successful hypothesis: identical to target concept? mostly agrees with target concept...?... does this most of the time? COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

8 Prototypical Concept Learning Task Given: Instances X : Possible days, each described by the attributes Sky, AirTemp, Humidity, Wind, Water, Forecast Target function c: En j oyspor t : X {0,1} Hypotheses H: Conjunctions of literals. E.g.?,Cold, Hi g h,?,?,? Training examples D: Positive and negative examples of the target function x 1,c(x 1 ),... x m,c(x m ) Determine: A hypothesis h in H such that h(x) = c(x) for all x in D? A hypothesis h in H such that h(x) = c(x) for all x in X? COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

9 Sample Complexity How many training examples are sufficient to learn the target concept? If learner proposes instances, as queries to teacher learner proposes instance x, teacher provides c(x) If teacher (who knows c) provides training examples teacher provides sequence of examples of form x, c(x) If some random process (e.g., nature) proposes instances instance x generated randomly, teacher provides c(x) COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

10 Sample Complexity: 1 Learner proposes instance x, teacher provides c(x) (assume c is in learner s hypothesis space H) Optimal query strategy: play 20 questions pick instance x such that half of the hypotheses in V S classify x positive, and half classify x negative when this is possible, need log 2 H queries to learn c when not possible, need even more learner conducts experiments (generates queries) see Version Space learning COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

11 Sample Complexity: 2 Teacher (who knows c) provides training examples (assume c is in learner s hypothesis space H) Optimal teaching strategy: depends on H used by learner Consider the case H = conjunctions of up to n Boolean literals e.g., (Air Temp = W ar m) (W ind = Str ong ), where Air Temp,W ind,... each have 2 possible values. generally, if n possible Boolean attributes in H, n + 1 examples suffice why? COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

12 Sample Complexity: 3 Given: set of instances X set of hypotheses H set of possible target concepts C training instances generated by a fixed, unknown probability distribution D over X Learner observes a sequence D of training examples of form x, c(x), for some target concept c C instances x are drawn from distribution D teacher provides target value c(x) for each Learner must output a hypothesis h estimating c h is evaluated by its performance on subsequent instances drawn according to D Note: randomly drawn instances, noise-free classifications COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

13 True Error of a Hypothesis Instance space X - c + h - + Where c and h disagree - COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

14 True Error of a Hypothesis Definition: The true error (denoted er r or D (h)) of hypothesis h with respect to target concept c and distribution D is the probability that h will misclassify an instance drawn at random according to D. er r or D (h) Pr [c(x) h(x)] x D COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

15 Two Notions of Error Training error of hypothesis h with respect to target concept c How often h(x) c(x) over training instances True error of hypothesis h with respect to c How often h(x) c(x) over future random instances Our concern: Can we bound the true error of h given the training error of h? First consider when training error of h is zero (i.e., h V S H,D ) COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

16 Exhausting the Version Space Hypothesis space H error =.3 r =.1 error =.1 r =.2 error =.2 r =0 VS H,D error =.1 r =0 error =.3 r =.4 error =.2 r =.3 COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

17 Exhausting the Version Space Note: in the diagram (r = training error, error = true error) Definition: The version space V S H,D is said to be ɛ-exhausted with respect to c and D, if every hypothesis h in V S H,D has error less than ɛ with respect to c and D. ( h V S H,D ) error D (h) < ɛ So V S H,D is not ɛ-exhausted if it contains at least one h with error D (h) ɛ. COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

18 How many examples will ɛ-exhaust the VS? Theorem: [Haussler, 1988]. If the hypothesis space H is finite, and D is a sequence of m 1 independent random examples of some target concept c, then for any 0 ɛ 1, the probability that the version space with respect to H and D is not ɛ-exhausted (with respect to c) is less than H e ɛm Interesting! This bounds the probability that any consistent learner will output a hypothesis h with er r or (h) ɛ COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

19 How many examples will ɛ-exhaust the VS? If we want this probability to be below δ then H e ɛm δ m 1 (ln H + ln(1/δ)) ɛ COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

20 Learning Conjunctions of Boolean Literals How many examples are sufficient to assure with probability at least (1 δ) that every h in V S H,D satisfies er r or D (h) ɛ Use our theorem: m 1 (ln H + ln(1/δ)) ɛ COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

21 Learning Conjunctions of Boolean Literals Suppose H contains conjunctions of constraints on up to n Boolean attributes (i.e., n Boolean literals any h can contain a literal, or its negation, or neither). Then H = 3 n, and or m 1 ɛ (ln3n + ln(1/δ)) m 1 (n ln3 + ln(1/δ)) ɛ COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

22 How About En j oyspor t? m 1 (ln H + ln(1/δ)) ɛ If H is as given in En j oyspor t then H = 973, and m 1 (ln973 + ln(1/δ)) ɛ COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

23 How About En j oyspor t?... if want to assure that with probability 95%, V S contains only hypotheses with er r or D (h).1, then it is sufficient to have m examples, where m 1 (ln973 + ln(1/.05)).1 m 10(ln973 + ln20) m 10( ) m 98.8 COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

24 PAC Learning Consider a class C of possible target concepts defined over a set of instances X of length n, and a learner L using hypothesis space H. Definition: C is PAC-learnable by L using H if for all c C, distributions D over X, ɛ such that 0 < ɛ < 1/2, and δ such that 0 < δ < 1/2, learner L will with probability at least (1 δ) output a hypothesis h H such that er r or D (h) ɛ, in time that is polynomial in 1/ɛ, 1/δ, n and si ze(c). Probably Approximately Correct Learning (Valiant, 1984). COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

25 Agnostic Learning So far, assumed c H consistent learners Agnostic learning setting: don t assume c H What do we want then? The hypothesis h that makes fewest errors on training data What is sample complexity in this case? derived from Hoeffding bounds: m 1 (ln H + ln(1/δ)) 2ɛ2 Pr [er r or D (h) > er r or D (h) + ɛ] e 2mɛ2 COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

26 Unbiased Learners Unbiased concept class C contains all target concepts definable on instance space X. C = 2 X Say X is defined using n Boolean features, then X = 2 n. C = 2 2n An unbiased learner has a hypothesis space able to represent all possible target concepts, i.e., H = C. m 1 ɛ (2n ln2 + ln(1/δ)) i.e., exponential (in the number of features) sample complexity COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

27 How Complex is a Hypothesis? COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

28 How Complex is a Hypothesis? The solid curve is the function sin(50x) for x [0,1]. The blue (solid) and green (hollow) points illustrate how the associated indicator function I (sin(αx) > 0) can shatter (separate) an arbitrarily large number of points by choosing an appropriately high frequency α. Classes separated based on sin(αx), for frequency α, a single parameter. COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

29 Shattering a Set of Instances Definition: a dichotomy of a set S is a partition of S into two disjoint subsets. Definition: a set of instances S is shattered by hypothesis space H if and only if for every dichotomy of S there exists some hypothesis in H consistent with this dichotomy. COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

30 Three Instances Shattered Instance space X COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

31 Example 4.7, p.125 Shattering a set of instances Consider the following instances: m = g = s = b = ManyTeeth Gills Short Beak ManyTeeth Gills Short Beak ManyTeeth Gills Short Beak ManyTeeth Gills Short Beak There are 16 different subsets of the set {m, g, s,b}. Can each of them be represented by its own conjunctive concept? The answer is yes: for every instance we want to exclude, we add the corresponding negated literal to the conjunction. Thus, {m, s} is represented by Gills Beak, {g, s,b} is represented by ManyTeeth, {s} is represented by ManyTeeth Gills Beak, and so on. We say that this set of four instances is shattered by the hypothesis language of conjunctive concepts. COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

32 The Vapnik-Chervonenkis Dimension Definition: The Vapnik-Chervonenkis dimension, V C(H), of hypothesis space H defined over instance space X is the size of the largest finite subset of X shattered by H. If arbitrarily large finite sets of X can be shattered by H, then V C(H). COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

33 VC Dim. of Linear Decision Surfaces COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

34 VC Dim. of Linear Decision Surfaces ( a) ( b) COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

35 Sample Complexity from VC Dimension How many randomly drawn examples suffice to ɛ-exhaust V S H,D with probability at least (1 δ)? m 1 ɛ (4log 2 (2/δ) + 8V C(H)log 2 (13/ɛ)) COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

36 Mistake Bounds So far: how many examples needed to learn? What about: how many mistakes before convergence? Let s consider similar setting to PAC learning: Instances drawn at random from X according to distribution D Learner must classify each instance before receiving correct classification from teacher Can we bound the number of mistakes learner makes before converging? COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

37 WINNOW Mistake bounds following Littlestone (1987) who developed WINNOW, an online, mistake-driven algorithm similar to perceptron learns linear threshold function designed to learn in the presence of many irrelevant features number of mistakes grows only logarithmically with number of irrelevant features Next slide shows the algorithm known as WINNOW2 in WINNOW1 attribute elimination not demotion COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

38 WINNOW2 While some instances are misclassified For each instance a classify a using current weights If predicted class is incorrect If a has class 1 For each a i = 1, w i αw i (if a i = 0, leave w i Otherwise unchanged) For each a i = 1, w i w i α (if a i = 0, leave w i unchanged) # Promotion # Demotion COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

39 WINNOW user-supplied threshold θ class is 1 if w i a i > θ note similarity to perceptron training rule WINNOW uses multiplicative weight updates α > 1 will do better than perceptron with many irrelevant attributes an attribute-efficient learner COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

40 Mistake Bounds: Find-S Consider Find-S when H = conjunction of Boolean literals FIND-S: Initialize h to the most specific hypothesis l 1 l 1 l 2 l 2...l n l n For each positive training instance x Remove from h any literal that is not satisfied by x Output hypothesis h. How many mistakes before converging to correct h? COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

41 Mistake Bounds: Find-S FIND-S will converge to error-free hypothesis in the limit if C H data are noise-free How many mistakes before converging to correct h? COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

42 Mistake Bounds: Find-S FIND-S initially classifies all instances as negative will generalize for positive examples by dropping unsatisfied literals since c H, will never classify a negative instance as positive (if it did, would exist a hypothesis h such that c g h would exist an instance x such that h (x) = 1 and c(x) 1 contradicts definition of generality ordering g ) mistake bound from number of positives classified as negatives How many mistakes before converging to correct h? COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

43 Mistake Bounds: Find-S 2n terms in initial hypothesis first mistake, remove half of these terms, leaving n each further mistake, remove at least 1 term in worst case, will have to remove all n remaining terms would be most general concept - everything is positive worst case number of mistakes would be n + 1 worst case sequence of learning steps, removing only one literal per step COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

44 Mistake Bounds: Halving Algorithm Consider the Halving Algorithm: Learn concept using version space CANDIDATE-ELIMINATION algorithm Classify new instances by majority vote of version space members How many mistakes before converging to correct h?... in worst case?... in best case? COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

45 Mistake Bounds: Halving Algorithm Consider the Halving Algorithm: Learn concept using version space CANDIDATE-ELIMINATION algorithm Classify new instances by majority vote of version space members How many mistakes before converging to correct h?... in worst case?... in best case? COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

46 Mistake Bounds: Halving Algorithm HALVING ALGORITHM learns exactly when the version space contains only one hypothesis which corresponds to the target concept at each step, remove all hypotheses whose vote was incorrect mistake when majority vote classification is incorrect mistake bound in converging to correct h? COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

47 Mistake Bounds: Halving Algorithm how many mistakes worst case? on every step, mistake because majority vote is incorrect each mistake, number of hypotheses reduced by at least half hypothesis space size H, worst-case mistake bound log 2 H how many mistakes best case? on every step, no mistake because majority vote is correct still remove all incorrect hypotheses, up to half best case, no mistakes in converging to correct h COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

48 Optimal Mistake Bounds Let M A (C) be the max number of mistakes made by algorithm A to learn concepts in C. (maximum over all possible c C, and all possible training sequences) M A (C) max c C M A(c) Definition: Let C be an arbitrary non-empty concept class. The optimal mistake bound for C, denoted Opt(C), is the minimum over all possible learning algorithms A of M A (C). Opt(C) min M A(C) A lear ning al g or i thms V C(C) Opt(C) M Hal ving (C) log 2 ( C ). COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

49 Weighted Majority generalisation of HALVING ALGORITHM predicts by weighted vote of set of prediction algorithms learns by altering weights for prediction algorithms a prediction algorithm simply predicts value of target concept given instance elements of hypothesis space H different learning algorithms inconsistency between prediction algorithm and training example reduce weight of prediction algorithm bound no. of mistakes of ensemble by no. of mistakes made by best prediction algorithm COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

50 Weighted Majority a i is the i -th prediction algorithm w i is the weight associated with a i For all i, initialize w i 1 For each training example x, c(x) Initialize q 0 and q 1 to 0 For each prediction algorithm a i If a i (x) = 0 then q 0 q 0 + w i If a i (x) = 1 then q 1 q 1 + w i If q 1 > q 0 then predict c(x) = 1 If q 0 > q 1 then predict c(x) = 0 If q 0 = q 1 then predict 0 or 1 at random for c(x) For each prediction algorithm a i If a i (x) c(x) then w i βw i COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

51 Weighted Majority WEIGHTED MAJORITY algorithm begins by assigning weight of 1 to each prediction algorithm. On misclassification by a prediction algorithm its weight is reduced by multiplication by constant β, 0 β 1. Equivalent to HALVING ALGORITHM when β = 0. Otherwise, just down-weights contribution of algorithms which make errors. Key result: number of mistakes made by WEIGHTED MAJORITY algorithm will never be greater than a constant factor times the number of mistakes made by the best member of the pool, plus a term that grows only logarithmically in the number of prediction algorithms in the pool. COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

52 Some questions about Machine Learning Are there reasons to prefer one learning algorithm over another? Can we expect any method to be superior overall? Can we even find an algorithm that is overall superior to random guessing? COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

53 No Free Lunch Theorem uniformly averaged over all target functions, the expected off-training-set error for all learning algorithms is the same even for a fixed training set, averaged over all target functions no learning algorithm yields an off-training-set error that is superior to any other Therefore, all statements of the form learning algorithm 1 is better than algorithm 2 are ultimately statements about the relevant target functions. COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

54 No Free Lunch example Assuming that the training set D can be learned correctly by all algorithms, averaged over all target functions no learning algorithm gives an off-training set error superior to any other: Σ F [E 1 (E F,D) E 2 (E F,D)] = 0 COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

55 No Free Lunch example x F h 1 h D E 1 (E F,D) = 0.4 E 2 (E F,D) = 0.6 COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

56 No Free Lunch example BUT if we have no prior knowledge about which F we are trying to learn, neither algorithm is superior to the other both fit the training data correctly, but there are 2 5 target functions consistent with D and for each there is exactly one other function whose output is inverted with respect to each of the off-training set patterns so the performance of algorithms 1 and 2 will be inverted thus ensuring average error difference of zero COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

57 A Conservation Theorem of Generalization Performance For every possible learning algorithm for binary classification the sum of performance over all possible target functions is exactly zero! on some problems we get positive performance so there must be other problems for which we get an equal and opposite amount of negative performance COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

58 A Conservation Theorem of Generalization Performance COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

59 A Conservation Theorem of Generalization Performance COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

60 A Conservation Theorem of Generalization Performance Takeaways: Even popular, theoretically algorithms will perform poorly on some domains, i.e., those where the learning algorithm is not a good match to the class of target functions. It is the assumptions about the learning domains that are relevant. Experience with a broad range of techniques is the best insurance for solving arbitrary new classification problems. from: Duda, Hart & Stork (2001) COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

61 Ugly Duckling Theorem In the absence of assumptions there is no privileged or best feature representation. For n objects, the set of concepts is 2 n. The number of concepts of which any pair of objects is a member (has the same concept definition, or pattern) is 2 n 1 using a finite number of predicates to distinguish any two patterns denoting sets of objects, the number of predicates shared by any two such patterns is constant and independent of those patterns. Therefore, even the notion of similarity between patterns depends on assumptions, i.e., inductive bias. COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

62 Bias, Variance and Model Complexity COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

63 Bias, Variance and Model Complexity We can see the behaviour of different models predictive accuracy on test sample and training sample as the model complexity is varied. How to measure model complexity? number of parameters? description length?... COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

64 Example 3.8, p.92 Line fitting example Consider the following set of five points: x y We want to estimate y by means of a polynomial in x. Figure 3.2 (left) shows the result for degrees of 1 to 5 using linear regression, which will be explained in Chapter 7. The top two degrees fit the given points exactly (in general, any set of n points can be fitted by a polynomial of degree no more than n 1), but they differ considerably at the extreme ends: e.g., the polynomial of degree 4 leads to a decreasing trend from x = 0 to x = 1, which is not really justified by the data. COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

65 Figure 3.2, p.92 Fitting polynomials to data (left) Polynomials of different degree fitted to a set of five points. From bottom to top in the top right-hand corner: degree 1 (straight line), degree 2 (parabola), degree 3, degree 4 (which is the lowest degree able to fit the points exactly), degree 5. (right) A piecewise constant function learned by a grouping model; the dotted reference line is the linear function from the left figure. COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

66 Overfitting again An n-degree polynomial has n + 1 parameters: e.g., a straight line y = a x + b has two parameters, and the polynomial of degree 4 that fits the five points exactly has five parameters. A piecewise constant model with n segments has 2n 1 parameters: n y-values and n 1 x-values where the jumps occur. So the models that are able to fit the points exactly are the models with more parameters. A rule of thumb is that, to avoid overfitting, the number of parameters estimated from the data must be considerably less than the number of data points. COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

67 Bias and variance I If we underestimate the number of parameters of the model, we will not be able to decrease the loss to zero, regardless of how much training data we have. On the other hand, with a larger number of parameters the model will be more dependent on the training sample, and small variations in the training sample can result in a considerably different model. This is sometimes called the bias variance dilemma: a low-complexity model suffers less from variability due to random variations in the training data, but may introduce a systematic bias that even large amounts of training data can t resolve; on the other hand, a high-complexity model eliminates such bias but can suffer non-systematic errors due to variance. COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

68 Bias and variance II We can make this a bit more precise by noting that expected squared loss on a training example x can be decomposed as follows: (f ) E[ (x) f ˆ 2 ] (x) = ( f (x) E [ ]) f ˆ 2 [ ( [ ]) (x) + E f ˆ(x) E f ˆ 2 ] (x) The expectation is taken over different training sets and hence different function estimators. The first term on the right is zero if these function estimators get it right on average; otherwise the learning algorithm exhibits a systematic bias of some kind. The second term quantifies the variance in the function estimates ˆ f (x) as a result of variations in the training set. COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

69 Figure 3.3, p.94 Bias and variance X X X X X X X X X X X X X X X X X X X X A dartboard metaphor illustrating the concepts of bias and variance. Each dartboard corresponds to a different learning algorithm, and each dart signifies a different training sample. The top row learning algorithms exhibit low bias, staying close to the bull s eye (the true function value for a particular x) on average, while the ones on the bottom row have high bias. The left column shows low variance and the right column high variance. COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

70 Bias-variance decomposition Theoretical tool for analyzing how much specific training set affects performance of classifier Assume we have an infinite number of classifiers built from different training sets of size n The bias of a learning scheme is the expected error of the combined classifier on new data The variance of a learning scheme is the expected error due to the particular training set used Total expected error: bias + variance COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

71 Bias-variance: a trade-off Easier to see with regression in the following figure 1 (to see the details you will have to zoom in in your viewer): each column represents a different model class g (x) shown in red each row represents a different set of n = 6 training points, D i, randomly sampled from target function F (x) with noise, shown in black probability functions of mean squared error E are shown 1 from: Elements of Statistical Learning by Hastie, Tibshirani and Friedman (2001) COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

72 Bias-variance: a trade-off COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

73 Bias-variance: a trade-off a) is very poor: a linear model with fixed parameters independent of training data; high bias, zero variance b) is better: a linear model with fixed parameters independent of training data; slightly lower bias, zero variance c) is a cubic model with parameters trained by mean-square-error on training data; low bias, moderate variance d) is a linear model with parameters adjusted to fit each training set; intermediate bias and variance training with data n would give c) with bias approaching small value due to noise but not d) variance for all models would approach zero COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

74 4. Summary Summary: Algorithm Independent Machine Learning Algorithm independent analyses using techniques from computational complexity, pattern recognition, statistics, etc. Some results over-conservative from practical viewpoint, e.g., worst-case rather than average-case analyses Nonetheless, has led to many practically useful algorithms, e.g., PAC Boosting VC dimension SVM Mistake Bounds WINNOW Bias-variance decomposition Ensemble learning methods COMP9417 ML & DM (CSE, UNSW) Algorithm Independent Machine Learning April 22, / 74

Concept Learning (2) Aims. Introduction. t Previously we looked at concept learning in an arbitrary conjunctive representation

Concept Learning (2) Aims. Introduction. t Previously we looked at concept learning in an arbitrary conjunctive representation Acknowledgements Concept Learning (2) 14s1: COMP9417 Machine Learning and Data Mining School of Computer Science and Engineering, University of New South Wales Material derived from slides for the book

More information

Optimization Methods for Machine Learning (OMML)

Optimization Methods for Machine Learning (OMML) Optimization Methods for Machine Learning (OMML) 2nd lecture Prof. L. Palagi References: 1. Bishop Pattern Recognition and Machine Learning, Springer, 2006 (Chap 1) 2. V. Cherlassky, F. Mulier - Learning

More information

CS 395T Computational Learning Theory. Scribe: Wei Tang

CS 395T Computational Learning Theory. Scribe: Wei Tang CS 395T Computational Learning Theory Lecture 1: September 5th, 2007 Lecturer: Adam Klivans Scribe: Wei Tang 1.1 Introduction Many tasks from real application domain can be described as a process of learning.

More information

CSE 417T: Introduction to Machine Learning. Lecture 6: Bias-Variance Trade-off. Henry Chai 09/13/18

CSE 417T: Introduction to Machine Learning. Lecture 6: Bias-Variance Trade-off. Henry Chai 09/13/18 CSE 417T: Introduction to Machine Learning Lecture 6: Bias-Variance Trade-off Henry Chai 09/13/18 Let! ", $ = the maximum number of dichotomies on " points s.t. no subset of $ points is shattered Recall

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 12 Combining

More information

Supervised Learning. CS 586 Machine Learning. Prepared by Jugal Kalita. With help from Alpaydin s Introduction to Machine Learning, Chapter 2.

Supervised Learning. CS 586 Machine Learning. Prepared by Jugal Kalita. With help from Alpaydin s Introduction to Machine Learning, Chapter 2. Supervised Learning CS 586 Machine Learning Prepared by Jugal Kalita With help from Alpaydin s Introduction to Machine Learning, Chapter 2. Topics What is classification? Hypothesis classes and learning

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:

More information

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control. What is Learning? CS 343: Artificial Intelligence Machine Learning Herbert Simon: Learning is any process by which a system improves performance from experience. What is the task? Classification Problem

More information

Bias-Variance Analysis of Ensemble Learning

Bias-Variance Analysis of Ensemble Learning Bias-Variance Analysis of Ensemble Learning Thomas G. Dietterich Department of Computer Science Oregon State University Corvallis, Oregon 97331 http://www.cs.orst.edu/~tgd Outline Bias-Variance Decomposition

More information

CPSC 340: Machine Learning and Data Mining. Regularization Fall 2016

CPSC 340: Machine Learning and Data Mining. Regularization Fall 2016 CPSC 340: Machine Learning and Data Mining Regularization Fall 2016 Assignment 2: Admin 2 late days to hand it in Friday, 3 for Monday. Assignment 3 is out. Due next Wednesday (so we can release solutions

More information

Decision Tree CE-717 : Machine Learning Sharif University of Technology

Decision Tree CE-717 : Machine Learning Sharif University of Technology Decision Tree CE-717 : Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Some slides have been adapted from: Prof. Tom Mitchell Decision tree Approximating functions of usually discrete

More information

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics Lecture 12: Ensemble Learning I Jie Wang Department of Computational Medicine & Bioinformatics University of Michigan 1 Outline Bias

More information

Regularization and model selection

Regularization and model selection CS229 Lecture notes Andrew Ng Part VI Regularization and model selection Suppose we are trying select among several different models for a learning problem. For instance, we might be using a polynomial

More information

Linear Models. Lecture Outline: Numeric Prediction: Linear Regression. Linear Classification. The Perceptron. Support Vector Machines

Linear Models. Lecture Outline: Numeric Prediction: Linear Regression. Linear Classification. The Perceptron. Support Vector Machines Linear Models Lecture Outline: Numeric Prediction: Linear Regression Linear Classification The Perceptron Support Vector Machines Reading: Chapter 4.6 Witten and Frank, 2nd ed. Chapter 4 of Mitchell Solving

More information

For Monday. Read chapter 18, sections Homework:

For Monday. Read chapter 18, sections Homework: For Monday Read chapter 18, sections 10-12 The material in section 8 and 9 is interesting, but we won t take time to cover it this semester Homework: Chapter 18, exercise 25 a-b Program 4 Model Neuron

More information

Simple Model Selection Cross Validation Regularization Neural Networks

Simple Model Selection Cross Validation Regularization Neural Networks Neural Nets: Many possible refs e.g., Mitchell Chapter 4 Simple Model Selection Cross Validation Regularization Neural Networks Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February

More information

h=[3,2,5,7], pos=[2,1], neg=[4,4]

h=[3,2,5,7], pos=[2,1], neg=[4,4] 2D1431 Machine Learning Lab 1: Concept Learning & Decision Trees Frank Hoffmann e-mail: hoffmann@nada.kth.se November 8, 2002 1 Introduction You have to prepare the solutions to the lab assignments prior

More information

Support Vector Machines

Support Vector Machines Support Vector Machines . Importance of SVM SVM is a discriminative method that brings together:. computational learning theory. previously known methods in linear discriminant functions 3. optimization

More information

Random Forest A. Fornaser

Random Forest A. Fornaser Random Forest A. Fornaser alberto.fornaser@unitn.it Sources Lecture 15: decision trees, information theory and random forests, Dr. Richard E. Turner Trees and Random Forests, Adele Cutler, Utah State University

More information

Version Space Support Vector Machines: An Extended Paper

Version Space Support Vector Machines: An Extended Paper Version Space Support Vector Machines: An Extended Paper E.N. Smirnov, I.G. Sprinkhuizen-Kuyper, G.I. Nalbantov 2, and S. Vanderlooy Abstract. We argue to use version spaces as an approach to reliable

More information

Boosting Simple Model Selection Cross Validation Regularization

Boosting Simple Model Selection Cross Validation Regularization Boosting: (Linked from class website) Schapire 01 Boosting Simple Model Selection Cross Validation Regularization Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 8 th,

More information

Supervised Learning for Image Segmentation

Supervised Learning for Image Segmentation Supervised Learning for Image Segmentation Raphael Meier 06.10.2016 Raphael Meier MIA 2016 06.10.2016 1 / 52 References A. Ng, Machine Learning lecture, Stanford University. A. Criminisi, J. Shotton, E.

More information

CPSC 536N: Randomized Algorithms Term 2. Lecture 10

CPSC 536N: Randomized Algorithms Term 2. Lecture 10 CPSC 536N: Randomized Algorithms 011-1 Term Prof. Nick Harvey Lecture 10 University of British Columbia In the first lecture we discussed the Max Cut problem, which is NP-complete, and we presented a very

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods Ensemble Learning Ensemble Learning So far we have seen learning algorithms that take a training set and output a classifier What if we want more accuracy than current algorithms afford? Develop new learning

More information

Module 4. Non-linear machine learning econometrics: Support Vector Machine

Module 4. Non-linear machine learning econometrics: Support Vector Machine Module 4. Non-linear machine learning econometrics: Support Vector Machine THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Introduction When the assumption of linearity

More information

CSC411 Fall 2014 Machine Learning & Data Mining. Ensemble Methods. Slides by Rich Zemel

CSC411 Fall 2014 Machine Learning & Data Mining. Ensemble Methods. Slides by Rich Zemel CSC411 Fall 2014 Machine Learning & Data Mining Ensemble Methods Slides by Rich Zemel Ensemble methods Typical application: classi.ication Ensemble of classi.iers is a set of classi.iers whose individual

More information

Linear Methods for Regression and Shrinkage Methods

Linear Methods for Regression and Shrinkage Methods Linear Methods for Regression and Shrinkage Methods Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Linear Regression Models Least Squares Input vectors

More information

Efficiency versus Convergence of Boolean Kernels for On-Line Learning Algorithms

Efficiency versus Convergence of Boolean Kernels for On-Line Learning Algorithms NIPS 01 Efficiency versus Convergence of Boolean Kernels for On-Line Learning Algorithms oni Khardon Tufts University Medford, MA 02155 roni@eecs.tufts.edu Dan oth University of Illinois Urbana, IL 61801

More information

AM 221: Advanced Optimization Spring 2016

AM 221: Advanced Optimization Spring 2016 AM 221: Advanced Optimization Spring 2016 Prof. Yaron Singer Lecture 2 Wednesday, January 27th 1 Overview In our previous lecture we discussed several applications of optimization, introduced basic terminology,

More information

Classification and Regression Trees

Classification and Regression Trees Classification and Regression Trees David S. Rosenberg New York University April 3, 2018 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 3, 2018 1 / 51 Contents 1 Trees 2 Regression

More information

An Empirical Study of Hoeffding Racing for Model Selection in k-nearest Neighbor Classification

An Empirical Study of Hoeffding Racing for Model Selection in k-nearest Neighbor Classification An Empirical Study of Hoeffding Racing for Model Selection in k-nearest Neighbor Classification Flora Yu-Hui Yeh and Marcus Gallagher School of Information Technology and Electrical Engineering University

More information

Tree-based methods for classification and regression

Tree-based methods for classification and regression Tree-based methods for classification and regression Ryan Tibshirani Data Mining: 36-462/36-662 April 11 2013 Optional reading: ISL 8.1, ESL 9.2 1 Tree-based methods Tree-based based methods for predicting

More information

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already

More information

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015 Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2015 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows K-Nearest

More information

Purifying Data by Machine Learning with Certainty Levels

Purifying Data by Machine Learning with Certainty Levels Purifying Data by Machine Learning with Certainty Levels (Extended Abstract) Shlomi Dolev Computer Science Department Ben Gurion University POB 653, Beer Sheva 84105, Israel Email: dolev@cs.bgu.ac.il Guy

More information

Representation Learning for Clustering: A Statistical Framework

Representation Learning for Clustering: A Statistical Framework Representation Learning for Clustering: A Statistical Framework Hassan Ashtiani School of Computer Science University of Waterloo mhzokaei@uwaterloo.ca Shai Ben-David School of Computer Science University

More information

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used. 1 4.12 Generalization In back-propagation learning, as many training examples as possible are typically used. It is hoped that the network so designed generalizes well. A network generalizes well when

More information

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples.

Supervised Learning with Neural Networks. We now look at how an agent might learn to solve a general problem by seeing examples. Supervised Learning with Neural Networks We now look at how an agent might learn to solve a general problem by seeing examples. Aims: to present an outline of supervised learning as part of AI; to introduce

More information

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/27/18

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/27/18 601.433/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/27/18 22.1 Introduction We spent the last two lectures proving that for certain problems, we can

More information

CPSC 340: Machine Learning and Data Mining. More Regularization Fall 2017

CPSC 340: Machine Learning and Data Mining. More Regularization Fall 2017 CPSC 340: Machine Learning and Data Mining More Regularization Fall 2017 Assignment 3: Admin Out soon, due Friday of next week. Midterm: You can view your exam during instructor office hours or after class

More information

BITS F464: MACHINE LEARNING

BITS F464: MACHINE LEARNING BITS F464: MACHINE LEARNING Lecture-16: Decision Tree (contd.) + Random Forest Dr. Kamlesh Tiwari Assistant Professor Department of Computer Science and Information Systems Engineering, BITS Pilani, Rajasthan-333031

More information

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Performance Estimation and Regularization Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Bias- Variance Tradeoff Fundamental to machine learning approaches Bias- Variance Tradeoff Error due to Bias:

More information

Expectation Maximization (EM) and Gaussian Mixture Models

Expectation Maximization (EM) and Gaussian Mixture Models Expectation Maximization (EM) and Gaussian Mixture Models Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 2 3 4 5 6 7 8 Unsupervised Learning Motivation

More information

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach Truth Course Outline Machine Learning Lecture 3 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Probability Density Estimation II 2.04.205 Discriminative Approaches (5 weeks)

More information

Lecture 2 :: Decision Trees Learning

Lecture 2 :: Decision Trees Learning Lecture 2 :: Decision Trees Learning 1 / 62 Designing a learning system What to learn? Learning setting. Learning mechanism. Evaluation. 2 / 62 Prediction task Figure 1: Prediction task :: Supervised learning

More information

A General Greedy Approximation Algorithm with Applications

A General Greedy Approximation Algorithm with Applications A General Greedy Approximation Algorithm with Applications Tong Zhang IBM T.J. Watson Research Center Yorktown Heights, NY 10598 tzhang@watson.ibm.com Abstract Greedy approximation algorithms have been

More information

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer Model Assessment and Selection Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Model Training data Testing data Model Testing error rate Training error

More information

Perceptron: This is convolution!

Perceptron: This is convolution! Perceptron: This is convolution! v v v Shared weights v Filter = local perceptron. Also called kernel. By pooling responses at different locations, we gain robustness to the exact spatial location of image

More information

CS446: Machine Learning Fall Problem Set 4. Handed Out: October 17, 2013 Due: October 31 th, w T x i w

CS446: Machine Learning Fall Problem Set 4. Handed Out: October 17, 2013 Due: October 31 th, w T x i w CS446: Machine Learning Fall 2013 Problem Set 4 Handed Out: October 17, 2013 Due: October 31 th, 2013 Feel free to talk to other members of the class in doing the homework. I am more concerned that you

More information

Boosting Simple Model Selection Cross Validation Regularization. October 3 rd, 2007 Carlos Guestrin [Schapire, 1989]

Boosting Simple Model Selection Cross Validation Regularization. October 3 rd, 2007 Carlos Guestrin [Schapire, 1989] Boosting Simple Model Selection Cross Validation Regularization Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 3 rd, 2007 1 Boosting [Schapire, 1989] Idea: given a weak

More information

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia UVA CS 6316/4501 Fall 2016 Machine Learning Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff Dr. Yanjun Qi University of Virginia Department of Computer Science 11/9/16 1 Rough Plan HW5

More information

Subject-Oriented Image Classification based on Face Detection and Recognition

Subject-Oriented Image Classification based on Face Detection and Recognition 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu [Kumar et al. 99] 2/13/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Support Vector Machines Aurélie Herbelot 2018 Centre for Mind/Brain Sciences University of Trento 1 Support Vector Machines: introduction 2 Support Vector Machines (SVMs) SVMs

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining Fundamentals of learning (continued) and the k-nearest neighbours classifier Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart.

More information

Constraint Satisfaction Problems

Constraint Satisfaction Problems Constraint Satisfaction Problems CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2013 Soleymani Course material: Artificial Intelligence: A Modern Approach, 3 rd Edition,

More information

Support Vector Machines

Support Vector Machines Support Vector Machines SVM Discussion Overview. Importance of SVMs. Overview of Mathematical Techniques Employed 3. Margin Geometry 4. SVM Training Methodology 5. Overlapping Distributions 6. Dealing

More information

CS 229 Midterm Review

CS 229 Midterm Review CS 229 Midterm Review Course Staff Fall 2018 11/2/2018 Outline Today: SVMs Kernels Tree Ensembles EM Algorithm / Mixture Models [ Focus on building intuition, less so on solving specific problems. Ask

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Maximum Margin Methods Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

Ensemble Learning: An Introduction. Adapted from Slides by Tan, Steinbach, Kumar

Ensemble Learning: An Introduction. Adapted from Slides by Tan, Steinbach, Kumar Ensemble Learning: An Introduction Adapted from Slides by Tan, Steinbach, Kumar 1 General Idea D Original Training data Step 1: Create Multiple Data Sets... D 1 D 2 D t-1 D t Step 2: Build Multiple Classifiers

More information

Weka ( )

Weka (  ) Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised

More information

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University DS 4400 Machine Learning and Data Mining I Alina Oprea Associate Professor, CCIS Northeastern University January 24 2019 Logistics HW 1 is due on Friday 01/25 Project proposal: due Feb 21 1 page description

More information

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday. CS 188: Artificial Intelligence Spring 2011 Lecture 21: Perceptrons 4/13/2010 Announcements Project 4: due Friday. Final Contest: up and running! Project 5 out! Pieter Abbeel UC Berkeley Many slides adapted

More information

Combined Weak Classifiers

Combined Weak Classifiers Combined Weak Classifiers Chuanyi Ji and Sheng Ma Department of Electrical, Computer and System Engineering Rensselaer Polytechnic Institute, Troy, NY 12180 chuanyi@ecse.rpi.edu, shengm@ecse.rpi.edu Abstract

More information

Multicollinearity and Validation CIVL 7012/8012

Multicollinearity and Validation CIVL 7012/8012 Multicollinearity and Validation CIVL 7012/8012 2 In Today s Class Recap Multicollinearity Model Validation MULTICOLLINEARITY 1. Perfect Multicollinearity 2. Consequences of Perfect Multicollinearity 3.

More information

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/18/14

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/18/14 600.363 Introduction to Algorithms / 600.463 Algorithms I Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/18/14 23.1 Introduction We spent last week proving that for certain problems,

More information

Chapter 4: Algorithms CS 795

Chapter 4: Algorithms CS 795 Chapter 4: Algorithms CS 795 Inferring Rudimentary Rules 1R Single rule one level decision tree Pick each attribute and form a single level tree without overfitting and with minimal branches Pick that

More information

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai Decision Trees Decision Tree Decision Trees (DTs) are a nonparametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target

More information

Large Margin Classification Using the Perceptron Algorithm

Large Margin Classification Using the Perceptron Algorithm Large Margin Classification Using the Perceptron Algorithm Yoav Freund Robert E. Schapire Presented by Amit Bose March 23, 2006 Goals of the Paper Enhance Rosenblatt s Perceptron algorithm so that it can

More information

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Neural Networks CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Biological and artificial neural networks Feed-forward neural networks Single layer

More information

Chapter 1. Math review. 1.1 Some sets

Chapter 1. Math review. 1.1 Some sets Chapter 1 Math review This book assumes that you understood precalculus when you took it. So you used to know how to do things like factoring polynomials, solving high school geometry problems, using trigonometric

More information

Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm

Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm Sham M Kakade c 2018 University of Washington cse446-staff@cs.washington.edu 1 / 17 Review 1 / 17 Decision Tree: Making a

More information

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13 CSE 634 - Data Mining Concepts and Techniques STATISTICAL METHODS Professor- Anita Wasilewska (REGRESSION) Team 13 Contents Linear Regression Logistic Regression Bias and Variance in Regression Model Fit

More information

1 Case study of SVM (Rob)

1 Case study of SVM (Rob) DRAFT a final version will be posted shortly COS 424: Interacting with Data Lecturer: Rob Schapire and David Blei Lecture # 8 Scribe: Indraneel Mukherjee March 1, 2007 In the previous lecture we saw how

More information

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes

More information

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Kernels + K-Means Matt Gormley Lecture 29 April 25, 2018 1 Reminders Homework 8:

More information

Unlabeled Compression Schemes for Maximum Classes

Unlabeled Compression Schemes for Maximum Classes Journal of Machine Learning Research 8 (2007) 2047-2081 Submitted 9/05; Revised 11/06; Published 9/07 Unlabeled Compression Schemes for Maximum Classes Dima Kuzmin Manfred K. Warmuth Computer Science Department

More information

Randomized algorithms have several advantages over deterministic ones. We discuss them here:

Randomized algorithms have several advantages over deterministic ones. We discuss them here: CS787: Advanced Algorithms Lecture 6: Randomized Algorithms In this lecture we introduce randomized algorithms. We will begin by motivating the use of randomized algorithms through a few examples. Then

More information

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Perceptron Matt Gormley Lecture 5 Jan. 31, 2018 1 Q&A Q: We pick the best hyperparameters

More information

Ensemble Methods, Decision Trees

Ensemble Methods, Decision Trees CS 1675: Intro to Machine Learning Ensemble Methods, Decision Trees Prof. Adriana Kovashka University of Pittsburgh November 13, 2018 Plan for This Lecture Ensemble methods: introduction Boosting Algorithm

More information

CS229 Lecture notes. Raphael John Lamarre Townshend

CS229 Lecture notes. Raphael John Lamarre Townshend CS229 Lecture notes Raphael John Lamarre Townshend Decision Trees We now turn our attention to decision trees, a simple yet flexible class of algorithms. We will first consider the non-linear, region-based

More information

Association Rule Learning

Association Rule Learning Association Rule Learning 16s1: COMP9417 Machine Learning and Data Mining School of Computer Science and Engineering, University of New South Wales March 15, 2016 COMP9417 ML & DM (CSE, UNSW) Association

More information

An Average-Case Analysis of the k-nearest Neighbor Classifier for Noisy Domains

An Average-Case Analysis of the k-nearest Neighbor Classifier for Noisy Domains An Average-Case Analysis of the k-nearest Neighbor Classifier for Noisy Domains Seishi Okamoto Fujitsu Laboratories Ltd. 2-2-1 Momochihama, Sawara-ku Fukuoka 814, Japan seishi@flab.fujitsu.co.jp Yugami

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule. CS 188: Artificial Intelligence Fall 2008 Lecture 24: Perceptrons II 11/24/2008 Dan Klein UC Berkeley Feature Extractors A feature extractor maps inputs to feature vectors Dear Sir. First, I must solicit

More information

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics

More information

Semi-supervised learning and active learning

Semi-supervised learning and active learning Semi-supervised learning and active learning Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Combining classifiers Ensemble learning: a machine learning paradigm where multiple learners

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Classification Advanced Reading: Chapter 8 & 9 Han, Chapters 4 & 5 Tan Anca Doloc-Mihu, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei. Data Mining.

More information

Neural Networks (Overview) Prof. Richard Zanibbi

Neural Networks (Overview) Prof. Richard Zanibbi Neural Networks (Overview) Prof. Richard Zanibbi Inspired by Biology Introduction But as used in pattern recognition research, have little relation with real neural systems (studied in neurology and neuroscience)

More information

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty)

Supervised Learning (contd) Linear Separation. Mausam (based on slides by UW-AI faculty) Supervised Learning (contd) Linear Separation Mausam (based on slides by UW-AI faculty) Images as Vectors Binary handwritten characters Treat an image as a highdimensional vector (e.g., by reading pixel

More information

CS6716 Pattern Recognition

CS6716 Pattern Recognition CS6716 Pattern Recognition Prototype Methods Aaron Bobick School of Interactive Computing Administrivia Problem 2b was extended to March 25. Done? PS3 will be out this real soon (tonight) due April 10.

More information

1) Give decision trees to represent the following Boolean functions:

1) Give decision trees to represent the following Boolean functions: 1) Give decision trees to represent the following Boolean functions: 1) A B 2) A [B C] 3) A XOR B 4) [A B] [C Dl Answer: 1) A B 2) A [B C] 1 3) A XOR B = (A B) ( A B) 4) [A B] [C D] 2 2) Consider the following

More information

(Refer Slide Time 6:48)

(Refer Slide Time 6:48) Digital Circuits and Systems Prof. S. Srinivasan Department of Electrical Engineering Indian Institute of Technology Madras Lecture - 8 Karnaugh Map Minimization using Maxterms We have been taking about

More information

CSE 573: Artificial Intelligence Autumn 2010

CSE 573: Artificial Intelligence Autumn 2010 CSE 573: Artificial Intelligence Autumn 2010 Lecture 16: Machine Learning Topics 12/7/2010 Luke Zettlemoyer Most slides over the course adapted from Dan Klein. 1 Announcements Syllabus revised Machine

More information

Bagging for One-Class Learning

Bagging for One-Class Learning Bagging for One-Class Learning David Kamm December 13, 2008 1 Introduction Consider the following outlier detection problem: suppose you are given an unlabeled data set and make the assumptions that one

More information

Machine Learning (CS 567)

Machine Learning (CS 567) Machine Learning (CS 567) Time: T-Th 5:00pm - 6:20pm Location: GFS 118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol Han (cheolhan@usc.edu)

More information

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah Reference Most of the slides are taken from the third chapter of the online book by Michael Nielson: neuralnetworksanddeeplearning.com

More information

5 Learning hypothesis classes (16 points)

5 Learning hypothesis classes (16 points) 5 Learning hypothesis classes (16 points) Consider a classification problem with two real valued inputs. For each of the following algorithms, specify all of the separators below that it could have generated

More information

Deepest Neural Networks

Deepest Neural Networks Deepest Neural Networks arxiv:707.0267v [cs.ne] 9 Jul 207 Raúl Rojas Dahlem Center for Machine Learning and Robotics Freie Universität Berlin July 207 Abstract This paper shows that a long chain of perceptrons

More information