April 3, 2012 T.C. Havens

Size: px

Start display at page:

Download "April 3, 2012 T.C. Havens"

Shana Parks
5 years ago
Views:

1 April 3, 2012 T.C. Havens

2 Different training parameters MLP with different weights, number of layers/nodes, etc. Controls instability of classifiers (local minima) Similar strategies can be used to generate different decision trees Different types of classifiers (MLPs, decision trees, NNs, SVMs) can be combined to improve diversity Random feature sets, called random subspace method

3 Two questions: How will the individual classifiers be generated? How will they differ from each other? Answer determines the diversity of classifiers and fusion performance Seek to improve ensemble diversity by some heuristic methods

4 Bagging, short for bootstrap aggregating, is one of the earliest ensemble based algorithms It is also one of the most intuitive and simplest to implement, with a surprisingly good performance Use bootstrapped replicas of the training data; large number of (say 200) training subsets are randomly drawn - with replacement - from the entire training data Each resampled training set is used to train a different classifier of the same type Individual classifiers are combined by taking a majority vote of their decisions Bagging is appealing for small training set; relatively large portion of the samples is included in each subset

5 Algorithm : Bagging input : Training data S with correct labelsω Weak learning algorithm WeakLearn, Integer T specifying number of 1.Take a bootstrapped replica S 2.Call WeakLearn with S 3.Add h 1.Evaluate the ensemble on x. 2.Letυ to the ensemble, E. 1, if ht picks class ω j = 0, otherwise iterations. 3.Obtain total vote received by each classv { ω,,ω } Percent ( of fraction) F to create bootstrapped training data Do t = 1,,T End t Test : Simple Majority Voting t,j t Ω = by randomly drawing percent of S. and receive the hypothesis (classifier) h. - Given unlabled instance be the vote given to class by classifier. t = 1 4.Choose the class that receives the highest total vote as t i j = 1 T υ t,j C representing C classes x, j = 1, C, t the final classification.

9 Random Forests Constructed from decision trees A random forest is created from individual decision trees, whose training parameters vary randomly Such parameters can be bootstrapped replicas of the training data, as in bagging But they can also be different feature subsets as in random subspace methods

10 Given: N training samples, p variables. Algorithm: 1. For b = 1 to B: a. Draw a bootstrap sample of size N from training data. b. Grow a random-forest tree T b on the bootstrapped data, by recursively repeating the following steps for each terminal node, until the minimum node size n min is reached. i. Select m variables at random from the p variables. ii. Pick the best variable and split-point among the m. iii. Split the node into two child nodes. 2. Output the ensemble of B trees {T b }.

11 Given: N training samples, p variables. Algorithm: 1. For b = 1 to B: a. Draw a bootstrap sample of size N from training data. b. Grow a random-forest tree T b on the bootstrapped data, by recursively repeating the following steps for each terminal node, until the minimum node size n min is reached. i. Select m variables at random from the p variables. ii. Pick the best variable and split-point among the m. iii. Split the node into two child nodes. 2. Output the ensemble of B trees {T b }. Only difference from bagging with decision trees. m typically sqrt( p ) (even as low as 1)

12 Random forests routinely outperform bagged ensembles, and are often competitive with boosting.

13 Random forests provide even more reduction of variance than bagged decision trees. But still do not impact bias. Benefit appears to be from de-correlation of individual trees. Bootstrap samples still have significant correlation. Simpler to train and tune than boosting algorithms.

14 First implemented in FORTRAN by Leo Breiman and Adele Cutler, and the term trademarked by them. Commercial distribution licensed exclusively to Salford Systems. Lots of open-source implementations in various languages and machine learning packages. Available in MATLAB as class TreeBagger (Statistics Toolbox).

15 Boost the performance of a weak learner to the level of a strong one Boosting creates an ensemble of classifiers by resampling the data; classifiers combined by majority voting resampling is strategically geared to provide the most informative training data for each consecutive classifier Boosting creates three weak classifiers: First classifier C1 is trained with a random subset of the available training data Training set for second classifier C2 is chosen as the most informative subset, given C1; half of the training data for C2 is correctly classified by C1, other half is misclassified by C1 Third classifier C3 is trained on instances on which both C1 & C2 disagree

16 Algprithm : Boosting Input : Training data S of the size N with correct labelsω Ω = Weak learning algorithm WeakLearn. Training 1.Select N 2.Call WeakLearn and train with S 3.Create dataset as the most informative dataset S and the other half is misclassified.to do so : a.filp a fair coin. If Add this instance to S b.if tail, select samples from S, and present them to C 4.Train the second classifier C 5.Create by selecting those instances for which C Test - Given a test instance x 1.Classify x by C 1 and C 2.If they disagree, choose the class predicted by C..If with S to create classifier C., given C and C { ω, ω }; < N patterns without replacement from S to create data subset S Head,select samples from S, and present them to C c.continue flipping coins until no more patterns can be added to S i until the first one is correctly classified. Add this instance to disagree.train the third classifier C they agree on the class, this class is the final classification , such that half of S as the ginal classification is correctly classified by C until the first instance is misclassified. 3 with S 3. 2, S 2.

17 Initially, all samples have equal weights. Samples that are wrongly classified have their weights increased. Samples that are classified correctly have their weights decreased. Samples with higher weights have more influence in subsequent training iterations. Adaptively changes training data distribution. Original Data Boosting (Round 1) Boosting (Round 2) Boosting (Round 3) sample 4 is hard to classify its weight is increased

23 AdaBoost (1997) is a more general version of the boosting algorithm; AdaBoost.M1 can handle multiclass problems AdaBoost generates a set of hypotheses (classifiers), and combines them through weighted majority voting of the classes predicted by the individual hypotheses Hypotheses are generated by training a weak classifier; samples are drawn from an iteratively updated distribution of the training set This distribution update ensures that instances misclassified by the previous classifier are more likely to be included in the training data of the next classifier Consecutive classifiers are trained on increasingly hard-toclassify samples

24 A weight distribution D t (i) on training instances x i, i=1,,n from which training data subsets S t are chosen for each consecutive classifier (hypothesis) h t A normalized error is then obtained as β t, such that for 0<ε t <1/2, they have 0< β t <1 Distribution update rule: The distribution weights of those instances that are correctly classified by the current hypothesis are reduced by a factor of β t, whereas the weights of the misclassified instances are unchanged. AdaBoost focuses on increasingly difficult instances AdaBoost raises the weights of instanced misclassified by h t, and lowers the weights of correctly classified instances AdaBoost is ready for classifying unlabeled test instances. Unlike bagging or boosting, AdaBoost uses the weighted majority voting 1/β t is therefore a measure of performance, of the t th hypothesis and can be used to weight the classifiers

25 Algorithm AdaBoost.M1 Input : Sequence of N examples S = Weak learning algorithm WeakLearn; Integer T specifying number of iterations. 1 Initialize D1 ( i) =,i = 1,,N N Do for t = 1, 2, T : 1.Select a training data subset S 2.Train WeakLearn with S 3.Calculate the error of h : ε = t Ifε 4.Set β where Z ( i) ( x) ( x ) ( i) > 1/ 2, abort. ( 1 ε ) 5.Update distribution D : D t + 1 ( i) β, if h ( x ) ( i) 1.Obtain total vote received by each class V t j t Test - = t t = ε / D D Z D. [( x,y )],i = 1,,N with labels y Ω, Ω = { ω, ω };, receive hypothesis h t t i = y 1,otherwise, draw from the distribution is a normalization constant chosen so that D Weight Majority Voting : Given an unlabeled instance x. t:h t t i:h t = t = ω j i = t y i i t t t t 1 log, j = 1,,C. β t t t i i 2.Choose the class that receives the highest total vote as the final classification. i t. i D t. t C becomes a proper distribution function.

26 AdaBoost algorithm is sequential; classifier (CK-1) is created before classifier CK

28 28

29 Bagging Resample data points Weight of each classifier is same Only reduces variance Robust to noise and outliers Easily parallelized Boosting Reweight data points (modify data distribution) Weight of a classifier depends on its accuracy Reduces both bias and variance Noise and outliers can hurt performance

An ensemble of classifiers is first created, whose outputs are used as inputs to a second level meta-classifier to learn the mapping between the ensemble outputs and the actual correct classes C 1,,C

30 An ensemble of classifiers is first created, whose outputs are used as inputs to a second level meta-classifier to learn the mapping between the ensemble outputs and the actual correct classes C 1,,C T are trained using training parameters θ 1 through θ T to output hypotheses h 1 through h T The outputs of these classifiers and the corresponding true classes are then used as input/output training pairs for the second level classifier, C T+1

31 A conceptually similar technique is the mixture-of-experts model, where a set of classifiers C 1,,C T constitute the ensemble, followed by a second-level classifier C T+1 used for assigning weights for the consecutive combiner The combiner itself is usually not a classifier, but rather a simple combination rule, such as random selection (from a weight distribution), weighted majority, or weighted winner-takes-all the weight distribution used for the combiner is determined by a second level classifier, usually a neural network, called the gating network The inputs to the gating network are the actual training data instances themselves (unlike outputs of first level classifiers for stacked generalization) Mixture-of-experts can, therefore, be seen as a classifier selection algorithm Individual classifiers are experts in some portion of the feature space, and the combination rule selects the most appropriate classifier, or classifiers weighted with respect to their expertise, for each instance x

32 The pooling system may use the weights in several different ways. it may choose a single classifier with the highest weight, or calculate a weighted sum of the classifier outputs for each class, and pick the class that receives the highest weighted sum.

33 How to combine classifiers? Combination rules grouped as (i) trainable vs. non-trainable Trainable rules: parameters of the combiner, called weights determined through a separate training algorithm Weights from trainable rules are usually instance specific, and hence are also called dynamic combination rules Non-trainable rules: combination parameters are available as classifiers are generated; Weighted majority voting is an example (ii) combination rules for class labels vs. class-specific continuous outputs combination rules that apply to class labels only need the classification decision (that is, one of ω j, j=1,,c) Other rules need continuous-valued outputs of individual classifiers

34 Assume that only class labels are available from the classifier outputs Define the decision of the t th classifier as d t,j {0,1}, t=1,,t and j=1,,c, where T is the number of classifiers and C is the number of classes If t th classifier chooses class ω j, then d t,j =1, 0 otherwise a)majority Voting : b)weighted Majority Voting : T d = C max T t,j j= 1 t= 1 t = 1 d c) Behavior Knowledge Space (BKS) : look up Table d)borda Count : each voter (classifier) rank orders the candidates (classes). If there are N candidates, the first-place candidate receives N 1 votes, the secondplace candidate receives N 2, with the candidate in i th place receiving N i votes. The votes are added up across all classifiers, and the class with the most votes is chosen as the ensemble decision t, j T ω d = C max t t,j j= 1 t= 1 t= 1 T ω d t t, j

35 Algebraic combiners a) Mean Rule: µ b) Weighted Average: j ( x) = d ( x) t = 1 c) Minimum/Maximum/Median Rule: { } 1 T T µ j t, j 1 T T ( x) = ω d ( x) t= 1 µ ( x) = max d ( x) µ ( x) = min d ( x) µ j ( x) = median d t, j ( x) j t =1,,T t, j j t =1,,T t, j t, j { } t, j t =1,,T { } d) Product Rule: µ j 1 T T ( x) = d ( x) t = 1 t, j e) Generalized Mean: Many of the above rules are in fact special cases of the generalized mean T 1 α α µ j ( x, α ) = dt, j ( x) T t = 1 α : minimum rule; α :maximum rule; 0: : mean rule α 1 1 T µ j x, α = d t = 1 α ( ) ( ) T t, j x 1

37 Ensemble Cloud Army (ECA) A Platform for Parallel Processing of Machine Learning Problems in the Amazon Cloud J. Jeffry Howbert Insilicos LLC May 11, 2011

38 Datasets Name Source Domain Instances Features Feature type(s) Classes satimage UCI soil types from satellite images 4435 train, 2000 test 36 numeric (0-255) 6 covertype UCI forest cover types from cartographic variables numeric, 44 binary qualitative 7 jones Ref. 3 protein secondary structure train, test 315 numeric 3

39 For ensembles, training subsets must deliver diversity, accuracy, and fast computation. For large datasets used with ECA, bootstrap samples are too large for practical computation. Instead, much smaller subsets of records are generated by random sampling without replacement. The key principle for effective sampling is the following: Using a sample will work almost as well as using the entire data set, provided the sample is representative. A sample is representative if it has approximately the same distribution of properties (of interest) as the original set of data

40 Ensembles have better accuracy than individual component classifiers 80 Classification accuracy, % covertype: ensemble of decision trees covertype: average of individual decision trees Jones: ensemble of neural nets Jones: average of individual neural nets Jones: ensemble of decision trees Jones: average of individual decision trees Number of instances per base classifier

41 Accuracy remains high despite large reduction in features Classification accuracy, % Jones neural nets, 315 features neural nets, 157 features neural nets, 78 features decision trees, 315 features decision trees, 157 features decision trees, 78 features Number of instances per base classifier

42 The potential speedup from parallelization is strictly limited by the portion of the computation that cannot be parallelized. Assume proportion P of computation can be parallelized, and proportion (1 P) is necessarily sequential. The speedup from parallelizing on N processors is: 1 (1 P ) + For example, if P = 0.9, maximum possible speedup is 10, no matter how large N is. P N

43 Computational performance: ensembles of decision trees 60 Increase in speed over single node ideal performance Jones instances 5000 instances 2500 instances 1000 instances covertype instances instances 5000 instances 2500 instances 1000 instances Number of nodes in cluster

44 Computational performance: ensembles of neural networks Increase in speed over single node Number of nodes in cluster ideal performance Jones instances instances 5000 instances 2500 instances 1000 instances 500 instances 250 instances 100 instances

45 Large data handling not as critical as expected Best ensemble accuracy associated with smaller partitions (< 5,000 instances) Ensembles with small partitions run much faster than those with larger partitions

46 Ensembles with small partitions run much faster than single classifier trained on all of data, and are more accurate Number of trees Instances per tree Processing mode Number of nodes Node type Runtime Accuracy, % serial 1 64-bit 2:01: serial 1 64-bit 29: parallel bit 5: Jones dataset, ensemble of decision trees

47 RMPI version released on SourceForge ica.sf.net

48 Given two models with similar generalization errors, one should prefer the simpler model over the more complex model. For complex models, there is a greater chance it was fitted accidentally by errors in data. Model complexity should therefore be considered when evaluating a model.

50 Ensemble systems are useful in practice Diversity of the base classifiers is important Ensemble generation techniques: bagging, AdaBoost, mixture of experts Classifier combination strategies: algebraic combiners, voting methods, and decision templates. No single ensemble generation algorithm or combination rule is universally better than others Effectiveness on real world data depends on the classifier diversity and characteristics of the data 50

CSC411 Fall 2014 Machine Learning & Data Mining. Ensemble Methods. Slides by Rich Zemel

CSC411 Fall 2014 Machine Learning & Data Mining Ensemble Methods Slides by Rich Zemel Ensemble methods Typical application: classi.ication Ensemble of classi.iers is a set of classi.iers whose individual