April 3, 2012 T.C. Havens

Different training parameters MLP with different weights, number of layers/nodes, etc. Controls instability of classifiers (local minima) Similar strategies can be used to generate different decision trees Different types of classifiers (MLPs, decision trees, NNs, SVMs) can be combined to improve diversity Random feature sets, called random subspace method

Two questions: How will the individual classifiers be generated? How will they differ from each other? Answer determines the diversity of classifiers and fusion performance Seek to improve ensemble diversity by some heuristic methods

Bagging, short for bootstrap aggregating, is one of the earliest ensemble based algorithms It is also one of the most intuitive and simplest to implement, with a surprisingly good performance Use bootstrapped replicas of the training data; large number of (say 200) training subsets are randomly drawn - with replacement - from the entire training data Each resampled training set is used to train a different classifier of the same type Individual classifiers are combined by taking a majority vote of their decisions Bagging is appealing for small training set; relatively large portion of the samples is included in each subset

Algorithm : Bagging input : Training data S with correct labelsω Weak learning algorithm WeakLearn, Integer T specifying number of 1.Take a bootstrapped replica S 2.Call WeakLearn with S 3.Add h 1.Evaluate the ensemble on x. 2.Letυ to the ensemble, E. 1, if ht picks class ω j = 0, otherwise iterations. 3.Obtain total vote received by each classv { ω,,ω } Percent ( of fraction) F to create bootstrapped training data Do t = 1,,T End t Test : Simple Majority Voting t,j t Ω = by randomly drawing percent of S. and receive the hypothesis (classifier) h. - Given unlabled instance be the vote given to class by classifier. t = 1 4.Choose the class that receives the highest total vote as t i j = 1 T υ t,j C representing C classes x, j = 1, C, t the final classification.

Random Forests Constructed from decision trees A random forest is created from individual decision trees, whose training parameters vary randomly Such parameters can be bootstrapped replicas of the training data, as in bagging But they can also be different feature subsets as in random subspace methods

Given: N training samples, p variables. Algorithm: 1. For b = 1 to B: a. Draw a bootstrap sample of size N from training data. b. Grow a random-forest tree T b on the bootstrapped data, by recursively repeating the following steps for each terminal node, until the minimum node size n min is reached. i. Select m variables at random from the p variables. ii. Pick the best variable and split-point among the m. iii. Split the node into two child nodes. 2. Output the ensemble of B trees {T b }.

Random forests routinely outperform bagged ensembles, and are often competitive with boosting.

Random forests provide even more reduction of variance than bagged decision trees. But still do not impact bias. Benefit appears to be from de-correlation of individual trees. Bootstrap samples still have significant correlation. Simpler to train and tune than boosting algorithms.

First implemented in FORTRAN by Leo Breiman and Adele Cutler, and the term trademarked by them. http://stat-www.berkeley.edu/users/breiman/randomforests/cc_home.htm Commercial distribution licensed exclusively to Salford Systems. Lots of open-source implementations in various languages and machine learning packages. Available in MATLAB as class TreeBagger (Statistics Toolbox).

Boost the performance of a weak learner to the level of a strong one Boosting creates an ensemble of classifiers by resampling the data; classifiers combined by majority voting resampling is strategically geared to provide the most informative training data for each consecutive classifier Boosting creates three weak classifiers: First classifier C1 is trained with a random subset of the available training data Training set for second classifier C2 is chosen as the most informative subset, given C1; half of the training data for C2 is correctly classified by C1, other half is misclassified by C1 Third classifier C3 is trained on instances on which both C1 & C2 disagree

Algprithm : Boosting Input : Training data S of the size N with correct labelsω Ω = Weak learning algorithm WeakLearn. Training 1.Select N 2.Call WeakLearn and train with S 3.Create dataset as the most informative dataset S and the other half is misclassified.to do so : a.filp a fair coin. If Add this instance to S b.if tail, select samples from S, and present them to C 4.Train the second classifier C 5.Create by selecting those instances for which C Test - Given a test instance x 1.Classify x by C 1 and C 2.If they disagree, choose the class predicted by C..If with S to create classifier C., given C and C { ω, ω }; < N patterns without replacement from S to create data subset S Head,select samples from S, and present them to C c.continue flipping coins until no more patterns can be added to S 1 2 2 2 1 2 1 2 i until the first one is correctly classified. Add this instance to disagree.train the third classifier C they agree on the class, this class is the final classification. 3 1 2 1. 1 1 2, such that half of S as the ginal classification. 2. 1 1. 2 is correctly classified by C until the first instance is misclassified. 3 with S 3. 2, S 2.

Initially, all samples have equal weights. Samples that are wrongly classified have their weights increased. Samples that are classified correctly have their weights decreased. Samples with higher weights have more influence in subsequent training iterations. Adaptively changes training data distribution. Original Data 1 2 3 4 5 6 7 8 9 10 Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3 Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2 Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4 sample 4 is hard to classify its weight is increased

AdaBoost (1997) is a more general version of the boosting algorithm; AdaBoost.M1 can handle multiclass problems AdaBoost generates a set of hypotheses (classifiers), and combines them through weighted majority voting of the classes predicted by the individual hypotheses Hypotheses are generated by training a weak classifier; samples are drawn from an iteratively updated distribution of the training set This distribution update ensures that instances misclassified by the previous classifier are more likely to be included in the training data of the next classifier Consecutive classifiers are trained on increasingly hard-toclassify samples

A weight distribution D t (i) on training instances x i, i=1,,n from which training data subsets S t are chosen for each consecutive classifier (hypothesis) h t A normalized error is then obtained as β t, such that for 0<ε t <1/2, they have 0< β t <1 Distribution update rule: The distribution weights of those instances that are correctly classified by the current hypothesis are reduced by a factor of β t, whereas the weights of the misclassified instances are unchanged. AdaBoost focuses on increasingly difficult instances AdaBoost raises the weights of instanced misclassified by h t, and lowers the weights of correctly classified instances AdaBoost is ready for classifying unlabeled test instances. Unlike bagging or boosting, AdaBoost uses the weighted majority voting 1/β t is therefore a measure of performance, of the t th hypothesis and can be used to weight the classifiers

Algorithm AdaBoost.M1 Input : Sequence of N examples S = Weak learning algorithm WeakLearn; Integer T specifying number of iterations. 1 Initialize D1 ( i) =,i = 1,,N N Do for t = 1, 2, T : 1.Select a training data subset S 2.Train WeakLearn with S 3.Calculate the error of h : ε = t Ifε 4.Set β where Z ( i) ( x) ( x ) ( i) > 1/ 2, abort. ( 1 ε ) 5.Update distribution D : D t + 1 ( i) β, if h ( x ) ( i) 1.Obtain total vote received by each class V t j t Test - = t t = ε / D D Z D. [( x,y )],i = 1,,N with labels y Ω, Ω = { ω, ω };, receive hypothesis h t t i = y 1,otherwise, draw from the distribution is a normalization constant chosen so that D Weight Majority Voting : Given an unlabeled instance x. t:h t t i:h t = t = ω j i = t y i i t t t t 1 log, j = 1,,C. β t t t i i 2.Choose the class that receives the highest total vote as the final classification. i t. i D t. t + 1 1 C becomes a proper distribution function.

AdaBoost algorithm is sequential; classifier (CK-1) is created before classifier CK

Bagging Resample data points Weight of each classifier is same Only reduces variance Robust to noise and outliers Easily parallelized Boosting Reweight data points (modify data distribution) Weight of a classifier depends on its accuracy Reduces both bias and variance Noise and outliers can hurt performance

An ensemble of classifiers is first created, whose outputs are used as inputs to a second level meta-classifier to learn the mapping between the ensemble outputs and the actual correct classes C 1,,C T are trained using training parameters θ 1 through θ T to output hypotheses h 1 through h T The outputs of these classifiers and the corresponding true classes are then used as input/output training pairs for the second level classifier, C T+1

A conceptually similar technique is the mixture-of-experts model, where a set of classifiers C 1,,C T constitute the ensemble, followed by a second-level classifier C T+1 used for assigning weights for the consecutive combiner The combiner itself is usually not a classifier, but rather a simple combination rule, such as random selection (from a weight distribution), weighted majority, or weighted winner-takes-all the weight distribution used for the combiner is determined by a second level classifier, usually a neural network, called the gating network The inputs to the gating network are the actual training data instances themselves (unlike outputs of first level classifiers for stacked generalization) Mixture-of-experts can, therefore, be seen as a classifier selection algorithm Individual classifiers are experts in some portion of the feature space, and the combination rule selects the most appropriate classifier, or classifiers weighted with respect to their expertise, for each instance x

The pooling system may use the weights in several different ways. it may choose a single classifier with the highest weight, or calculate a weighted sum of the classifier outputs for each class, and pick the class that receives the highest weighted sum.

How to combine classifiers? Combination rules grouped as (i) trainable vs. non-trainable Trainable rules: parameters of the combiner, called weights determined through a separate training algorithm Weights from trainable rules are usually instance specific, and hence are also called dynamic combination rules Non-trainable rules: combination parameters are available as classifiers are generated; Weighted majority voting is an example (ii) combination rules for class labels vs. class-specific continuous outputs combination rules that apply to class labels only need the classification decision (that is, one of ω j, j=1,,c) Other rules need continuous-valued outputs of individual classifiers

Assume that only class labels are available from the classifier outputs Define the decision of the t th classifier as d t,j {0,1}, t=1,,t and j=1,,c, where T is the number of classifiers and C is the number of classes If t th classifier chooses class ω j, then d t,j =1, 0 otherwise a)majority Voting : b)weighted Majority Voting : T d = C max T t,j j= 1 t= 1 t = 1 d c) Behavior Knowledge Space (BKS) : look up Table d)borda Count : each voter (classifier) rank orders the candidates (classes). If there are N candidates, the first-place candidate receives N 1 votes, the secondplace candidate receives N 2, with the candidate in i th place receiving N i votes. The votes are added up across all classifiers, and the class with the most votes is chosen as the ensemble decision t, j T ω d = C max t t,j j= 1 t= 1 t= 1 T ω d t t, j

Algebraic combiners a) Mean Rule: µ b) Weighted Average: j ( x) = d ( x) t = 1 c) Minimum/Maximum/Median Rule: { } 1 T T µ j t, j 1 T T ( x) = ω d ( x) t= 1 µ ( x) = max d ( x) µ ( x) = min d ( x) µ j ( x) = median d t, j ( x) j t =1,,T t, j j t =1,,T t, j t, j { } t, j t =1,,T { } d) Product Rule: µ j 1 T T ( x) = d ( x) t = 1 t, j e) Generalized Mean: Many of the above rules are in fact special cases of the generalized mean T 1 α α µ j ( x, α ) = dt, j ( x) T t = 1 α : minimum rule; α :maximum rule; 0: : mean rule α 1 1 T µ j x, α = d t = 1 α ( ) ( ) T t, j x 1

Ensemble Cloud Army (ECA) A Platform for Parallel Processing of Machine Learning Problems in the Amazon Cloud J. Jeffry Howbert Insilicos LLC May 11, 2011

Datasets Name Source Domain Instances Features Feature type(s) Classes satimage UCI soil types from satellite images 4435 train, 2000 test 36 numeric (0-255) 6 covertype UCI forest cover types from cartographic variables 581012 54 10 numeric, 44 binary qualitative 7 jones Ref. 3 protein secondary structure 209529 train, 17731 test 315 numeric 3

For ensembles, training subsets must deliver diversity, accuracy, and fast computation. For large datasets used with ECA, bootstrap samples are too large for practical computation. Instead, much smaller subsets of records are generated by random sampling without replacement. The key principle for effective sampling is the following: Using a sample will work almost as well as using the entire data set, provided the sample is representative. A sample is representative if it has approximately the same distribution of properties (of interest) as the original set of data

Ensembles have better accuracy than individual component classifiers 80 Classification accuracy, % 75 70 65 60 55 50 45 covertype: ensemble of decision trees covertype: average of individual decision trees Jones: ensemble of neural nets Jones: average of individual neural nets Jones: ensemble of decision trees Jones: average of individual decision trees 40 100 1000 10000 100000 Number of instances per base classifier

Accuracy remains high despite large reduction in features 71 69 Classification accuracy, % 67 65 63 61 59 57 Jones neural nets, 315 features neural nets, 157 features neural nets, 78 features decision trees, 315 features decision trees, 157 features decision trees, 78 features 55 100 1000 10000 100000 Number of instances per base classifier

The potential speedup from parallelization is strictly limited by the portion of the computation that cannot be parallelized. Assume proportion P of computation can be parallelized, and proportion (1 P) is necessarily sequential. The speedup from parallelizing on N processors is: 1 (1 P ) + For example, if P = 0.9, maximum possible speedup is 10, no matter how large N is. P N

Computational performance: ensembles of decision trees 60 Increase in speed over single node 50 40 30 20 10 0 0 10 20 30 40 50 60 ideal performance Jones 10000 instances 5000 instances 2500 instances 1000 instances covertype 20000 instances 10000 instances 5000 instances 2500 instances 1000 instances Number of nodes in cluster

Computational performance: ensembles of neural networks Increase in speed over single node 30 25 20 15 10 5 0 0 5 10 15 20 25 30 Number of nodes in cluster ideal performance Jones 20000 instances 10000 instances 5000 instances 2500 instances 1000 instances 500 instances 250 instances 100 instances

Large data handling not as critical as expected Best ensemble accuracy associated with smaller partitions (< 5,000 instances) Ensembles with small partitions run much faster than those with larger partitions

Ensembles with small partitions run much faster than single classifier trained on all of data, and are more accurate Number of trees Instances per tree Processing mode Number of nodes Node type Runtime Accuracy, % 1 209529 serial 1 64-bit 2:01:34 58.30 100 2500 serial 1 64-bit 29:54 66.30 180 2500 parallel 60 32-bit 5:44 66.66 Jones dataset, ensemble of decision trees

RMPI version released on SourceForge ica.sf.net

Given two models with similar generalization errors, one should prefer the simpler model over the more complex model. For complex models, there is a greater chance it was fitted accidentally by errors in data. Model complexity should therefore be considered when evaluating a model.

http://www.datamininglab.com/pubs/paradox_jcgs.pdf

Ensemble systems are useful in practice Diversity of the base classifiers is important Ensemble generation techniques: bagging, AdaBoost, mixture of experts Classifier combination strategies: algebraic combiners, voting methods, and decision templates. No single ensemble generation algorithm or combination rule is universally better than others Effectiveness on real world data depends on the classifier diversity and characteristics of the data 50