April 3, 2012 T.C. Havens

Similar documents
CSC411 Fall 2014 Machine Learning & Data Mining. Ensemble Methods. Slides by Rich Zemel

Ensemble Learning: An Introduction. Adapted from Slides by Tan, Steinbach, Kumar

Boosting Algorithms for Parallel and Distributed Learning

STA 4273H: Statistical Machine Learning

Random Forest A. Fornaser

Slides for Data Mining by I. H. Witten and E. Frank

Machine Learning Techniques for Data Mining

An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures

Ensemble Learning. Another approach is to leverage the algorithms we have via ensemble methods

CS 559: Machine Learning Fundamentals and Applications 10 th Set of Notes

Ensemble Methods, Decision Trees

Semi-supervised learning and active learning

An introduction to random forests

7. Boosting and Bagging Bagging

CS 229 Midterm Review

Ensemble Methods: Bagging

Adaptive Boosting for Spatial Functions with Unstable Driving Attributes *

Computer Vision Group Prof. Daniel Cremers. 8. Boosting and Bagging

Multi-label classification using rule-based classifier systems

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Data Mining Lecture 8: Decision Trees

Naïve Bayes for text classification

Ensemble methods in machine learning. Example. Neural networks. Neural networks

CS294-1 Final Project. Algorithms Comparison

Data Mining Practical Machine Learning Tools and Techniques

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Bias-Variance Analysis of Ensemble Learning

An Empirical Comparison of Ensemble Methods Based on Classification Trees. Mounir Hamza and Denis Larocque. Department of Quantitative Methods

Lecture #17: Autoencoders and Random Forests with R. Mat Kallada Introduction to Data Mining with R

Lecture 2 :: Decision Trees Learning

CSC 411 Lecture 4: Ensembles I

Information Management course

Oliver Dürr. Statistisches Data Mining (StDM) Woche 12. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

Classification/Regression Trees and Random Forests

Model combination. Resampling techniques p.1/34

Soft computing algorithms

Algorithms: Decision Trees

ECG782: Multidimensional Digital Signal Processing

OPTIMIZATION OF BAGGING CLASSIFIERS BASED ON SBCB ALGORITHM

HALF&HALF BAGGING AND HARD BOUNDARY POINTS. Leo Breiman Statistics Department University of California Berkeley, CA

Bagging for One-Class Learning

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

Machine Learning (CS 567)

arxiv: v2 [cs.lg] 11 Sep 2015

Network Traffic Measurements and Analysis

Overview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8

From Ensemble Methods to Comprehensible Models

Supervised Learning for Image Segmentation

Probabilistic Approaches

MASTER. Random forest visualization. Kuznetsova, N.I. Award date: Link to publication

Improving the Random Forest Algorithm by Randomly Varying the Size of the Bootstrap Samples for Low Dimensional Data Sets

Logical Rhythm - Class 3. August 27, 2018

Information Fusion Dr. B. K. Panigrahi

The digital copy of this thesis is protected by the Copyright Act 1994 (New Zealand).

Cover Page. The handle holds various files of this Leiden University dissertation.

Decision Trees. This week: Next week: Algorithms for constructing DT. Pruning DT Ensemble methods. Random Forest. Intro to ML

THE ENSEMBLE CONCEPTUAL CLUSTERING OF SYMBOLIC DATA FOR CUSTOMER LOYALTY ANALYSIS

Business Club. Decision Trees

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

Machine Learning Lecture 11

Ensemble methods. Ricco RAKOTOMALALA Université Lumière Lyon 2. Ricco Rakotomalala Tutoriels Tanagra -

Classification with Decision Tree Induction

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES

Contents. Preface to the Second Edition

More Learning. Ensembles Bayes Rule Neural Nets K-means Clustering EM Clustering WEKA

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Nonparametric Methods Recap

INTRO TO RANDOM FOREST BY ANTHONY ANH QUOC DOAN

CS6716 Pattern Recognition. Ensembles and Boosting (1)

Artificial Intelligence. Programming Styles

Optimal Extension of Error Correcting Output Codes

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

Using Machine Learning to Optimize Storage Systems

Applying Supervised Learning

PV211: Introduction to Information Retrieval

Advanced and Predictive Analytics with JMP 12 PRO. JMP User Meeting 9. Juni Schwalbach

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Classification of Hand-Written Numeric Digits

Nonparametric Classification Methods

Based on Raymond J. Mooney s slides

The Curse of Dimensionality

Classifier Case Study: Viola-Jones Face Detector

BITS F464: MACHINE LEARNING

Detecting Faces in Images. Detecting Faces in Images. Finding faces in an image. Finding faces in an image. Finding faces in an image

Skin and Face Detection

Machine Learning for Signal Processing Detecting faces (& other objects) in images

Globally Induced Forest: A Prepruning Compression Scheme

The Basics of Decision Trees

Machine Learning. Chao Lan

Computer Vision Group Prof. Daniel Cremers. 6. Boosting

8. Tree-based approaches

User Documentation Decision Tree Classification with Bagging

A Practical Tour of Ensemble (Machine) Learning

Classification with PAM and Random Forest

Decision Trees. This week: Next week: constructing DT. Pruning DT. Ensemble methods. Greedy Algorithm Potential Function.

Adaptive Boosting Techniques in Heterogeneous and Spatial Databases

Face detection and recognition. Detection Recognition Sally

Binary Hierarchical Classifier for Hyperspectral Data Analysis

Transcription:

April 3, 2012 T.C. Havens

Different training parameters MLP with different weights, number of layers/nodes, etc. Controls instability of classifiers (local minima) Similar strategies can be used to generate different decision trees Different types of classifiers (MLPs, decision trees, NNs, SVMs) can be combined to improve diversity Random feature sets, called random subspace method

Two questions: How will the individual classifiers be generated? How will they differ from each other? Answer determines the diversity of classifiers and fusion performance Seek to improve ensemble diversity by some heuristic methods

Bagging, short for bootstrap aggregating, is one of the earliest ensemble based algorithms It is also one of the most intuitive and simplest to implement, with a surprisingly good performance Use bootstrapped replicas of the training data; large number of (say 200) training subsets are randomly drawn - with replacement - from the entire training data Each resampled training set is used to train a different classifier of the same type Individual classifiers are combined by taking a majority vote of their decisions Bagging is appealing for small training set; relatively large portion of the samples is included in each subset

Algorithm : Bagging input : Training data S with correct labelsω Weak learning algorithm WeakLearn, Integer T specifying number of 1.Take a bootstrapped replica S 2.Call WeakLearn with S 3.Add h 1.Evaluate the ensemble on x. 2.Letυ to the ensemble, E. 1, if ht picks class ω j = 0, otherwise iterations. 3.Obtain total vote received by each classv { ω,,ω } Percent ( of fraction) F to create bootstrapped training data Do t = 1,,T End t Test : Simple Majority Voting t,j t Ω = by randomly drawing percent of S. and receive the hypothesis (classifier) h. - Given unlabled instance be the vote given to class by classifier. t = 1 4.Choose the class that receives the highest total vote as t i j = 1 T υ t,j C representing C classes x, j = 1, C, t the final classification.

Random Forests Constructed from decision trees A random forest is created from individual decision trees, whose training parameters vary randomly Such parameters can be bootstrapped replicas of the training data, as in bagging But they can also be different feature subsets as in random subspace methods

Given: N training samples, p variables. Algorithm: 1. For b = 1 to B: a. Draw a bootstrap sample of size N from training data. b. Grow a random-forest tree T b on the bootstrapped data, by recursively repeating the following steps for each terminal node, until the minimum node size n min is reached. i. Select m variables at random from the p variables. ii. Pick the best variable and split-point among the m. iii. Split the node into two child nodes. 2. Output the ensemble of B trees {T b }.

Given: N training samples, p variables. Algorithm: 1. For b = 1 to B: a. Draw a bootstrap sample of size N from training data. b. Grow a random-forest tree T b on the bootstrapped data, by recursively repeating the following steps for each terminal node, until the minimum node size n min is reached. i. Select m variables at random from the p variables. ii. Pick the best variable and split-point among the m. iii. Split the node into two child nodes. 2. Output the ensemble of B trees {T b }. Only difference from bagging with decision trees. m typically sqrt( p ) (even as low as 1)

Random forests routinely outperform bagged ensembles, and are often competitive with boosting.

Random forests provide even more reduction of variance than bagged decision trees. But still do not impact bias. Benefit appears to be from de-correlation of individual trees. Bootstrap samples still have significant correlation. Simpler to train and tune than boosting algorithms.

First implemented in FORTRAN by Leo Breiman and Adele Cutler, and the term trademarked by them. http://stat-www.berkeley.edu/users/breiman/randomforests/cc_home.htm Commercial distribution licensed exclusively to Salford Systems. Lots of open-source implementations in various languages and machine learning packages. Available in MATLAB as class TreeBagger (Statistics Toolbox).

Boost the performance of a weak learner to the level of a strong one Boosting creates an ensemble of classifiers by resampling the data; classifiers combined by majority voting resampling is strategically geared to provide the most informative training data for each consecutive classifier Boosting creates three weak classifiers: First classifier C1 is trained with a random subset of the available training data Training set for second classifier C2 is chosen as the most informative subset, given C1; half of the training data for C2 is correctly classified by C1, other half is misclassified by C1 Third classifier C3 is trained on instances on which both C1 & C2 disagree

Algprithm : Boosting Input : Training data S of the size N with correct labelsω Ω = Weak learning algorithm WeakLearn. Training 1.Select N 2.Call WeakLearn and train with S 3.Create dataset as the most informative dataset S and the other half is misclassified.to do so : a.filp a fair coin. If Add this instance to S b.if tail, select samples from S, and present them to C 4.Train the second classifier C 5.Create by selecting those instances for which C Test - Given a test instance x 1.Classify x by C 1 and C 2.If they disagree, choose the class predicted by C..If with S to create classifier C., given C and C { ω, ω }; < N patterns without replacement from S to create data subset S Head,select samples from S, and present them to C c.continue flipping coins until no more patterns can be added to S 1 2 2 2 1 2 1 2 i until the first one is correctly classified. Add this instance to disagree.train the third classifier C they agree on the class, this class is the final classification. 3 1 2 1. 1 1 2, such that half of S as the ginal classification. 2. 1 1. 2 is correctly classified by C until the first instance is misclassified. 3 with S 3. 2, S 2.

Initially, all samples have equal weights. Samples that are wrongly classified have their weights increased. Samples that are classified correctly have their weights decreased. Samples with higher weights have more influence in subsequent training iterations. Adaptively changes training data distribution. Original Data 1 2 3 4 5 6 7 8 9 10 Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3 Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2 Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4 sample 4 is hard to classify its weight is increased

AdaBoost (1997) is a more general version of the boosting algorithm; AdaBoost.M1 can handle multiclass problems AdaBoost generates a set of hypotheses (classifiers), and combines them through weighted majority voting of the classes predicted by the individual hypotheses Hypotheses are generated by training a weak classifier; samples are drawn from an iteratively updated distribution of the training set This distribution update ensures that instances misclassified by the previous classifier are more likely to be included in the training data of the next classifier Consecutive classifiers are trained on increasingly hard-toclassify samples

A weight distribution D t (i) on training instances x i, i=1,,n from which training data subsets S t are chosen for each consecutive classifier (hypothesis) h t A normalized error is then obtained as β t, such that for 0<ε t <1/2, they have 0< β t <1 Distribution update rule: The distribution weights of those instances that are correctly classified by the current hypothesis are reduced by a factor of β t, whereas the weights of the misclassified instances are unchanged. AdaBoost focuses on increasingly difficult instances AdaBoost raises the weights of instanced misclassified by h t, and lowers the weights of correctly classified instances AdaBoost is ready for classifying unlabeled test instances. Unlike bagging or boosting, AdaBoost uses the weighted majority voting 1/β t is therefore a measure of performance, of the t th hypothesis and can be used to weight the classifiers

Algorithm AdaBoost.M1 Input : Sequence of N examples S = Weak learning algorithm WeakLearn; Integer T specifying number of iterations. 1 Initialize D1 ( i) =,i = 1,,N N Do for t = 1, 2, T : 1.Select a training data subset S 2.Train WeakLearn with S 3.Calculate the error of h : ε = t Ifε 4.Set β where Z ( i) ( x) ( x ) ( i) > 1/ 2, abort. ( 1 ε ) 5.Update distribution D : D t + 1 ( i) β, if h ( x ) ( i) 1.Obtain total vote received by each class V t j t Test - = t t = ε / D D Z D. [( x,y )],i = 1,,N with labels y Ω, Ω = { ω, ω };, receive hypothesis h t t i = y 1,otherwise, draw from the distribution is a normalization constant chosen so that D Weight Majority Voting : Given an unlabeled instance x. t:h t t i:h t = t = ω j i = t y i i t t t t 1 log, j = 1,,C. β t t t i i 2.Choose the class that receives the highest total vote as the final classification. i t. i D t. t + 1 1 C becomes a proper distribution function.

AdaBoost algorithm is sequential; classifier (CK-1) is created before classifier CK

28

Bagging Resample data points Weight of each classifier is same Only reduces variance Robust to noise and outliers Easily parallelized Boosting Reweight data points (modify data distribution) Weight of a classifier depends on its accuracy Reduces both bias and variance Noise and outliers can hurt performance

An ensemble of classifiers is first created, whose outputs are used as inputs to a second level meta-classifier to learn the mapping between the ensemble outputs and the actual correct classes C 1,,C T are trained using training parameters θ 1 through θ T to output hypotheses h 1 through h T The outputs of these classifiers and the corresponding true classes are then used as input/output training pairs for the second level classifier, C T+1

A conceptually similar technique is the mixture-of-experts model, where a set of classifiers C 1,,C T constitute the ensemble, followed by a second-level classifier C T+1 used for assigning weights for the consecutive combiner The combiner itself is usually not a classifier, but rather a simple combination rule, such as random selection (from a weight distribution), weighted majority, or weighted winner-takes-all the weight distribution used for the combiner is determined by a second level classifier, usually a neural network, called the gating network The inputs to the gating network are the actual training data instances themselves (unlike outputs of first level classifiers for stacked generalization) Mixture-of-experts can, therefore, be seen as a classifier selection algorithm Individual classifiers are experts in some portion of the feature space, and the combination rule selects the most appropriate classifier, or classifiers weighted with respect to their expertise, for each instance x

The pooling system may use the weights in several different ways. it may choose a single classifier with the highest weight, or calculate a weighted sum of the classifier outputs for each class, and pick the class that receives the highest weighted sum.

How to combine classifiers? Combination rules grouped as (i) trainable vs. non-trainable Trainable rules: parameters of the combiner, called weights determined through a separate training algorithm Weights from trainable rules are usually instance specific, and hence are also called dynamic combination rules Non-trainable rules: combination parameters are available as classifiers are generated; Weighted majority voting is an example (ii) combination rules for class labels vs. class-specific continuous outputs combination rules that apply to class labels only need the classification decision (that is, one of ω j, j=1,,c) Other rules need continuous-valued outputs of individual classifiers

Assume that only class labels are available from the classifier outputs Define the decision of the t th classifier as d t,j {0,1}, t=1,,t and j=1,,c, where T is the number of classifiers and C is the number of classes If t th classifier chooses class ω j, then d t,j =1, 0 otherwise a)majority Voting : b)weighted Majority Voting : T d = C max T t,j j= 1 t= 1 t = 1 d c) Behavior Knowledge Space (BKS) : look up Table d)borda Count : each voter (classifier) rank orders the candidates (classes). If there are N candidates, the first-place candidate receives N 1 votes, the secondplace candidate receives N 2, with the candidate in i th place receiving N i votes. The votes are added up across all classifiers, and the class with the most votes is chosen as the ensemble decision t, j T ω d = C max t t,j j= 1 t= 1 t= 1 T ω d t t, j

Algebraic combiners a) Mean Rule: µ b) Weighted Average: j ( x) = d ( x) t = 1 c) Minimum/Maximum/Median Rule: { } 1 T T µ j t, j 1 T T ( x) = ω d ( x) t= 1 µ ( x) = max d ( x) µ ( x) = min d ( x) µ j ( x) = median d t, j ( x) j t =1,,T t, j j t =1,,T t, j t, j { } t, j t =1,,T { } d) Product Rule: µ j 1 T T ( x) = d ( x) t = 1 t, j e) Generalized Mean: Many of the above rules are in fact special cases of the generalized mean T 1 α α µ j ( x, α ) = dt, j ( x) T t = 1 α : minimum rule; α :maximum rule; 0: : mean rule α 1 1 T µ j x, α = d t = 1 α ( ) ( ) T t, j x 1

Ensemble Cloud Army (ECA) A Platform for Parallel Processing of Machine Learning Problems in the Amazon Cloud J. Jeffry Howbert Insilicos LLC May 11, 2011

Datasets Name Source Domain Instances Features Feature type(s) Classes satimage UCI soil types from satellite images 4435 train, 2000 test 36 numeric (0-255) 6 covertype UCI forest cover types from cartographic variables 581012 54 10 numeric, 44 binary qualitative 7 jones Ref. 3 protein secondary structure 209529 train, 17731 test 315 numeric 3

For ensembles, training subsets must deliver diversity, accuracy, and fast computation. For large datasets used with ECA, bootstrap samples are too large for practical computation. Instead, much smaller subsets of records are generated by random sampling without replacement. The key principle for effective sampling is the following: Using a sample will work almost as well as using the entire data set, provided the sample is representative. A sample is representative if it has approximately the same distribution of properties (of interest) as the original set of data

Ensembles have better accuracy than individual component classifiers 80 Classification accuracy, % 75 70 65 60 55 50 45 covertype: ensemble of decision trees covertype: average of individual decision trees Jones: ensemble of neural nets Jones: average of individual neural nets Jones: ensemble of decision trees Jones: average of individual decision trees 40 100 1000 10000 100000 Number of instances per base classifier

Accuracy remains high despite large reduction in features 71 69 Classification accuracy, % 67 65 63 61 59 57 Jones neural nets, 315 features neural nets, 157 features neural nets, 78 features decision trees, 315 features decision trees, 157 features decision trees, 78 features 55 100 1000 10000 100000 Number of instances per base classifier

The potential speedup from parallelization is strictly limited by the portion of the computation that cannot be parallelized. Assume proportion P of computation can be parallelized, and proportion (1 P) is necessarily sequential. The speedup from parallelizing on N processors is: 1 (1 P ) + For example, if P = 0.9, maximum possible speedup is 10, no matter how large N is. P N

Computational performance: ensembles of decision trees 60 Increase in speed over single node 50 40 30 20 10 0 0 10 20 30 40 50 60 ideal performance Jones 10000 instances 5000 instances 2500 instances 1000 instances covertype 20000 instances 10000 instances 5000 instances 2500 instances 1000 instances Number of nodes in cluster

Computational performance: ensembles of neural networks Increase in speed over single node 30 25 20 15 10 5 0 0 5 10 15 20 25 30 Number of nodes in cluster ideal performance Jones 20000 instances 10000 instances 5000 instances 2500 instances 1000 instances 500 instances 250 instances 100 instances

Large data handling not as critical as expected Best ensemble accuracy associated with smaller partitions (< 5,000 instances) Ensembles with small partitions run much faster than those with larger partitions

Ensembles with small partitions run much faster than single classifier trained on all of data, and are more accurate Number of trees Instances per tree Processing mode Number of nodes Node type Runtime Accuracy, % 1 209529 serial 1 64-bit 2:01:34 58.30 100 2500 serial 1 64-bit 29:54 66.30 180 2500 parallel 60 32-bit 5:44 66.66 Jones dataset, ensemble of decision trees

RMPI version released on SourceForge ica.sf.net

Given two models with similar generalization errors, one should prefer the simpler model over the more complex model. For complex models, there is a greater chance it was fitted accidentally by errors in data. Model complexity should therefore be considered when evaluating a model.

http://www.datamininglab.com/pubs/paradox_jcgs.pdf

Ensemble systems are useful in practice Diversity of the base classifiers is important Ensemble generation techniques: bagging, AdaBoost, mixture of experts Classifier combination strategies: algebraic combiners, voting methods, and decision templates. No single ensemble generation algorithm or combination rule is universally better than others Effectiveness on real world data depends on the classifier diversity and characteristics of the data 50