BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics Lecture 12: Ensemble Learning I Jie Wang Department of Computational Medicine & Bioinformatics University of Michigan 1

Outline Bias and Variance Bagging Application of Bagging to CART (Classfication & Regression Tree) 2

Bias and Variance The reducible error of a model can be decomposed into two parts. Error due to Bias Error due to Variance Good references include Scott Fortmann-Roe s blog: http://scott.fortmann-roe.com/docs/biasvariance.html#fn:1 Lecture 8 of the free online course: Introductory Machine Learning (a bit technical). https://work.caltech.edu/telecourse.html http://www.hlt.utdallas.edu/~vgogate/ml/2015s/lectures/ensemblemethods.pdf https://followthedata.wordpress.com/2012/06/02/practical-advice-for-machine-learning-bias-variance/ 3

The Error Due to Bias The error due to bias refers to the difference between the expected (or average) prediction of the model and the true value that we are trying to predict. Suppose that there are infinitely many marbles in the bin. How can we estimate the fraction of red marbles in the bin? We use the fraction of red marbles in the sample to estimate the fraction of red marbles in the bin. Figure is from http://work.caltech.edu/slides/slides02.pdf 4

The Error Due to Bias Is the fraction of red marbles computed from the sample a good approximation to its real value? unlucky It is unfair to say a model is bad if it performs poorly on one sample. We average the performance on many samples. μ 1 μ n The error due to bias is 0 n μ = 1 n i=1 μ i 5

The Error Due to Bias Low bias means the model fits the data well Linear regression applied to linear data SVM applied to linearly separable data with large margin Caution: can be overfitting High bias means poor approximation to the data Linear regression to nonlinear data KNN with very large k 6

The Error due to Variance The error due to variance refers to the variance of a model prediction for a given data point. x Data 1 g D1 g D1 x Data 2 g D2 g D2 x error due to variance = Var g D x fixed Data n g Dn g Dn x Random variable 7

The Error due to Variance Low variance means the prediction of the model is stable Linear regression applied to nonlinear data Model independent with the data Caution: can be under-fitting High variance means the prediction of the model is unstable High degree polynomial KNN with k = 1 8

Graphical illustration of bias and variance (figure is from Scott Fortmann-Roe s blog) 9

Bias/Variance Tradeoff E D g D x g x 2 = E D g D x g x 2 + ED g D x E D g D x 2 The true model (usually unknown) bias 2 variance We usually have low bias, high variance (the model is too complex) low variance, high bias (the model is too simple) Duda, Pattern Classification Tradeoff: Bias 2 vs variance 10

Bias/Variance Tradeoff Hastie, Tibshirani, Friedman Elements of Statistical Learning 11

Tips to reduce errors High bias Try to get new features High variance Try to get more training samples Try subset of features 12

Reduce Variance without increasing Bias Averaging reduces variance Var N i=1 N Xi = Var X N X i are identical and independent distributed (i.i.d.) random variables The training data is given. How can we find more data? 13

Bagging Breiman, 1996 Derived from bootstrap (Efron, 1993) Create classifiers using training sets that are bootstrapped (drawn with replacement) Average results for each case

What is Bagging? bootstrap aggregating Basic idea of bootstrap Suppose we have a model fit to a set of training data. The training set is Z = z 1, z 2,, z N where z i = x i, y i. The basic idea is to randomly draw datasets with replacement from the training data. Each sample set is the same size as the original training set. This is done B times and we will have B bootstrap datasets. We refit the model to each of the bootstrap datasets, and examines the behavior over the B replications. Question: what would happen if you draw datasets without replacement?

Review on Bootstrap Bootstrap can be used to assess the accuracy of a parameter estimate or prediction In Bagging, we use it to improve the estimation or prediction itself. For each bootstrap sample set Z b, b = 1,2,, B, we fit our model, giving prediction f b x, the bagging estimate is f bag x = 1 B b=1 B f b x. The average over the prediction reduces variance (not bias).

Bagging Example (Opitz, 1999) Original 1 2 3 4 5 6 7 8 Training set 1 2 7 8 3 7 6 3 1 Training set 2 7 8 5 6 4 2 7 1 Training set 3 3 6 2 7 5 6 2 2 Training set 4 4 5 1 4 6 4 3 8

Bagging ML f 1 ML f 2 f ML f T

Examples in Tree-based methods Tree-based methods Partition the feature space into a set of rectangles, and usually fit a constant in each one. The figure is from Wei-Yin Loh (2011) Classfication and regression trees. WIREs Data Mining and Knowledge Discovery 1:14-23.

Why Bagging in CART CART = Classification and Regression Tree To construct the tree, usually MSE or Misclassification error is minimized over the training sample. Tree based methods have very high variance. It is unstable because of the hierarchical structure. Bagging can average many trees to reduce the variance.

Example I: Tree-based Regression

Example II: Classification Tree Training Sample Results from one CART Bagged Tree Decision Boundary

Results from Breiman 96

Tips of Using Bagging Bagging is useful when the base-learner is unstable. A base-learner is unstable if a small perturbation in the training data leads to large change in the model. Neural networks, KNN with small k, and decision tree are unstable. KNN and naive Bayes classifier are stable. Bagging is not intended for reducing bias. 24

References Breiman 96, Bagging Predictors, Machine Learning, 26, 123-140 Leblanc M. and Tibshirani, R., 96, Combining estimates in regression and classfication, J. of Amer. Stat. Assoc. 91:1641-1650 Textbook: The elements of Statistical Learning