Random Forest in Genomic Selection

Size: px

Start display at page:

Download "Random Forest in Genomic Selection"

Annis Owens
5 years ago
Views:

1 Random Forest in genomic selection 1 Dpto Mejora Genética Animal, INIA, Madrid; Universidad Politécnica de Valencia, September, 2010.

2 Outline 1 Remind 2 Random Forest Introduction Classification And Regression Trees (CART) Algorithm workflow The out of bag samples 3 Final Remarks 4 RANFOG Input files How to run 5 Examples Examples

3 Remind What are ensembles Ensembles Ensembles are combination of different methods (usually simple models). They have very good predictive ability because use complementary and additivity of models performances. Ensembles have better predictive ability than methods separately. They have known statistics properties (no black boxes ). In a multitud of counselors there is saftey

4 Remind Building Ensembles: Two steps 1. Developing a population of varied models Also called base learners. May be weak models: slightly better than random guess. Same/different method. Features Subset Selection (FSS). Data values. Partition of the input space. 2. Combining them to form a composite predictor Voting. Estimated weight. Averaging.

5 Remind Building Ensembles: Two steps 1. Developing a population of varied models Also called base learners. May be weak models: slightly better than random guess. Same/different method. Features Subset Selection (FSS). Data values. Partition of the input space. 2. Combining them to form a composite predictor Voting. Estimated weight. Averaging.

6 Introduction Outline 1 Remind 2 Random Forest Introduction Classification And Regression Trees (CART) Algorithm workflow The out of bag samples 3 Final Remarks 4 RANFOG Input files How to run 5 Examples Examples

7 Introduction Overview Some properties Based on Classification And Regression Trees (CART). Use Randomization and Bagging. Performs Feature Subset Selection. Convenient for classification problems. Fast computation Simple interpretation of results for human minds.

8 Introduction Overview Brief description y = c 0 + c 1 f 1 (y,x) + c 2 f 2 (y,x) c i f i (y,x) c M f M (y,x) + e Perform bootstrap on data: Ψ = (y,x). Build a CART (f i (y,x) = h t (x)). Repeat n times to reduce residuals by a factor of n. Average estimates c 0 = µ; c i = 1 n.

9 Classification And Regression Trees (CART) Outline 1 Remind 2 Random Forest Introduction Classification And Regression Trees (CART) Algorithm workflow The out of bag samples 3 Final Remarks 4 RANFOG Input files How to run 5 Examples Examples

10 Classification And Regression Trees (CART) Classification And Regression Trees (CART) Let Ψ = (y,x) be a set of data, with y = vector of phenotypes (response variables) X = (x 1,x 2 ) =matrix of features y 1 x 11 x y i x i1 x i y n x n1 x n2

11 Classification And Regression Trees (CART) Classification And Regression Trees (CART) Problem 1. Classification

12 Classification And Regression Trees (CART) Classification And Regression Trees (CART) Problem 1. Classification a) Heuristic search to decide the best feature and partition among all possible cases. b) Split the data in two branches.

13 Classification And Regression Trees (CART) Classification And Regression Trees (CART) Problem 1. Classification c) Repeat the heuristic search. d) New search improves accuracy (e.g. MSE, missclassification,...)? -no: Estimate in that node is the average or majority vote of observations. End branch. -yes: Split the node in two new branches.

14 Classification And Regression Trees (CART) Classification And Regression Trees (CART) Problem 1. Classification c) Repeat the heuristic search. d) New search improves accuracy (e.g. MSE, missclassification,...)? -no: Estimate in that node is the average or majority vote of observations. End branch. -yes: Split the node in two new branches.

15 Classification And Regression Trees (CART) Classification And Regression Trees (CART) Problem 2. Regression

16 Classification And Regression Trees (CART) Classification And Regression Trees (CART) Problem 2. Regression a) Heuristic search to decide the best feature and partition among all possible cases. b) Split the data in two branches.

17 Classification And Regression Trees (CART) Classification And Regression Trees (CART) Problem 2. Regression c) Repeat the heuristic search. d) New search improves accuracy (e.g. MSE, missclassification,...)? -no: Estimate in that node is the average or majority vote of observations. End branch. -yes: Split the node in two new branches.

18 Classification And Regression Trees (CART) Classification And Regression Trees (CART) Problem 2. Regression c) Repeat the heuristic search. d) New search improves accuracy (e.g. MSE, missclassification,...)? -no: Estimate in that node is the average or majority vote of observations. End branch. -yes: Split the node in two new branches.

19 Workflow-1 Outline 1 Remind 2 Random Forest Introduction Classification And Regression Trees (CART) Algorithm workflow The out of bag samples 3 Final Remarks 4 RANFOG Input files How to run 5 Examples Examples

20 Workflow-1 Data Let Ψ = (y,x) be a set of data, with y = vector of phenotypes (response variables) X =matrix of SNP genotypes y 1 x x 1j... x 1p y i x i1... x ij... x ip y n x n1... x nj... x np with x ij = 0, 1, 2 (most common) 11, 12, 21, 22 (haplotypes) 10, 11, 01 (dummy variates)

21 Workflow-1 Data Let Ψ = (y,x) be a set of data, with y = vector of phenotypes (response variables) X =matrix of SNP genotypes y 1 x x 1j... x 1p y i x i1... x ij... x ip y n x n1... x nj... x np ŷ = T t=1 1 T ĥt( )

22 Workflow-1 Step 1 1. BOOTSTRAP THE DATA Ψ t is a bootstrapped set of n records of the original training set. contains (aprox) 63% of the original data (some records appear more than once and other not at all) around 27% of records are kept out of bag (OOB samples).

23 Workflow-1 Step 2 2. SELECT A SNP TO SPLIT THE DATA IN TWO NEW BRANCHES Select m SNPs out of p at random. Select the SNP j {1,...,m} that minimizes a given loss function j = argmin j L[Ψ(y,h t (x j ))] Use heuristic methods. Take a fresh look at the data and features that have arrived at the node and evaluate all possible splits L may be quadratic loss function, enthropy, Gini index, cost function, L 2, Huber loss function L h, exponential loss function L 1,...

24 Workflow-1 Step 3 3. SPLIT THE NODE Create two new branches according to the genotype of SNP j that one individual may or may not have. i.e. Individuals with allele A go to one branch, and individuals w/o allele A go to the other branch.

25 Workflow-1 Step 4 4. GROW TREE Repeat steps 2-4 until a minimum size (e.g. <5) is reached Estimated phenotype is the average phenotype of individuals in the terminal node (Regression) majority vote of individuals in the terminal node (Classification) Estimates of yet to be observed records are calculated as: Pass the genotype i through the tree until reach a terminal node. The estimate for individual i is the corresponding to the terminal node reached ŷ t = h t (x i )

26 Workflow-1 Step 5 5. GROW FOREST repeat steps 1-5 until a large number of times. average estimates across trees to make final predictions Ŷ i = T 1 t=1 T ŷit

27 Workflow-1 Diagram

28 The out of bag samples Outline 1 Remind 2 Random Forest Introduction Classification And Regression Trees (CART) Algorithm workflow The out of bag samples 3 Final Remarks 4 RANFOG Input files How to run 5 Examples Examples

29 The out of bag samples The OOB samples OOB: out of bag samples Records that are not sampled during the bootstrap procces. These records are passed down the tree constructed with Ψ t. Generalization error may be calculated as n oob i=1 (y i ŷ i ) 2

30 The out of bag samples The OOB samples Variable Importance The importance of each feature may be calculated using the OOB samples. 1 Passed the OOB records down the tree constructed Ψ t oob. 2 Calculate the MSE or L of choice L oob. 3 Perform permutation on SNP j {1,...,p} in the OOB. 4 Passed the permuted records down the tree constructed Ψ t oob j. 5 Calculate the MSE or L of choice on the permuted records L oob j. 6 The variable importance is the difference between the permuted obb tree and the non-permuted oob tree L oob j L oob

31 The out of bag samples The OOB samples The variable importance for each SNP is averaged across trees. Relative importance variable are obtained dividing all variable importance by the largest average value obtained.

32 The out of bag samples Set up your own random forest algorithm THINGS TO CHOOSE m L( ) minimum number of observations in the terminal node

33 Remarks Random Forest Based on bagging (randomization). Subset Feature Selection. Provide feature importance (for SNP selection). Good performance in classification problems. Percentage of features used at each node, loss function and convergence criterion create personalized RF algorithms. Excelent predictive ability. Behaves as a human mind would diagnostic (easy interpretation).

34 Input files Outline 1 Remind 2 Random Forest Introduction Classification And Regression Trees (CART) Algorithm workflow The out of bag samples 3 Final Remarks 4 RANFOG Input files How to run 5 Examples Examples

35 Input files Prepare your data Training and testing set with the same format DO NOT INCLUDE HEADER IN THE FILES!!

36 Input files Parameter file

37 Input files Parameter file

38 How to run Outline 1 Remind 2 Random Forest Introduction Classification And Regression Trees (CART) Algorithm workflow The out of bag samples 3 Final Remarks 4 RANFOG Input files How to run 5 Examples Examples

39 How to run Run the program Use java app to run it home>java -jar RanFoG_XXX.jar Make sure the program, the parameter file and the data files are in the same folder.

40 How to run Run the program

41 How to run Tuning the number of iterations Randomization or Pseudo-randomization RF is a Monte-Carlo process. Random number generators actually use pseudo-random functions depending on some seeds. Run large forests and n forests (the program automatically chooses a different seed). Obtain n estimates for each individual in the testing set Average the n estimates of each idividual to make final predictions. Ŷ i = 1 n n j=1 ŷij

42 Outline 1 Remind 2 Random Forest Introduction Classification And Regression Trees (CART) Algorithm workflow The out of bag samples 3 Final Remarks 4 RANFOG Input files How to run 5 Examples Examples

43 MSE or missclassification rate in the training set and OOB samples: Trees.txt MSE or missclassifictaion rate in the testing set: Trees.test Predicted genomic breeding value in the training set: EGBV.txt Predicted genomic breeding value in the testing set: Predictions.txt Feature importance information: Variable_Importance.txt

44 Trees.txt Each line contains the record for the corresponding iteration. Line contains: MSE (regression) or missclassifictaion rate (classification) for the bootstrapped samples in the training set. MSE (regression) or missclassifictaion rate (classification) for the OOB samples (tuning set).

45 Trees.txt training OOB_samples

46 Trees.test Each line contains the estimates for the corresponding iteration. Line contains: MSE (reg) or missclassificaction rate (class) in the testing set

47 Trees.test testing

48 EGBV.txt Each line contains the prediction for each corresponding animal in the training set. Line contains: ID Estimated phenotype in this run

49 EGBV.txt ID ŷ 1 0, , , , , , , ,

50 Predictions.txt Each line contains the prediction for each corresponding animal in the testing set. Line contains: ID Estimated genomic value in this run

51 Predictions.txt ID ŷ 801 0, , , , , , , , , ,

52 Variable_Importance.txt Each line contains the importance for each feature. Line contains: Feature order in the data files. Estimated variable importance. Divide by the larger value to obtain the relative importance of each feature (between 0 and 1).

53 Variable_Importance.txt Feature Variable_Importance 1 0, , , , , , , , , ,

54 Examples Outline 1 Remind 2 Random Forest Introduction Classification And Regression Trees (CART) Algorithm workflow The out of bag samples 3 Final Remarks 4 RANFOG Input files How to run 5 Examples Examples

55 Examples Forest Size

56 Examples mtry Goldstein et al. (2010) BMC Genetics.

57 Examples Predictions

58 Examples Variable importance Relative SNP importance in three different pig lines. González-Recio and Forni. (submitted).

59 Examples Predictive accuracy González-Recio and Forni. (submitted).

60 Examples Predictive accuracy González-Recio and Forni. (submitted).

61 Examples References Goldstein et al. (2010) BMC Genetics 11:49. Seni and Elder (2010) Ensemble Methods in Data Mining. Hastie et al. (2009) Elements of Statistical Learning. 2nd Edition.

Random Forest A. Fornaser

Random Forest A. Fornaser alberto.fornaser@unitn.it Sources Lecture 15: decision trees, information theory and random forests, Dr. Richard E. Turner Trees and Random Forests, Adele Cutler, Utah State University