Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Size: px

Start display at page:

Download "Evaluation of different biological data and computational classification methods for use in protein interaction prediction."

Mark Page
5 years ago
Views:

1 Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Yanjun Qi, Ziv Bar-Joseph, Judith Klein-Seetharaman Protein 2006

2 Motivation Correctly identifying the set of interacting proteins in an organism is useful for deciphering the molecular mechanisms underlying given biological functions and for assigning functions to unknown proteins based on their interacting partners.

3 Introduction Physical interaction

4 Introduction Co-complex relationship

5 Introduction Pathway co-membership

6 Introduction Lean mass protein complex

7 Introduction Lean mass protein complex NOT INCLUDED IN STUDY!

8 Introduction Yeast proteinprotein inteactions (Jeong et al. 2001)

9 Direct Methods of PPI Prediction Current high-throughput experimental approaches have been applied to determine the set of interacting proteins Two-hybird (Y2H) Mass Spectrometry

10 Direct Methods of PPI Prediction Current high-throughput experimental approaches have been applied to determine the set of interacting proteins Two-hybird (Y2H) Mass Spectrometry These methods have high rate of false-positves and false-negatives.

11 Direct Methods of PPI Prediction Two-hybird (Y2H)

12 Direct Methods of PPI Prediction Tandem Affinity Purification Mass Spectrometry

13 Indirect Methods of PPI Prediction Gene expression data

14 Indirect Methods of PPI Prediction Gene expression data Biological function (GO)

15 Indirect Methods of PPI Prediction Gene expression data Biological function (GO) Biological process (GO)

16 Indirect Methods of PPI Prediction Gene expression data Biological function (GO) Biological process (GO) Sequence similarity

17 Key Words PPI - protein protein interaction

18 Key Words PPI - protein protein interaction Gold Standard Dataset - data used to train and test an algorithm

19 Key Words PPI - protein protein interaction Gold Standard Dataset - data used to train and test an algorithm Positive Examples - a set of known interacting protein pairs

20 Key Words PPI - protein protein interaction Gold Standard Dataset - data used to train and test an algorithm Positive Examples - a set of known interacting protein pairs Negative Examples - a set of randomly paired proteins believed not to interact with each other

21 Key Words Feature Encoding - how do we use the data we have?

22 Key Words Feature Encoding - how do we use the data we have? Detailed - each source is handled separately

23 Key Words Feature Encoding - how do we use the data we have? Detailed - each source is handled separately Summary - combine similar sources

24 Goal Combine information from a variety of direct/indirect methods and apply them to a supervised learning framework and predict protein-protein interactions

25 Goal Combine information from a variety of direct/indirect methods and apply them to a supervised learning framework and predict protein-protein interactions Which one is the best?

26 Past Studies Varying datasets, encoding styles and classifiers Jansen et al. - Naïve Bayes, Co-complex, Summary Lin et al. - Random Forest, Logistic Regression, Co-complex, Summary Zhang et al. Decision Tree, Co-complex, Detailed etc

27 Past Studies Varying datasets, encoding styles and classifiers Jansen et al. - Naïve Bayes, Co-complex, Summary Lin et al. - Random Forest, Logistic Regression, Co-complex, Summary Zhang et al. Decision Tree, Co-complex, Detailed etc

28 Past Studies Varying datasets, encoding styles and classifiers Jansen et al. - Naïve Bayes, Co-complex, Summary Lin et al. - Random Forest, Logistic Regression, Co-complex, Summary Zhang et al. Decision Tree, Co-complex, Detailed etc

29 Past Studies Varying datasets, encoding styles and classifiers Jansen et al. - Naïve Bayes, Co-complex, Summary Lin et al. - Random Forest, Logistic Regression, Co-complex, Summary Zhang et al. Decision Tree, Co-complex, Detailed etc

30 Past Studies Varying datasets, encoding styles and classifiers Jansen et al. - Naïve Bayes, Co-complex, Summary Lin et al. - Random Forest, Logistic Regression, Co-complex, Summary Zhang et al. Decision Tree, Co-complex, Detailed etc

31 Systematic Comparison Reference Datasets = {physical, cocomplex, co-pathway} Encoding Styles = {summary, detailed} Classifiers = {DT, LR, NB, SVM, RF, krf}

32 Systematic Comparison Reference Datasets = {physical, cocomplex, co-pathway} Encoding Styles = {summary, detailed} Classifiers = {DT, LR, NB, SVM, RF, krf}

33 Positive Examples Physical Interactions - DIP (Database of Interacting Proteins)

34 Positive Examples Physical Interactions - DIP (Database of Interacting Proteins) Co-complex Interactions - MIPS (Munich Information Center for Protein Sequences)

(Munich Information Center for Protein Sequences)

35 Positive Examples Physical Interactions - DIP (Database of Interacting Proteins) Co-complex Interactions - MIPS (Munich Information Center for Protein Sequences) Co-pathway - KEGG (Kyoto Encyclopedia of Genes and Genomes)

36 Positive Examples

37 Negative Examples Post-filtering randomized protein pairing (Zhang et al. 2004) Only a fraction of of total pairs within the datasets are interacting, ~99% of randomized data is non-interacting

38 Negative Examples Post-filtering randomized protein pairing (Zhang et al. 2004) Only a fraction of of total pairs within the datasets are interacting, ~99% of randomized data is non-interacting Final training sets contained one positive example for every 600 negative interaction pairs

39 Features Used

40 Classification Algorithms SVM - Support Vector Machine NB - Naïve Bayes LR - Logistic Regression DT - Decision Tree RF - Random Forest krf - Random Forest-based k-nearest Neighbor

41 Support Vector Machine Basic idea of support vector machines Find optimal hyperplane for linearly separating patterns

42 Support Vector Machine Basic idea of support vector machines Find optimal hyperplane for linearly separating patterns Extend to patterns that are not linearly separable by transforming data into new space

43 Support Vector Support vectors are the data points that lie closest to the decision surface

44 Support Vector Support vectors are the data points that lie closest to the decision surface They have a direct bearing on the optimum location of the decision surface

45 Support Vector Machine

46 Support Vector Machine Y = mx + b

47 Support Vector Machine Y = mx + b

48 Support Vector Machine Y = mx + b

49 Support Vector Machine

50 Support Vector Machine r 2 = X 2 + Y 2

51 Support Vector Machine

52 Support Vector Machine As we move to higher dimensions the problem becomes much more complex

53 Naïve Bayes Basic idea of Naïve Bayes Calculate probability of a desired outcome based on a set of characteristics assuming a desired outcome

54 Naïve Bayes Basic idea of Naïve Bayes Calculate probability of a desired outcome based on a set of characteristics assuming a desired outcome Bayes rule

55 Bayes Rule B a Char 1 Char 2 Char 3 Char 4 Char 5 Interaction Y Y N N N Y Y N N Y Y N N N N Y Y N Y Y Y Y N Y N N Y Y N N

56 Naïve Bayes Take the product across all characteristics (X i ) with the assumption that each event is independent and that there is an interaction (Y = 1)

57 Logistic Regression Basic idea of Naïve Bayes Statistical regression model for binary dependant variables

58 Decision Tree Basic idea of tree based methods Construct a binary tree where each node represents a filter for a given characteristic and each leaf contains the decision Root contains all protein pairs and at each node pairs are separated into two categories, representing presence or absence of a characteristic

59 Decision Tree How do we decide which characteristic to use when separating data?

60 Decision Tree How do we decide which characteristic to use when separating data? Gini Index Looks at the largest class in the target, and tries to find a split, using a feature, to isolate it from the other classes

61 Decision Tree How do we decide which characteristic to use when separating data? Gini Index Looks at the largest class in the target, and tries to find a split, using a feature, to isolate it from the other classes A perfect series of splits would end up with k pure child nodes

62 Decision Tree How do we decide which characteristic to use when separating data? Gini Index Looks at the largest class in the target, and tries to find a split, using a feature, to isolate it from the other classes A perfect series of splits would end up with k pure child nodes If costs are assigned, we could isolate the most costly feature (most important), the one which tends to drive the cases into a single class

63 Decision Tree How do we decide which characteristic to use when separating data? Gini Index Looks at the largest class in the target, and tries to find a split, using a feature, to isolate it from the other classes A perfect series of splits would end up with k pure child nodes If costs are assigned, we could isolate the most costly feature (most important), the one which tends to drive the cases into a single class

64 Decision Tree Interaction?

65 Decision Tree Interaction? Similar Gene Expresssion

66 Decision Tree Interaction? Similar Gene Expression Sequence Similarity (95%) Go Annotation (Level 3) Y

67 Decision Tree Interaction? Similar Gene Expression Sequence Similarity (95%) Go Annotation (Level 3) Characteristic 1 Characteristic 2 Characteristic 3 N Y Y N Y N Y

68 Pruning After splitting stops the next step is prune the tree Cut off branches that provide the least additional predictive power Cut off weak branches with high misclassification rates

69 Pruning After splitting stops the next step is prune the tree Cut off branches that provide the least additional predictive power Cut off weak branches with high misclassification rates Improve accuracy

70 Decision Tree Interaction? Gene Regulation (2-fold) Sequence Similarity (95%) Go Annotation (Level 3) Characteristic 3 N Y N N Y

71 Random Forest Based on same idea as Decision Tree only we take random subsets of features and construct multiple trees simultaneously

72 Random Forest Based on same idea as Decision Tree only we take random subsets of features and construct multiple trees simultaneously Classification is chosen based on majority support 200 trees for each run

73 Random Forest

74 Random Forest

75 Random Forest Based on majority rule we would consider the pair tested as an interacting pair

76 k-nearest Neighbor Based on same idea as Random Forest only we calculate a similarity matrix based on the tree comparison values

77 k-nearest Neighbor Based on same idea as Random Forest only we calculate a similarity matrix based on the tree comparison values Classification is chosen based on k- nearest neighbors Do not specify the value of k used

78 k-nearest Neighbor <1,1,0,1,1, 0,1> Vector is used to plot data in n- dimensional space (n = 200)

79 k-nearest Neighbor

80 k-nearest Neighbor

81 k-nearest Neighbor

82 Performance Evaluation Decision model was trained with 30,000 protein pairs and then tested with a different 30,000

83 Performance Evaluation Decision model was trained with 30,000 protein pairs and then tested with a different 30,000 Plot precision vs recall Receiver operator characteristic curves (ROC)

84 Precision vs. Recall T Reality F Prediction T F True Positive (TP) False Negative Type II Error (FN) False Positive Type I Error (FP) True Negative (TN)

85 ROC Curves Plot of true-positives vs false positives

86 ROC Curves Plot of true-positives vs false positives Area under the curve is used as a measure of diagnostic accuracy Area measured until 50 false positives are found

87 Performance Comparison

88 Feature Importance Gene expression data is the most important in recovering all types of interactions

89 Feature Composition

90 Conclusions Co-complex relationships are the easiest to predict

91 Conclusions Co-complex relationships are the easiest to predict Detailed encoding style is preferred

92 Conclusions Co-complex relationships are the easiest to predict Detailed encoding style is preferred Random Forest classifier performs the best

93 Conclusions Co-complex relationships are the easiest to predict Detailed encoding style is preferred Random Forest classifier performs the best Different features have different importance in predicting protein interactions

94 Questions? Conclusions

Random Forest Similarity for Protein-Protein Interaction Prediction from Multiple Sources. Y. Qi, J. Klein-Seetharaman, and Z.

Random Forest Similarity for Protein-Protein Interaction Prediction from Multiple Sources Y. Qi, J. Klein-Seetharaman, and Z. Bar-Joseph Pacific Symposium on Biocomputing 10:531-542(2005) RANDOM FOREST