Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Yanjun Qi, Ziv Bar-Joseph, Judith Klein-Seetharaman Protein 2006
Motivation Correctly identifying the set of interacting proteins in an organism is useful for deciphering the molecular mechanisms underlying given biological functions and for assigning functions to unknown proteins based on their interacting partners.
Introduction Physical interaction
Introduction Co-complex relationship
Introduction Pathway co-membership
Introduction Lean mass protein complex
Introduction Lean mass protein complex NOT INCLUDED IN STUDY!
Introduction Yeast proteinprotein inteactions (Jeong et al. 2001)
Direct Methods of PPI Prediction Current high-throughput experimental approaches have been applied to determine the set of interacting proteins Two-hybird (Y2H) Mass Spectrometry
Direct Methods of PPI Prediction Current high-throughput experimental approaches have been applied to determine the set of interacting proteins Two-hybird (Y2H) Mass Spectrometry These methods have high rate of false-positves and false-negatives.
Direct Methods of PPI Prediction Two-hybird (Y2H)
Direct Methods of PPI Prediction Tandem Affinity Purification Mass Spectrometry
Indirect Methods of PPI Prediction Gene expression data
Indirect Methods of PPI Prediction Gene expression data Biological function (GO)
Indirect Methods of PPI Prediction Gene expression data Biological function (GO) Biological process (GO)
Indirect Methods of PPI Prediction Gene expression data Biological function (GO) Biological process (GO) Sequence similarity
Key Words PPI - protein protein interaction
Key Words PPI - protein protein interaction Gold Standard Dataset - data used to train and test an algorithm
Key Words PPI - protein protein interaction Gold Standard Dataset - data used to train and test an algorithm Positive Examples - a set of known interacting protein pairs
Key Words PPI - protein protein interaction Gold Standard Dataset - data used to train and test an algorithm Positive Examples - a set of known interacting protein pairs Negative Examples - a set of randomly paired proteins believed not to interact with each other
Key Words Feature Encoding - how do we use the data we have?
Key Words Feature Encoding - how do we use the data we have? Detailed - each source is handled separately
Key Words Feature Encoding - how do we use the data we have? Detailed - each source is handled separately Summary - combine similar sources
Goal Combine information from a variety of direct/indirect methods and apply them to a supervised learning framework and predict protein-protein interactions
Goal Combine information from a variety of direct/indirect methods and apply them to a supervised learning framework and predict protein-protein interactions Which one is the best?
Past Studies Varying datasets, encoding styles and classifiers Jansen et al. - Naïve Bayes, Co-complex, Summary Lin et al. - Random Forest, Logistic Regression, Co-complex, Summary Zhang et al. Decision Tree, Co-complex, Detailed etc
Past Studies Varying datasets, encoding styles and classifiers Jansen et al. - Naïve Bayes, Co-complex, Summary Lin et al. - Random Forest, Logistic Regression, Co-complex, Summary Zhang et al. Decision Tree, Co-complex, Detailed etc
Past Studies Varying datasets, encoding styles and classifiers Jansen et al. - Naïve Bayes, Co-complex, Summary Lin et al. - Random Forest, Logistic Regression, Co-complex, Summary Zhang et al. Decision Tree, Co-complex, Detailed etc
Past Studies Varying datasets, encoding styles and classifiers Jansen et al. - Naïve Bayes, Co-complex, Summary Lin et al. - Random Forest, Logistic Regression, Co-complex, Summary Zhang et al. Decision Tree, Co-complex, Detailed etc
Past Studies Varying datasets, encoding styles and classifiers Jansen et al. - Naïve Bayes, Co-complex, Summary Lin et al. - Random Forest, Logistic Regression, Co-complex, Summary Zhang et al. Decision Tree, Co-complex, Detailed etc
Systematic Comparison Reference Datasets = {physical, cocomplex, co-pathway} Encoding Styles = {summary, detailed} Classifiers = {DT, LR, NB, SVM, RF, krf}
Systematic Comparison Reference Datasets = {physical, cocomplex, co-pathway} Encoding Styles = {summary, detailed} Classifiers = {DT, LR, NB, SVM, RF, krf}
Positive Examples Physical Interactions - DIP (Database of Interacting Proteins)
Positive Examples Physical Interactions - DIP (Database of Interacting Proteins) Co-complex Interactions - MIPS (Munich Information Center for Protein Sequences)
Positive Examples Physical Interactions - DIP (Database of Interacting Proteins) Co-complex Interactions - MIPS (Munich Information Center for Protein Sequences) Co-pathway - KEGG (Kyoto Encyclopedia of Genes and Genomes)
Positive Examples
Negative Examples Post-filtering randomized protein pairing (Zhang et al. 2004) Only a fraction of of total pairs within the datasets are interacting, ~99% of randomized data is non-interacting
Negative Examples Post-filtering randomized protein pairing (Zhang et al. 2004) Only a fraction of of total pairs within the datasets are interacting, ~99% of randomized data is non-interacting Final training sets contained one positive example for every 600 negative interaction pairs
Features Used
Classification Algorithms SVM - Support Vector Machine NB - Naïve Bayes LR - Logistic Regression DT - Decision Tree RF - Random Forest krf - Random Forest-based k-nearest Neighbor
Support Vector Machine Basic idea of support vector machines Find optimal hyperplane for linearly separating patterns
Support Vector Machine Basic idea of support vector machines Find optimal hyperplane for linearly separating patterns Extend to patterns that are not linearly separable by transforming data into new space
Support Vector Support vectors are the data points that lie closest to the decision surface
Support Vector Support vectors are the data points that lie closest to the decision surface They have a direct bearing on the optimum location of the decision surface
Support Vector Machine
Support Vector Machine Y = mx + b
Support Vector Machine Y = mx + b
Support Vector Machine Y = mx + b
Support Vector Machine
Support Vector Machine r 2 = X 2 + Y 2
Support Vector Machine
Support Vector Machine As we move to higher dimensions the problem becomes much more complex
Naïve Bayes Basic idea of Naïve Bayes Calculate probability of a desired outcome based on a set of characteristics assuming a desired outcome
Naïve Bayes Basic idea of Naïve Bayes Calculate probability of a desired outcome based on a set of characteristics assuming a desired outcome Bayes rule
Bayes Rule B a Char 1 Char 2 Char 3 Char 4 Char 5 Interaction Y Y N N N Y Y N N Y Y N N N N Y Y N Y Y Y Y N Y N N Y Y N N 0.6 0.4 0.4 0.8 0.4 0.4
Naïve Bayes Take the product across all characteristics (X i ) with the assumption that each event is independent and that there is an interaction (Y = 1)
Logistic Regression Basic idea of Naïve Bayes Statistical regression model for binary dependant variables
Decision Tree Basic idea of tree based methods Construct a binary tree where each node represents a filter for a given characteristic and each leaf contains the decision Root contains all protein pairs and at each node pairs are separated into two categories, representing presence or absence of a characteristic
Decision Tree How do we decide which characteristic to use when separating data?
Decision Tree How do we decide which characteristic to use when separating data? Gini Index Looks at the largest class in the target, and tries to find a split, using a feature, to isolate it from the other classes
Decision Tree How do we decide which characteristic to use when separating data? Gini Index Looks at the largest class in the target, and tries to find a split, using a feature, to isolate it from the other classes A perfect series of splits would end up with k pure child nodes
Decision Tree How do we decide which characteristic to use when separating data? Gini Index Looks at the largest class in the target, and tries to find a split, using a feature, to isolate it from the other classes A perfect series of splits would end up with k pure child nodes If costs are assigned, we could isolate the most costly feature (most important), the one which tends to drive the cases into a single class
Decision Tree How do we decide which characteristic to use when separating data? Gini Index Looks at the largest class in the target, and tries to find a split, using a feature, to isolate it from the other classes A perfect series of splits would end up with k pure child nodes If costs are assigned, we could isolate the most costly feature (most important), the one which tends to drive the cases into a single class
Decision Tree Interaction?
Decision Tree Interaction? Similar Gene Expresssion
Decision Tree Interaction? Similar Gene Expression Sequence Similarity (95%) Go Annotation (Level 3) Y
Decision Tree Interaction? Similar Gene Expression Sequence Similarity (95%) Go Annotation (Level 3) Characteristic 1 Characteristic 2 Characteristic 3 N Y Y N Y N Y
Pruning After splitting stops the next step is prune the tree Cut off branches that provide the least additional predictive power Cut off weak branches with high misclassification rates
Pruning After splitting stops the next step is prune the tree Cut off branches that provide the least additional predictive power Cut off weak branches with high misclassification rates Improve accuracy
Decision Tree Interaction? Gene Regulation (2-fold) Sequence Similarity (95%) Go Annotation (Level 3) Characteristic 3 N Y N N Y
Random Forest Based on same idea as Decision Tree only we take random subsets of features and construct multiple trees simultaneously
Random Forest Based on same idea as Decision Tree only we take random subsets of features and construct multiple trees simultaneously Classification is chosen based on majority support 200 trees for each run
Random Forest
Random Forest 1 1 0 1 0 1
Random Forest 1 1 0 1 0 1 Based on majority rule we would consider the pair tested as an interacting pair
k-nearest Neighbor Based on same idea as Random Forest only we calculate a similarity matrix based on the tree comparison values
k-nearest Neighbor Based on same idea as Random Forest only we calculate a similarity matrix based on the tree comparison values Classification is chosen based on k- nearest neighbors Do not specify the value of k used
k-nearest Neighbor 1 1 0 1 0 1 <1,1,0,1,1, 0,1> Vector is used to plot data in n- dimensional space (n = 200)
k-nearest Neighbor
k-nearest Neighbor
k-nearest Neighbor
Performance Evaluation Decision model was trained with 30,000 protein pairs and then tested with a different 30,000
Performance Evaluation Decision model was trained with 30,000 protein pairs and then tested with a different 30,000 Plot precision vs recall Receiver operator characteristic curves (ROC)
Precision vs. Recall T Reality F Prediction T F True Positive (TP) False Negative Type II Error (FN) False Positive Type I Error (FP) True Negative (TN)
ROC Curves Plot of true-positives vs false positives
ROC Curves Plot of true-positives vs false positives Area under the curve is used as a measure of diagnostic accuracy Area measured until 50 false positives are found
Performance Comparison
Feature Importance Gene expression data is the most important in recovering all types of interactions
Feature Composition
Conclusions Co-complex relationships are the easiest to predict
Conclusions Co-complex relationships are the easiest to predict Detailed encoding style is preferred
Conclusions Co-complex relationships are the easiest to predict Detailed encoding style is preferred Random Forest classifier performs the best
Conclusions Co-complex relationships are the easiest to predict Detailed encoding style is preferred Random Forest classifier performs the best Different features have different importance in predicting protein interactions
Questions? Conclusions