Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Yanjun Qi, Ziv Bar-Joseph, Judith Klein-Seetharaman Protein 2006

Motivation Correctly identifying the set of interacting proteins in an organism is useful for deciphering the molecular mechanisms underlying given biological functions and for assigning functions to unknown proteins based on their interacting partners.

Introduction Physical interaction

Introduction Co-complex relationship

Introduction Pathway co-membership

Introduction Lean mass protein complex

Introduction Lean mass protein complex NOT INCLUDED IN STUDY!

Introduction Yeast proteinprotein inteactions (Jeong et al. 2001)

Direct Methods of PPI Prediction Current high-throughput experimental approaches have been applied to determine the set of interacting proteins Two-hybird (Y2H) Mass Spectrometry

Direct Methods of PPI Prediction Current high-throughput experimental approaches have been applied to determine the set of interacting proteins Two-hybird (Y2H) Mass Spectrometry These methods have high rate of false-positves and false-negatives.

Direct Methods of PPI Prediction Two-hybird (Y2H)

Direct Methods of PPI Prediction Tandem Affinity Purification Mass Spectrometry

Indirect Methods of PPI Prediction Gene expression data

Indirect Methods of PPI Prediction Gene expression data Biological function (GO)

Indirect Methods of PPI Prediction Gene expression data Biological function (GO) Biological process (GO)

Indirect Methods of PPI Prediction Gene expression data Biological function (GO) Biological process (GO) Sequence similarity

Key Words PPI - protein protein interaction

Key Words PPI - protein protein interaction Gold Standard Dataset - data used to train and test an algorithm

Key Words PPI - protein protein interaction Gold Standard Dataset - data used to train and test an algorithm Positive Examples - a set of known interacting protein pairs

Key Words PPI - protein protein interaction Gold Standard Dataset - data used to train and test an algorithm Positive Examples - a set of known interacting protein pairs Negative Examples - a set of randomly paired proteins believed not to interact with each other

Key Words Feature Encoding - how do we use the data we have?

Key Words Feature Encoding - how do we use the data we have? Detailed - each source is handled separately

Key Words Feature Encoding - how do we use the data we have? Detailed - each source is handled separately Summary - combine similar sources

Goal Combine information from a variety of direct/indirect methods and apply them to a supervised learning framework and predict protein-protein interactions

Goal Combine information from a variety of direct/indirect methods and apply them to a supervised learning framework and predict protein-protein interactions Which one is the best?

Past Studies Varying datasets, encoding styles and classifiers Jansen et al. - Naïve Bayes, Co-complex, Summary Lin et al. - Random Forest, Logistic Regression, Co-complex, Summary Zhang et al. Decision Tree, Co-complex, Detailed etc

Systematic Comparison Reference Datasets = {physical, cocomplex, co-pathway} Encoding Styles = {summary, detailed} Classifiers = {DT, LR, NB, SVM, RF, krf}

Positive Examples Physical Interactions - DIP (Database of Interacting Proteins)

Positive Examples Physical Interactions - DIP (Database of Interacting Proteins) Co-complex Interactions - MIPS (Munich Information Center for Protein Sequences)

Positive Examples Physical Interactions - DIP (Database of Interacting Proteins) Co-complex Interactions - MIPS (Munich Information Center for Protein Sequences) Co-pathway - KEGG (Kyoto Encyclopedia of Genes and Genomes)

Positive Examples

Negative Examples Post-filtering randomized protein pairing (Zhang et al. 2004) Only a fraction of of total pairs within the datasets are interacting, ~99% of randomized data is non-interacting

Negative Examples Post-filtering randomized protein pairing (Zhang et al. 2004) Only a fraction of of total pairs within the datasets are interacting, ~99% of randomized data is non-interacting Final training sets contained one positive example for every 600 negative interaction pairs

Features Used

Classification Algorithms SVM - Support Vector Machine NB - Naïve Bayes LR - Logistic Regression DT - Decision Tree RF - Random Forest krf - Random Forest-based k-nearest Neighbor

Support Vector Machine Basic idea of support vector machines Find optimal hyperplane for linearly separating patterns

Support Vector Machine Basic idea of support vector machines Find optimal hyperplane for linearly separating patterns Extend to patterns that are not linearly separable by transforming data into new space

Support Vector Support vectors are the data points that lie closest to the decision surface

Support Vector Support vectors are the data points that lie closest to the decision surface They have a direct bearing on the optimum location of the decision surface

Support Vector Machine

Support Vector Machine Y = mx + b

Support Vector Machine

Support Vector Machine r 2 = X 2 + Y 2

Support Vector Machine

Support Vector Machine As we move to higher dimensions the problem becomes much more complex

Naïve Bayes Basic idea of Naïve Bayes Calculate probability of a desired outcome based on a set of characteristics assuming a desired outcome

Naïve Bayes Basic idea of Naïve Bayes Calculate probability of a desired outcome based on a set of characteristics assuming a desired outcome Bayes rule

Bayes Rule B a Char 1 Char 2 Char 3 Char 4 Char 5 Interaction Y Y N N N Y Y N N Y Y N N N N Y Y N Y Y Y Y N Y N N Y Y N N 0.6 0.4 0.4 0.8 0.4 0.4

Naïve Bayes Take the product across all characteristics (X i ) with the assumption that each event is independent and that there is an interaction (Y = 1)

Logistic Regression Basic idea of Naïve Bayes Statistical regression model for binary dependant variables

Decision Tree Basic idea of tree based methods Construct a binary tree where each node represents a filter for a given characteristic and each leaf contains the decision Root contains all protein pairs and at each node pairs are separated into two categories, representing presence or absence of a characteristic

Decision Tree How do we decide which characteristic to use when separating data?

Decision Tree How do we decide which characteristic to use when separating data? Gini Index Looks at the largest class in the target, and tries to find a split, using a feature, to isolate it from the other classes A perfect series of splits would end up with k pure child nodes If costs are assigned, we could isolate the most costly feature (most important), the one which tends to drive the cases into a single class

Decision Tree Interaction?

Decision Tree Interaction? Similar Gene Expresssion

Decision Tree Interaction? Similar Gene Expression Sequence Similarity (95%) Go Annotation (Level 3) Y

Decision Tree Interaction? Similar Gene Expression Sequence Similarity (95%) Go Annotation (Level 3) Characteristic 1 Characteristic 2 Characteristic 3 N Y Y N Y N Y

Pruning After splitting stops the next step is prune the tree Cut off branches that provide the least additional predictive power Cut off weak branches with high misclassification rates

Pruning After splitting stops the next step is prune the tree Cut off branches that provide the least additional predictive power Cut off weak branches with high misclassification rates Improve accuracy

Decision Tree Interaction? Gene Regulation (2-fold) Sequence Similarity (95%) Go Annotation (Level 3) Characteristic 3 N Y N N Y

Random Forest Based on same idea as Decision Tree only we take random subsets of features and construct multiple trees simultaneously

Random Forest Based on same idea as Decision Tree only we take random subsets of features and construct multiple trees simultaneously Classification is chosen based on majority support 200 trees for each run

Random Forest

Random Forest 1 1 0 1 0 1

Random Forest 1 1 0 1 0 1 Based on majority rule we would consider the pair tested as an interacting pair

k-nearest Neighbor Based on same idea as Random Forest only we calculate a similarity matrix based on the tree comparison values

k-nearest Neighbor Based on same idea as Random Forest only we calculate a similarity matrix based on the tree comparison values Classification is chosen based on k- nearest neighbors Do not specify the value of k used

k-nearest Neighbor 1 1 0 1 0 1 <1,1,0,1,1, 0,1> Vector is used to plot data in n- dimensional space (n = 200)

k-nearest Neighbor

Performance Evaluation Decision model was trained with 30,000 protein pairs and then tested with a different 30,000

Performance Evaluation Decision model was trained with 30,000 protein pairs and then tested with a different 30,000 Plot precision vs recall Receiver operator characteristic curves (ROC)

Precision vs. Recall T Reality F Prediction T F True Positive (TP) False Negative Type II Error (FN) False Positive Type I Error (FP) True Negative (TN)

ROC Curves Plot of true-positives vs false positives

ROC Curves Plot of true-positives vs false positives Area under the curve is used as a measure of diagnostic accuracy Area measured until 50 false positives are found

Performance Comparison

Feature Importance Gene expression data is the most important in recovering all types of interactions

Feature Composition

Conclusions Co-complex relationships are the easiest to predict

Conclusions Co-complex relationships are the easiest to predict Detailed encoding style is preferred

Conclusions Co-complex relationships are the easiest to predict Detailed encoding style is preferred Random Forest classifier performs the best

Conclusions Co-complex relationships are the easiest to predict Detailed encoding style is preferred Random Forest classifier performs the best Different features have different importance in predicting protein interactions

Questions? Conclusions