Network-based auto-probit modeling for protein function prediction

Size: px

Start display at page:

Download "Network-based auto-probit modeling for protein function prediction"

Jordan Farmer
5 years ago
Views:

1 Network-based auto-probit modeling for protein function prediction Supplementary material Xiaoyu Jiang, David Gold, Eric D. Kolaczyk Derivation of Markov Chain Monte Carlo algorithm with the GO annotation uncertainty When we consider the Gene Ontology annotation uncertainty and include the probability of being incorrectly un-annotated, g, the fully conditional distribution for updating individual z i s is different from before, being expressed as P(z i z [ i],µ,β,g,y) = 2πΦ( zi ) exp { 2 (z i z i ) 2 }, z i, y i =,, z i <, y i =, g exp 2π[ Φ( zi )+gφ( z i )] { 2 (z i z i ) 2 }, z i, y i =, 2π[ Φ( zi )+gφ( z i )] exp { 2 (z i z i ) 2 }, z i <, y i =, where z [ i] is all of z except the ith element z i, z i = µ i +β j i di d j a ij (z j µ j ), and Φ is the standard normal cumulative density function. The deriva-

2 tion of the conditional probability is as follows. where P(z i z [ i],µ,β,g,y) = P(y z,g) P(z i z [ i],µ,β) z i P(y,z i z [ i],µ,β,g)dz i P(y i z i,g) = P(z i z z i P(y i z i,g) P(z i z [ i],µ,β)dz [ i],µ,β) i = C P(z i z [ i],µ,β), C = = = = P(y i z i,g) z i P(y i z i,g) P(z i z [ i],µ,β)dz i P(y i z i,g) z i P(y i z i,g) P(z i z [ i],µ,β)dz i + z i < P(y i z i,g) P(z i z [ i],µ,β)dz i P(y i z i,g) Rz i P(z i z [ i],µ,β)dz i + R z i < P(z, when y i z [ i],µ,β)dz i i =, P(y i z i,g) Rz i g P(z i z [ i],µ,β)dz i + R z i < P(z i z [ i],µ,β)dz i, when y i =, Φ( z i ), when y i =, z i,, when y i =, z i <, g gφ( z i )+ Φ( z i ), when y i =, z i, gφ( z i )+ Φ( z i ), when y i =, z i <. The Gibbs sampler can be used to update g, the fully conditional distribution of which is a beta distribution, P(g z,µ,β,y) P(y z,g) P(g) g N + ( g) N ++, where N + = #{i : y i =,z i }, N ++ = #{i : y i = +,z i }. 2

3 2 x 3 β for CC network Iterations 2 x 4 β for OOB network Iterations Web Figure : Traceplots of the posterior samples of β for the two networks. [Top]: the CC network; [Bottom]: the OOB network. 3

4 Frequency Frequency Posterior samples of β for CC network x 3 5 Posterior samples of β for OOB network 2 x 4 Web Figure 2: Histograms of the posterior samples of β for the two networks. [Top]: the CC network; [Bottom]: the OOB network. 4

5 3 25 Un annotated proteins Annotated proteins 2 Frequency Posterior estimates of µ for CC network 25 Un annotated proteins Annotated proteins 2 Frequency Posterior estimates of µ for OOB network Web Figure 3: Histogram of the posterior estimates of µ for the two networks. In both plots, color blue are based on proteins not annotated with the term in question while purple are for annotated ones. [Top]: the CC network; [Bottom]: the OOB network. 5

6 Web Figure 3 contains the histograms of the posterior estimates of probabilities of having the target function, given the observed GO annotations, for the two networks. Blue histograms are based on proteins which are not annotated with the term in question while purple ones are for annotated proteins. We also plot the histograms of the posterior estimates of µ as in Web Figure 4. Interestingly, in those plots, the histograms for the two classes of proteins are well separated, with the posterior mean of the un-annotated proteins lower than that of the annotated proteins in both cases, and little overlapping areas between the two classes. This indicates that the autoprobit model is capable of distinguishing proteins with different functional status, annotated and un-annotated, by utilizing the network topology and estimating the parameters in a globally coherent fashion. 6

7 8 6 Un annotated proteins Annotated proteins 4 2 Frequency Posterior predictive probabilities for CC network 2 8 Un annotated proteins Annotated proteins Frequency Posterior predictive probabilities for OOB network Web Figure 4: Histogram of the posterior estimates of the probability of having the target function. [Top]: intracellular signaling cascade in the CC network; [Bottom]: chromosome organization and biogenesis in the OOB network. 7

8 Simulation study We conducted trials of simulation on the CC network. More specifically, for each trial, we fixed the network topology, pre-specified parameter values and simulated the protein annotations for each trial. The parameter values were those inferred from the original data. Then we applied our model on the simulated annotations in a -fold cross-validation to produce predictions. Predictive accuracy was evaluated against the simulated annotations and plotted the ROC curve. Web Figure 5 below shows the individual ROC curves for the simulation trials (grey). The red ROC curve is plotted with averaged sensitivity and specificity across all trials; its AUC is.867. Upper, lower 5th percentile and the 5th percentile curves are chosen based on the relative rank of their corresponding AUCs. They are colored as purple, blue and green, with AUC of.87,.988 and.887, respectively. The mean AUC is.8678 with a standard deviation.282. It is clear that the simulation results are fairly stable and satisfying, and our model has good reproducibility when prediction is of interest. Sensitivity Individual simulation.2 Averaged across simulations 5 percentile AUC. 95 percentile AUC 5 percentile AUC Specificity Web Figure 5: ROC curves for the trials of simulation 8

9 Network-based auto-probit model on large network We implemented our method on a large yeast network containing 585 proteins to predict the two terms studied in the paper - intracellular signal cascade, and chromosome organization and biogenesis, by a -fold crossvalidation study, and performed the same analysis as in Section 4.4. Below are the precision and recall plots. Sensitivity Auto Probit: with g Auto_Probit: without g specificity 9

10 Sensitivity Auto Probit: with g Auto_Probit: without g specificty Web Figure 6: Results for predicting function intracellular signal cascade on the network of 585 proteins by a -fold cross-validation. [Top]: recall versus threshold; [Bottom]: precision versus threshold. [Red]: auto-probit method with modeling annotation uncertainty; [Blue]: auto-probit method without modeling annotation uncertainty. Sensitivity Auto Probit: with g Auto_Probit: without g specificity

11 Sensitivity Auto Probit: with g Auto_Probit: without g Specificity Web Figure 7: Results for predicting function chromosome organization and biogenesis on the network of 585 proteins by a -fold cross-validation. [Top]: recall versus threshold; [Bottom]: precision versus threshold. [Red]: auto-probit method with modeling annotation uncertainty; [Blue]: auto-probit method without modeling annotation uncertainty.

12 Network-based auto-probit model with a different choice of path in Gene Ontology To show how our proposed method works with a different choice of path in the GO hierarchy, we chose a different path that leads to the term intracellular signal cascade (ISC) to predict it. That is, we used the path from the term regulation of cellular process (RCP) to ISC, instead of using the path from cellular communication (CC) as used in our manuscript. We chose to use the largest connected component of 655 genes that are annotated with RCP as our network, namely, the RCP network. Similar to what we did in the paper, we first compared the predictive accuracy of our autoprobit model, with the logistic kernel method and the Nearest-Neighbor (NN) algorithm. Then we studied the prediction improvement from modeling annotation uncertainty by our model. The results showed the same story: that our network-based auto-probit model is comparable in prediction ability and modeling the annotation uncertainty improves prediction. Web Figure 8 below shows the comparison of the ROC curves from the auto-probit model, the logistic kernel method and NN algorithm based on a -fold cross-validation study on the RCP network. AUC s from the autoprobit model, the logistic kernel method, and NN algorithm are.84,.8632 and.772, respectively. The p-value for comparing AUC from the auto-probit model and the logistic kernel method is.689; the p-value for comparing AUC s from the auto-probit model and NN algorithm is.439, indicating that our model has similar predictive capability as these commonly used methods. Web Figure 9 show the ROC curves from our model with and without annotation uncertainty by a -fold cross-validation study on RCP network. This is the same analysis as in Section 4.4 in the manuscript. The annotations used in the cross-validation study are updated in June 26, and model performance is evaluated against annotations updated in November 27. The AUC for modeling annotation uncertainty (under the red curve) is.6297, while the AUC for without annotation uncertainty is only It shows that predictive accuracy is greatly improved by incorporating the uncertainty in negative annotations. 2

13 ROC Curve Sensitivity Auto probit: weighted STRING. Logistic kernel method Nearest Neighbor Specificity Web Figure 8: ROC curves for method comparison for predicting intracellular signal cascade on the RCP network. [Red]: auto-probit method with modeling annotation uncertainty; [Blue]: auto-probit method without modeling annotation uncertainty. Sensitivity Auto Probit: with g Auto_Probit: without g specificity Web Figure 9: Recall plot for predicting function intracellular signal cascade on the RCP network. [Red]: auto-probit method with modeling 3

14 annotation uncertainty; [Blue]: auto-probit method without modeling annotation uncertainty. 4

Markov chain Monte Carlo methods

Markov chain Monte Carlo methods (supplementary material) see also the applet http://www.lbreyer.com/classic.html February 9 6 Independent Hastings Metropolis Sampler Outline Independent Hastings Metropolis