Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Similar documents
Random Forest Similarity for Protein-Protein Interaction Prediction from Multiple Sources. Y. Qi, J. Klein-Seetharaman, and Z.

Network Traffic Measurements and Analysis

Evaluating Classifiers

Classification Algorithms in Data Mining

Evaluating Classifiers

Supervised vs unsupervised clustering

Performance Evaluation of Various Classification Algorithms

CS145: INTRODUCTION TO DATA MINING

Supervised Learning Classification Algorithms Comparison

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Evaluating Classifiers

IEE 520 Data Mining. Project Report. Shilpa Madhavan Shinde

CSE 158. Web Mining and Recommender Systems. Midterm recap

Statistics 202: Statistical Aspects of Data Mining

Weka ( )

ECLT 5810 Evaluation of Classification Quality

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Data Mining and Knowledge Discovery Practice notes 2

List of Exercises: Data Mining 1 December 12th, 2015

Classification and Regression

Data Mining Concepts & Techniques

Data Mining and Knowledge Discovery: Practice Notes

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

Identifying and Understanding Differential Transcriptor Binding

Applying Supervised Learning

Probabilistic Classifiers DWML, /27

Classifiers and Detection. D.A. Forsyth

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

CS4491/CS 7265 BIG DATA ANALYTICS

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Classification Part 4

Artificial Intelligence. Programming Styles

DATA MINING LECTURE 11. Classification Basic Concepts Decision Trees Evaluation Nearest-Neighbor Classifier

Machine Learning for. Artem Lind & Aleskandr Tkachenko

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Advanced Video Content Analysis and Video Compression (5LSH0), Module 8B

CS249: ADVANCED DATA MINING

Clustering will not be satisfactory if:

Business Club. Decision Trees

CSE4334/5334 DATA MINING

Information Management course

Artificial Neural Networks (Feedforward Nets)

Data Mining and Knowledge Discovery: Practice Notes

Nearest neighbor classification DSE 220

CS 229 Midterm Review

DATA MINING LECTURE 9. Classification Basic Concepts Decision Trees Evaluation

Lecture 25: Review I

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Supplementary Material

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

CLASSIFICATION JELENA JOVANOVIĆ. Web:

Tutorials Case studies

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. for each element of the dataset we are given its class label.

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

Using Real-valued Meta Classifiers to Integrate and Contextualize Binding Site Predictions

Random Forest A. Fornaser

Large Scale Data Analysis Using Deep Learning

Introduction to Machine Learning CANB 7640

Contents. Preface to the Second Edition

Fast or furious? - User analysis of SF Express Inc

Classifying Imbalanced Data Sets Using. Similarity Based Hierarchical Decomposition

Data Mining in Bioinformatics Day 1: Classification

Logistic Regression: Probabilistic Interpretation

Multi-label classification using rule-based classifier systems

Computer Vision Group Prof. Daniel Cremers. 8. Boosting and Bagging

Subject. Dataset. Copy paste feature of the diagram. Importing the dataset. Copy paste feature into the diagram.

Supplemental Material: Multi-Class Open Set Recognition Using Probability of Inclusion

Using Machine Learning to Identify Security Issues in Open-Source Libraries. Asankhaya Sharma Yaqin Zhou SourceClear

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

I211: Information infrastructure II

CREDIT RISK MODELING IN R. Finding the right cut-off: the strategy curve

Classification. Slide sources:

Louis Fourrier Fabien Gaie Thomas Rolf

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

Part II: A broader view

ADVANCED CLASSIFICATION TECHNIQUES

Machine Learning in Biology

Classification and Regression Trees

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

Evaluating Machine-Learning Methods. Goals for the lecture

CS229 Lecture notes. Raphael John Lamarre Townshend

User Guide Written By Yasser EL-Manzalawy

CSE 190, Spring 2015: Midterm

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

An introduction to random forests

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

Classification. Instructor: Wei Ding

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Application of Support Vector Machine In Bioinformatics

Introduction to Automated Text Analysis. bit.ly/poir599

ML4Bio Lecture #1: Introduc3on. February 24 th, 2016 Quaid Morris

R (2) Data analysis case study using R for readily available data set using any one machine learning algorithm.

Supervised Learning. Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

Decision trees. Decision trees are useful to a large degree because of their simplicity and interpretability

Predictive modelling / Machine Learning Course on Big Data Analytics

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

EPL451: Data Mining on the Web Lab 5

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

Transcription:

Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Yanjun Qi, Ziv Bar-Joseph, Judith Klein-Seetharaman Protein 2006

Motivation Correctly identifying the set of interacting proteins in an organism is useful for deciphering the molecular mechanisms underlying given biological functions and for assigning functions to unknown proteins based on their interacting partners.

Introduction Physical interaction

Introduction Co-complex relationship

Introduction Pathway co-membership

Introduction Lean mass protein complex

Introduction Lean mass protein complex NOT INCLUDED IN STUDY!

Introduction Yeast proteinprotein inteactions (Jeong et al. 2001)

Direct Methods of PPI Prediction Current high-throughput experimental approaches have been applied to determine the set of interacting proteins Two-hybird (Y2H) Mass Spectrometry

Direct Methods of PPI Prediction Current high-throughput experimental approaches have been applied to determine the set of interacting proteins Two-hybird (Y2H) Mass Spectrometry These methods have high rate of false-positves and false-negatives.

Direct Methods of PPI Prediction Two-hybird (Y2H)

Direct Methods of PPI Prediction Tandem Affinity Purification Mass Spectrometry

Indirect Methods of PPI Prediction Gene expression data

Indirect Methods of PPI Prediction Gene expression data Biological function (GO)

Indirect Methods of PPI Prediction Gene expression data Biological function (GO) Biological process (GO)

Indirect Methods of PPI Prediction Gene expression data Biological function (GO) Biological process (GO) Sequence similarity

Key Words PPI - protein protein interaction

Key Words PPI - protein protein interaction Gold Standard Dataset - data used to train and test an algorithm

Key Words PPI - protein protein interaction Gold Standard Dataset - data used to train and test an algorithm Positive Examples - a set of known interacting protein pairs

Key Words PPI - protein protein interaction Gold Standard Dataset - data used to train and test an algorithm Positive Examples - a set of known interacting protein pairs Negative Examples - a set of randomly paired proteins believed not to interact with each other

Key Words Feature Encoding - how do we use the data we have?

Key Words Feature Encoding - how do we use the data we have? Detailed - each source is handled separately

Key Words Feature Encoding - how do we use the data we have? Detailed - each source is handled separately Summary - combine similar sources

Goal Combine information from a variety of direct/indirect methods and apply them to a supervised learning framework and predict protein-protein interactions

Goal Combine information from a variety of direct/indirect methods and apply them to a supervised learning framework and predict protein-protein interactions Which one is the best?

Past Studies Varying datasets, encoding styles and classifiers Jansen et al. - Naïve Bayes, Co-complex, Summary Lin et al. - Random Forest, Logistic Regression, Co-complex, Summary Zhang et al. Decision Tree, Co-complex, Detailed etc

Past Studies Varying datasets, encoding styles and classifiers Jansen et al. - Naïve Bayes, Co-complex, Summary Lin et al. - Random Forest, Logistic Regression, Co-complex, Summary Zhang et al. Decision Tree, Co-complex, Detailed etc

Past Studies Varying datasets, encoding styles and classifiers Jansen et al. - Naïve Bayes, Co-complex, Summary Lin et al. - Random Forest, Logistic Regression, Co-complex, Summary Zhang et al. Decision Tree, Co-complex, Detailed etc

Past Studies Varying datasets, encoding styles and classifiers Jansen et al. - Naïve Bayes, Co-complex, Summary Lin et al. - Random Forest, Logistic Regression, Co-complex, Summary Zhang et al. Decision Tree, Co-complex, Detailed etc

Past Studies Varying datasets, encoding styles and classifiers Jansen et al. - Naïve Bayes, Co-complex, Summary Lin et al. - Random Forest, Logistic Regression, Co-complex, Summary Zhang et al. Decision Tree, Co-complex, Detailed etc

Systematic Comparison Reference Datasets = {physical, cocomplex, co-pathway} Encoding Styles = {summary, detailed} Classifiers = {DT, LR, NB, SVM, RF, krf}

Systematic Comparison Reference Datasets = {physical, cocomplex, co-pathway} Encoding Styles = {summary, detailed} Classifiers = {DT, LR, NB, SVM, RF, krf}

Positive Examples Physical Interactions - DIP (Database of Interacting Proteins)

Positive Examples Physical Interactions - DIP (Database of Interacting Proteins) Co-complex Interactions - MIPS (Munich Information Center for Protein Sequences)

Positive Examples Physical Interactions - DIP (Database of Interacting Proteins) Co-complex Interactions - MIPS (Munich Information Center for Protein Sequences) Co-pathway - KEGG (Kyoto Encyclopedia of Genes and Genomes)

Positive Examples

Negative Examples Post-filtering randomized protein pairing (Zhang et al. 2004) Only a fraction of of total pairs within the datasets are interacting, ~99% of randomized data is non-interacting

Negative Examples Post-filtering randomized protein pairing (Zhang et al. 2004) Only a fraction of of total pairs within the datasets are interacting, ~99% of randomized data is non-interacting Final training sets contained one positive example for every 600 negative interaction pairs

Features Used

Classification Algorithms SVM - Support Vector Machine NB - Naïve Bayes LR - Logistic Regression DT - Decision Tree RF - Random Forest krf - Random Forest-based k-nearest Neighbor

Support Vector Machine Basic idea of support vector machines Find optimal hyperplane for linearly separating patterns

Support Vector Machine Basic idea of support vector machines Find optimal hyperplane for linearly separating patterns Extend to patterns that are not linearly separable by transforming data into new space

Support Vector Support vectors are the data points that lie closest to the decision surface

Support Vector Support vectors are the data points that lie closest to the decision surface They have a direct bearing on the optimum location of the decision surface

Support Vector Machine

Support Vector Machine Y = mx + b

Support Vector Machine Y = mx + b

Support Vector Machine Y = mx + b

Support Vector Machine

Support Vector Machine r 2 = X 2 + Y 2

Support Vector Machine

Support Vector Machine As we move to higher dimensions the problem becomes much more complex

Naïve Bayes Basic idea of Naïve Bayes Calculate probability of a desired outcome based on a set of characteristics assuming a desired outcome

Naïve Bayes Basic idea of Naïve Bayes Calculate probability of a desired outcome based on a set of characteristics assuming a desired outcome Bayes rule

Bayes Rule B a Char 1 Char 2 Char 3 Char 4 Char 5 Interaction Y Y N N N Y Y N N Y Y N N N N Y Y N Y Y Y Y N Y N N Y Y N N 0.6 0.4 0.4 0.8 0.4 0.4

Naïve Bayes Take the product across all characteristics (X i ) with the assumption that each event is independent and that there is an interaction (Y = 1)

Logistic Regression Basic idea of Naïve Bayes Statistical regression model for binary dependant variables

Decision Tree Basic idea of tree based methods Construct a binary tree where each node represents a filter for a given characteristic and each leaf contains the decision Root contains all protein pairs and at each node pairs are separated into two categories, representing presence or absence of a characteristic

Decision Tree How do we decide which characteristic to use when separating data?

Decision Tree How do we decide which characteristic to use when separating data? Gini Index Looks at the largest class in the target, and tries to find a split, using a feature, to isolate it from the other classes

Decision Tree How do we decide which characteristic to use when separating data? Gini Index Looks at the largest class in the target, and tries to find a split, using a feature, to isolate it from the other classes A perfect series of splits would end up with k pure child nodes

Decision Tree How do we decide which characteristic to use when separating data? Gini Index Looks at the largest class in the target, and tries to find a split, using a feature, to isolate it from the other classes A perfect series of splits would end up with k pure child nodes If costs are assigned, we could isolate the most costly feature (most important), the one which tends to drive the cases into a single class

Decision Tree How do we decide which characteristic to use when separating data? Gini Index Looks at the largest class in the target, and tries to find a split, using a feature, to isolate it from the other classes A perfect series of splits would end up with k pure child nodes If costs are assigned, we could isolate the most costly feature (most important), the one which tends to drive the cases into a single class

Decision Tree Interaction?

Decision Tree Interaction? Similar Gene Expresssion

Decision Tree Interaction? Similar Gene Expression Sequence Similarity (95%) Go Annotation (Level 3) Y

Decision Tree Interaction? Similar Gene Expression Sequence Similarity (95%) Go Annotation (Level 3) Characteristic 1 Characteristic 2 Characteristic 3 N Y Y N Y N Y

Pruning After splitting stops the next step is prune the tree Cut off branches that provide the least additional predictive power Cut off weak branches with high misclassification rates

Pruning After splitting stops the next step is prune the tree Cut off branches that provide the least additional predictive power Cut off weak branches with high misclassification rates Improve accuracy

Decision Tree Interaction? Gene Regulation (2-fold) Sequence Similarity (95%) Go Annotation (Level 3) Characteristic 3 N Y N N Y

Random Forest Based on same idea as Decision Tree only we take random subsets of features and construct multiple trees simultaneously

Random Forest Based on same idea as Decision Tree only we take random subsets of features and construct multiple trees simultaneously Classification is chosen based on majority support 200 trees for each run

Random Forest

Random Forest 1 1 0 1 0 1

Random Forest 1 1 0 1 0 1 Based on majority rule we would consider the pair tested as an interacting pair

k-nearest Neighbor Based on same idea as Random Forest only we calculate a similarity matrix based on the tree comparison values

k-nearest Neighbor Based on same idea as Random Forest only we calculate a similarity matrix based on the tree comparison values Classification is chosen based on k- nearest neighbors Do not specify the value of k used

k-nearest Neighbor 1 1 0 1 0 1 <1,1,0,1,1, 0,1> Vector is used to plot data in n- dimensional space (n = 200)

k-nearest Neighbor

k-nearest Neighbor

k-nearest Neighbor

Performance Evaluation Decision model was trained with 30,000 protein pairs and then tested with a different 30,000

Performance Evaluation Decision model was trained with 30,000 protein pairs and then tested with a different 30,000 Plot precision vs recall Receiver operator characteristic curves (ROC)

Precision vs. Recall T Reality F Prediction T F True Positive (TP) False Negative Type II Error (FN) False Positive Type I Error (FP) True Negative (TN)

ROC Curves Plot of true-positives vs false positives

ROC Curves Plot of true-positives vs false positives Area under the curve is used as a measure of diagnostic accuracy Area measured until 50 false positives are found

Performance Comparison

Feature Importance Gene expression data is the most important in recovering all types of interactions

Feature Composition

Conclusions Co-complex relationships are the easiest to predict

Conclusions Co-complex relationships are the easiest to predict Detailed encoding style is preferred

Conclusions Co-complex relationships are the easiest to predict Detailed encoding style is preferred Random Forest classifier performs the best

Conclusions Co-complex relationships are the easiest to predict Detailed encoding style is preferred Random Forest classifier performs the best Different features have different importance in predicting protein interactions

Questions? Conclusions