Analyzing ICAT Data. Analyzing ICAT Data

Similar documents
9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Gene Clustering & Classification

/ Computational Genomics. Normalization

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Clustering. Lecture 6, 1/24/03 ECS289A

Network Traffic Measurements and Analysis

Exploratory data analysis for microarrays

ECS 234: Data Analysis: Clustering ECS 234

EECS730: Introduction to Bioinformatics

Methodology for spot quality evaluation

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Clustering. RNA-seq: What is it good for? Finding Similarly Expressed Genes. Data... And Lots of It!

Distances, Clustering! Rafael Irizarry!

CLUSTERING IN BIOINFORMATICS

IAT 355 Visual Analytics. Data and Statistical Models. Lyn Bartram

Tutorial 2: Analysis of DIA/SWATH data in Skyline

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Cluster Analysis for Microarray Data

Clustering. Chapter 10 in Introduction to statistical learning

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot

10701 Machine Learning. Clustering

Vocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable.

Clustering and Visualisation of Data

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Ying Shen, SSE, Tongji University

Comparisons and validation of statistical clustering techniques for microarray gene expression data. Outline. Microarrays.

Automated Bioinformatics Analysis System on Chip ABASOC. version 1.1

ECLT 5810 Clustering

Contents. ! Data sets. ! Distance and similarity metrics. ! K-means clustering. ! Hierarchical clustering. ! Evaluation of clustering results

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Clustering Jacques van Helden

CHAPTER 2 DESCRIPTIVE STATISTICS

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Clustering Techniques

MS data processing. Filtering and correcting data. W4M Core Team. 22/09/2015 v 1.0.0

Data Processing and Analysis in Systems Medicine. Milena Kraus Data Management for Digital Health Summer 2017

Hierarchical Clustering

10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2

Descriptive Statistics, Standard Deviation and Standard Error

MSA220 - Statistical Learning for Big Data

Tutorial 7: Automated Peak Picking in Skyline

ECLT 5810 Clustering

Double Self-Organizing Maps to Cluster Gene Expression Data

Supplementary Figure 1. Decoding results broken down for different ROIs

Gene expression & Clustering (Chapter 10)

Skyline High Resolution Metabolomics (Draft)

Region-based Segmentation

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

EECS 730 Introduction to Bioinformatics Microarray. Luke Huan Electrical Engineering and Computer Science

Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

Introduction to GE Microarray data analysis Practical Course MolBio 2012

Nature Methods: doi: /nmeth Supplementary Figure 1

GS Analysis of Microarray Data

Clustering analysis of gene expression data

Unsupervised: no target value to predict

IBL and clustering. Relationship of IBL with CBR

Motivation. Technical Background

Data processing. Filters and normalisation. Mélanie Pétéra W4M Core Team 31/05/2017 v 1.0.0

Table of Contents (As covered from textbook)

An Unsupervised Technique for Statistical Data Analysis Using Data Mining

Introduction to Bioinformatics AS Laboratory Assignment 2

Outlier Detection and Removal Algorithm in K-Means and Hierarchical Clustering

Histograms. h(r k ) = n k. p(r k )= n k /NM. Histogram: number of times intensity level rk appears in the image

Cluster Analysis. Angela Montanari and Laura Anderlucci

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Customer Clustering using RFM analysis

Small Libraries of Protein Fragments through Clustering

BIOINF 4399B Computational Proteomics and Metabolomics

Supervised vs. Unsupervised Learning

INF 4300 Classification III Anne Solberg The agenda today:

K-means clustering Based in part on slides from textbook, slides of Susan Holmes. December 2, Statistics 202: Data Mining.

Using Statistical Techniques to Improve the QC Process of Swell Noise Filtering

How do microarrays work

Skyline irt Retention Time Prediction

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Visual Representations for Machine Learning

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Superpixel Tracking. The detail of our motion model: The motion (or dynamical) model of our tracker is assumed to be Gaussian distributed:

Optimal Clustering and Statistical Identification of Defective ICs using I DDQ Testing

Predict Outcomes and Reveal Relationships in Categorical Data

University of Florida CISE department Gator Engineering. Clustering Part 2

COLOCALISATION. Alexia Ferrand. Imaging Core Facility Biozentrum Basel

Mascot Insight is a new application designed to help you to organise and manage your Mascot search and quantitation results. Mascot Insight provides

Outline. Multivariate analysis: Least-squares linear regression Curve fitting

Course on Microarray Gene Expression Analysis

CBioVikings. Richard Röttger. Copenhagen February 2 nd, Clustering of Biomedical Data

A Dendrogram. Bioinformatics (Lec 17)

ELEC Dr Reji Mathew Electrical Engineering UNSW

GLM II. Basic Modeling Strategy CAS Ratemaking and Product Management Seminar by Paul Bailey. March 10, 2015

SD 372 Pattern Recognition

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Unsupervised Learning and Clustering

CHAPTER 2: DESCRIPTIVE STATISTICS Lecture Notes for Introductory Statistics 1. Daphne Skipper, Augusta University (2016)

Chapter 3 - Displaying and Summarizing Quantitative Data

Exploring Data. This guide describes the facilities in SPM to gain initial insights about a dataset by viewing and generating descriptive statistics.

Methods for Intelligent Systems

Clustering CS 550: Machine Learning

Transcription:

Analyzing ICAT Data Gary Van Domselaar University of Alberta Analyzing ICAT Data ICAT: Isotope Coded Affinity Tag Introduced in 1999 by Ruedi Aebersold as a method for quantitative analysis of complex protein mixtures Steven P. Gygi, Beate Rist, Scott A. Gerber, Frantisek Turecek, Michael H. Gelb, Ruedi Aebersold (1999) Nature Biotechnology 17, 994-999.

The Quantitation Problem Mass spec peak intensities do not correlate well with sample amounts for different analytes: Differential ability of peptides to acquire a charge. Relationship between atomic composition and peak intensity is poorly understood. The Quantitation Problem Mass spec peak intensities are quantitative for chemically identical peptides in identical experimental conditions ICAT methodology exploits this fact by isotopically labeling peptide fragments from different cell states

The ICAT Strategy

Quantitating ICAT Peaks Quantitating ICAT Peaks

Identifying ICAT Peaks Identifying ICAT Peaks

ICAT Quantitation Software ProICAT ABI SpectrumMill Agilent XPRESS Institute for Systems Biology ASAPRatio Institute for Systems Biology XPRESS and ASAPRatio Work with Peptide Prophet and Protein Prophet ICAT Quantitation Software Sashimi: free open source tools for downstream analysis of mass spectrometric data Glossolalia: a common file format for MS Data XPRESS & ASAPRatio Foor relative quantification of isotopically labeled peptides http://sashimi.sourceforge.net

Goal of ICAT To identify changes in expression How do we know if the changes we see are significant? To correlate the changes with biochemical processes What underlies the changes that we see? ICAT and GeneChip Comparison

A Real Example Meehan and Sadar Proteomics 2004, 4, 1116 1134 A Real Example Meehan and Sadar Proteomics 2004, 4, 1116 1134

A Real Example A Real Example

A Real Example Scatter Plots Simplest kind of data plot (data scattered over a two-axis plot) No assumed connectivity (no lines connecting the dots) Challenge is to fit a line or a curve to the raw data to reveal a trend

A Scatter Plot Heavy Light Correlation + correlation Uncorrelated - correlation

Correlation High correlation Low correlation Perfect correlation Correlation Coefficient r = Σ(x i - µ x )(y i - µ y ) Σ(x i - µ x ) 2 (y i - µ y ) 2 r = 0.85 r = 0.4 r = 1.0

Correlation Coefficient Sometimes called coefficient of linear correlation or Pearson product-moment correlation coefficient A quantitative way of determining what model (or equation or type of line) best fits a set of data Commonly used to assess most kinds of predictions or simulations Correlation and Outliers Experimental error or something important? A single bad point can destroy a good correlation

Outliers Can be both good and bad When modeling data -- you don t like to see outliers (suggests the model is bad) Often a good indicator of experimental or measurement errors -- only you can know! When plotting ICAT data you do like to see outliers A good indicator of something significant Cross Sectioning a Scatter Plot

What kind of point scatter do you see? Gaussian or Normal Distribution Features of a Normal Distribution Symmetric Distribution Has an average or mean value (µ ) at the centre Has a characteristic width called the standard deviation (σ) Most common type of distribution known µ = mean

Gaussian Distribution P 2 ( x µ ) 1 2 2σ ( x) = e 2πσ µ - 3 σ µ - 2 σ µ - σ µ µ + σ µ + 2 σ µ + 3 σ Some Equations Mean µ = Σx i N Variance σ 2 = Σ(x i - µ ) 2 N Standard Deviation σ = Σ(x i - µ ) 2 N

Standard Deviations (Z-values) µ ± 1.0 S.D. 0.683 > µ + 1.0 S.D. 0.158 µ ± 2.0 S.D. 0.954 > µ + 2.0 S.D. 0.023 µ ± 3.0 S.D. 0.9972 > µ + 3.0 S.D. 0.0014 µ ± 4.0 S.D. 0.99994 > µ + 4.0 S.D. 0.00003 µ ± 5.0 S.D. 0.999998 > µ + 5.0 S.D. 0.000001 P 2 ( x µ ) 1 2 2σ ( x) = e 2πσ µ - 2 σ µ - σ µ µ + σ µ - 3 σ µ + 2 σ µ + 3 σ Significance & Z-values Generally if something is more than 2 SD away from the mean, then it is considered significant (p > 0.95) Sometimes used to detect signals from noise or unusual from normal Gene expression levels that are 2-2.5 SD above mean are often considered significant

Mean, Median & Mode Mode Median Mean Log Transformation linear scale log 2 scale ICAT heavy intensity 70000 60000 50000 40000 30000 20000 10000 0 exp t A ch1 intensity 0 10000 20000 30000 40000 50000 60000 70000 18 exp t A 16 14 12 10 8 6 4 2 0 0 5 10 15 ICAT Light intensity

Choice of Base is Not Important log10 ln 6 14 exp t A 5 12 4 10 exp t A 8 3 6 2 4 1 2 0 0 0 2 4 6 0 5 10 Why Log2 Transformation? Makes variation of intensities and ratios of intensities more independent of absolute magnitude Makes normalization additive Evens out highly skewed distributions Gives more realistic sense of variation Approximates normal distribution Treats increased and diminished expression equally. 15

Log Transformations Applying a log transformation makes the variance and offset more proportionate along the entire graph H L H/L 60 000 40 000 1.5 3000 2000 1.5 16 log 2 L Area log 2 H log 2 L log 2 ratio 15.87 15.29 0.58 11.55 10.97 0.58 0 log 2 H area 16 Log Transformations

Signal-to-noise Significant? Detecting Clusters Height Weight

Is it Right to Calculate a Correlation Coefficient? Height r = 0.73 Weight Or is There More to This? Boy Height Girl Weight

Clustering Applications in Bioinformatics 2D Gel or ProteinChip Analysis Microarray or GeneChip Analysis Protein Interaction Analysis Phylogenetic and Evolutionary Analysis Structural Classification of Proteins Protein Sequence Families ICAT :-) Clustering Definition - a process by which objects that are logically similar in characteristics are grouped together. Clustering is different than Classification In classification the objects are assigned to pre-defined classes, in clustering the classes are yet to be defined Clustering helps in classification

Clustering Requires... A method to measure similarity (a similarity matrix) or dissimilarity (a dissimilarity coefficient) between objects A threshold value with which to decide whether an object belongs with a cluster A way of measuring the distance between two clusters A cluster seed (an object to begin the clustering process) Clustering Algorithms K-means or Partitioning Methods - divides a set of N objects into M clusters -- with or without overlap Hierarchical Methods - produces a set of nested clusters in which each pair of objects is progressively nested into a larger cluster until only one cluster remains Self-Organizing Feature Maps - produces a cluster set through iterative training

K-means or Partitioning Methods Make the first object the centroid for the first cluster For the next object calculate the similarity to each existing centroid If the similarity is greater than a threshold add the object to the existing cluster and redetermine the centroid, else use the object to start new cluster Return to step 2 and repeat until done K-means or Partitioning Methods Initial cluster choose 1 choose 2 test & join centroid= centroid= Rule: λ T = λ centroid + 50 nm -

Hierarchical Clustering Find the two closest objects and merge them into a cluster Find and merge the next two closest objects (or an object and a cluster, or two clusters) using some similarity measure and a predefined threshold If more than one cluster remains return to step 2 until finished Hierarchical Clustering Initial cluster pairwise select select compare closest next closest Rule: λ T = λ obs + 50 nm -

Hierarchical Clustering A A B A B C B C D E Find 2 most similar protein express levels or curves Find the next closest pair of levels or curves F Iterate Self-Organizing Feature Maps T=0 T=0 T=1day T=20 h T=2days T=3days SvOutPlaceObject

Self-Organizing Feature Maps Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Plot Chip Data Compute Feature Examine Clusters for Map with 6 nodes Biological Meaning SOMs - Details Specify the number of nodes (clusters) desired, and a 2-D geometry for the nodes (rectangular or hexagonal) G1 G2 G3 G6 G4 G5 N1 N2 G7 G9 G8 G10 G11 N = Nodes G = Genes G26 G28 G29 G27 N3 N5 N4 N6 G12 G13 G14 G15 G23 G25 G24 G16 G17 G18 G19 G20 G21 G22

SOMs - Details Choose a random protein, e.g., G9 G2 G1 G3 G6 G4 G5 N1 N2 G7 G9 G8 G10 G11 G26 G28 G29 G27 N3 N5 N4 N6 G12 G13 G14 G15 G23 G25 G24 G17 G20 G18 G22 G16 G19 G21 Move the nodes in the direction of G9. The node closest to G9 (N2) is moved the most, and the other nodes are moved by smaller varying amounts. The farther away the node is from N2, the less it is moved. G2 G1 G3 SOMs - Details G6 G4 G5 N1 N2 G7 G9 G8 G10 G11 G26 G28 G29 G27 N3 N5 N4 N6 G12 G13 G14 G15 G23 G25 G24 G17 G20 G18 G22 G16 G19 G21

SOMs - Details Repeat the process many (usually several thousand) times choosing different proteins. With each iteration, the amount that the nodes move is decreased. G2 G1 G3 G6 G4 G5 N1 N2 G7 G9 G8 G10 G11 G26 G28 G29 G27 N3 N5 N4 N6 G12 G13 G14 G15 G23 G25 G24 G17 G20 G18 G22 G16 G19 G21 SOMs - Details Finally, each node will nestle among a cluster of genes, and a protein will be considered to be in the cluster if its distance to the node in that cluster is less than its distance to any other node. G1 G6 G2 N1 G4 G3 G5 G7 G8 G9 N2 G10 G11 G26 G27 N3 G28 G29 N5 G23 G25 G24 G12 N4 G13 G14 G15 G16 G17 G18 G19 G20 N6 G21 G22

Statistics Software Excel MATLAB Octave SAS SPSS S-PLUS Statistica R Cluster Annotation Once you have your clusters, annotate them and look for patterns that can reveal the underlying process Metabolism: KEGG http://www.genome.ad.jp/kegg/metabolism.html Roche/Boeringer http://www.expasy.org/cgi-bin/search-biochem-ind EcoCyc www.ecocyc.org PathDB http://www.ncgr.org/pathdb

Cluster Annotation Interaction Databases BIND http://www.bind.ca DIP http://dip.doe-mbi.ucla.edu/ MINT http://mint.bio.uniroma2.it/mint PathCalling http://protal.curagen.com/extpc/com.curagen.portal.s Cluster Annotation Bibliographic Databases PubMed Medline http://www.ncbi.nlm.nih.gov/pubmed/ Science Citation Index http://isi4.isiknowledge.com/portal.cgi Your Local Library www.xxxx.ca Current Contents http://www.isinet.com/isi

Cluster Annotation Other SWISSPROT: Curated Expert Annotations http://www.expasy.org/ Subcellular Localization http://www.cs.ualberta.ca/~bioinfo/pa/sub/ Genome Ontology http://www.geneontology.org/