Analyzing ICAT Data. Analyzing ICAT Data

Analyzing ICAT Data Gary Van Domselaar University of Alberta Analyzing ICAT Data ICAT: Isotope Coded Affinity Tag Introduced in 1999 by Ruedi Aebersold as a method for quantitative analysis of complex protein mixtures Steven P. Gygi, Beate Rist, Scott A. Gerber, Frantisek Turecek, Michael H. Gelb, Ruedi Aebersold (1999) Nature Biotechnology 17, 994-999.

The Quantitation Problem Mass spec peak intensities do not correlate well with sample amounts for different analytes: Differential ability of peptides to acquire a charge. Relationship between atomic composition and peak intensity is poorly understood. The Quantitation Problem Mass spec peak intensities are quantitative for chemically identical peptides in identical experimental conditions ICAT methodology exploits this fact by isotopically labeling peptide fragments from different cell states

The ICAT Strategy

Quantitating ICAT Peaks Quantitating ICAT Peaks

Identifying ICAT Peaks Identifying ICAT Peaks

ICAT Quantitation Software ProICAT ABI SpectrumMill Agilent XPRESS Institute for Systems Biology ASAPRatio Institute for Systems Biology XPRESS and ASAPRatio Work with Peptide Prophet and Protein Prophet ICAT Quantitation Software Sashimi: free open source tools for downstream analysis of mass spectrometric data Glossolalia: a common file format for MS Data XPRESS & ASAPRatio Foor relative quantification of isotopically labeled peptides http://sashimi.sourceforge.net

Goal of ICAT To identify changes in expression How do we know if the changes we see are significant? To correlate the changes with biochemical processes What underlies the changes that we see? ICAT and GeneChip Comparison

A Real Example Meehan and Sadar Proteomics 2004, 4, 1116 1134 A Real Example Meehan and Sadar Proteomics 2004, 4, 1116 1134

A Real Example A Real Example

A Real Example Scatter Plots Simplest kind of data plot (data scattered over a two-axis plot) No assumed connectivity (no lines connecting the dots) Challenge is to fit a line or a curve to the raw data to reveal a trend

A Scatter Plot Heavy Light Correlation + correlation Uncorrelated - correlation

Correlation High correlation Low correlation Perfect correlation Correlation Coefficient r = Σ(x i - µ x )(y i - µ y ) Σ(x i - µ x ) 2 (y i - µ y ) 2 r = 0.85 r = 0.4 r = 1.0

Correlation Coefficient Sometimes called coefficient of linear correlation or Pearson product-moment correlation coefficient A quantitative way of determining what model (or equation or type of line) best fits a set of data Commonly used to assess most kinds of predictions or simulations Correlation and Outliers Experimental error or something important? A single bad point can destroy a good correlation

Outliers Can be both good and bad When modeling data -- you don t like to see outliers (suggests the model is bad) Often a good indicator of experimental or measurement errors -- only you can know! When plotting ICAT data you do like to see outliers A good indicator of something significant Cross Sectioning a Scatter Plot

What kind of point scatter do you see? Gaussian or Normal Distribution Features of a Normal Distribution Symmetric Distribution Has an average or mean value (µ ) at the centre Has a characteristic width called the standard deviation (σ) Most common type of distribution known µ = mean

Gaussian Distribution P 2 ( x µ ) 1 2 2σ ( x) = e 2πσ µ - 3 σ µ - 2 σ µ - σ µ µ + σ µ + 2 σ µ + 3 σ Some Equations Mean µ = Σx i N Variance σ 2 = Σ(x i - µ ) 2 N Standard Deviation σ = Σ(x i - µ ) 2 N

Standard Deviations (Z-values) µ ± 1.0 S.D. 0.683 > µ + 1.0 S.D. 0.158 µ ± 2.0 S.D. 0.954 > µ + 2.0 S.D. 0.023 µ ± 3.0 S.D. 0.9972 > µ + 3.0 S.D. 0.0014 µ ± 4.0 S.D. 0.99994 > µ + 4.0 S.D. 0.00003 µ ± 5.0 S.D. 0.999998 > µ + 5.0 S.D. 0.000001 P 2 ( x µ ) 1 2 2σ ( x) = e 2πσ µ - 2 σ µ - σ µ µ + σ µ - 3 σ µ + 2 σ µ + 3 σ Significance & Z-values Generally if something is more than 2 SD away from the mean, then it is considered significant (p > 0.95) Sometimes used to detect signals from noise or unusual from normal Gene expression levels that are 2-2.5 SD above mean are often considered significant

Mean, Median & Mode Mode Median Mean Log Transformation linear scale log 2 scale ICAT heavy intensity 70000 60000 50000 40000 30000 20000 10000 0 exp t A ch1 intensity 0 10000 20000 30000 40000 50000 60000 70000 18 exp t A 16 14 12 10 8 6 4 2 0 0 5 10 15 ICAT Light intensity

Choice of Base is Not Important log10 ln 6 14 exp t A 5 12 4 10 exp t A 8 3 6 2 4 1 2 0 0 0 2 4 6 0 5 10 Why Log2 Transformation? Makes variation of intensities and ratios of intensities more independent of absolute magnitude Makes normalization additive Evens out highly skewed distributions Gives more realistic sense of variation Approximates normal distribution Treats increased and diminished expression equally. 15

Log Transformations Applying a log transformation makes the variance and offset more proportionate along the entire graph H L H/L 60 000 40 000 1.5 3000 2000 1.5 16 log 2 L Area log 2 H log 2 L log 2 ratio 15.87 15.29 0.58 11.55 10.97 0.58 0 log 2 H area 16 Log Transformations

Signal-to-noise Significant? Detecting Clusters Height Weight

Is it Right to Calculate a Correlation Coefficient? Height r = 0.73 Weight Or is There More to This? Boy Height Girl Weight

Clustering Applications in Bioinformatics 2D Gel or ProteinChip Analysis Microarray or GeneChip Analysis Protein Interaction Analysis Phylogenetic and Evolutionary Analysis Structural Classification of Proteins Protein Sequence Families ICAT :-) Clustering Definition - a process by which objects that are logically similar in characteristics are grouped together. Clustering is different than Classification In classification the objects are assigned to pre-defined classes, in clustering the classes are yet to be defined Clustering helps in classification

Clustering Requires... A method to measure similarity (a similarity matrix) or dissimilarity (a dissimilarity coefficient) between objects A threshold value with which to decide whether an object belongs with a cluster A way of measuring the distance between two clusters A cluster seed (an object to begin the clustering process) Clustering Algorithms K-means or Partitioning Methods - divides a set of N objects into M clusters -- with or without overlap Hierarchical Methods - produces a set of nested clusters in which each pair of objects is progressively nested into a larger cluster until only one cluster remains Self-Organizing Feature Maps - produces a cluster set through iterative training

K-means or Partitioning Methods Make the first object the centroid for the first cluster For the next object calculate the similarity to each existing centroid If the similarity is greater than a threshold add the object to the existing cluster and redetermine the centroid, else use the object to start new cluster Return to step 2 and repeat until done K-means or Partitioning Methods Initial cluster choose 1 choose 2 test & join centroid= centroid= Rule: λ T = λ centroid + 50 nm -

Hierarchical Clustering Find the two closest objects and merge them into a cluster Find and merge the next two closest objects (or an object and a cluster, or two clusters) using some similarity measure and a predefined threshold If more than one cluster remains return to step 2 until finished Hierarchical Clustering Initial cluster pairwise select select compare closest next closest Rule: λ T = λ obs + 50 nm -

Hierarchical Clustering A A B A B C B C D E Find 2 most similar protein express levels or curves Find the next closest pair of levels or curves F Iterate Self-Organizing Feature Maps T=0 T=0 T=1day T=20 h T=2days T=3days SvOutPlaceObject

Self-Organizing Feature Maps Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Plot Chip Data Compute Feature Examine Clusters for Map with 6 nodes Biological Meaning SOMs - Details Specify the number of nodes (clusters) desired, and a 2-D geometry for the nodes (rectangular or hexagonal) G1 G2 G3 G6 G4 G5 N1 N2 G7 G9 G8 G10 G11 N = Nodes G = Genes G26 G28 G29 G27 N3 N5 N4 N6 G12 G13 G14 G15 G23 G25 G24 G16 G17 G18 G19 G20 G21 G22

SOMs - Details Choose a random protein, e.g., G9 G2 G1 G3 G6 G4 G5 N1 N2 G7 G9 G8 G10 G11 G26 G28 G29 G27 N3 N5 N4 N6 G12 G13 G14 G15 G23 G25 G24 G17 G20 G18 G22 G16 G19 G21 Move the nodes in the direction of G9. The node closest to G9 (N2) is moved the most, and the other nodes are moved by smaller varying amounts. The farther away the node is from N2, the less it is moved. G2 G1 G3 SOMs - Details G6 G4 G5 N1 N2 G7 G9 G8 G10 G11 G26 G28 G29 G27 N3 N5 N4 N6 G12 G13 G14 G15 G23 G25 G24 G17 G20 G18 G22 G16 G19 G21

SOMs - Details Repeat the process many (usually several thousand) times choosing different proteins. With each iteration, the amount that the nodes move is decreased. G2 G1 G3 G6 G4 G5 N1 N2 G7 G9 G8 G10 G11 G26 G28 G29 G27 N3 N5 N4 N6 G12 G13 G14 G15 G23 G25 G24 G17 G20 G18 G22 G16 G19 G21 SOMs - Details Finally, each node will nestle among a cluster of genes, and a protein will be considered to be in the cluster if its distance to the node in that cluster is less than its distance to any other node. G1 G6 G2 N1 G4 G3 G5 G7 G8 G9 N2 G10 G11 G26 G27 N3 G28 G29 N5 G23 G25 G24 G12 N4 G13 G14 G15 G16 G17 G18 G19 G20 N6 G21 G22

Statistics Software Excel MATLAB Octave SAS SPSS S-PLUS Statistica R Cluster Annotation Once you have your clusters, annotate them and look for patterns that can reveal the underlying process Metabolism: KEGG http://www.genome.ad.jp/kegg/metabolism.html Roche/Boeringer http://www.expasy.org/cgi-bin/search-biochem-ind EcoCyc www.ecocyc.org PathDB http://www.ncgr.org/pathdb

Cluster Annotation Interaction Databases BIND http://www.bind.ca DIP http://dip.doe-mbi.ucla.edu/ MINT http://mint.bio.uniroma2.it/mint PathCalling http://protal.curagen.com/extpc/com.curagen.portal.s Cluster Annotation Bibliographic Databases PubMed Medline http://www.ncbi.nlm.nih.gov/pubmed/ Science Citation Index http://isi4.isiknowledge.com/portal.cgi Your Local Library www.xxxx.ca Current Contents http://www.isinet.com/isi

Cluster Annotation Other SWISSPROT: Curated Expert Annotations http://www.expasy.org/ Subcellular Localization http://www.cs.ualberta.ca/~bioinfo/pa/sub/ Genome Ontology http://www.geneontology.org/