Introduction to GE Microarray data analysis Practical Course MolBio 2012 Claudia Pommerenke Nov-2012 Transkriptomanalyselabor TAL Microarray and Deep Sequencing Core Facility Göttingen University Medical Center Göttingen 1 / 46
Outline 1 Experimental Design Research Question Controls & Replicates 2 Preprocessing Image Analysis Normalization 3 Differential Expression Student s t-test Gene List Analyzing Practical Solutions 4 Summary 2 / 46
Experimental Design Experimental Design - Think before you start! Research Question Choice of Technology Controls & Replicates Reference: Churchill. 2002. Fundamentals of experimental design for cdna microarrays, Nature Genetics, Supplement 32: 490-495 3 / 46
Experimental Design Study objectives class comparison: differential expression (e.g. Liver vs. Kidney) 4 / 46
Experimental Design Class Comparison Class A Liver Class B Kidney L1 L2 L3 K1 K2 K3 vs. Differentially Expressed Genes (e.g. Fxyd2, Trf) Functional Characterization of Tissues 5 / 46
Experimental Design Study objectives class comparison: differential expression (e.g. Liver vs. Kidney) class prediction: classification (e.g. good vs. bad prognosis for cancer patients) 6 / 46
Experimental Design Class Prediction Class A Bad Prognosis P1 P2 P3 Pattern A Class B Good Prognosis P4 P5 P6 P7 Pattern B??? more like Pattern A or B??? N 7 / 46
Experimental Design Study objectives class comparison: differential expression (e.g. Liver vs. Kidney) class prediction: classification (e.g. good vs. bad prognosis for cancer patients) class discovery: clustering (e.g. find new subtypes of disease) 8 / 46
Experimental Design Class Discovery Color Key 2 0 2 4 log2 Ratio AML ALL P16 P3 P27 P44 P32 P33 P15 P13 P36 P21 P18 P20 P30 P6 P35 P31 P23 P11 P24 P43 P5 P37 P29 P40 P46 P39 P8 P22 P17 P2 P47 P38 P19 P12 P10 P45 P34 P25 P41 P28 P7 P14 P9 P1 P4 P26 P42 36643_at 1007_s_at 38408_at 1039_s_at 402_s_at 34850_at 36650_at 34362_at 40088_at 41193_at 266_s_at 36536_at 37006_at 307_at 37479_at 37193_at 41071_at 41478_at 37184_at 1140_at 37978_at 40493_at 39717_g_at 38413_at 33412_at 36398_at 177_at 38004_at 41191_at 39315_at 37810_at 36777_at 931_at 33358_at 37558_at 37251_s_at 36873_at 1914_at 41470_at 37809_at 41742_s_at 34699_at 1307_at 33809_at 33193_at 40393_at 33405_at 39716_at 32215_i_at 1929_at 40763_at 41448_at 205_g_at 873_at 34247_at 1500_at 38223_at 36149_at 33528_at 34098_f_at 32116_at 39424_at 2039_s_at 1134_at 38032_at 40480_s_at 41723_s_at 35816_at 41266_at 34210_at 37967_at 32378_at 37043_at 675_at 36795_at 38096_f_at 38095_i_at 1389_at 35016_at 38833_at 37383_f_at 676_g_at 37039_at 41237_at Eisen et. al 1998 9 / 46
Experimental Design Class Discovery Color Key AML Subtype A ALL Subtype B 2 0 2 4 log2 Ratio P16 P3 P27 P44 P32 P33 P15 P13 P36 P21 P18 P20 P30 P6 P35 P31 P23 P11 P24 P43 P5 P37 P29 P40 P46 P39 P8 P22 P17 P2 P47 P38 P19 P12 P10 P45 P34 P25 P41 P28 P7 P14 P9 P1 P4 P26 P42 36643_at 1007_s_at 38408_at 1039_s_at 402_s_at 34850_at 36650_at 34362_at 40088_at 41193_at 266_s_at 36536_at 37006_at 307_at 37479_at 37193_at 41071_at 41478_at 37184_at 1140_at 37978_at 40493_at 39717_g_at 38413_at 33412_at 36398_at 177_at 38004_at 41191_at 39315_at 37810_at 36777_at 931_at 33358_at 37558_at 37251_s_at 36873_at 1914_at 41470_at 37809_at 41742_s_at 34699_at 1307_at 33809_at 33193_at 40393_at 33405_at 39716_at 32215_i_at 1929_at 40763_at 41448_at 205_g_at 873_at 34247_at 1500_at 38223_at 36149_at 33528_at 34098_f_at 32116_at 39424_at 2039_s_at 1134_at 38032_at 40480_s_at 41723_s_at 35816_at 41266_at 34210_at 37967_at 32378_at 37043_at 675_at 36795_at 38096_f_at 38095_i_at 1389_at 35016_at 38833_at 37383_f_at 676_g_at 37039_at 41237_at Eisen et. al 1998 10 / 46
Experimental Design Sources of variation 1 biological variation use replication genetic variation environmental variation 11 / 46
Experimental Design Sources of variation 1 biological variation use replication genetic variation environmental variation 2 technical variation minimize & randomize RNA source and RNA isolation labeling, dyes and hybridization array design and batch experimenter 11 / 46
Experimental Design Sources of variation 1 biological variation use replication genetic variation environmental variation 2 technical variation minimize & randomize RNA source and RNA isolation labeling, dyes and hybridization array design and batch experimenter 3 measurement error reading fluorescent signals 11 / 46
Experimental Design Biological replicates Aim: increase precision and estimate error need to know the biological variation within one group to assign significance to variation between groups number of replicates statistical power: false positives, false negatives experimental variation (platform-dependent) biological variation (species, tissue-dependent) biological effect (larger changes easier to find) 12 / 46
Experimental Design Layers of design 1 experimental units: biological replicates e.g. mice in different treatment groups samples should be representative for the population treatments should be assigned randomly 13 / 46
Experimental Design Layers of design 1 experimental units: biological replicates e.g. mice in different treatment groups samples should be representative for the population treatments should be assigned randomly 2 technical replicates two independent RNA extractions or two aliquots of the same extraction in two color designs: assign to different dyes 13 / 46
Experimental Design Layers of design 1 experimental units: biological replicates e.g. mice in different treatment groups samples should be representative for the population treatments should be assigned randomly 2 technical replicates two independent RNA extractions or two aliquots of the same extraction in two color designs: assign to different dyes 3 arrayed elements e.g. duplicate spots for each probe 13 / 46
Experimental Design Array controls positive biological controls: genes whose regulation is known check on biological experiment & data analysis 14 / 46
Experimental Design Array controls positive biological controls: genes whose regulation is known check on biological experiment & data analysis positive technical controls: spikes in mrna and/or hyb mix check labeling procedure and hybridization detection range (sensitivity) and dynamic range landmarks for gridding software 14 / 46
Experimental Design Array controls positive biological controls: genes whose regulation is known check on biological experiment & data analysis positive technical controls: spikes in mrna and/or hyb mix check labeling procedure and hybridization detection range (sensitivity) and dynamic range landmarks for gridding software negative controls: non-specific binding check cross-hybridization: buffer, non-homologous DNA 14 / 46
Experimental Design Rule of thumb... two class or multiclass experiment paired or unpaired samples differential gene expression (n 5-25 subjects/group) classification (n >> 25 per group) cell lines: under very controlled conditions, n=3 may be enough 15 / 46
Experimental Design Limitations by profiling mrna you don t look (directly) at regulation at protein level 16 / 46
Experimental Design Limitations by profiling mrna you don t look (directly) at regulation at protein level protein modification protein turn-over protein complexes splice forms encoding different proteins RNA from different cellular compartments 16 / 46
Experimental Design Limitations by profiling mrna you don t look (directly) at regulation at protein level protein modification protein turn-over protein complexes splice forms encoding different proteins RNA from different cellular compartments detection of lowly expressed transcripts 16 / 46
Experimental Design Limitations by profiling mrna you don t look (directly) at regulation at protein level protein modification protein turn-over protein complexes splice forms encoding different proteins RNA from different cellular compartments detection of lowly expressed transcripts only detect transcripts for which there are (good) probes on the array 16 / 46
Experimental Design Think before you start Think ahead about the final data analysis when you plan the experiment! 17 / 46
Experimental Design Think before you start Think ahead about the final data analysis when you plan the experiment! Involve statisticians in your experimental design or they ll give you trouble later! 17 / 46
Experimental Design Think before you start Think ahead about the final data analysis when you plan the experiment! Involve statisticians in your experimental design or they ll give you trouble later! If cost is an issue, limit your question: Reduce the number of groups, not the number of arrays per group! 17 / 46
Preprocessing Experimental cycle 18 / 46
Preprocessing Preprocessing steps Image analysis Log2 transformation Background correction Normalization Quality Control 19 / 46
Preprocessing From image to numerical data (a) total (b) detail Segmentation: spot detection in a given grid (fixed circle model) Quantization: compute numerical red- and/or green-intensity values for each spot Well established (commercial) software available for full automatic processing! N 20 / 46
Preprocessing Log2 transformation Density 0.0 0.2 0.4 0.6 Density 0.0 0.1 0.2 0.3 0.4 Original scale Log2 scale Statistical effects: Normal distributed data (assumption for t-test) 21 / 46
Preprocessing Log2 transformation Density 0.0 0.2 0.4 0.6 Density 0.0 0.1 0.2 0.3 0.4 Original scale Log2 scale Statistical effects: Normal distributed data (assumption for t-test) Variance Stabilization - Variation in intensities typically grows with the average intensities large intensities tend to be more variable (Multiplicative noise) 21 / 46
Preprocessing Normalization What is Normalization? Normalization: Why? Normalization: How? 22 / 46
Preprocessing What is Normalization? Broad question How do we compare results across microarrays? Focused goal Getting numbers (quantification) from one microarray to mean the same as numbers from another microarray. 23 / 46
Preprocessing What is Normalization? attempt to correct for systematic bias in data remove impact of non-biological influences on biological data allowing for comparsion of data from one array to another red versus green on one array intensities or ratios from several arrays 24 / 46
Preprocessing Why is Normalization an Issue? amount of RNA efficiencies of RNA extraction, reverse transcription, labeling, photo-detection PCR yield DNA quality variation that is obscuring as opposed to interesting 25 / 46
Preprocessing Why is Normalization an Issue? amount of RNA efficiencies of RNA extraction, reverse transcription, labeling, photo-detection PCR yield DNA quality variation that is obscuring as opposed to interesting Raw Data are not mrna concentrations! RNA degradation Tissue contamination amplification and hybridization efficiency/specificity... 25 / 46
Preprocessing Displaying variability in Microarray Data Unnormalized Data Log2 Signal 18 16 14 12 10 8 6 1 2 3 Sample Nr. Maximum Q3=75 % Median Q2=25 % Minimum 26 / 46
Preprocessing Quantile Normalization Procedure 1 Assume that the distributions of probe intensities should be completely the same across samples/microarrays. This procedure (sorting and averaging) is comparatively fast.
Preprocessing Quantile Normalization Procedure 1 Assume that the distributions of probe intensities should be completely the same across samples/microarrays. 2 Start with n samples, and m genes, and form a m n matrix X. This procedure (sorting and averaging) is comparatively fast.
Preprocessing Quantile Normalization Procedure 1 Assume that the distributions of probe intensities should be completely the same across samples/microarrays. 2 Start with n samples, and m genes, and form a m n matrix X. 3 Sort the columns of X, so that the entries in a given row correspond to a fixed quantile. This procedure (sorting and averaging) is comparatively fast.
Preprocessing Quantile Normalization Procedure 1 Assume that the distributions of probe intensities should be completely the same across samples/microarrays. 2 Start with n samples, and m genes, and form a m n matrix X. 3 Sort the columns of X, so that the entries in a given row correspond to a fixed quantile. 4 Replace all entries in that row with their mean value. This procedure (sorting and averaging) is comparatively fast.
Preprocessing Quantile Normalization Procedure 1 Assume that the distributions of probe intensities should be completely the same across samples/microarrays. 2 Start with n samples, and m genes, and form a m n matrix X. 3 Sort the columns of X, so that the entries in a given row correspond to a fixed quantile. 4 Replace all entries in that row with their mean value. 5 Undo the sort. This procedure (sorting and averaging) is comparatively fast. 27 / 46
Preprocessing Quantile Normalization Sample A Sample B Sample C Gene1 100 200 140 Gene2 10 40 270 Gene3 100 120 70 28 / 46
Preprocessing Quantile Normalization Rank Sample A Sample B Sample C Mean 1 10 Gene2 40 Gene2 70 Gene3 40 2 100 Gene1 120 Gene3 140 Gene1 120 3 100 Gene3 200 Gene1 270 Gene2 190 29 / 46
Preprocessing Quantile Normalization Rank Sample A Sample B Sample C Mean 1 40 Gene2 40 Gene2 40 Gene3 40 2 120 Gene1 120 Gene3 120 Gene1 120 3 190 Gene3 190 Gene1 190 Gene1 190 30 / 46
Preprocessing Quantile Normalization Sample A Sample B Sample C Gene1 120 190 120 Gene2 40 40 190 Gene3 190 120 40 31 / 46
Preprocessing Quantile Normalization Quantile normalized Data Log2 Signal 18 16 14 12 10 8 6 1 2 3 Sample Nr. 32 / 46
Preprocessing Normalization Remarks many different normalization methods exists 33 / 46
Preprocessing Normalization Remarks many different normalization methods exists it s difficult to test which method is the best ( matter of taste) 33 / 46
Preprocessing Normalization Remarks many different normalization methods exists it s difficult to test which method is the best ( matter of taste) it is best to minimize the amount of normalization (loss of biological information possible) 33 / 46
Preprocessing Normalization Remarks many different normalization methods exists it s difficult to test which method is the best ( matter of taste) it is best to minimize the amount of normalization (loss of biological information possible) further informations: Smyth, G. K., and Speed, T. P. (2003). Normalization of cdna microarray data. Methods 31, 265-273. 33 / 46
Differential Expression Class Comparison Perhaps the most common use of microarrays is to determine which genes are differentially expressed between prespecified classes of samples. In general, we refer to this as the class comparison problem. Here, we start looking at the simplest case: 34 / 46
Differential Expression Class Comparison Perhaps the most common use of microarrays is to determine which genes are differentially expressed between prespecified classes of samples. In general, we refer to this as the class comparison problem. Here, we start looking at the simplest case: Given microarray experiments on N A sample of type A (e.g. Liver) N B sample of type B (e.g. Kidney) Decide which of the G genes on the microarray are differentially expressed between the two groups. 34 / 46
Differential Expression One gene approach start to analyze microarrays with the one gene at a time approach look for a reasonable way to analyze the same problem when we only have one gene figure out how to adapt that method to thousands of genes 35 / 46
Differential Expression Student s t-test The one-gene version of the class comparison problem with two classes simply asks, is this gene different in the two classes? 36 / 46
Differential Expression Student s t-test The one-gene version of the class comparison problem with two classes simply asks, is this gene different in the two classes? A classic analytical method is Student s t-test. 36 / 46
Differential Expression Student s t-test The one-gene version of the class comparison problem with two classes simply asks, is this gene different in the two classes? A classic analytical method is Student s t-test. We start by estimating the mean and standard deviation in both classes: 36 / 46
Differential Expression Student s t-test The one-gene version of the class comparison problem with two classes simply asks, is this gene different in the two classes? A classic analytical method is Student s t-test. We start by estimating the mean and standard deviation in both classes: X ˆ A = 1 N A N A i=1 x i, Ŝ 2 A = 1 N A 1 N A (x i x) 2 i=1 36 / 46
Differential Expression Weighted difference in means Next, we pool the estimates of standard deviation from the two groups: 37 / 46
Differential Expression Weighted difference in means Next, we pool the estimates of standard deviation from the two groups: 2 2 2 Sˆ (N A 1) Sˆ A + (NB 1) Sˆ B P = N A + N B 2 37 / 46
Differential Expression Weighted difference in means Next, we pool the estimates of standard deviation from the two groups: 2 2 2 Sˆ (N A 1) Sˆ A + (NB 1) Sˆ B P = N A + N B 2 The two-sample t-statistic is the difference in means, weighted by the pooled estimate of the standard deviation and the number of samples: 37 / 46
Differential Expression Weighted difference in means Next, we pool the estimates of standard deviation from the two groups: 2 2 2 Sˆ (N A 1) Sˆ A + (NB 1) Sˆ B P = N A + N B 2 The two-sample t-statistic is the difference in means, weighted by the pooled estimate of the standard deviation and the number of samples: X ˆ B X ˆ A t = Sˆ 2 P 1/NA + 1/N B Question: Why not just use the difference in means? 37 / 46
2 0 2 4 6 Differential Expression Why the standard deviation matters Density 0.0 0.1 0.2 0.3 0.4 SD=1 38 / 46
2 0 2 4 6 2 0 2 4 6 Differential Expression Why the standard deviation matters Density 0.0 0.1 0.2 0.3 0.4 SD=1 Density 0.0 0.2 0.4 0.6 0.8 SD=0.5 38 / 46
2 0 2 4 6 2 0 2 4 6 2 0 2 4 6 Differential Expression Why the standard deviation matters Density 0.0 0.1 0.2 0.3 0.4 SD=1 Density 0.0 0.2 0.4 0.6 0.8 SD=0.5 Density 0.00 0.05 0.10 0.15 0.20 SD=2 38 / 46
Differential Expression t-statistics Three ways to get a larger t-statistic: Bigger difference in means Smaller standard deviation More samples 39 / 46
Differential Expression What about p-values? Null hypothesis: The difference in mean expression between the two groups is zero. 40 / 46
Differential Expression What about p-values? Null hypothesis: The difference in mean expression between the two groups is zero. Two-sided alternative hypothesis: The difference in mean expression is non-zero. 40 / 46
Differential Expression What about p-values? Null hypothesis: The difference in mean expression between the two groups is zero. Two-sided alternative hypothesis: The difference in mean expression is non-zero. P-value = probability of seeing a t-statistic this extreme under the null hypothesis = area in both tails of the distribution. 40 / 46
Differential Expression What about p-values? Null hypothesis: The difference in mean expression between the two groups is zero. Two-sided alternative hypothesis: The difference in mean expression is non-zero. P-value = probability of seeing a t-statistic this extreme under the null hypothesis = area in both tails of the distribution. Interpretation If you repeat the same experiment many times (with the same number of samples in the two groups), the p-value represents the proportion of times that you would expect to see a t-statistic this large. 40 / 46
Differential Expression Candidate List ProbeName GeneSymbol FoldChange, log2 Tissue P-Value A 51 P498442 Slc34a1 15.9 Kidney 0.0039 A 51 P129731 Tmigd1 15.7 Kidney 0.0039..... A 51 P108659 Pon1 12.9 Liver 0.031 A 51 P108659 Arg1 12.4 Liver 0.022..... Typical Cut-Offs FoldChange >2 P-value <0.05 41 / 46
Differential Expression Interpretation of your results Searching your gene list for: similar functions (GO) overrepresented pathways (KEGG) genomic hot-spots... 42 / 46
Differential Expression Interpretation of your results Searching your gene list for: similar functions (GO) overrepresented pathways (KEGG) genomic hot-spots... Popular web-tool: DAVID (http://david.abcc.ncifcrf.gov/tools.jsp) Ref.: Huang et al.,systematic and integrative analysis of large gene lists using DAVID Bioinformatics Resources. (2009) Nat Protoc. 42 / 46
Differential Expression Practical Solutions for MA Analysis Many commercial software available (e.g. GeneSpring, Partek) 43 / 46
Differential Expression Practical Solutions for MA Analysis Many commercial software available (e.g. GeneSpring, Partek) But most people use R (www.cran.r-project.org): Complete statistical package and programming language Useful for all bioscience areas Powerful graphics Access to fast growing number of analysis packages Is standard for data mining and biostatistical analysis Technical advantages: free, open-source, available for all OSs 43 / 46
Differential Expression Practical Solutions for MA Analysis Many commercial software available (e.g. GeneSpring, Partek) But most people use R (www.cran.r-project.org): Complete statistical package and programming language Useful for all bioscience areas Powerful graphics Access to fast growing number of analysis packages Is standard for data mining and biostatistical analysis Technical advantages: free, open-source, available for all OSs Further resources: www.bioconductor.org/ manuals.bioinformatics.ucr.edu/home/r BioCondManual simpler - using R for Introductory Statistics (Gentleman et al. 2005) 43 / 46
Summary Summary Experimental design: Think before you start! 44 / 46
Summary Summary Experimental design: Think before you start! Use replications for statistical and biological reasons 44 / 46
Summary Summary Experimental design: Think before you start! Use replications for statistical and biological reasons Differential gene expression is defined by difference in means (FoldChange) and p-values 44 / 46
Summary Further informations&course material ftp://www.microarrays. med.uni-goettingen.de /lehre 45 / 46
Summary Questions? 46 / 46