Introduction to GE Microarray data analysis Practical Course MolBio 2012

Similar documents
/ Computational Genomics. Normalization

Course on Microarray Gene Expression Analysis

SVM Classification in -Arrays

MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS

Preprocessing -- examples in microarrays

Microarray Data Analysis (V) Preprocessing (i): two-color spotted arrays

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Lecture 5. Functional Analysis with Blast2GO Enriched functions. Kegg Pathway Analysis Functional Similarities B2G-Far. FatiGO Babelomics.

Gene signature selection to predict survival benefits from adjuvant chemotherapy in NSCLC patients

ROTS: Reproducibility Optimized Test Statistic

Micro-array Image Analysis using Clustering Methods

SEEK User Manual. Introduction

Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data

Feature Selection in Knowledge Discovery

EECS 730 Introduction to Bioinformatics Microarray. Luke Huan Electrical Engineering and Computer Science

How to use the DEGseq Package

How do microarrays work

GPR Analyzer version 1.23 User s Manual

Automated Bioinformatics Analysis System on Chip ABASOC. version 1.1

CLUSTERING IN BIOINFORMATICS

Gene Expression an Overview of Problems & Solutions: 1&2. Utah State University Bioinformatics: Problems and Solutions Summer 2006

Exploratory data analysis for microarrays

Gene expression & Clustering (Chapter 10)

Min Wang. April, 2003

Gene Clustering & Classification

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga.

Graphs,EDA and Computational Biology. Robert Gentleman

SVM CLASSIFICATION AND ANALYSIS OF MARGIN DISTANCE ON MICROARRAY DATA. A Thesis. Presented to. The Graduate Faculty of The University of Akron

Applying Data-Driven Normalization Strategies for qpcr Data Using Bioconductor

PROCEDURE HELP PREPARED BY RYAN MURPHY

Class Discovery and Prediction of Tumor with Microarray Data

Colorado State University Bioinformatics Algorithms Assignment 6: Analysis of High- Throughput Biological Data Hamidreza Chitsaz, Ali Sharifi- Zarchi

Drug versus Disease (DrugVsDisease) package

Organizing, cleaning, and normalizing (smoothing) cdna microarray data

CARMAweb users guide version Johannes Rainer

Tutorial - Analysis of Microarray Data. Microarray Core E Consortium for Functional Glycomics Funded by the NIGMS

Clustering Techniques

Quality control of array genotyping data with argyle Andrew P Morgan

Fuzzy C-means with Bi-dimensional Empirical Mode Decomposition for Segmentation of Microarray Image

Microarray data analysis

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Analysis of ChIP-seq data

MATH3880 Introduction to Statistics and DNA MATH5880 Statistics and DNA Practical Session Monday, 16 November pm BRAGG Cluster

Package pcr. November 20, 2017

Comparisons and validation of statistical clustering techniques for microarray gene expression data. Outline. Microarrays.

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Application of Hierarchical Clustering to Find Expression Modules in Cancer

Package snm. July 20, 2018

Resampling Methods. Levi Waldron, CUNY School of Public Health. July 13, 2016

CPSC 340: Machine Learning and Data Mining. Outlier Detection Fall 2018

Nature Publishing Group

GS Analysis of Microarray Data

Genomics - Problem Set 2 Part 1 due Friday, 1/26/2018 by 9:00am Part 2 due Friday, 2/2/2018 by 9:00am

From microarray images to Biological knowledge. Junior Barrera BIOINFO-USP DCC/IME-USP

Goal-oriented Schema in Biological Database Design

Analysis of (cdna) Microarray Data: Part I. Sources of Bias and Normalisation

Introduction to Cancer Genomics

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

EECS730: Introduction to Bioinformatics

Pathway Analysis using Partek Genomics Suite 6.6 and Partek Pathway

Analyzing ICAT Data. Analyzing ICAT Data

Genome Browsers - The UCSC Genome Browser

Analysis of multi-channel cell-based screens

Normalization: Bioconductor s marray package

Biclustering for Microarray Data: A Short and Comprehensive Tutorial

Properties of Biological Networks

NGS NEXT GENERATION SEQUENCING

Methodology for spot quality evaluation

Package pcagopromoter

Double Self-Organizing Maps to Cluster Gene Expression Data

A Reliable and Distributed LIMS for Efficient Management of the Microarray Experiment Environment

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

Bioconductor s stepnorm package

- with application to cluster and significance analysis

STEM. Short Time-series Expression Miner (v1.1) User Manual

RNA-Seq. Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University

Classification by Nearest Shrunken Centroids and Support Vector Machines

Expander Online Documentation

Tutorial. RNA-Seq Analysis of Breast Cancer Data. Sample to Insight. November 21, 2017

Exploring cdna Data. Achim Tresch, Andreas Buness, Tim Beißbarth, Florian Hahne, Wolfgang Huber. June 17, 2005

Committee: Dr. Rosemary Renaut 1 Professor Department of Mathematics and Statistics, Director Computational Biosciences PSM Arizona State University

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Noise-based Feature Perturbation as a Selection Method for Microarray Data

mirnet Tutorial Starting with expression data

Biosphere: the interoperation of web services in microarray cluster analysis

Package tspair. July 18, 2013

Gene Expression Data Analysis. Qin Ma, Ph.D. December 10, 2017

Application of Support Vector Machine In Bioinformatics

Long Read RNA-seq Mapper

Click Trust to launch TableView.

CHAPTER 6 REAL-VALUED GENETIC ALGORITHMS

Anaquin - Vignette Ted Wong January 05, 2019

Affymetrix GeneChip DNA Analysis Software

Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata

Package DriverNet. R topics documented: October 4, Type Package

Genomics - Problem Set 2 Part 1 due Friday, 1/25/2019 by 9:00am Part 2 due Friday, 2/1/2019 by 9:00am

High throughput Data Analysis 2. Cluster Analysis

QIAseq Targeted RNAscan Panel Analysis Plugin USER MANUAL

GenViewer Tutorial / Manual

Differential gene expression analysis using RNA-seq

Transcription:

Introduction to GE Microarray data analysis Practical Course MolBio 2012 Claudia Pommerenke Nov-2012 Transkriptomanalyselabor TAL Microarray and Deep Sequencing Core Facility Göttingen University Medical Center Göttingen 1 / 46

Outline 1 Experimental Design Research Question Controls & Replicates 2 Preprocessing Image Analysis Normalization 3 Differential Expression Student s t-test Gene List Analyzing Practical Solutions 4 Summary 2 / 46

Experimental Design Experimental Design - Think before you start! Research Question Choice of Technology Controls & Replicates Reference: Churchill. 2002. Fundamentals of experimental design for cdna microarrays, Nature Genetics, Supplement 32: 490-495 3 / 46

Experimental Design Study objectives class comparison: differential expression (e.g. Liver vs. Kidney) 4 / 46

Experimental Design Class Comparison Class A Liver Class B Kidney L1 L2 L3 K1 K2 K3 vs. Differentially Expressed Genes (e.g. Fxyd2, Trf) Functional Characterization of Tissues 5 / 46

Experimental Design Study objectives class comparison: differential expression (e.g. Liver vs. Kidney) class prediction: classification (e.g. good vs. bad prognosis for cancer patients) 6 / 46

Experimental Design Class Prediction Class A Bad Prognosis P1 P2 P3 Pattern A Class B Good Prognosis P4 P5 P6 P7 Pattern B??? more like Pattern A or B??? N 7 / 46

Experimental Design Study objectives class comparison: differential expression (e.g. Liver vs. Kidney) class prediction: classification (e.g. good vs. bad prognosis for cancer patients) class discovery: clustering (e.g. find new subtypes of disease) 8 / 46

Experimental Design Class Discovery Color Key 2 0 2 4 log2 Ratio AML ALL P16 P3 P27 P44 P32 P33 P15 P13 P36 P21 P18 P20 P30 P6 P35 P31 P23 P11 P24 P43 P5 P37 P29 P40 P46 P39 P8 P22 P17 P2 P47 P38 P19 P12 P10 P45 P34 P25 P41 P28 P7 P14 P9 P1 P4 P26 P42 36643_at 1007_s_at 38408_at 1039_s_at 402_s_at 34850_at 36650_at 34362_at 40088_at 41193_at 266_s_at 36536_at 37006_at 307_at 37479_at 37193_at 41071_at 41478_at 37184_at 1140_at 37978_at 40493_at 39717_g_at 38413_at 33412_at 36398_at 177_at 38004_at 41191_at 39315_at 37810_at 36777_at 931_at 33358_at 37558_at 37251_s_at 36873_at 1914_at 41470_at 37809_at 41742_s_at 34699_at 1307_at 33809_at 33193_at 40393_at 33405_at 39716_at 32215_i_at 1929_at 40763_at 41448_at 205_g_at 873_at 34247_at 1500_at 38223_at 36149_at 33528_at 34098_f_at 32116_at 39424_at 2039_s_at 1134_at 38032_at 40480_s_at 41723_s_at 35816_at 41266_at 34210_at 37967_at 32378_at 37043_at 675_at 36795_at 38096_f_at 38095_i_at 1389_at 35016_at 38833_at 37383_f_at 676_g_at 37039_at 41237_at Eisen et. al 1998 9 / 46

Experimental Design Class Discovery Color Key AML Subtype A ALL Subtype B 2 0 2 4 log2 Ratio P16 P3 P27 P44 P32 P33 P15 P13 P36 P21 P18 P20 P30 P6 P35 P31 P23 P11 P24 P43 P5 P37 P29 P40 P46 P39 P8 P22 P17 P2 P47 P38 P19 P12 P10 P45 P34 P25 P41 P28 P7 P14 P9 P1 P4 P26 P42 36643_at 1007_s_at 38408_at 1039_s_at 402_s_at 34850_at 36650_at 34362_at 40088_at 41193_at 266_s_at 36536_at 37006_at 307_at 37479_at 37193_at 41071_at 41478_at 37184_at 1140_at 37978_at 40493_at 39717_g_at 38413_at 33412_at 36398_at 177_at 38004_at 41191_at 39315_at 37810_at 36777_at 931_at 33358_at 37558_at 37251_s_at 36873_at 1914_at 41470_at 37809_at 41742_s_at 34699_at 1307_at 33809_at 33193_at 40393_at 33405_at 39716_at 32215_i_at 1929_at 40763_at 41448_at 205_g_at 873_at 34247_at 1500_at 38223_at 36149_at 33528_at 34098_f_at 32116_at 39424_at 2039_s_at 1134_at 38032_at 40480_s_at 41723_s_at 35816_at 41266_at 34210_at 37967_at 32378_at 37043_at 675_at 36795_at 38096_f_at 38095_i_at 1389_at 35016_at 38833_at 37383_f_at 676_g_at 37039_at 41237_at Eisen et. al 1998 10 / 46

Experimental Design Sources of variation 1 biological variation use replication genetic variation environmental variation 11 / 46

Experimental Design Sources of variation 1 biological variation use replication genetic variation environmental variation 2 technical variation minimize & randomize RNA source and RNA isolation labeling, dyes and hybridization array design and batch experimenter 11 / 46

Experimental Design Sources of variation 1 biological variation use replication genetic variation environmental variation 2 technical variation minimize & randomize RNA source and RNA isolation labeling, dyes and hybridization array design and batch experimenter 3 measurement error reading fluorescent signals 11 / 46

Experimental Design Biological replicates Aim: increase precision and estimate error need to know the biological variation within one group to assign significance to variation between groups number of replicates statistical power: false positives, false negatives experimental variation (platform-dependent) biological variation (species, tissue-dependent) biological effect (larger changes easier to find) 12 / 46

Experimental Design Layers of design 1 experimental units: biological replicates e.g. mice in different treatment groups samples should be representative for the population treatments should be assigned randomly 13 / 46

Experimental Design Layers of design 1 experimental units: biological replicates e.g. mice in different treatment groups samples should be representative for the population treatments should be assigned randomly 2 technical replicates two independent RNA extractions or two aliquots of the same extraction in two color designs: assign to different dyes 13 / 46

Experimental Design Layers of design 1 experimental units: biological replicates e.g. mice in different treatment groups samples should be representative for the population treatments should be assigned randomly 2 technical replicates two independent RNA extractions or two aliquots of the same extraction in two color designs: assign to different dyes 3 arrayed elements e.g. duplicate spots for each probe 13 / 46

Experimental Design Array controls positive biological controls: genes whose regulation is known check on biological experiment & data analysis 14 / 46

Experimental Design Array controls positive biological controls: genes whose regulation is known check on biological experiment & data analysis positive technical controls: spikes in mrna and/or hyb mix check labeling procedure and hybridization detection range (sensitivity) and dynamic range landmarks for gridding software 14 / 46

Experimental Design Array controls positive biological controls: genes whose regulation is known check on biological experiment & data analysis positive technical controls: spikes in mrna and/or hyb mix check labeling procedure and hybridization detection range (sensitivity) and dynamic range landmarks for gridding software negative controls: non-specific binding check cross-hybridization: buffer, non-homologous DNA 14 / 46

Experimental Design Rule of thumb... two class or multiclass experiment paired or unpaired samples differential gene expression (n 5-25 subjects/group) classification (n >> 25 per group) cell lines: under very controlled conditions, n=3 may be enough 15 / 46

Experimental Design Limitations by profiling mrna you don t look (directly) at regulation at protein level 16 / 46

Experimental Design Limitations by profiling mrna you don t look (directly) at regulation at protein level protein modification protein turn-over protein complexes splice forms encoding different proteins RNA from different cellular compartments 16 / 46

Experimental Design Limitations by profiling mrna you don t look (directly) at regulation at protein level protein modification protein turn-over protein complexes splice forms encoding different proteins RNA from different cellular compartments detection of lowly expressed transcripts 16 / 46

Experimental Design Limitations by profiling mrna you don t look (directly) at regulation at protein level protein modification protein turn-over protein complexes splice forms encoding different proteins RNA from different cellular compartments detection of lowly expressed transcripts only detect transcripts for which there are (good) probes on the array 16 / 46

Experimental Design Think before you start Think ahead about the final data analysis when you plan the experiment! 17 / 46

Experimental Design Think before you start Think ahead about the final data analysis when you plan the experiment! Involve statisticians in your experimental design or they ll give you trouble later! 17 / 46

Experimental Design Think before you start Think ahead about the final data analysis when you plan the experiment! Involve statisticians in your experimental design or they ll give you trouble later! If cost is an issue, limit your question: Reduce the number of groups, not the number of arrays per group! 17 / 46

Preprocessing Experimental cycle 18 / 46

Preprocessing Preprocessing steps Image analysis Log2 transformation Background correction Normalization Quality Control 19 / 46

Preprocessing From image to numerical data (a) total (b) detail Segmentation: spot detection in a given grid (fixed circle model) Quantization: compute numerical red- and/or green-intensity values for each spot Well established (commercial) software available for full automatic processing! N 20 / 46

Preprocessing Log2 transformation Density 0.0 0.2 0.4 0.6 Density 0.0 0.1 0.2 0.3 0.4 Original scale Log2 scale Statistical effects: Normal distributed data (assumption for t-test) 21 / 46

Preprocessing Log2 transformation Density 0.0 0.2 0.4 0.6 Density 0.0 0.1 0.2 0.3 0.4 Original scale Log2 scale Statistical effects: Normal distributed data (assumption for t-test) Variance Stabilization - Variation in intensities typically grows with the average intensities large intensities tend to be more variable (Multiplicative noise) 21 / 46

Preprocessing Normalization What is Normalization? Normalization: Why? Normalization: How? 22 / 46

Preprocessing What is Normalization? Broad question How do we compare results across microarrays? Focused goal Getting numbers (quantification) from one microarray to mean the same as numbers from another microarray. 23 / 46

Preprocessing What is Normalization? attempt to correct for systematic bias in data remove impact of non-biological influences on biological data allowing for comparsion of data from one array to another red versus green on one array intensities or ratios from several arrays 24 / 46

Preprocessing Why is Normalization an Issue? amount of RNA efficiencies of RNA extraction, reverse transcription, labeling, photo-detection PCR yield DNA quality variation that is obscuring as opposed to interesting 25 / 46

Preprocessing Why is Normalization an Issue? amount of RNA efficiencies of RNA extraction, reverse transcription, labeling, photo-detection PCR yield DNA quality variation that is obscuring as opposed to interesting Raw Data are not mrna concentrations! RNA degradation Tissue contamination amplification and hybridization efficiency/specificity... 25 / 46

Preprocessing Displaying variability in Microarray Data Unnormalized Data Log2 Signal 18 16 14 12 10 8 6 1 2 3 Sample Nr. Maximum Q3=75 % Median Q2=25 % Minimum 26 / 46

Preprocessing Quantile Normalization Procedure 1 Assume that the distributions of probe intensities should be completely the same across samples/microarrays. This procedure (sorting and averaging) is comparatively fast.

Preprocessing Quantile Normalization Procedure 1 Assume that the distributions of probe intensities should be completely the same across samples/microarrays. 2 Start with n samples, and m genes, and form a m n matrix X. This procedure (sorting and averaging) is comparatively fast.

Preprocessing Quantile Normalization Procedure 1 Assume that the distributions of probe intensities should be completely the same across samples/microarrays. 2 Start with n samples, and m genes, and form a m n matrix X. 3 Sort the columns of X, so that the entries in a given row correspond to a fixed quantile. This procedure (sorting and averaging) is comparatively fast.

Preprocessing Quantile Normalization Procedure 1 Assume that the distributions of probe intensities should be completely the same across samples/microarrays. 2 Start with n samples, and m genes, and form a m n matrix X. 3 Sort the columns of X, so that the entries in a given row correspond to a fixed quantile. 4 Replace all entries in that row with their mean value. This procedure (sorting and averaging) is comparatively fast.

Preprocessing Quantile Normalization Procedure 1 Assume that the distributions of probe intensities should be completely the same across samples/microarrays. 2 Start with n samples, and m genes, and form a m n matrix X. 3 Sort the columns of X, so that the entries in a given row correspond to a fixed quantile. 4 Replace all entries in that row with their mean value. 5 Undo the sort. This procedure (sorting and averaging) is comparatively fast. 27 / 46

Preprocessing Quantile Normalization Sample A Sample B Sample C Gene1 100 200 140 Gene2 10 40 270 Gene3 100 120 70 28 / 46

Preprocessing Quantile Normalization Rank Sample A Sample B Sample C Mean 1 10 Gene2 40 Gene2 70 Gene3 40 2 100 Gene1 120 Gene3 140 Gene1 120 3 100 Gene3 200 Gene1 270 Gene2 190 29 / 46

Preprocessing Quantile Normalization Rank Sample A Sample B Sample C Mean 1 40 Gene2 40 Gene2 40 Gene3 40 2 120 Gene1 120 Gene3 120 Gene1 120 3 190 Gene3 190 Gene1 190 Gene1 190 30 / 46

Preprocessing Quantile Normalization Sample A Sample B Sample C Gene1 120 190 120 Gene2 40 40 190 Gene3 190 120 40 31 / 46

Preprocessing Quantile Normalization Quantile normalized Data Log2 Signal 18 16 14 12 10 8 6 1 2 3 Sample Nr. 32 / 46

Preprocessing Normalization Remarks many different normalization methods exists 33 / 46

Preprocessing Normalization Remarks many different normalization methods exists it s difficult to test which method is the best ( matter of taste) 33 / 46

Preprocessing Normalization Remarks many different normalization methods exists it s difficult to test which method is the best ( matter of taste) it is best to minimize the amount of normalization (loss of biological information possible) 33 / 46

Preprocessing Normalization Remarks many different normalization methods exists it s difficult to test which method is the best ( matter of taste) it is best to minimize the amount of normalization (loss of biological information possible) further informations: Smyth, G. K., and Speed, T. P. (2003). Normalization of cdna microarray data. Methods 31, 265-273. 33 / 46

Differential Expression Class Comparison Perhaps the most common use of microarrays is to determine which genes are differentially expressed between prespecified classes of samples. In general, we refer to this as the class comparison problem. Here, we start looking at the simplest case: 34 / 46

Differential Expression Class Comparison Perhaps the most common use of microarrays is to determine which genes are differentially expressed between prespecified classes of samples. In general, we refer to this as the class comparison problem. Here, we start looking at the simplest case: Given microarray experiments on N A sample of type A (e.g. Liver) N B sample of type B (e.g. Kidney) Decide which of the G genes on the microarray are differentially expressed between the two groups. 34 / 46

Differential Expression One gene approach start to analyze microarrays with the one gene at a time approach look for a reasonable way to analyze the same problem when we only have one gene figure out how to adapt that method to thousands of genes 35 / 46

Differential Expression Student s t-test The one-gene version of the class comparison problem with two classes simply asks, is this gene different in the two classes? 36 / 46

Differential Expression Student s t-test The one-gene version of the class comparison problem with two classes simply asks, is this gene different in the two classes? A classic analytical method is Student s t-test. 36 / 46

Differential Expression Student s t-test The one-gene version of the class comparison problem with two classes simply asks, is this gene different in the two classes? A classic analytical method is Student s t-test. We start by estimating the mean and standard deviation in both classes: 36 / 46

Differential Expression Student s t-test The one-gene version of the class comparison problem with two classes simply asks, is this gene different in the two classes? A classic analytical method is Student s t-test. We start by estimating the mean and standard deviation in both classes: X ˆ A = 1 N A N A i=1 x i, Ŝ 2 A = 1 N A 1 N A (x i x) 2 i=1 36 / 46

Differential Expression Weighted difference in means Next, we pool the estimates of standard deviation from the two groups: 37 / 46

Differential Expression Weighted difference in means Next, we pool the estimates of standard deviation from the two groups: 2 2 2 Sˆ (N A 1) Sˆ A + (NB 1) Sˆ B P = N A + N B 2 37 / 46

Differential Expression Weighted difference in means Next, we pool the estimates of standard deviation from the two groups: 2 2 2 Sˆ (N A 1) Sˆ A + (NB 1) Sˆ B P = N A + N B 2 The two-sample t-statistic is the difference in means, weighted by the pooled estimate of the standard deviation and the number of samples: 37 / 46

Differential Expression Weighted difference in means Next, we pool the estimates of standard deviation from the two groups: 2 2 2 Sˆ (N A 1) Sˆ A + (NB 1) Sˆ B P = N A + N B 2 The two-sample t-statistic is the difference in means, weighted by the pooled estimate of the standard deviation and the number of samples: X ˆ B X ˆ A t = Sˆ 2 P 1/NA + 1/N B Question: Why not just use the difference in means? 37 / 46

2 0 2 4 6 Differential Expression Why the standard deviation matters Density 0.0 0.1 0.2 0.3 0.4 SD=1 38 / 46

2 0 2 4 6 2 0 2 4 6 Differential Expression Why the standard deviation matters Density 0.0 0.1 0.2 0.3 0.4 SD=1 Density 0.0 0.2 0.4 0.6 0.8 SD=0.5 38 / 46

2 0 2 4 6 2 0 2 4 6 2 0 2 4 6 Differential Expression Why the standard deviation matters Density 0.0 0.1 0.2 0.3 0.4 SD=1 Density 0.0 0.2 0.4 0.6 0.8 SD=0.5 Density 0.00 0.05 0.10 0.15 0.20 SD=2 38 / 46

Differential Expression t-statistics Three ways to get a larger t-statistic: Bigger difference in means Smaller standard deviation More samples 39 / 46

Differential Expression What about p-values? Null hypothesis: The difference in mean expression between the two groups is zero. 40 / 46

Differential Expression What about p-values? Null hypothesis: The difference in mean expression between the two groups is zero. Two-sided alternative hypothesis: The difference in mean expression is non-zero. 40 / 46

Differential Expression What about p-values? Null hypothesis: The difference in mean expression between the two groups is zero. Two-sided alternative hypothesis: The difference in mean expression is non-zero. P-value = probability of seeing a t-statistic this extreme under the null hypothesis = area in both tails of the distribution. 40 / 46

Differential Expression What about p-values? Null hypothesis: The difference in mean expression between the two groups is zero. Two-sided alternative hypothesis: The difference in mean expression is non-zero. P-value = probability of seeing a t-statistic this extreme under the null hypothesis = area in both tails of the distribution. Interpretation If you repeat the same experiment many times (with the same number of samples in the two groups), the p-value represents the proportion of times that you would expect to see a t-statistic this large. 40 / 46

Differential Expression Candidate List ProbeName GeneSymbol FoldChange, log2 Tissue P-Value A 51 P498442 Slc34a1 15.9 Kidney 0.0039 A 51 P129731 Tmigd1 15.7 Kidney 0.0039..... A 51 P108659 Pon1 12.9 Liver 0.031 A 51 P108659 Arg1 12.4 Liver 0.022..... Typical Cut-Offs FoldChange >2 P-value <0.05 41 / 46

Differential Expression Interpretation of your results Searching your gene list for: similar functions (GO) overrepresented pathways (KEGG) genomic hot-spots... 42 / 46

Differential Expression Interpretation of your results Searching your gene list for: similar functions (GO) overrepresented pathways (KEGG) genomic hot-spots... Popular web-tool: DAVID (http://david.abcc.ncifcrf.gov/tools.jsp) Ref.: Huang et al.,systematic and integrative analysis of large gene lists using DAVID Bioinformatics Resources. (2009) Nat Protoc. 42 / 46

Differential Expression Practical Solutions for MA Analysis Many commercial software available (e.g. GeneSpring, Partek) 43 / 46

Differential Expression Practical Solutions for MA Analysis Many commercial software available (e.g. GeneSpring, Partek) But most people use R (www.cran.r-project.org): Complete statistical package and programming language Useful for all bioscience areas Powerful graphics Access to fast growing number of analysis packages Is standard for data mining and biostatistical analysis Technical advantages: free, open-source, available for all OSs 43 / 46

Differential Expression Practical Solutions for MA Analysis Many commercial software available (e.g. GeneSpring, Partek) But most people use R (www.cran.r-project.org): Complete statistical package and programming language Useful for all bioscience areas Powerful graphics Access to fast growing number of analysis packages Is standard for data mining and biostatistical analysis Technical advantages: free, open-source, available for all OSs Further resources: www.bioconductor.org/ manuals.bioinformatics.ucr.edu/home/r BioCondManual simpler - using R for Introductory Statistics (Gentleman et al. 2005) 43 / 46

Summary Summary Experimental design: Think before you start! 44 / 46

Summary Summary Experimental design: Think before you start! Use replications for statistical and biological reasons 44 / 46

Summary Summary Experimental design: Think before you start! Use replications for statistical and biological reasons Differential gene expression is defined by difference in means (FoldChange) and p-values 44 / 46

Summary Further informations&course material ftp://www.microarrays. med.uni-goettingen.de /lehre 45 / 46

Summary Questions? 46 / 46