10-810 /02-710 Computational Genomics Normalization
Genes and Gene Expression Technology Display of Expression Information
Yeast cell cycle expression Experiments (over time) baseline expression program gene 1 0 10 20 70 80 genes Higher expression compared to baseline Lower expression compared to baseline Spellman et al Mol. Biol. Cell 1998
ROS response
Visualization: Relative vs. absolute expression
Exercising the Genome 600 Conditions/Mutations 6200 Genes Environment Single-gene Mutations
Using annotation databases Statistical tests to identify the overlap with various functional categories
Genome wide binding experiments (transcription factors) no binding by this TF gene 1 TF1 TF8 genes Probably bound by this TF Lee et al Science 2002
What you should know The basic idea behind microarray profiling The two different microarray technologies Pros and cons for each Noise factors in microarray experiments (more next)
Gene expression analysis
Gene Expression Analysis Model Computational information fusion Biological regulatory networks Pattern Recognition Data Analysis clustering, classification normalization, miss. value estimation functional assignment, response programs diff. expressed genes Experimental Design array design, number of repeats experiment selection
Experiment design A number of computational issues should be addressed: Selecting short subsequences for oligo arrays to minimize cross hybridizations Determining the number of replicates for each sample Sampling rates for time series experiments
Data analysis Normalization Combining results from replicates Identifying differentially expressed genes Dealing with missing values Static vs. time series
Data analysis Normalization Combining results from replicates Identifying differentially expressed genes Dealing with missing values Static vs. time series
Typical experiment: replicates healthy cancer Technical replicates: same sample using multiple arrays Dye swap: reverse the color code between arrays Clinical replicates: samples from different individuals Many experiments have all three kinds of replicates
Normalization Single Channel Background Correction Probe Set Summarization (Affymetrix) Between Array Normalization Normalization Two Channel Array
Background Correction On Affymetrix major normalization software (e.g. MAS 5, dchip, RMA, gcrma) have different approaches to background correction On Agilent the feature extraction software gives various options on ways to do background correction Image from Agilent User Manual
Agilent Background Correction Table from Agilent User Manual
Normalization Single Channel Background Correction Between Array Normalization Normalization Probe Set Summarization (Affymetrix) Two Channel Array
In theory these really should have the same distribution Image from Terry Speed s Slides Ideally normalization should remove variation between microarray arrays unrelated to the biologly of interest, without removing the variation due to the biology of interest
Two repeats from Agilent mouse expression data
Normalization: Assumptions In order to normalize between arrays we need to make certain assumptions. Methods we will discuss will assume (with decreasing order of assumption strength): 1. Values are exactly the same between the arrays (though different genes may be assigned different values in each array). 2. Values are normally distributed with the same mean and variance across arrays. 3. Some of the genes do not change between arrays and thus should have relatively similar values (rank invariant).
Three Major Approaches to Between Array Normalization Quantile Normalization Scale Factor Normalization Invariant Set Normalization 1. Values are exactly the same between arrays (though different genes may be assigned different values in each array). 2. Values are normally distributed with the same mean and variance across arrays. 3. Some of the genes do not change between arrays and thus should have relatively similar values (rank invariant).
Three Major Approaches to Between Array Normalization Quantile Normalization Scale Factor Normalization Invariant Set Normalization 1. Values are exactly the same between arrays (though different genes may be assigned different values in each array). 2. Values are normally distributed with the same mean and variance across arrays. 3. Some of the genes do not change between arrays and thus should have relatively similar values (rank invariant).
Normalizing across arrays Consider the following two sets of values:
Lets put them together
Normalizing between arrays The first step in the analysis of microarray data in a given experiment is to normalize between the different arrays. Simple assumption: mrna quantity is the same for all arrays M j = 1 n n! i= 1 y i j Where n and T are the total number of genes and arrays, respectfully. M j is known as the sample mean M Next we transform each value to make all arrays have the same mean: j j j yˆ i = yi! M + M = 1 T T! j= 1 M j
Normalizing the mean
Variance normalization In many cases normalizing the mean is not enough. We may further assume that the variance should be the same for each array Implicitly we assume that the expression distribution is the same for all arrays (though different genes may change in each of the arrays) Here V j is the sample variance. Next, we transform each value as follows:!! = = = " = T j j n i j j j V T V M y n V i 1 1 2 1 ) ( 1 ( ) j j j i j i V V M M y y +! = ˆ
Normalizing mean and variance
Mean Scaling Before After
Distribution is still not identical
Three Major Approaches to Between Array Normalization Quantile Normalization Scale Factor Normalization Invariant Set Normalization 1. Values are exactly the same between arrays (though different genes may be assigned different values in each array). 2. Values are normally distributed with the same mean and variance across arrays. 3. Some of the genes do not change between arrays and thus should have relatively similar values (rank invariant).
Quantile Normalization Maps the distribution of expression values for every microarray experiment to the same target distribution This is the approach of Terry Speed s group s RMA software Also an option in Agilent s Gene Spring Software
Mapping raw data to normalized data Image from Terry Speed s Slides
Normalized distribution For each order all the values in the array Take the mean (or median) value in each position Sort Mean 1 9 2 2 2 7 8 4 6 5 8 3 C B A 8 9 7 5 8 6 2 4 3 1 2 2 C B A 8 6.33 3 1.67 Normalized
Quantile Example Image from Terry Speed s Slides
Three Major Approaches to Between Array Normalization Quantile Normalization Scale Factor Normalization Invariant Set Normalization 1. Values are exactly the same between arrays (though different genes may be assigned different values in each array). 2. Values are normally distributed with the same mean and variance across arrays. 3. Some of the genes do not change between arrays and thus should have relatively similar values (rank invariant).
Invariant Set Normalization Normalization based on probes that do not change between data Housekeeping Genes Control Probes Rank invariant genes This is the approach of Wing Wong s dchip software
dchip Invariant Set Normalization Select one baseline array Default Median Intensity All other arrays are normalized against baseline Computationally finds rank invariant probes Different genes may be used for different arrays Running Median curve through rank invariant probes becomes the new y=x curve, all values adjust accordingly
dchip on Replicate Data Spline on All Unnormalized Data Running Median Line Normalized Data Q-Q Plot Image From Li and Wong, Genome Biology 2001
dchip on data with true differentially expressed genes Spline on All Unnormalized Data Running Median Line Normalized Data Q-Q Plot Differently expressed genes not selected as rank invariant Image From Li and Wong, Genome Biology 2001
Normalizing Across Very Different Distributions can be Problematic May want to normalize subsets from different tissues separately Image From dchip manual
cdna arrays: ratios healthy cancer In many experiments we are interested in the ratio between two samples For example - Cancer vs. healthy - Progression of disease (ratio to time point 0)
Transformation While ratios are useful, they are not symmetric. If R = 2*G then R/G = 2 while G/R = ½ This makes it hard to visualize the different changes Instead, we use a log transform, and focus on the log ratio: y i = R i log = log Ri! Gi logg i Empirical studies have also shown that in microarray experiments the log ratio of (most) genes tends to be normally distributed
Normalizing between array: Locally weighted linear regression Normalizing the mean and the variance works well if the variance is independent of the measured value. However, this is not the case in gene expression. For microarrays it turns out that the variance is value dependent.
Locally weighted linear regression Instead of computing a single mean and variance for each array, we can compute different means and variances for different expression values. Given two arrays, R and G, or two channels on the same array, we plot on the x axis the (log) of their intensity and on the y axis their ratio We are interested in normalizing the average (log) expression ratio for the different intensity values
Computing local mean and variance Setting may work, however, it requires that many genes have the same x value, which is usually not the case Instead, we can use a weighted sum where the weight is propotional to the distance of the point from x:!! = = " = = x x i x x i i i x m y k x v y k x m 2 )) ( ( 1 ) ( 1 ) (!!!! " = = i i i i i i i i i i x m y x w x w x v y x w x w x m 2 )) ( ) ( ( ) ( 1 ) ( ) ( ) ( 1 ) ( ( ) ) ( ) ( ) ( ) ˆ( x v V M x m x y x y +! =
Determining the weights There are a number of ways to determine the weights Here we will use a Gaussian centered at x, such that 2 ( x# x i ) # 2 w( ) = 2! x i 1 2"! e σ 2 is a parameter that should be selected by the user
Locally weighted regression: Results Original values normalized values
Oligo arrays: Negative values In many cases oligo array can return values that are less than 0 (Why?) There are a number of ways to handle these values The most common is to threshold at a certain positive value A more sophisticated way is to use the negative values to learn something about the variance of the specific gene
Data analysis Normalization Combining results from replicates Identifying differentially expressed genes Dealing with missing values Static vs. time series
Motivation In many cases, this is the goal of the experiment. Such genes can be key to understanding what goes wrong / or get fixed under certain condition (cancer, stress etc.). In other cases, these genes can be used as features for a classifier. These genes can also serve as a starting point for a model for the system being studied (e.g. cell cycle, phermone response etc.).
Problems As mentioned in the previous lecture, differences in expression values can result from many different noise sources. Our goal is to identify the real differences, that is, differences that can be explained by the various errors introduced during the experimental phase. Need to understand both the experimental protocol and take into account the underlying biology / chemistry
The wrong way During the early days (though some continue to do this today) the common method was to select genes based on their fold change between experiments. The common value was 2 (or absolute log of 1). Obviously this method is not perfect
Significance bands for Affy arrays
Value dependent variance
Typical experiment: replicates healthy cancer Technical replicates: same sample using multiple arrays Dye swap: reverse the color code between arrays Clinical replicates: samples from different individuals Many experiments have all three kinds of replicates
What you should know The different noise factors that influence microarray results The assumptions and corresponding methods for normalizing oligo (single channel) arrays. Methods for normalizing two channel (cdna) arrays.