/ Computational Genomics. Normalization

10-810 /02-710 Computational Genomics Normalization

Genes and Gene Expression Technology Display of Expression Information

Yeast cell cycle expression Experiments (over time) baseline expression program gene 1 0 10 20 70 80 genes Higher expression compared to baseline Lower expression compared to baseline Spellman et al Mol. Biol. Cell 1998

ROS response

Visualization: Relative vs. absolute expression

Exercising the Genome 600 Conditions/Mutations 6200 Genes Environment Single-gene Mutations

Using annotation databases Statistical tests to identify the overlap with various functional categories

Genome wide binding experiments (transcription factors) no binding by this TF gene 1 TF1 TF8 genes Probably bound by this TF Lee et al Science 2002

What you should know The basic idea behind microarray profiling The two different microarray technologies Pros and cons for each Noise factors in microarray experiments (more next)

Gene expression analysis

Gene Expression Analysis Model Computational information fusion Biological regulatory networks Pattern Recognition Data Analysis clustering, classification normalization, miss. value estimation functional assignment, response programs diff. expressed genes Experimental Design array design, number of repeats experiment selection

Experiment design A number of computational issues should be addressed: Selecting short subsequences for oligo arrays to minimize cross hybridizations Determining the number of replicates for each sample Sampling rates for time series experiments

Data analysis Normalization Combining results from replicates Identifying differentially expressed genes Dealing with missing values Static vs. time series

Typical experiment: replicates healthy cancer Technical replicates: same sample using multiple arrays Dye swap: reverse the color code between arrays Clinical replicates: samples from different individuals Many experiments have all three kinds of replicates

Normalization Single Channel Background Correction Probe Set Summarization (Affymetrix) Between Array Normalization Normalization Two Channel Array

Background Correction On Affymetrix major normalization software (e.g. MAS 5, dchip, RMA, gcrma) have different approaches to background correction On Agilent the feature extraction software gives various options on ways to do background correction Image from Agilent User Manual

Agilent Background Correction Table from Agilent User Manual

Normalization Single Channel Background Correction Between Array Normalization Normalization Probe Set Summarization (Affymetrix) Two Channel Array

In theory these really should have the same distribution Image from Terry Speed s Slides Ideally normalization should remove variation between microarray arrays unrelated to the biologly of interest, without removing the variation due to the biology of interest

Two repeats from Agilent mouse expression data

Normalization: Assumptions In order to normalize between arrays we need to make certain assumptions. Methods we will discuss will assume (with decreasing order of assumption strength): 1. Values are exactly the same between the arrays (though different genes may be assigned different values in each array). 2. Values are normally distributed with the same mean and variance across arrays. 3. Some of the genes do not change between arrays and thus should have relatively similar values (rank invariant).

Three Major Approaches to Between Array Normalization Quantile Normalization Scale Factor Normalization Invariant Set Normalization 1. Values are exactly the same between arrays (though different genes may be assigned different values in each array). 2. Values are normally distributed with the same mean and variance across arrays. 3. Some of the genes do not change between arrays and thus should have relatively similar values (rank invariant).

Normalizing across arrays Consider the following two sets of values:

Lets put them together

Normalizing between arrays The first step in the analysis of microarray data in a given experiment is to normalize between the different arrays. Simple assumption: mrna quantity is the same for all arrays M j = 1 n n! i= 1 y i j Where n and T are the total number of genes and arrays, respectfully. M j is known as the sample mean M Next we transform each value to make all arrays have the same mean: j j j yˆ i = yi! M + M = 1 T T! j= 1 M j

Normalizing the mean

Variance normalization In many cases normalizing the mean is not enough. We may further assume that the variance should be the same for each array Implicitly we assume that the expression distribution is the same for all arrays (though different genes may change in each of the arrays) Here V j is the sample variance. Next, we transform each value as follows:!! = = = " = T j j n i j j j V T V M y n V i 1 1 2 1 ) ( 1 ( ) j j j i j i V V M M y y +! = ˆ

Normalizing mean and variance

Mean Scaling Before After

Distribution is still not identical

Quantile Normalization Maps the distribution of expression values for every microarray experiment to the same target distribution This is the approach of Terry Speed s group s RMA software Also an option in Agilent s Gene Spring Software

Mapping raw data to normalized data Image from Terry Speed s Slides

Normalized distribution For each order all the values in the array Take the mean (or median) value in each position Sort Mean 1 9 2 2 2 7 8 4 6 5 8 3 C B A 8 9 7 5 8 6 2 4 3 1 2 2 C B A 8 6.33 3 1.67 Normalized

Quantile Example Image from Terry Speed s Slides

Invariant Set Normalization Normalization based on probes that do not change between data Housekeeping Genes Control Probes Rank invariant genes This is the approach of Wing Wong s dchip software

dchip Invariant Set Normalization Select one baseline array Default Median Intensity All other arrays are normalized against baseline Computationally finds rank invariant probes Different genes may be used for different arrays Running Median curve through rank invariant probes becomes the new y=x curve, all values adjust accordingly

dchip on Replicate Data Spline on All Unnormalized Data Running Median Line Normalized Data Q-Q Plot Image From Li and Wong, Genome Biology 2001

dchip on data with true differentially expressed genes Spline on All Unnormalized Data Running Median Line Normalized Data Q-Q Plot Differently expressed genes not selected as rank invariant Image From Li and Wong, Genome Biology 2001

Normalizing Across Very Different Distributions can be Problematic May want to normalize subsets from different tissues separately Image From dchip manual

cdna arrays: ratios healthy cancer In many experiments we are interested in the ratio between two samples For example - Cancer vs. healthy - Progression of disease (ratio to time point 0)

Transformation While ratios are useful, they are not symmetric. If R = 2*G then R/G = 2 while G/R = ½ This makes it hard to visualize the different changes Instead, we use a log transform, and focus on the log ratio: y i = R i log = log Ri! Gi logg i Empirical studies have also shown that in microarray experiments the log ratio of (most) genes tends to be normally distributed

Normalizing between array: Locally weighted linear regression Normalizing the mean and the variance works well if the variance is independent of the measured value. However, this is not the case in gene expression. For microarrays it turns out that the variance is value dependent.

Locally weighted linear regression Instead of computing a single mean and variance for each array, we can compute different means and variances for different expression values. Given two arrays, R and G, or two channels on the same array, we plot on the x axis the (log) of their intensity and on the y axis their ratio We are interested in normalizing the average (log) expression ratio for the different intensity values

Computing local mean and variance Setting may work, however, it requires that many genes have the same x value, which is usually not the case Instead, we can use a weighted sum where the weight is propotional to the distance of the point from x:!! = = " = = x x i x x i i i x m y k x v y k x m 2 )) ( ( 1 ) ( 1 ) (!!!! " = = i i i i i i i i i i x m y x w x w x v y x w x w x m 2 )) ( ) ( ( ) ( 1 ) ( ) ( ) ( 1 ) ( ( ) ) ( ) ( ) ( ) ˆ( x v V M x m x y x y +! =

Determining the weights There are a number of ways to determine the weights Here we will use a Gaussian centered at x, such that 2 ( x# x i ) # 2 w( ) = 2! x i 1 2"! e σ 2 is a parameter that should be selected by the user

Locally weighted regression: Results Original values normalized values

Oligo arrays: Negative values In many cases oligo array can return values that are less than 0 (Why?) There are a number of ways to handle these values The most common is to threshold at a certain positive value A more sophisticated way is to use the negative values to learn something about the variance of the specific gene

Data analysis Normalization Combining results from replicates Identifying differentially expressed genes Dealing with missing values Static vs. time series

Motivation In many cases, this is the goal of the experiment. Such genes can be key to understanding what goes wrong / or get fixed under certain condition (cancer, stress etc.). In other cases, these genes can be used as features for a classifier. These genes can also serve as a starting point for a model for the system being studied (e.g. cell cycle, phermone response etc.).

Problems As mentioned in the previous lecture, differences in expression values can result from many different noise sources. Our goal is to identify the real differences, that is, differences that can be explained by the various errors introduced during the experimental phase. Need to understand both the experimental protocol and take into account the underlying biology / chemistry

The wrong way During the early days (though some continue to do this today) the common method was to select genes based on their fold change between experiments. The common value was 2 (or absolute log of 1). Obviously this method is not perfect

Significance bands for Affy arrays

Value dependent variance

What you should know The different noise factors that influence microarray results The assumptions and corresponding methods for normalizing oligo (single channel) arrays. Methods for normalizing two channel (cdna) arrays.