DNA arrays. and their various applications. Algorithmen der Bioinformatik II - SoSe Christoph Dieterich

Size: px

Start display at page:

Download "DNA arrays. and their various applications. Algorithmen der Bioinformatik II - SoSe Christoph Dieterich"

Blanche Joseph
6 years ago
Views:

1 DNA arrays and their various applications Algorithmen der Bioinformatik II - SoSe 2007 Christoph Dieterich 1 Introduction Motivation DNA microarray is a parallel approach to gene screening and target identification. Microarrays are now being applied to Disease characterization Developmental biology Pathway mapping Mechanism of action studies and toxicology These applications are in the domain of mrna or gene expression profiling. Motivation Other more recent applications include: Comparative genomic hybridization (Array CGH) - Assessing large genomic rearrangements. SNP detection arrays - Looking for Single nucleotide polymorphism in the genome of populations. Chromatin immunoprecipitation (ChIP) studies - Determining protein binding site occupancy throughout the genome. Whole-genome tilting arrays are mostly used for these applications. DNA arrays - gene expression profiling Probe - A probe is a particular DNA sequence corresponding (complementary) to an mrna. Target - The complex mixture of nucleic acids species being tested. We are given an unknown target nucleic acid sample and the goal is to detect the identity and/or abundance of its constituents using known probe sequences. Single stranded DNA probes are called oligo-nucleotides or oligos. There are two different formats of DNA chips: 281

Format I: The target (500-5000 bp) is attached to a solid surface and exposed to a set of probes, either separately or in a mixture.

Examples are oligo-arrays and cdna microarrays.

2 Format I: The target ( bp) is attached to a solid surface and exposed to a set of probes, either separately or in a mixture. The earliest chips where of this kind, used for oligo-fingerprinting. Format II: An array of probes is produced either in situ or by attachment. The array is then exposed to sample DNA. Examples are oligo-arrays and cdna microarrays. Spotted microarrays In spotted microarrays (or two-channel or two-colour microarrays), the probes are oligonucleotides, cdna or small fragments of PCR products that correspond to mrnas and are spotted onto the microarray surface. This type of array is typically hybridized with cdna from two samples to be compared (e.g. diseased tissue versus healthy tissue) that are labeled with two different fluorophores (e.g. Rhodamine (Cyanine 5, red) and Fluorescein (Cyanine 3, green)). Typical experiment The specific interaction between probe and target species is based upon DNA hybridization. The relative abundance of individual target species can be measured by the ratio of dye intensities. 282

Spotted cdna microarrays Each spot contains identical cdna clones, which represents a gene. (Such complementary DNA is obtained by reverse transcription from some known mrna.

3 Spotted cdna microarrays Each spot contains identical cdna clones, which represents a gene. (Such complementary DNA is obtained by reverse transcription from some known mrna.) The target is the unknown mrna extracted from a specific cell. Oligonucleotide microarrays In oligonucleotide microarrays (or single-channel microarrays), the probes are designed to match parts of the sequence of known or predicted mrnas. These microarrays give estimations of the absolute value of gene expression and therefore the comparison of two conditions requires the use of two separate microarrays. Affymetrix produces oligo arrays with the goal of capturing each coding region as specifically as possible. The length of the oligos is usually less than 25 bases. The density of oligos on a chip can be very high and a 1cm 1cm chip can easily contain types of oligos. The chip contains both coding oligos and control oligos, the former corresponding to perfect matches to known targets and the controls corresponding to matches with one perturbed base. When reading the chip, hybridization levels at controls are subtracted from the level of match probes to reduce the number of false positives. Actual chip designs use 10 match- and 10 mismatch probes for each target gene. Today, Affymetrix offers chips for almost every finished genome. Manufacturing Oligo Arrays 283

Manufacturing Oligo Arrays 1. Start with a matrix created over a glass substrate. 2. Each cell contains a growing chain of nucleotides that ends with a terminator that prevents chain extension. 3.

4 Manufacturing Oligo Arrays 1. Start with a matrix created over a glass substrate. 2. Each cell contains a growing chain of nucleotides that ends with a terminator that prevents chain extension. 3. Cover the substrate with a mask and then illuminate the uncovered cells, breaking the bonds between the chains and their terminators. 4. Expose the substrate to a solution of many copies a specific nucleotide base so that each of the unterminated chains is extended by one copy of the nucleotide base and a new terminator. 5. Repeat using different masks. Gene expression profiling The existing methods for measuring gene expression are based on two biological assumptions: 1. The transcription level of genes indicates their regulation: Since a protein is generated from a gene in a number of stages (transcription, splicing, synthesis of protein from mrna), regulation of gene expression can occur at many points. However, we assume that most regulation is done only during the transcription phase. 2. Only genes which contribute to organism fitness are expressed, in other words, genes that are irrelevant to the given cell under the given circumstances etc. are not expressed. 284

5 Gene expression profiling Genes affect the cell by being expressed, i.e. transcribed into mrna and translated into proteins that react with other molecules. From the pattern of expression we may be able to deduce the function of an unknown gene. This is especially true, if the pattern of expression of the unknown gene is very similar to the pattern of expression of a gene with known function. Also, the level of expression of a gene in different tissues and at different stages is of significant interest. Hence, it is highly interesting to analyze the expression profile of genes, i.e. in which tissues and at what stages of development they are expressed. 2 Analysis of Microarray Data From raw to primary data Generally three steps are necessary for the image analysis: 1. Adressing: Assign location of spot center Based on the gridding process the coordinates of each spot are assigned. The algorithms for this steps need to be robust and reproducible. 2. Segmentation: Classification of a pixel into foreground (signal) or background pixel (noise) 3. Information extraction: Now numerical values are computed For each spot on the array (and label if more than one is used) compute: mean signal intensity, mean background intensity, quality value. Expression values of two-channel arrays Let F X,j denote the set of foreground pixels in channel X (X = R for red, X = G for green) of the jth probe (spot, gene). Similarly, let B X,j denote the set of background pixels in channel X (X = R for red, X = G for green) of the jth probe. Let r i and g i, respectively, be the intensity of pixel i in the red and green channel, respectively. Furthermore let R j f and Gj f, respectively, be the mean foreground signal of the jth spot in the red and green channel, respectively. Equivalently we set R j b and Gj b respectively, be the mean background signal of the jth spot in the red and green channel, respectively. Expression values of two-channel arrays These are computed as R j f = ( i F R,j r i )/ F R,j (1) G j f = ( i F G,j g i )/ F G,j (2) 285

and R j b = ( i B R,j r i )/ B R,j (3) G j b = ( i B G,j g i )/ B G,j (4) Expression values of two-channel arrays Then for the final expression value of a spot the background signals are subtracted

Expression values of two-channel arrays Finally, both expression values are combined into a ratio or log ratio (commonly base 2): e(j) = log 2 ( Rj G j ) (7) Thus e(j) is the log ratio expression

6 and R j b = ( i B R,j r i )/ B R,j (3) G j b = ( i B G,j g i )/ B G,j (4) Expression values of two-channel arrays Then for the final expression value of a spot the background signals are subtracted from the foreground signals: R j = R j f Rj b (5) G j = G j f Gj b (6) Here care must be taken, if R j b > Rj f and/or Gj b > Gj f. In this case, most image analysis programs return a flagged spot. Expression values of two-channel arrays Finally, both expression values are combined into a ratio or log ratio (commonly base 2): e(j) = log 2 ( Rj G j ) (7) Thus e(j) is the log ratio expression value of the jth spot. Expression values of one-channel arrays Similarly, the expression values are computed for arrays with just one channel. Here we will define e(j) to be either the absolute expression intensity or the log 2 value of it. The expression matrix Now that we have defined an expression value of a gene in a single array experiment, we will turn to assembling all values of several array experiments into a common matrix. Definition 1. The expression matrix of a microarray experiment consisting of p arrays, where each array has n genes is an n p matrix, where the ijth cell contains the expression value of the ith gene on the jth hybridized array. The expression matrix Let us denote an expression profile of the ith gene g i by e(g i ), and the expression value of the ith gene in the jth experiment by e(g ij ). Then we denote the mean expression of g i by e(g i ) = 1 p e(g ij ). p j=1 286

7 Visualisation of gene expression data A very important aspect of microarray data analysis is visualization. Visualization tools are primarily used to gain biologically important insights into the data. There are a number of approaches to the problem of visualizing microarray data, ranging from viewing the raw image data, viewing profiles of genes across experiments, to using one of the many scatter plot variants. In this section a short overview of common visualisation methods is given. Scatterplot In a scatterplot one distribution is plotted against another one. Let log(x) and log(y ) denote the logvalues of distributionx and Y. Then one plots log(y ) against log(x). A very typical application here is to plot intensity values (log 2 ) of the green channel against those of the red channel. MA-Plot Here rather than plotting Y against X and/or log(y ) against log(x), one plots against For the two channels we thus get M = log(y/x) = log(y ) log(x) A = (log(x) + log(y ))/2 M = log 2 (R/G) = log 2 R log 2 G is plotted against A = (log 2 R + log 2 G)/2 MA-Plot 287

8 The MA-plot is in fact the original scatter plot turned 45 counterclockwise with subsequent scaling. MA-Plot The above example shows the differences in incorporation of the label: here the molecules in the green channel have higher intensities than their respectives ones in the red channel. Heatmap One of the most popular tools for microarray data visualization are heatmaps (Eisen, 1998). Heatmaps, also known as intensity or matrix plots, present a tabular view of the expression matrix. Using any ordering, the primary data table is then represented graphically by filling each cell with a color on the basis of the measured intensity ratio. Typically a single color gradient is used to visualize log transformed expression ratios in a heatmap. That gradient is constructed from three colors, which usually are green, black and red. Colors from the green to black gradient are used to represent the negative log-ratios, while colors from the black to red gradient are used to represent positive log-ratios. The closer a log-ratio is to 0, the darker the color will be. The closer a log-ratio is to an extremum, the more saturated the color will be. Heatmap Example: 288

9 Profile Plots Profile plots show the expression profile: 289

Normalisation A microarray experiment is always a comparative experiment. Often, one wishes for example to detect diffferentially expressed genes between two different conditions of an experiment.

10 Normalisation A microarray experiment is always a comparative experiment. Often, one wishes for example to detect diffferentially expressed genes between two different conditions of an experiment. In order to detect realibly variation in expression that is the result of biological and not technical variation, one needs to reduce the technical variation to a minimum. Furthermore, many analysis methods assume that the data come from a normal distribution. Thus, a normalisation step to transform the distribution of the data to a normal distribution, is necessary. Normalisation within an array When conducting a two-color microarray experiment, one often observes differences in the incorporation of the labels which leads to global intensity differences. Global Normalisation Generally we are searching for a function l, that depends on parameters such as intensity, location, array type, and transform log 2 (R/G) log 2 (R/G) l (8) We will look at three strategies: 1. global scaling 2. linear regression 3. non-linear regression Global Normalisation For the global intensity scaling one assumes l to be constant over all spots and sets is equal to the mean or median of all log ratios: 290

Then I array = 1 N i log( R i G i ) log 2 (e i ) = log 2 (e i ) I array Linear regression for two-channel arrays To check whether two distributions, such as green and red channel intensity values,

11 Then I array = 1 N i log( R i G i ) log 2 (e i ) = log 2 (e i ) I array Linear regression for two-channel arrays To check whether two distributions, such as green and red channel intensity values, show a high (linear) correlation, a possibility is to compute an original scatterplot (or an MA-plot). In the case of high correlation, the cloud approximates a straight line, with slope 1 and intercept 0 (in original scatterplot). If however, a non-zero intercept is observed, then one distribution has consistently higher intensities a slope different from 1 is observed, then one distribution shows different response at higher intensities. it is no straight line, then no linear correlations is existing. Linear regression for two-channel arrays Example: here a linear regression line is shown for the example of above Linear regression for two-channel arrays In the case of either a non-zero intercept and/or deviation from slope 1, a linear regression is performed in order to normalize the data. Here we demonstrate this on the example of a green channel normalisation: R j = β 0 + β 1 G j + u j (9) 291

12 β 0 and β 1 are constants for the intercept and slope, and u j is a random normally distributed error. An estimator for the slope b 1 is given by the solution of the equation b 1 = n j=1 (R j R)(G j G) n j=1 (G j G) 2 (10) An estimator for the intercept is then simply computed by b 0 = R b 1 G (11) Linear regression for two-channel arrays In the last step, now we apply this linear regression function to the intensity values of the red channel: R j = R j b 0 b 1 (12) Rather than doing a linear regression on the two channel distributions, it is often recommended to do a linear regression on the MA-values. Non-linear regression for two-channel arrays Very often a non-linear correlation of the two distributions are observed. In this case a commonly used normalisation method is the application of a locally weighted linear regression (Lowess) (Cleveland, 1979). The basic idea is to move a window along the x-axis of the scatter plot and to perform a linear regression within each window. All regressions are then joined to the lowess-curve. When Lowess normalisation is applied to the MA-values, then for each feature the normalised M-value is calculated by subtracting the Lowess fit value l(a) from the raw M-value: M = log 2 (R/G) = log 2 (R/G) l(a) (13) Non-linear regression demonstration Example: here is a lowess demonstration in six steps 292

13 293

14 294

Normalisation between arrays When analysing several experiments, ie., arrays, further reasons for variability in the data have to be considered.

15 Normalisation between arrays When analysing several experiments, ie., arrays, further reasons for variability in the data have to be considered. Normalisation is performed to correct for the non-biological variability introduced by using several arrays. Three standard methods are often used: Scaling Centering Distribution normalisation Scaling 295

16 Goal of scaling of the data is that the means (or medians) of all distributions are equal. By distributions we of course mean the distribution of a gene s expression values. After scaling the mean of a gene s expression profile is equal to. e scaled (g ij ) = e(g ij ) e(g i ) (14) Centering Goal of centering of the data is to scale the data such that the mean and the standard deviation of all distributions are equal. e center (g ij ) = e(g ij) e(g i ) (15) σ(e(g i )) After scaling the mean of a gene s expression profile is equal to, and the standard deviation is equal to. Box plots are often used to compare distributions simultaneously. They are used for example to compare replicates, the distribution before and after normalisation etc. Similarity and dissimilarity of expression data In the following we will look at distance measures to compute (dis)similarity of expression profiles. The computed (dis)similarity values will then be input of clustering algorithms. Again we assume that we have an expression matrix with n genes and p arrays. Metrics and semi-metrics for expression data The most often used distance is the Euclidean distance: 296

17 and/or the normalised Euclidean distance: v u px d(x, y) = t (x i y i ) 2 i=1 q Pp i=1 (x i y i ) 2 d(x, y) = n or the weighted Euclidean distance: Let C be a diagonal matrix, where the c ii are the weights: q s X d w(x, y) = (x y) T C 1 (x y) = c ii (x i y i ) 2 i Metrics and semi-metrics for expression data Another commonly applied metric is the L 1 -metric, also known as the Manhattan metric: p d L1 (x, y) = x i y i i=1 Metrics and semi-metrics for expression data A semi-metric measure is the Pearson Correlation coefficient: p i=1 ρ(x, y) = (x i x)(y ȳ) p i=1 (x i x) 2 p i=1 (y i ȳ) 2 It is ρ(x, y) [ 1, 1] and ρ(x, y) = 1 implies perfect similarity and ρ(x, y) = 0 randomness. Metrics and semi-metrics for expression data The Pearson correlation coefficient is a similarity measure, thus one needs to transform it into a distance parameter: d ρ (x, y) = 1 ρ(x, y) (16) Metrics and semi-metrics for expression data Mutual information This distance measure is based on the notion of the entropy. The entropy of an expression profile is a measure for the information content of the profile and is computed by: p H(x) = p(x i ) log 2 (p(x i )) i=1 Metrics and semi-metrics for expression data The larger the entropy value, the more random are the expression values (ie., the lower the information content). The entropy is computed from discrete probability values. However, gene expression values are normally measured on a continuous scale. To compute the entropy, it is therefore common to use the histogram method: first the range of the expression values is computed for each. This is range is binned into k intervals. p(x i ) is then the relative frequency of the expression values within interval x i. 297

18 Metrics and semi-metrics for expression data The mutual information is now a measure for additional information that one gains by looking at an additional expression profile. It is computed from M(x, y) = H(x) + H(y) H(x, y) In other words, the mutual information of two expression profiles is computed by subtracting the joint entropy from the sum of the individual entropies of both profiles. It is H(x, y) = k p(x i, y i ) log 2 (p(x i, y i )) i=1 Metrics and semi-metrics for expression data M(x, y) = 0 implies that the joint distribution of the two expression profiles does not increase the information content than the two individual profiles. A higher values for M(x, y) implies that the two profiles are not randomly associated. Thus M(x, y) can be used to compute the (dis)similarity of two profiles. However, by definition M(x, y) is a similarity measure. To transform it into a distance measure, we first need to normalise it: M(x, y) norm = Then the mutual information distance is defined by M(x, y) max (H(x), H(y)) d MI (x, y) = 1 M(x, y) norm Metrics and semi-metrics for expression data In summary, in order to compute the mutual information distance between two expression profiles x and y, the following computational steps need to be followed: H(x) = kx p(x i ) log 2 (p(x i )) i=1 kx H(y) = p(y i ) log 2 (p(y i )) i=1 kx H(x, y) = p(x i, y i ) log 2 (p(x i, y i )) i=1 M(x, y) = H(x) + H(y) H(x, y) M(x, y) norm = M(x, y) max (H(x), H(y)) d MI (x, y) = 1 M(x, y) norm 298

19 Metrics and semi-metrics for expression data Application example (from Butte and Kohane, ): A publicly available RNA expression data set from Stanford, containing 79 separate measurements of 2,467 genes in Saccharomyces cerevisiae. Measurements of all genes were compared against each other, resulting in 3,041,811 total pairwise calculations of mutual information, ranging from 0.2 to 2.8. To assess significance of this distribution, the RNA expression measurements were permuted and a distribution of the new pair-wise mutual informations was recalculated for each permutation. Metrics and semi-metrics for expression data Clustering Introduction In gene expression analysis to analyse expression profiles often classification methods are applied. The most commonly used methods are discriminant- and cluster analysis methods. While a classification analysis assigns objects to predefined groups / classes, cluster analysis computes groups of objects (which are here either genes or samples). In this section we will first turn to cluster analysis, the next one will introduce some methods of discriminant analysis. We distinguish two general types of cluster methods: the hierarchical and the partitioning methods. 1 AJ Butte and IS Kohane (2000) Mutual Information Relevance Networks: Functional Genomic Clustering Using Pairwise Entropy Measurements. PSB 5:

20 Hierarchical clustering The result of hierarchical clustering are nested clusters which can be visualized by means of a tree or dendrogram. We distinguish two types of hierarchical cluster approaches: Bottom-up (agglomerative clustering) Top-down (divisive clustering) Bottom-up hierarchical clustering Initialisation: each object is one cluster Iteration: combine two clusters that have minimal distance Termination: one cluster that contains all objects Question: how to compute distance between 2 clusters? Bottom-up hierarchical clustering Bottom-Up Hierarchical Clustering Algorithmus: for i = 1 to n do c i = {x i } C = {c 1,..., c n } j = n + 1 while C > 1 (c a, c b ) = arg min (ca,c b ) d(c u.c v ) c j = c a c b C = C {c a, c b } {c j } j = j + 1 Bottom-up hierarchical clustering Several versions exist, among them Single Linkage (or Minimum Method, Nearest Neighbor): d(k, i j) = min(d(k, i), d(j, k)) Complete Linkage (or Maximum Method, Furthest Neighbor): Average Linkage (UPGMA): d(k, i j) = max(d(k, i), d(j, k)) d(k, i j) = (n i d(k, i) + n j d(j, k))/(n i + n j ) k-means clustering The goal of the k-means Clustering is to find a partition C of the set X in k (pre-chosen) cluster, such that a given measure for homogeneity is minimised. 300

21 k-means Clustering Algorithm: 1. Choose k 2. Choose randomly k centers µ 1,..., µ k that are the mean values for the clusters 3. For each gene compute the nearest cluster center: C(i) = argmin 1 l k d(x i, µ l ) 2 4. Compute new mean for each cluster: µ i = 1 C i x j C i x j 5. Repeat steps 3-4 until algorithm converges k-means Clustering Note: the k-means method minimizes the total intravariance sum: k l=1 C(i)=l d(x i, µ l ) 2 (17) i.e. the sum of the quadratic distances between each gene expression profile to its respective cluster center. An important parameter for the method is the choice of k, the number of clusters. A possibility to optimize this choice is tto run algorithm several times with different ks, compute each time the total intravariance sum and plot the result. Self-organizing Maps The methods of self-organizing maps (SOM) was originally invented by Kohonen (1997). Here the n- dimensional space is projected onto a one- or two-dimensional grid. The dimension of the grid, ie. the number of clusters is prechosen. Each node in the grid is associated with a so-called reference vector. During the run of the algorithm the input vectors (thus the gene profiles) iteratively direct the reference vectors towards the input vector space. Self-organizing Maps The following figure depicts the principle of the SOM-Algorithm: here the initial geometry is a 2 3 grid (points 1,..., 6 joined by lines). The arrows indicate hypothetical trajectories of the nodes on their interativ way to the fit of the data. Data points are black points. 301

22 Figure from Tamayo, 1999 Self-organizing Maps Algorithm: 1. Input: the p-dimensional gene expression profiles 2. Choose a grid with k l nodes. 3. Initialisize the p-dimensional vectors f 0 (v) with random values (either from the input data or completely random) 4. iterate i: Self-organizing Maps 1. For each profile x E compute the node v x, for which f i (v x) is closest to x 2. Update all weighted vectors as follows: f i+1 (v) = f i (v) + η(h(v x, v), i) (x f i (v)) where η = learning rate that decreases with iteration number i and learning function h(v x, v) of node v x and v (thus, nodes that are not so close to vx will be less moved), 3. repeate a) and b) until algorithm converges Learning function, e.g. Gaussian function: Visualisation of clusters h(i, j v ) = exp( d(i, j) 2 /(2σ(v) 2 )) Profile plot Plot either all profile of each cluster or only the profile of the cluster representative 302

23 Sihouette plot For a profile x from E its silhouette s(x) is defined as follows (Rousseeuw, 1987): 1. Compute a(x) = mean distance of x to all other profiles in the same cluster: a(x) = 1 C C d(x, x i ) 2. For each cluster C k compute the mean distance of x to all profiles in C k : d(x, C k ) = i=1 Sihouette plot 3. Compute mean of all d(x, C k ): b(x) = min k d(x, C k ). This is the distance of x to the nearest cluster of which x is not a member. 4. Finally: s(x) = b(x) a(x) max(a(x), b(x)) Profiles whose silhouette values s(x) are equal or close to 1, are within well-defined and tight clusters, while profiles with s(x) close to 0 are exactly between two clusters, and finally, those with negative s(x) are in wrong clusters. Visualisation of clusters Example: 303

24 Summary Either gene or experiment profiles can be clustered Cluster algorithms generate visually interpretable images Significance of custering results is of great importance Problems of unsupervised cluster methods are: no rules can be deduced from observed data number of clusters is generally not known the best (dis)similarity measure is generally not known many algorithms only compute approximate solutions without indications of deviation from input data Classification Microarray experiments nowadays are often used to study differences between types and subtypes of tumors. The goal is to develop diagnostic approaches using marker genes. In a microarray screen with data of different patiens with known tumor classes, the goal of supervised learning is to find a subset of genes that allows to distinguish the different classes. in the microarray literature one often uses class detection for clustering or unsupervised learning and class prediction for supervised learning, while in the machine learning world one talks of supervised learning. Classification The following figure illustrates the two different learning approaches. 304

taken from Ramaswarny and Golub, 2002. Classification A classification procedure consists typically of two tasks: 1.

25 taken from Ramaswarny and Golub, Classification A classification procedure consists typically of two tasks: 1. Learning task Given: Expression profiles of samples and classes Task: Learn a model that allows to distinguish expression profiles of one class from those of the other class. 2. Classification task Given: Expression profile of new sample whose class is not known Task: Predict class of sample Let us formally describe the problem. Given is a set of objects to be classified into a predefined number of classes K, say {1, 2,..., K} or for binary classification tasks { 1, +1} or {0, 1}. Which each object associate a class label Y {1, 2,..., K} a set of n measurements, that define the feature vector X = (X 1,..., X n ). 305

26 The task is to classify an object into one of the K classes on the basis of an observed measurement X = x, in other words we want to predict Y from X. Definition 2. A classifier C for K classes partitions the set of gene expression profiles into K disjoint subsets T 1,..., T K such that for a sample with expression profils x = (x 1,..., x n ) T j, the predicted class is j. Thus, a classifier for the K classes is a map C : X {1, 2,..., K}. Linear discriminants The linear discriminant algorithms belong to the easiest methods. The linear discriminant analysis (often also referred to as Fisher s linear discriminant analysis) is based on determining a linear combination ax of the feature vectors x = (x 1,..., x n ). The various methods differ in their choice of a. Linear discriminants For the general case the linear discriminant rule is C(x) = argmin k (x µ k )Σ 1 k (x µ k) t (18) Σ k is the covariance matrix of class k. The covariance of two random variables X and Y is defined as cov(x, Y ) = E(X µ(x))e(y µ(y )) and is estimated as cov(x, y) = 1 p 1 (x i x)(y i y) i Linear discriminants The argument in the discriminant function above is just the quadratic Mahalanobis diistance of x to the vector µ k of the means of the kth class. The Mahalanobis distance between two vectors x and y is defined as where S is generally a positive definite matrix. d ML (x, y) = (x y)s 1 (x y) t (19) Linear discriminants If Σ k = Σ for all k classes then, we have the linear discriminant analysis (LDA): C(x) = argmin k (x µ k )Σ 1 (x µ k ) t (20) In the simplest case, the all classes have the same diagonal covariance matrix diag(σ 2 1,..., σ2 n). This leads to the diagonal linear discriminant analysis (DLDA): C(x) = argmin k n (x g µ kg ) 2 g=1 σ 2 g (21) 306

27 Linear discriminants For a binary class case, ie. k = 2 the DLDA is: n a g (x g b g ) = g=1 n g=1 µ 1g µ 2g σ 2 g ( x g ( µg1 + µ ) ) g2 (22) 2 Testing a classificator In the case of a small data set, cross validation. and here especially Leave-one-out cross validation=loocv is used. For LOOCV one sample is taken out from the whole data set, for the remaining p 1 samples, the predictor is generated, and for the left-out sample its class is predicted. This is done for all p probes, and from this an error rate is computed. Testing a classificator Let {s 1, s 2,..., s p } the full sample set. For i = p, 1, 1 do generate predictor of {s 1,..., s i 1, s i+1,...s p } compute number of false class predictions ER i Compute overall cross-validation error, by eg. taking the mean of all ER i. Feature selection Problem: microarray data has too many features. Explicit selection: before generating classificator, filter methods (eg. relieff) Implicit selection: during generating classificator - wrapper methods Other machine learning methods Decision trees Neural networks Support vector machines (SVMs) Bayes regression methods... Statistical testing 307

28 Differentially expressed genes One of the main goals of microarray experiments is the detection of differentially expressed genes. Starting point is the comparison of expression values of a gene in two or more cell populations. This leads to the central question: Problem 3. How does one distinguish expression differences of a gene between different experiments that have a biological cause from those that are the result of technical variation or noise? Hypothesis testing An obvious choice is to use a statistical test, that ranks the genes from highest to lowest eveidence for differential expression. Then for a prechosen critical value for the rank statistic, all values above the threshold are chosen to be significant. In our case we asking for example: For a given gene g, assume its expression values in population A are (g A1,..., g Ak ) and (g B1,..., g Bl ) in population B. Let g A, g B be the mean of expression in population A and B respectively. Now we ask: is g A g B, and is the difference of the two means significant or random? Hypothesis testing Answers are offered by hypothesis testing: Generally all hypothesis tests involve comparison of an observation and a value that one would expect by chance for a given test statistic. The following two hypotheses are stated: Null hypothesis (eg. mean of a gene under condition A is not different from the mean of the gene under condition B, ie., the gene is not differentially expressed), often denoted by H 0 : observed value does not differ significantly from the one expected by chance Alternative hypothesis (eg. the gene is differentially expressed), often denoted by H 1 : observed value differs significantly from the one expected by chance Hypothesis testing One distinguishes the one-sided (also called directional) test - the unknown parameter θ to estimate (eg. mean) is either larger or smaller than a given parameter θ 0 - from the two-sided test - here one estimates whether the unknown parameter is not equal to the given parameter. Null hypothesis Alternative hypothesis Type of test H 0 : θ = θ 0 H 1 : θ θ 0 two-sided test H 0 : θ < θ 0 H 1 : θ θ 0 one-sided test H 0 : θ > θ 0 H 1 : θ θ 0 one-sided test Error types All statistical tests can bear errors. We distinguish a type I error α, - the null hypothesis has been falsely rejected - and the type II error β, - the null hypothesis has been falsely accepted. For the error type I one uses a low confidence value α 1% or 5%. For a chosen α the significance level is the p-value: 308

29 Accept H 0 Reject H 0 H 0 is true true error type I H 0 is false error type II true p-value The p-value is the probability that the test statistic of the observed sample is as least as large as the the value of the test statistic achieved at random. Thus, the p-value is the probability of rejecting the null hypothesis erroneously. Is the p-value smaller than the chosen type I error, the null hypothesis is rejected. The p-value is often set as the significance level: if the null hypothesis is rejected with significance α, then p < α. t-test A commonly chosen test is the t-test. This method (among others) takes the variation of the expression value into account for the computation. The t-test is an example for classical hypothesis testing. t-distribution Definition 4. Let X be a standard-normally distributed variable N (0, 1), then the distribution of X 2 is called χ 2 -distribution with one degree of freedom. Definition 5. Let X 1,..., X n be independent χ 2 -distributed variables with one degree of freedom, then the distribution of Y = Xi 2 (23) i is called a χ 2 -distribution with n degrees of freedom. t-distribution Definition 6. Let X be a standard-normally distributed random variable and let Y have a χ 2 distribution with n degrees of freedom. Let X and Y be independent of each other. Then the distribution of T = X Y/n (24) is called the t- or Student-distribution with n degrees of freedom. t-distribution The following figures show the t-distribution with 5 (left) and 50(right) degrees of freedom. The green line is the standard normal distribution. 309

30 The one-sample t-test When we compare the mean expression level of a gene g against a known mean (of the underlying population) (for example µ = 0 for normal ratios, or µ = 1 for log ratios) we use the one-sample t-test: Definition 7. The one-sample t-statistic of gene g = (g 1,..., g p ) is defined by t(g) = g µ σ(g)/ p, (25) where g denotes the mean expression value of g, and σ(g) denotes the standard deviation. The one-sample t-test Under the null hypothesis (mean = 0), the one-sample t-statistic follows a t p 1 distribution. For the one-sided test one compares the computed t-value with t(1 α, p 1). Is t > t(1 α, p 1), then the null hypothesis (H 0 : µ = µ 0 = 0) is rejected. For the two-sided test one compares the computed t-value with t(1 α/2, p 1). Is t > t(1 α/2, p 1), then the null hypothesis (H 0 : µ = µ 0 = 0) is rejected. Example: Let x = 1.7. The p value for the one-sided test is und for the two-sided test. The two-sample test Given a sample x 1,..., x m1 of N (µ 1, σ 2 ) distributed random variables X 1,..., X m1, as well as a sample y 1,..., y m2 of N (µ 2, σ 2 ) distributed random variables Y 1,..., Y m2. We assume that all m 1 + m 2 random variables are independent with the same variance σ 2. The expected mean values µ 1 and µ 2 are unknown. We test if µ 1 = µ 2. Compute the pooled variance estimate s 2 as follows: m1 s 2 1 = m 1 + m 2 2 ( m 2 (x i x) 2 + (y j y) 2 ) i=1 j=1 Definition 8. The two-sample t-statistic is defined by x y t = s 2 ( 1 m m 2 ) (26) 310

31 The two-sample test Under the null hypothesis the two-sample t-statistic follows the t-distribution with (m 1 + m 2 2) degrees of freedom. For a chosen α-error, for the one-sided test one then compares the t-value with t(1 α/2, m 1 +m 2 2). Is t > t(1 α/2, m 1 + m 2 2), the null hypothesis is rejected. One- and two-sided tests We distinguish the following tests: two-sided: Do A and B have different means? one-sided: Does A have a larger (smaller) mean than B? The interpretation for microarray experiments is, that a two-sided test detects differentially regulated genes, while a one-sided test seeks up(down)-regulated genes The two-sample test Summary of applying t-test: Choose significance level, the type-i error α threshold Compute a one- or two-sided t-statistic for each gene g Compute the p-value from the distribution of the test statistic 3 Sequencing by Hybridization Sequencing by Hybridization (SBH) Originally, the hope was that one can use DNA chips to sequence lage unknown DNA fragments using a large array of short probes: 1. Produce a chip C(l) spotted with all possible probes of length l (l = 8 in the first SBH papers), 2. Apply a solution containing many copies of a fluorescently labeled DNA target fragment to the array. 3. The DNA fragments hybridize to those probes that are complementary to substrings of length l of the fragment 4. Detect probes that hybridize with the DNA fragment and obtain the l-tuple composition of the DNA fragment 5. Apply a combinatorial algorithm to reconstruct the sequence of the DNA target from the l-tuple composition The Shortest Superstring Problem SBH provides information of the l-tuples present in a target DNA sequence, but not their positions. Suppose we are given the spectrum S of all l-tuples of a target DNA sequence, how do we construct the sequence? 311

32 This is a special case of the Shortest Common Superstring Problem (SCS): A superstring for a given set of strings s 1, s 2,..., s m is a string that contains each s i as a substring. Given a set of strings, finding the shortest superstring is NP-complete. The Shortest Superstring Problem Define overlap(s i, s j ) as the length of a maximal prefix of s j that matches a suffix of s i. The SCS problem can be cast as a Traveling Salesman Problem in a complete directed graph G with m vertices s 1, s 2,..., s m and edges (s i, s j ) of length overlap(s i, s j ). The SBH graph SBH corresponds to the special case in which all substrings have the same length l. We say that two SBH probes p and q overlap, if the last l 1 letters of p coincide with the first l 1 of q. Given the spectrum S of a DNA fragment, construct the directed graph H with vertex set S and edge set E = {(p, q) p and q overlap}. There exists a one-to-one correspondence between paths that visit each vertex of H at least once and the DNA fragments with the spectrum S. Vertices: l tuples of the spectrum S, edges: overlapping l-tuples: S = { ATG AGG TGC TCC GTC GGT GCA CAG } H The path visiting all vertices corresponds to the sequence reconstruction ATGCAGGTCC. A path that visits all nodes of a graph exactly once is called a Hamiltonian path. Unfortunately, the Hamiltonian Path Problem is NP-complete, so for larger graphs we cannot hope to find such paths. Second example of the SBH graph S = { ATG TGG TGC GTG GGC GCA GCG CGT } H This example has two different Hamiltonian paths and thus two different reconstructed sequences: ATGCGTGGCA ATGGCGTGCA Euler Path Leonard Euler wanted to know whether there exists a path that uses all seven bridges in Königsberg exactly once: 312

33 Kneiphoff island Pregel river Birth of graph theory... SBH and the Eulerian Path Problem Let S be the spectrum of a DNA fragment. We define a graph G whose set of nodes consists of all possible (l 1)-tuples. We connect one l 1-tuple v = v 1... v l 1 to another w = w 1... w l 1 by a directed edge (v, w), if the spectrum S contains an l-tuple u with prefix v and suffix w, i.e. such that u = v 1... v l 1 w 1 = v l 1 w 1... w l 1. Hence, in this graph the probes correspond to edges and the problem is to find a path that visits all edges exactly once, i.e., an Eulerian path. Finding all Eulerian paths is simple to solve. SBH and the Eulerian Path Problem S = { ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT } GT GCT CG AT ATG TG GTG TGG TGC GG GCG GGC GC GCA CA Vertices represent (l 1)-tuples, edges correspond to l-tuples of the spectrum. SBH and the Eulerian Path Problem There are two different solutions: 313

34 GT CG GT CG AT TG GC CA AT TG GC CA GG ATGGCGTGCA GG ATGCGTGGCA SBH and the Eulerian Path Problem A directed graph G is called Eulerian, if it contains a cycle that traverses every edge of G exactly once. A vertex v is called balanced, if the number of edges entering v equals the number of edges leaving v, i.e. indegree(v) = outdegree(v). We call v semi-balanced, if indegree(v) outdegree(v) = 1. Theorem A directed graph is Eulerian, iff it is connected and each of its vertices is balanced. Lemma A connected directed graph is Eulerian, iff it contains at most two semi-balanced nodes. Probability of unique sequence reconstruction What is the probability that a randomly generated DNA fragment of n can be uniquely reconstructed using a DNA array C(l)? In other words, how large must l be so that a random sequence of length n can be uniquely reconstructed from its l-tuples? We assume that the bases at each position are chosen independently, each with probability p = 1 4. Note that a repeat of length l will always lead to a non-unique reconstruction. We expect about ( ) n 2 p l repeats of length l. Note that ( n 2) p l ( = 1 implies l = log n ) 1 p 2. Probability of unique sequence reconstruction = For a given l one should choose n 2 4 l, but not larger. (However, this is a very loose bound and a much tighter bound is known.) SBH currently infeasible The Eulerian path approach to SBH is currently infeasible due to two problems: Errors in the data False positives arise, when the the target DNA hybridizes to a probe even though an exact match is not present False negatives arise, when an exact match goes undetected Repeats make the reconstruction impossible, as soon as the length of the repeated sequence is longer than the word length l Nevertheless, ideas developed here are employed in a new approach to sequence assembly that uses sequenced reads and a Eulerian path representation of the data (Pavel Pevzner, Recomb 2001). 314

DNA Sequencing The Shortest Superstring & Traveling Salesman Problems Sequencing by Hybridization

Eulerian & Hamiltonian Cycle Problems DNA Sequencing The Shortest Superstring & Traveling Salesman Problems Sequencing by Hybridization The Bridge Obsession Problem Find a tour crossing every bridge just