DNA arrays. and their various applications. Algorithmen der Bioinformatik II - SoSe Christoph Dieterich

Size: px
Start display at page:

Download "DNA arrays. and their various applications. Algorithmen der Bioinformatik II - SoSe Christoph Dieterich"

Transcription

1 DNA arrays and their various applications Algorithmen der Bioinformatik II - SoSe 2007 Christoph Dieterich 1 Introduction Motivation DNA microarray is a parallel approach to gene screening and target identification. Microarrays are now being applied to Disease characterization Developmental biology Pathway mapping Mechanism of action studies and toxicology These applications are in the domain of mrna or gene expression profiling. Motivation Other more recent applications include: Comparative genomic hybridization (Array CGH) - Assessing large genomic rearrangements. SNP detection arrays - Looking for Single nucleotide polymorphism in the genome of populations. Chromatin immunoprecipitation (ChIP) studies - Determining protein binding site occupancy throughout the genome. Whole-genome tilting arrays are mostly used for these applications. DNA arrays - gene expression profiling Probe - A probe is a particular DNA sequence corresponding (complementary) to an mrna. Target - The complex mixture of nucleic acids species being tested. We are given an unknown target nucleic acid sample and the goal is to detect the identity and/or abundance of its constituents using known probe sequences. Single stranded DNA probes are called oligo-nucleotides or oligos. There are two different formats of DNA chips: 281

2 Format I: The target ( bp) is attached to a solid surface and exposed to a set of probes, either separately or in a mixture. The earliest chips where of this kind, used for oligo-fingerprinting. Format II: An array of probes is produced either in situ or by attachment. The array is then exposed to sample DNA. Examples are oligo-arrays and cdna microarrays. Spotted microarrays In spotted microarrays (or two-channel or two-colour microarrays), the probes are oligonucleotides, cdna or small fragments of PCR products that correspond to mrnas and are spotted onto the microarray surface. This type of array is typically hybridized with cdna from two samples to be compared (e.g. diseased tissue versus healthy tissue) that are labeled with two different fluorophores (e.g. Rhodamine (Cyanine 5, red) and Fluorescein (Cyanine 3, green)). Typical experiment The specific interaction between probe and target species is based upon DNA hybridization. The relative abundance of individual target species can be measured by the ratio of dye intensities. 282

3 Spotted cdna microarrays Each spot contains identical cdna clones, which represents a gene. (Such complementary DNA is obtained by reverse transcription from some known mrna.) The target is the unknown mrna extracted from a specific cell. Oligonucleotide microarrays In oligonucleotide microarrays (or single-channel microarrays), the probes are designed to match parts of the sequence of known or predicted mrnas. These microarrays give estimations of the absolute value of gene expression and therefore the comparison of two conditions requires the use of two separate microarrays. Affymetrix produces oligo arrays with the goal of capturing each coding region as specifically as possible. The length of the oligos is usually less than 25 bases. The density of oligos on a chip can be very high and a 1cm 1cm chip can easily contain types of oligos. The chip contains both coding oligos and control oligos, the former corresponding to perfect matches to known targets and the controls corresponding to matches with one perturbed base. When reading the chip, hybridization levels at controls are subtracted from the level of match probes to reduce the number of false positives. Actual chip designs use 10 match- and 10 mismatch probes for each target gene. Today, Affymetrix offers chips for almost every finished genome. Manufacturing Oligo Arrays 283

4 Manufacturing Oligo Arrays 1. Start with a matrix created over a glass substrate. 2. Each cell contains a growing chain of nucleotides that ends with a terminator that prevents chain extension. 3. Cover the substrate with a mask and then illuminate the uncovered cells, breaking the bonds between the chains and their terminators. 4. Expose the substrate to a solution of many copies a specific nucleotide base so that each of the unterminated chains is extended by one copy of the nucleotide base and a new terminator. 5. Repeat using different masks. Gene expression profiling The existing methods for measuring gene expression are based on two biological assumptions: 1. The transcription level of genes indicates their regulation: Since a protein is generated from a gene in a number of stages (transcription, splicing, synthesis of protein from mrna), regulation of gene expression can occur at many points. However, we assume that most regulation is done only during the transcription phase. 2. Only genes which contribute to organism fitness are expressed, in other words, genes that are irrelevant to the given cell under the given circumstances etc. are not expressed. 284

5 Gene expression profiling Genes affect the cell by being expressed, i.e. transcribed into mrna and translated into proteins that react with other molecules. From the pattern of expression we may be able to deduce the function of an unknown gene. This is especially true, if the pattern of expression of the unknown gene is very similar to the pattern of expression of a gene with known function. Also, the level of expression of a gene in different tissues and at different stages is of significant interest. Hence, it is highly interesting to analyze the expression profile of genes, i.e. in which tissues and at what stages of development they are expressed. 2 Analysis of Microarray Data From raw to primary data Generally three steps are necessary for the image analysis: 1. Adressing: Assign location of spot center Based on the gridding process the coordinates of each spot are assigned. The algorithms for this steps need to be robust and reproducible. 2. Segmentation: Classification of a pixel into foreground (signal) or background pixel (noise) 3. Information extraction: Now numerical values are computed For each spot on the array (and label if more than one is used) compute: mean signal intensity, mean background intensity, quality value. Expression values of two-channel arrays Let F X,j denote the set of foreground pixels in channel X (X = R for red, X = G for green) of the jth probe (spot, gene). Similarly, let B X,j denote the set of background pixels in channel X (X = R for red, X = G for green) of the jth probe. Let r i and g i, respectively, be the intensity of pixel i in the red and green channel, respectively. Furthermore let R j f and Gj f, respectively, be the mean foreground signal of the jth spot in the red and green channel, respectively. Equivalently we set R j b and Gj b respectively, be the mean background signal of the jth spot in the red and green channel, respectively. Expression values of two-channel arrays These are computed as R j f = ( i F R,j r i )/ F R,j (1) G j f = ( i F G,j g i )/ F G,j (2) 285

6 and R j b = ( i B R,j r i )/ B R,j (3) G j b = ( i B G,j g i )/ B G,j (4) Expression values of two-channel arrays Then for the final expression value of a spot the background signals are subtracted from the foreground signals: R j = R j f Rj b (5) G j = G j f Gj b (6) Here care must be taken, if R j b > Rj f and/or Gj b > Gj f. In this case, most image analysis programs return a flagged spot. Expression values of two-channel arrays Finally, both expression values are combined into a ratio or log ratio (commonly base 2): e(j) = log 2 ( Rj G j ) (7) Thus e(j) is the log ratio expression value of the jth spot. Expression values of one-channel arrays Similarly, the expression values are computed for arrays with just one channel. Here we will define e(j) to be either the absolute expression intensity or the log 2 value of it. The expression matrix Now that we have defined an expression value of a gene in a single array experiment, we will turn to assembling all values of several array experiments into a common matrix. Definition 1. The expression matrix of a microarray experiment consisting of p arrays, where each array has n genes is an n p matrix, where the ijth cell contains the expression value of the ith gene on the jth hybridized array. The expression matrix Let us denote an expression profile of the ith gene g i by e(g i ), and the expression value of the ith gene in the jth experiment by e(g ij ). Then we denote the mean expression of g i by e(g i ) = 1 p e(g ij ). p j=1 286

7 Visualisation of gene expression data A very important aspect of microarray data analysis is visualization. Visualization tools are primarily used to gain biologically important insights into the data. There are a number of approaches to the problem of visualizing microarray data, ranging from viewing the raw image data, viewing profiles of genes across experiments, to using one of the many scatter plot variants. In this section a short overview of common visualisation methods is given. Scatterplot In a scatterplot one distribution is plotted against another one. Let log(x) and log(y ) denote the logvalues of distributionx and Y. Then one plots log(y ) against log(x). A very typical application here is to plot intensity values (log 2 ) of the green channel against those of the red channel. MA-Plot Here rather than plotting Y against X and/or log(y ) against log(x), one plots against For the two channels we thus get M = log(y/x) = log(y ) log(x) A = (log(x) + log(y ))/2 M = log 2 (R/G) = log 2 R log 2 G is plotted against A = (log 2 R + log 2 G)/2 MA-Plot 287

8 The MA-plot is in fact the original scatter plot turned 45 counterclockwise with subsequent scaling. MA-Plot The above example shows the differences in incorporation of the label: here the molecules in the green channel have higher intensities than their respectives ones in the red channel. Heatmap One of the most popular tools for microarray data visualization are heatmaps (Eisen, 1998). Heatmaps, also known as intensity or matrix plots, present a tabular view of the expression matrix. Using any ordering, the primary data table is then represented graphically by filling each cell with a color on the basis of the measured intensity ratio. Typically a single color gradient is used to visualize log transformed expression ratios in a heatmap. That gradient is constructed from three colors, which usually are green, black and red. Colors from the green to black gradient are used to represent the negative log-ratios, while colors from the black to red gradient are used to represent positive log-ratios. The closer a log-ratio is to 0, the darker the color will be. The closer a log-ratio is to an extremum, the more saturated the color will be. Heatmap Example: 288

9 Profile Plots Profile plots show the expression profile: 289

10 Normalisation A microarray experiment is always a comparative experiment. Often, one wishes for example to detect diffferentially expressed genes between two different conditions of an experiment. In order to detect realibly variation in expression that is the result of biological and not technical variation, one needs to reduce the technical variation to a minimum. Furthermore, many analysis methods assume that the data come from a normal distribution. Thus, a normalisation step to transform the distribution of the data to a normal distribution, is necessary. Normalisation within an array When conducting a two-color microarray experiment, one often observes differences in the incorporation of the labels which leads to global intensity differences. Global Normalisation Generally we are searching for a function l, that depends on parameters such as intensity, location, array type, and transform log 2 (R/G) log 2 (R/G) l (8) We will look at three strategies: 1. global scaling 2. linear regression 3. non-linear regression Global Normalisation For the global intensity scaling one assumes l to be constant over all spots and sets is equal to the mean or median of all log ratios: 290

11 Then I array = 1 N i log( R i G i ) log 2 (e i ) = log 2 (e i ) I array Linear regression for two-channel arrays To check whether two distributions, such as green and red channel intensity values, show a high (linear) correlation, a possibility is to compute an original scatterplot (or an MA-plot). In the case of high correlation, the cloud approximates a straight line, with slope 1 and intercept 0 (in original scatterplot). If however, a non-zero intercept is observed, then one distribution has consistently higher intensities a slope different from 1 is observed, then one distribution shows different response at higher intensities. it is no straight line, then no linear correlations is existing. Linear regression for two-channel arrays Example: here a linear regression line is shown for the example of above Linear regression for two-channel arrays In the case of either a non-zero intercept and/or deviation from slope 1, a linear regression is performed in order to normalize the data. Here we demonstrate this on the example of a green channel normalisation: R j = β 0 + β 1 G j + u j (9) 291

12 β 0 and β 1 are constants for the intercept and slope, and u j is a random normally distributed error. An estimator for the slope b 1 is given by the solution of the equation b 1 = n j=1 (R j R)(G j G) n j=1 (G j G) 2 (10) An estimator for the intercept is then simply computed by b 0 = R b 1 G (11) Linear regression for two-channel arrays In the last step, now we apply this linear regression function to the intensity values of the red channel: R j = R j b 0 b 1 (12) Rather than doing a linear regression on the two channel distributions, it is often recommended to do a linear regression on the MA-values. Non-linear regression for two-channel arrays Very often a non-linear correlation of the two distributions are observed. In this case a commonly used normalisation method is the application of a locally weighted linear regression (Lowess) (Cleveland, 1979). The basic idea is to move a window along the x-axis of the scatter plot and to perform a linear regression within each window. All regressions are then joined to the lowess-curve. When Lowess normalisation is applied to the MA-values, then for each feature the normalised M-value is calculated by subtracting the Lowess fit value l(a) from the raw M-value: M = log 2 (R/G) = log 2 (R/G) l(a) (13) Non-linear regression demonstration Example: here is a lowess demonstration in six steps 292

13 293

14 294

15 Normalisation between arrays When analysing several experiments, ie., arrays, further reasons for variability in the data have to be considered. Normalisation is performed to correct for the non-biological variability introduced by using several arrays. Three standard methods are often used: Scaling Centering Distribution normalisation Scaling 295

16 Goal of scaling of the data is that the means (or medians) of all distributions are equal. By distributions we of course mean the distribution of a gene s expression values. After scaling the mean of a gene s expression profile is equal to. e scaled (g ij ) = e(g ij ) e(g i ) (14) Centering Goal of centering of the data is to scale the data such that the mean and the standard deviation of all distributions are equal. e center (g ij ) = e(g ij) e(g i ) (15) σ(e(g i )) After scaling the mean of a gene s expression profile is equal to, and the standard deviation is equal to. Box plots are often used to compare distributions simultaneously. They are used for example to compare replicates, the distribution before and after normalisation etc. Similarity and dissimilarity of expression data In the following we will look at distance measures to compute (dis)similarity of expression profiles. The computed (dis)similarity values will then be input of clustering algorithms. Again we assume that we have an expression matrix with n genes and p arrays. Metrics and semi-metrics for expression data The most often used distance is the Euclidean distance: 296

17 and/or the normalised Euclidean distance: v u px d(x, y) = t (x i y i ) 2 i=1 q Pp i=1 (x i y i ) 2 d(x, y) = n or the weighted Euclidean distance: Let C be a diagonal matrix, where the c ii are the weights: q s X d w(x, y) = (x y) T C 1 (x y) = c ii (x i y i ) 2 i Metrics and semi-metrics for expression data Another commonly applied metric is the L 1 -metric, also known as the Manhattan metric: p d L1 (x, y) = x i y i i=1 Metrics and semi-metrics for expression data A semi-metric measure is the Pearson Correlation coefficient: p i=1 ρ(x, y) = (x i x)(y ȳ) p i=1 (x i x) 2 p i=1 (y i ȳ) 2 It is ρ(x, y) [ 1, 1] and ρ(x, y) = 1 implies perfect similarity and ρ(x, y) = 0 randomness. Metrics and semi-metrics for expression data The Pearson correlation coefficient is a similarity measure, thus one needs to transform it into a distance parameter: d ρ (x, y) = 1 ρ(x, y) (16) Metrics and semi-metrics for expression data Mutual information This distance measure is based on the notion of the entropy. The entropy of an expression profile is a measure for the information content of the profile and is computed by: p H(x) = p(x i ) log 2 (p(x i )) i=1 Metrics and semi-metrics for expression data The larger the entropy value, the more random are the expression values (ie., the lower the information content). The entropy is computed from discrete probability values. However, gene expression values are normally measured on a continuous scale. To compute the entropy, it is therefore common to use the histogram method: first the range of the expression values is computed for each. This is range is binned into k intervals. p(x i ) is then the relative frequency of the expression values within interval x i. 297

18 Metrics and semi-metrics for expression data The mutual information is now a measure for additional information that one gains by looking at an additional expression profile. It is computed from M(x, y) = H(x) + H(y) H(x, y) In other words, the mutual information of two expression profiles is computed by subtracting the joint entropy from the sum of the individual entropies of both profiles. It is H(x, y) = k p(x i, y i ) log 2 (p(x i, y i )) i=1 Metrics and semi-metrics for expression data M(x, y) = 0 implies that the joint distribution of the two expression profiles does not increase the information content than the two individual profiles. A higher values for M(x, y) implies that the two profiles are not randomly associated. Thus M(x, y) can be used to compute the (dis)similarity of two profiles. However, by definition M(x, y) is a similarity measure. To transform it into a distance measure, we first need to normalise it: M(x, y) norm = Then the mutual information distance is defined by M(x, y) max (H(x), H(y)) d MI (x, y) = 1 M(x, y) norm Metrics and semi-metrics for expression data In summary, in order to compute the mutual information distance between two expression profiles x and y, the following computational steps need to be followed: H(x) = kx p(x i ) log 2 (p(x i )) i=1 kx H(y) = p(y i ) log 2 (p(y i )) i=1 kx H(x, y) = p(x i, y i ) log 2 (p(x i, y i )) i=1 M(x, y) = H(x) + H(y) H(x, y) M(x, y) norm = M(x, y) max (H(x), H(y)) d MI (x, y) = 1 M(x, y) norm 298

19 Metrics and semi-metrics for expression data Application example (from Butte and Kohane, ): A publicly available RNA expression data set from Stanford, containing 79 separate measurements of 2,467 genes in Saccharomyces cerevisiae. Measurements of all genes were compared against each other, resulting in 3,041,811 total pairwise calculations of mutual information, ranging from 0.2 to 2.8. To assess significance of this distribution, the RNA expression measurements were permuted and a distribution of the new pair-wise mutual informations was recalculated for each permutation. Metrics and semi-metrics for expression data Clustering Introduction In gene expression analysis to analyse expression profiles often classification methods are applied. The most commonly used methods are discriminant- and cluster analysis methods. While a classification analysis assigns objects to predefined groups / classes, cluster analysis computes groups of objects (which are here either genes or samples). In this section we will first turn to cluster analysis, the next one will introduce some methods of discriminant analysis. We distinguish two general types of cluster methods: the hierarchical and the partitioning methods. 1 AJ Butte and IS Kohane (2000) Mutual Information Relevance Networks: Functional Genomic Clustering Using Pairwise Entropy Measurements. PSB 5:

20 Hierarchical clustering The result of hierarchical clustering are nested clusters which can be visualized by means of a tree or dendrogram. We distinguish two types of hierarchical cluster approaches: Bottom-up (agglomerative clustering) Top-down (divisive clustering) Bottom-up hierarchical clustering Initialisation: each object is one cluster Iteration: combine two clusters that have minimal distance Termination: one cluster that contains all objects Question: how to compute distance between 2 clusters? Bottom-up hierarchical clustering Bottom-Up Hierarchical Clustering Algorithmus: for i = 1 to n do c i = {x i } C = {c 1,..., c n } j = n + 1 while C > 1 (c a, c b ) = arg min (ca,c b ) d(c u.c v ) c j = c a c b C = C {c a, c b } {c j } j = j + 1 Bottom-up hierarchical clustering Several versions exist, among them Single Linkage (or Minimum Method, Nearest Neighbor): d(k, i j) = min(d(k, i), d(j, k)) Complete Linkage (or Maximum Method, Furthest Neighbor): Average Linkage (UPGMA): d(k, i j) = max(d(k, i), d(j, k)) d(k, i j) = (n i d(k, i) + n j d(j, k))/(n i + n j ) k-means clustering The goal of the k-means Clustering is to find a partition C of the set X in k (pre-chosen) cluster, such that a given measure for homogeneity is minimised. 300

21 k-means Clustering Algorithm: 1. Choose k 2. Choose randomly k centers µ 1,..., µ k that are the mean values for the clusters 3. For each gene compute the nearest cluster center: C(i) = argmin 1 l k d(x i, µ l ) 2 4. Compute new mean for each cluster: µ i = 1 C i x j C i x j 5. Repeat steps 3-4 until algorithm converges k-means Clustering Note: the k-means method minimizes the total intravariance sum: k l=1 C(i)=l d(x i, µ l ) 2 (17) i.e. the sum of the quadratic distances between each gene expression profile to its respective cluster center. An important parameter for the method is the choice of k, the number of clusters. A possibility to optimize this choice is tto run algorithm several times with different ks, compute each time the total intravariance sum and plot the result. Self-organizing Maps The methods of self-organizing maps (SOM) was originally invented by Kohonen (1997). Here the n- dimensional space is projected onto a one- or two-dimensional grid. The dimension of the grid, ie. the number of clusters is prechosen. Each node in the grid is associated with a so-called reference vector. During the run of the algorithm the input vectors (thus the gene profiles) iteratively direct the reference vectors towards the input vector space. Self-organizing Maps The following figure depicts the principle of the SOM-Algorithm: here the initial geometry is a 2 3 grid (points 1,..., 6 joined by lines). The arrows indicate hypothetical trajectories of the nodes on their interativ way to the fit of the data. Data points are black points. 301

22 Figure from Tamayo, 1999 Self-organizing Maps Algorithm: 1. Input: the p-dimensional gene expression profiles 2. Choose a grid with k l nodes. 3. Initialisize the p-dimensional vectors f 0 (v) with random values (either from the input data or completely random) 4. iterate i: Self-organizing Maps 1. For each profile x E compute the node v x, for which f i (v x) is closest to x 2. Update all weighted vectors as follows: f i+1 (v) = f i (v) + η(h(v x, v), i) (x f i (v)) where η = learning rate that decreases with iteration number i and learning function h(v x, v) of node v x and v (thus, nodes that are not so close to vx will be less moved), 3. repeate a) and b) until algorithm converges Learning function, e.g. Gaussian function: Visualisation of clusters h(i, j v ) = exp( d(i, j) 2 /(2σ(v) 2 )) Profile plot Plot either all profile of each cluster or only the profile of the cluster representative 302

23 Sihouette plot For a profile x from E its silhouette s(x) is defined as follows (Rousseeuw, 1987): 1. Compute a(x) = mean distance of x to all other profiles in the same cluster: a(x) = 1 C C d(x, x i ) 2. For each cluster C k compute the mean distance of x to all profiles in C k : d(x, C k ) = i=1 Sihouette plot 3. Compute mean of all d(x, C k ): b(x) = min k d(x, C k ). This is the distance of x to the nearest cluster of which x is not a member. 4. Finally: s(x) = b(x) a(x) max(a(x), b(x)) Profiles whose silhouette values s(x) are equal or close to 1, are within well-defined and tight clusters, while profiles with s(x) close to 0 are exactly between two clusters, and finally, those with negative s(x) are in wrong clusters. Visualisation of clusters Example: 303

24 Summary Either gene or experiment profiles can be clustered Cluster algorithms generate visually interpretable images Significance of custering results is of great importance Problems of unsupervised cluster methods are: no rules can be deduced from observed data number of clusters is generally not known the best (dis)similarity measure is generally not known many algorithms only compute approximate solutions without indications of deviation from input data Classification Microarray experiments nowadays are often used to study differences between types and subtypes of tumors. The goal is to develop diagnostic approaches using marker genes. In a microarray screen with data of different patiens with known tumor classes, the goal of supervised learning is to find a subset of genes that allows to distinguish the different classes. in the microarray literature one often uses class detection for clustering or unsupervised learning and class prediction for supervised learning, while in the machine learning world one talks of supervised learning. Classification The following figure illustrates the two different learning approaches. 304

25 taken from Ramaswarny and Golub, Classification A classification procedure consists typically of two tasks: 1. Learning task Given: Expression profiles of samples and classes Task: Learn a model that allows to distinguish expression profiles of one class from those of the other class. 2. Classification task Given: Expression profile of new sample whose class is not known Task: Predict class of sample Let us formally describe the problem. Given is a set of objects to be classified into a predefined number of classes K, say {1, 2,..., K} or for binary classification tasks { 1, +1} or {0, 1}. Which each object associate a class label Y {1, 2,..., K} a set of n measurements, that define the feature vector X = (X 1,..., X n ). 305

26 The task is to classify an object into one of the K classes on the basis of an observed measurement X = x, in other words we want to predict Y from X. Definition 2. A classifier C for K classes partitions the set of gene expression profiles into K disjoint subsets T 1,..., T K such that for a sample with expression profils x = (x 1,..., x n ) T j, the predicted class is j. Thus, a classifier for the K classes is a map C : X {1, 2,..., K}. Linear discriminants The linear discriminant algorithms belong to the easiest methods. The linear discriminant analysis (often also referred to as Fisher s linear discriminant analysis) is based on determining a linear combination ax of the feature vectors x = (x 1,..., x n ). The various methods differ in their choice of a. Linear discriminants For the general case the linear discriminant rule is C(x) = argmin k (x µ k )Σ 1 k (x µ k) t (18) Σ k is the covariance matrix of class k. The covariance of two random variables X and Y is defined as cov(x, Y ) = E(X µ(x))e(y µ(y )) and is estimated as cov(x, y) = 1 p 1 (x i x)(y i y) i Linear discriminants The argument in the discriminant function above is just the quadratic Mahalanobis diistance of x to the vector µ k of the means of the kth class. The Mahalanobis distance between two vectors x and y is defined as where S is generally a positive definite matrix. d ML (x, y) = (x y)s 1 (x y) t (19) Linear discriminants If Σ k = Σ for all k classes then, we have the linear discriminant analysis (LDA): C(x) = argmin k (x µ k )Σ 1 (x µ k ) t (20) In the simplest case, the all classes have the same diagonal covariance matrix diag(σ 2 1,..., σ2 n). This leads to the diagonal linear discriminant analysis (DLDA): C(x) = argmin k n (x g µ kg ) 2 g=1 σ 2 g (21) 306

27 Linear discriminants For a binary class case, ie. k = 2 the DLDA is: n a g (x g b g ) = g=1 n g=1 µ 1g µ 2g σ 2 g ( x g ( µg1 + µ ) ) g2 (22) 2 Testing a classificator In the case of a small data set, cross validation. and here especially Leave-one-out cross validation=loocv is used. For LOOCV one sample is taken out from the whole data set, for the remaining p 1 samples, the predictor is generated, and for the left-out sample its class is predicted. This is done for all p probes, and from this an error rate is computed. Testing a classificator Let {s 1, s 2,..., s p } the full sample set. For i = p, 1, 1 do generate predictor of {s 1,..., s i 1, s i+1,...s p } compute number of false class predictions ER i Compute overall cross-validation error, by eg. taking the mean of all ER i. Feature selection Problem: microarray data has too many features. Explicit selection: before generating classificator, filter methods (eg. relieff) Implicit selection: during generating classificator - wrapper methods Other machine learning methods Decision trees Neural networks Support vector machines (SVMs) Bayes regression methods... Statistical testing 307

28 Differentially expressed genes One of the main goals of microarray experiments is the detection of differentially expressed genes. Starting point is the comparison of expression values of a gene in two or more cell populations. This leads to the central question: Problem 3. How does one distinguish expression differences of a gene between different experiments that have a biological cause from those that are the result of technical variation or noise? Hypothesis testing An obvious choice is to use a statistical test, that ranks the genes from highest to lowest eveidence for differential expression. Then for a prechosen critical value for the rank statistic, all values above the threshold are chosen to be significant. In our case we asking for example: For a given gene g, assume its expression values in population A are (g A1,..., g Ak ) and (g B1,..., g Bl ) in population B. Let g A, g B be the mean of expression in population A and B respectively. Now we ask: is g A g B, and is the difference of the two means significant or random? Hypothesis testing Answers are offered by hypothesis testing: Generally all hypothesis tests involve comparison of an observation and a value that one would expect by chance for a given test statistic. The following two hypotheses are stated: Null hypothesis (eg. mean of a gene under condition A is not different from the mean of the gene under condition B, ie., the gene is not differentially expressed), often denoted by H 0 : observed value does not differ significantly from the one expected by chance Alternative hypothesis (eg. the gene is differentially expressed), often denoted by H 1 : observed value differs significantly from the one expected by chance Hypothesis testing One distinguishes the one-sided (also called directional) test - the unknown parameter θ to estimate (eg. mean) is either larger or smaller than a given parameter θ 0 - from the two-sided test - here one estimates whether the unknown parameter is not equal to the given parameter. Null hypothesis Alternative hypothesis Type of test H 0 : θ = θ 0 H 1 : θ θ 0 two-sided test H 0 : θ < θ 0 H 1 : θ θ 0 one-sided test H 0 : θ > θ 0 H 1 : θ θ 0 one-sided test Error types All statistical tests can bear errors. We distinguish a type I error α, - the null hypothesis has been falsely rejected - and the type II error β, - the null hypothesis has been falsely accepted. For the error type I one uses a low confidence value α 1% or 5%. For a chosen α the significance level is the p-value: 308

29 Accept H 0 Reject H 0 H 0 is true true error type I H 0 is false error type II true p-value The p-value is the probability that the test statistic of the observed sample is as least as large as the the value of the test statistic achieved at random. Thus, the p-value is the probability of rejecting the null hypothesis erroneously. Is the p-value smaller than the chosen type I error, the null hypothesis is rejected. The p-value is often set as the significance level: if the null hypothesis is rejected with significance α, then p < α. t-test A commonly chosen test is the t-test. This method (among others) takes the variation of the expression value into account for the computation. The t-test is an example for classical hypothesis testing. t-distribution Definition 4. Let X be a standard-normally distributed variable N (0, 1), then the distribution of X 2 is called χ 2 -distribution with one degree of freedom. Definition 5. Let X 1,..., X n be independent χ 2 -distributed variables with one degree of freedom, then the distribution of Y = Xi 2 (23) i is called a χ 2 -distribution with n degrees of freedom. t-distribution Definition 6. Let X be a standard-normally distributed random variable and let Y have a χ 2 distribution with n degrees of freedom. Let X and Y be independent of each other. Then the distribution of T = X Y/n (24) is called the t- or Student-distribution with n degrees of freedom. t-distribution The following figures show the t-distribution with 5 (left) and 50(right) degrees of freedom. The green line is the standard normal distribution. 309

30 The one-sample t-test When we compare the mean expression level of a gene g against a known mean (of the underlying population) (for example µ = 0 for normal ratios, or µ = 1 for log ratios) we use the one-sample t-test: Definition 7. The one-sample t-statistic of gene g = (g 1,..., g p ) is defined by t(g) = g µ σ(g)/ p, (25) where g denotes the mean expression value of g, and σ(g) denotes the standard deviation. The one-sample t-test Under the null hypothesis (mean = 0), the one-sample t-statistic follows a t p 1 distribution. For the one-sided test one compares the computed t-value with t(1 α, p 1). Is t > t(1 α, p 1), then the null hypothesis (H 0 : µ = µ 0 = 0) is rejected. For the two-sided test one compares the computed t-value with t(1 α/2, p 1). Is t > t(1 α/2, p 1), then the null hypothesis (H 0 : µ = µ 0 = 0) is rejected. Example: Let x = 1.7. The p value for the one-sided test is und for the two-sided test. The two-sample test Given a sample x 1,..., x m1 of N (µ 1, σ 2 ) distributed random variables X 1,..., X m1, as well as a sample y 1,..., y m2 of N (µ 2, σ 2 ) distributed random variables Y 1,..., Y m2. We assume that all m 1 + m 2 random variables are independent with the same variance σ 2. The expected mean values µ 1 and µ 2 are unknown. We test if µ 1 = µ 2. Compute the pooled variance estimate s 2 as follows: m1 s 2 1 = m 1 + m 2 2 ( m 2 (x i x) 2 + (y j y) 2 ) i=1 j=1 Definition 8. The two-sample t-statistic is defined by x y t = s 2 ( 1 m m 2 ) (26) 310

31 The two-sample test Under the null hypothesis the two-sample t-statistic follows the t-distribution with (m 1 + m 2 2) degrees of freedom. For a chosen α-error, for the one-sided test one then compares the t-value with t(1 α/2, m 1 +m 2 2). Is t > t(1 α/2, m 1 + m 2 2), the null hypothesis is rejected. One- and two-sided tests We distinguish the following tests: two-sided: Do A and B have different means? one-sided: Does A have a larger (smaller) mean than B? The interpretation for microarray experiments is, that a two-sided test detects differentially regulated genes, while a one-sided test seeks up(down)-regulated genes The two-sample test Summary of applying t-test: Choose significance level, the type-i error α threshold Compute a one- or two-sided t-statistic for each gene g Compute the p-value from the distribution of the test statistic 3 Sequencing by Hybridization Sequencing by Hybridization (SBH) Originally, the hope was that one can use DNA chips to sequence lage unknown DNA fragments using a large array of short probes: 1. Produce a chip C(l) spotted with all possible probes of length l (l = 8 in the first SBH papers), 2. Apply a solution containing many copies of a fluorescently labeled DNA target fragment to the array. 3. The DNA fragments hybridize to those probes that are complementary to substrings of length l of the fragment 4. Detect probes that hybridize with the DNA fragment and obtain the l-tuple composition of the DNA fragment 5. Apply a combinatorial algorithm to reconstruct the sequence of the DNA target from the l-tuple composition The Shortest Superstring Problem SBH provides information of the l-tuples present in a target DNA sequence, but not their positions. Suppose we are given the spectrum S of all l-tuples of a target DNA sequence, how do we construct the sequence? 311

32 This is a special case of the Shortest Common Superstring Problem (SCS): A superstring for a given set of strings s 1, s 2,..., s m is a string that contains each s i as a substring. Given a set of strings, finding the shortest superstring is NP-complete. The Shortest Superstring Problem Define overlap(s i, s j ) as the length of a maximal prefix of s j that matches a suffix of s i. The SCS problem can be cast as a Traveling Salesman Problem in a complete directed graph G with m vertices s 1, s 2,..., s m and edges (s i, s j ) of length overlap(s i, s j ). The SBH graph SBH corresponds to the special case in which all substrings have the same length l. We say that two SBH probes p and q overlap, if the last l 1 letters of p coincide with the first l 1 of q. Given the spectrum S of a DNA fragment, construct the directed graph H with vertex set S and edge set E = {(p, q) p and q overlap}. There exists a one-to-one correspondence between paths that visit each vertex of H at least once and the DNA fragments with the spectrum S. Vertices: l tuples of the spectrum S, edges: overlapping l-tuples: S = { ATG AGG TGC TCC GTC GGT GCA CAG } H The path visiting all vertices corresponds to the sequence reconstruction ATGCAGGTCC. A path that visits all nodes of a graph exactly once is called a Hamiltonian path. Unfortunately, the Hamiltonian Path Problem is NP-complete, so for larger graphs we cannot hope to find such paths. Second example of the SBH graph S = { ATG TGG TGC GTG GGC GCA GCG CGT } H This example has two different Hamiltonian paths and thus two different reconstructed sequences: ATGCGTGGCA ATGGCGTGCA Euler Path Leonard Euler wanted to know whether there exists a path that uses all seven bridges in Königsberg exactly once: 312

33 Kneiphoff island Pregel river Birth of graph theory... SBH and the Eulerian Path Problem Let S be the spectrum of a DNA fragment. We define a graph G whose set of nodes consists of all possible (l 1)-tuples. We connect one l 1-tuple v = v 1... v l 1 to another w = w 1... w l 1 by a directed edge (v, w), if the spectrum S contains an l-tuple u with prefix v and suffix w, i.e. such that u = v 1... v l 1 w 1 = v l 1 w 1... w l 1. Hence, in this graph the probes correspond to edges and the problem is to find a path that visits all edges exactly once, i.e., an Eulerian path. Finding all Eulerian paths is simple to solve. SBH and the Eulerian Path Problem S = { ATG, TGG, TGC, GTG, GGC, GCA, GCG, CGT } GT GCT CG AT ATG TG GTG TGG TGC GG GCG GGC GC GCA CA Vertices represent (l 1)-tuples, edges correspond to l-tuples of the spectrum. SBH and the Eulerian Path Problem There are two different solutions: 313

34 GT CG GT CG AT TG GC CA AT TG GC CA GG ATGGCGTGCA GG ATGCGTGGCA SBH and the Eulerian Path Problem A directed graph G is called Eulerian, if it contains a cycle that traverses every edge of G exactly once. A vertex v is called balanced, if the number of edges entering v equals the number of edges leaving v, i.e. indegree(v) = outdegree(v). We call v semi-balanced, if indegree(v) outdegree(v) = 1. Theorem A directed graph is Eulerian, iff it is connected and each of its vertices is balanced. Lemma A connected directed graph is Eulerian, iff it contains at most two semi-balanced nodes. Probability of unique sequence reconstruction What is the probability that a randomly generated DNA fragment of n can be uniquely reconstructed using a DNA array C(l)? In other words, how large must l be so that a random sequence of length n can be uniquely reconstructed from its l-tuples? We assume that the bases at each position are chosen independently, each with probability p = 1 4. Note that a repeat of length l will always lead to a non-unique reconstruction. We expect about ( ) n 2 p l repeats of length l. Note that ( n 2) p l ( = 1 implies l = log n ) 1 p 2. Probability of unique sequence reconstruction = For a given l one should choose n 2 4 l, but not larger. (However, this is a very loose bound and a much tighter bound is known.) SBH currently infeasible The Eulerian path approach to SBH is currently infeasible due to two problems: Errors in the data False positives arise, when the the target DNA hybridizes to a probe even though an exact match is not present False negatives arise, when an exact match goes undetected Repeats make the reconstruction impossible, as soon as the length of the repeated sequence is longer than the word length l Nevertheless, ideas developed here are employed in a new approach to sequence assembly that uses sequenced reads and a Eulerian path representation of the data (Pavel Pevzner, Recomb 2001). 314

DNA Sequencing The Shortest Superstring & Traveling Salesman Problems Sequencing by Hybridization

DNA Sequencing The Shortest Superstring & Traveling Salesman Problems Sequencing by Hybridization Eulerian & Hamiltonian Cycle Problems DNA Sequencing The Shortest Superstring & Traveling Salesman Problems Sequencing by Hybridization The Bridge Obsession Problem Find a tour crossing every bridge just

More information

10/15/2009 Comp 590/Comp Fall

10/15/2009 Comp 590/Comp Fall Lecture 13: Graph Algorithms Study Chapter 8.1 8.8 10/15/2009 Comp 590/Comp 790-90 Fall 2009 1 The Bridge Obsession Problem Find a tour crossing every bridge just once Leonhard Euler, 1735 Bridges of Königsberg

More information

Graph Algorithms in Bioinformatics

Graph Algorithms in Bioinformatics Graph Algorithms in Bioinformatics Computational Biology IST Ana Teresa Freitas 2015/2016 Sequencing Clone-by-clone shotgun sequencing Human Genome Project Whole-genome shotgun sequencing Celera Genomics

More information

Sequencing. Computational Biology IST Ana Teresa Freitas 2011/2012. (BACs) Whole-genome shotgun sequencing Celera Genomics

Sequencing. Computational Biology IST Ana Teresa Freitas 2011/2012. (BACs) Whole-genome shotgun sequencing Celera Genomics Computational Biology IST Ana Teresa Freitas 2011/2012 Sequencing Clone-by-clone shotgun sequencing Human Genome Project Whole-genome shotgun sequencing Celera Genomics (BACs) 1 Must take the fragments

More information

10/8/13 Comp 555 Fall

10/8/13 Comp 555 Fall 10/8/13 Comp 555 Fall 2013 1 Find a tour crossing every bridge just once Leonhard Euler, 1735 Bridges of Königsberg 10/8/13 Comp 555 Fall 2013 2 Find a cycle that visits every edge exactly once Linear

More information

Algorithms for Bioinformatics

Algorithms for Bioinformatics Adapted from slides by Alexandru Tomescu, Leena Salmela and Veli Mäkinen, which are partly from http://bix.ucsd.edu/bioalgorithms/slides.php 58670 Algorithms for Bioinformatics Lecture 5: Graph Algorithms

More information

Sequence Assembly Required!

Sequence Assembly Required! Sequence Assembly Required! 1 October 3, ISMB 20172007 1 Sequence Assembly Genome Sequenced Fragments (reads) Assembled Contigs Finished Genome 2 Greedy solution is bounded 3 Typical assembly strategy

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

More information

CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assembly

CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assembly CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assembly Ben Raphael Sept. 22, 2009 http://cs.brown.edu/courses/csci2950-c/ l-mer composition Def: Given string s, the Spectrum ( s, l ) is unordered multiset

More information

Algorithms for Bioinformatics

Algorithms for Bioinformatics Adapted from slides by Alexandru Tomescu, Leena Salmela and Veli Mäkinen, which are partly from http://bix.ucsd.edu/bioalgorithms/slides.php 582670 Algorithms for Bioinformatics Lecture 3: Graph Algorithms

More information

CLUSTERING IN BIOINFORMATICS

CLUSTERING IN BIOINFORMATICS CLUSTERING IN BIOINFORMATICS CSE/BIMM/BENG 8 MAY 4, 0 OVERVIEW Define the clustering problem Motivation: gene expression and microarrays Types of clustering Clustering algorithms Other applications of

More information

/ Computational Genomics. Normalization

/ Computational Genomics. Normalization 10-810 /02-710 Computational Genomics Normalization Genes and Gene Expression Technology Display of Expression Information Yeast cell cycle expression Experiments (over time) baseline expression program

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

DNA Fragment Assembly

DNA Fragment Assembly Algorithms in Bioinformatics Sami Khuri Department of Computer Science San José State University San José, California, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri DNA Fragment Assembly Overlap

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Supervised vs unsupervised clustering

Supervised vs unsupervised clustering Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2015 1 Sequence Alignment Dannie Durand Pairwise Sequence Alignment The goal of pairwise sequence alignment is to establish a correspondence between the

More information

How do microarrays work

How do microarrays work Lecture 3 (continued) Alvis Brazma European Bioinformatics Institute How do microarrays work condition mrna cdna hybridise to microarray condition Sample RNA extract labelled acid acid acid nucleic acid

More information

Gene expression & Clustering (Chapter 10)

Gene expression & Clustering (Chapter 10) Gene expression & Clustering (Chapter 10) Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species Dynamic programming Approximate pattern matching

More information

A Dendrogram. Bioinformatics (Lec 17)

A Dendrogram. Bioinformatics (Lec 17) A Dendrogram 3/15/05 1 Hierarchical Clustering [Johnson, SC, 1967] Given n points in R d, compute the distance between every pair of points While (not done) Pick closest pair of points s i and s j and

More information

Sequence Assembly. BMI/CS 576 Mark Craven Some sequencing successes

Sequence Assembly. BMI/CS 576  Mark Craven Some sequencing successes Sequence Assembly BMI/CS 576 www.biostat.wisc.edu/bmi576/ Mark Craven craven@biostat.wisc.edu Some sequencing successes Yersinia pestis Cannabis sativa The sequencing problem We want to determine the identity

More information

Introduction to GE Microarray data analysis Practical Course MolBio 2012

Introduction to GE Microarray data analysis Practical Course MolBio 2012 Introduction to GE Microarray data analysis Practical Course MolBio 2012 Claudia Pommerenke Nov-2012 Transkriptomanalyselabor TAL Microarray and Deep Sequencing Core Facility Göttingen University Medical

More information

Genome 373: Genome Assembly. Doug Fowler

Genome 373: Genome Assembly. Doug Fowler Genome 373: Genome Assembly Doug Fowler What are some of the things we ve seen we can do with HTS data? We ve seen that HTS can enable a wide variety of analyses ranging from ID ing variants to genome-

More information

DNA Sequencing. Overview

DNA Sequencing. Overview BINF 3350, Genomics and Bioinformatics DNA Sequencing Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Backgrounds Eulerian Cycles Problem Hamiltonian Cycles

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology 9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct outputs) Density Estimation Learn P(X) given training data for X Clustering Partition data into clusters Dimensionality Reduction Discover

More information

High throughput Data Analysis 2. Cluster Analysis

High throughput Data Analysis 2. Cluster Analysis High throughput Data Analysis 2 Cluster Analysis Overview Why clustering? Hierarchical clustering K means clustering Issues with above two Other methods Quality of clustering results Introduction WHY DO

More information

Cluster Analysis for Microarray Data

Cluster Analysis for Microarray Data Cluster Analysis for Microarray Data Seventh International Long Oligonucleotide Microarray Workshop Tucson, Arizona January 7-12, 2007 Dan Nettleton IOWA STATE UNIVERSITY 1 Clustering Group objects that

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748 CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs07.html 3/3/08 CAP5510 1 Gene g Probe 1 Probe 2 Probe N 3/3/08 CAP5510

More information

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,

More information

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York Clustering Robert M. Haralick Computer Science, Graduate Center City University of New York Outline K-means 1 K-means 2 3 4 5 Clustering K-means The purpose of clustering is to determine the similarity

More information

Clustering. Lecture 6, 1/24/03 ECS289A

Clustering. Lecture 6, 1/24/03 ECS289A Clustering Lecture 6, 1/24/03 What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

3 Feature Selection & Feature Extraction

3 Feature Selection & Feature Extraction 3 Feature Selection & Feature Extraction Overview: 3.1 Introduction 3.2 Feature Extraction 3.3 Feature Selection 3.3.1 Max-Dependency, Max-Relevance, Min-Redundancy 3.3.2 Relevance Filter 3.3.3 Redundancy

More information

Micro-array Image Analysis using Clustering Methods

Micro-array Image Analysis using Clustering Methods Micro-array Image Analysis using Clustering Methods Mrs Rekha A Kulkarni PICT PUNE kulkarni_rekha@hotmail.com Abstract Micro-array imaging is an emerging technology and several experimental procedures

More information

INF 4300 Classification III Anne Solberg The agenda today:

INF 4300 Classification III Anne Solberg The agenda today: INF 4300 Classification III Anne Solberg 28.10.15 The agenda today: More on estimating classifier accuracy Curse of dimensionality and simple feature selection knn-classification K-means clustering 28.10.15

More information

MSA220 - Statistical Learning for Big Data

MSA220 - Statistical Learning for Big Data MSA220 - Statistical Learning for Big Data Lecture 13 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Clustering Explorative analysis - finding groups

More information

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1 Week 8 Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Part I Clustering 2 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group

More information

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic Clustering SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic Clustering is one of the fundamental and ubiquitous tasks in exploratory data analysis a first intuition about the

More information

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering Introduction to Pattern Recognition Part II Selim Aksoy Bilkent University Department of Computer Engineering saksoy@cs.bilkent.edu.tr RETINA Pattern Recognition Tutorial, Summer 2005 Overview Statistical

More information

10701 Machine Learning. Clustering

10701 Machine Learning. Clustering 171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among

More information

Based on Raymond J. Mooney s slides

Based on Raymond J. Mooney s slides Instance Based Learning Based on Raymond J. Mooney s slides University of Texas at Austin 1 Example 2 Instance-Based Learning Unlike other learning algorithms, does not involve construction of an explicit

More information

Machine Learning in Biology

Machine Learning in Biology Università degli studi di Padova Machine Learning in Biology Luca Silvestrin (Dottorando, XXIII ciclo) Supervised learning Contents Class-conditional probability density Linear and quadratic discriminant

More information

Clustering Techniques

Clustering Techniques Clustering Techniques Bioinformatics: Issues and Algorithms CSE 308-408 Fall 2007 Lecture 16 Lopresti Fall 2007 Lecture 16-1 - Administrative notes Your final project / paper proposal is due on Friday,

More information

Clustering. Chapter 10 in Introduction to statistical learning

Clustering. Chapter 10 in Introduction to statistical learning Clustering Chapter 10 in Introduction to statistical learning 16 14 12 10 8 6 4 2 0 2 4 6 8 10 12 14 1 Clustering ² Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 1990). ² What

More information

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06 Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,

More information

Eulerian tours. Russell Impagliazzo and Miles Jones Thanks to Janine Tiefenbruck. April 20, 2016

Eulerian tours. Russell Impagliazzo and Miles Jones Thanks to Janine Tiefenbruck.  April 20, 2016 Eulerian tours Russell Impagliazzo and Miles Jones Thanks to Janine Tiefenbruck http://cseweb.ucsd.edu/classes/sp16/cse21-bd/ April 20, 2016 Seven Bridges of Konigsberg Is there a path that crosses each

More information

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions Thomas Giraud Simon Chabot October 12, 2013 Contents 1 Discriminant analysis 3 1.1 Main idea................................

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Unsupervised Learning: Clustering Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com (Some material

More information

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition Pattern Recognition Kjell Elenius Speech, Music and Hearing KTH March 29, 2007 Speech recognition 2007 1 Ch 4. Pattern Recognition 1(3) Bayes Decision Theory Minimum-Error-Rate Decision Rules Discriminant

More information

Graph Algorithms in Bioinformatics

Graph Algorithms in Bioinformatics Graph Algorithms in Bioinformatics Bioinformatics: Issues and Algorithms CSE 308-408 Fall 2007 Lecture 13 Lopresti Fall 2007 Lecture 13-1 - Outline Introduction to graph theory Eulerian & Hamiltonian Cycle

More information

10. Clustering. Introduction to Bioinformatics Jarkko Salojärvi. Based on lecture slides by Samuel Kaski

10. Clustering. Introduction to Bioinformatics Jarkko Salojärvi. Based on lecture slides by Samuel Kaski 10. Clustering Introduction to Bioinformatics 30.9.2008 Jarkko Salojärvi Based on lecture slides by Samuel Kaski Definition of a cluster Typically either 1. A group of mutually similar samples, or 2. A

More information

Cluster Analysis. Angela Montanari and Laura Anderlucci

Cluster Analysis. Angela Montanari and Laura Anderlucci Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a

More information

Region-based Segmentation

Region-based Segmentation Region-based Segmentation Image Segmentation Group similar components (such as, pixels in an image, image frames in a video) to obtain a compact representation. Applications: Finding tumors, veins, etc.

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS

MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS Mathematical and Computational Applications, Vol. 5, No. 2, pp. 240-247, 200. Association for Scientific Research MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS Volkan Uslan and Đhsan Ömür Bucak

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Comparisons and validation of statistical clustering techniques for microarray gene expression data. Outline. Microarrays.

Comparisons and validation of statistical clustering techniques for microarray gene expression data. Outline. Microarrays. Comparisons and validation of statistical clustering techniques for microarray gene expression data Susmita Datta and Somnath Datta Presented by: Jenni Dietrich Assisted by: Jeffrey Kidd and Kristin Wheeler

More information

Double Self-Organizing Maps to Cluster Gene Expression Data

Double Self-Organizing Maps to Cluster Gene Expression Data Double Self-Organizing Maps to Cluster Gene Expression Data Dali Wang, Habtom Ressom, Mohamad Musavi, Cristian Domnisoru University of Maine, Department of Electrical & Computer Engineering, Intelligent

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga.

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga. Americo Pereira, Jan Otto Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga. ABSTRACT In this paper we want to explain what feature selection is and

More information

Mixture Models and the EM Algorithm

Mixture Models and the EM Algorithm Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Finite Mixture Models Say we have a data set D = {x 1,..., x N } where x i is

More information

Genome Reconstruction: A Puzzle with a Billion Pieces Phillip E. C. Compeau and Pavel A. Pevzner

Genome Reconstruction: A Puzzle with a Billion Pieces Phillip E. C. Compeau and Pavel A. Pevzner Genome Reconstruction: A Puzzle with a Billion Pieces Phillip E. C. Compeau and Pavel A. Pevzner Outline I. Problem II. Two Historical Detours III.Example IV.The Mathematics of DNA Sequencing V.Complications

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

SOCIAL MEDIA MINING. Data Mining Essentials

SOCIAL MEDIA MINING. Data Mining Essentials SOCIAL MEDIA MINING Data Mining Essentials Dear instructors/users of these slides: Please feel free to include these slides in your own material, or modify them as you see fit. If you decide to incorporate

More information

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

More information

A Course in Machine Learning

A Course in Machine Learning A Course in Machine Learning Hal Daumé III 13 UNSUPERVISED LEARNING If you have access to labeled training data, you know what to do. This is the supervised setting, in which you have a teacher telling

More information

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other

More information

Analyzing ICAT Data. Analyzing ICAT Data

Analyzing ICAT Data. Analyzing ICAT Data Analyzing ICAT Data Gary Van Domselaar University of Alberta Analyzing ICAT Data ICAT: Isotope Coded Affinity Tag Introduced in 1999 by Ruedi Aebersold as a method for quantitative analysis of complex

More information

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford Department of Engineering Science University of Oxford January 27, 2017 Many datasets consist of multiple heterogeneous subsets. Cluster analysis: Given an unlabelled data, want algorithms that automatically

More information

Variable Selection 6.783, Biomedical Decision Support

Variable Selection 6.783, Biomedical Decision Support 6.783, Biomedical Decision Support (lrosasco@mit.edu) Department of Brain and Cognitive Science- MIT November 2, 2009 About this class Why selecting variables Approaches to variable selection Sparsity-based

More information

Lesson 3. Prof. Enza Messina

Lesson 3. Prof. Enza Messina Lesson 3 Prof. Enza Messina Clustering techniques are generally classified into these classes: PARTITIONING ALGORITHMS Directly divides data points into some prespecified number of clusters without a hierarchical

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,

More information

What to come. There will be a few more topics we will cover on supervised learning

What to come. There will be a few more topics we will cover on supervised learning Summary so far Supervised learning learn to predict Continuous target regression; Categorical target classification Linear Regression Classification Discriminative models Perceptron (linear) Logistic regression

More information

SVM Classification in -Arrays

SVM Classification in -Arrays SVM Classification in -Arrays SVM classification and validation of cancer tissue samples using microarray expression data Furey et al, 2000 Special Topics in Bioinformatics, SS10 A. Regl, 7055213 What

More information

Feature Selection. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262

Feature Selection. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 262 Feature Selection Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester 2016 239 / 262 What is Feature Selection? Department Biosysteme Karsten Borgwardt Data Mining Course Basel

More information

Clustering gene expression data

Clustering gene expression data Clustering gene expression data 1 How Gene Expression Data Looks Entries of the Raw Data matrix: Ratio values Absolute values Row = gene s expression pattern Column = experiment/condition s profile genes

More information

ECS 234: Data Analysis: Clustering ECS 234

ECS 234: Data Analysis: Clustering ECS 234 : Data Analysis: Clustering What is Clustering? Given n objects, assign them to groups (clusters) based on their similarity Unsupervised Machine Learning Class Discovery Difficult, and maybe ill-posed

More information

Course on Microarray Gene Expression Analysis

Course on Microarray Gene Expression Analysis Course on Microarray Gene Expression Analysis ::: Normalization methods and data preprocessing Madrid, April 27th, 2011. Gonzalo Gómez ggomez@cnio.es Bioinformatics Unit CNIO ::: Introduction. The probe-level

More information

Unsupervised Learning

Unsupervised Learning Networks for Pattern Recognition, 2014 Networks for Single Linkage K-Means Soft DBSCAN PCA Networks for Kohonen Maps Linear Vector Quantization Networks for Problems/Approaches in Machine Learning Supervised

More information

Machine learning - HT Clustering

Machine learning - HT Clustering Machine learning - HT 2016 10. Clustering Varun Kanade University of Oxford March 4, 2016 Announcements Practical Next Week - No submission Final Exam: Pick up on Monday Material covered next week is not

More information

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning Associate Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551

More information

Finding Clusters 1 / 60

Finding Clusters 1 / 60 Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering Clustering by Partitioning, e.g. k-means Density Based Clustering, e.g. DBScan Grid Based Clustering 1 / 60

More information

Dimension reduction : PCA and Clustering

Dimension reduction : PCA and Clustering Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental

More information

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering

Sequence clustering. Introduction. Clustering basics. Hierarchical clustering Sequence clustering Introduction Data clustering is one of the key tools used in various incarnations of data-mining - trying to make sense of large datasets. It is, thus, natural to ask whether clustering

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 15: Microarray clustering http://compbio.pbworks.com/f/wood2.gif Some slides were adapted from Dr. Shaojie Zhang (University of Central Florida) Microarray

More information

Supervised Learning for Image Segmentation

Supervised Learning for Image Segmentation Supervised Learning for Image Segmentation Raphael Meier 06.10.2016 Raphael Meier MIA 2016 06.10.2016 1 / 52 References A. Ng, Machine Learning lecture, Stanford University. A. Criminisi, J. Shotton, E.

More information

Eulerian Tours and Fleury s Algorithm

Eulerian Tours and Fleury s Algorithm Eulerian Tours and Fleury s Algorithm CSE21 Winter 2017, Day 12 (B00), Day 8 (A00) February 8, 2017 http://vlsicad.ucsd.edu/courses/cse21-w17 Vocabulary Path (or walk): describes a route from one vertex

More information

Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

More information

10-701/15-781, Fall 2006, Final

10-701/15-781, Fall 2006, Final -7/-78, Fall 6, Final Dec, :pm-8:pm There are 9 questions in this exam ( pages including this cover sheet). If you need more room to work out your answer to a question, use the back of the page and clearly

More information

Hierarchical Clustering

Hierarchical Clustering What is clustering Partitioning of a data set into subsets. A cluster is a group of relatively homogeneous cases or observations Hierarchical Clustering Mikhail Dozmorov Fall 2016 2/61 What is clustering

More information

Multivariate Analysis

Multivariate Analysis Multivariate Analysis Cluster Analysis Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com Unsupervised Learning Cluster Analysis Natural grouping Patterns in the data

More information

Algorithms for Bioinformatics

Algorithms for Bioinformatics Adapted from slides by Leena Salmena and Veli Mäkinen, which are partly from http: //bix.ucsd.edu/bioalgorithms/slides.php. 582670 Algorithms for Bioinformatics Lecture 6: Distance based clustering and

More information