Micro-array Image Analysis using Clustering Methods

Size: px

Start display at page:

Download "Micro-array Image Analysis using Clustering Methods"

Maryann Snow
6 years ago
Views:

1 Micro-array Image Analysis using Clustering Methods Mrs Rekha A Kulkarni PICT PUNE kulkarni_rekha@hotmail.com Abstract Micro-array imaging is an emerging technology and several experimental procedures have been developed producing different image characteristics. Micro-array images are processed using a multi step procedure involving image segmentation, information extraction, normalization. Images are structured with high intensity spots located on a grid, Each spot corresponds to a gene. Spots have roughly circular shape though some show significant deviation from this shape due to the experimental variations. The clustering methods group a set of genes or arrays representing the similarity of genes to each other. Clustering techniques such as K-means, Partitioning around medoids have been recently used for micro-array image segmentation or classification. K-means is a partitioning algorithm with a prefixed number k of clusters. It tries to minimize the sum of within-cluster-variances. Partitioning Around Medoids is a partitioning algorithm a generalization of K-means Key Words- Micro-Array, Gridding, Clustering, K- Means, PAM 1. Introduction Micro-array technologies, able to measure the expression of thousands of genes in a single experiment, have developed over the past decade and now produce huge amounts of data. New techniques for looking at genetic variations in large human populations, and for identifying interactions between sets of proteins in cells, are pouring data onto file servers around the world. Bioinformatics is charged with managing and making sense of all of the data, keeping pace with both data production and technology development. There's plenty of work to go around. Micro-array is a glass microscope slide with a large number of ordered target sequences on it. These target sequences normally consists of cdna or RNA sequences. These target sequences are single stranded as opposed to DNA that is double stranded. There are thousands of target sequences on the micro-array. Micro-array technology, as a high throughput approach of differential gene expression studies, efficiently generates massive amount of gene regulation data, facilitating scientists in quickly identifying what gene candidates to follow up with functional characterization. Traditional techniques for the study of gene expression allow investigators to study only one or few genes at a time. Genomic projects aimed at cloning, mapping and sequencing genomes of various organisms, generated large amount of sequence data. However, the function, expression and regulation of more than 80% of them were unknown. The next phase of the human genome project will place strong emphasis on assigning function to these genes. There are methods by which one can assign function to genes, out of which DNA micro-array analysis is widely used to extract patterns of gene expression. Although both cdna micro-arrays and oligonucleotide arrays are capable of analyzing patterns of gene expression, fundamental differences exist between the methods. 2. Existing segementation techniques Fixed circle segmentation Fits a circle with a constant diameter to all spots in the image Easy to implement The spots need to be of the same shape and size Adaptive circle segmentation The circle diameter is estimated separately for each spot segmentation Adaptive shape segmentation Specification of starting points or seeds Bonus: already know geometry of array! Regions grow outwards from the seed points preferentially according to the difference between a pixel s value and the running mean of values in an adjoining regions. 145

2 Histogram segmentation: Uses a target mask chosen to be larger than any other spot Foreground and background intensity are determined from the histogram of pixel values for pixels within the masked area 3. Intensity based segmentation The images are structured with high intensity spots which correspond to the probes located on a grid. The spots have roughly circular shape though some show significant deviation from this shape due to the experimental variation of the spotting procedure. The underlying principle in micro-array image analysis is that the spot intensity is a measure of gene expression. This implicitly assumes the gene expression of a spot to be governed entirely by the distribution of pixel intensities. Clustering based segmentation is used to extract the intensity of the spots. the approximate boundaries of spots in the micro-array are determined by adjustment of rectilinear grids The K means and Partitioning around medoids are used to generate a binary partition of pixel intensities. Images will have four colored spots. The red, green,yellow and black. A red colored spot implies that a particular gene is being expressed in the experimental channel. Green colored spot indicates high expression in the control channel. A yellow colored spot indicates that the gene is expressed in both channels. To estimate the differential gene expression the high intensity regions in the image corresponding to each probe have to be identified. This is done as part of image segmentation. Then the local background noise has to be estimated and removed which corresponds to background correction. An example of spot or gene summary statistics for cdna is the ratio of background corrected mean intensities. We denote R i the pixels from the red fluorescent scan and by G i the pixels from the green fluorescent scan. The differential expression level R / G is then calculated as the ratio of the mean ROI intensities : R / G = 1/S σ Ri µr 1/S σ Gi µ G where µ refers to the estimate of the local background and S is the number of ROI pixels. 4. Micro-array image analysis Image analysis involves three stages. First, the arrayed genes must be identified from spurious signals that can arise due to precipitated probe or other hybridization artifacts or dust on the surface of the slide. After gridding the spot intensities (real signal) and background (noise) has to be calculated for spots. It is always better to calculate background locally for each spot, rather than globally for the entire image. next step in image processing is the extraction of signal, noise and quality control measures for the spot. Image processing steps: 1. Addressing/ Gridding : Find the areas in an image that belong to spots. The combined areas of spot and its background is called target area. 2. Segmentation : Partition the target area into foreground and background 3. Reduction : Extract two scalar values R and G for red and green intensity and assign one value for relative abundance Gridding This is the process of assigning coordinates to each of the spots. Automating this part of the procedure permits high throughput analysis. Ideally spots should be equally spaces across the array. However the robot arm that prints the spots often introduces deviation from these pixel positions. The approximate boundary of the spots are determined by drawing rectilinear grid. As a result each spot is enclosed in a rectangular box.this is accomplished by adjusting the grid 4.2. Segmentation Segmentation allows the classification of pixels as corresponding to a spot of interest or as background. It involves partitioning the image into disjoint sets. Consider an image I Partitioned into L regions. Represented by ri I= 1..L then I= Union of ri where I =1..L And each ri satisfies a predicate. Image segmentation can be either texture based or intensity based. While the former is governed by spatial characteristic latter is entirely governed by the distribution of pixel intensities. The choice of segmentation technique is based on problem at hand. Since a gene expression value is proportional to intensity of the spot the segmentation based on intensity is appropriate. The objective is to separate the foreground and background pixel intensities inside each grid. The one dimensional distribution of the foreground and back ground pixel intensities can be represented by f(f) and f(b) respectively. The distribution of pixel intensities in a grid can be assumed to be the 146

3 superposition of f(f) and f(b). The following case can arise 1) f(f) and f(b) are narrowly distributed with no overlap. 2) F(F) is spread whereas f(b) is narrowly distributed. 3) F(F) and f(b) exhibit significant overlap. Start Microarry image file in TIFF format. Number of rows and numbaer of coloums i.e total number of spots Manuaaly align the row and column grids to determine the approximate boundary of each spot in the control channel (CY3) Retain the coordinate for the experimental channel (CY5) 5. Clustering methods Clustering techniques such as K-means and PAM is useful in detecting patterns in the data generated by unknown processes and have been recently used for micro-array segmentation. The input to k means and Pam clustering algorithms was the two dimensional values (Ri,Gi) where Ri and Gi represent the ith pixel intensity of a given spot in the cy3 and cy5 channels. Cluster algorithms k-means K-means is a partitioning algorithm with a prefixed number k of clusters. It tries to minimize the sum of within-cluster-variances. The algorithm chooses a random sample of k different objects as initial cluster midpoints. Then it alternates between two steps until convergence: 1. Assign each object to its closest of the k midpoints with respect to Euclidean distance. 2. Calculate k new midpoints as the averages of all points assigned to the old midpoints, respectively. Fig.1: Flow chart of Microarray Image Segmentation 4.3. Reduction : Rfg, Gfg and Rbg, Gbg cluster means with high resp. low intensities, (Rfg-Rbg)/(Gfg-Gbg) final rel ative abundance estimate. I=1 Map the image matrix I of the ith grid into a one dimentional vector v. Apply K-means clustering technique to obtain a binary partition of the vetor v f(i) = median of foreground pixels g(i) = median of background pixels t(i) = f(i) - b(i) Is Stop K-means is a randomized algorithm, two runs usually produce different results. Thus it has to be applied a few times to the same data set and the result with minimal sum of within-cluster variances should be chosen. PxKmeans: Pixel clustering with k-means 1. Construct initial representatives: Starting midpoints m1=(rfg,gfg) and m2=(rbg,gbg), where Rfg, Gfg are highest intensity values and Rbg, Gbg the lowest. 2. Find local optimum of cluster problem (k-means): Repeat alternating until convergence: - Assign each data point to its closest of the two midpoints. - Calculate two new midpoints as the means of all points assigned to the old midpoints, respectively. 3. Reduction: Rfg, Gfg and Rbg, Gbg cluster means with high resp. low intensities, (Rfg-Rbg)/(Gfg-Gbg) final rel ative abundance estimate. Cluster algorithms: PAM PAM (Partitioning around medoids) Kaufman and Rousseeuw is a partitioning algorithm, a generalization of k- means. 147

4 For an arbitrary dissimilarity matrix d it tries to minimize the sum (overall objects) of distances to the closest of k prototypes. Objective function: (d: Manhattan, Correlation, etc.) BUILD phase: Initial 'medoids. SWAP phase: Repeat until convergence: Consider all pairs of objects (i,j), where i is a medoid and j not, and make the i j swap (if any) which decreases the objective function most. PxPAM: pixel clustering with PAM 0. Calculation of dissimilarity matrix of spot pixels: Calculate the Manhattan distances between all pairs of pixels: dij = d(xi,xj) = Ri-Rj + Gi-Gj. 1. Construct two initial representatives (PAM Build phase): Define m1 as object with smallest Ói=1..n d(xi,m1) and m2 as object that decreases objective as much as possible. 2. Find local optimum of cluster problem (PAM Swap phase) 3. Reduction: Rfg and Gfg: values of medoid pixel with higher intensities; Rbg, Gbg of other one. (Rfg-Rbg)/(Gfg-Gbg) final relative abundance estimate. 6. Background correction and normalization The estimation of background intensity is generally considered necessary for the purpose of performing background correction. The motivation for background correction is that a spot s measured fluorescence intensity includes a contribution which is not specifically due to the hybridization of the mrna samples to the spotted DNA. Background correction of the spot intensities is usually performed by subtracting background estimation from the red and green foreground values with the aim of improving accuracy that is reducing the bias. Spot quality scores may include measures of spot size or shape or measures of background intensity to foreground intensity. In some cases background adjustment can substantially reduce the precision that is increases the variability of low spot values. The ratio of the fluorescence intensities for each spot is indicative of the relative abundance of the corresponding DNA sequence in the two nucleic acid sample. There are various types of noise that can affect the final signal produced by the scanner. These can be divided into two categories.source noise and detector noise. Examples of source noise are photon noise dust on the slides and treatment of the glass slides. Detector noise includes features of the amplification and digitization process. A perfect image should only reflect measures of the fluorescence intensities for the dye of interest. However in practice we have an input system and images are usually combination of undesired signals. 7. Gene clustering: The need of Gene clustering: Genes clustered together have the same pattern of expression indicates co regulated genes Co-regulated genes have function correlation, and might involve in same metabolic pathways Steps in Gene clustering For each gene, look at xi. (expression of ith gene through m experiments) Measure distances between xi. for i from 1 to m. Clustering is based on the similarity or distance metrics:euclidean distance, vector angle, and correlation coefficient Clustering is to group a set of genes or arrays into a tree. The branch of tree represents the similarity of genes to each other. Methods/Algorithms Hierarchical clustering Single linkage-nearest distance Complete linkage-max distance Average linkage-avg between all points in clusters K-mean clustering Hierarchical clustering The clustering solution is represented by a dendrogram, which is a rooted weighted tree, with leaves corresponding to the objects The edge s length reflects the dissimilarity between that cluster and remaining clusters Hierarchical clustering Steps 1. Filter data: remove genes having a lot of missing values or of low quality spots. 2. Pre-processing data: 1) Log transformation 2) Mean/median centering 3) Normalization 3. Create similarity metrics based on distance measure 4. Create distance matrices 148

5 5. Scan the distance matrix and find the smallest distance (for single linkage.) 6. Create a node/branch of a tree linking two genes with the smallest distance. Set the length of the branch to the distance of two genes. 7. Average two values and replaces two genes with a new item 8. Calculate the distance of the new item to other genes 9. Repeat the process n-1 times (n is the number of genes or arrays) K-mean clustering Have prior knowledge of k (# of clusters) Initialize cluster centroids: Data centroid based search Evenly spaced profiles Randomly generated profiles Calculate cluster centroids Euclidean distance: distance between 2 data points in a N dimension. Correlation K-mean clustering Steps: 1. Random assign the genes into K clusters 2. Measure the mean vector of all genes in each cluster 3. The mean into the cluster whose center is closest to the gene vector is used as center of the cluster and assign the gene. 4. Repeat steps 2 and 3 until reaching the maximum number of cycles number or reach steady state. 8. Conclusion DNA Micro-array analysis allows comparisons to be made between the expression levels of certain genes across different tissues and pathological conditions. Micro-array technology can give us an understanding of the temporal and spatial patterns of expression of all the genes involved in the developmental processes of an organism. Micro-array analysis will improve our understanding of diagnosis and prognosis. Transcription profiling using DNA micro-arrays has great potential as a systematic approach for discovering new classes of tumors for assigning known tumors to classes to predict response to therapy. Thus, it is perceived that gene expression monitoring could provide new insights into many aspects of tumor pathology, including cell of origin, stage, grade, clinical course and response to treatment. Other applications include identification of targets for drug development, diagnosis and prognosis, number detection, risk assessment and the study of mutation. Clustering large and high dimensional data collection is a challenging task. The problem is more complex if, in addition to clusterering, one is also interested in learning cluster dependent feature relevance weights. One possible solution to alleviate this problem is to use partial supervision to guide the search process and narrow down the space of possible solutions. Recently, semisupervised learning has emerged as a new research directive in machine learning to improve the performance of unsupervised learning using some supervised information. 9. References [1] G.A Baxes, Digital image processing, Principles and applications [2]E Gose,R,Johnsonbaugh,Steve jost, Pattern recognition and image analysis. [3] Radhakrishaan Nagarajan,Charlotte A Peterson, Identifying spots in micro-array images IEEE transaction on NanoBio, vol1 No 2 june 2002 [4]Radhakrishaan Nagarajan, intensity based segmentation of microarray images,ieee transaction on Med Imaging, vol 22 No 2 July 2003 [5] Mathias katzer,franz Kummert, Methods for automatic Micro array image segmentation, IEEE transaction on NanoBio, vol 2 No 4 Dec [6] Krishnapuram R. and Keller J.M A Possiblistic approach to clustering IEEE Trans on Fuzzy Systems, Vol 1 No.2 May 1993 pp [7] Hichem Frigui, Fuzzy clustering and Aggregation of Realational Data With Instance Level Constraits, IEEE Trans on Fuzzy Systems, Vol 16 No.6 Dec 2008 pp

MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS

Mathematical and Computational Applications, Vol. 5, No. 2, pp. 240-247, 200. Association for Scientific Research MICROARRAY IMAGE SEGMENTATION USING CLUSTERING METHODS Volkan Uslan and Đhsan Ömür Bucak