How do microarrays work

Size: px

Start display at page:

Download "How do microarrays work"

Bruno Warren Chase
6 years ago
Views:

1 Lecture 3 (continued) Alvis Brazma European Bioinformatics Institute How do microarrays work condition mrna cdna hybridise to microarray condition

2 Sample RNA extract labelled acid acid acid nucleic acid acid A microarray experiment hybridisation genes Array design array array array Microarray Gene Protocol Protocol expression Protocol Protocol data matrix Protocol Protocol normalization integration Steps in microarray data processing Array scans Quantitations Spots AGenes B D C

3 Microarray expression measurements in cell cycle for over 400 periodic genes in yeast Rustici et al, Nature Genetics, 004 The goal of data normalisation - Gene Expression Data Matrix j Gen nes i X(i,j) amount of the RNA of the i-th gene in the j-th sample 3

4 What are we actually measuring? Fluorescence Intensity = RNA abundance probe efficiency hybridisation conditions error What are we measuring? Fluorescence Intensity = RNA abundance probe efficiency hybridisation conditions How do we know probe efficiency and hybridisation conditions? 4

5 Lecture 4 expression profiles and their analysis The goal of data normalisation - Gene Expression Data Matrix j Gen nes i X(i,j) amount of the RNA of the i-th gene in the j-th sample 5

6 Gene Expression Profile j i A gene expression profile: (x(i,), x(i,),, x(i,m)) a vector of real numbers Ge enes A sample expression profile probe effects are large 6

7 Gene expression profile Find genes with similar expression profiles 7

8 Gene Expression Profile j i A gene expression profile: (x(i,), x(i,),, x(i,m)) a vector of real numbers Ge enes A sample expression profile A B C Condition Condition Figure 4. 8

9 Gene Expression Profile i A gene expression profile: (x(i,), x(i,),, x(i,m)) a vector of real numbers Ge enes Log ratios 5 How to measure distance between two gene (or sample) expression profiles? 0 - A = (-, 0,-,,-4) B = (0,,-,4,-5) -5 9

10 Log ratios 5 A = (-, 0,-,,-4) B = (0,,-,4,-5) 0 - Euclidean distance = (add up the squares of all arrows and take a square root) = (+4++9+) / = 4-5 Euclidean distance D Eucl ( A, B) = n i= ( a i b i ) 0

11 Log ratios 5 A = (-, 0,-,,-4) B = (0,,-,4,-5) 0 - The absolute values are not very meaningful (remember that sequence effects are large) - the Euclidean distance may not be the best How to measure similarities in trends? -5 Log ratios 5 A = (-, 0,-,,-4) B = (0,,-,4,-5). Center all vectors around

12 Log ratios Log ratios 5 A = (0,,-,,-3) B = (0,,-,4,-5) Chord distance = 0 - Make the length of both equal to A = ( ) / 4 B = ( ) / 7-5

13 The length of a vector Given a vector A=(a,, ak), we define its length A as A = a + + A = ( ) / 4 B = ( ) / 7 A = (0, /4,-/4,/,-3/4) B = (0,/7,-/7,4/7,-5/7)... ak Log ratios Log ratios 5 / /4-5 A = (0,,-,,-3) B = (0,,-,4,-5) - A = (0, /4,-/4,/,-3/4) B = (0,/7,-/7,4/7,-5/7) 3

14 Log ratios A = (0,,-,,-3) B = (0,,-,4,-5) Chord distance = /4 0 -/4 -. Center all vectors around 0. Make the length of both equal to A = ( ) / 4 B = ( ) / 7 A = (0, /4,-/4,/,-3/4) B = (0,/7,-/7,4/7,-5/7) Log ratios A = ( ) / 4 B = ( ) / 7 A = (0, /4,-/4,/,-3/4) B = (0,/7,-/7,4/7,-5/7) /4 0 -/4 - Chord distance =. Center all vectors around 0. Make the length of both equal to 3. Calculate Euclidean distance between the centered and scaled vectors (see that the chord distance in this case is about 0.0) 4

15 x a A Euclidean distance b B a Angle distance 0.5 Chord distance b B A α γ β x a b a b Log ratios A = ( ) / 4 B = ( ) / 7 A = (0, /4,-/4,/,-3/4) B = (0,/7,-/7,4/7,-5/7) /4 0 -/4 - Correlation distance =. Very similar to Chord distance calculate the cos between the two vectors: cos(a,b )= =0*0+/4*/7+(-/4)*(-/7)+ /4)*( /7)+ +/*4/7+(-3/4)*(-5/7) = =/4. Cor_dist = -cos(a,b ) = =3/4 /5 5

16 Relationships between chord and correlation distances D chord D chord ( A, B) = ( a b + a b ' ' ) ( ' ' ) = a b a b + A B A B ( ( a' b' + a' ' )) = ( cos ) ( A, B) = b α D chord ( A, B) = sinα Cor(A,B) = cos(ab), if A and sin α = cosα B are means centered x normalised sed( (length ) vectors a b a A b Chord distance α B γ β 0.5 a b a b A Euclidean distance B x Log ratios Correlation and anticorrelation /4 0 -/4 cos x perfect correlation has distance 0, anticorrelation has max distance - cos x - both perfect correlation and perfect anticorrelation distances are 0-6

17 Log ratios Rank correlation A = (0, /4,-/4,/,-3/4) B = (0,/7,-/7,4/7,-5/7) /4 0 -/4 - Transform the values to ranks A = (0,,-,,-) B = (0,,-,,-) Compute the (correlation) distance between them (for that first normalise them to length ). Advantages and disadvantages of rank correlation based distances Advantages does not depend on the precise values Disadvantages ranks depend on the precise values in the large density arrears, e.g., when the expression values are very close to each other (closer than the error bars), their relative order (ranks) are very prone to error 7

18 Distance measures A distance measure D(A,B) is said to be metric, if it satisfies the following properties: if A=B, then D(A,B) = 0, i.e., the distance of an object to itself is 0; if A B, then D(A,B) 0, i.e., the distance is always nonnegative; D(A,B) = D(B,A), i.e., it does not matter in which order we measure the distance; D(A,B) + D(B,C) D(A,C), i.e., given three objects, the length of a direct path from the first to the third objects cannot be greater than the length of the path through the second object. Why they arise? Missing data points Bad quality spot e.g. flagged as bad by the image analysis software (e.g, so-called half moon spots, empty circles, ) Very low intensity signal in one or both channels (may be 0 or infinity ratio) Inconsistency between replicates (on the same or different arrays) 8

19 Missing data points Why they are a nuisance? How to compute distance between vectors with missing data points ignore the dimension If many comparisons have to be made, missing dimensions may start to accumulate How to deal with them? If replicates are available, they can be used Replace missing values by 0 Replace by the row average value K nearest neighbour imputation (KNN imputation) KNN imputation We are given a gene expression matrix M Let X=(X (X, X,, Xi,, Xn)beavectorinthe the matrix M with a missing value at Xi at the dimension i Find in the gene expression data matrix matrix vectors X, X,, X k, such that they are the k closest vectors to X in M (in the sense of a chosen distance measure) among the vectors that do not have a missing i value at dimension i i Replace the missing value Xi with the mean (or median) of X i, X i,, X k i, i.e., mean (median) of the values at dimension i of vectors X, X,, X k 9

20 Gene Expression Profile Ge enes A gene expression profile: X=(X, Xi,, Xn) avector of real numbers, Xi a missing data point KNN imputation We are given a gene expression matrix M Let X=(X (X, X,, Xi,, Xn)beavectorinthe the matrix M with a missing value at Xi at the dimension i Find in the gene expression data matrix matrix vectors X, X,, X k, such that they are the k closest vectors to X in M (in the sense of a chosen distance measure) among the vectors that do not have a missing i value at dimension i i Replace the missing value Xi with the mean (or median) of X i, X i,, X k i, i.e., mean (median) of the values at dimension i of vectors X, X,, X k 0

21 B A C Condition Condition Figure 4. Supervised vs. unsupervised analysis - class discovery vs. clustering

22 What is a cluster? In a set of elements, subsets of elements that are in some sense closer to each other than average Closeness can be defined by a distance measure Distance by itself is not sufficient How to measure distance between more than points? Shape of the cluster? Thresholds of closeness which are the same clusters, which are not What is a cluster? The definition of what is a cluster is difficult In practice it is defined by an algorithm that finds clusters

23 Clustering algorithms Hierarchical vs flat Hierarchical clustering builds a hierarchical tree (also called dendrogram) showing the relationship among the elements Flat clustering partitions the set of elements in subsets (nonoverlapping or overlapping) c c4 c c5 c3 Hierarchical clustering how does it work? , , , 3 4 5,

24 Different linkages Keep joining together two closest clusters by using the: Minimum distance => Single linkage Maximum distance => Complete linkage Average distance => Average linkage Alternative maintain a centroid in each cluster and use it for linking 4

25 y A B A= (,5) B = (4,) C = (3,-3) 3) x X=(+4+3)/3=3 Y=(5+-4)/3= C -5 y A B A= (,5) B = (4,) C = (3,-3) 3) x X=(+4+3)/3= C -5 5

26 y A B A= (,5) B = (4,) C = (3,-3) 3) x X=(+4+3)/3=3 Y=(5+-4)/3= C -5 y A B A= (,5) B = (4,) C = (3,-3) 3) x X=(+4+3)/3=3 Y=(5+-4)/3= - - G = (3,) -3-4 C -5 6

27 K means clustering. Select K points (vectors) called centers in the space somehow (at random, or more intelligently so that they are far a way). For each vector in the universe that you want to cluster, calculate the distance between it and all the K centers, and assign it to the center which is the closest - In this way K clusters are defined. 3. In each cluster define the new center as its gravity center 4. Repeat steps -3 until the gravity centers do not move any more, or after some fixed number of steps. Guess K centres 3. Move to gravity centres. Assign to clusters 7

28 K means clustering. Select K points (vectors) called centers in the space somehow (at random, or more intelligently so that they are far a way). For each vector in the universe that you want to cluster, calculate the distance between it and all the K centers, and assign it to the center which is the closest - In this way K clusters are defined. 3. In each cluster define the new center as its gravity center 4. Repeat steps -3 until the gravity centers do not move any more, or after some fixed number of steps Other clustering methods Kohonen s self organising maps Self organising trees (Dopazo) Probability distribution based clustering Two way clustering Fuzzy clustering Cluster comparison 8

29 Clustering genes and smaples When does it make sense to cluster samples? Ordination methods Principal components analysis (PCA) 9

30 Principal Component Analysis (PCA) Also known as Ordination or SVD (each version having slightly different meaning) Fairly nontrivial mathematical apparatus, but quite simple idea Condition Condition Condition 3 Temperature Altitude Latitude Gene Measurement Gene Measurement Gene n Measurement n 30

31 Temperature Altitude (South) Temperature Alti-latitude Altitude (South) 3

32 Temperature Alti-latitude Second PC First principal component PCA in a nutshel The main idea in the original n-dimensional space find the direction of most data variability (i.e., in which direction data-points are most stretched Orient a new coordinate axis in this direction. This will be the first principal component, and the relative stretch is the first eigenvalue, and the direction is the first eigenvector Then find the direction of the next highest h variability orthogonal to the first eigenvector this is the second component And so on 3

33 First 5 eigenvalues (X) (Y) (Z)

34 Supervised vs unsupervised analysis 34

35 35

36 36

new data this will tell us where the new

37 Classifiers - applications Training on known data find a classifier that t can separate one experimental factor value from the other based only on data Apply to new data this will tell us where the new sample belongs (e.g., diseased or normal diagnostics) 37

38 K nearest neighbours classifier x x 38

39 Linear discriminants x x = ax + b discrimination line x x x = ax + b discrimination line x 39

Dimension reduction : PCA and Clustering

Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental