Statistical Genomics: Gene Expression Profiling

Size: px

Start display at page:

Download "Statistical Genomics: Gene Expression Profiling"

Patrick Logan
5 years ago
Views:

1 Statistical Genomics: Gene Epression Profiling Jung-Hsin Lin School of Pharmacy National Taiwan University r.mc.ntu.edu.tw/~jlin/statisticalgenomics.pdf et 8404

2 Microarray Eperiment Samples Further reading: Pharmacogenomics 3: (00). Biopsy (OCT embedding) microtome HE staining or Immunofluorescence staining RNA QC LCM RNA etraction RNA QC Culture cells or whole blood cells ( lysed in Soln.. D) RNA etraction RNA QC Total RNA T7-based RNA amplification (1 woring days) arna Cy3/Cy5 Direct labeling (1 woring day) Cy3/ Cy5 post-labeling (1 woring days) Cy3-/Cy5 /Cy5- target cdna Hybridization: : 17 hours Scanning : 0.5 woring day Data analysis : 0.5 woring days Procedure I : - 3 wees Procedure II & III : 1 wees

3 Worflow of a cdna Microarray EST fragments arrayed in 96- or 384- well plates are spotted at high density onto a glass microscope slide. Subsequently, two different fluorescently labeled cdna populations derived from independent mrna samples are hybridized to the array. After washing, a laser scan the slide and the ratio of induced fluorescence of the two samples is calculated for each EST, which indicates the relative amount of transcript for EST in the samples.

4 Dye-Swap Eperiment Array A Array B

5 Microarray Data Analysis Feature Etraction MIAME Data Visualization Data Clustering Analysis of Variance Epression Profile Comparison Pathway Identification

6 Feature Etraction Algorithms (Agilent) 1. FindSpots and SpotAnalysis algorithms. PolyOutlierFlagger algorithm 3. BGSub (Bacground subtraction) algorithm 4. DyeNorm algorithm 5. Ratio algorithm

1. FindSpots and SpotAnalyzer FindSpots algorithms 1) Locate the corners ) Locate the spots SpotAnalyzer algorithms 1) Define features ) Estimate the radius for the

7 1. FindSpots and SpotAnalyzer FindSpots algorithms 1) Locate the corners ) Locate the spots SpotAnalyzer algorithms 1) Define features ) Estimate the radius for the local bacground 3) Reject outliers 4) Calculate the mean signal of the feature 5) Calculate the mean signal of the local bacground 6) Determine if the feature is saturated

8 Feature definition with CooieCutter method Feature definition with WholeSpot method

9 . PolyOutlier algorithm 1) Determine if the feature is a non-uniformity outlier ) Determine if the feature is a population outlier

signals ) Calculate the significance of feature intensity relative to

10 3. BGSubtractor algorithm 1) Calculate the feature bacground-subtracted signal Unadjusted bacground-subtracted signals Adjusted bacground-subtracted signals ) Calculate the significance of feature intensity relative to bacground 3) Determine if the feature bacground-subtracted signal is well above the bacground

Average of negative control features Minimum signal

11 Choices for Bacground Subtraction Methods Local bacground (Default) Average of all bacground areas Average of negative control features Minimum signal (feature or bacground) Minimum signal (feature) on array

4. Dye Normalization algorithm 1) Determine normalization features a) Using all significant, non-control, and nonoutlier features method b) Using list

12 4. Dye Normalization algorithm 1) Determine normalization features a) Using all significant, non-control, and nonoutlier features method b) Using list of normalization genes method c) Using Ran Consistency Filter method (Default) ) Calculate the normalization factor 3) Calculate the dye normalized signal

13 Scatter Plot of Normalized Cy5/Cy3 Signals

14 5. Ratio algorithm 1) Calculate the surrogate value ) Calculate the processed signal 3) Calculate the log ratio of feature 4) Calculate the p-value p and error on log ratio of feature

15 Log Ratio vs. Feature ID

16 Histogram of Log Ratio Values

17 ArrayMap Upload the array data and the image. Set up criteria for mapping genes on the array image. Display the array image and circle the found feature on the image.

$'c:\tet.$

18 More ArrayMap Functions Intensity 'c:\tet.data' 6e+04 5e+04 4e+04 3e+04 e+04 1e X Y

19 Searching Repeating Genes Upload files for across gene array comparison. Find out repeating genes and save searched results in a file.

MIAME is a standard that describes minimal information needed to fully annotate a gene epression eperiment and being MIAME has been a prerequisite for publishing your

20 MIAME is a standard that describes minimal information needed to fully annotate a gene epression eperiment and being MIAME has been a prerequisite for publishing your data. To ensure that the data to be MIAMEcompliant. MIAME 1. Eperimental design. Array Design 3. Samples 4. Hybridization 5. Measurements 6. Normalization and controls

21 3D Data Visualization To provide in-depth 3D scatter plot tool and interactive representations of highly comple data. Epression data values or analysis results can be placed on any of the 3 userdefined aes to create a powerful medium for array data presentation.

22 Tree View of Gene Classification To facilitate powerful, easy navigation within dendrograms. To display gene classifications and eperiment parameters on gene and condition trees, respectively. To simplify eamination of clinical and eperimental data in the contet of clustered epression patterns.

23 Visual Filtering To provide a simplified and visually intuitive user interfaces for filtering tools. Real-time generation of graphs of results.

24 Analysis of Variance (ANOVA) To reliably identify differentially epressed genes. To identify genes capable of discriminating between one or more eperimental parameters or sample phenotypes. Groups of genes identified by epression profiling can be further characterized by performing sequence searches for potential regulatory elements

25 Epression Profile Comparison To eplore all of the eperiments related to a single genome in the databases. To identify target epression patterns and quicly find similar epression profiles within all the normalized sample sets database. To characterizing the results of compound screening eperiments and patient treatment studies

26 Pathway Viewer To visually characterize genes and their epression patterns based on their location within a cellular pathway. Users can design their own pathway diagrams or directly import publicly available pathway maps. Users can predict genes associated with discrete steps in the pathway of interest.

27 Affymetri Data Analysis Flow Chart Chip Description Files Hybridization + Scanning EXP File Image analysis DAT File + CDF File CEL File Processing 1. Bacground Correction. Normalization 3. PM Correction 4. Epression Inde GCOS MAS CHP File Intensity value Absent / Present call RMA Tet File Probe ID + Log (Intensity) RPT File Report File, quality Ecel File

28 Affymetri Data Files : DAT Files.*DAT file ~ 110 MBytes

29 From DAT to CEL

30 Affymetri Data Files : CDF Files

31 MAS 5.0 Analysis output file (*.CHP)

32 RNA Quality Assessment RNA Degradation Plot

33 Statistical Plots : Bo Plot

34 Statistical Plots

35 Scatter Plot and MA plot

36 MAS 5.0 Epression Report File (.RPT)

37 MAS 5.0 Epression Report File (.RPT)

38 Options for Normalization Levels PM & MM (MAS 5.1), PM-MM (DChip), PM only (RMA) Features All, Ran invariant set, Spie in, houseeeping Methods Complete data: no reference chip, information from all arrays used: Quantiles Normalization, MVA plot + Loess Baseline: normalized using reference chip: MAS 5.0, Li-Wong s Model-based, Q-spline

39 Summarization Methods: MAS Cell intensities are preprocessed for global bacground.. An ideal mismatch value is calculated and subtracted to adjust the PM intensity. 3. The adjusted PM intensities are log-transformed to stabilize the variance. 4. The biwight estimator is used to provide a robust mean of the resulting values. Signal is it output as the antilog of the resulting value. 5. Finally, Signal is scaled using a trimmed mean. Probe value the probe value PV for every probe pair j in prob set i. V Signal is calculated as follows: i, j PV ij ma( PM IM d) default δ log ( V ij i, j ), ij j 1,..., n n is the number of probe pairs in the probe set. -0 SignalLogV aluei Tbi ( PVi,1,..., PVi, e ) i target signal sf SignalLogValuei TrimMean(,0.0,0.98 ) TrimMean(SPVbi,0.0,0.98 ) nf TrimMean(SPVe,0.0,0.98 ) ReportedValue( i ) nf sf i ( SignalLogValuei )

40 Summarization Methods: RMA Medianpolish This is the summarization used in the RMA epression summary. A multichip linear fit model is fit to data from each probe set. The medianpolish is an algorithm for fitting this model robustly. It should be noted that the calculated epression values using this summary measures will be in log scale. For a probe set with i 1,, I probes and data from j1,.j arrays, fit the following model: log ( ( ) PM ij where α is β i j is ) a α + β + ε ( ) i probe effect the log ( ) j and ( ) ij epression value

41 Software Shareware/Freeware Bioconductor (R. Gentleman) DNA-Chip Analyser (dchip v1.3) (Li and Wong) RMAEpress: : a simple standalone GUI program for windows for computing the RMA epression measure. Commmerical Affymetri Microarray Suite (MAS 5.1) Affymetri GeneChip Operating Software (GCOS v1.)

42 Gene Ontology GO-Annotator The GO collaborators are developing three structured, controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner.

43 GO-Annota Annotated ted GeneChip Result

44 Reference

45 Random Variables Probability density function (pdf( pdf): f ( ) P( X ) Cumulative distribution function (cdf( cdf): F( ) P( X )

46 The Normal (Gaussian) Distribution f ( μ ) 1 ( ) σ σ The distribution is symmetric around the mean. Approimately 68% of the values are within standard deviations from the mean. Approimately 95% of the values are within 4 standard deviations from the mean. Approimately 99% of the values are within 6 standard deviations from the mean. The infleion points of the curve occurs at μ ±σ π e

47 Mean N μ i 1 N X i N: number of all population X n i 1 n X i n: number of the sample, a subset of all population

48 Mode, Median & Percentile Mode: the value that occurs most often in a data set Median: the value situated in the middle of the ordered list of data P-th Percentile: the value that has p% of members below it and 100-p p % of members above it.

49 Range and Variance Range: Variance X ma X min σ N i 1 ( X μ) i N s n i 1 ( X X ) i n 1 Population variance Variance of a sample

50 Z Value Z X μ σ Any normal distribution can be mapped to standard normal distribution. The standard normal distribution has the mean of zero and one of standard deviation. μ Z 0 1 σ Z

51 p ValueValue P( Z > ) 1 P( Z ) The p-value p provides the information about the amount of trust we can place on a decision made using a given threshold. P-value is the probability of maing an error if we choose as the threshold.

52 Student Student s t s t-test for significantly test for significantly different means different means is the significance that two distributions are having different means. ( ) ( ) B A B A A i B j B j A i D N N N N s 1 1 D B A s t, 1, 1 1 ) ( 1 1/ B A t t N N d B t A ν ν ν ν ν ν ( ) ( ) ds s s b a B b a , ) ( 1 ν t A

53 F-test for significantly different test for significantly different variances variances is the significance level at which the hypothesis 1 has smaller variance than can be rejected. A small numerical value implies a very significant rejection, in turn implying high confidence in the hypothesis that 1 has variance great or equal to. ) var( ) var( B A F 1 1,,, ), ( B A F v N N I F Q ν ν ν ν ν ν ν ν ( ) ds s s b a B b a I b a ), ( 1 ), ( ), ( ν 1 ν F Q

54 Corrected sum of squared and Standard deviation Corrected sum of squared (CSS): Standard deviation n ( X ) i X i 1 σ N i 1 ( X μ) i N s n i 1 ( X X ) i n 1 Population standard deviation Sample standard deviation

55 Covariance and Correlation Covariance and Correlation Covariance: Covariance: (Pearson) correlation coefficient (Pearson) correlation coefficient ( )( ) ( ) ( ) n i i n i i n i i i Y X XY Y Y X X Y Y X X s s Y X Cov ), ( ρ ( )( ) 1 ), ( 1 n Y Y X X Y X Cov n i i i

56 Covariance Matri and Correlation Matri Covariance Matri: σ1 σ1 σ13 K σ 1 σ σ 3 Σ M σ 1 σ σ 3 (Pearson) Correlation Matri σ σ σ 1 σ ij ρ ij σ i σ j

57 Principal Component Analysis Principal Component Analysis () ( ) () ( ) t t j j t i i ij q t q q t q C CT T Λ T Covariance Matri λ f λ λ λ O 3 1 Λ [ ] v f,, v, v, v T 3 1 L i i i v Cv λ

58 Why Principal Component Analysis? There are usually too many degrees of freedom in a comple system. Importance of variables is evaluated statistically. Dimensionality (or degrees of freedom) can be significantly reduced by only looing at most important new coordinates. The first principal component is the normalized linear combination with maimum variance. Principal component is the characteristic vectors of the covariance matri.

59 Clustering Methods: Agglomerative Method (Bottom-Up): proceeded by series of fusions of the n objects into groups. More commonly used. Divisive Method (Top-Down): proceeded by series of partition into finger groups.

60 Anything can be clustered Real Fae

61 Dendrogram A two-dimensional diagram which illustrates the fusions or divisions made at each successive stage of clustering process.

62 Agglomerative methods According to different ways of defining distances (similarities), we have some popular agglomerative methods as follows: Single linage clustering Complete linage clustering Average linage clustering Average group linage Ward s s hierarchical clustering method

63 Single linage clustering Also nown as nearest neighbor technique. Defining feature: distance between groups, D(r,s), is defined as the distance between the closest pair of objects, where only pairs consisting of one object from each group are considered. D(r, s) Min { d ( i, j) : i r, j s} The distance between every possible object pair (i,j) is computed, where object i is in cluster r and object j is in cluster s. The minimum value of these distances is said to be the distance between clusters r and s. In other words, the distance between two clusters is given by the value of the shortest lin between the clusters. At each stage of hierarchical clustering, the clusters r and s, for which D(r,s) is minimum, are merged.

64 Complete linage clustering Also nown as farthest neighbor technique, opposite of single linage. Defining feature: distance between groups, D(r,s), is defined as the distance between the most distant pair of objects, where only pairs consisting of one object from each group are considered. D(r, s) Ma{ d( i, j) : i r, j s} The distance between every possible object pair (i,j) is computed, where object i is in cluster r and object j is in cluster s. The maimum value of these distances is said to be the distance between clusters r and s. In other words, the distance between two clusters is given by the value of the longest lin between the clusters. At each stage of hierarchical clustering, the clusters r and s, for which D(r,s) is maimum, are merged.

65 Average linage clustering Defining feature: distance between groups, D(r,s), is defined as the average distance between all pair of objects, where each pair is made up of one object from each group. D(r, s) Trs /(Nr N s ) where T rs is the sum of all pairwise distances between cluster r and cluster s. N r and N s are the sizes of the clusters r and s respectively. At each stage of hierarchical clustering, the clusters r and s, for which D(r,s) is the minimum, are merged.

66 Average group linage Defining feature: groups once formed are represented by their mean values of each variable, i.e., their mean vector, and intergroup distance, D(r,s), is defined in terms of distance between two such mean vectors. The two clusters r and s are merged such that, after merger, the average pairwise distance within the newly formed cluster, is minimum. Suppose we label the new cluster formed by merging clusters r and s, as t. Then D(r,s), the distance between clusters r and s is computed as D(r, s) Average { d ( i, j) : i & j t, t r s} At each stage of hierarchical clustering, the clusters r and s, for which D(r,s) is minimum, are merged. In this case, those two clusters are merged such that the newly formed cluster, on average, will have minimum pairwise distances between the points in it.

67 -means clustering -means clustering algorithm is an algorithm for partitioning (or clustering) N data points into disjoint subsets S j containing N j data points so as to minimize the sum-of-squares criterion: H j 1 n S j n μ j where n is a vector representing the nth data point and μ n is the geometric centroid of the data points in S j. Algorithm: 1. The data points are assigned at random to the sets.. The centroid is computed for each set, and H is evaluated. Repeat these two steps until H has reached its minimum.

68 Profile Plots associated with -means clustering

69 Self-organizing map (SOM) clustering The Self-Organizing Map (SOM) was introduced by Teuvo Kohonen in 198. The SOM (also nown as the Kohonen feature map) algorithm is one of the best nown artificial neural networ algorithms. In contrast to many other neural networs using supervised learning, the SOM is based on unsupervised learning. Self-adaptive topological maps were initially inspired by modelling perception systems found in mammalian brain. A perception system involves the reception of eternal signals and their processing inside the nervous system. The comple mammalian sills, such as seeing and hearing seemed to bear similarity to each other in the way they wored. Namely, the primary characteristic of these systems is that neighbouring neurons encode input signals which are similar to each other. The SOM is quite a unique ind of neural networ in the sense that it constructs a topology preserving mapping from the high-dimensional space onto map units in such a way that relative distances between data points are preserved. The map units, or neurons, usually form a twodimensional regular lattice where the location of a map unit carries semantic information. The SOM can thus serve as a clustering tool of highdimensional data. Because of its typical two-dimensional shape, it is also easy to visualize. Another important feature of the SOM is its capability to generalize. In other words, it can interpolate between previously encountered inputs.

70 Outline of the SOM algorithm Outline of the SOM algorithm The SOM defines a mapping from high dimensional input data space onto a regular two-dimensional array of neurons. Every neuron i of the map is associated with an n-dimensional reference vector m i [m i1,,m in ] T, where n denotes the dimension of the input vectors. The reference vectors together form a codeboo. The neurons of the map are connected to adjacent neurons by a neighbourhood relation, which dictates the topology, or the structure, of the map. The most common topologies in use are rectangular and heagonal. Adjacent neurons belong to the neighbourhood N i of the neuron i. In the basic SOM algorithm, the topology and the number of neurons remain fied from the beginning. The number of neurons determines the granularity of the mapping, which has an effect on the accuracy and generalization of the SOM. During the training phase, the SOM forms an elastic net that folds onto the "cloud" formed by input data. The algorithm controls the net so that it strives to approimate the density of the data. The reference vectors in the codeboo drift to the areas where the density of the input data is high. Eventually, only few codeboo vectors lie in areas where the input data is sparse.

71 The SOM learning process 1. One sample vector is randomly drawn from the input data set and its similarity (distance) to the codeboo vectors is computed by using e.g. the common Euclidean distance measure: m min { } c m i i. After the Best Matching Unit (BMU) has been found, the codeboo vectors are updated. The BMU itself as well as its topological neighbours are moved closer to the input vector in the input space i.e. the input vector attracts them. The magnitude of the attraction is governed by the learning rate. As the learning proceeds and new input vectors are given to the map, the learning rate gradually decreases to zero according to the specified learning rate function type. Along with the learning rate, the neighbourhood radius decreases as well. The update rule for the reference vector of unit i is the following: mi mi( t + 1) ( t) + α( t) [ ( t) m ( t) ] m ( t), i i N i N 3. The steps 1 and together constitute a single training step and they are repeated until the training ends. The number of training steps must be fied prior to training the SOM because the rate of convergence in the neighbourhood function and the learning rate is calculated accordingly. i c, ( t) c ( t)

72 Dimensionality reduction by SOM In this eample, SOM projects the input space with four dimension into a feature space with only dimension.

73 Gene clustering using SOM Using SOM, two genes that are plotted net to each other are necessarily similar according to the chose distance metric. 1D feature map.

Eisen Eisen s Hierarchical Clustering Method Hierarchical Clustering Method ( ) ( ) N t j j j N t i i i N t j j N t i i j j j

Construct a matri of similarity measures between all pairs of genes.. Recursively cluster the genes into a tree-lie hierarchy.

a) Find the pair of clusters with the highest correlation and combine the pair into a single cluster.

74 Eisen Eisen s Hierarchical Clustering Method Hierarchical Clustering Method ( ) ( ) N t j j j N t i i i N t j j N t i i j j j N t i i i ij E t E N E t E N t E N E t E N E E t E E t E N c ) ( 1, ) ( 1 ) ( 1, ) ( 1 ) ( ) ( 1 σ σ σ σ 1. Construct a matri of similarity measures between all pairs of genes.. Recursively cluster the genes into a tree-lie hierarchy. 3. Determine the boundaries between individual genes. a) Find the pair of clusters with the highest correlation and combine the pair into a single cluster. b) Update the correlation matri using the average values of the newly combined cluster. c) Repeat step (a) and (b) N-1 times until all genes have been clustered. Eisen et al., Proc. Natl. Acad. Sci. USA. 95: (1998)

75 Protein-Protein Interaction Networ Biocarta (

76 Pathway Viewer To visually characterize genes and their epression patterns based on their location within a cellular pathway. Users can design their own pathway diagrams or directly import publicly available pathway maps. Users can predict genes associated with discrete steps in the pathway of interest.

84 Modeling Metabolic Insulin Signaling Pathways Am J Physiol Endocrinol Metab 83: E084-E1101, 00

85 Definition of Variables 1 Insulin input Concentration of unbound surface insulin receptors 3 Concentration of unphosphorylated once-bound surface receptors 4 Concentration of phosphorylated twice-bound surface receptors 5 Concentration of phosphorylated once-bound surface receptors 6 Concentration of unbound unphosphorylated intracellular receptors 7 Concentration of phosphorylated twice-bound intracellular receptors 8 Concentration of phosphorylated once-bound intracellular receptors 9 Concentration of unphosphorylated IRS-1 10 Concentration of tyrosine-phosphorylated IRS-1 11 Concentration of unactivated PI 3-inase 1 Concentration of tyrosine-phosphorylated IRS-1/activated PI 3-inase comple 13 Percentage of PI(3,4,5)P3 out of the total lipid population 14 Percentage of PI(4,5)P out of the total lipid population 15 Percentage of PI(3,4)P out of the total lipid population 16 Percentage of unactivated At 17 Percentage of activated At 18 Percentage of unactivated PKC- 19 Percentage of activated PKC- 0 Percentage of intracellular GLUT4 1 Percentage of cell surface GLUT4

86 Mathematical Formulation Mathematical Formulation p p ) [PTP] ( ) ) /(IR ( ) ) /(IR ( [PTP] [PTP] [PTP] ) [PTP]( [PTP] [PTP] dt d dt d dt d dt d dt d dt d dt d dt d dt d dt d ' ' ) ( ) ( [SHIP] [PTEN] [SHIP]) [PTEN] ( dt d dt d dt d dt d dt d dt d dt d dt d dt d dt d

Gene Clustering & Classification

BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering