Gene Expression Data Analysis. Qin Ma, Ph.D. December 10, 2017

Size: px

Start display at page:

Download "Gene Expression Data Analysis. Qin Ma, Ph.D. December 10, 2017"

Martin O’Brien’
5 years ago
Views:

1 1 Gene Expression Data Analysis Qin Ma, Ph.D. December 10, 2017

2 Bioinformatics Systems biology This interdisciplinary science is about providing computational support to studies on linking the behavior of cells, organisms and populations to the information

2 2 Bioinformatics Systems biology This interdisciplinary science is about providing computational support to studies on linking the behavior of cells, organisms and populations to the information encoded in the genomes. Temple Smith, Current Topics in Computational Molecular Biology (2002) Omics data Genomics Transcriptomics Metabolomics Metagenomics Epigenomics Proteomics Interactomics Bioinformatics

Data (behavioral outcomes in observational study)

3 3 Characteristics of Biological Big Data 36.8 million transactions per day on Amazon Biomedical Data (behavioral outcomes in observational study) Big Small Data v.s. Small Big Data Next Generation Sequencing Data

4 4 The Hierarchical Structure of Computational Techniques Models Algorithms Programs Tools Software

DNA à RNA à Protein Central Dogma Intro to gene expression (central dogma). (n.d.). Retrieved November 05, 2017, from https://www.

5 DNA à RNA à Protein Central Dogma Intro to gene expression (central dogma). (n.d.). Retrieved November 05, 2017, from 5/46

6 Information derivable from gene expression data 6 Inference: genes x, y are highly expressed under conditions W while genes a, b are not expressed Inference: gene X is significantly more highly expressed in diseased cell than in normal cell; hence gene X could potentially serve s a marker of the disease differentially expressed genes genome sequence Control Treatment genome sequence Inference: genes with similar expression patterns might be functionally related, e.g., working in the same pathway or co-regulated co-expression -> co-regulation

7 Gene Expression Measurement 7 Microarray (GEO) Read quality check (FastQC) RNA-seq (SRA) RNA-seq read mapping (BWA, Bowtie) RNA-seq Assembly with reference genome (Cufflinks) $ $ RNA-seq Assembly without reference genome (Trinity: De-novo assembly)

RNA-seq Process Purpose Analysis of Big Genomic Data Gene Expression Estimation Variations Differential Gene Expression Analysis Functional Enrichment Analysis Network

8 RNA-seq Process Purpose Analysis of Big Genomic Data Gene Expression Estimation Variations Differential Gene Expression Analysis Functional Enrichment Analysis Network Analysis Forde, B. M., & O Toole, P. W. (2013). Next-generation sequencing technologies and their impact on microbial genomics. Briefings in functional genomics, 12(5), /46

9 Non-trivial RNA-seq Analysis Pipeline RNA-seq Reads Quality Check Data Trimming Read Mapping (De-novo) Assembly Gene Read Count Operon Prediction Differential Expression Analysis Functional Enrichment Analysis De-novo (Bi)-Clustering Network Analysis & Modeling 9

10 Non-trivial RNA-seq Analysis Tools RNA-seq Reads FastQC Btrim HISAT Cufflinks Trinity HtSeq DOOR SeqTU EdgeR/DeSeq DAVID/GO MCL/QUBIC NCA/GtrieScanner 10

11 Existing RNA-seq Pipeline Tools 2009 GSNAP edger FastQC FastX 2011 Novoalign Bowtie DESeq kallisto BWA Bowti e TopHa t GNUM ap RSEM Cufflinks Cutadapt TopHat2 STAR Trinity HISAT2 Bridger HtSeq sleu th 11/46

12 ViDGER Tool to assist in interpreting and analyzing count matrices PCA, MDS, Clustering DGEA Visualizations Basic R package Shiny implementation 12/46

ViDGER Compatibility 1607 8% Count & condition

Cuffdiff* edger DESeq2 DEGseq limma sleuth* 5200

13 ViDGER Compatibility % Count & condition matrix Popular DGE tools by citation count Cuffdiff* edger DESeq2 DEGseq limma sleuth* % % DGE & Visualization Visualization Only None 13/46

14 Shiny Input Count Matrix Generates basic figures from matrix Initial Analyses PCA MDS 14/46

15 Differential Gene Expression Select DGE tool to analyze data Interactive results table DGE results visualizations for improved interpretation Interactivity between table & figures 15/46

16 27 Pitfall I: Popularity High Performance Human MapSplice2 (97.8%) CRAC (86.1%) GSNAP (98.9%) Novoalign (90.3%) TopHat2 (12.5%)

17 Pitfall II: Gene expression estimation RNA-seq Reads Quality Check Data Trimming Read Mapping Mapping uncertainty! (De-novo) Assembly Gene Read Count Operon Prediction Differential Expression Analysis Functional Enrichment Analysis De-novo (Bi)-Clustering Network Analysis & Modeling 28

18 29 Pitfall II: Gene expression estimation RNA-seq reads mapping uncertainty

19 Mapping Uncertainty Occurrences Plants Highly duplicative nature of genome Animals Alternative splicing Metagenomics Sequencing of entire microbial communities simultaneously Identical genes across different species Similar, mutated or evolved genes Currently other issues compounding mapping uncertainty 19/46

20 30 Pitfall II: How Serious? Diploid plants Polyploid plants Species Arabidopsis thaliana Vitis vinifera Solanum lycopersicum Solanum tuberosum Triticum aestivum Uniquemapped Multimapped Unmapped 77%~89% 55%~82% 49%~87% 55%~69% 62%~69% 8%~17% 10%~25% 6%~34% 18%~26% 18%~25% 2%~5% 8%~23% 5%~44% 12%~19% 9%~18% Similar things happen in Human (transcript) and Metagenome

21 Mapping Uncertainty in Real Data Diploid plants Polyploid plants Animal Species Arabidopsis thaliana Vitis vinifera Solanum Lycopersicum Panicum Virgatum Triticum Aestivum Human Genome Human Transcriptome Mus musculus Genome Mus musculus Transcriptome Total Datasets Size(G) Unique- Mapped 69%~89% 55%~82% 52%~88% 47%~66% 61%~69% 55%-65% 10%~15% 40%~70% 11%~27% 55% Multi- Mapped 8%~17% 9%~25% 5%~34% 17%~33% 17%~25% 21%-28% 23%-31% 10%~38% 9%~42% 22% Un-mapped 2%~17% 8%~23% 4%~16% 13%~25% 9%~18% 12%-21% 55%-65% 3%~31% 43%~67% 23% (Multi-mapped)/ (Total mapped) 8%-18% 10%-31% 6%-39% 22%-39% 21%-28% 25%-33% 61%-72% 13%~48% 29%~77% 29% 21/46

22 Mapping Uncertainty in Plant Data 22/46

23 Mapping Uncertainty in Animal Data 23/46

24 Pitfall II: How to Proceed? a) Ignore them: only consider unique mapping 30%-70% of reads are discarded from further analysis in plants b) Random mapping: If multiple equally best matches, choose one at random TopHat c) Report all: try to keep more information Cufflinks: distribute these multiple mapping reads uniformly or based on the expression level of unique mapping reads. 31

25 Pitfall II: How to Proceed? It is an OPEN and challenge problem! 32

26 Quantifying Mapping Uncertainty Gene Expression Quality Check (GeneQC) Computational program collecting relevant information from datasets Interprets information in meaningful way to provide quantification of mapping uncertainty Two levels of observations Genomic level: Sequence Similarity between two genomic locations Transcriptomic level: Proportion of shared ambiguous reads 26/46

27 C D GeneQC 0.5 A B C /46

28 D-score Allows for comparable metric of mapping uncertainty Combines three statistics Maximum proportion of shared ambiguous reads Maximum base-pair similarity Number of gene pair interactions Normalized between 0 and 1 for each dataset i /46

29 D. : Sequence Similarity * Match Length max 2 {ss 5,2 l 5,2 } ss 5,2 = sequence similirty of gene i and gene y l 5,2 = match length Variables: D. Additional Constraints for D. e-value < 10 KL SS*Match Length > 100 Mismatch < 5 Gap < 5 gene y. : ss 5,. = 65%; l 5,. = 100 gene y P : ss 5,P = 85%; l 5,. = 200 gene y S : ss 5,S = 85%; l 5,S = 350 gene y R : ss 5,R = 85%; l 5,R = 200 gene i 29/46

30 D P : Max MMR percentage U V X U V G 5 = reads aligned to gene i X = argmax G 5 Y ] Y. Variables: D P X G 5 G 5 X Y P 30/46

31 D S : Degree weight log.b S 5 M S 5 = {genomic locations where D. > 0} M 5 = {genomic locations where D P > 0} Variables: D S Separated into two populations D P = 0 D P 0 31/46

32 Variables by Species 32/46

33 D., D P, D S combined into one distinct value Regression-based approach to optimize effect of each parameter D = α. D. + α P D P + α S D S + α R D. D P + α j D. D S + α L D P D S + α k D. D P D S SD = D S (α. D. + α P D P ) D-score Development D used as dependent variable to represent mapping uncertainty G 5 = reads mapped to gene i (All matches) U 5 = reads uniquely mapped to gene i (Unique mapping) Real alignment falls somewhere between U 5 R 5 G 5 D = U V K r V U V D 1 Ṗ = 1 r V U V = 1 U V t u V P U V D regressed upon (D., D P, D S ) to determine optimized coefficients for each dataset Interpretations for each set of coefficients can be used to understand biological mechanisms behind species-specific mapping uncertainty 33/46

34 D-scores 34/46

35 Simplified D- score 35/46

36 Simplified D-score Distributions Density plots appear to show mixture distributions Individual distributions can help indicate categorizations for mapping uncertainty 36/46

37 Level of Mapping Uncertainty from D-scores Mixture model distributions fit to set of D-scores Indicates level of mapping uncertainty for each annotated gene Normal & Gamma distribution fitting Variable number of distributions Mixture Model Fitting using Expectation-Maximization Algorithm P X θ = z β z Y z X θ z X = x., x P,, x ~ represent the set of D-scores β z represent the weight for the k component with z β z = 1 Y z (X θ z ) represent the distribution of the k component θ z is the set of parameters for the k component 37/46

Assume Y z (X, θ z ) = N(X; μ z, σ z P ) Initial parameterization K-means clustering to separate into k components Mixture Model Fitting: Initialization θ z, β z

38 Assume Y z (X, θ z ) = N(X; μ z, σ z P ) Initial parameterization K-means clustering to separate into k components Mixture Model Fitting: Initialization θ z, β z calculated for each component using MLE based on N z MLE(μ z ) = Š ˆ ˆ, Š ˆ MLE σ z P = ˆ, KŒ β z =, with N z = number of data points in component k & z N z = N k = 4 38/46

39 Mixture Model Fitting: Expectation & Maximization Posterior Probability of containment within each component for each D-score is calculated P x k 5 x = P x x k 5 P k 5 P x = N x μ N z, σ z z N = β zn x μ z σ z z β z N x μ z, σ z z β z N x μ z σ z Parameters for each component calculated after Expectation Step μ z = P x k. 5 x x P x k 5 x. σ P z = P x k. 5 x x μ z P x k5 x. P β z =. N P x k 5 x 39/46

40 Expectation and Maximization steps repeated until no significant improvement achieved after each iteration log likelihood fails to substantially increase Mixture Model Fitting: Optimization k = 4 Implementation in R with k {1,, 9} Best model fitting determined by lowest Bayesian Information Criterion (BIC) 40/46

41 Mixture Model Fitting k = 4 The four distributions provide criteria for separating genes into 4 categorizations based on mapping uncertainty level 41/46

Addressing Mapping Uncertainty Co-expression Modules (CEMs) Genes typically co-expressed at certain rates with other genes forming co-expression modules Can use expression levels for known

42 Addressing Mapping Uncertainty Co-expression Modules (CEMs) Genes typically co-expressed at certain rates with other genes forming co-expression modules Can use expression levels for known co-expressed genes (CEGs) to predict likely expression levels for the gene locations This information can be in turn used to determine which location is most likely for any particular ambiguous read Can use existing information to gain insight into the likelihood of the correct location for alignment If no prior CEMs are available, biclustering of data can provide dataset-specific CEMs. 42/46

43 Pitfall III: T-test for differentially expression analysis Wilcoxon (nonparametric) test has better performance than T-test (parametric) Bioinformatics Nov;18(11): Cited by 308 P-value <

Pitfall IV: co-expression correlation Pearson or

5 Gene2 8.1 7.2 7.0 8.4 8.9 8.8 6.5 10.4 6.9 7.

44 Pitfall IV: co-expression correlation Pearson or Spearman? chip1 chip2 chip3 chip4 chip5 chip6 chip7 chip8 Chip9 chip10 Gene Gene Pearson Spearman Pearson benchmarks linear relationship Spearman s rank correlation benchmarks monotonic relationship 35

45 Pitfall V: Co-expression in LARGE data set Conditions One dimensional clustering (genes or conditions) Genes more data!! Bi-clustering (genes & conditions) Genes are not necessarily co-expressed under all experimental conditions, when we have a large data set! 45

46 Computer Lab Requirement Recent version of following software R RStudio MiKTeX (or TeXLive) Install the following R packages on your personal computer EdgeR QUBIC sand 46

47 Final Report Presentation 12 teams, 3 person/team For each team, 15 mins team presentation 12 mins presentation 3 mins question-and-answer One score per team 47

Data Processing and Analysis in Systems Medicine. Milena Kraus Data Management for Digital Health Summer 2017

Data Processing and Analysis in Systems Medicine. Milena Kraus Data Management for Digital Health Summer 2017 Milena Kraus Digital Health Summer Agenda Real-world Use Cases Oncology Nephrology Heart Insufficiency Additional Topics Data Management & Foundations Biology Recap Data Sources Data Formats Business Processes