HT Expression Data Analysis 台大農藝系劉力瑜 lyliu@ntu.edu.tw 08/03/2018 1
HT Transcriptomic Data Microarray RNA-seq
HT Transcriptomic Data Microarray RNA-seq
Workflow Data import Preprocessing* Visualization DE analysis* Adjust p-values for multiple comparisons Cluster analysis * Different methods are used for microarray and RNA-seq data
R / Bioconductor for HT Transcriptomic Data "affylmgui" for Affymetrix microarrays "limma" for microarrays in general "DESeq" for RNA-seq data
Affymetrix Microarrays Example data: (GSE59533) Expression data from Zea mays cultivars Tietar and DKC 6575 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gse59533
Get GEO Data using R # Install "GEOquery" package in Bioconductor > source("http://bioconductor.org/bioclite.r") > bioclite("geoquery") > library(geoquery) # 取得 GSE59533 的 CEL 檔案 : > gse0 = getgeosuppfiles("gse59533") > gse0 # 取得下載後檔案存放位置 ; 解壓縮檔案
> source("http://bioconductor.org/bioclite.r") > bioclite("affylmgui") > library(affylmgui) > affylmgui() # mac OS
File -> New
Target File Format The file at the right is known as "RNA Targets" file in affylmgui. It describes the experimental conditions for each of the 12 arrays. The file should be: In tab-delimited text format. Having 3 columns in the file The column headings must appear exactly as shown: Name: the unique name for each chip FileName: Affymetrix.CEL file name for each chip Target: Used by affylmgui to group the arrays into different classes (for downstream differential expression analysis).
Normalization for RNAseq There are two main sources of systematic variability that require normalization. (1) RNA fragmentation during library construction causes longer transcripts to generate more reads compared to shorter transcripts present at the same abundance in the sample (3&4). (2) The variability in the number of reads produced for each run causes fluctuations in the number of fragments mapped across samples (1&2).
Normalization for RNAseq Single-end reads: use reads per kilobase of transcript per million mapped reads (RPKM) metric 10 9 x R / (N x L) Pair-end reads: use analogous fragments per kilobase of transcript per million mapped reads (FPKM) metric
Scaling Method in DESeq
DE Analysis for RNAseq DESeq (DESeq2) is an BioC package: Assume the read counts are distributed as negative binomial (NB) distribution. 1. Estimate the variance for NB distribution 2. Hypothesis testing under NB distribution
DESeq2 Input from count matrix: ctdata.tab gene T1a T1b T2 T3 N1 N2 Gene_00001 0 0 2 0 0 1 Gene_00002 20 8 12 5 19 26 Gene_00003 3 0 2 0 0 0 Gene_00004 75 84 241 149 271 257 Gene_00005 10 16 4 0 4 10 Gene_00006 129 126 451 223 243 149 Gene_00007 13 4 21 19 31 4 Gene_00008 0 3 0 0 0 0 Gene_00009 202 122 256 43 287 357 Gene_00010 10 8 56 145 14 15 Gene_00011 2 3 5 0 3 0 Gene_00012 104 60 218 213 111 121 Gene_00013 6 6 22 13 15 6 (18761 genes) (6 samples)
DESeq2 > library('deseq2') > samplecountdata = read.delim("data/ctdata.tab") > samplecoldata = DataFrame( condition=as.factor(c("treated","treated", "treated","treated","control","control")), row.names=colnames(samplecountdata)) > dds = DESeqDataSetFromMatrix( countdata = samplecountdata, coldata = samplecoldata, design = ~ condition)
DESeq2 > dds = DESeq(dds) > res = results(dds) > res = res[order(res$padj),] > plotma(dds) > write.csv(as.data.frame(res), file="condition_treated_results.csv") # save normalized read counts > norm.cts = counts(dds, normalized=true) > write.csv(norm.cts, file="normalizedcounts.csv")
DESeq2 # LRT for mutiple levels > coldata(dds)$condition = as.factor(c("t1","t1","t2","t2","ctrl","ctrl")) > coldata(dds)$condition = relevel(coldata(dds)$condition, "ctrl") > ddslrt = DESeq(dds,test="LRT", reduced= ~ 1) > reslrt=results(ddslrt) > mcols(ddslrt,use.names=true)[1:3,] # when there is no replicate > trt = c("t1a","t1b") > dds.short = DESeqDataSetFromMatrix(countData = samplecountdata[,1:2], + coldata = DataFrame(condition=as.factor(trt), row.names=trt), + design = ~ condition) > dds.short = DESeq(dds.short) > plotma(dds.short)