Visualization and PCA with Gene Expression Data. Utah State University Fall 2017 Statistical Bioinformatics (Biomedical Big Data) Notes 4

Size: px

Start display at page:

Download "Visualization and PCA with Gene Expression Data. Utah State University Fall 2017 Statistical Bioinformatics (Biomedical Big Data) Notes 4"

Alicia Ford
6 years ago
Views:

1 Visualization and PCA with Gene Expression Data Utah State University Fall 2017 Statistical Bioinformatics (Biomedical Big Data) Notes 4 1

2 References Chapter 10 of Bioconductor Monograph Ringner (2008). What is principal component analysis? Nature Biotechnology 26: Chapter 8 of Johnson & Wichern s Applied Multivariate Statistical Analysis, 5 th ed. 2

3 MA plot compare samples MA plot: M=Y-X vs. A=0.5(Y+X) Rotate and scale Y vs. X scatterplot For log-scale expression on arrays Y and X: M = log fold-change (Y vs. X) A = average expression For comparing multiple arrays, can create a pseudo-array reference X by taking median for each gene across all samples Sometimes add a Loess curve: locally weighted polynomial regression 3

4 Visualization and efficiency library(affydata); data(dilution) eset <- log2(exprs(dilution)) dim(eset) # 409,600 rows (gene probes), 4 columns (samples) X <- eset[,1]; Y <- eset[,2] M <- Y - X; A <- (Y+X)/2 plot(a,m,main='default M-A plot',pch=16,cex=0.1); abline(h=0) Interpretation: M-direction shows differential expression A-direction shows average expression Look for: systematic changes outliers patterns quality / normalization (larger M-variability, curvature) Problems: 1. loss of information (overlayed points) 2. file size Note: Dilution is from an Affymetrix experiment, only used for illustration here. 4

5 Alternative: show density by color library(geneplotter); library(rcolorbrewer) blues.ramp <- colorramppalette(brewer.pal(9,"blues")[-1]) dc <- denscols(a,m,colramp=blues.ramp) plot(a,m,col=dc,pch=16,cex=0.1,main='color density M-A plot') abline(h=0) Interpretation: color represents density around corresponding point How is this better? visualize - overlayed points What problem(s) remain? file size (one point per probe) 5

6 Alternative: highlight outer points op <- par() par(bg="black", fg="white", col.axis="white", col.lab="white", col.sub="white", col.main="white") dcol <- denscols(a,m,colramp=blues.ramp) plot(a,m,col=dcol,pch=16,cex=0.1, main='color density M-A plot') abline(h=0) par(op) 6

7 Alternative: smooth density smoothscatter(a,m,nbin=250,nrpoints=50,colramp=blues.ramp, main='smooth scatter M-A plot'); abline(h=0) What does this do? 1. smooth the local density using a kernel density estimator 2. black points represent isolated data points - But it can be a bit blurry (creates visual artifacts) 7

8 Alternative: hexagonal binning library(hexbin) hb <- hexbin(cbind(a,m),xbins=40) plot(hb, colramp = blues.ramp, main='hexagonal binning M-A plot') What does this do? essentially discretizes density - Maybe a little clunky, and adding reference lines can be tricky - But probably the safest plot 8

9 # Same as previous slide, but try adding horizontal ref hb <- hexbin(cbind(a,m),xbins=40) plot(hb, colramp = blues.ramp, main='hexagonal binning M-A plot') abline(h=0) 9

10 Image file types and sizes (slide 5 ex.) dcol <- denscols(a,m,colramp=blues.ramp) pdf("c:\\folder\\f1.pdf") plot(a,m,col=dcol,pch=16,cex=0.1,main='m-a plot'); abline(h=0) dev.off() # 3,643 KB postscript("c:\\folder\\f1.ps") plot(a,m,col=dcol,pch=16,cex=0.1,main='m-a plot'); abline(h=0) dev.off() # 25,315 KB jpeg("c:\\folder\\f1.jpg") plot(a,m,col=dcol,pch=16,cex=0.1,main='m-a plot'); abline(h=0) dev.off() # 14 KB png("c:\\folder\\f1.png") plot(a,m,col=dcol,pch=16,cex=0.1,main='m-a plot'); abline(h=0) dev.off() # 9 KB (Note other options for these functions to control size and quality.) 10

11 Principal Components Analysis A common approach in high-dimensional data: reduce dimensionality Notation: X lj = [log-scale] expression / abundance level for variable (gene / protein / metabolite / substance) j in observation (sample) l of the data [so X T expression set matrix] Define i th principal component (like a new variable or column): PC i = a ij X j j (where X j is the j th column of X) 11

12 Principal Components Analysis In the ith principal component: the coefficients a ij are chosen (automatically) so that: PC 1, PC 2, each have the most variation possible PC 1, PC 2, are independent (uncorrelated) PC 1 has most variation, followed by PC 2, then PC 3, i (a ij ) 2 = 1 for each i PC i = j a ij X j 12

13 PCA: Interpretation Size of a ij s indicates importance in variability Example: Suppose a 1j s are large for a certain class of gene / protein / metabolite, but small for other classes. Then PC1 can be interpreted as representing that class Problem: such clean interpretation not guaranteed 13

14 PCA: Visualization with the Biplot Several tools exist, but the biplot is fairly common Represent both observations / samples (rows of X) and variables [genes / proteins / etc.] (columns of X) Observations usually plotted as text labels at coordinates determined by first two PC s Greater interest: Variables plotted as labeled arrows, to coordinate (on arbitrary scale of top and right axes) weight in the first two PC s 14

15 PCA Implementation and Example Problem with high-dimensional wide data If have many more variables than observations Solution: transpose X and convert back to original space [princomp2 function in msprocess package] Example here: Naples data 4 Ctrl patients, 4 diseased (RCM or DCM) patients Focus on C vs. non-c (D or R) differentiation Just use previously-selected set of 20 genes for now (recall Notes 3 slide 23) 15

16 ## Get subset of Naples data (same as in Notes 3, slide 23) url <- " naples <- read.csv(url, row.names=1) head(naples) emat <- as.matrix(naples) gn <- rownames(emat) gn.list <- c("ensg ","ensg ", "ENSG ","ENSG ","ENSG ", "ENSG ","ENSG ","ENSG ", "ENSG ","ENSG ","ENSG ", "ENSG ","ENSG ","ENSG ", "ENSG ","ENSG ","ENSG ", "ENSG ","ENSG ","ENSG ") t <- is.element(gn,gn.list) small.eset <- emat[t,] group <- c('c','r','c','d','c','r','c','d') colnames(small.eset) <- group # Run PCA and visualize result source(" pc <- pc2(t(small.eset), scale = TRUE) 16

17 biplot(pc, cex=c(2,1), xlim=c(-.8,.8)) Biplot arrows show how certain variables put observations in certain parts of the plot. Here, dominant variables are , , , and

18 small.eset C R C D C R C D ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG ENSG Why would , , , and dominate? 18

19 Caution with high-dimensional PCA With a large # of variables (genes, dimensions, etc.), need to be aware of scale Variables with large variance (perhaps due to scale) will dominate principal components Need to standardize variables (genes; rows here) if they are measured on scales with widely different ranges or if the units of measurement are not [comparable] (Johnson & Wichern text) Log-scale expression values often better behaved 19

20 After standardizing rows of small.eset: Now able to detect effects, both huge and subtle: C > non-c for , , , and C < non-c for 68137, 20

21 Scree plot shows variance of the first few PC s Best to have majority of variance accounted for by the first two (or so) PC s 21

22 # standardize each row X <- log(small.eset+1) # Steps omitted here; each row of X standardized. # Think about how you could do it! pcx <- pc2(t(x), scale=true) biplot(pcx, xlim=c(-.8,.6), cex=c(2,1)) screeplot(pcx) 22

23 Side note: archived package Package msprocess was removed from the CRAN repository. Formerly available versions can be obtained from the archive. (But only there as source, not compiled package) We may return to this package later, but for now, we only need the princomp2 function: This function performs principal component analysis (PCA) for wide data x, i.e. dim(x)[1] < dim(x)[2]. This kind of data can be handled by princomp in S-PLUS but not in R. The trick is to do PCA for t(x) first and then convert back to the original space. It might be more efficient than princomp for high dimensional data. The pc2 function sourced in (previous slide) is a modified (but fully functional) version 23

24 # Try a 3-d biplot library(biplotgui) Biplots(Data=t(X)) # actually, using t(log(small.eset+1)) # looks similar here # lower-left: External --> In 3D 24

25 Summary Visualization efficiency Avoid overlaying points Conserve file size (note file type) Choose meaningful color palette Principal Components Emphasize major sources of variability No guaranteed interpretability (no accounting for bio.) Pay attention to scale of variables 25

26 Warnings why end with this? What can be gained from clustering? general ideas, possible structures; NOT - statistical inference (at least not in the same way) Be wary of: elaborate pictures non-normalized data unjustifiable decisions clustering method distance function color scheme claims of - proof (maybe support ) arbitrary decisions What is clustering best for? exploratory data analysis and summary, NOT statistical inference 26

Visualization and PCA with Gene Expression Data

References Visualization and PCA with Gene Expression Data Utah State University Spring 2014 STAT 5570: Statistical Bioinformatics Notes 2.4 Chapter 10 of Bioconductor Monograph (course text) Ringner (2008).