Introduction to Cancer Genomics Gene expression data analysis part I David Gfeller Computational Cancer Biology Ludwig Center for Cancer research david.gfeller@unil.ch 1
Overview 1. Basic understanding of RNA-Seq data processing. 2. Differential expression. Examples of R code 3. Dimensionality reduction. 2
Goals Help you understand what can be done with a computer -> programming logic Give you some basic idea of how to ask the computer to perform some tasks -> syntax. Show you a few examples of gene expression data analysis in R that you could reuse for your projects (see also practical). 3
Gene expression experiments Microarrays: Chip with DNA probes that will pair with DNA (retro-transcribed RNA) in a sample. Intensity is measured as a light signal. Very popular in (2000-2010) RNA-Seq: Directly count how many transcripts (mrna molecules) originate from each gene in a sample. Increasingly replacing microarrays for gene expression analyses 4
RNA-Seq RNA fragmentation Reverse transcription Adaptors + amplification Sequencing ACCTAG CGGTAA ATGGCA TGGGAC TATAGG >100M reads RNA Map to reference transcriptome Gene A Gene B Gene expression => Quite easy (count the reads) Gene fusion => More difficult (especially for new fusion events) Splicing => More difficult (especially for poorly annotated isoforms) 5
1 - Typical output of RNA-Seq Raw sequences: - Fastq format (sequence of the reads + quality information) Processed data: - Counts: Number of reads mapping to each gene/transcript. - Bam format (compressed) - Sra format (compressed) 6
How to think about these data in a computer Sample1: gene1: 254; gene2: 1284; gene3: 7234; Sample2: gene1: 5; gene2: 362; gene3: 0; Sample3: gene1: 8902; gene2: 2199; gene3: 722; Each expression value corresponds to a scalar. Each sample corresponds to a vector. All samples form a matrix (M) N genes S samples M[s,n] corresponds to expression of gene n in sample s 7
Computers like numbers In R: - Scalar (numeric) - Vector (array) - Matrix (multidimensional arrays, e.g. S x N) Gene expression data are naturally digitalized, which makes them especially appropriate to use with computers Many other biological objects can be digitalized as vectors or matrices: - Protein/DNA sequences <-> vectors of letters/numbers - Protein structures <-> vectors/matrices of 3D coordinates - Interactions <-> N x N matrix with 1 s and 0 s - Image <-> matrix of pixel (1/0 for two-color image) - Set of measurements <-> vector of values 8
How to think about these data in a computer In R, once you load your data into a matrix (M), you can very easily: - Print one specific column: M[,2] - Print one specific line: M[1,] - Plot the correlation of two genes: plot(m[,5], M[,7]) - Make operations on lines or columns. 9
Let s practice Create a empty directory Tutorial_Gfeller and Tutorial_Gfeller/ Data Download the file: GSE93722_RAW.tar at: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gse93722 Put it in Tutorial_Gfeller/Data/ and uncompress it and uncompress the zip files. Each of the files corresponds to the gene expression profiling of a melanoma sample. Open Rstudio. Set the working directory (Session -> Set Working Directory) to Tutorial_Gfeller. Create a new Rscript file (File-> New File -> R script); this is where you will write your code and save it in Tutorial_Gfeller as file.r. 10
Let s load the data Each GSMxxx corresponds to one sample First have a look at the files in a Excel (or any text editor). To start with, we will focus on the expected_count column The command to load file is read.delim(): m1 <- read.delim("data/gse93722_raw/gsm2461003_lau125.genes.results.txt ) Name of the object that will store the data. Path to the file to be loaded Then execute the command in the Console (pasting it or command+enter). Now you can look at the elements of m1 (e.g., for the first line, type m1[1,] in the console). Does it correspond to the first line of the file? With dim(m1)you can check the dimensions of m1. 11
Let s load the data Load the other files into m2 (LAU1255), m3 (LAU1314) and m4 LAU355). Build a matrix taking the fifth column in each file: M <- matrix(nrow=4, ncol=dim(m1)[1]) M[1,] <- m1[,5] In the first line, put the 5 th column of m1 Initialize an empty matrix with the correct dimension Do the same with m2, m3 and m4 (if you had many files, we would do a loop, see exercises). Try to query any entry of your matrix (e.g., M[3,5]). Do you get the expected number? 12
Genes have (many) names In these files, we have Ensembl gene Ids We want to convert them to Common Gene names. We need a file with the mapping (two columns, one for Ensembl IDs, one for gene names). Go to: https://www.ensembl.org/biomart/martview/ Select Ensembl Genes 90, then Human genes. In Attributes, Select GENE: -> Gene stable ID and EXTERNAL: -> HGNC symbol. Click on Results, then Unique results only, and Go to save to a local file (put the file in Tutorial_Gfeller/Data). 13
Then in R Open the file: mapping <- read.delim("data/mart_export.txt") Use the match() function to find the position in mapping of all the genes for which you have expression data in m1: i <- match(m1[,1], mapping[,1]) Then build a vector with the gene names gene <- as.character(mapping[i,2]) N <- length(gene) Verify that the mapping is correct by checking a few examples 14
Computers like simple and sequential calculations Additions/subtractions and multiplications/divisions You need to decompose any problem into a set of simple operations. You need to tell the computer about every step of your calculations (e.g., loop over all entries in one column). Example: Find the average expression of a gene (e.g., EGFR) across samples 15
How to do it on a computer gene = EGFR M = 1) Have a matrix M with all expression values and a vector gene with the name of the genes (columns of M). 2) Find the column corresponding to your gene: n <- which(gene == EGFR ) 3) Initialize a scalar: av <- 0 4) Go through each element of the column: S <- dim(m)[1] for(s in 1:S){ av <- av + M[s,n] } M[,n] 5) Normalize your value: av <- av/s 16
How programming languages work The exact commands will change between programming languages (R, python, perl, C, matlab), but the logic remains the same ( grammar ). Learning the syntax ( words ) can be done with many online resources. In these two days, we will focus on R, since it is very convenient for graphical visualization of the data. Many built-in functions (e.g., average()), but important to understand the logic. 17
Typical output of RNA-Seq Raw sequences: - Fastq format (sequence of the reads + quality information Processed data: - Counts: Number of reads mapping to each gene/transcript. - Bam format (compressed) - Sra format (compressed) 18
Computational analyses Alignments Isoforms (splicing) Low complexity regions (repeats) Variable regions (TCR, MHC) Sequencing errors Poorly annotated regions / genomes ACCTAG CGGTAA ATGGCA TGGGAC TATAGG >100M reads Map to reference transcriptome Gene A Gene B 19
What else needs to be considered Different samples can have different total number of reads (e.g., different sequencing depth). Sample 1 Gene A Gene B Sample 2 Gene A Gene B Longer genes have more reads. Gene A Gene B If you want to compare expression between samples, you need to renormalize by total number of reads, If you want to compare expression between genes, you need to renormalize by gene length,
How to do it (naïve way) 10 362 093 12 482 546 7 542 733 M = N <- dim(m)[2] M.norm <- matrix(nrow=s, ncol=n) # Initialize an empty matrix for( s in 1:S ){ tot=0; for (n in 1:N){ tot=tot+m[s,n] # Compute the sum over row s } for (n in 1:N){ M.norm[s,n] <- M[s,n]/tot # Normalize row s } } M.norm <- M.norm*1000000 # Avoid having too small numbers 21
A few names commonly used Raw counts: Number of reads mapping to a gene Scaled counts: After renormalization by total number of counts in the sample. Reads Per Kilobase Million (RPKM): Divide by the total number of reads and then by the gene length. Multiply by 1 000 000 to have numbers that are easier to read. Transcripts Per Kilobase Million (TPM): Divide by gene length and then normalize across all genes (i.e. sum of TPMs of all genes is the same for all samples)
Scaled counts vs TPM vs RPKM TPM are increasingly used. The sum is always equal to 10 6 in TPM. The two values (TPM vs RPKM) are equivalent, up to a renormalizing factor. Scaled counts are enough to compare the same gene in different samples. TPM/RPKM are required to compare different genes. 23
Studying expression of some gene in two types of samples G1 G2 M[,n] 1) Define the groups: G1 <- c(1,2); G2 <- c(3,4) 2) Find the column corresponding to the gene: n <- which(gene== CD19 ) 3) Take the mean over the blue box: av1 <- 0; for(s in G1) { av1 <- av1 + M.norm[s,n] }; av1 <- av1/length(g1) 4) Take the mean over the red box: av2 <- 0; for(s in G2) {av2 <- av2 + M.norm[s,n] }; av2 <- av2/length(g2) 5) Compare expression. 6) With more samples you can do statistics (T-test, boxplot, see exercises). 24
2 - Differential expression Expression level How can we quantify these differences? S1 S2 Samples 25
Differential expression Log fold change: High expression genes can show big differences in counts (10 000 to 20 000), compared to low expression genes (10 to 20), even if they experience the same relative change. Better to use logarithms. 10 -> 20 = log 2 fold change of 1 = 10 000 -> 20 000. 26
P-value: Differential expression Give a statistical significance, but not trivial to estimate. Expression level Expression level Expression level Differences in the mean values are not enough! 27
Differential expression P-value: Give a statistical significance, but not trivial to estimate. Expression level 2 1 2 000 1 000 Depending on your random model, the first case may be more likely to appear by chance. 28
Differential expression P-value: Give a statistical significance, but not trivial to estimate. Expression level 2 1 2 000 1 000 Advanced statistical methods have been developed to estimate P- values in RNA-Seq data! 29
Differential expression P-value: Give a statistical significance, but not trivial to estimate. Gene 1 Gene 2 Gene 9 Expression level Gene 8 Gene 3 Gene 7 Gene 4 Gene 5 Gene 6 Gene 10 Gene 11 Many genes (20 000) => many testing => Higher chances that the differences are just due to chance. 30
Tools for differential expression Accurate estimation of P-values aim at considering these different issues in testing the hypothesis that the expression values come from the same distribution or have the same mean in two conditions. Consider the multiple testing problem. gene mean Log-fold change P-value P-value adjusted Tools in R: - EdgeR - DESeq2 P= 20 000 genes 31
How to show your results? P_adj < 0.05 P_adj >= 0.05 How to plot this in your computer? 1) Select genes with P_adj >= 0.05: ind1 <- which( P[,5] >= 0.05 ) 2) Plot these points plot( P[ind1, 2], P[ind1, 3] ) 3) Select genes with P_adj < 0.05: ind2 <- which( P[,5] < 0.05 ) gene mean Log-fold change P-value P-value adjusted 4) Plot these points par(new=t) # This is to overlay the graphs plot( P[ind2, 2], P[ind2, 3], col= red ) P= 32
3 - Visualizing high-dimensional data Each sample can be considered as a point in a very high dimensional space (N dimensions). In this high-dimensional space, are some samples more similar to each other? Replicates Similar cell types Cancer subtypes 33
Example in 3D (i.e. 3 genes) Gene 2 S5 S2 S4 S1 S3 Gene 1 Visually, you can see that: - S1, S3, S4 are similar to each other. - S2, S5 are similar to each other. Can you quantify it? - Distance - Angle (correlation) Gene 3 34
Distances - How would you do it on a computer? Gene 2 S5 S2 S4 S1 S3 Gene 1 S1 <- c(5, 6, -1) S2 <- c(-2, 5, 3) d12 <- 0 for(i in 1:3){ d12 <- d12 + (S1[i]-S2[i])**2 } d12 <- sqrt(d12) Here we used the ** for taking the square of a number and the sqrt() function for square root. Gene 3 35
What if you have 20 000 genes? Very hard to visualize You can still compute distances d12 <- 0 N <- length(s1) for(i in 1:N){ d12 <- d12 + (S1[i]-S2[i])**2 } d12 <- sqrt(d12) This is a big advantage of using programming languages, compared to Excel (or manual calculations ) 36
Visualization Distances are still not very intuitive If you have many points (S), the number of pairwise distances is S(S-1)/2 Idea: Project the data in 2D, so that it represents optimally the raw data (gene expression profiles) in the N-dimensional space. 37
2D projection the good choice PC2 S5 S2 Gene 2 S4 S1 S3 PC2 S5 S2 In 2D S4 S1 S3 PC1 Gene 1 PC1 Gene 3 38
2D projection the bad choice PC2 S5 S2 Gene 2 S4 S1 S3 PC2 In 2D S4 S2 S1 S5 S3 PC1 Gene 1 PC1 Gene 3 39
Principle Component Analysis (PCA) PC2 S5 S2 Gene 2 S4 S1 S3 How to select your 2D plan on which to project the data? - Intuitive idea: Take axes with the largest variance or dispersion (Principal Components). PC1 - The math behind is not simple (eigenvalue decomposition of Gene 1 covariance matrix) but does not depend on the number of genes (dimension). Gene 3 - You do not need to understand the math to use it. 40
How to do it on your computer In R, use function prcomp (stats package). S1 <- c(5, 6, -1) S2 <- c(-2, 5, 3) S3 <- c(5.5, 6.5, -1.3) S4 <- c(4, 6.5, -0.3) S5 <- c(-2.2, 5.3, 3.1) x <- c(s1[1], S2[1], S3[1], S4[1], S5[1]) y <- c(s1[2], S2[2], S3[2], S4[2], S5[2]) z <- c(s1[3], S2[3], S3[3], S4[3], S5[3]) Plot the data in 3D library(rgl) plot3d(x,y,z, xlim=c(-10,10), ylim=c(-10,10), zlim=c(-10,10)) Make a PCA analysis mat <- t(matrix(c(s1, S2, S3, S4, S5), nrow=3)) pca = prcomp(mat) plot(pca$x[,1], pca$x[,2]) Each point in space Coordinates along x, y, z axes Make a matrix with each point in one line See practical this afternoon 41
Now let s look at the tumor expression Run: pca = prcomp(m.norm) data # Plot the samples along the two first components plot(pca$x[,1], pca$x[,2]) What do you see? Does it make sense in light of expression of CD19? 42
Principle component analysis some Gene 2 PC1 Gene 1 discussions - The axis with the largest variance do not necessarily reflect the structures in the data. - In PCA, the principle components are always orthogonal (linear method). - It is often useful to make sure the mean of the samples is at 0. PC1 43
Many refinements/alternatives In PCA, only select a subset of genes (high expression, high variability, ). Multi-dimensional scaling (MDS). Plot the points in 2D so that distances in the original space are best preserved (R package cmdscale ). Stochastic Neighbor Embedding (tsne). Very popular these days (R package tsne ). Non-linear techniques (not a simple projection). All these techniques are fully unsupervised: they do not need to know what your data are, which cluster you should expect, 44
Start with PCA. How to choose? If you know what your samples are (e.g., different cell types), you can try to play a bit with parameters (e.g., choice of genes, choice of algorithm) to have meaningful clusters. Find optimal parameters that best capture the signal in your data. => Allows you to discover new things Overfit your data: See only what you want to see (even if it is not there). Prevents from seeing anything new 45
Where to access gene expression data GEO: Largest collection of gene expression data (microarray, RNA-Seq). Often has counts (not only raw data). ENA (European Nucleotide Archive): Large collection of raw RNA-Seq data (bam files). ArrayExpress: functional genomics data See exercises this afternoon 46
Where can we access cancer gene expression data TCGA: large collection of tumor RNA-Seq, Exome-Seq, methylation, clinical information, > 10 000 patients with sequenced tumors See exercises tomorrow 47
General remarks about programming Computers like numbers and simple operations Need to decompose complex tasks into simple steps. Learning a programming language takes time, but you do not need to know everything before starting. First understand the logics, then use books or online resources for the syntax. Data analysis takes time Analyzing large datasets is often more challenging than producing them 48
General remarks about programming Many ways of making many mistakes!!! We all do mistakes You need to check your outputs when you write a code If you do a normalization on matrix rows, check that the row sums are truly equal. If there is something incoherent in your output, always go back to find the mistakes (do not impute to noise ), even if the data come from a bioinformatics expert. 49
General remarks about programming In the beginning, it is a big investment to write a script, rather than using Excel. But in the long range, it allows you to go much faster and quickly analyze many datasets without having to redo everything each time. Many analyses cannot be done in Excel, while R provides many packages that you can use. 50
How to get support for bioinformatics analyses of gene expression data Sequencing facility: GTF (Keith Harshman) Standard pipelines for normalizing and PCA Bioinformatics core facility (Delorenzi) or Vital- IT (Xenarios). Very specific analyses: groups working in computational biology. 51
Questions? 52