Introduction to Cancer Genomics

Similar documents
How to store and visualize RNA-seq data

RNA-Seq. Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University

Automated Bioinformatics Analysis System on Chip ABASOC. version 1.1

Colorado State University Bioinformatics Algorithms Assignment 6: Analysis of High- Throughput Biological Data Hamidreza Chitsaz, Ali Sharifi- Zarchi

Sequence Analysis Pipeline

Data Processing and Analysis in Systems Medicine. Milena Kraus Data Management for Digital Health Summer 2017

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

Services Performed. The following checklist confirms the steps of the RNA-Seq Service that were performed on your samples.

Why use R? Getting started. Why not use R? Introduction to R: Log into tak. Start R R or. It s hard to use at first

RNA-seq. Manpreet S. Katari

Dimension reduction : PCA and Clustering

Quantification. Part I, using Excel

Using R for statistics and data analysis

/ Computational Genomics. Normalization

How do microarrays work

Dimension Reduction CS534

Exercise 1 Review. --outfiltermismatchnmax : max number of mismatch (Default 10) --outreadsunmapped fastx: output unmapped reads

Differential gene expression analysis using RNA-seq

Exploratory data analysis for microarrays

Gene Expression Data Analysis. Qin Ma, Ph.D. December 10, 2017

CompClustTk Manual & Tutorial

Advanced RNA-Seq 1.5. User manual for. Windows, Mac OS X and Linux. November 2, 2016 This software is for research purposes only.

Introduction to GE Microarray data analysis Practical Course MolBio 2012

ROTS: Reproducibility Optimized Test Statistic

CLC Server. End User USER MANUAL

Data Mining - Data. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - Data 1 / 47

Our typical RNA quantification pipeline

Gene Survey: FAQ. Gene Survey: FAQ Tod Casasent DRAFT

Single/paired-end RNAseq analysis with Galaxy

A review of RNA-Seq normalization methods

Supplementary Figure 1. Fast read-mapping algorithm of BrowserGenome.

RNA-Seq analysis with Astrocyte Differential expression and transcriptome assembly

Transcript quantification using Salmon and differential expression analysis using bayseq

Reference guided RNA-seq data analysis using BioHPC Lab computers

Computational Genomics and Molecular Biology, Fall

11/8/2017 Trinity De novo Transcriptome Assembly Workshop trinityrnaseq/rnaseq_trinity_tuxedo_workshop Wiki GitHub

Tutorial: RNA-Seq Analysis Part II (Tracks): Non-Specific Matches, Mapping Modes and Expression measures

Differential Expression

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

SVM Classification in -Arrays

CSE 6242 A / CX 4242 DVA. March 6, Dimension Reduction. Guest Lecturer: Jaegul Choo

Easy visualization of the read coverage using the CoverageView package

RNA-Seq Analysis With the Tuxedo Suite

ChIP-Seq Tutorial on Galaxy

Expression Analysis with the Advanced RNA-Seq Plugin

srap: Simplified RNA-Seq Analysis Pipeline

Database Repository and Tools

ArrayExpress and Expression Atlas: Mining Functional Genomics data

Visualization using CummeRbund 2014 Overview

Mapping RNA sequence data (Part 1: using pathogen portal s RNAseq pipeline) Exercise 6

ChIP-seq (NGS) Data Formats

Gene expression & Clustering (Chapter 10)

Galaxy workshop at the Winter School Igor Makunin

Tutorial. RNA-Seq Analysis of Breast Cancer Data. Sample to Insight. November 21, 2017

Unsupervised Learning

User s Guide. Using the R-Peridot Graphical User Interface (GUI) on Windows and GNU/Linux Systems

Clustering Techniques

TP RNA-seq : Differential expression analysis

Anaquin - Vignette Ted Wong January 05, 2019

Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata

Long Read RNA-seq Mapper

Clustering analysis of gene expression data

ChIP-seq hands-on practical using Galaxy

CPIB SUMMER SCHOOL 2011: INTRODUCTION TO BIOLOGICAL MODELLING

Clustering and Visualisation of Data

Goal: Learn how to use various tool to extract information from RNAseq reads. 4.1 Mapping RNAseq Reads to a Genome Assembly

Maximizing Public Data Sources for Sequencing and GWAS

m6aviewer Version Documentation

Gene signature selection to predict survival benefits from adjuvant chemotherapy in NSCLC patients

Introduction to Matlab. Sasha Lukyanov, 2018 Xenopus Bioinformatics Workshop, MBL, Woods Hole

INTRODUCTION TO BIOINFORMATICS

TECH NOTE Improving the Sensitivity of Ultra Low Input mrna Seq

How to use the DEGseq Package

Why use R? Getting started. Why not use R? Introduction to R: It s hard to use at first. To perform inferential statistics (e.g., use a statistical

Testing for Differential Expression

Matlab project Independent component analysis

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Metabolomic Data Analysis with MetaboAnalyst

CQN (Conditional Quantile Normalization)

mrna-seq Basic processing Read mapping (shown here, but optional. May due if time allows) Gene expression estimation

CLUSTERING IN BIOINFORMATICS

Drug versus Disease (DrugVsDisease) package

Exercises: Analysing RNA-Seq data

ChIP-seq Analysis Practical

Introduction to Systems Biology II: Lab

The software and data for the RNA-Seq exercise are already available on the USB system

Exercise 1. RNA-seq alignment and quantification. Part 1. Prepare the working directory. Part 2. Examine qualities of the RNA-seq data files

ECG782: Multidimensional Digital Signal Processing

SOM Tutorial. Camden Jansen Mortazavi Lab

Mapping NGS reads for genomics studies

CS313 Exercise 4 Cover Page Fall 2017

PROMO 2017a - Tutorial

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Course on Microarray Gene Expression Analysis

Inf2B assignment 2. Natural images classification. Hiroshi Shimodaira and Pol Moreno. Submission due: 4pm, Wednesday 30 March 2016.

Introduction to R: Using R for statistics and data analysis

Taxonomically Clustering Organisms Based on the Profiles of Gene Sequences Using PCA

Linear and Non-linear Dimentionality Reduction Applied to Gene Expression Data of Cancer Tissue Samples

7 Control Structures, Logical Statements

Package SC3. November 27, 2017

Transcription:

Introduction to Cancer Genomics Gene expression data analysis part I David Gfeller Computational Cancer Biology Ludwig Center for Cancer research david.gfeller@unil.ch 1

Overview 1. Basic understanding of RNA-Seq data processing. 2. Differential expression. Examples of R code 3. Dimensionality reduction. 2

Goals Help you understand what can be done with a computer -> programming logic Give you some basic idea of how to ask the computer to perform some tasks -> syntax. Show you a few examples of gene expression data analysis in R that you could reuse for your projects (see also practical). 3

Gene expression experiments Microarrays: Chip with DNA probes that will pair with DNA (retro-transcribed RNA) in a sample. Intensity is measured as a light signal. Very popular in (2000-2010) RNA-Seq: Directly count how many transcripts (mrna molecules) originate from each gene in a sample. Increasingly replacing microarrays for gene expression analyses 4

RNA-Seq RNA fragmentation Reverse transcription Adaptors + amplification Sequencing ACCTAG CGGTAA ATGGCA TGGGAC TATAGG >100M reads RNA Map to reference transcriptome Gene A Gene B Gene expression => Quite easy (count the reads) Gene fusion => More difficult (especially for new fusion events) Splicing => More difficult (especially for poorly annotated isoforms) 5

1 - Typical output of RNA-Seq Raw sequences: - Fastq format (sequence of the reads + quality information) Processed data: - Counts: Number of reads mapping to each gene/transcript. - Bam format (compressed) - Sra format (compressed) 6

How to think about these data in a computer Sample1: gene1: 254; gene2: 1284; gene3: 7234; Sample2: gene1: 5; gene2: 362; gene3: 0; Sample3: gene1: 8902; gene2: 2199; gene3: 722; Each expression value corresponds to a scalar. Each sample corresponds to a vector. All samples form a matrix (M) N genes S samples M[s,n] corresponds to expression of gene n in sample s 7

Computers like numbers In R: - Scalar (numeric) - Vector (array) - Matrix (multidimensional arrays, e.g. S x N) Gene expression data are naturally digitalized, which makes them especially appropriate to use with computers Many other biological objects can be digitalized as vectors or matrices: - Protein/DNA sequences <-> vectors of letters/numbers - Protein structures <-> vectors/matrices of 3D coordinates - Interactions <-> N x N matrix with 1 s and 0 s - Image <-> matrix of pixel (1/0 for two-color image) - Set of measurements <-> vector of values 8

How to think about these data in a computer In R, once you load your data into a matrix (M), you can very easily: - Print one specific column: M[,2] - Print one specific line: M[1,] - Plot the correlation of two genes: plot(m[,5], M[,7]) - Make operations on lines or columns. 9

Let s practice Create a empty directory Tutorial_Gfeller and Tutorial_Gfeller/ Data Download the file: GSE93722_RAW.tar at: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gse93722 Put it in Tutorial_Gfeller/Data/ and uncompress it and uncompress the zip files. Each of the files corresponds to the gene expression profiling of a melanoma sample. Open Rstudio. Set the working directory (Session -> Set Working Directory) to Tutorial_Gfeller. Create a new Rscript file (File-> New File -> R script); this is where you will write your code and save it in Tutorial_Gfeller as file.r. 10

Let s load the data Each GSMxxx corresponds to one sample First have a look at the files in a Excel (or any text editor). To start with, we will focus on the expected_count column The command to load file is read.delim(): m1 <- read.delim("data/gse93722_raw/gsm2461003_lau125.genes.results.txt ) Name of the object that will store the data. Path to the file to be loaded Then execute the command in the Console (pasting it or command+enter). Now you can look at the elements of m1 (e.g., for the first line, type m1[1,] in the console). Does it correspond to the first line of the file? With dim(m1)you can check the dimensions of m1. 11

Let s load the data Load the other files into m2 (LAU1255), m3 (LAU1314) and m4 LAU355). Build a matrix taking the fifth column in each file: M <- matrix(nrow=4, ncol=dim(m1)[1]) M[1,] <- m1[,5] In the first line, put the 5 th column of m1 Initialize an empty matrix with the correct dimension Do the same with m2, m3 and m4 (if you had many files, we would do a loop, see exercises). Try to query any entry of your matrix (e.g., M[3,5]). Do you get the expected number? 12

Genes have (many) names In these files, we have Ensembl gene Ids We want to convert them to Common Gene names. We need a file with the mapping (two columns, one for Ensembl IDs, one for gene names). Go to: https://www.ensembl.org/biomart/martview/ Select Ensembl Genes 90, then Human genes. In Attributes, Select GENE: -> Gene stable ID and EXTERNAL: -> HGNC symbol. Click on Results, then Unique results only, and Go to save to a local file (put the file in Tutorial_Gfeller/Data). 13

Then in R Open the file: mapping <- read.delim("data/mart_export.txt") Use the match() function to find the position in mapping of all the genes for which you have expression data in m1: i <- match(m1[,1], mapping[,1]) Then build a vector with the gene names gene <- as.character(mapping[i,2]) N <- length(gene) Verify that the mapping is correct by checking a few examples 14

Computers like simple and sequential calculations Additions/subtractions and multiplications/divisions You need to decompose any problem into a set of simple operations. You need to tell the computer about every step of your calculations (e.g., loop over all entries in one column). Example: Find the average expression of a gene (e.g., EGFR) across samples 15

How to do it on a computer gene = EGFR M = 1) Have a matrix M with all expression values and a vector gene with the name of the genes (columns of M). 2) Find the column corresponding to your gene: n <- which(gene == EGFR ) 3) Initialize a scalar: av <- 0 4) Go through each element of the column: S <- dim(m)[1] for(s in 1:S){ av <- av + M[s,n] } M[,n] 5) Normalize your value: av <- av/s 16

How programming languages work The exact commands will change between programming languages (R, python, perl, C, matlab), but the logic remains the same ( grammar ). Learning the syntax ( words ) can be done with many online resources. In these two days, we will focus on R, since it is very convenient for graphical visualization of the data. Many built-in functions (e.g., average()), but important to understand the logic. 17

Typical output of RNA-Seq Raw sequences: - Fastq format (sequence of the reads + quality information Processed data: - Counts: Number of reads mapping to each gene/transcript. - Bam format (compressed) - Sra format (compressed) 18

Computational analyses Alignments Isoforms (splicing) Low complexity regions (repeats) Variable regions (TCR, MHC) Sequencing errors Poorly annotated regions / genomes ACCTAG CGGTAA ATGGCA TGGGAC TATAGG >100M reads Map to reference transcriptome Gene A Gene B 19

What else needs to be considered Different samples can have different total number of reads (e.g., different sequencing depth). Sample 1 Gene A Gene B Sample 2 Gene A Gene B Longer genes have more reads. Gene A Gene B If you want to compare expression between samples, you need to renormalize by total number of reads, If you want to compare expression between genes, you need to renormalize by gene length,

How to do it (naïve way) 10 362 093 12 482 546 7 542 733 M = N <- dim(m)[2] M.norm <- matrix(nrow=s, ncol=n) # Initialize an empty matrix for( s in 1:S ){ tot=0; for (n in 1:N){ tot=tot+m[s,n] # Compute the sum over row s } for (n in 1:N){ M.norm[s,n] <- M[s,n]/tot # Normalize row s } } M.norm <- M.norm*1000000 # Avoid having too small numbers 21

A few names commonly used Raw counts: Number of reads mapping to a gene Scaled counts: After renormalization by total number of counts in the sample. Reads Per Kilobase Million (RPKM): Divide by the total number of reads and then by the gene length. Multiply by 1 000 000 to have numbers that are easier to read. Transcripts Per Kilobase Million (TPM): Divide by gene length and then normalize across all genes (i.e. sum of TPMs of all genes is the same for all samples)

Scaled counts vs TPM vs RPKM TPM are increasingly used. The sum is always equal to 10 6 in TPM. The two values (TPM vs RPKM) are equivalent, up to a renormalizing factor. Scaled counts are enough to compare the same gene in different samples. TPM/RPKM are required to compare different genes. 23

Studying expression of some gene in two types of samples G1 G2 M[,n] 1) Define the groups: G1 <- c(1,2); G2 <- c(3,4) 2) Find the column corresponding to the gene: n <- which(gene== CD19 ) 3) Take the mean over the blue box: av1 <- 0; for(s in G1) { av1 <- av1 + M.norm[s,n] }; av1 <- av1/length(g1) 4) Take the mean over the red box: av2 <- 0; for(s in G2) {av2 <- av2 + M.norm[s,n] }; av2 <- av2/length(g2) 5) Compare expression. 6) With more samples you can do statistics (T-test, boxplot, see exercises). 24

2 - Differential expression Expression level How can we quantify these differences? S1 S2 Samples 25

Differential expression Log fold change: High expression genes can show big differences in counts (10 000 to 20 000), compared to low expression genes (10 to 20), even if they experience the same relative change. Better to use logarithms. 10 -> 20 = log 2 fold change of 1 = 10 000 -> 20 000. 26

P-value: Differential expression Give a statistical significance, but not trivial to estimate. Expression level Expression level Expression level Differences in the mean values are not enough! 27

Differential expression P-value: Give a statistical significance, but not trivial to estimate. Expression level 2 1 2 000 1 000 Depending on your random model, the first case may be more likely to appear by chance. 28

Differential expression P-value: Give a statistical significance, but not trivial to estimate. Expression level 2 1 2 000 1 000 Advanced statistical methods have been developed to estimate P- values in RNA-Seq data! 29

Differential expression P-value: Give a statistical significance, but not trivial to estimate. Gene 1 Gene 2 Gene 9 Expression level Gene 8 Gene 3 Gene 7 Gene 4 Gene 5 Gene 6 Gene 10 Gene 11 Many genes (20 000) => many testing => Higher chances that the differences are just due to chance. 30

Tools for differential expression Accurate estimation of P-values aim at considering these different issues in testing the hypothesis that the expression values come from the same distribution or have the same mean in two conditions. Consider the multiple testing problem. gene mean Log-fold change P-value P-value adjusted Tools in R: - EdgeR - DESeq2 P= 20 000 genes 31

How to show your results? P_adj < 0.05 P_adj >= 0.05 How to plot this in your computer? 1) Select genes with P_adj >= 0.05: ind1 <- which( P[,5] >= 0.05 ) 2) Plot these points plot( P[ind1, 2], P[ind1, 3] ) 3) Select genes with P_adj < 0.05: ind2 <- which( P[,5] < 0.05 ) gene mean Log-fold change P-value P-value adjusted 4) Plot these points par(new=t) # This is to overlay the graphs plot( P[ind2, 2], P[ind2, 3], col= red ) P= 32

3 - Visualizing high-dimensional data Each sample can be considered as a point in a very high dimensional space (N dimensions). In this high-dimensional space, are some samples more similar to each other? Replicates Similar cell types Cancer subtypes 33

Example in 3D (i.e. 3 genes) Gene 2 S5 S2 S4 S1 S3 Gene 1 Visually, you can see that: - S1, S3, S4 are similar to each other. - S2, S5 are similar to each other. Can you quantify it? - Distance - Angle (correlation) Gene 3 34

Distances - How would you do it on a computer? Gene 2 S5 S2 S4 S1 S3 Gene 1 S1 <- c(5, 6, -1) S2 <- c(-2, 5, 3) d12 <- 0 for(i in 1:3){ d12 <- d12 + (S1[i]-S2[i])**2 } d12 <- sqrt(d12) Here we used the ** for taking the square of a number and the sqrt() function for square root. Gene 3 35

What if you have 20 000 genes? Very hard to visualize You can still compute distances d12 <- 0 N <- length(s1) for(i in 1:N){ d12 <- d12 + (S1[i]-S2[i])**2 } d12 <- sqrt(d12) This is a big advantage of using programming languages, compared to Excel (or manual calculations ) 36

Visualization Distances are still not very intuitive If you have many points (S), the number of pairwise distances is S(S-1)/2 Idea: Project the data in 2D, so that it represents optimally the raw data (gene expression profiles) in the N-dimensional space. 37

2D projection the good choice PC2 S5 S2 Gene 2 S4 S1 S3 PC2 S5 S2 In 2D S4 S1 S3 PC1 Gene 1 PC1 Gene 3 38

2D projection the bad choice PC2 S5 S2 Gene 2 S4 S1 S3 PC2 In 2D S4 S2 S1 S5 S3 PC1 Gene 1 PC1 Gene 3 39

Principle Component Analysis (PCA) PC2 S5 S2 Gene 2 S4 S1 S3 How to select your 2D plan on which to project the data? - Intuitive idea: Take axes with the largest variance or dispersion (Principal Components). PC1 - The math behind is not simple (eigenvalue decomposition of Gene 1 covariance matrix) but does not depend on the number of genes (dimension). Gene 3 - You do not need to understand the math to use it. 40

How to do it on your computer In R, use function prcomp (stats package). S1 <- c(5, 6, -1) S2 <- c(-2, 5, 3) S3 <- c(5.5, 6.5, -1.3) S4 <- c(4, 6.5, -0.3) S5 <- c(-2.2, 5.3, 3.1) x <- c(s1[1], S2[1], S3[1], S4[1], S5[1]) y <- c(s1[2], S2[2], S3[2], S4[2], S5[2]) z <- c(s1[3], S2[3], S3[3], S4[3], S5[3]) Plot the data in 3D library(rgl) plot3d(x,y,z, xlim=c(-10,10), ylim=c(-10,10), zlim=c(-10,10)) Make a PCA analysis mat <- t(matrix(c(s1, S2, S3, S4, S5), nrow=3)) pca = prcomp(mat) plot(pca$x[,1], pca$x[,2]) Each point in space Coordinates along x, y, z axes Make a matrix with each point in one line See practical this afternoon 41

Now let s look at the tumor expression Run: pca = prcomp(m.norm) data # Plot the samples along the two first components plot(pca$x[,1], pca$x[,2]) What do you see? Does it make sense in light of expression of CD19? 42

Principle component analysis some Gene 2 PC1 Gene 1 discussions - The axis with the largest variance do not necessarily reflect the structures in the data. - In PCA, the principle components are always orthogonal (linear method). - It is often useful to make sure the mean of the samples is at 0. PC1 43

Many refinements/alternatives In PCA, only select a subset of genes (high expression, high variability, ). Multi-dimensional scaling (MDS). Plot the points in 2D so that distances in the original space are best preserved (R package cmdscale ). Stochastic Neighbor Embedding (tsne). Very popular these days (R package tsne ). Non-linear techniques (not a simple projection). All these techniques are fully unsupervised: they do not need to know what your data are, which cluster you should expect, 44

Start with PCA. How to choose? If you know what your samples are (e.g., different cell types), you can try to play a bit with parameters (e.g., choice of genes, choice of algorithm) to have meaningful clusters. Find optimal parameters that best capture the signal in your data. => Allows you to discover new things Overfit your data: See only what you want to see (even if it is not there). Prevents from seeing anything new 45

Where to access gene expression data GEO: Largest collection of gene expression data (microarray, RNA-Seq). Often has counts (not only raw data). ENA (European Nucleotide Archive): Large collection of raw RNA-Seq data (bam files). ArrayExpress: functional genomics data See exercises this afternoon 46

Where can we access cancer gene expression data TCGA: large collection of tumor RNA-Seq, Exome-Seq, methylation, clinical information, > 10 000 patients with sequenced tumors See exercises tomorrow 47

General remarks about programming Computers like numbers and simple operations Need to decompose complex tasks into simple steps. Learning a programming language takes time, but you do not need to know everything before starting. First understand the logics, then use books or online resources for the syntax. Data analysis takes time Analyzing large datasets is often more challenging than producing them 48

General remarks about programming Many ways of making many mistakes!!! We all do mistakes You need to check your outputs when you write a code If you do a normalization on matrix rows, check that the row sums are truly equal. If there is something incoherent in your output, always go back to find the mistakes (do not impute to noise ), even if the data come from a bioinformatics expert. 49

General remarks about programming In the beginning, it is a big investment to write a script, rather than using Excel. But in the long range, it allows you to go much faster and quickly analyze many datasets without having to redo everything each time. Many analyses cannot be done in Excel, while R provides many packages that you can use. 50

How to get support for bioinformatics analyses of gene expression data Sequencing facility: GTF (Keith Harshman) Standard pipelines for normalizing and PCA Bioinformatics core facility (Delorenzi) or Vital- IT (Xenarios). Very specific analyses: groups working in computational biology. 51

Questions? 52