Introduction to Cancer Genomics
|
|
- Douglas Little
- 6 years ago
- Views:
Transcription
1 Introduction to Cancer Genomics Gene expression data analysis part I David Gfeller Computational Cancer Biology Ludwig Center for Cancer research david.gfeller@unil.ch 1
2 Overview 1. Basic understanding of RNA-Seq data processing. 2. Differential expression. Examples of R code 3. Dimensionality reduction. 2
3 Goals Help you understand what can be done with a computer -> programming logic Give you some basic idea of how to ask the computer to perform some tasks -> syntax. Show you a few examples of gene expression data analysis in R that you could reuse for your projects (see also practical). 3
4 Gene expression experiments Microarrays: Chip with DNA probes that will pair with DNA (retro-transcribed RNA) in a sample. Intensity is measured as a light signal. Very popular in ( ) RNA-Seq: Directly count how many transcripts (mrna molecules) originate from each gene in a sample. Increasingly replacing microarrays for gene expression analyses 4
5 RNA-Seq RNA fragmentation Reverse transcription Adaptors + amplification Sequencing ACCTAG CGGTAA ATGGCA TGGGAC TATAGG >100M reads RNA Map to reference transcriptome Gene A Gene B Gene expression => Quite easy (count the reads) Gene fusion => More difficult (especially for new fusion events) Splicing => More difficult (especially for poorly annotated isoforms) 5
6 1 - Typical output of RNA-Seq Raw sequences: - Fastq format (sequence of the reads + quality information) Processed data: - Counts: Number of reads mapping to each gene/transcript. - Bam format (compressed) - Sra format (compressed) 6
7 How to think about these data in a computer Sample1: gene1: 254; gene2: 1284; gene3: 7234; Sample2: gene1: 5; gene2: 362; gene3: 0; Sample3: gene1: 8902; gene2: 2199; gene3: 722; Each expression value corresponds to a scalar. Each sample corresponds to a vector. All samples form a matrix (M) N genes S samples M[s,n] corresponds to expression of gene n in sample s 7
8 Computers like numbers In R: - Scalar (numeric) - Vector (array) - Matrix (multidimensional arrays, e.g. S x N) Gene expression data are naturally digitalized, which makes them especially appropriate to use with computers Many other biological objects can be digitalized as vectors or matrices: - Protein/DNA sequences <-> vectors of letters/numbers - Protein structures <-> vectors/matrices of 3D coordinates - Interactions <-> N x N matrix with 1 s and 0 s - Image <-> matrix of pixel (1/0 for two-color image) - Set of measurements <-> vector of values 8
9 How to think about these data in a computer In R, once you load your data into a matrix (M), you can very easily: - Print one specific column: M[,2] - Print one specific line: M[1,] - Plot the correlation of two genes: plot(m[,5], M[,7]) - Make operations on lines or columns. 9
10 Let s practice Create a empty directory Tutorial_Gfeller and Tutorial_Gfeller/ Data Download the file: GSE93722_RAW.tar at: Put it in Tutorial_Gfeller/Data/ and uncompress it and uncompress the zip files. Each of the files corresponds to the gene expression profiling of a melanoma sample. Open Rstudio. Set the working directory (Session -> Set Working Directory) to Tutorial_Gfeller. Create a new Rscript file (File-> New File -> R script); this is where you will write your code and save it in Tutorial_Gfeller as file.r. 10
11 Let s load the data Each GSMxxx corresponds to one sample First have a look at the files in a Excel (or any text editor). To start with, we will focus on the expected_count column The command to load file is read.delim(): m1 <- read.delim("data/gse93722_raw/gsm _lau125.genes.results.txt ) Name of the object that will store the data. Path to the file to be loaded Then execute the command in the Console (pasting it or command+enter). Now you can look at the elements of m1 (e.g., for the first line, type m1[1,] in the console). Does it correspond to the first line of the file? With dim(m1)you can check the dimensions of m1. 11
12 Let s load the data Load the other files into m2 (LAU1255), m3 (LAU1314) and m4 LAU355). Build a matrix taking the fifth column in each file: M <- matrix(nrow=4, ncol=dim(m1)[1]) M[1,] <- m1[,5] In the first line, put the 5 th column of m1 Initialize an empty matrix with the correct dimension Do the same with m2, m3 and m4 (if you had many files, we would do a loop, see exercises). Try to query any entry of your matrix (e.g., M[3,5]). Do you get the expected number? 12
13 Genes have (many) names In these files, we have Ensembl gene Ids We want to convert them to Common Gene names. We need a file with the mapping (two columns, one for Ensembl IDs, one for gene names). Go to: Select Ensembl Genes 90, then Human genes. In Attributes, Select GENE: -> Gene stable ID and EXTERNAL: -> HGNC symbol. Click on Results, then Unique results only, and Go to save to a local file (put the file in Tutorial_Gfeller/Data). 13
14 Then in R Open the file: mapping <- read.delim("data/mart_export.txt") Use the match() function to find the position in mapping of all the genes for which you have expression data in m1: i <- match(m1[,1], mapping[,1]) Then build a vector with the gene names gene <- as.character(mapping[i,2]) N <- length(gene) Verify that the mapping is correct by checking a few examples 14
15 Computers like simple and sequential calculations Additions/subtractions and multiplications/divisions You need to decompose any problem into a set of simple operations. You need to tell the computer about every step of your calculations (e.g., loop over all entries in one column). Example: Find the average expression of a gene (e.g., EGFR) across samples 15
16 How to do it on a computer gene = EGFR M = 1) Have a matrix M with all expression values and a vector gene with the name of the genes (columns of M). 2) Find the column corresponding to your gene: n <- which(gene == EGFR ) 3) Initialize a scalar: av <- 0 4) Go through each element of the column: S <- dim(m)[1] for(s in 1:S){ av <- av + M[s,n] } M[,n] 5) Normalize your value: av <- av/s 16
17 How programming languages work The exact commands will change between programming languages (R, python, perl, C, matlab), but the logic remains the same ( grammar ). Learning the syntax ( words ) can be done with many online resources. In these two days, we will focus on R, since it is very convenient for graphical visualization of the data. Many built-in functions (e.g., average()), but important to understand the logic. 17
18 Typical output of RNA-Seq Raw sequences: - Fastq format (sequence of the reads + quality information Processed data: - Counts: Number of reads mapping to each gene/transcript. - Bam format (compressed) - Sra format (compressed) 18
19 Computational analyses Alignments Isoforms (splicing) Low complexity regions (repeats) Variable regions (TCR, MHC) Sequencing errors Poorly annotated regions / genomes ACCTAG CGGTAA ATGGCA TGGGAC TATAGG >100M reads Map to reference transcriptome Gene A Gene B 19
20 What else needs to be considered Different samples can have different total number of reads (e.g., different sequencing depth). Sample 1 Gene A Gene B Sample 2 Gene A Gene B Longer genes have more reads. Gene A Gene B If you want to compare expression between samples, you need to renormalize by total number of reads, If you want to compare expression between genes, you need to renormalize by gene length,
21 How to do it (naïve way) M = N <- dim(m)[2] M.norm <- matrix(nrow=s, ncol=n) # Initialize an empty matrix for( s in 1:S ){ tot=0; for (n in 1:N){ tot=tot+m[s,n] # Compute the sum over row s } for (n in 1:N){ M.norm[s,n] <- M[s,n]/tot # Normalize row s } } M.norm <- M.norm* # Avoid having too small numbers 21
22 A few names commonly used Raw counts: Number of reads mapping to a gene Scaled counts: After renormalization by total number of counts in the sample. Reads Per Kilobase Million (RPKM): Divide by the total number of reads and then by the gene length. Multiply by to have numbers that are easier to read. Transcripts Per Kilobase Million (TPM): Divide by gene length and then normalize across all genes (i.e. sum of TPMs of all genes is the same for all samples)
23 Scaled counts vs TPM vs RPKM TPM are increasingly used. The sum is always equal to 10 6 in TPM. The two values (TPM vs RPKM) are equivalent, up to a renormalizing factor. Scaled counts are enough to compare the same gene in different samples. TPM/RPKM are required to compare different genes. 23
24 Studying expression of some gene in two types of samples G1 G2 M[,n] 1) Define the groups: G1 <- c(1,2); G2 <- c(3,4) 2) Find the column corresponding to the gene: n <- which(gene== CD19 ) 3) Take the mean over the blue box: av1 <- 0; for(s in G1) { av1 <- av1 + M.norm[s,n] }; av1 <- av1/length(g1) 4) Take the mean over the red box: av2 <- 0; for(s in G2) {av2 <- av2 + M.norm[s,n] }; av2 <- av2/length(g2) 5) Compare expression. 6) With more samples you can do statistics (T-test, boxplot, see exercises). 24
25 2 - Differential expression Expression level How can we quantify these differences? S1 S2 Samples 25
26 Differential expression Log fold change: High expression genes can show big differences in counts ( to ), compared to low expression genes (10 to 20), even if they experience the same relative change. Better to use logarithms. 10 -> 20 = log 2 fold change of 1 = >
27 P-value: Differential expression Give a statistical significance, but not trivial to estimate. Expression level Expression level Expression level Differences in the mean values are not enough! 27
28 Differential expression P-value: Give a statistical significance, but not trivial to estimate. Expression level Depending on your random model, the first case may be more likely to appear by chance. 28
29 Differential expression P-value: Give a statistical significance, but not trivial to estimate. Expression level Advanced statistical methods have been developed to estimate P- values in RNA-Seq data! 29
30 Differential expression P-value: Give a statistical significance, but not trivial to estimate. Gene 1 Gene 2 Gene 9 Expression level Gene 8 Gene 3 Gene 7 Gene 4 Gene 5 Gene 6 Gene 10 Gene 11 Many genes (20 000) => many testing => Higher chances that the differences are just due to chance. 30
31 Tools for differential expression Accurate estimation of P-values aim at considering these different issues in testing the hypothesis that the expression values come from the same distribution or have the same mean in two conditions. Consider the multiple testing problem. gene mean Log-fold change P-value P-value adjusted Tools in R: - EdgeR - DESeq2 P= genes 31
32 How to show your results? P_adj < 0.05 P_adj >= 0.05 How to plot this in your computer? 1) Select genes with P_adj >= 0.05: ind1 <- which( P[,5] >= 0.05 ) 2) Plot these points plot( P[ind1, 2], P[ind1, 3] ) 3) Select genes with P_adj < 0.05: ind2 <- which( P[,5] < 0.05 ) gene mean Log-fold change P-value P-value adjusted 4) Plot these points par(new=t) # This is to overlay the graphs plot( P[ind2, 2], P[ind2, 3], col= red ) P= 32
33 3 - Visualizing high-dimensional data Each sample can be considered as a point in a very high dimensional space (N dimensions). In this high-dimensional space, are some samples more similar to each other? Replicates Similar cell types Cancer subtypes 33
34 Example in 3D (i.e. 3 genes) Gene 2 S5 S2 S4 S1 S3 Gene 1 Visually, you can see that: - S1, S3, S4 are similar to each other. - S2, S5 are similar to each other. Can you quantify it? - Distance - Angle (correlation) Gene 3 34
35 Distances - How would you do it on a computer? Gene 2 S5 S2 S4 S1 S3 Gene 1 S1 <- c(5, 6, -1) S2 <- c(-2, 5, 3) d12 <- 0 for(i in 1:3){ d12 <- d12 + (S1[i]-S2[i])**2 } d12 <- sqrt(d12) Here we used the ** for taking the square of a number and the sqrt() function for square root. Gene 3 35
36 What if you have genes? Very hard to visualize You can still compute distances d12 <- 0 N <- length(s1) for(i in 1:N){ d12 <- d12 + (S1[i]-S2[i])**2 } d12 <- sqrt(d12) This is a big advantage of using programming languages, compared to Excel (or manual calculations ) 36
37 Visualization Distances are still not very intuitive If you have many points (S), the number of pairwise distances is S(S-1)/2 Idea: Project the data in 2D, so that it represents optimally the raw data (gene expression profiles) in the N-dimensional space. 37
38 2D projection the good choice PC2 S5 S2 Gene 2 S4 S1 S3 PC2 S5 S2 In 2D S4 S1 S3 PC1 Gene 1 PC1 Gene 3 38
39 2D projection the bad choice PC2 S5 S2 Gene 2 S4 S1 S3 PC2 In 2D S4 S2 S1 S5 S3 PC1 Gene 1 PC1 Gene 3 39
40 Principle Component Analysis (PCA) PC2 S5 S2 Gene 2 S4 S1 S3 How to select your 2D plan on which to project the data? - Intuitive idea: Take axes with the largest variance or dispersion (Principal Components). PC1 - The math behind is not simple (eigenvalue decomposition of Gene 1 covariance matrix) but does not depend on the number of genes (dimension). Gene 3 - You do not need to understand the math to use it. 40
41 How to do it on your computer In R, use function prcomp (stats package). S1 <- c(5, 6, -1) S2 <- c(-2, 5, 3) S3 <- c(5.5, 6.5, -1.3) S4 <- c(4, 6.5, -0.3) S5 <- c(-2.2, 5.3, 3.1) x <- c(s1[1], S2[1], S3[1], S4[1], S5[1]) y <- c(s1[2], S2[2], S3[2], S4[2], S5[2]) z <- c(s1[3], S2[3], S3[3], S4[3], S5[3]) Plot the data in 3D library(rgl) plot3d(x,y,z, xlim=c(-10,10), ylim=c(-10,10), zlim=c(-10,10)) Make a PCA analysis mat <- t(matrix(c(s1, S2, S3, S4, S5), nrow=3)) pca = prcomp(mat) plot(pca$x[,1], pca$x[,2]) Each point in space Coordinates along x, y, z axes Make a matrix with each point in one line See practical this afternoon 41
42 Now let s look at the tumor expression Run: pca = prcomp(m.norm) data # Plot the samples along the two first components plot(pca$x[,1], pca$x[,2]) What do you see? Does it make sense in light of expression of CD19? 42
43 Principle component analysis some Gene 2 PC1 Gene 1 discussions - The axis with the largest variance do not necessarily reflect the structures in the data. - In PCA, the principle components are always orthogonal (linear method). - It is often useful to make sure the mean of the samples is at 0. PC1 43
44 Many refinements/alternatives In PCA, only select a subset of genes (high expression, high variability, ). Multi-dimensional scaling (MDS). Plot the points in 2D so that distances in the original space are best preserved (R package cmdscale ). Stochastic Neighbor Embedding (tsne). Very popular these days (R package tsne ). Non-linear techniques (not a simple projection). All these techniques are fully unsupervised: they do not need to know what your data are, which cluster you should expect, 44
45 Start with PCA. How to choose? If you know what your samples are (e.g., different cell types), you can try to play a bit with parameters (e.g., choice of genes, choice of algorithm) to have meaningful clusters. Find optimal parameters that best capture the signal in your data. => Allows you to discover new things Overfit your data: See only what you want to see (even if it is not there). Prevents from seeing anything new 45
46 Where to access gene expression data GEO: Largest collection of gene expression data (microarray, RNA-Seq). Often has counts (not only raw data). ENA (European Nucleotide Archive): Large collection of raw RNA-Seq data (bam files). ArrayExpress: functional genomics data See exercises this afternoon 46
47 Where can we access cancer gene expression data TCGA: large collection of tumor RNA-Seq, Exome-Seq, methylation, clinical information, > patients with sequenced tumors See exercises tomorrow 47
48 General remarks about programming Computers like numbers and simple operations Need to decompose complex tasks into simple steps. Learning a programming language takes time, but you do not need to know everything before starting. First understand the logics, then use books or online resources for the syntax. Data analysis takes time Analyzing large datasets is often more challenging than producing them 48
49 General remarks about programming Many ways of making many mistakes!!! We all do mistakes You need to check your outputs when you write a code If you do a normalization on matrix rows, check that the row sums are truly equal. If there is something incoherent in your output, always go back to find the mistakes (do not impute to noise ), even if the data come from a bioinformatics expert. 49
50 General remarks about programming In the beginning, it is a big investment to write a script, rather than using Excel. But in the long range, it allows you to go much faster and quickly analyze many datasets without having to redo everything each time. Many analyses cannot be done in Excel, while R provides many packages that you can use. 50
51 How to get support for bioinformatics analyses of gene expression data Sequencing facility: GTF (Keith Harshman) Standard pipelines for normalizing and PCA Bioinformatics core facility (Delorenzi) or Vital- IT (Xenarios). Very specific analyses: groups working in computational biology. 51
52 Questions? 52
How to store and visualize RNA-seq data
How to store and visualize RNA-seq data Gabriella Rustici Functional Genomics Group gabry@ebi.ac.uk EBI is an Outstation of the European Molecular Biology Laboratory. Talk summary How do we archive RNA-seq
More informationRNA-Seq. Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University
RNA-Seq Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University joshua.ainsley@tufts.edu Day four Quantifying expression Intro to R Differential expression
More informationAutomated Bioinformatics Analysis System on Chip ABASOC. version 1.1
Automated Bioinformatics Analysis System on Chip ABASOC version 1.1 Phillip Winston Miller, Priyam Patel, Daniel L. Johnson, PhD. University of Tennessee Health Science Center Office of Research Molecular
More informationColorado State University Bioinformatics Algorithms Assignment 6: Analysis of High- Throughput Biological Data Hamidreza Chitsaz, Ali Sharifi- Zarchi
Colorado State University Bioinformatics Algorithms Assignment 6: Analysis of High- Throughput Biological Data Hamidreza Chitsaz, Ali Sharifi- Zarchi Although a little- bit long, this is an easy exercise
More informationSequence Analysis Pipeline
Sequence Analysis Pipeline Transcript fragments 1. PREPROCESSING 2. ASSEMBLY (today) Removal of contaminants, vector, adaptors, etc Put overlapping sequence together and calculate bigger sequences 3. Analysis/Annotation
More informationData Processing and Analysis in Systems Medicine. Milena Kraus Data Management for Digital Health Summer 2017
Milena Kraus Digital Health Summer Agenda Real-world Use Cases Oncology Nephrology Heart Insufficiency Additional Topics Data Management & Foundations Biology Recap Data Sources Data Formats Business Processes
More informationCSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo
CSE 6242 A / CS 4803 DVA Feb 12, 2013 Dimension Reduction Guest Lecturer: Jaegul Choo CSE 6242 A / CS 4803 DVA Feb 12, 2013 Dimension Reduction Guest Lecturer: Jaegul Choo Data is Too Big To Do Something..
More informationServices Performed. The following checklist confirms the steps of the RNA-Seq Service that were performed on your samples.
Services Performed The following checklist confirms the steps of the RNA-Seq Service that were performed on your samples. SERVICE Sample Received Sample Quality Evaluated Sample Prepared for Sequencing
More informationWhy use R? Getting started. Why not use R? Introduction to R: Log into tak. Start R R or. It s hard to use at first
Why use R? Introduction to R: Using R for statistics ti ti and data analysis BaRC Hot Topics October 2011 George Bell, Ph.D. http://iona.wi.mit.edu/bio/education/r2011/ To perform inferential statistics
More informationRNA-seq. Manpreet S. Katari
RNA-seq Manpreet S. Katari Evolution of Sequence Technology Normalizing the Data RPKM (Reads per Kilobase of exons per million reads) Score = R NT R = # of unique reads for the gene N = Size of the gene
More informationDimension reduction : PCA and Clustering
Dimension reduction : PCA and Clustering By Hanne Jarmer Slides by Christopher Workman Center for Biological Sequence Analysis DTU The DNA Array Analysis Pipeline Array design Probe design Question Experimental
More informationQuantification. Part I, using Excel
Quantification In this exercise we will work with RNA-seq data from a study by Serin et al (2017). RNA-seq was performed on Arabidopsis seeds matured at standard temperature (ST, 22 C day/18 C night) or
More informationUsing R for statistics and data analysis
Introduction ti to R: Using R for statistics and data analysis BaRC Hot Topics October 2011 George Bell, Ph.D. http://iona.wi.mit.edu/bio/education/r2011/ Why use R? To perform inferential statistics (e.g.,
More information/ Computational Genomics. Normalization
10-810 /02-710 Computational Genomics Normalization Genes and Gene Expression Technology Display of Expression Information Yeast cell cycle expression Experiments (over time) baseline expression program
More informationHow do microarrays work
Lecture 3 (continued) Alvis Brazma European Bioinformatics Institute How do microarrays work condition mrna cdna hybridise to microarray condition Sample RNA extract labelled acid acid acid nucleic acid
More informationDimension Reduction CS534
Dimension Reduction CS534 Why dimension reduction? High dimensionality large number of features E.g., documents represented by thousands of words, millions of bigrams Images represented by thousands of
More informationExercise 1 Review. --outfiltermismatchnmax : max number of mismatch (Default 10) --outreadsunmapped fastx: output unmapped reads
Exercise 1 Review Setting parameters STAR --quantmode GeneCounts --genomedir genomedb -- runthreadn 2 --outfiltermismatchnmax 2 --readfilesin WTa.fastq.gz --readfilescommand zcat --outfilenameprefix WTa
More informationDifferential gene expression analysis using RNA-seq
https://abc.med.cornell.edu/ Differential gene expression analysis using RNA-seq Applied Bioinformatics Core, September/October 2018 Friederike Dündar with Luce Skrabanek & Paul Zumbo Day 3: Counting reads
More informationExploratory data analysis for microarrays
Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA
More informationGene Expression Data Analysis. Qin Ma, Ph.D. December 10, 2017
1 Gene Expression Data Analysis Qin Ma, Ph.D. December 10, 2017 2 Bioinformatics Systems biology This interdisciplinary science is about providing computational support to studies on linking the behavior
More informationCompClustTk Manual & Tutorial
CompClustTk Manual & Tutorial Brandon King Copyright c California Institute of Technology Version 0.1.10 May 13, 2004 Contents 1 Introduction 1 1.1 Purpose.............................................
More informationAdvanced RNA-Seq 1.5. User manual for. Windows, Mac OS X and Linux. November 2, 2016 This software is for research purposes only.
User manual for Advanced RNA-Seq 1.5 Windows, Mac OS X and Linux November 2, 2016 This software is for research purposes only. QIAGEN Aarhus Silkeborgvej 2 Prismet DK-8000 Aarhus C Denmark Contents 1 Introduction
More informationIntroduction to GE Microarray data analysis Practical Course MolBio 2012
Introduction to GE Microarray data analysis Practical Course MolBio 2012 Claudia Pommerenke Nov-2012 Transkriptomanalyselabor TAL Microarray and Deep Sequencing Core Facility Göttingen University Medical
More informationROTS: Reproducibility Optimized Test Statistic
ROTS: Reproducibility Optimized Test Statistic Fatemeh Seyednasrollah, Tomi Suomi, Laura L. Elo fatsey (at) utu.fi March 3, 2016 Contents 1 Introduction 2 2 Algorithm overview 3 3 Input data 3 4 Preprocessing
More informationCLC Server. End User USER MANUAL
CLC Server End User USER MANUAL Manual for CLC Server 10.0.1 Windows, macos and Linux March 8, 2018 This software is for research purposes only. QIAGEN Aarhus Silkeborgvej 2 Prismet DK-8000 Aarhus C Denmark
More informationData Mining - Data. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - Data 1 / 47
Data Mining - Data Dr. Jean-Michel RICHER 2018 jean-michel.richer@univ-angers.fr Dr. Jean-Michel RICHER Data Mining - Data 1 / 47 Outline 1. Introduction 2. Data preprocessing 3. CPA with R 4. Exercise
More informationOur typical RNA quantification pipeline
RNA-Seq primer Our typical RNA quantification pipeline Upload your sequence data (fastq) Align to the ribosome (Bow>e) Align remaining reads to genome (TopHat) or transcriptome (RSEM) Make report of quality
More informationGene Survey: FAQ. Gene Survey: FAQ Tod Casasent DRAFT
Gene Survey: FAQ Tod Casasent 2016-02-22-1245 DRAFT 1 What is this document? This document is intended for use by internal and external users of the Gene Survey package, results, and output. This document
More informationSingle/paired-end RNAseq analysis with Galaxy
October 016 Single/paired-end RNAseq analysis with Galaxy Contents: 1. Introduction. Quality control 3. Alignment 4. Normalization and read counts 5. Workflow overview 6. Sample data set to test the paired-end
More informationA review of RNA-Seq normalization methods
A review of RNA-Seq normalization methods This post covers the units used in RNA-Seq that are, unfortunately, often misused and misunderstood I ll try to clear up a bit of the confusion here The first
More informationSupplementary Figure 1. Fast read-mapping algorithm of BrowserGenome.
Supplementary Figure 1 Fast read-mapping algorithm of BrowserGenome. (a) Indexing strategy: The genome sequence of interest is divided into non-overlapping 12-mers. A Hook table is generated that contains
More informationRNA-Seq analysis with Astrocyte Differential expression and transcriptome assembly
RNA-Seq analysis with Astrocyte Differential expression and transcriptome assembly Beibei Chen Ph.D BICF 9/28/2016 Agenda Launch Workflows using Astrocyte BICF Workflows BICF RNA-seq Workflow Experimental
More informationTranscript quantification using Salmon and differential expression analysis using bayseq
Introduction to expression analysis (RNA-seq) Transcript quantification using Salmon and differential expression analysis using bayseq Philippine Genome Center University of the Philippines Prepared by
More informationReference guided RNA-seq data analysis using BioHPC Lab computers
Reference guided RNA-seq data analysis using BioHPC Lab computers This document assumes that you already know some basics of how to use a Linux computer. Some of the command lines in this document are
More informationComputational Genomics and Molecular Biology, Fall
Computational Genomics and Molecular Biology, Fall 2015 1 Sequence Alignment Dannie Durand Pairwise Sequence Alignment The goal of pairwise sequence alignment is to establish a correspondence between the
More information11/8/2017 Trinity De novo Transcriptome Assembly Workshop trinityrnaseq/rnaseq_trinity_tuxedo_workshop Wiki GitHub
trinityrnaseq / RNASeq_Trinity_Tuxedo_Workshop Trinity De novo Transcriptome Assembly Workshop Brian Haas edited this page on Oct 17, 2015 14 revisions De novo RNA-Seq Assembly and Analysis Using Trinity
More informationTutorial: RNA-Seq Analysis Part II (Tracks): Non-Specific Matches, Mapping Modes and Expression measures
: RNA-Seq Analysis Part II (Tracks): Non-Specific Matches, Mapping Modes and February 24, 2014 Sample to Insight : RNA-Seq Analysis Part II (Tracks): Non-Specific Matches, Mapping Modes and : RNA-Seq Analysis
More informationDifferential Expression
Differential Expression Data In this practical, as before, we will work with RNA-Seq data from Arabidopsis seeds that matured at standard temperature (ST, 22 C day/18 C night) or at high temperature (HT,
More information9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology
9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example
More informationSVM Classification in -Arrays
SVM Classification in -Arrays SVM classification and validation of cancer tissue samples using microarray expression data Furey et al, 2000 Special Topics in Bioinformatics, SS10 A. Regl, 7055213 What
More informationCSE 6242 A / CX 4242 DVA. March 6, Dimension Reduction. Guest Lecturer: Jaegul Choo
CSE 6242 A / CX 4242 DVA March 6, 2014 Dimension Reduction Guest Lecturer: Jaegul Choo Data is Too Big To Analyze! Limited memory size! Data may not be fitted to the memory of your machine! Slow computation!
More informationEasy visualization of the read coverage using the CoverageView package
Easy visualization of the read coverage using the CoverageView package Ernesto Lowy European Bioinformatics Institute EMBL June 13, 2018 > options(width=40) > library(coverageview) 1 Introduction This
More informationRNA-Seq Analysis With the Tuxedo Suite
June 2016 RNA-Seq Analysis With the Tuxedo Suite Dena Leshkowitz Introduction In this exercise we will learn how to analyse RNA-Seq data using the Tuxedo Suite tools: Tophat, Cuffmerge, Cufflinks and Cuffdiff.
More informationChIP-Seq Tutorial on Galaxy
1 Introduction ChIP-Seq Tutorial on Galaxy 2 December 2010 (modified April 6, 2017) Rory Stark The aim of this practical is to give you some experience handling ChIP-Seq data. We will be working with data
More informationExpression Analysis with the Advanced RNA-Seq Plugin
Expression Analysis with the Advanced RNA-Seq Plugin May 24, 2016 Sample to Insight CLC bio, a QIAGEN Company Silkeborgvej 2 Prismet 8000 Aarhus C Denmark Telephone: +45 70 22 32 44 www.clcbio.com support-clcbio@qiagen.com
More informationsrap: Simplified RNA-Seq Analysis Pipeline
srap: Simplified RNA-Seq Analysis Pipeline Charles Warden October 30, 2017 1 Introduction This package provides a pipeline for gene expression analysis. The normalization function is specific for RNA-Seq
More informationDatabase Repository and Tools
Database Repository and Tools John Matese May 9, 2008 What is the Repository? Save and exchange retrieved and analyzed datafiles Perform datafile manipulations (averaging and annotations) Run specialized
More informationArrayExpress and Expression Atlas: Mining Functional Genomics data
and Expression Atlas: Mining Functional Genomics data Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL gabry@ebi.ac.uk What is functional genomics (FG)? The aim of FG is to understand the function
More informationVisualization using CummeRbund 2014 Overview
Visualization using CummeRbund 2014 Overview In this lab, we'll look at how to use cummerbund to visualize our gene expression results from cuffdiff. CummeRbund is part of the tuxedo pipeline and it is
More informationMapping RNA sequence data (Part 1: using pathogen portal s RNAseq pipeline) Exercise 6
Mapping RNA sequence data (Part 1: using pathogen portal s RNAseq pipeline) Exercise 6 The goal of this exercise is to retrieve an RNA-seq dataset in FASTQ format and run it through an RNA-sequence analysis
More informationChIP-seq (NGS) Data Formats
ChIP-seq (NGS) Data Formats Biological samples Sequence reads SRA/SRF, FASTQ Quality control SAM/BAM/Pileup?? Mapping Assembly... DE Analysis Variant Detection Peak Calling...? Counts, RPKM VCF BED/narrowPeak/
More informationGene expression & Clustering (Chapter 10)
Gene expression & Clustering (Chapter 10) Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species Dynamic programming Approximate pattern matching
More informationGalaxy workshop at the Winter School Igor Makunin
Galaxy workshop at the Winter School 2016 Igor Makunin i.makunin@uq.edu.au Winter school, UQ, July 6, 2016 Plan Overview of the Genomics Virtual Lab Introduce Galaxy, a web based platform for analysis
More informationTutorial. RNA-Seq Analysis of Breast Cancer Data. Sample to Insight. November 21, 2017
RNA-Seq Analysis of Breast Cancer Data November 21, 2017 Sample to Insight QIAGEN Aarhus Silkeborgvej 2 Prismet 8000 Aarhus C Denmark Telephone: +45 70 22 32 44 www.qiagenbioinformatics.com AdvancedGenomicsSupport@qiagen.com
More informationUnsupervised Learning
Unsupervised Learning Learning without Class Labels (or correct outputs) Density Estimation Learn P(X) given training data for X Clustering Partition data into clusters Dimensionality Reduction Discover
More informationUser s Guide. Using the R-Peridot Graphical User Interface (GUI) on Windows and GNU/Linux Systems
User s Guide Using the R-Peridot Graphical User Interface (GUI) on Windows and GNU/Linux Systems Pitágoras Alves 01/06/2018 Natal-RN, Brazil Index 1. The R Environment Manager...
More informationClustering Techniques
Clustering Techniques Bioinformatics: Issues and Algorithms CSE 308-408 Fall 2007 Lecture 16 Lopresti Fall 2007 Lecture 16-1 - Administrative notes Your final project / paper proposal is due on Friday,
More informationTP RNA-seq : Differential expression analysis
TP RNA-seq : Differential expression analysis Overview of RNA-seq analysis Fusion transcripts detection Differential expresssion Gene level RNA-seq Transcript level Transcripts and isoforms detection 2
More informationAnaquin - Vignette Ted Wong January 05, 2019
Anaquin - Vignette Ted Wong (t.wong@garvan.org.au) January 5, 219 Citation [1] Representing genetic variation with synthetic DNA standards. Nature Methods, 217 [2] Spliced synthetic genes as internal controls
More informationDr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata
Analysis of RNA sequencing data sets using the Galaxy environment Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata Microarray and Deep-sequencing core facility 30.10.2017 RNA-seq workflow I Hypothesis
More informationLong Read RNA-seq Mapper
UNIVERSITY OF ZAGREB FACULTY OF ELECTRICAL ENGENEERING AND COMPUTING MASTER THESIS no. 1005 Long Read RNA-seq Mapper Josip Marić Zagreb, February 2015. Table of Contents 1. Introduction... 1 2. RNA Sequencing...
More informationClustering analysis of gene expression data
Clustering analysis of gene expression data Chapter 11 in Jonathan Pevsner, Bioinformatics and Functional Genomics, 3 rd edition (Chapter 9 in 2 nd edition) Human T cell expression data The matrix contains
More informationChIP-seq hands-on practical using Galaxy
ChIP-seq hands-on practical using Galaxy In this exercise we will cover some of the basic NGS analysis steps for ChIP-seq using the Galaxy framework: Quality control Mapping of reads using Bowtie2 Peak-calling
More informationCPIB SUMMER SCHOOL 2011: INTRODUCTION TO BIOLOGICAL MODELLING
CPIB SUMMER SCHOOL 2011: INTRODUCTION TO BIOLOGICAL MODELLING 1 Getting started Practical 4: Spatial Models in MATLAB Nick Monk Matlab files for this practical (Mfiles, with suffix.m ) can be found at:
More informationClustering and Visualisation of Data
Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some
More informationGoal: Learn how to use various tool to extract information from RNAseq reads. 4.1 Mapping RNAseq Reads to a Genome Assembly
ESSENTIALS OF NEXT GENERATION SEQUENCING WORKSHOP 2014 UNIVERSITY OF KENTUCKY AGTC Class 4 RNAseq Goal: Learn how to use various tool to extract information from RNAseq reads. Input(s): magnaporthe_oryzae_70-15_8_supercontigs.fasta
More informationMaximizing Public Data Sources for Sequencing and GWAS
Maximizing Public Data Sources for Sequencing and GWAS February 4, 2014 G Bryce Christensen Director of Services Questions during the presentation Use the Questions pane in your GoToWebinar window Agenda
More informationm6aviewer Version Documentation
m6aviewer Version 1.6.0 Documentation Contents 1. About 2. Requirements 3. Launching m6aviewer 4. Running Time Estimates 5. Basic Peak Calling 6. Running Modes 7. Multiple Samples/Sample Replicates 8.
More informationGene signature selection to predict survival benefits from adjuvant chemotherapy in NSCLC patients
1 Gene signature selection to predict survival benefits from adjuvant chemotherapy in NSCLC patients 1,2 Keyue Ding, Ph.D. Nov. 8, 2014 1 NCIC Clinical Trials Group, Kingston, Ontario, Canada 2 Dept. Public
More informationIntroduction to Matlab. Sasha Lukyanov, 2018 Xenopus Bioinformatics Workshop, MBL, Woods Hole
Introduction to Matlab Sasha Lukyanov, 2018 Xenopus Bioinformatics Workshop, MBL, Woods Hole MATLAB Environment This image cannot currently be displayed. What do we use? Help? If you know the name of the
More informationINTRODUCTION TO BIOINFORMATICS
Molecular Biology-2017 1 INTRODUCTION TO BIOINFORMATICS In this section, we want to provide a simple introduction to using the web site of the National Center for Biotechnology Information NCBI) to obtain
More informationTECH NOTE Improving the Sensitivity of Ultra Low Input mrna Seq
TECH NOTE Improving the Sensitivity of Ultra Low Input mrna Seq SMART Seq v4 Ultra Low Input RNA Kit for Sequencing Powered by SMART and LNA technologies: Locked nucleic acid technology significantly improves
More informationHow to use the DEGseq Package
How to use the DEGseq Package Likun Wang 1,2 and Xi Wang 1. October 30, 2018 1 MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST /Department of Automation, Tsinghua University. 2
More informationWhy use R? Getting started. Why not use R? Introduction to R: It s hard to use at first. To perform inferential statistics (e.g., use a statistical
Why use R? Introduction to R: Using R for statistics ti ti and data analysis BaRC Hot Topics November 2013 George W. Bell, Ph.D. http://jura.wi.mit.edu/bio/education/hot_topics/ To perform inferential
More informationTesting for Differential Expression
Testing for Differential Expression Objectives Once we've obtained abundance counts for our genes/exons/transcripts, we are usually interested in identifying those genes/exons/transcripts that are differentially
More informationMatlab project Independent component analysis
Matlab project Independent component analysis Michel Journée Dept. of Electrical Engineering and Computer Science University of Liège, Belgium m.journee@ulg.ac.be September 2008 What is Independent Component
More informationCluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1
Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods
More informationMetabolomic Data Analysis with MetaboAnalyst
Metabolomic Data Analysis with MetaboAnalyst User ID: guest6522519400069885256 April 14, 2009 1 Data Processing and Normalization 1.1 Reading and Processing the Raw Data MetaboAnalyst accepts a variety
More informationCQN (Conditional Quantile Normalization)
CQN (Conditional Quantile Normalization) Kasper Daniel Hansen khansen@jhsph.edu Zhijin Wu zhijin_wu@brown.edu Modified: August 8, 2012. Compiled: April 30, 2018 Introduction This package contains the CQN
More informationmrna-seq Basic processing Read mapping (shown here, but optional. May due if time allows) Gene expression estimation
mrna-seq Basic processing Read mapping (shown here, but optional. May due if time allows) Tophat Gene expression estimation cufflinks Confidence intervals Gene expression changes (separate use case) Sample
More informationCLUSTERING IN BIOINFORMATICS
CLUSTERING IN BIOINFORMATICS CSE/BIMM/BENG 8 MAY 4, 0 OVERVIEW Define the clustering problem Motivation: gene expression and microarrays Types of clustering Clustering algorithms Other applications of
More informationDrug versus Disease (DrugVsDisease) package
1 Introduction Drug versus Disease (DrugVsDisease) package The Drug versus Disease (DrugVsDisease) package provides a pipeline for the comparison of drug and disease gene expression profiles where negatively
More informationExercises: Analysing RNA-Seq data
Exercises: Analysing RNA-Seq data Version 2018-03 Exercises: Analysing RNA-Seq data 2 Licence This manual is 2011-18, Simon Andrews, Laura Biggins. This manual is distributed under the creative commons
More informationChIP-seq Analysis Practical
ChIP-seq Analysis Practical Vladimir Teif (vteif@essex.ac.uk) An updated version of this document will be available at http://generegulation.info/index.php/teaching In this practical we will learn how
More informationIntroduction to Systems Biology II: Lab
Introduction to Systems Biology II: Lab Amin Emad NIH BD2K KnowEnG Center of Excellence in Big Data Computing Carl R. Woese Institute for Genomic Biology Department of Computer Science University of Illinois
More informationThe software and data for the RNA-Seq exercise are already available on the USB system
BIT815 Notes on R analysis of RNA-seq data The software and data for the RNA-Seq exercise are already available on the USB system The notes below regarding installation of R packages and other software
More informationExercise 1. RNA-seq alignment and quantification. Part 1. Prepare the working directory. Part 2. Examine qualities of the RNA-seq data files
Exercise 1. RNA-seq alignment and quantification Part 1. Prepare the working directory. 1. Connect to your assigned computer. If you do not know how, follow the instruction at http://cbsu.tc.cornell.edu/lab/doc/remote_access.pdf
More informationECG782: Multidimensional Digital Signal Processing
Professor Brendan Morris, SEB 3216, brendan.morris@unlv.edu ECG782: Multidimensional Digital Signal Processing Spring 2014 TTh 14:30-15:45 CBC C313 Lecture 06 Image Structures 13/02/06 http://www.ee.unlv.edu/~b1morris/ecg782/
More informationSOM Tutorial. Camden Jansen Mortazavi Lab
SOM Tutorial Camden Jansen Mortazavi Lab csjansen@uci.edu Presentation outline Background on Self-Organizing Maps (SOMs) In-depth description of SOM training Using SOMatic s features Using the SOMatic
More informationMapping NGS reads for genomics studies
Mapping NGS reads for genomics studies Valencia, 28-30 Sep 2015 BIER Alejandro Alemán aaleman@cipf.es Genomics Data Analysis CIBERER Where are we? Fastq Sequence preprocessing Fastq Alignment BAM Visualization
More informationCS313 Exercise 4 Cover Page Fall 2017
CS313 Exercise 4 Cover Page Fall 2017 Due by the start of class on Thursday, October 12, 2017. Name(s): In the TIME column, please estimate the time you spent on the parts of this exercise. Please try
More informationPROMO 2017a - Tutorial
PROMO 2017a - Tutorial Introduction... 2 Installing PROMO... 2 Step 1 - Importing data... 2 Step 2 - Preprocessing... 6 Step 3 Data Exploration... 9 Step 4 Clustering... 13 Step 5 Analysis of sample clusters...
More informationData Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University
Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Exploratory data analysis tasks Examine the data, in search of structures
More informationCourse on Microarray Gene Expression Analysis
Course on Microarray Gene Expression Analysis ::: Normalization methods and data preprocessing Madrid, April 27th, 2011. Gonzalo Gómez ggomez@cnio.es Bioinformatics Unit CNIO ::: Introduction. The probe-level
More informationInf2B assignment 2. Natural images classification. Hiroshi Shimodaira and Pol Moreno. Submission due: 4pm, Wednesday 30 March 2016.
Inf2B assignment 2 (Ver. 1.2) Natural images classification Submission due: 4pm, Wednesday 30 March 2016 Hiroshi Shimodaira and Pol Moreno This assignment is out of 100 marks and forms 12.5% of your final
More informationIntroduction to R: Using R for statistics and data analysis
Why use R? Introduction to R: Using R for statistics and data analysis George W Bell, Ph.D. BaRC Hot Topics November 2014 Bioinformatics and Research Computing Whitehead Institute http://barc.wi.mit.edu/hot_topics/
More informationTaxonomically Clustering Organisms Based on the Profiles of Gene Sequences Using PCA
Journal of Computer Science 2 (3): 292-296, 2006 ISSN 1549-3636 2006 Science Publications Taxonomically Clustering Organisms Based on the Profiles of Gene Sequences Using PCA 1 E.Ramaraj and 2 M.Punithavalli
More informationLinear and Non-linear Dimentionality Reduction Applied to Gene Expression Data of Cancer Tissue Samples
Linear and Non-linear Dimentionality Reduction Applied to Gene Expression Data of Cancer Tissue Samples Franck Olivier Ndjakou Njeunje Applied Mathematics, Statistics, and Scientific Computation University
More information7 Control Structures, Logical Statements
7 Control Structures, Logical Statements 7.1 Logical Statements 1. Logical (true or false) statements comparing scalars or matrices can be evaluated in MATLAB. Two matrices of the same size may be compared,
More informationPackage SC3. November 27, 2017
Type Package Title Single-Cell Consensus Clustering Version 1.7.1 Author Vladimir Kiselev Package SC3 November 27, 2017 Maintainer Vladimir Kiselev A tool for unsupervised
More information