Quantification. Part I, using Excel

Similar documents
Differential Expression

The software and data for the RNA-Seq exercise are already available on the USB system

Transcript quantification using Salmon and differential expression analysis using bayseq

How to store and visualize RNA-seq data

Tutorial 1: Using Excel to find unique values in a list

Tutorial. RNA-Seq Analysis of Breast Cancer Data. Sample to Insight. November 21, 2017

Sequence Analysis Pipeline

Integrated Genome browser (IGB) installation

Supplementary Figure 1. Fast read-mapping algorithm of BrowserGenome.

Tutorial: RNA-Seq Analysis Part II (Tracks): Non-Specific Matches, Mapping Modes and Expression measures

Expression Analysis with the Advanced RNA-Seq Plugin

Starting Excel application

Package RNASeqR. January 8, 2019

Introduction to Cancer Genomics

Agenda. Spreadsheet Applications. Spreadsheet Terminology A workbook consists of multiple worksheets. By default, a workbook has 3 worksheets.

Pathway Analysis using Partek Genomics Suite 6.6 and Partek Pathway

Introduction to Microsoft Excel

-In windows explorer navigate to your Exercise_4 folder and right-click the DEC_10_SF1_P1.csv file and choose Open With > Notepad.

Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata

Using Excel This is only a brief overview that highlights some of the useful points in a spreadsheet program.

Basic tasks in Excel 2013

Practicing on word processor, spreadsheet, search engines, citation and referencing, and creating zipped files.

Mascot Insight is a new application designed to help you to organise and manage your Mascot search and quantitation results. Mascot Insight provides

Click Trust to launch TableView.

Exercise 1. RNA-seq alignment and quantification. Part 1. Prepare the working directory. Part 2. Examine qualities of the RNA-seq data files

De novo genome assembly

2. Key the titles in cells A1 to D1, adjust to size 12, click on the bold button, and format with an underline.

NGS Analysis Using Galaxy

Excel Boot Camp PIONEER TRAINING, INC.

Instructions on Adding Zeros to the Comtrade Data

Excel 2013 Part 2. 2) Creating Different Charts

Microsoft Excel Microsoft Excel

If you finish the work for the day go to QUIA and review any objective you feel you need help with.

1.a) Go to it should be accessible in all browsers

Colorado State University Bioinformatics Algorithms Assignment 6: Analysis of High- Throughput Biological Data Hamidreza Chitsaz, Ali Sharifi- Zarchi

3/31/2016. Spreadsheets. Spreadsheets. Spreadsheets and Data Management. Unit 3. Can be used to automatically

Using Microsoft Excel for Recording and Analyzing Data Noah Segall

RNA-seq. Manpreet S. Katari

The HOME Tab: Cut Copy Vertical Alignments

Exercises: Analysing RNA-Seq data

Homework 1 Excel Basics

Microsoft Excel 2010

Tableau Tutorial Using Canadian Arms Sales Data

A B C D E F G H I J K L M 1 Student Test 1 Test 2 Test 3 Test 4 Total AVG. Joe Smith

Interactive Graphing. Overview

SEEK User Manual. Introduction

Basic Excel 2010 Workshop 101

How To: Run the ENCODE histone ChIP- seq analysis pipeline on DNAnexus

How to stay connected. Stay connected with DIIT

RNA-Seq Analysis With the Tuxedo Suite

addition + =5+C2 adds 5 to the value in cell C2 multiplication * =F6*0.12 multiplies the value in cell F6 by 0.12

CHAPTER 6. The Normal Probability Distribution

INSERT SUBTOTALS Database Exercise Sort the Data Department Department Data Tab Sort and Filter Group

Open Microsoft Word: click the Start button, click Programs> Microsoft Office> Microsoft Office Word 2007.

Microsoft Access 2010

Introduction to Excel Workshop

David Crossman, Ph.D. UAB Heflin Center for Genomic Science. GCC2012 Wednesday, July 25, 2012

Spreadsheet and Graphing Exercise Biology 210 Introduction to Research

UCB CALSTAPH EXCEL AND EPIDEMIOLOGY

Skittles Excel Project

Browser Exercises - I. Alignments and Comparative genomics

m6aviewer Version Documentation

Microsoft Access 2013

Microsoft Access 2013

Mapping RNA sequence data (Part 1: using pathogen portal s RNAseq pipeline) Exercise 6

The Allen Human Brain Atlas offers three types of searches to allow a user to: (1) obtain gene expression data for specific genes (or probes) of

Service Line Export and Pivot Table Report (Windows Excel 2010)

Excel 2013 Workshop. Prepared by

Printing spreadsheets is easy. Microsoft Excel has lots of options available so you can print exactly what you want.

HPOG RoundTable: How to Manipulate PAGES Data with Excel

Introduction to version Instruction date

MetScape User Manual

Intermediate Excel 2016

San Francisco State University

Basic Data Processing Tutorial 07/03/14

srap: Simplified RNA-Seq Analysis Pipeline

INTRODUCTION... 1 UNDERSTANDING CELLS... 2 CELL CONTENT... 4

STIDistrict Query (Basic)

QIAseq Targeted RNAscan Panel Analysis Plugin USER MANUAL

CS130/230 Lecture 6 Introduction to StatView

Advanced Excel for EMIS Coordinators

THE AMERICAN LAW INSTITUTE Continuing Legal Education

KINETICS CALCS AND GRAPHS INSTRUCTIONS

Let s start by examining an Excel worksheet for the linear programming. Maximize P 70x 120y. subject to

Word Overview Page 3 Tables Page 5 Labels Page 9 Mail Merge Page 12. Excel Overview Page 19 Charts Page 22

CloudFM Viewer. User guide version /2013. Web viewer of measured data FIEDLER-MÁGR ELEKTRONIKA PRO EKOLOGII

Getting Started. April Strand Life Sciences, Inc All rights reserved.

Highline Excel 2016 Class 13: One Lookup Value to Return Multiple Items: Array Formula

Section 3. Formulas. By the end of this Section you should be able to:

Open Excel by following the directions listed below: Click on Start, select Programs, and the click on Microsoft Excel.

ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013

Demo 1: Free text search of ENCODE data

Gloucester County Library System EXCEL 2007

Status Bar: Right click on the Status Bar to add or remove features.

Protocol: peak-calling for ChIP-seq data / segmentation analysis for histone modification data

SA+ Spreadsheets. Fig. 1

Technology Assignment: Scatter Plots

Excel Foundation Quick Reference (Windows PC)

Excel Format cells Number Percentage (.20 not 20) Special (Zip, Phone) Font

mrna-seq Basic processing Read mapping (shown here, but optional. May due if time allows) Gene expression estimation

Transcription:

Quantification In this exercise we will work with RNA-seq data from a study by Serin et al (2017). RNA-seq was performed on Arabidopsis seeds matured at standard temperature (ST, 22 C day/18 C night) or at high temperature (HT, 25 C day/23 C night). Both conditions have three biological replicates. The RNA-seq reads are available from the Sequence Read Archive (SRA) at the NCBI: https://www.ncbi.nlm.nih.gov/traces/study/?acc=srp125020 The reads were downloaded as fastq files using the fastq-dump program from the SRA toolkit: https://www.ncbi.nlm.nih.gov/sra/docs/toolkitsoft/ The reads were mapped to the Arabidopsis genome using the Hisat2 program and quantified using stringtie and prepde.py to get the read counts per gene (Pertea et al. 2016). A text file called ST_vs_HT.csv with read counts per gene for each of the samples is available at: http://www.bioinformatics.nl/courses/rnaseq/ Part I, using Excel Download the file (using save target as from the right mouse button menu) Open this file with Excel. You should get a sheet with data in seven columns. The first column contains Arabidopsis gene IDs, columns B-G contain read count data for three replicates from two experimental conditions. Let us first clean up the data a bit by removing genes that have no counts. In cell H1 type max and in cell H2 =MAX(B2:G2) This will give the maximum count for each of the six samples for the first gene. Now move your mouse to the right bottom corner of cell H2 until the cursor changes to a black +. If you double click the +, the formula will be repeated for all rows that have data. Select columns A-H and from the Data tab click on the Filter button. Now you should be able to sort the max column on highest to lowest. Remove all rows that have a max count of 0. Next we will add some descriptions to the gene IDs to find out what the genes with the highest counts acctually are. Browse to http://plants.ensembl.org/ and click on BioMart Choose database Ensembl Plants Genes, dataset Arabidopsis thaliana genes. In the menu on the left click on Filters and in the center pane expand GENE. For Limit to genes (external references) select With TAIR ID(s). In the menu on the left click on Attributes and in the center pane expand GENE. Here, make sure only Gene stable ID and Gene description are selected (i.e.

deselect Transcript stable ID ). If you now click on the Results button you should get a small table with gene IDs and corresponding descriptions. Press the Go button next to Export all results to File TSV and a file called mart_export.txt should be downloaded by your browser. Going back to Excel, select the tab for the (empty) Sheet1 and select Import from the File menu to import a Text file. Import the gene IDs with their descriptions in Sheet1. Going back to the sheet with the counts, in cell I1 type description and in I2 put the following formula: =VLOOKUP(A2,Sheet1!A:B,2,FALSE) This will lookup the value of the gene ID for each row with count data in the list of descriptions. Again double click the right bottom corner of the cell to copy the formula to the other rows as well. Given the origin of the mrna, do the genes with the highest counts make sense? Now let s have a look at the data in R, using R Studio. Part II, using R We will again have a look at the count data, but this time in R. Open R Studio, create a new R script, paste the following line in it (without the line breaks): counts = read.table("http://www.bioinformatics.nl/courses/rnaseq/st_vs_ht.csv", sep=",", row.names=1, header=true, stringsasfactors=false) Select the line and run it. This will load a table with the count data, you can view it using: View(counts) This should look similar to the Excel spreadsheet. To get the number of lines: nrow(counts) To view the counts for a specific gene in a specific sample: counts["at4g22890","st_1"] Now, we will remove genes without counts using the maximum counts per row: mx = apply(counts,1,max) counts = counts[mx>0,]

mx is a list with the maximum counts for each row, and every row for which this value is larger than 0 is selected in the second statement. To check how many rows (genes) we have left: nrow(counts) We will now zoom in on the first two samples, which are biological replicates. First we select the columns: ST_1 = counts[,"st_1"] ST_2 = counts[,"st_2"] To get an idea of the values we can look at a summary: summary(st_1) Or plot the values sorted from low to high: plot(sort(st_1)) Or look at the histogram: hist(st_1) You will probably notice that the plots are not very informative, because of a few genes with many counts. To get a better overview of the counts per gene it helps to first take the logarithm of the counts: plot(sort(log2(st_1))) hist(log2(st_1)) To assess how similar the biological replicates are, we can use a scatterplot plot(log2(st_1),log2(st_2)) In this plot every dot represents a gene, with the (log2 transformed) counts in one sample on the x-axis and the counts in the other sample on the y-axis. How similar are the replicates, based on this plot? The data we have worked with were produced by mapping reads onto the genome using the Hisat2 program, now we will have a look at counts produced by mapping reads onto the transcriptome by the kallisto program. Kallisto performs pseudomapping and directly returns the counts normalized as tpm, transcripts per million. It also returns estimated counts that should be comparable to the counts we got from Hisat2. However, the counts are now per transcript and not per gene. To load the kallisto data run (again without the line breaks): kallisto = read.table("http://www.bioinformatics.nl/courses/rnaseq/st_1_kallisto.t sv",sep="\t",row.names = 1, header=true,stringsasfactors = FALSE)

You can again view it: View(kallisto) To get the number of transcripts that were quantified: nrow(kallisto) The tpm value for a transcript is actually the fraction multiplied by a million to get a more readable number (i.e. 7 instead of 0.000007). What would you expect the sum of the tpm values for all transcripts in a sample to be? Let s find out: sum(kallisto$tpm) Both kallisto and Hisat2 used the same reads, so the counts produced by these tools should (hopefully) be comparable. To check that they are, we first have to aggregate kallisto's transcript counts to gene counts. A simple way to do that is to just sum up the counts for all transcripts that belong to the same gene. To add a column to the kallisto table that represents the gene for a transcript, we can strip off the last part of the transcript identifier, i.e.: AT3G18710.1 -> AT3G18710 The substring function can be used for this: kallisto$gene = substring(rownames(kallisto),1,9) Now we can use the aggregate function to sum up the counts: kallisto_gene = aggregate(est_counts ~ gene, data=kallisto, sum) To create a scatterplot of the kallisto and Hisat2 counts we can first combine the data into one table using the merge function: combined_counts = merge(kallisto_gene,counts,by.x="gene",by.y=0,all=false) With the by.x and by.y values we instruct merge to use the 'gene' column fo the kallisto table and the row names of the Hisat table to combine rows for the same gene. The all=false option removes all genes that do are not present in both tables. Now we can plot the data for the first sample ST_1, again as log2 values: with(combined_counts,plot(log2(est_counts),log2(st_1))) How well do both methods agree? Let's find out tomorrow how this effects the subsequent identification of differentially expressed genes...

References Pertea, M., D. Kim, G. M. Pertea, J. T. Leek and S. L. Salzberg (2016). "Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown." Nature Protocols 11: 1650. Serin, E. A. R., L. B. Snoek, H. Nijveen, L. A. J. Willems, J. M. Jiménez-Gómez, H. W. M. Hilhorst and W. Ligterink (2017). "Construction of a High-Density Genetic Map from RNA-Seq Data for an Arabidopsis Bay-0 Shahdara RIL Population." Frontiers in Genetics 8: 201. The counts were produced using these Linux commands: Hisat2 hisat2 -x TAIR11 -U SRR6293018.fastq -S SRR6293018.sam samtools sort -o SRR6293018.bam SRR6293018.sam stringtie -e -B -G genes.gtf -o SRR6293018.gtf SRR6293018.bam python prepde.py Kallisto kallisto quant -i TAIR.idx -o SRR6293018 -b 100 --single -l 200 -s 20 -t 10 > SRR6293018.fastq