Mapping RNA sequence data (Part 1: using pathogen portal s RNAseq pipeline) Exercise 6

Similar documents
How to store and visualize RNA-seq data

Colorado State University Bioinformatics Algorithms Assignment 6: Analysis of High- Throughput Biological Data Hamidreza Chitsaz, Ali Sharifi- Zarchi

BGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14)

ChIP-seq hands-on practical using Galaxy

Goal: Learn how to use various tool to extract information from RNAseq reads. 4.1 Mapping RNAseq Reads to a Genome Assembly

Single/paired-end RNAseq analysis with Galaxy

Sequence Analysis Pipeline

Cyverse tutorial 1 Logging in to Cyverse and data management. Open an Internet browser window and navigate to the Cyverse discovery environment:

Fast-track to Gene Annotation and Genome Analysis

Protocol: peak-calling for ChIP-seq data / segmentation analysis for histone modification data

Resequencing Analysis. (Pseudomonas aeruginosa MAPO1 ) Sample to Insight

Importing your Exeter NGS data into Galaxy:

David Crossman, Ph.D. UAB Heflin Center for Genomic Science. GCC2012 Wednesday, July 25, 2012

Galaxy Platform For NGS Data Analyses

Maize genome sequence in FASTA format. Gene annotation file in gff format

NGS Data Visualization and Exploration Using IGV

NGS FASTQ file format

Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata

Tutorial 4 BLAST Searching the CHO Genome

Genome Browsers - The UCSC Genome Browser

Getting Started. April Strand Life Sciences, Inc All rights reserved.

Galaxy workshop at the Winter School Igor Makunin

ChIP-seq hands-on practical using Galaxy

RNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF

Browser Exercises - I. Alignments and Comparative genomics

NGS Analysis Using Galaxy

How To: Run the ENCODE histone ChIP- seq analysis pipeline on DNAnexus

Copyright 2014 Regents of the University of Minnesota

Copyright 2014 Regents of the University of Minnesota

Performing a resequencing assembly

Wilson Leung 01/03/2018 An Introduction to NCBI BLAST. Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment

RNA-Seq Analysis With the Tuxedo Suite

Accessible, Transparent and Reproducible Analysis with Galaxy

Ensembl RNASeq Practical. Overview

Lecture 8. Sequence alignments

ChIP-Seq Tutorial on Galaxy

Helpful Galaxy screencasts are available at:

TUTORIAL: Generating diagnostic primers using the Uniqprimer Galaxy Workflow

RNA-seq. Manpreet S. Katari

Sequence Alignment. GBIO0002 Archana Bhardwaj University of Liege

Tutorial: De Novo Assembly of Paired Data

Tutorial. Small RNA Analysis using Illumina Data. Sample to Insight. October 5, 2016

Small RNA Analysis using Illumina Data

RNA-Seq analysis with Astrocyte Differential expression and transcriptome assembly

Integrated Genome browser (IGB) installation

Module 1 Artemis. Introduction. Aims IF YOU DON T UNDERSTAND, PLEASE ASK! -1-

CLC Server. End User USER MANUAL

Cytidine-to-Uridine Recognizing Editor for Chloroplasts

Tutorial. De Novo Assembly of Paired Data. Sample to Insight. November 21, 2017

Using the Galaxy Local Bioinformatics Cloud at CARC

MetaStorm: User Manual

de.nbi and its Galaxy interface for RNA-Seq

ArrayExpress and Expression Atlas: Mining Functional Genomics data

Expression Analysis with the Advanced RNA-Seq Plugin

EBI services. Jennifer McDowall EMBL-EBI

Wilson Leung 05/27/2008 A Simple Introduction to NCBI BLAST

Useful software utilities for computational genomics. Shamith Samarajiwa CRUK Autumn School in Bioinformatics September 2017

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

EBI patent related services

NGS : reads quality control

m6aviewer Version Documentation

COMPARATIVE MICROBIAL GENOMICS ANALYSIS WORKSHOP. Exercise 2: Predicting Protein-encoding Genes, BlastMatrix, BlastAtlas

4.1. Access the internet and log on to the UCSC Genome Bioinformatics Web Page (Figure 1-

Bioinformatics in next generation sequencing projects

Analyzing ChIP- Seq Data in Galaxy

Goal: Learn how to use various tool to extract information from RNAseq reads.

Exercise 2: Browser-Based Annotation and RNA-Seq Data

8:15 Introduction/Overview Michelle Giglio. 8:45 CloVR background W. Florian Fricke. 9:15 Hands-on: Start CloVR W. Florian Fricke

Advanced UCSC Browser Functions

Practical Course in Genome Bioinformatics

Biostatistics and Bioinformatics Molecular Sequence Databases

Tutorial 1: Exploring the UCSC Genome Browser

Maximizing Public Data Sources for Sequencing and GWAS

epigenomegateway.wustl.edu

Tutorial: RNA-Seq analysis part I: Getting started

Tutorial: RNA-Seq Analysis Part II (Tracks): Non-Specific Matches, Mapping Modes and Expression measures

Tiling Assembly for Annotation-independent Novel Gene Discovery

mrna-seq Basic processing Read mapping (shown here, but optional. May due if time allows) Gene expression estimation

Lecture 12. Short read aligners

Genomic Analysis with Genome Browsers.

Analyzing Variant Call results using EuPathDB Galaxy, Part II

11/8/2017 Trinity De novo Transcriptome Assembly Workshop trinityrnaseq/rnaseq_trinity_tuxedo_workshop Wiki GitHub

ChIP-seq (NGS) Data Formats

Fusion Detection Using QIAseq RNAscan Panels

BovineMine Documentation

Genomic Files. University of Massachusetts Medical School. October, 2015

ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013

Performing whole genome SNP analysis with mapping performed locally

RNA-seq Data Analysis

Tutorial: chloroplast genomes

DataSTORRE Deposit Guide

Minimum Information for Reporting Immunogenomic NGS Genotyping (MIRING)

Introduction to Genome Browsers

New generation of patent sequence databases Information Sources in Biotechnology Japan

RASER: Reads Aligner for SNPs and Editing sites of RNA (version 0.51) Manual

Topics of the talk. Biodatabases. Data types. Some sequence terminology...

KisSplice. Identifying and Quantifying SNPs, indels and Alternative Splicing Events from RNA-seq data. 29th may 2013

Exeter Sequencing Service

MIRING: Minimum Information for Reporting Immunogenomic NGS Genotyping. Data Standards Hackathon for NGS HACKATHON 1.0 Bethesda, MD September

Annotating a Genome in PATRIC

Transcription:

Mapping RNA sequence data (Part 1: using pathogen portal s RNAseq pipeline) Exercise 6 The goal of this exercise is to retrieve an RNA-seq dataset in FASTQ format and run it through an RNA-sequence analysis pipeline. Step I: Create a login account at Pathogen Portal: 1. Go to http://pathogenportal.org 2. Click on RNA Rocket. 3. Click on Create account and fill in the required information.

Step II: Getting data into your launch pad. The following exercise is based on data generated from the recent study: Franzén et al. Transcriptome profiling of Giardia intestinalis using strand-specific RNA-seq. PLoS Comput Biol. 2013;9(3) http://www.ncbi.nlm.nih.gov/pubmed/23555231 The paper examines transcription in Giardia assemblages A (WB), B (GS) and E (P15). In the paper the authors indicate that the data has been deposited to the sequence read archive (SRA) and they provide a link to GEO: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gsm895812 Examining the information available in GEO and under the SRA accession numbers you will notice that this data is paired end and strand specific. So for each assemblage there should be two files for the forward strand (one for each pair) and two files for the reverse strand. Assemblage A (WB): Assemblage A (AS175): Assemblage B (GS): Assemblage E (P15): http://www.ncbi.nlm.nih.gov/sra/srx129645 http://www.ncbi.nlm.nih.gov/sra/srx129648 http://ncbi.nlm.nih.gov/sra/srx129649 http://www.ncbi.nlm.nih.gov/sra/srx129646 The required input format is something called a FASTQ file, which is similar to a FASTA file. These are simple text files that include sequence and additional information about the sequence (ie. name, quality scores, sequencing machine ID, lane number etc.).

Definition line FASTA >SEQUENCE_1 MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDK AVQLLREKGLGKAAKKADRLAAEGLVSVKVSDDFTIAA MRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRL KDPNKPEHKIPQFASRKQLSDAILKEAEEKIKEELKAQ GKPEKIWDNIIPGKMNSFIADNSQLDSKLTLMGQFYVM DDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKT EDFAAEVAAQL Sequence End of Sequence FASTQ Definition line @SRR016080.2 20AKUAAXX:7:1:123:268 TGTAGCATAATGCCGTTTTCTTTGTTTCCATTCATC + II&I&4IICIIIIIIII.III3:III3#6IIII1I) @SRR016080.3 20AKUAAXX:7:1:112:638 TATAGATCTTGGTAACACCCGTTGTATTATTCGCAA + IIIIIIIIIIIIIIIIIIIIIIII-IIIII%%IIII @SRR016080.4 20AKUAAXX:7:1:102:360 TTGCCAGTACAACACCGTTTTGCATCGTTTTTTTTA + IIIIII$IIIIIIII'IIIIIIIIIIII@IIIID35 Sequence Encoded Quality Score - FASTQ files are large and as a result not all sequencing repositories will store this format. However, tools are available to convert, for example, NCBI s.sra format to FASTQ. - Sequence data is housed in three repositories that are synchronized on a regular basis. o The sequence read archive at GenBank o The European Nucleotide Archive at EMBL o The DNA data bank of Japan - RNArocket allows you to use SRA accession numbers and directly retrieve FASTQ files.

Group 1 (Assemblage A (WB)): Accession number: SRX129645 Group 2 (Assemblage A (AS175)): Accession number: SRX129648 Group 3 (Assemblage B (GS)): Accession number: SRX129649 Group 4 (Assemblage E (P15)): Accession number: SRX129646 Ø Here are the steps you take to start uploading data into your Launchpad: 1. Click on the Upload Files link

2. On the next page, notice the instructions to use the global search on the ENA site. Next click on continue. 3. Cut and paste your accession number into the global search box. Click on the search icon. 4. Select the record that matches your search accession number, usually the first one.

5. On the next page you can select the files to load into RNArocket. We will use the Fastq files (Galaxy) since RNArocket is built on Galaxy. Remember, you have to get 4 files. Two (paired) for each strand. Click on the link for File 1 for the first run, then click on the back button on your browser and click on the link for File 2. You should now see a window that looks like this: To view the progress of your upload, click on Project View (red square in image above). In progress tasks will show up in yellow

Completed tasks will show up in green You can inspect the contents of completed tasks (like uploaded files) by clicking on the eye icon next to the name of the file (arrow in above image). Inspecting a FASTQ file should look like this: 6. Once the RNA-sequence FASTQ file has been uploaded you can start the RNAseq pipeline. Pathogen portal uses two algorithms for mapping (TopHat) and transcript prediction and expression value calculation (Cufflinks). Note that there are many algorithms and methods for RNA-seq mapping and analysis each with its advantages and disadvantages. You are encouraged to learn more about the algorithm you are using. o TopHat: o Cufflinks: http://tophat.cbcb.umd.edu/ http://cufflinks.cbcb.umd.edu/index.html - To start the pipeline click on the Launch Pad link (red square in above image). On the next page, scroll down to the RNA-Seq Analysis section and click on Map Reads & Assemble Transcripts.

- - On the next page, scroll down and choose the type of analysis (in this case we are analyzing a paired end eukaryotic sample). Next select the target project from the drop down menu. You should only have one or two projects one of which will contain both FASTQ files you uploaded (probably called Uploaded Files ). Once you select the correct project you should see the two FASTQ files contained within it. Next click on continue. - The next page allows you to configure the pipeline: Step1: Select the upstream read file (ends in _1) and click on the arrow to move it to the Selected window. Step2: Select the downstream read file (ends in _2) and click on the arrow to move it to the Selected window.

Step3: Configure TopHat there are a number of options that may be modified, however, for the purposes of this exercise the default parameters may be used. The only required change is the reference genome -- select Giardia Assemblage A isolate WB Step4: Configure Cufflinks once again there are a number of options to modify. For the purposes of this exercise change the following: Maximum Intron Length (-I): 500 The reference annotation should be automatically selected: Giardia Assemblage A isolate WB Select how to use the provided annotation: Assemble Novel + annotated transcripts. Click on the Run Workflow button.

After you start the workflow you should get a confirmation window that indicates all the steps that have been added to the queue. The progress of your workflow can be viewed to the right. Completed tasks are in green, running tasks are in yellow and tasks waiting in the queue are in grey.