DNA / RNA sequencing

Similar documents
RNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF

Galaxy Platform For NGS Data Analyses

RNA-seq Data Analysis

Understanding and Pre-processing Raw Illumina Data

Variation among genomes

NGS : reads quality control

Sequence Data Quality Assessment Exercises and Solutions.

Copyright 2014 Regents of the University of Minnesota

Peter Schweitzer, Director, DNA Sequencing and Genotyping Lab

Copyright 2014 Regents of the University of Minnesota

Trimming and quality control ( )

These will serve as a basic guideline for read prep. This assumes you have demultiplexed Illumina data.

NGS Data Analysis. Roberto Preste

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.

NGS Data Visualization and Exploration Using IGV

Colorado State University Bioinformatics Algorithms Assignment 6: Analysis of High- Throughput Biological Data Hamidreza Chitsaz, Ali Sharifi- Zarchi

Genomics AGRY Michael Gribskov Hock 331

BIT 815: Analysis of Deep DNA Sequencing Data

NGS NEXT GENERATION SEQUENCING

RNA-seq. Manpreet S. Katari

DNA Sequencing analysis on Artemis

NGS Analysis Using Galaxy

Importing your Exeter NGS data into Galaxy:

An Introduction to Linux and Bowtie

MetaPhyler Usage Manual

Package Rsubread. July 21, 2013

Contact: Raymond Hovey Genomics Center - SFS

NGS Sequence data. Jason Stajich. UC Riverside. jason.stajich[at]ucr.edu. twitter:hyphaltip stajichlab

README _EPGV_DataTransfer_Illumina Sequencing

PRACTICAL SESSION 5 GOTCLOUD ALIGNMENT WITH BWA JAN 7 TH, 2014 STOM 2014 WORKSHOP HYUN MIN KANG UNIVERSITY OF MICHIGAN, ANN ARBOR

Galaxy workshop at the Winter School Igor Makunin

de novo assembly Simon Rasmussen 36626: Next Generation Sequencing analysis DTU Bioinformatics Next Generation Sequencing Analysis

ASAP - Allele-specific alignment pipeline

High-throughout sequencing and using short-read aligners. Simon Anders

Lecture 8. Sequence alignments

Sep. Guide. Edico Genome Corp North Torrey Pines Court, Plaza Level, La Jolla, CA 92037

1. Download the data from ENA and QC it:

Data Preprocessing : Next Generation Sequencing analysis CBS - DTU Next Generation Sequencing Analysis

Data Preprocessing. Next Generation Sequencing analysis DTU Bioinformatics Next Generation Sequencing Analysis

Rsubread package: high-performance read alignment, quantification and mutation discovery

Quality assessment of NGS data

Nevada Genomics Center

Quantitative Biology Bootcamp Intro to Unix: Command Line Interface

Pre-processing and quality control of sequence data. Barbera van Schaik KEBB - Bioinformatics Laboratory

Rsubread package: high-performance read alignment, quantification and mutation discovery

EpiGnome Methyl Seq Bioinformatics User Guide Rev. 0.1

The Analysis of RAD-tag Data for Association Studies

2015 Workshop on Genomics. Genomics Laboratory

Darwin: A Genomic Co-processor gives up to 15,000X speedup on long read assembly (To appear in ASPLOS 2018)

Mar. Guide. Edico Genome Inc North Torrey Pines Court, Plaza Level, La Jolla, CA 92037

Next Generation Sequence Alignment on the BRC Cluster. Steve Newhouse 22 July 2010

RASER: Reads Aligner for SNPs and Editing sites of RNA (version 0.51) Manual

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

Genomic Files. University of Massachusetts Medical School. October, 2014

Ensembl RNASeq Practical. Overview

High-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg

ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013

RNAseq analysis: SNP calling. BTI bioinformatics course, spring 2013

User's Guide to DNASTAR SeqMan NGen For Windows, Macintosh and Linux

Sentieon Documentation

NGS Data and Sequence Alignment

Illumina Next Generation Sequencing Data analysis

Demultiplexing Illumina sequencing data containing unique molecular indexes (UMIs)

Single/paired-end RNAseq analysis with Galaxy

Genome 373: Mapping Short Sequence Reads III. Doug Fowler

Fusion Detection Using QIAseq RNAscan Panels

SAM : Sequence Alignment/Map format. A TAB-delimited text format storing the alignment information. A header section is optional.

ls /data/atrnaseq/ egrep "(fastq fasta fq fa)\.gz" ls /data/atrnaseq/ egrep "(cn ts)[1-3]ln[^3a-za-z]\."

SMRT-Portal Exercises. J Fass UCD Genome Center Bioinformatics Core Thursday April 16, 2015

Mapping reads to a reference genome

GALAXY BIOINFORMATICS WORKFLOW ENVIRONMENT. Rutger Vos, 3 April 2012

2013 Workshop on Genomics, Cesky Krumlov

ChIP-seq Analysis. BaRC Hot Topics - March 21 st 2017 Bioinformatics and Research Computing Whitehead Institute.

Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata

ChIP-seq hands-on practical using Galaxy

Helpful Galaxy screencasts are available at:

Subread/Rsubread Users Guide

ChIP-seq practical: peak detection and peak annotation. Mali Salmon-Divon Remco Loos Myrto Kostadima

Functional Genomics Research Stream. Computational Meeting: March 29, 2012 RNA-seq Analysis Pipeline

Integrative Genomics Viewer. Prat Thiru

de.nbi Nanopore Training Course Documentation Release latest

Sequence Analysis Pipeline

Short Read Alignment. Mapping Reads to a Reference

Introduction to UNIX command-line II

Preparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers

Genomic Files. University of Massachusetts Medical School. October, 2015

Quality Control of Sequencing Data

QIAseq DNA V3 Panel Analysis Plugin USER MANUAL

GBS Bioinformatics Pipeline(s) Overview

Using the GBS Analysis Pipeline Tutorial

Run Setup and Bioinformatic Analysis. Accel-NGS 2S MID Indexing Kits

replace my_user_id in the commands with your actual user ID

For Research Use Only. Not for use in diagnostic procedures.

USING BRAT-BW Table 1. Feature comparison of BRAT-bw, BRAT-large, Bismark and BS Seeker (as of on March, 2012)

Mapping NGS reads for genomics studies

SlopMap: a software application tool for quick and flexible identification of similar sequences using exact k-mer matching

Briefly: Bioinformatics File Formats. J Fass September 2018

Cyverse tutorial 1 Logging in to Cyverse and data management. Open an Internet browser window and navigate to the Cyverse discovery environment:

Package Rsubread. June 29, 2018

HMPL User Manual. Shuying Sun or Texas State University

Transcription:

Outline Ways to generate large amounts of sequence Understanding the contents of large sequence files Fasta format Fastq format Sequence quality metrics Summarizing sequence data quality/quantity Using Unix to look at large files Manipulating large files in Unix

DNA / RNA sequencing Sanger Long reads (800 bp), high quality targeted (primers), slow, expensive, hard to automate 454 Long reads (600-800bp), fairly high quality Insertions/deletions, library prep is expensive, not cheap Illumina Many, many reads, high quality Short(ish) 100bp-250bp Ion torrent error rates, throughput PacBio high error rates (10-15% errors) but very long reads (up to 100kb) Oxford nanopore

DNA / RNA sequencing Sanger Long reads (800 bp), high quality targeted (primers), slow, expensive, hard to automate 454 Long reads (600-800bp), fairly high quality Insertions/deletions, library prep is expensive, not cheap Illumina Many, many reads, high quality Short(ish) 100bp-250bp Ion torrent error rates, throughput PacBio high error rates (10-15% errors) but very long reads (up to 100kb) Oxford nanopore

DNA / RNA sequencing Sanger Long reads (800 bp), high quality targeted (primers), slow, expensive, hard to automate 454 Long reads (600-800bp), fairly high quality Insertions/deletions, library prep is expensive, not cheap Illumina Many, many reads, high quality Short(ish) 100bp-250bp Ion torrent high error rates, throughput PacBio high error rates (10-15% errors) but very long reads (up to 100kb) Oxford nanopore

DNA / RNA sequencing Sanger Long reads (800 bp), high quality targeted (primers), slow, expensive, hard to automate 454 Long reads (600-800bp), fairly high quality Insertions/deletions, library prep is expensive, not cheap Illumina Many, many reads, high quality Short(ish) 100bp-250bp Ion torrent error rates, throughput PacBio high error rates (10-15% errors) but very long reads (up to 100kb) Oxford nanopore

http://www.molecularecologist.com /next-gen-fieldguide-2014/

Illumina library prep Shear the DNA to specific fragment length Ligate on adaptors and barcodes

Illumina sequencing

Illumina sequencing

Illumina sequencing

Data formats Fasta Fastq.fastq.fq.fq.txt.fastq.txt SAM BAM

Basic Unix Unix tutorial http://www.ee.surrey.ac.uk/teaching/unix/ Standard commands: pwd print working directory ls list contents of working directory cd change working directory less look at a text file man read the manual how to use a command wget get a file from another machine

An example dataset These files are already on Reference genome Arabidopsis mitochondrion wget ftp://ftp.arabidopsis.org/home/tair/sequences/mitochondrial/mitochondrial_genomic_sequence Illumina sequence for another genotype http://trace.ncbi.nlm.nih.gov/traces/sra/sra.cgi?cmd=dload&run_list=srr307232&format=fastq (you may have to uncompress this file)

Data formats Fasta Fastq.fastq.fq.fq.txt.fastq.txt SAM BAM

Fasta format

Fasta format First line: a > symbol, and a sequence name After that 1 or more lines of sequence data May have another header and other sequence after that or many headers and sequences >Cannabis sativa CBDAS mrna for cannabidiolic acid synthase, complete cds ATGAAGTGCTCAACATTCTCCTTTTGGTTTGTTTGCAAGATAATATTTTTCTTTTTCTCATTCAATATCC AAACTTCCATTGCTAATCCTCGAGAAAACTTCCTTAAATGCTTCTCGCAATATATTCCCAATAATGCAAC AAATCTAAAACTCGTATACACTCAAAACAACCCATTGTATATGTCTGTCCTAAATTCGACAATACACAAT CTTAGATTCACCTCTGACACAACCCCAAAACCACTTGTTATCGTCACTCCTTCACATGTCTCTCATATCC AAGGCACTATTCTATGCTCCAAGAAAGTTGGCTTGCAGATTCGAACTCGAAGTGGTGGTCATGATTCTGA GGGCATGTCCTACATATCTCAAGTCCCATTTGTTATAGTAGACTTGAGAAACATGCGTTCAATCAAAATA GATGTTCATAGCCAAACTGCATGGGTTGAAGCCGGAGCTACCCTTGGAGAAGTTTATTATTGGGTTAATG AGAAAAATGAGAATCTTAGTTTGGCGGCTGGGTATTGCCCTACTGTTTGCGCAGGTGGACACTTTGGTGG AGGAGGCTATGGACCATTGATGAGAAACTATGGCCTCGCGGCTGATAATATCATTGATGCACACTTAGTC

Now, look at the file mt.fa How do we look at this sequence? What do we know about this sequence? How do you know?

Data formats Fasta Fastq.fastq.fq.fq.txt.fastq.txt SAM BAM

Fastq format 4 line repeating pattern: 1. Header line, starting with @ 2. DNA sequence (ATGCN) 3. spacer line, starting with + 4. Sequence quality scores

Looking at a fastq file using less

Fastq ASCII quality scores http://www.asciitable.com/

Illumina quality scores http://en.wikipedia.org/wiki/fastq_format Sanger format 0 to 93 using ASCII 33 to 126 Solexa/Illumina 1.0 format -5 to 62 using ASCII 59 to 126 Illumina 1.3+ format 0 to 62 using ASCII 64 to 126 Illumina 1.5+ 0 and 1 are no longer used and the value 2, encoded by ASCII 66 "B", is used also at the end of reads as a Read Segment Quality Control Indicator [6].

Fastq ASCII quality scores http://www.asciitable.com/

Looking at a fastq file using less

Examining the data Look at the fastq file - Less SRR307232.fastq Please work in groups of 2-4 again, and figure out which quality score type is this?

Illumina quality scores http://en.wikipedia.org/wiki/fastq_format Sanger format 0 to 93 using ASCII 33 to 126 Solexa/Illumina 1.0 format -5 to 62 using ASCII 59 to 126 Illumina 1.3+ format 0 to 62 using ASCII 64 to 126 Illumina 1.5+ 0 and 1 are no longer used and the value 2, encoded by ASCII 66 "B", is used also at the end of reads as a Read Segment Quality Control Indicator [6].

Evaluating quality Fastqc - a good program for quality metrics fastqc sra_data.fastq

For Thursday Do modules 3-4 in the tutorial Read up on fastq files: http://en.wikipedia.org/wiki/fastq_format