Quality assessment of NGS data

Similar documents
ChIP-Seq data analysis workshop

NGS : reads quality control

Quality Control of Sequencing Data

Copyright 2014 Regents of the University of Minnesota

Copyright 2014 Regents of the University of Minnesota

Trimming and quality control ( )

Understanding and Pre-processing Raw Illumina Data

ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013

INF-BIO5121/ Oct 7, Analyzing mirna data using Lifeportal PRACTICALS

WM2 Bioinformatics. ExomeSeq data analysis part 1. Dietmar Rieder

Sequence Preprocessing: A perspective

Tutorial. Find Very Low Frequency Variants With QIAGEN GeneRead Panels. Sample to Insight. November 21, 2017

Galaxy Platform For NGS Data Analyses

Data Preprocessing. Next Generation Sequencing analysis DTU Bioinformatics Next Generation Sequencing Analysis

Data Preprocessing : Next Generation Sequencing analysis CBS - DTU Next Generation Sequencing Analysis

Sequence Data Quality Assessment Exercises and Solutions.

NGS Data Analysis. Roberto Preste

Importing your Exeter NGS data into Galaxy:

Next Generation Sequencing quality trimming (NGSQTRIM)

NGS Data Visualization and Exploration Using IGV

Genome 373: Mapping Short Sequence Reads III. Doug Fowler

ChIP-seq Analysis. BaRC Hot Topics - Feb 23 th 2016 Bioinformatics and Research Computing Whitehead Institute.

Fusion Detection Using QIAseq RNAscan Panels

see also:

RNA-seq Data Analysis

Lecture 8. Sequence alignments

!"#$%&$'()#$*)+,-./).01"0#,23+3,303456"6,&((46,7$+-./&((468,

Tutorial: De Novo Assembly of Paired Data

Introduction To NGS Data & Analytic Tools. Steve Pederson Bioinformatics Centre University Of Adelaide

ChIP-seq hands-on practical using Galaxy

Tutorial for Windows and Macintosh. Trimming Sequence Gene Codes Corporation

The Analysis of RAD-tag Data for Association Studies

Next Generation Sequence Alignment on the BRC Cluster. Steve Newhouse 22 July 2010

DNA / RNA sequencing

Pre-processing and quality control of sequence data. Barbera van Schaik KEBB - Bioinformatics Laboratory

RNA-seq. Manpreet S. Katari

User Guide. SLAMseq Data Analysis Pipeline SLAMdunk on Bluebee Platform

QIAseq DNA V3 Panel Analysis Plugin USER MANUAL

ChIP-seq hands-on practical using Galaxy

QIAseq Targeted RNAscan Panel Analysis Plugin USER MANUAL

Preparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers

Tutorial. OTU Clustering Step by Step. Sample to Insight. March 2, 2017

Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata

Colorado State University Bioinformatics Algorithms Assignment 6: Analysis of High- Throughput Biological Data Hamidreza Chitsaz, Ali Sharifi- Zarchi

cutadapt Documentation

ChIP-seq Analysis. BaRC Hot Topics - March 21 st 2017 Bioinformatics and Research Computing Whitehead Institute.

OTU Clustering Using Workflows

Community analysis of 16S rrna amplicon sequencing data with Chipster. Eija Korpelainen CSC IT Center for Science, Finland

Using Galaxy for NGS Analyses Luce Skrabanek

Tutorial. Small RNA Analysis using Illumina Data. Sample to Insight. October 5, 2016

Small RNA Analysis using Illumina Data

Run Setup and Bioinformatic Analysis. Accel-NGS 2S MID Indexing Kits

Tutorial. OTU Clustering Step by Step. Sample to Insight. June 28, 2018

Genome Assembly: Preliminary Results

BaseSpace - MiSeq Reporter Software v2.4 Release Notes

srna Detection Results

BGGN-213: FOUNDATIONS OF BIOINFORMATICS (Lecture 14)

Exome sequencing. Jong Kyoung Kim

Genome Assembly. 2 Sept. Groups. Wiki. Job files Read cleaning Other cleaning Genome Assembly

HMPL User Manual. Shuying Sun or Texas State University

RCAC. Job files Example: Running seqyclean (a module)

Galaxy workshop at the Winter School Igor Makunin

ASAP - Allele-specific alignment pipeline

Functional Genomics Research Stream. Computational Meeting: March 29, 2012 RNA-seq Analysis Pipeline

1. Download the data from ENA and QC it:

README _EPGV_DataTransfer_Illumina Sequencing

ChIP-Seq Tutorial on Galaxy

Sequence Analysis Pipeline

NGS Analysis Using Galaxy

These will serve as a basic guideline for read prep. This assumes you have demultiplexed Illumina data.

Performing a resequencing assembly

atropos Documentation

TP RNA-seq : Differential expression analysis

1. mirmod (Version: 0.3)

Accessible, Transparent and Reproducible Analysis with Galaxy

Handling sam and vcf data, quality control

Practical Linux examples: Exercises

Importing FASTQ files

Helsinki 19 Jan Practical course in genome bioinformatics DAY 0

Tutorial and Exercises with KeyWords in WordSmith Tools: Level I

NGS Analyses with Galaxy

Gegenees genome format...7. Gegenees comparisons...8 Creating a fragmented all-all comparison...9 The alignment The analysis...

Analyzing ChIP- Seq Data in Galaxy

Exercises: Analysing RNA-Seq data

RNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF

ssrna Example SASSIE Workflow Data Interpolation

De novo genome assembly

de.nbi and its Galaxy interface for RNA-Seq

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.

Sorted bar plot with 45 degree labels in R

Running SNAP. The SNAP Team October 2012

MacVector for Mac OS X. The online updater for this release is MB in size

Captivating Movies! Getting Started with Captivate

NGS Sequence data. Jason Stajich. UC Riverside. jason.stajich[at]ucr.edu. twitter:hyphaltip stajichlab

Meraculous De Novo Assembly of the Ariolimax dolichophallus Genome. Charles Cole, Jake Houser, Kyle McGovern, and Jennie Richardson

Package fastqcr. April 11, 2017

ERPEEG Tutorial. Version 1.0. This tutorial was written by: Sravya Atluri, Matthew Frehlich and Dr. Faranak Farzan.

Running SNAP. The SNAP Team February 2012

Tutorial files are available from the Exelis VIS website or on the ENVI Resource DVD in the image_reg directory.

From the Schnable Lab:

Transcription:

Quality assessment of NGS data Ines de Santiago July 27, 2015 Contents 1 Introduction 1 2 Checking read quality with FASTQC 1 3 Preprocessing with FASTX-Toolkit 2 3.1 Preprocessing with FASTX-Toolkit: fastq quality filter........................... 2 3.2 Preprocessing with FASTX-Toolkit: fastx trimmer.............................. 4 4 Cleaning adapter containing reads from FASTQ data using cutadapt 5 4.1 Checking read quality with FASTQC..................................... 5 4.2 Using cutadapt................................................ 5 5 Conclusions 7 1 Introduction This training gives an introduction to quality assessment of NGS data through various workflows using command line tools. We will use a software called FASTQC to assess the quality of sequenced reads and lead you through the different preprocessing steps, from removing reads of low quality to removing adapter contamination. 2 Checking read quality with FASTQC FASTQC checks whether a set of sequence reads in a.fastq file exhibit any unusual qualities. There are two ways in which FASTQC can be run: in command line mode, or as a GUI (graphical user interface). This exercise addresses the command line version of FASTQC. Use Case: Run FASTQC to check read quality. The data we will use in this session consists of a fastq file with example sequences. The file is called sample.fastq and is located in the located in the Day1/qa directory: fastqc sample.fastq FASTQC generates a html report with a nice graphical summary output about the quality of sequencing reads. Two files are generated, one compressed, and one not. To view the report, open the file sample fastqc.html in a browser. Your output should have a menu on the left-hand side that looks like this: 1

Quality assessment of NGS data 2 Make sure you understand the results. Do they look OK? What do the warnings and errors mean? 3 Preprocessing with FASTX-Toolkit In this section of the tutorial we will be using FASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/) to filter and trim sequences based on quality. The FASTX-Toolkit is a collection of command line tools for preprocessing of FASTA/FASTQ files. There are many tools available within FASTX-toolkit, we will be using two of those tools: fastq quality filter: Filters sequences based on quality fastx trimmer: Shortening reads in a FASTQ or FASTQ files (removing barcodes or noise). For most programs and scripts, you can see their instructions by typing their name in the terminal followed by the flag -h. There are many options available, and we will use only a few of those. 3.1 Preprocessing with FASTX-Toolkit: fastq quality filter The Per-base sequence quality plot from sample.fastq file shows the average and range of the sequence quality values across the read. It looks like there are many reads in the file with very low quality. We will remove low quality reads using fastq quality filter tool:

Quality assessment of NGS data 3 Use Case: Use fastq quality filter to remove reads with lower quality, keeping only reads that have at least 75% of bases with a quality score of 20 or more. fastq_quality_filter -v -Q 64 -q 20 -p 75 -i sample.fastq -o sample_filtered.fastq the following options were used: -i: input file -o: output file -v: report number of sequences -Q 64: determines the input quality ASCII offset (in this case 64). You will need version 0.0.10 or later f -q 20: the quality value required. -p 75: the percentage of bases that have to have that quality value fastq quality filter should give you the following output: Q u a l i t y cut o f f : 20 Minimum p e r c e n t a g e : 75 I n p u t : 9053 r e a d s. Output : 6629 r e a d s. d i s c a r d e d 2424 (26%) low q u a l i t y r e a d s.

Quality assessment of NGS data 4 3.2 Preprocessing with FASTX-Toolkit: fastx trimmer The Per-base sequence content plot from sample.fastq file shows the average percentage of A, G, C and T across the read length. The pattern we see here at the beginning of the reads show some sort bias for the first 6 bases. We will use fastx trimmer to deal with this issue: Use Case: Use fastx trimmer to trim the first 6 bases from reads. fastx_trimmer -v -f 7 -l 36 -i sample_filtered.fastq -o sample_filtered_and_trimmed.fastq the following options were used: -f: first base to keep -l: last base to keep -i: input file -o: output file -v: report number of sequences fastx trimmer should give you the following output: Trimming : base 7 to 36 I n p u t : 6629 r e a d s. Output : 6629 r e a d s.

Quality assessment of NGS data 5 4 Cleaning adapter containing reads from FASTQ data using cutadapt In this section we will demonstrate the effect of adapter contamination. We will use another example dataset, the file is called sample2.fastq and is located in the Day1/qa directory. First, let s start by running FASTQC with this file. 4.1 Checking read quality with FASTQC Use Case: Run FASTQC to check read quality fastqc sample2.fastq Now open the output file sample2 fastqc.html in a browser and inspect the results. Many of the issues present in the report can be linked to adapter contamination. Do you have adapter sequences in your reads? 4.2 Using cutadapt Under overrepresented sequences option you will see the following table: As seen here, one sequence is present in more than 28.97% of the reads. FASTQC identifies the primer as TruSeq Adapter, Index 5, possibly a left over from the library construction process or from a PCR amplification issue. We now use the cutadapt utility to remove reads containing the TruSeq Adapter GATCGGAAGAGCACACGTCT- GAACTCCAGTCACACA : Use Case: Use cutadapt to remove the unwanted adapter sequences from sample2.fastq. cutadapt -m 20 -e 0.1 -a GATCGGAAGAGCACACGTCTGAACTCCAGTCACACA sample2.fastq \ -o sample2--cutadapt.fastq The following options were used: -a: The sequence of the adaptor is given with the -a option -m 20: throws away processed reads shorter than 20 bases. -e 0.1: The level of error tolerance is adjusted by specifying a maximum 10% error rate. Allowed errors are mismatches, insertions and deletions. -o: output file cutadapt should give you the following output:

Quality assessment of NGS data 6 cutadapt v e r s i o n 1. 1 Command l i n e p a r a m e t e r s : m 20 e 0. 1 a GATCGGAAGAGCACACGTCTGAACTCCAGTCACACA sample2. f a s t q o sample2 c utadapt. f a s t q Maximum e r r o r r a t e : 10.00% P r o c e s s e d r e a d s : 100000 Trimmed r e a d s : 32792 ( 32.8%) Total b a s e p a i r s : 3600000 ( 3. 6 Mbp) Trimmed b a s e p a i r s : 1124674 ( 1. 1 Mbp) (31.24% o f t o t a l ) Too s h o r t r e a d s : 31113 ( 31.1% o f p r o c e s s e d r e a d s ) Too l o n g r e a d s : 0 ( 0.0% o f p r o c e s s e d r e a d s ) Total time : 5.13 s Time per read : 0.05 ms === Adapter 1 === Adapter 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACACA', l e n g t h 36, was trimmed 32792 t i m e s. Lengths o f removed s e q u e n c e s l e n g t h count e x p e c t e d 3 1267 1562.5 4 277 390.6 5 52 97.7 6 22 24.4 7 1 6. 1 8 2 1. 5 9 2 0. 4 10 7 0. 1 11 2 0. 0 12 1 0. 0 16 46 0. 0 17 19 0. 0 18 9 0. 0 20 1 0. 0 21 38 0. 0 23 35 0. 0 24 1 0. 0 36 31010 0. 0 Use Case: Inspect the results after cleaning the data (use FASTQC again). fastqc sample2--cutadapt.fastq Now many of the issues in the report seem to have been resolved. Only a low quality issue with the 3 of reads and some kmer noise seems to remain. Most mappers should not be affected by contaminating adapters or kmers because those reads should not map to the reference genome and are discarded during the alignment step. However, when a known contaminating sequence is detected or known to be present it can be good practice to remove and trim reads prior to alignment.

Quality assessment of NGS data 7 5 Conclusions When you get your sequences back from a sequencing facility, its important to check that they are of high quality. FASTQC can tell us that there are unusual features in your fastq file, but it cant necessarily explain why. If you want more information on what each piece of FASTQCs output means, the documentation is available online at http: //www.bioinformatics.bbsrc.ac.uk/projects/fastqc/help/ Many of the issues observed are related to poor sequencing quality or bad base calling, adapter sequences or library contamination. When known adapter/linker sequences are detected, a simple removal of these contaminating sequences can make a very big difference in the quality of your fastq reads. There are many options that you can use with cutadapt, for instance it is possible to specify more than one adapter sequence, trim more than one adapter from each read, etc. We recommend reading cutadapt documentation for more details and examples (http://cutadapt.readthedocs.org/ en/latest/guide.html).