Quality Control of Sequencing Data

Size: px

Start display at page:

Download "Quality Control of Sequencing Data"

Phillip Horton
6 years ago
Views:

1 Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY // BTI Plant Bioinformatics Course /27/2017 BTI Plant Bioinformatics Course

2 3/27/2017 BTI Plant Bioinformatics Course Quality Control of NGS Data 1. Exploration 2. Evaluation 3. Preprocessing

3 3/27/2017 BTI Plant Bioinformatics Course Exploration Goal: Use shell commands and installed tools to explore using the file from TAIR Data: /home/bioinfo/data/tair10_pep.fasta.gz Tools: (fasta toolkit)

4 3/27/2017 BTI Plant Bioinformatics Course Exploration Exercise 1: 1. Type fasta and TAB key to find all the commands starting with fasta 2. Uncompress and find the length of sequences in the file: TAIR10_pep.fasta.gz Tip: Use gzip and fastalength 3. Split the above file into 3 fasta files Tip: Use fastasplit

5 Evaluation Goal: Learn the use of read evaluation programs keeping attention in relevant parameters such as quality score and length distributions and unusual reads duplications. Data: data for two tomato ripening stages /home/bioinfo/data/slch04_demo.tar.gz Tools: tar -zxvf OR xvf (command line, untar and unzip the files) head (command line, take a quick look of the files) mv (command line, change the name of the files) grep (command line, find/count patterns in files) FASTX toolkit (command line, process fasta/fastq) FastQC (gui, to calculate several stats for each file) 3/27/2017 BTI Plant Bioinformatics Course

6 3/27/2017 BTI Plant Bioinformatics Course Evaluation Exercise 2: 1. Uncompress the file: /home/bioinfo/data/slch04_demo.tar.gz 2. Raw data will be found in 4 files. Print the first 10 lines for the files: SRR404331_ch4.fq, SRR404333_ch4.fq, SRR404334_ch4.fq and SRR404336_ch4.fq. Question 2.1: Do these files have fastq format?

7 Evaluation Exercise 2: 3. Count number of sequences in each fastq file using commands you learnt last time. 4. Convert the fastq files to fasta. Tip: Use grep Tip: Use fastq_to_fasta -h to see help Use Google if you are stuck 5. Explore other tools in the FASTX toolkit Now count the number of sequences in fasta file and see if the number of sequences has changed. 3/27/2017 BTI Plant Bioinformatics Course

8 3/27/2017 BTI Plant Bioinformatics Course Good Evaluation: Sequence Quality

9 3/27/2017 BTI Plant Bioinformatics Course Good Evaluation: Sequence Quality Poor

10 3/27/2017 BTI Plant Bioinformatics Course Evaluation: Sequence Quality Pacific Biosciences

11 3/27/2017 BTI Plant Bioinformatics Course Good Evaluation: Sequence Content

12 3/27/2017 BTI Plant Bioinformatics Course Good Evaluation: Sequence Content Poor

13 3/27/2017 BTI Plant Bioinformatics Course Evaluation: Duplication Good

14 3/27/2017 BTI Plant Bioinformatics Course Good Evaluation: Duplication Poor

15 3/27/2017 BTI Plant Bioinformatics Course Evaluation: Overrepresented Sequences Good

16 3/27/2017 BTI Plant Bioinformatics Course Evaluation: Overrepresented Sequences Good Poor

17 3/27/2017 BTI Plant Bioinformatics Course Evaluation: Kmer content Good

18 3/27/2017 BTI Plant Bioinformatics Course Good Evaluation: Kmer content Poor

19 3/27/2017 BTI Plant Bioinformatics Course Evaluation Exercise 3: 1.Type fastqc to start the FastQC program. Load the four fastq sequence files in the program. Question 3.2: How many sequences there are per file in FastQC? Question 3.3: Which is the length range for these reads? Question 3.4: Which is the quality score range for these reads? Which one looks best quality-wise? Question 3.5: Do these s have read overrepresentation? Question 3.6: Looking into the kmer content, do you think that the samples have an adaptor?

20 3/27/2017 BTI Plant Bioinformatics Course Preprocessing Goal: Trim the low quality ends of the reads and remove the short reads. Data: data for two tomato ripening stages /home/bioinfo/data/slch04_demo.tar.gz Tools: fastq-mcf (command line tool to process reads) FastQC (gui, to calculate several stats for each file)

21 Preprocessing Exercise 4: Download the file: adapters1.fa from ftp://ftp.solgenomics.net/user_requests/aubombarely/courses/rnaseqcorpoica/a dapters1.fa Run the read processing program over each of the s using Min. qscore of 30 Min. length of 40 bp Tip: Use fastq-mcf -h to see help Type fastqc to start the FastQC program. Load the four new fastq sequence files. Compare the results with the previous s. 3/27/2017 BTI Plant Bioinformatics Course

22 3/27/2017 BTI Plant Bioinformatics Course Bonus Material Here are some simple exercises using the file from TAIR /home/bioinfo/data/tair10_pep.fasta.gz Number of unknown proteins per chromosome Number of proteins that are not unknown proteins per chromosome and are on the forward strand Number of proteins which are located within the first 5MB of the chromosome (awk) All genes of length greater than 2000bp (awk) All proteins of length greater than 200aa (awk)

Quality assessment of NGS data

Quality assessment of NGS data Ines de Santiago July 27, 2015 Contents 1 Introduction 1 2 Checking read quality with FASTQC 1 3 Preprocessing with FASTX-Toolkit 2 3.1 Preprocessing with FASTX-Toolkit: