Molecular Index Error correction

Size: px

Start display at page:

Download "Molecular Index Error correction"

Melina Davidson
5 years ago
Views:

1 Molecular Index Error correction Overview: This section provides directions for generating SSCS (Single Strand Consensus Sequence) reads and trimming molecular indexes from raw fastq files. Learning Objectives: Use python script to generate SSCS Reads. Use flexbar to trim molecular indexes from duplex seq libraries. Tutorial: SSCS Reads Since this will take longer we will first use First we want to generate SSCS reads where we take advantage of the molecular indexes added during library prep. For the purpose of this tutorial, the paired end sequencing of sample DED110 has been placed in the $BI/gva_course /mixed_population directory. To do so we will use a "majority rules" python script (named SSCS_DCS.py) which was heavily modified by DED from a script originally created by Mike Schmitt and Scott Kennedy for the original duplex seq paper. This script can be found in the $BI/scripts directory. Invoking the script is as simple as typing SSCS_DCS.py; adding -h will give a list of the available options. The goal of this command is to generate SSCS reads, for any molecular index where we have at least 2 reads present, and to generate a log file which will tell us some information about the data. Click here for solution of how to copy the DED110 fastq files to a new directorry called BDIB_Error_Correction cds mkdir BDIB_Error_Correction cd BDIB_Error_Correction cp $BI/gva_course/mixed_population/DED110*.fastq. Interrogate the SSCS_DCS.py script to determine how to invoke it. Click here for hints before the answer You can often get more information about python scripts by typing the name of the script followed by the -h command.

2 If you see an error message check here for a probable solution Traceback (most recent call last): File "/corral-repl/utexas/bioiteam/bin/sscs_dcs.py", line 45, in <module> import numpy / init.py", line 170, in <module> from. import add_newdocs /add_newdocs.py", line 13, in <module> from numpy.lib import add_newdoc /lib/ init.py", line 8, in <module> from.type_check import * /lib/type_check.py", line 11, in <module> import numpy.core.numeric as _nx /core/ init.py", line 46, in <module> from numpy.testing import Tester /testing/ init.py", line 13, in <module> from.utils import * /testing/utils.py", line 15, in <module> from tempfile import mkdtemp File "/opt/apps/python/epd/7.3.2/lib/python2.7/tempfile.py", line 34, in <module> from random import Random as _Random File "/opt/apps/python/epd/7.3.2/lib/python2.7/random.py", line 47, in <module> from os import urandom as _urandom ImportError: cannot import name urandom # this error message seems to be being caused by the default version of python which is loaded on login from your.profile module swap python/2.7.3-epd python/2.7.1 # will swap 1 version of python for another. If you start having other problems with other programs you may need to switch back to the other version

3 The -h command should show you these options as being the key options to use/consider -f1 FASTQ1, --fastq1 FASTQ1 fastq read1 file to check -f2 FASTQ2, --fastq2 FASTQ2 fastq read2 file to check -p PREFIX, --prefix PREFIX prefix for output files -s, --SSCS calculate SSCS sequence, off by default. IF DCS specificed, automatically on -m MINIMUM_READS, --minimum_reads MINIMUM_READS minimum number of reads needed to support SSCS reads --log LOG name of output log file Using that information, see if you can figure out how to put the command together SSCS_DCS.py -f1 DED110_CATGGC_L006_R1_001.fastq -f2 DED110_CATGGC_L006_R2_001.fastq -p DED110 -s -m 2 --log SSCS_Log This should take ~15 minutes to complete in an idev shell. While this is completing we suggest generating.fastq files where the molecular index has been trimmed from the read to familiarize yourself with flexbar and to set up input files for a more advanced optional breseq tutorial. Error correction evaluation: The SSCS_Log is a great place to start. Use the tail command to look at the last 8 lines of the log file to determine how many reads made it from raw reads to error corrected SSCS reads. tail -n 8 SSCS_Log help with the tail command Approximately what fraction of raw reads became SSCS reads? There are approximately 1/6th as many SSCS reads as raw reads: Total Reads: SSCS count: While this is somewhat misleading as it takes a minimum of 2 reads to generate a single SSCS read, we do have some additional information regarding what happened to the other reads. The first thing is to consider is the "Dual MI Reads" these represent the reads which correctly had the 12bp of degenerate sequence and the 4bp anchor. In this case, more than 1.5 million reads lacked an identifyable molecular index on read 1 and/or read 2. By that regard, we had ~1/4 as many SSCS reads as raw reads. Perhaps more interesting is the number of errors removed. This is also available in the SSC_Log file, but in the middle of the file and don't have any good handle to grep with. One option is to cat the entire file and scroll around, another is to use tail/head commands you can get the specific lines only: tail -n 94 SSCS_Log head -n 86 help with the tail command

4 The 3 columns are the read posistion, the number of bases changed, and the number of bases not changed. If you copy and paste these 3 columns into excel you can easily calculate the sum of the 2nd column to see that 446,104 bases were changed. The read position is based on the 5-3' sequence, and you should notice that generally the higher the read position, the more errors were corrected. This should make sense based on what we have talked about with decreasing quality scores as read length increases. Tutorial (Trimmed Reads): For the purpose of this tutorial, we will be working with flexbar which like breseq is something that we have installed in the BioITeam as it is not a tacc module. Test that flexbar is installed and working correctly using the which command and 'flexbar -h'. If these commands work, note that this is the perfect example of the benefits of the BioITeam. This is has proven to be an extremely difficult command to install and configure and install in the past, leading me to place it in the $BI/bin directory, but it still required additional options (expand the next section to see what those may have been). Yet if it is working it strongly suggests some other member of the BioITeam found it available, and fixed it in some way so it works by default. If error messages are displayed you may need to try the following as has been necessary in previous years Some additional modules must be loaded in order for it to work correctly, and the LD_LIBRARY_PATH variable must be modified as listed below. module swap intel gcc export LD_LIBRARY_PATH=/corral-repl/utexas/BioITeam/flexbar_v2. 23_linux64:$LD_LIBRARY_PATH For this tutorial, it is sufficient to simply type these commands out, if this becomes something you want to do more often, or want to submit as a job, it would be important to add these lines to your.profile so they are loaded each time you log in, and available by default to the compute nodes. If the above commands are executed properly, typing flexbar -h should display a lengthy list of optional arguments which can be used for a variety of purposes. For the purpose of this tutorial, we will only focus on trimming the first 16 bases off each read as this represents the 12 bases of the molecular index and a 4 base constant region. See if you can figure out what the command is based on the help output pay special attention to the -t option. If you need a hint without the answer click the triangle... The following arguments are the ones that are needed to successfully trim the first 16 bases of the sequence: -u, --max-uncalled NUM Allowed uncalled bases (N or.) in reads, default: 0 -x, --pre-trim-left NUM Trim specified number of bases on 5' end of reads before alignment -t, --target STR Prefix for output file names -r, --reads FILE Input file with reads, that may contain barcodes -p, --reads2 FILE Second input file for paired read scenario -f, --format STR Input format of reads: csfasta, csfastq, fasta, fastq, fastqsanger, fastq-solexa, fastq-i1.3, fastq-i1.5, fastq-i1.8 (illumina 1.8+) -d, --length-dist Write length distribution for read output files

5 click here for the answer flexbar -u 100 -x 16 -t trimmed -r DED110_CATGGC_L006_R1_001.fastq -p DED110_CATGGC_L006_R2_001.fastq -f fastq -d In an idev shell this should take less than 5 minutes to complete. Once completed there should be 4 new files, all of which begin with "trimmed" if you took the answer from the above help, or whatever string you entered for the -t argument if you did not use the above help. These 4 files represent the trimmed fastq files, the length distribution. Using the head command, see if you can figure out which file is which. click here for the answer to which file is which the trimmed_1/2.fastq is the trimmed fastq files the trimmed_1/2.fastq.lengthdist is the length distribution file (which should have 2 lines: a header line and a line showing that all of the reads are 85 bp long now Next step: You should now have 3 new.fastq files which we will use to call variants in: DED110_SSCS.fastq, trimmed_1.fastq, and trimmed_2.fastq. You could take these files into a more in depth breseq tutorial that was prepared last year for comparisons of the specific mutations that are eliminated using the error correction (SSCS). Link to other optional tutorial.

Preparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers

Preparation of alignments for variant calling with GATK: exercise instructions for BioHPC Lab computers Data used in the exercise We will use D. melanogaster WGS paired-end Illumina data with NCBI accessions