These will serve as a basic guideline for read prep. This assumes you have demultiplexed Illumina data.

Size: px

Start display at page:

Download "These will serve as a basic guideline for read prep. This assumes you have demultiplexed Illumina data."

Neil Logan
5 years ago
Views:

1 These will serve as a basic guideline for read prep. This assumes you have demultiplexed Illumina data. We have a few different choices for running jobs on DT2 we will explore both here. We need to alter your profile so that you load the proper modules for all of our jobs. type $ nano ~/.bashrc.mine enter the following two lines in this file module load dept/bioinfo module load bioinfo exit and save then type $ source ~/.bashrc.mine INTERACTIVE JOBS $ salloc -p bioinfo --share -n 5 -t 2:00:00 Quality Control 1) The first thing that has to be done is QC. We will evaluate each of our read files using FastQC. 2) When doing QC you must evaluate each read file separately. Do not assume that each read is ok if you only checked one or two files. 3) Move to Raw_reads directory 4) Load the FastQC module. Details on FastQC ( $ module load bioinfo/fastqc/ ) Run FastQC. you can get help by running $ fastqc -h

2 6) Run this command to get QC information on these read files $ fastqc SRR _1.fq SRR _2.fq 7) Each read set will create two output files <read_name>.fastqc.html and <read_name>.fastqc.zip 8) Use firefox to view the *.html files. (note this requires X forwarding. you must login with ssh -X for this to work) $ firefox SRR _1_fastqc.html This will display the output from fastqc to your screen. 9) Based upon this output you can make decisions about the read quality and what needs to be trimmed. Typically each dataset will have it s own parameters. FASTQ_SCREEN 1) Next we will check that the reads contain the data the are supposed to, in this case S. bayanus DNA. 2) load fastq_screen. ( $ module load bioinfo/fastq_screen/ ) Check the help for running information $ fastq_screen -h Fastq Screen - Screen sequences against a panel of databases Synopsis fastq_screen [OPTION]... [FastQ FILE]... Function Fastq Screen is intended to be used as part of a QC pipeline. It allows you to take a sequence dataset and search it against a set of bowtie databases. It will then generate both a text and a graphical summary of the results to see if the sequence dataset contains the kind of sequences you expect

3 or not. Options --help -h --subset Print program help and exit Don't use the whole sequence file, but create a temporary dataset of this specified number of reads. The dataset created will be of approximately (within a factor of 2) of this size. If the real dataset is smaller than twice the specified size then the whole dataset will be used. Subsets will be taken evenly from throughout the whole original dataset 4) This program works well with threads. We will use five threads. $ fastq_screen --threads 5 --subset 1000 SRR _1.fq SRR _2.fq 5) The output from this file is a *.png image that you can view with firefox $ firefox SRR _1_screen.png

4 Trimmomatic There are a lot of trimming tools. This is the one we will use today. pdf 1) We will submit our trimming commands as a batch script. 2) Before that we need to decide how we want to trim. Note Trimmomatic trims in the order the commands are given. 3) In the case of the s_bayanus reads we will want to trim the first 10 bps as well as the adaptors found by FastQC. Then we will remove any remaining short sequences. Note: Trimmomatic will remove duplicates as well. Normally, for genome assembly this is something that we would do. However in the interest of time we will skip this step. Please review the Trimmomatic manual for all of the settings and suggestions. 4) In order to remove the adaptors found by fastqc we need to make our own adapter fasta file. $ nano srr_adapt.fasta add these lines >PE1 TACACTCTTTCCCTACACGACGCTCTTCCGATCT >PE1_rc AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA >PE2 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT >PE2_rc AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC I copied the adaptors found by FastQC and added the reverse complement for the sequence. Save that file. 5) Now we can build our batch script (make sure to check you file locations) nano trimmomatic.sh

5 #!/bin/bash #SBATCH -p bioinfo #SBATCH -n 5 #SBATCH --share #SBATCH -t 1:00:00 #SBATCH --job-name=trimmomatic #SBATCH --mail-type=all #SBATCH --mail-user=<your > #Main script for job actions. ~/.profile module load bioinfo/trimmomatic cd <your path> #check you file locations. They have to be absolute paths or relative #to the place you cd to above #Run the necessary trimmomatic commands #this will trim the paired end files time java -jar $TRIM/trimmomatic.jar PE -threads 5 \ SRR _1.fq SRR _2.fq \ SRR_1_pe SRR_1_se SRR_2_pe SRR_2_se \ HEADCROP:10 \ ILLUMINACLIP:srr_adapt.fasta:2:30:10 \ SLIDINGWINDOW:4:20 \ MINLEN:70 6) Submit that script $ sbatch trimmomatic.sh EXERCIZE In the RNAseq folder there are 4 pairs of reads. Each pair is labeled with a left and right tag. These sequence reads are from S. pombe. Create a batch script that will do both fastqc and fastq_screen on all 4 pairs of reads. Check the outputs to confirm the quality of these datasets.

Sequence Data Quality Assessment Exercises and Solutions.

Sequence Data Quality Assessment Exercises and Solutions. Starting Note: Please do not copy and paste the commands. Characters in this document may not be copied correctly. Please type the commands and