RCAC. Job files Example: Running seqyclean (a module)

Size: px

Start display at page:

Download "RCAC. Job files Example: Running seqyclean (a module)"

Andra Walters
5 years ago
Views:

1 RCAC Job files Why? When you log into an RCAC server you are using a special server designed for multiple users. This is called a frontend node ( or sometimes a head node). There are (I think) three front end nodes often they are very busy. Frontend node: edit files, send mail, backup data, compile programs No computing The other nodes are called compute nodes. They are allocated and run by a system called PBS/Torque. The preferred way to use PBS is by submitting a job file using the command qsub When you run a job with qsub, all of the normal output (STDOUT) and error output (STDERR) is sent to files called jobname.o<jobnumber> and jobname.e<jobnumber>, respectively. For example check_clip_man.o check_clip_man.e

2 RCAC Job files Example: Running seqyclean (a module) #!/bin/sh -l #PBS -N seqyclean_monpu1 #PBS -q scholar #PBS -l nodes=1:ppn=16 #PBS -l walltime=168:00:00 module load seqyclean cd $PBS_O_WORKDIR pwd Shebang tell unix this is a shell file. It could be a Perl file Jobname (seen in qstat) Queue use scholar unless otherwise instructed Number of nodes and CPUs (ppn) to reserve. Usually ppn will be 1 or 16 on scholar Maximum CPU time the job will run. The scholar queue is limited to 168 hours cat seqyclean.job date +"%d %B %Y %H:%M:%S" echo " " seqyclean -t 16 \ -1../../data/Monpu1.genome.rawReads.r1.fq \ -2../../data/Monpu1.genome.rawReads.r2.fq \ -v adapter.fa \ -qual \ -minimum_read_length 30 \ -o Monpu1.genome.rawReads.seqyclean.stats \ > seqyclean.log Optional: pwd $PBS_O_WORKDIR is a predefined symbol that means the directory from which you submitted the job with qsub Pwd print the directory after the cd useful for debugging cat <filename> copies the command file to the output The date echo command writes the date into the output The backslash, \, is a line continuation character in unix. It makes it easier to write and understand very long command lines The greater-than symbol, >, redirects output in unix, i.e., everything written to STDOUT is sent to the file seqyclean.log echo " " date +"%d %B %Y %H:%M:%S"

3 RCAC Job files #PBS -N seqyclean_monpu1 #PBS -q scholar #PBS -l nodes=1:ppn=16 #PBS -l walltime=168:00:00 PBS commands can also be entered on the command line when you run qsub qsub N seqyclean_monpu1 -q scholar -l nodes=1:ppn=16 -l walltime=168:00:00 I like the PBS commands in the job file so I have a record Its easy to make a mistake Can save a lot of work by copying from old jobs

4 RCAC job files Job files # (header stuff removed for example) ~/src/btrim/btrim64 \ -3 \ -p adapter2.fa \ -t../monpu1.genome.rawreads.fastq \ -o Monpu1.trimmed \ -s Monpu1.btrim.summary \ >btrim.log This file is in /home/mgribsko/src. In unix, ~ is a symbol for your home directory. ~<username>, for instance ~mgribsko is a symbol for the named user s home directory echo " " date +"%d %B %Y %H:%M:%S" # Btrim64: -q -p <pattern file> -t <fastq file> -o <trim file> [-u 5'-error -v 3'-error -l minlen -b <5'-cut> -e <3'-cut> \ # -w <window> -a <average> -f <5'-trim> -I] # # Required for pattern trimming: # -p <pattern file> each line contains one pair of 5'- and 3'-adaptors; ignored if -q in effect # -t <sequence file> fastq file to be trimmed # -o <output file> fastq file of trimmed sequences # # Required for quality trimming (-q in effect): # -t <sequence file> fastq file to be trimmed # -o <output file> fastq file of trimmed sequences # # Optional: # -q toggle to quality trimming [default=adaptor trimming] # -3 3'-adaptor trimming only [default=off] # -P pass if no adaptor is found [default=off] # -Q do a quality trimming even if adaptor is found [default=off] # -s <summary file> detailed trimming info for each sequence # -u <5'-error> maximum number of errors in 5'-adaptor [default=3] # -v <3'-error> maximum number of errors in 3'-adaptor [default=4] # -l <minimal length> minimal insert size [default=25] # -b <5'-range> the length of sequence to look for 5'-adaptor at the beginning of the sequence [default=1.3 x adaptor length] # I often copy the help for the command into the job file as a comment. Comments begin with #. This makes it much easier to change the command later. Notice that the PBS commands are comments as far as unix is concerned

5 Sequencing Basics Genome Size

6 Sequencing Basics Illumina Sequencing

7 Sequencing Basics Illumina TruSeq adapters Index TCGATCGGAAGAGC GCTCTTCCGATCT Universal barcode

8 Sequencing Basics Illumina TruSeq System universal adapter Primer Vocabulary paired-end mate-pair contig scaffold insert Primer bar code index adapter Read Coverage Consensus Ekblom, fig 2 (partial)

9 Sequencing Basics Illumina process Bind polymerase primer Add one base (fluorescent), base is chemically blocked and cannot be extended detect unblock, return to 2

Sequencing Basics Fastq format instrument:run:flowcell:lane:tile:x:y pair:filtered:control:bar-code @HISEQ02:319:C22FKACXX:2:1101:1699:1972 1:N:0:GTAGAG

10 Sequencing Basics Fastq format instrument:run:flowcell:lane:tile:x:y 1:N:0:GTAGAG GACCCATCCATTGTTGGACAGCTGAAGACGGGACGATCGTGCTCGTGTTTTGAATGCGAGAATCCCTGCAGAGGCTGCCTGCTTCGGNNNNNNNNNNTCCTCGACAG + CCCFFFFFHHHHHJIJJJJGIJJJJJJJJJJJIIJIJJJIIJIIHAFGIJJEHHHHFFFDCDDDDDDCDDDDDDBBDDDDDDCCDDB##########++28<<@BB> I = ascii 73 Quality = = 40 Quality = -10 log 10 ε ε = 10-4 # = ascii 35 Q = = 2 ε = = 0.63 = totally bogus

11 Sequencing Basics Read quality Base calling phasing (no base synthesized) pre-phasing (two or more bases synthesized) crosstalk Quality predictors vs empirical data (PhiX174) intensity profile signal to noise ratio David Jenkins on Sep 13, 2011

12 Genome Assembly Adapter trimming I have tried many methods AdapterRemoval AlienTrimmer Btrim Cutadapt Fastx_clip Fastqmcf Flexbar Reaper Scythe Seqprep Seqyclean Skewer Trimmomatic

13 Genome Assembly Adapter trimming Quick and Dirty test: use grep to check for the first 14 bases of the universal and index adapters, and their reverse complement Why 14? Long enough that you don t expect to see (many) matches by chance. Why quick and dirty? Only exact matches will be found Quality not considered Matches may be cut off by end of read This test will UNDERESTIMATE the number of adapters.

14 Quality and Cleaning Adapter trimming index Index Universal Universal Total reads adapters Forward Reverse Forward Reverse Adapters remain remain Monpu1.genome.rawReads.r1.fq Monpu1.genome.rawReads.r2.fq Monpu1.genome.rawReads.both.fq % % Monpu1.genome.filteredReads.fastq % 34.11% adapterremoval % 3.16% alientrimmer % 5.46% cutadapt % 65.96% fastqmcf % 17.39% flexbar % 1.72% reaper % 2.66% scythe % 4.21% seqprep % 3.26% skewer % 2.14% seqyclean all % 0.11%

15 Quality and Cleaning Adapter trimming 2:15:10 Lead:7 Trail:7 Window:4:13 min_len:30 (no palindrome trimming) index Index Universal Universal Total reads adapters Forward Reverse Forward Reverse Adapters remain remain r1 paired r1 unpaired r1 total % r2 paired r2 unpaired r2 total % trimmomatic all % 20.45% paired r % unpaired r % r1 total % 0.11% paired r % unpaired r % r2 total % 0.19% total % 0.15% 2:20:9 Lead:7 Trail:7 Window:4:13 min_len:30

16 Genome Assembly Adapter trimming Group 1- trimmomatic

17 Genome Assembly Adapter trimming

18 Genome Assembly Adapter trimming

19 Genome Assembly Adapter trimming

20 Genome Assembly Adapter trimming

21 Genome Assembly Adapter trimming

22 Genome Assembly Adapter trimming

23 Genome Assembly - Data Preprocessing Contaminants Exogenous external contaminants of source material insects, fungi, bacteria, etc. parasites and commensals found in source material intercellular and intracellular pathogens bacteria, viruses, etc. laboratory contaminants E. coli, S. cerevisiae, bacteriophage Endogenous organelles mitochondria, chloroplast, episomes endogenous viruses transposons ribosomal RNA (RNA-Seq)

24 Genome Assembly - Data Preprocessing Contaminants Find and remove by mapping reads to known sequences "known" sequences are imperfect contaminant may be different strain or variety Unknown contaminants: screen final assemblies for outliers increasingly difficult the more unique the organism is

25 Genome Assembly Data Preprocessing Other Cleaning Mitochondrial Phi-X174 Match to reads using Bowtie2 (or any other mapper) use local-very-sensitive (matches with small gaps)

26 Mapping Read Alignment (mapping) Find where short sequences (reads) map inside a longer sequence (reference) finding overlaps between reads is a similar problem Brute force Slide each sequence along the reference and count the number of matches/mismatches need to allow for sequence errors need to allow for sequence variation brute force is too slow what about repetitive sequences? Limitations: memory and/or time

27 Mapping Read alignment Speedups consider only limited sequence at the ends check for kmers vs kmer index hashing

28 Mapping MAQ Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research 18, , Find ungapped match with lowest mismatch score (sum of qualities at mismatched bases) only consider positions with < 2 mismatches in the first 28 bases paired reads where the mate is mapped are researched using gapped alignment algorithm Each alignment has a quality score probability that the true alignment is not the one reported Only one alignment is reported, if there are multiple equally good mappings, one is chosen at random mapping quality = 0

29 Mapping MAQ Read all reads into memory for each pair of non-contiguous seeds, e.g., and (for 8 bases) calculate hash index result is a 24-bit integer check only first 28 bases scan the reference base by base (forward and reverse) take each 28 base sequence and convert to hash index hits with the same index are potential matches calculate sum of qualities of mismatched bases over the whole read Fig 1. Flicek & Birney, 2009

30 Mapping MAQ Mapping quality Base quality tells us the probability that a base is incorrect The probability that a mapped read is correct is the probability that the mismatched bases are sequencing errors if sequencing quality is high, all mapped reads with mismatches are likely to be errors assume errors are independent If I have two mismatches with quality 10 (P=0.1) and 20 (P=0.01), the probability that the read truly matches the reference and the differences are simply errors is 0.1 x 0.01 = the probability of the two sequencing errors occurring in the same sequence

31 Mapping Bowtie very fast, 100x MAQ, low memory based on Burrows-Wheeler transform with FM indexing also BWA, SOAP2 etc. quality aware allows mismatches (backtracking strategy) when no exact match is found, "select and already-matched position and substitute a different base", then resume matching after the substituted position select low quality positions preferentially substitution must create a matching suffix try to find the mapping with minimal quality value options in bowtie/bowtie2 default: report only best alignment reads with multiple positions may have quality zero -k report up to k alignments -a report all alignments

32 Mapping Read alignment comparison Simulated data 100k reads 100 bp se

33 Mapping Mapped reads SAM = Sequence Alignment/Map format BAM = Binary Alignment/Map format BAM is much smaller you can't read a BAM file directly, use samtools or picard or similar program 2:1101:21154: A_terreus_NIH S132M4S = AS:i:157 XS:i:170 XN:i:0 XM:i:15 XO:i:0 XG:i:0 NM:i:15 VN:1.0 SN:A_terreus_NIH2624 ID:bowtie2 PN:bowtie2 VN:2.2.3 CL:"/group/bioinfo/apps/apps/bowtie /bowtie2-align-s --wrapper basic-0 --very-sensitive-local -a --maxins phred33 -p 16 -x mito 2:1101:19460: A_terreus_NIH S108M = AAGAT...TATTA CCCFF...DDDDE AS:i:127 XS:i:133 XN:i:0 XM:i:12 XO:i:0 XG:i:0 NM:i:12 MD:Z:11A4G16A8T1A17T0C0A3T0T14 2:1101:10080: A_terreus_NIH S50M = TATTT...TATTT CCCFF...EDDE@ AS:i:65 XS:i:65 XN:i:0 XM:i:5 XO:i:0 XG:i:0 NM:i:5 MD:Z:11A4G16A8T1A5 2:1101:3670: A_terreus_NIH S51M = AS:i:67 XS:i:67 XN:i:0 XM:i:5 XO:i:0 XG:i:0 NM:i:5 MD:Z:11A4G16A8T1A6 Read ID Bitwise flag reference ID position Mapping quality CIGAR string Mate ID Mate Pos Inferred Length Sequence Quality Optional Fields CIGAR M match I insertion relative to reference D deletion relative to reference S clipped from read sequence also N, H, P, =, X

34 Mapping Mapped Reads SAM Bitwise flag each bit in the integer has a meaning 0x0400 0x0200 0x0100 0x0080 0x0040 0x0020 0x0010 0x0008 0x0004 0x0002 0x0001 Octal Decimal Read is paired Read properly mapped Query is unmapped Mate is unmapped Query strand Mate strand First read in pair Second read in pair Secondary alignment Fails platform/vendor checks Duplicate

35 Mapping Mapped reads SAM CIGAR String compressed alignment M+I+S+=+X must equal the length of sequence 42S108M perfect match but clipped 14S132M4S clipped on both ends 18M2D19M 18 match, 2 base deletion 19 match Letter M I D N S H P Meaning Alignment match or mismatch Insertion in reference Deletion from reference Skipped region in reference (e.g., intron) Soft clipping (present in sequence) Hard clipping (present in reference) padding = Sequence match X Sequence mismatch

36 Mapping Mapped reads Optional fields, see SAM format specification or aligner manual AS alignment score generated by aligner XS Alignment score for the best-scoring alignment found other than the alignment reported XN The number of ambiguous bases in the reference covering this alignment XM The number of mismatches in the alignment. XO The number of gap opens, for both read and reference gaps XG The number of gap extensions, for both read and reference gaps NM The edit distance; that is, the minimal number of one-nucleotide edits MD A string representation of the mismatched reference bases in the alignment YS score of the paired read YT:Z alignment type UU read was not part of a pair CP part of concordant pair DP part of discordant pair UP part of pair but failed to align e /bowtie2-align-s --wrapper basic-0 --very-sensitive-local -a --maxins phred33 -p 16 -x mitochondria -1../raw/Monpu1.genome.rawReads.r CCFF...DDDDE AS:i:127 XS:i:133 XN:i:0 XM:i:12 XO:i:0 XG:i:0 NM:i:12 MD:Z:11A4G16A8T1A17T0C0A3T0T14C7C15 YS:i:146 YT:Z:CP CCFF...DCC>A AS:i:157 XS:i:170 XN:i:0 XM:i:15 XO:i:0 XG:i:0 NM:i:15 MD:Z:11A4G16A8T1A17T0C0A3T0T14C7C17A0T10T9 YS:i:167 YT:Z:CP CCFF...EDDE@ AS:i:65 XS:i:65 XN:i:0 XM:i:5 XO:i:0 XG:i:0 NM:i:5 MD:Z:11A4G16A8T1A5 YS:i:142 YT:Z:CP CCFF...CDCDB AS:i:67 XS:i:67 XN:i:0 XM:i:5 XO:i:0 XG:i:0 NM:i:5 MD:Z:11A4G16A8T1A6 YS:i:152 YT:Z:CP

37 Mapping Mapped reads samtools view examine and extract reads from SAM or BAM files sort Sort reads by position or name merge Combine multiple SAM or BAM files mpileup Examine sequences aligned at a position index/faidx Index SAM/BAM or reference

38 Mapping Mapped reads samtools Samtools view Convert SAM BAM Select reads that match or do not match reference Count matches Select all reads where neither the read nor its mate matches the reference f 13 Select all read 1 that are paired f 65 Select read 2 that are unpaired -f 128

39 Mapping Mapped reads samtools converting from SAM to BAM is slow, and SAM takes lots of disk space. But Bowtie2 writes sam output solution: use unix pipes to samtools Note the continuation characters, \, and pipe characters, #!/bin/sh -l #PBS -N bowtie_monascus_mt #PBS -q scholar #PBS -l nodes=1:ppn=16 #PBS -l walltime=120:00:00 module load samtools module load bowtie2 cd $PBS_O_WORKDIR bowtie2 --very-sensitive-local -a --maxins phred33 -p 16 -x mitochondria \ -1../raw/Monpu1.genome.rawReads.r1.fq \ -2../raw/Monpu1.genome.rawReads.r2.fq \ samtools view -us - \ samtools sort - mitochondrial_raw.sorted samtools index mitochondrial_raw.sorted.bam

40 Mapping Bowtie output Monascus vs several fungal mt genomes reads; of these: (100.00%) were paired; of these: (96.93%) aligned concordantly 0 times (0.11%) aligned concordantly exactly 1 time (2.96%) aligned concordantly >1 times pairs aligned concordantly 0 times; of these: 4104 (0.01%) aligned discordantly 1 time pairs aligned 0 times concordantly or discordantly; of these: mates make up the pairs; of these: (99.75%) aligned 0 times (0.04%) aligned exactly 1 time (0.22%) aligned >1 times

41 Genome Assembly De Bruijn Graphs (from Homolog.us Bioinformatics)

42 Genome Assembly De Bruijn Graph

43 Genome Assembly De Bruijn Graph Repeats

44 Genome Assembly De Bruijn Graph reads

45 Genome Assembly Velvet One of the first De Bruijn assemblers Pruning tips a chain of nodes disconnected on one end caused by sequencing errors OR coverage gaps errors tend to be short (rule trim if < 2 kmer ) errors tend to have low multiplicity at junction bubbles paths that leave and return caused by sequence variation (SNPs) length/multiplicity rule shorter, higher multiplicity paths are preferred Erroneous connections duplicate sequences + errors errors will have low coverage, so will areas with low coverage

Genome Assembly. 2 Sept. Groups. Wiki. Job files Read cleaning Other cleaning Genome Assembly

Genome Assembly. 2 Sept. Groups. Wiki. Job files Read cleaning Other cleaning Genome Assembly 2 Sept Groups Group 5 was down to 3 people so I merged it into the other groups Group 1 is now 6 people anyone want to change? The initial drafter is not the official leader use any management structure