Lecture 12. Short read aligners

Size: px

Start display at page:

Download "Lecture 12. Short read aligners"

Kevin Kennedy
5 years ago
Views:

1 Lecture 12 Short read aligners

2 Ebola reference genome We will align ebola sequencing data against the 1976 Mayinga reference genome. We will hold the reference gnome and all indices: mkdir -p ~/reference/ebola Get the ebola genome in FASTA format: efetch -db=nuccore -format=fasta -id=af > ~/reference/ebola/1976.fa

3 Build an index with bwa bwa index ~/reference/ebola/1976.fa

4 Align a paired-end dataset $fastq-dump -X split-files SRR $bwa mem -t 10 ~/reference/ebola/1976.fa SRR _1.fastq SRR _2.fastq > SRR sam

5 SAM The resulting file is in a so-called SAM format. It is one of the most recent bioinformatics data formats, one that by today has become the standard method to store and represent all high-throughput sequencing results A SAM file encompasses all known information about the sample and its alignment.

6 Help on bwa mem

7 Help on bwa mem

8 Sequence Alignment Map (SAM)

9 Overview The SAM format is the standard format for storing sequencing reads mapped to a reference. Its binary analog is BAM. The official specification of the SAM format:

10 Overview The SAM format is a TAB-delimited, line oriented text format with two sections: 1. Header: each line contains some metadata 2. Alignment: each line contains information on an alignment The SAM format specification lists the required and optional content for each of these sections.

11 Installing samtools echo export 'PATH=/opt/genomics/samtools-1.3.1:$ PATH' >> ~/.bashrc source ~/.bashrc

12 SAM header Header lines contain vital metadata about the reference sequences, read and sample information, and (optionally) processing steps and comments. Each header line begins with followed by a two-letter code that distinguishes the different type of metadata records in the header. Following this two-letter code are tab-delimited key-value pairs in the format KEY:VALUE.

13 SAM header

14 SAM store information about the reference sequence. The required key-values are 1. SN: stores the sequence name 2. LN: the sequence length All separate sequences in your reference have a corresponding entry in the header.

15 SAM contain important read group and sample metadata. 1. ID: The read group identifier ID is only required and must be unique. This ID value contains information about the origin of a set of reads. Some software relies on read groups to indicate a technical groups of reads, to account for batch effects. Consequently, it s beneficial to create read groups related to the specific sequencing run. 2. SM: Sample information is the metadata about your experiment s samples. 3. PL: sequencing platform

16 SAM contain metadata about the programs used to create and process a set of SAM/BAM files. Each program must have a unique ID value, and metadata such as program version number (VN) and the exact command line (CL) can be saved in these header entries. Many programs will add these line automatically.

17 samtools The standard way of interacting with SAM/BAM files is through the samtools. samtools view is the general tool for viewing and converting SAM/BAM files. A way to look at an entire SAM/BAM header is with

18 samtools We can see all read groups with samtools view without any arguments returns the entire alignment section without header:

19 SAM alignment The alignment section contains read alignments and usually includes reads that did not align. Each alignment entry is composed of 11 required fields and optional fields after this. 1. QNAME, 2. FLAG, 3. RNAME, 4. POS, 5. MAPQ, 6. CIGAR, 7-8. RNEXT and PNEXT, 9. TLEN, 10. SEQ, 11. QUAL

20 SAM alignment 1. QNAME: the name of the read 2. FLAG: the bitwise flag, which contains information about the alignment 3. RNAME: the reference name which sequence the query aligned to. The reference name must be in the SAM header as an SQ entry. If the read is unaligned, this entry may be *. 4. POS: the position on the reference sequence using 1-based indexing of the first mapping base in the query sequence. This may be zero if the read does not align.

21 SAM alignment MAPQ: the mapping quality, which is a measure of how likely the read is to actually originate from the position it maps to. Mapping quality is estimated by the aligner. Many tools downstream of aligners filter out reads that map with low mapping quality. CIGAR: the CIGAR string, which is a specialized format for describing the alignment. RNEXT and PNEXT (on the next line): the reference name and position of a paired-end read s partner. The value * indicates RNEXT is not available, and = indicates that RNEXT is the same as RNAME. PNEXT will be 0 when not available.

22 SAM alignment TLEN: the template length for paired-end reads SEQ: the original read sequence. This sequence will always be in the orientation it aligned in and this may be the reverse complement of the original read sequence. If your read aligned to the reverse strand, which is information kept in the bitwise flag field, this sequence will be the reverse complement. QUAL: the original read base quality.

23 Bitwise flags Bitwise flags are a very space-efficient and common way to encode attributes. Bitwise flags are like a series of toggle switches, each of which can be either on or off. Each of these toggle switches values are bits (0 or 1) of a binary number. Each bit in a bit-field represents a particular attribute about an alignment, with 1 indicating that the attribute is true and 0 indicating it s false

24 Bitwise flags

25 Bitwise flags As an example, suppose you encounter the bit-flag 147 (0x93 in hexadecimal) and you want to know what this says about this alignment. In binary this number is represented as We see that this corresponds to the attributes paired-end, aligned in proper pair, the sequence is reverse complemented, and the second read in the pair.

26 Primary, secondary and supplementary Primary alignment: the best alignments by score. If there are multiple alignment that match equally well the aligner will designate one as the primary alignment. Secondary alignment: In case of multiple matches one is designated as primary and all others are secondary Supplementary alignment: a read that cannot be represented as a single linear alignment and matches two different locations without significant overlap. These are also known as chimeric alignments. What gets called as chimeric alignment is software dependent.

27 samtools flags samtools flags generates

28 samtools flags samtools flags can translate decimal and hexadecimal flags:

29 samtools flagstat To generate statistics on the flags:

30 Properly aligned pairs R2 unmapped R2 mapped to a different chromosome incorrect orientation correct orientation but in valid distance

31 CIGAR strings CIGAR (Compact Idiosyncratic Gapped Alignment Representation) strings are another specialized way to encode information about an aligned sequence. While bitwise flags store true/false properties about an alignment, CIGAR strings encode information about which bases of an alignment are matches/mismatches, insertions, deletions, soft or hard clipped, and so on.

32 CIGAR strings As an example: A basic CIGAR string contains concatenated pairs of integer lengths and character operations.

33 CIGAR strings Operation Description M Alignment match (note that this could be a sequence match or mismatch) I Insertion, the nucleotide is present in the read but not in the reference D Deletion, the nucleotide is present in the reference but not in the read N Skipped region (from reference) S Soft-clipped region (soft-clipped regions are present in sequence in SEQ filed) H Hard-clipped region (not in sequence in SEQ field) P Padding = Sequence match X Sequence mismatch

34 CIGAR strings Soft clipping is when only part of the query sequence is aligned to the reference, leaving some portion of the query sequence unaligned. It occurs when an aligner can partially map a read to a location, but the alignment at the end of the sequence is questionable. Hard clipping is similar, but hard-clipped regions are not present in the sequence stored in the SAM filed SEQ.

35 CIGAR strings Clipped alignment: In local alignment, a sequence may not be aligned from the first residue to the last one. Subsequences at the ends may be clipped off. 3S8M1D6M4S

36 CIGAR strings Spliced alignment: In cdna-to-genome alignment, we want to distinguish introns from deletions in exons. 9M32N8M

37 CIGAR strings 51M: a fully aligned 51bp read without insertions or deletions. By the SAM format specification, M means there s an alignment match, not that all bases in the query and reference sequence are identical. 43S6M1I26M: the first 43bp are soft clipped, the next 6 were matches/mismatches, then a 1bp insertion to the reference, and finally, 26 matches/mismatches. The SAM format specification mandates that all M, I, S, =, X operations lengths must add to the length of the seuqnece.

38 Padded alignment Most short-read aligners do not present how inserted sequences are aligned against each other. Alignment with inserted sequences fully aligned is called padded alignment. Padded alignment is produced by de novo assemblers and is important for an alignment viewer to display the alignment properly. To store padded alignment, we introduce P which can be considered as a silent deletion from padded reference sequence.

39 Padded alignment REF: TCA--GAC R1 : TCAGAGAC R2 : TCA-AGAC R3 : TCA--GAC Padded CIGAR R1 : 3M2I3M R2 : 3M1P1I3M R3 : 3M2P3M Unpadded CIGAR R1 : 3M2I3M R2 : 3M1I3M R3 : 6M

40 samtools

41 samtools view

42 Converting between SAM and BAM samtools view allows us to convert SAM to BAM with the b option. Similarly, we can go from BAM to SAM:

43 Converting between SAM and BAM When converting BAM to SAM, samtools view will not include the SAM header by default. SAM files without headers cannot be converted into BAM files. We can include the header with h:

44 Converting between SAM and BAM We only need to convert BAM to SAM when manually inspecting files. In general, it is better to store files in BAM format, as it s more space efficient, compatible with all samtools subcommands, and faster to process because tools can directly read in binary values rather than require parsing SAM strings.

45 Sort and index We sort by alignment position and index a BAM file to allow for fast random access to reads aligned within a certain region. We sort a BAM file with samtools sort:

46 Sort and index

47 Sort and index Sorting a large number of alignment can be very computationally intensive, so samtools sort has options that allow you to increase the memory allocation and parallelize sorting across multiple threads.

48 Sort and index Position-sorted BAM files are the starting point for most later processing steps such as SNP calling and extracting alignments from specific regions. Sorted BAM files are much more disk-space efficient than unsorted BAM files. Often, we want to work with alignments within a particular region in the genome. Iterating through an entire BAM file just to work with a subset of reads at a potion would be inefficient. Consequently, BAM files can be indexed. The BAM file must be sorted first, and we cannot index SAM file.

49 Sort and index To index a position-sorted BAM file, we use: This creates a file named SRR sorted.bam.bai, which contain the index for the BAM file.

50 Extracting alignments With a position-sorted and indexed BAM file, we can extract specific regions of an alignment with samtools view As an example, get a GenBank format file for the Ebola virus: $efetch -db=nuccore -format=gb -id=af > AF gb

51 Extracting alignments Pick a gene NP: Then, let s take a look at some alignments in this region

52 Extracting alignments We can also count the number of alignments that overlap with this region:

53 Filtering alignments samtools view has options for filtering alignments based on bitwise flags, mapping quality, read group. Two options to filter based on bitwise flags: -f: only outputs reads with the specified flags -F: only outputs reads without the specified flags.

54 Filtering alignments Suppose you want to output all reads that are unmapped.

55 Filtering alignments It is also possible to output reads with multiple bitwise flags set. For example, we could find the first reads that aligned in a proper pair alignment.

56 Filtering alignments We can use F option to extract alignments that do not have any of the bits set of the supplied flag argument. For example, suppose we wanted to extract all aligned reads

57 Filtering alignments Suppose you want to extract all reads that did not align in a proper pair (the read is aligned and paired, but not aligned in a proper pair). A naïve approach with F 2 is incorrect because both unmapped reads and unpaired reads will be included.

58 Filtering alignments We do this by combining bits:

SAM : Sequence Alignment/Map format. A TAB-delimited text format storing the alignment information. A header section is optional.

SAM : Sequence Alignment/Map format. A TAB-delimited text format storing the alignment information. A header section is optional. Alignment of NGS reads, samtools and visualization Hands-on Software used in this practical BWA MEM : Burrows-Wheeler Aligner. A software package for mapping low-divergent sequences against a large reference