Mapping. Reference. read

Size: px

Start display at page:

Download "Mapping. Reference. read"

Dennis Armstrong
5 years ago
Views:

1 Mapping Reference read

3 Assembly vs mapping contig1 contig2 reads bly as s em ll v sa all ma pp all ing vs r efe ren ce Reference

4 What s the problem? Reads differ from the genome due to evolution and sequencing errors cannot use exact string matching Genomes are repetitive it is important that multiple matching reads are treated carefully often only unique matches are kept Contamination: Some reads are not from the target genome (primers, contamination, etc)

Most used bioinformatics tool on the Planet Gives nice E-values Only one problem: BLASTing* a lane of illumina reads against human genome takes years!

5 Most used bioinformatics tool on the Planet Gives nice E-values Only one problem: BLASTing* a lane of illumina reads against human genome takes years!! So let s use Blast *) 250 million reads Blastn (default params) against human genome took about 0 minutes per 1000 reads on a single CPU Zzz-mail-What-happens-when-sleepwalkers-go-online.html

6 Next-generation alignment algorithms Blast indexes words in the query. Search time proportional to the database size First generation short read mappers like Eland and MAQ use a hash table of reads Better to index the genome (may use lots of memory though) BLAT makes an index of non-overlapping words in the genome, but not so well suited for short reads Second generation mappers like Bowtie and BWA are based on a sophisticated index called the Burrows-Wheeler transform

7 Mapping Reference genome / transcriptome...gtgggccggcaattcgatatcgcgcatatatttcggcgcatgcttagc... Reads (unmapped) GCATATATTT GCATATATTT TGGGCCGGCA ATTCGATATC ATATTTCGGC CCGGCAATTC TCGCGCATAT CATGCTTAGC GATATCGCGC

8 Mapping Reference genome / transcriptome...gtgggccggcaattcgatatcgcgcatatatttcggcgcatgcttagc... TGGGCCGGCA GCATATATTT CATGCTTAGC CCGGCAATTC ATATTTCGGC ATTCGATATC GCATATATTT Reads (mapped) TCGCGCATAT GATATCGCGC

9 NGS alignment algorithms Seed/hash methods: Used by BFAST and Stampy Methodology: find matches for short subsequences assuming that at least one seed in a read will perfectly match Align with a sensitive method like SW Tend to be more sensitive than BWT Burrows Wheeler transform: Used by BWA and Bowtie Faster than hash methods at the same sensitivity level compact the genome into a data structure that is very efficient when searching for perfect matches performance decreases exponentially with number of mismatches

10 BWT La trasformata di Burrows- Wheeler (abbreviata con BWT) è un algoritmo usato nei programmi di compressione da> come bzip2. È stata inventata da Michael Burrows e David Wheeler.[1] Quando una stringa di caraieri viene soioposta alla BWT, nessuno di ques> cambia di valore perché la trasformazione permuta soltanto l'ordine dei caraieri. Se la stringa originale con>ene molte ripe>zioni di certe soiostringhe, allora nella stringa trasformata troveremo diversi pun> in cui lo stesso caraiere si ripete tante volte. Ciò è u>le per la compressione perché diventa facile comprimere una stringa in cui compaiono lunghe sequenze di caraieri tuq uguali. TRENTATRE.TRENTINI.ANDARONO.A.TRENTO.TUTTI.E.TRENTATRE.TROTTERELLANDO OIIEEAEO..LDTTNN.RRRRRRRTNTTLEAAIOEEEENTRDRTTETTTTATNNTTNNAAO...OU.T

11 BWT La trasformata è faia ordinando tuie le rotazioni del testo e poi prendendo soltanto l'ul>ma colonna. Per esempio, il testo "^BANANA@" viene trasformato in "BNN^AA@A" airaverso ques> passi

14 BWT INDEX CREATION Genome Marks end-of-string,lexicographically smallest X = AGGAGC$ Next Generation SequencingAnalysis

15 BWT INDEX CREATION X = AGGAGC$ 1.Create all possible transformations of the string (move first base to end) AGGAGC$ Next Generation SequencingAnalysis

16 BWT INDEX CREATION X = AGGAGC$ 1.Create all possible transformations of the string (move first base to end) AGGAGC$ GGAGC$A Next Generation SequencingAnalysis

17 BWT INDEX CREATION X = AGGAGC$ 1.Create all possible transformations of the string (move first base to end) AGGAGC$ GGAGC$A GAGC$AG Next Generation SequencingAnalysis

18 BWT INDEX CREATION X = AGGAGC$ 1.Create all possible transformations of the string (move first base to end) AGGAGC$ GGAGC$A GAGC$AG AGC$AGG GC$AGGA C$AGGAG $AGGAGC Next Generation SequencingAnalysis

19 BWT INDEX CREATION X = AGGAGC$ 1.Create all possible transformations of the string (move first base to end) AGGAGC$ GGAGC$A GAGC$AG AGC$AGG GC$AGGA C$AGGAG $AGGAGC Next Generation SequencingAnalysis

20 BWT INDEX CREATION X = AGGAGC$ 2.Sort the strings lexicographically AGGAGC$ GGAGC$A GAGC$AG AGC$AGG GC$AGGA C$AGGAG $AGGAGC Next Generation SequencingAnalysis

21 BWT INDEX CREATION X = AGGAGC$ 2.Sort the strings lexicographically AGGAGC$ GGAGC$A GAGC$AG AGC$AGG GC$AGGA C$AGGAG $AGGAGC $AGGAG C Next Generation SequencingAnalysis

22 BWT INDEX CREATION X = AGGAGC$ 2.Sort the strings lexicographically 0 1 AGGAGC$ GGAGC$A $AGGAG AGC$AG C G GAGC$AG AGC$AGG GC$AGGA C$AGGAG $AGGAGC Next Generation SequencingAnalysis

23 BWT INDEX CREATION X = AGGAGC$ 2.Sort the strings lexicographically AGGAGC$ GGAGC$A GAGC$AG 0 $AGGAG C AGC$AG G AGGAGC $ 4 5 AGC$AGG GC$AGGA C$AGGAG $AGGAGC Next Generation SequencingAnalysis

24 BWT INDEX CREATION X = AGGAGC$ 2.Sort the strings lexicographically AGGAGC$ GGAGC$A GAGC$AG AGC$AGG GC$AGGA C$AGGAG $AGGAGC $AGGAG C AGC$AG G AGGAGC $ C$AGGA G GAGC$A G GC$AGG A GGAGC$ A Next Generation SequencingAnalysis

25 BWT INDEX CREATION X = AGGAGC$.Create the Suffix-Array (SA) and the BWT AGGAGC$ GGAGC$A GAGC$AG AGC$AGG GC$AGGA C$AGGAG $AGGAGC $AGGAG C AGC$AG G AGGAGC $ C$AGGA G GAGC$A G GC$AGG A GGAGC$ A Next Generation SequencingAnalysis

26 BWT INDEX CREATION.Create the Suffix-Array (SA) and the BWT X = AGGAGC$ i SA BWT AGGAGC$ GGAGC$A GAGC$AG AGC$AGG GC$AGGA C$AGGAG $AGGAGC $AGGAG C AGC$AG G AGGAGC $ C$AGGA G GAGC$A G GC$AGG A GGAGC$ A Next Generation SequencingAnalysis

27 BWT INDEX CREATION X = AGGAGC$.Create the Suffix-Array (SA) and the BWT i SA BWT AGGAGC$ GGAGC$A GAGC$AG AGC$AGG GC$AGGA C$AGGAG $AGGAGC $AGGAG C AGC$AG G AGGAGC $ C$AGGA G GAGC$A G GC$AGG A GGAGC$ A i = (0,1,2,,4,5,) SA = (,,0,5,2,4,1) BWT = CG$GGAA Next Generation SequencingAnalysis

28 BWT INDEX CREATION Our index Read = AG i SA BWT $AGGAG C AGC$AG G AGGAGC $ C$AGGA G GAGC$A G GC$AGG A GGAGC$ A Next Generation SequencingAnalysis

29 BWT INDEX CREATION Our index Read = AG i SA BWT $AGGAG C AGC$AG G AGGAGC $ C$AGGA G GAGC$A G GC$AGG A GGAGC$ A Which strings starts with AG? Next Generation SequencingAnalysis

30 BWT INDEX CREATION Our index Read = AG i SA BWT $AGGAG C AGC$AG G AGGAGC $ C$AGGA G GAGC$A G GC$AGG A GGAGC$ A Which strings starts with AG? Next Generation SequencingAnalysis

31 BWT INDEX CREATION Our index Read = AG i SA BWT $AGGAG C AGC$AG G AGGAGC $ C$AGGA G GAGC$A G GC$AGG A GGAGC$ A Which strings starts with AG? Get SuffixArray Indices:i = [1,2] Next Generation SequencingAnalysis

32 BWT INDEX CREATION Our index Read = AG i SA BWT $AGGAG C AGC$AG G AGGAGC $ C$AGGA G GAGC$A G GC$AGG A GGAGC$ A Which strings starts with AG? Get SuffixArray Indices:i = [1,2] SuffixArray values :SA[i] = [,0] Next Generation SequencingAnalysis

33 BWT INDEX CREATION Our index Read = AG i SA BWT $AGGAG C AGC$AG G AGGAGC $ C$AGGA G GAGC$A G GC$AGG A GGAGC$ A Which strings starts with AG? Get SuffixArray Indices:i = [1,2] SuffixArray values :SA[i] = [,0] = read aligns at pos 0 & Next Generation SequencingAnalysis

34 BWT INDEX CREATION Our index Read = AG i SA BWT $AGGAG C AGC$AG G AGGAGC $ C$AGGA G GAGC$A G GC$AGG A GGAGC$ A Which strings starts with AG? Get SuffixArray Indices:i = [1,2] SuffixArray values :SA[i] = [,0] = read aligns at pos 0 & pos 0: AGGAGC Next Generation SequencingAnalysis

35 BWT INDEX CREATION Our index Read = AG i SA BWT $AGGAG C AGC$AG G AGGAGC $ C$AGGA G GAGC$A G GC$AGG A GGAGC$ A Which strings starts with AG? Get SuffixArray Indices:i = [1,2] SuffixArray values :SA[i] = [,0] = read aligns at pos 0 & pos 0: AGGAGC pos : AGGAGC Next Generation SequencingAnalysis

36 Mismatches We can find mismatches and indels: Backtracking, allowing a maximum of n mismatches Large genomes can be searched very fast this way! But only allowing a certain number of mismatches Next Generation SequencingAnalysis

37 Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem Problem

38 Mapping sensitivity Not all reads that should be mapped (aligned) will be mapped. Highly polymorphic regions or large insertions or deletions are difficult to detect. Sensitivity related mapper characteristics: Mapper performance algorithm maximum edit distance (num. Mismatches) allow small indels allow large gaps (e.g. introns) global or local alignments sensitivity Time/memory

39 Sensitivity vs edit distance Overall alignment accuracy vs edit distance 100% 95% % of all alignments at the specified edit distance 90% 85% 80% 75% 70% 5% 0% 55% 50% Edit distance (bp) bwa correct bowtie attempted bwa attempted soap correct bowtie correct soap attempted Michael Stromberg@bioinformatcis.ca

40 Mapping against A. thaliana col. as reference Sensitivity Species Accession SRA %Mapped Reads A.thaliana Col SRR % Ler SRR % C24 SRR % A.lyrata SRR % Brassicarapa Readswerepreprocessedwith Q20L0.Mappingtool:Bowtie2 ERR079 20% Taken from Aureliano Bombarely

41 Mapping score MAPQ reflects the probability that the read originated from the region of the genome where it maps. The mapping score of one alignment depends on: how similar the read is to the reference and, how many alignments have been found. The mapping score is usually given as a phred score. loci1 loci2 loci read Read Loci1 Loci2 Loci ACGTCTAGTTACGATACGTT ACGACTAGTTACGATACGTT score1 ACGTCTAGCTACGCTAGGTT score2 ACGACTAGTTACGATACGTT score1

42 Mapping quality Depends on Similarity between read and genome Quality of the read The number of alternative locations Mapping quality scores MapQ include (some of) these

$dddddaddadc_cccffcdcdefeeeee^deefffeefdeffdeffffd 1 Error probability 0,1 0,01 0,001 0,0001 0,00001 One-letter code (base 4) BCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefgh 0 5 10 15 20 25 0 5 40 Quality$

43 Reads come with qualities Illumina and other platforms give quality scores in a oneletter Fastq format CTTGGTGGTAGTAGCAAATATTCAAACGAGAACTTTGAAGAGATCGGAA + dddddaddadc_cccffcdcdefeeeee^deefffeefdeffdeffffd 1 Error probability 0,1 0,01 0,001 0,0001 0,00001 One-letter code (base 4) BCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefgh Quality score

44 MapQ It is possible to calculate the probability that a match is correct using base quality scores (implemented in PSSM-BWA) In BWA the MapQ score is an approximation of the logarithm of the mapping probability the worst is 0 and the best is 7

45 Alignments to report Aread might be aligned to 0, 1 or more regions in the genome. When several alignment are found we could classify them in two groups. Best alignments: alignments with best score (Map Quality) Other alignments. We can choose to report: All alignments. All best alignments. One of the best alignments at random. All alignments above a score threshold Reference read

46 UNIQUE, MULTIPLE, UNMAPPED

47 Unique matches may be wrong!

48 Example: Mapping repeats Your read happens to be from an Alu repeat (~50% of your reads from human are from repeats!) You match to the genome and find one exact match (no mismatches) There are 00 matches in the genome with one mismatch. How likely is it that your unique match is the correct one?

49 It is less than you think! If there is 2% error rate, the probability that your unique match is correct is around %

50 Experiment: What is the chance that a unique match is wrong? For each length: Generate a million reads at random from the human genome Introduce errors with a rate of 1, 2 and 5% uniformly Map back to the genome with up to mismatches (exact mapping) Record how many map uniquely to the WRONG location

51 Unique, but wrongly mapped reads Mapped with up to three mismatches Exhaustive mapping with up to mismatches and no indels

52 Contamination & Significance How likely is it that sequences not belonging to the target genome maps anyway?

Random matches The key to the success of Blast was the introduction of E-values the expected number of random matches For local alignment, the expected number of

53 Random matches The key to the success of Blast was the introduction of E-values the expected number of random matches For local alignment, the expected number of random matches is calculated from the extreme value distribution Simpler for mapping DNA reads Figure from mcb221_2005/class7.html

54 E. coli reads mapped uniquely to the human genome

55 Be careful with very short reads!

56 BWAvs Bowtie2 BWA mem Reads from 70 bp up to 1Mbp Seeded algorithm plus Smith and Waterman Local alignment Allows gaps up to tens of bp in 100 bp reads Reports chimeric alignments Bowtie2 One of the fastest alignment software for short reads Gapped alignment Global or local Base quality can be used evaluating alignment Paired end BWA backtract (samse/sampe) Short reads up to 70 bp with errors <5% Global alignment Gapped alignment Base quality is not used in evaluating hits Can do paired end

57 Many alignments vs multiple alignment Mappers do many alignments, but they do not do multiple alignments. Doing many pairwise alignments is computationally more feasible. There's one drawback. many alignments multiplealignment Ref Sample Read1 Read2 Read read4 read5 read Ref Sample Read1 Read2 Read Read4 read5 read...aggttttataaaac----aattaagtctacagagcaacta......aggttttataaaacaaataattaagtctacagagcaacta......aggttttataaaac****aaataa...ggttttataaaac****aaataatt...ttataaaacaaataattaagtctaca... CaaaT****aattaagtctacagagcaac... aat****aattaagtctacagagcaact... T****aattaagtctacagagcaacta......aggttttataaaac----aattaagtctacagagcaacta......aggttttataaaacAAATaattaagtctacagagcaacta......aggttttataaaacAAATaa...ggttttataaaacAAATaatt...ttataaaacAAATaattaagtctaca... caaataattaagtctacagagcaac... AATaattaagtctacagagcaact... Taattaagtctacagagcaacta...

58 Many alignments vs multiple alignment The gaps can be located in different positions. many alignments ref sample read1 read2 read consensus Strategies to mitigate this problem: Fixing the problem. aggttttataaaacaaaaaattaagtctacagagcaacta aggttttataaaacaaa-aattaagtctacagagcaacta aggttttataaaacaa-aaattaagtctacagagcaacta aggttttataaaaca-aaaattaagtctacagagcaacta aggttttataaaac-aaaaattaagtctacagagcaacta aggttttataaaacaaaaaattaagtctacagagcaacta GATK realignment. It realigns the problematic regions (lots of SNPs or some indels). Computationally slow. It does not fixes all problems. Avoid using the misaligned positions. Samtools BAQ (calmd). For each position It calculates the probability of being misaligned.

59 SAM!!! Sequence Alignment/Map ( File describing reads aligned to a reference genome. Standard file format. Not meant for human consumption, although can be opened with a text editor: Normally used by programs in its binary version (BAM) Input for genome browsers (e.g., IGV) and SNP callers. It is usually found with the reads sorted along the reference There are some differences in the output between mappers. For instance bwa represent multiple hits with an optional tag (XA) and bowtie with multiple lines (one per hit).

60 SAM!!!

61 Alignment section fields Col Field Briefdescription 1 QNAME QuerytemplateNAME 2 FLAG bitwiseflag RNAME ReferencesequenceNAME 4 POS 1-basedleftmostmappingPOSition 5 MAPQ MAPpingQuality CIGAR CIGARstring 7 RNEXT Ref.Nameofthemate/nextread 8 PNEXT Positionofthemate/nextread 9 TLEN ObservedTemplateLENgth 10 SEQ segmentsequence 11 QUAL ASCIIofPhred-scaledbaseQUALity+

62 hip://picard.sourceforge.net/explain- flags.html Flag Chr Descrip>on 0x0001 p the read is paired in sequencing 0x0002 P the read is mapped in a proper pair 0x0004 u the query sequence itself is unmapped 0x0008 U the mate is unmapped 0x0010 r strand of the query (1 for reverse) 0x0020 R strand of the mate 0x the read is the first read in a pair 0x the read is the second read in a pair 0x0100 s the alignment is not primary 0x0200 f the read fails plaiorm/vendor quality checks 0x0400 d the read is either a PCR or an op>cal duplicate

63 SAM QC Flag statistics (samtools flagstats) MAPQ distribution Coverage distribution Mapped/unmapped reads per read group Mapped/unmapped reads per reference (samtools idxstats)

64 SAMSTAT

end file and single end file) Mapping Mapping pair- ends (CM.5 & CM.

65 Raw data Receving reads from a sequencing center Quality control Cleaning Remove adaptors (not yet implemented) Quality trimming (PHRED, GC content, KMER content, length) Length trimming New quality control IF PHRED > 20/25, no repkmer and length > 5/40 - > new_files_with_reads (pair- end file and single end file) Mapping Mapping pair- ends (CM.5 & CM.) Mapping quality filtered single reads Minimum QUAL of PHRED 20, allow mismatch and one gap Alignment file crea>on Mapping file crea>on Processing MAP file (BAM) Marge pair- ends and single end file Index Sort Remove PCR duplicates

66 A one click pipeline to NG resequencing data S.U.P.E.R W S i m pl n if a ir n d E a ds o rk N Ps y i e d f l o w

67 IGV viewer Visualization tool for interactive exploration of large, integrated datasets. Supports a wide variety of data types including: alignments, microarrays, and genomic annotations.

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page. In this page you will learn to use the tools of the MAPHiTS suite. A little advice before starting : rename your