Read Mapping and Assembly

Size: px

Start display at page:

Download "Read Mapping and Assembly"

Mark Wade
5 years ago
Views:

1 Statistical Bioinformatics: Read Mapping and Assembly Stefan Seemann University of Copenhagen April 9th 2019

2 Why sequencing?

3 Why sequencing? Which organism does the sample comes from? Assembling the genome of an organism. Which region of the genome is transcribed? Variablity (AS events) of mrnas (ncrnas) of the same gene. Expression level of the mrna (ncrna).

4 Next-generation sequencing

5 Workflow of RNA-seq

6 Some terminology read - a long word that comes from a NGS machine coverage - the average number of reads (or inserts) that cover a position in the target DNA piece shotgun sequencing - the process of obtaining many reads from random locations in DNA, to detect overlaps and assemble mate pair - a pair of reads from two ends of the same insert fragment (we know approx. distance) contig - a contiguous sequence formed by several overlapping reads with no gaps consensus sequence - sequence derived from the multiple alignment of reads in a contig

7 Read Mapping Map a small sequence to a reference genome or catalog of sequences (read, transcripts,... ). Often the very first step in the analysis of sequencing data.

8 First there were sequences...

9 Naive mapping Search for query at each position in reference genome ACGTTACCGAATCGATCAAAGTCGA GTTA m = query length, n = genome length

10 Naive mapping Search for query at each position in reference genome ACGTTACCGAATCGATCAAAGTCGA GTTA m = query length, n = genome length

11 Naive mapping Search for query at each position in reference genome ACGTTACCGAATCGATCAAAGTCGA GTTA :) m = query length, n = genome length Time: O(mn)

12 Naive mapping Human Genome (queries) would take far too long: Illumina/Solexa sequencing technology produces million, bp short reads Mapping these reads to a 3.2 billion bp human genome is a challenge Far worse when we allow for Indels and mismatches. Are optimal alignments still feasible?

13 Principles of mapping reads The need: Hundreds of millions of reads (or more). Adapt to handle growing amount of data. Exploit technological development. Handle protocol development (as was the case with pair-end libs). What to do?

14 Principles of mapping reads The need: Hundreds of millions of reads (or more). Adapt to handle growing amount of data. Exploit technological development. Handle protocol development (as was the case with pair-end libs). What to do? We are looking for nearly identical matches overlap detection

15 Banded Smith-Waterman Dynamic Programming Algorithm guaranteed optimal, but slow Overlap detection

16 Overlap detection Banded Smith-Waterman Dynamic Programming Algorithm guaranteed optimal, but slow BLAST Seed-and-extend for good matches to a DB

17 Overlap detection Banded Smith-Waterman Dynamic Programming Algorithm guaranteed optimal, but slow BLAST Seed-and-extend for good matches to a DB K-mer word counting Read mapping

18 Principles of mapping reads Many sequenced reads are redundant!!! We do not need to search the entire genome each time again.

19 Book analog Do not search the entire book, instead search the book index. Gorodkin et al. Methods Mol Biol. 2014

Book analog Do not search the entire book, instead search the book index. Methods in Molecular Biology 1097 Jan Gorodkin Walter L.

.. 407, 408 setrepresentation... 250 Abstractshapeanalysis... 102, 215 243 stacking interactions... 281, 401, 409 Adenine...34 Watson-Crick...49, 166, 171, 180, 187, 338, Adenosineplatform.

.. 176, 286 cis Hoogsteen/Hoogsteen...385 gaps...15, 128, 139, 264, 267, 285, 295 cis Hoogsteen/sugaredge...385 seed...111, 116, 172, 176 cis sugar edge/sugaredge...385 Ali-stemplot.

..385 avoidance...12 transhoogsteen/sugaredge...385 semantic... 97, 100, 101 transsugar edge/sugaredge...385 syntactic... 89,101 transwatson-crick/hoogsteen...385 Aminoglycosides.

20 Book analog Do not search the entire book, instead search the book index. Methods in Molecular Biology 1097 Jan Gorodkin Walter L. Ruzzo Editors RNA Sequence, Structure, and Function: Computational and Bioinformatic Methods INDEX A model non-canonical ,271 probability... 81, 254, 282, 283, 432 Ab initio , 408 setrepresentation Abstractshapeanalysis , stacking interactions , 401, 409 Adenine...34 Watson-Crick...49, 166, 171, 180, 187, 338, Adenosineplatform , 381, 384, 385, 390, 405, 406, 410 Affinity....40, 364, 493, 495, 498, 499, 501, 504, 505, Base-pairing probability , 513 Base pair types Alignment bifurcated full , 286 cis Hoogsteen/Hoogsteen gaps...15, 128, 139, 264, 267, 285, 295 cis Hoogsteen/sugaredge seed...111, 116, 172, 176 cis sugar edge/sugaredge Ali-stemplot cis Watson-Crick/Hoogsteen ALPS , 449 cis Watson-Crick/sugar edge AMBER , 400, 407, cis Watson-Crick/Watson-Crick Ambiguity transhoogsteen/hoogsteen avoidance...12 transhoogsteen/sugaredge semantic... 97, 100, 101 transsugar edge/sugaredge syntactic... 89,101 transwatson-crick/hoogsteen Aminoglycosides...40 transwatson-crick/sugar edge Ancestralcorrelations transwatson-crick/watson-crick Annotation Base triple... 6, 23,39, 268, 333, 385, 391 automated Bcheck...167, 187, , 201, 206, 210 false...110, 111, 115, 116 Bellman s GAP...102, 236, 238, 241 pipeline...114, 117, 193, 203 Benchmarks... 20, 21, 23, 24, 310, 311, 396, 401, 481, Antibiotics Antisense...2, 417, 418, 478, 481, 504, 509 Big O Aptamer(s)... 40, 237, 364, 396 BioEdit Aragorn , 187, , , 210 BioPredsi...482, 483 ARB , 380 Bit (unit of information)... 12, 89, 95, 171, 176, Arc-Annotated Sequence , 268, , 179, 182, 184, 189, 190, 249, 264 Argonaute...458, 500 BLAST...5,19, 111, 113, 117, 118, 306, 396, Assessment...57, , 418, 444, 447 Azoarcus , 404 Blockbuster Boltzmann sampling... 79, 80 B Boltzmann weight... 80, 218, 220, 221, 226, 230, 235, Backbone , 397, , 405, 406, 409, 411, 236, 423, 424, 426, , 502, 504, 509 Boltzmann-weightedenergies...218, 423, 424, 426 Backbone torsion , 411 Boulder ALE , 381 Backtracking , 79 80, 130, 152 Bowtie Barriertree... 82, 83, 339, 340 Breastcancer Base pair BWA canonical covariance , 399, 408 correspondence C direct distance... 78, 80, 216, Carnac , 297, 298, 307 indirect Carrying capacity , 326 intermolecular...426, CASP... 19,396 intramolecular...421, 425, 428, 429, 432, 483 Cations Jan Gorodkin and Walter L. Ruzzo (eds.), RNA Sequence, Structure, and Function: Computational and Bioinformatic Methods, Methods in Molecular Biology, vol. 1097, DOI / , Springer Science+Business Media New York Gorodkin et al. Methods Mol Biol. 2014

21 k-mer k-mer is a substring of length k For example sequence GGCGATTCATCG: 4-mer GGCG, GCGA CGAT 3-mer GGC, GCG, CGA, GAT Schatz et al. Genome Res 2010

22 Read mapping through indexing Most computational time is spent on alignment. Solution An index is a data structure that improves the speed of data retrieval operations at the cost of additional storage space to maintain the index data structure. Pros Quick search for matches in an entire genome. Cons Index structure of the entire genome takes a lot of memory. Will my laptop have the memory to run the alignment algorithm?

23 Challenges of read mapping In principle read mapping is to map an exact piece of sequence (a string) to the genome. Go in groups of 2-3 people and discuss (3 min) Why an exact mapping might not always be what we want. Which concerns might you have when mapping genomic sequence? Which concerns might you have when mapping transcriptomic data?

24 Indexing problems Flexibility and constraints: Errors versus natural variation: trade-off in error threshold. Computational efficiency (time / memory): allowed mismatches / unique mappings. Balance between speed, memory and reported mappings. Indexing is often used for seed matching. Indexing method choice is crucial!

25 Indexing techniques

26 Hash tables for mapping reads Using an index structure to store information. Each entry is an Key Value pair: hf(key) = Value. Key is sequence motif (k-mer) Value is a motif match defined by a hash function hf Example: Hash function defines if 6-mer (Key) is in the reference sequence (Value=1) or not (Value=0). Key Value TATAAT 1 GATACC 0 AGTCAT 0 TATAAA 1 TATGAT 1 CGTACT 0 TATACT 1 Blastn finds seeds of exact matches between the query and database sequences by using a hash table of all 11-mers (megablast 28-mers).

27 Suffix arrays for mapping reads Even hashing can be too slow. Why map the same substring multiple times? Concept: Create a seed once, but keep track of all genomic locations. Employ seed extension. Employ suffix array to keep track of all suffixes. The suffix of a string (DNA sequence) is any portion from the end and some range into the sequence. For example AGTT is a suffix of ACTACCAGTT. Burrows-Wheeler Transform (BWT) is an extension of suffix arrays for efficient data compression (e.g. used in Bowtie).

28 Build a suffix array: Suffix arrays for mapping reads Reference sequence: AGGAGC Reference suffices Lexicographically sorted suffices 0: AGGAGC$ 6: $ 1: GGAGC$ 3: AGC 2: GAGC$ 0: AGGAGC 3: AGC$ 5: C 4: GC$ 2: GAGC 5: C$ 4: GC 6: $ 1: GGAGC SA = [6,3,0,5,2,4,1] $: the end of string symbol lexigraphically smaller than A.

29 Suffix arrays for mapping reads Binary search of perfect matches: m = query length, n = reference length Time: O(m log(n)) After finding perfect match you look in the adjacent cells for additional matches.

30 Search a suffix array: Suffix arrays for mapping reads Reference sequence: AGGAGC Reference suffices Lexicographically sorted suffices 0: AGGAGC$ 6: $ } 1: GGAGC$ 3: AGC 2: GAGC$ 0: AGGAGC AG 3: AGC$ 5: C 4: GC$ 2: GAGC 5: C$ 4: GC 6: $ 1: GGAGC binary search SA = [6,3,0,5,2,4,1] $: the end of string symbol lexigraphically smaller than A.

Principles of mapping reads Coping with gaps: Support of long gaps is computationally expensive, but sometimes required (longer reads have higher

31 Principles of mapping reads Coping with gaps: Support of long gaps is computationally expensive, but sometimes required (longer reads have higher chance of gaps). Splice Junction mappers can be devided in 1 Exon-first methods TopHat (based on Bowtie2) 2 Seed-extend methods Blat, STAR-aligner, Segemehl

32 Strategies of gapped alignments Graphic credit: Garber et al. (2011) Nature Methods 8:

33 Time and memory requirements of read mappers [Time unit is minutes, and memory unit is GB] Fonseca et al. Bioinformatics 2012

34 Why assembling?

35 The assembly process Assembly is a hierarchical process, starting from individual reads, build high confidence contigs, incorporate the mates to build scaffolds.

36 The assembly process Reference based (comparative) De novo Combined methods

37 The assembly process Reference based (comparative) De novo Combined methods When to apply which assembly type?

38 Reference based assembly Applications Organisms with good reference genomes, except perhaps polyploid organisms Pros Can be run in parallel; Contamination not a major issue; less coverage needed Cons Reference dependant; known splice site dependant; long introns may be predicted Common software Cufflinks

39 De novo based assembly Applications Organisms with poor reference genomes or no reference genome; independent of correct splice sites or intron length Pros Identification of novel transcripts; no reference needed Cons Computationally intensive; requires high read depth; contamination a major issue Common software Trinity

40 Assembly algorithm Different approaches: Greedy algorithm Overlap graphs De Brujn graphs Basic ideas: Find all overlaps between reads Build a graph Simplify the graph (sequencing errors) Traverse a graph to produce a consensus

41 Greedy algorithm We pick two strings s i and s j with largest overlap from R (breaking ties arbitrarily) and replace them with their merge. Stop when there is only one string left.

42 Greedy algorithm We pick two strings s i and s j with largest overlap from R (breaking ties arbitrarily) and replace them with their merge. Stop when there is only one string left. What may go wrong?

43 Why to use graphs for assembly? Say a sequencer produces d reads of length n in one sequencing run from a genome of length m: d reads n 100 nt m nt human Task: Glue overlapping reads together to recover biology. Combinatorical problem best solved by graph theory! NGS library Graph Genome Transcriptome

44 Overlap graph each read alignment is a node short overlaps (>5bp) are indicated by directed edges transitive overlaps (longer overlaps) are shown as dotted edges

45 Overlap graph

46 Overlap graph

47 De Bruijn graph

48 De Bruijn graph

49 Evaluating an assembly Evaluating assembly quality is as important as the assembly itself! Assembly quality depends on 1 Coverage: low coverage is mathematically hopeless 2 Repeat composition: high repeat content is challenging 3 Read length: longer reads help resolve repeats 4 Error rate: errors reduce coverage and increase false positives

50 Repeats Over 50% of the human genome is repetitive! Repeats The BIG problem Long range misassembly Masking Repeats Known Repeats High Depth regions Paired-end sequencing can meet (some of) the need.

51 Summary HT sequencing produces millions of reads Index structures improve speed. Tradeoff between speed and memory. Graph theory helps to solve the complexity in assembling Sequencing of full length RNA molecules by new technologies (PacBio, Nanopore) eliminates the need to fragment the RNA during library preparation no more need for assembler tools (however new tools are needed)

Long Read RNA-seq Mapper

UNIVERSITY OF ZAGREB FACULTY OF ELECTRICAL ENGENEERING AND COMPUTING MASTER THESIS no. 1005 Long Read RNA-seq Mapper Josip Marić Zagreb, February 2015. Table of Contents 1. Introduction... 1 2. RNA Sequencing...