splitmem: graphical pan-genome analysis with suffix skips Shoshana Marcus May 7, 2014

Size: px

Start display at page:

Download "splitmem: graphical pan-genome analysis with suffix skips Shoshana Marcus May 7, 2014"

Alexander Wilcox
6 years ago
Views:

1 splitmem: graphical pan-genome analysis with suffix skips Shoshana Marcus May 7, 2014

2 Outline 1 Overview 2 Data Structures 3 splitmem Algorithm 4 Pan-genome Analysis

3 Objective Input! Output! A B C D Several complete genomes! Available today for many microbial species, near future for higher eukaryotes! Pan-genome: analyze multiple genomes of species together Compressed de Bruijn graph! Graphical representation depicts how population variants relate to each other, especially where they diverge at branch points! How well conserved is a sequence?! What are network properties?!

4 de Bruijn graph Node for each distinct kmer Directed edge connects consecutive kmers Nodes overlap by k-1 bp Self-loops, multi-edges AGAAGTCC ATAAGTTA Reconstruct original sequence: Eulerian path through graph, visit each edge once

5 Compressed de Bruijn graph Merge non-branching chains of nodes Min. number of nodes that preserve path labels! Usually built from uncompressed graph! We build directly in O(n log n) time and space

6 Compresssed de Bruijn graph 9 strains of Bacillus anthracis k=25

7 Compresssed de Bruijn graph 9 strains of Bacillus anthracis k=1000

8 Outline 1 Overview 2 Data Structures 3 splitmem Algorithm 4 Pan-genome Analysis

9 Suffix Tree Rooted, directed tree with leaf for each suffix. Each internal node, except the root, has at least two children. Each edge is labeled with nonempty substring. No two siblings begin with the same character. Path from root to leaf i spells suffix S[i... n]. Append special character $ to guarantee each suffix ends at leaf.

10 Constructing Suffix Tree NaïveAlgorithm S=banana$ suf 1 banana$ suf 1

11 Constructing Suffix Tree NaïveAlgorithm S=banana$ suf 2 suf 1 banana$ anana$ suf 2

12 Constructing Suffix Tree NaïveAlgorithm S=banana$ nana$ suf 2 suf 1 suf 3 suf 3 nana$

13 Constructing Suffix Tree NaïveAlgorithm S=banana$ nana$ suf 2 suf 1 suf 3 ana$ suf 4

14 Constructing Suffix Tree NaïveAlgorithm S=banana$ suf 2 suf 4 banana$ suf 1 suf 3 ana$ suf 4

15 Constructing Suffix Tree suf 6 suf 2 NaïveAlgorithm $ suf 4 na banana$ suf 1 O(n 2 )Eme suf 3 suf 7 suf 5 S=banana$ banana$ anana$ nana$ ana$ na$ a$ $ suf 1 suf 2 suf 3 suf 4 suf 5 suf 6 suf 7

16 suf 6 suf 2 $ Constructing Suffix Tree suf 3 suf 4 na banana$ suf 1 O(n)Eme suf 7 suf 5 On#line'Construc/n'of'Suffix'Trees,E.Ukkonen Algorithmica(1995) SuffixLinks ana a na

17 Suffix Tree Query suf 6 suf 2 $ suf 3 suf 4 na banana$ suf 1 suf 7 suf 5 S=banana$ Search for ban

18 Suffix Tree Query suf 6 suf 2 $ suf 3 suf 4 na banana$ suf 1 suf 7 suf 5 S=banana$ Search for ban

19 Suffix Tree Query suf 6 suf 2 $ suf 3 suf 4 na banana$ suf 1 suf 7 suf 5 S=banana$ Search for ban

20 Suffix Tree Query S=banana$ suf 6 suf 2 $ suf 3 suf 4 na banana$ suf 1 suf 7 suf 5 Search for ban Found 1 occurrence

21 Suffix Tree Query S=banana$ suf 6 suf 2 $ suf 3 suf 4 na banana$ suf 1 suf 7 suf 5 Search for band Not found

22 Suffix Tree Query S=banana$ suf 6 suf 2 $ suf 3 suf 4 na banana$ suf 1 suf 7 suf 5 Search for an Found 2 occurrences

23 Suffix Tree! Many applications in computational biology! Linear time construction algorithms Linear time solutions to Genome alignment Finding longest common substring All-pairs suffix-prefix matching Locating all maximal repetitions And many more

24 MEMs MaximalExactMatch(MEM) Exactmatchwithinsequencethatcannotbe extendedletorrightwithoutintroducing mismatch. TGCACGCAA Weareinterested inmemslength k

25 MEMs Maximal Exact Match (MEM) Exact match within sequence that cannot be extended left or right without introducing mismatch. MEMs are internal nodes in the suffix tree that have left-diverse descendants. (have descendant leaves that represent suffixes with different characters preceding them)! Linear-time suffix tree traversal to locate MEMs.

26 MEMs in Suffix Tree suf 6 suf 2 Possible MEMs: a, ana, na $ MEM? suf 4 MEM? na banana$ suf 1 MEM? suf 3 MEMs are internal nodes in suffix tree with left-diverse descendants suf 7 suf 5 S=banana$ banana$ anana$ nana$ ana$ na$ a$ $ suf 1 suf 2 suf 3 suf 4 suf 5 suf 6 suf 7

27 MEMs in Suffix Tree suf 2! MEM?! $ suf 6 MEM? suf 4 na MEMs: a, ana banana$ suf 1 MEM? suf 3 MEMs are internal nodes in suffix tree with left-diverse descendants suf 7 suf 5 S=banana$ banana$ nana$ suf 2 suf 4

28 MEMs in Suffix Tree suf 6 suf 2 $ MEM na banana$ suf 3 suf 4 MEM MEMs: a, ana suf 1 XMEM? MEMs are internal nodes in suffix tree with left-diverse descendants suf 7 suf 5 S=banana$ anana$ ana$ suf 3 suf 5

29 Outline 1 Overview 2 Data Structures 3 splitmem Algorithm 4 Pan-genome Analysis

30 Compresssed de Bruijn graph Types of nodes: i. repeatnodes ii. uniquenodes Input: AGAAGTCC$ATAAGTTA

31 splitmem Nodes in compressed de Bruijn graph classified as i. repeatnodes ii. uniquenodes Algorithm: 1 Construct set of repeatnodes 2 Sort start positions of repeatnodes 3 Create edges and uniquenodes to link noncontiguous repeatnodes

32 repeatnodes 1 Construct set of repeatnodes 1. Build suffix tree of genome 2. Mark internal nodes that are MEMs, length k 3. Preprocess suffix tree for LMA queries 4. Compute repeatnodes in compressed de Bruijn graph by decomposing MEMs and extracting overlapping components, length k

33 1 MEM occurs twice TGCAC GGCAA GCA

34 Overlapping MEMs TGCCATCGCCAACCAT TGCCATCGCCAACCAT

35 Tandem Repeat AGGCTTGGCTTGGCTTGGCTA AGGCTTGGCTTGGCTTGGCTA AGGCTTGGCTTGGCTTGGCTA AGGCTTGGCTTGGCTTGGCTA AGGCTTGGCTTGGCTTGGCTA

36 repeatnodes 1 Construct set of repeatnodes 1. Build suffix tree of genome 2. Mark internal nodes that are MEMs, length k 3. Preprocess suffix tree for LMA queries 4. Compute repeatnodes in compressed de Bruijn graph by decomposing MEMs and extracting overlapping components, length k

37 SplitMEMtorepeatNodes xxyzα β yxyzα β uα γ MEM α z x y z α u α α β α γ α α β MEM

38 SplitMEMtorepeatNodes xxyzα β yxyzα β uα γ MEM α z α β x y z α β MEM FindMEMinsuffixtree.

39 SplitMEMtorepeatNodes xxyzα β yxyzα β uα γ MEM α z α β x y z α β MEM Traversesuffixlink. LookforMEMasancestor.

40 SplitMEMtorepeatNodes xxyzα β yxyzα β uα γ MEM α z α β x y z α β MEM Traversesuffixlink. LookforMEMasancestor.

41 SplitMEMtorepeatNodes xxyzα β yxyzα β uα γ MEM α z α β x y z α β MEM Traversesuffixlink. LookforMEMasancestor.

42 SplitMEMtorepeatNodes xxyzα β yxyzα β uα γ MEM α z x y z α u α α β α γ α α β MEM FoundMEMasancestor.Decompose. RemoveembeddedMEM(suffixlinks).FindnextembeddedMEM.

43 Suffix Skips! Reduce O(n 2 ) time to O(n log n) time Suffix link: quickly navigate to distant part of tree o Pointer from internal node labeled xs to node S o Trim 1 character in O(1) time o Trim c characters in O(c) time Suffix skip: o Trim c characters in O(log c) time

44 Genome: babab Suffix Skips Suffix skips 0 (dist = 1; suffix links) Suffix skips 1 (dist=2) Suffix skips 2 (dist=4)! Additional Preprocessing:! pointer jumping to rapidly add additional links!

45 splitmem software o C++ splitmem o open source Input modes: o single genome: fasta file o pan-genome: multi-fasta file Multi k-mer construct several compressed de Bruijn graphs without rebuilding suffix tree

46 Outline 1 Overview 2 Data Structures 3 splitmem Algorithm 4 Pan-genome Analysis

47 B. Anthracis and E. coli Pan-genome analysis Examine graph properties: Number nodes, edges, avg. degree Node length distribution Genome sharing among nodes Distribution of node distances to core genome Other properties that can be studied: Girth, Diameter, Modularity, Network Motifs, etc. Functional enrichment of highly conserved or genome specific genes.

48 Pan-genome analysis

49 Pan-genome analysis Graphs of main chromosomes 9 strains of Bacillus anthracis Selection of 9 strains of Escherichia coli Species K Nodes Edges Avg. Degree B. anthracis B. anthracis B. anthracis E. coli E. coli E. coli

50 Pan-genome analysis B. Anthracis and E. coli Examine graph properties Node length distribution Genome sharing among nodes Distribution of node distances to core genome

51 Pan-genome analysis Histogram of Node Lengths

52 Pan-genome analysis Histogram of Node Lengths Spike at 2k: SNPs

53 Pan-genome analysis B. Anthracis and E. coli Examine graph properties Node length distribution Genome sharing among nodes Distribution of node distances to core genome

54 Pan-genome analysis B. anthracis E. coli Fraction of nodes with each level of genome sharing

55 Pan-genome analysis B. Anthracis and E. coli Examine graph properties Node length distribution Genome sharing among nodes Distribution of node distances to core genome

56 Pan-genome analysis Graph encodes sequence context of segments. Core genome: subsequences that occur in at least 70% of underlying genomes. Branch and Bound Search Nodes can be further in terms of hops while closer by base pairs.

57 Pan-genome analysis Node distances to core genome Searched hop radius

58 Summary Identify pan-genome relationships graphically. Topological relationship between suffix tree and compressed de Bruijn graph. Direct construction of compressed de Bruijn graph for single or pan-genome. Introduce suffix skips. Explore pan-genome graphs of B. anthracis, E. coli. SplitMEM: Graphical pan-genome analysis with suffix skips. Marcus, S, Lee, H, Schatz, MC (2014) BioRxiv

59 Improve splitmem software: Future work Reduce space using compressed full-text index instead of suffix tree Approximate indexing of strains to form a pan-genome graph Alignment of reads to pan-genome Biological applications: Functional enrichment of core-genome and genome specific segments Expand study to larger collection of microbes and larger genomes

60 Acknowledgments Michael Schatz Hayan Lee Giuseppe Narzisi James Gurtowski Schatz Lab IT department Todd Heywood

61 Thank You!

62 Pan-genome analysis Branch and bound search (like Dijkstra s shortest path algorithm) to compute bp distance from each non-core node to core genome: Traverse all distinct paths from source until OR o a core node is reached o current node was visited by a shorter path Bounded search once a core node is found, its distance bounds maximum search distance along other paths

An Efficient Algorithm for Identifying the Most Contributory Substring. Ben Stephenson Department of Computer Science University of Western Ontario

An Efficient Algorithm for Identifying the Most Contributory Substring Ben Stephenson Department of Computer Science University of Western Ontario Problem Definition Related Problems Applications Algorithm