Genome Reconstruction: A Puzzle with a Billion Pieces. Phillip Compeau Carnegie Mellon University Computational Biology Department
|
|
- Jonah Scott
- 5 years ago
- Views:
Transcription
1
2 Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau Carnegie Mellon University Computational Biology Department
3 Eternity II: The Highest-Stakes Puzzle in History Courtesy: Matej Bat ha
4 The Newspaper Problem
5 What is a Genome?
6 The Molecular Structure of DNA DNA s Double Helix (1953) DNA s Molecular Structure
7 Nucleotides Nucleotide: Half of one rung of DNA. There are only four choices for the nucleic acid of a nucleotide: 1. Adenine (A) 2. Cytosine (C) 3. Guanine (G) bonds to C 4. Thymine (T) bonds to A The order of nucleotides determines genetics!
8 DNA Sequencing Genome: The nucleotide sequence read down one side of an organism s chromosomal DNA. CCGTAGTCGCGAACAGTATACGAGACAGTACAGATACGATACGATACGATCATTAACCGAGAGTACCAGATTCCAGATCATACG TTACGCTTAGCTACGGACGTACGATACCCAGATTACGATCCATATAGATATAACCGGTGTGTCTTGCTAATACGTAACGGGGTGCCT TCGATAGGTCAGAATACCAGATCTCTCGATCTTCTTACAGATACTACGATCCCCAGATACTACCCCTACTGACCCATCGTACGGGTA CTACTACGGATATACCGTAGAGGGATCCATATATCCCGAGACGTCTCGCGCATAAGATCATCGTCTAGATACACGTACGTA CTAGACTAGCGTCCTCTTATCGTCCCGATCGAGTCGCGTGCTCAGAAAAGCTACGATACGATACCCGATACTAGACCATAG A human genome has about 3 billion nucleotides. Sequencing a genome means reading all these letters. Key Point: DNA is submicroscopic! How do we read something that we cannot see?
9 Introduction to Genome Sequencing
10 All Humans Are Genetically Very Similar Any two humans share 99.9% of their genomes. The 0.1% difference accounts for height, color, high cholesterol susceptibility, etc. eye CTGGACTACGCTACTACTGCTAGCTGTATTACGA TCAGCTACCACATCGTAGCTACGCATTAGCAAGCTAT CGATCGATCGATCGATTATCTACGATCGATCGATCGATCA CTATACGAGCTACTACGTACGTACGATCGCGGGACTATTA TCGACTACAGATAAAACCTAGTACAACAGTATACATA GCTGCGGGATACGATTAGCTAATAGCTGACGATATCCGAT CTGGACTACGCTACTACTGCTAGCTGTATTACGA TCAGCTACAACATCGTAGCTACGCATTAGCAAGCTAT CGATCGATCGATCGATTATCTACGATCGATCGATCGATCA CTATACGAGCTACTACGTACGTACGATCGCGTGACTATTA TCGACTACAGAAACCTAGTACAACAGTATACATA GCTGCGGGATACGATTAGCTAATAGCTGACGATATCCGAT
11 All Humans Are Genetically Very Similar Any two humans share 99.9% of their genomes. The 0.1% difference accounts for height, color, high cholesterol susceptibility, etc. eye CTGGACTACGCTACTACTGCTAGCTGTATTACGA TCAGCTACCACATCGTAGCTACGCATTAGCAAGCTAT CGATCGATCGATCGATTATCTACGATCGATCGATCGATCA CTATACGAGCTACTACGTACGTACGATCGCGGGACTATTA TCGACTACAGATAAAACCTAGTACAACAGTATACATA GCTGCGGGATACGATTAGCTAATAGCTGACGATATCCGAT CTGGACTACGCTACTACTGCTAGCTGTATTACGA TCAGCTACAACATCGTAGCTACGCATTAGCAAGCTAT CGATCGATCGATCGATTATCTACGATCGATCGATCGATCA CTATACGAGCTACTACGTACGTACGATCGCGTGACTATTA TCGACTACAGAAACCTAGTACAACAGTATACATA GCTGCGGGATACGATTAGCTAATAGCTGACGATATCCGAT
12 Species vs. Individual Sequencing Species Sequencing: What is a species s consensus genome?
13 Species Sequencing vs. Individual Sequencing Individual Sequencing: What makes an individual unique?
14 Why Would We Sequence a Species s Genome? Hug et al., 2016 Courtesy: Nature Biotechnology, Discovery Magazine
15 Why Would We Sequence an Individual s Genome? Personalized Medicine: Tailoring medical treatment to the individual based on their genome. 2010: First person whose life was saved because of genome sequencing.
16 Brief History of Genome Sequencing Late 1970s: Walter Gilbert and Frederick Sanger develop independent sequencing methods. 1980: They share the Nobel Prize in Chemistry. Walter Gilbert Still, their sequencing methods were too expensive for large genomes: with a $1 per nucleotide cost, it would cost $3 billion to sequence a human genome. Frederick Sanger
17 Brief History of Genome Sequencing 1990: The public Human Genome Project, headed by Francis Collins, aims to sequence the human genome. Francis Collins 1997: Craig Venter founds Celera Genomics, a private firm, with the same goal. Craig Venter
18 Brief History of Genome Sequencing 2000: The draft of the human genome is simultaneously completed by the (public) Human Genome Consortium and (private) Celera Genomics.
19 Brief History of Genome Sequencing 2000s: Many more mammalian genomes are sequenced.
20 The Arrival of Personal Genomics 2010s: The market for sequencing machines takes off. Illumina reduces the cost of sequencing an individual human genome from $3 billion to $1,000. Beijing Genome Institute orders hundreds of sequencing machines. Companies offer partial genome sequencing for ~$500. UK funds genome sequencing for 100,000 citizens.
21 The Future of Genome Sequencing The Near Future (?): Hopefully, sequencing an individual genome will soon become as routine as an X-ray.
22 The Newspaper Problem and Genome Sequencing
23 Returning to The Newspaper Problem Given enough time, how might you solve this problem?
24 The Newspaper Problem as an Overlap Puzzle The newspaper problem is not the same as a jigsaw puzzle: We have multiple copies of the same edition of a newspaper. Plus, some pieces of paper got blown to bits in the explosion. Instead, we must use overlapping shreds of paper to reconstruct what the newspaper said. This gives us a giant overlap puzzle.
25 What Makes Genome Sequencing So Difficult? When we read a book, we can read the entire book one letter at a time from beginning to end. However, modern sequencing machines can only read short pieces of DNA (~250 nucleotides long), called reads. vs.
26 Sequencing a Genome: Illustration Multiple identical copies of a genome Shatter the genome into reads Sequence the reads (Lab) AGAATATCA TGAGAATAT GAGAATATC Assemble the genome using overlapping reads (Computational) AGAATATCA GAGAATATC TGAGAATAT...TGAGAATATCA... What does genome sequencing remind you of?
27 Fragment Assembly = Overlap Puzzle!
28 Complications in Genome Assembly 1. DNA is double-stranded (and may consist of multiple chromosomes). 2. Reads have imperfect coverage of the underlying genome. 3. Sequencing machines are error-prone.
29 Simplifying Assumptions for Genome Assembly 1. DNA is single-stranded (and consists of a single circular chromosome, like bacteria). 2. Reads have perfect coverage of the underlying genome. 3. Sequencing machines are error-free.
30 From Biological Data to a Computational Problem Aim: Construct a shortest circular genome containing all our reads. These reads have length 3, and the answer isn t obvious. How in the world would we solve this problem if we had a billion reads? TTT TCT TTG TCG TTC TCC TTA TCA GTT GCT GTG GCG GTC GCC GTA GCA CTT CCT CTG CCG CTC CCC CTA CCA ATT ACT ACG ATC ACC ATA ACA TGT TAT TGG TAG TGC TAC TGA TAA GGT GAT GGG GAG GGC GAC GGA GAA CGT CAT CGG CAG CGC CAC CGA CAA AGT AAT AGG AAG AGC AAC AGA AAA What method would you use?
31 From Biological Data to a Computational Problem Idea: Look for reads having maximum overlap and build circular string. Say we start with AAT. Which read should we pick next? AAT TTT TCT TTG TCG TTC TCC TTA TCA GTT GCT GTG GCG GTC GCC GTA GCA CTT CCT CTG CCG CTC CCC CTA CCA ATT ACT ACG ATC ACC ATA ACA TGT TAT TGG TAG TGC TAC TGA TAA GGT GAT GGG GAG GGC GAC GGA GAA CGT CAT CGG CAG CGC CAC CGA CAA AGT AAT AGG AAG AGC AAC AGA AAA
32 From Biological Data to a Computational Problem Idea: Look for reads having maximum overlap and build circular string. Say we start with AAT. Which read should we pick next? AAT TTT TCT TTG TCG TTC TCC TTA TCA GTT GCT GTG GCG GTC GCC GTA GCA CTT CCT CTG CCG CTC CCC CTA CCA ATT ACT ACG ATC ACC ATA ACA TGT TAT TGG TAG TGC TAC TGA TAA GGT GAT GGG GAG GGC GAC GGA GAA CGT CAT CGG CAG CGC CAC CGA CAA AGT AAT AGG AAG AGC AAC AGA AAA
33 From Biological Data to a Computational Problem Idea: Look for reads having maximum overlap and build circular string. Now we have a choice between TGC and TGG. AAT TG? TTT TCT TTG TCG TTC TCC TTA TCA GTT GCT GTG GCG GTC GCC GTA GCA CTT CCT CTG CCG CTC CCC CTA CCA ATT ACT ACG ATC ACC ATA ACA TGT TAT TGG TAG TGC TAC TGA TAA GGT GAT GGG GAG GGC GAC GGA GAA CGT CAT CGG CAG CGC CAC CGA CAA AGT AAT AGG AAG AGC AAC AGA AAA
34 From Biological Data to a Computational Problem Idea: Look for reads having maximum overlap and build circular string. If we get lucky with TGC, again we have a choice between GCA and GCG. AAT TGC GC? TTT TCT TTG TCG TTC TCC TTA TCA GTT GCT GTG GCG GTC GCC GTA GCA CTT CCT CTG CCG CTC CCC CTA CCA ATT ACT ACG ATC ACC ATA ACA TGT TAT TGG TAG TGC TAC TGA TAA GGT GAT GGG GAG GGC GAC GGA GAA CGT CAT CGG CAG CGC CAC CGA CAA AGT AAT AGG AAG AGC AAC AGA AAA
35 Repeats Cause Headaches in Genome Assembly 50% of the human genome is made up of repeats: strings that appear multiple times with minor variations. Analogy: The Triazzle contains lots of repeated figures, which makes it difficult to solve (even with just 16 pieces). Repeated information is what makes Eternity II so difficult to solve!
36 Taking a Walk
37 The Bridges of Königsberg The people of Königsberg, Prussia (present-day Kaliningrad, Russia) enjoyed taking walks.
38 The Bridges of Königsberg They wondered if they could start at home, walk through the city, cross each bridge (blue) exactly once, and return home. Can you see a solution?
39 The Bridges of Königsberg 1735: Leonhard Euler develops an approach to answer this question for any city, even for a city with a million islands. We will soon see how Euler did this. Leonhard Euler
40 The Icosian Game Over a century passes 1857: Irish mathematician William Hamilton designs a game consisting of a board representing 20 islands connected by bridges. Goal: find a walk that visits every island exactly once and returns back where it started. Icosian Game Can you see a solution?
41 These Two Stories Have Something in Common Find a walk that crosses every bridge once and returns home. Find a walk that visits every island once and returns home.
42 ... But How Do They Relate to Genome Sequencing? But how do these problems relate to genome sequencing? Assemble the genome using overlapping reads AGAATATCA GAGAATATC TGAGAATAT...TGAGAATATCA...
43 Hamiltonian and Eulerian Cycles
44 Königsberg Bridges Network For the Königsberg Bridge Problem, we create a network: Nodes = 4 land masses of the city Edges = 7 bridges connecting land masses We are looking for an Eulerian cycle: crossing every edge. Now can you see a solution?
45 Hamiltonian Cycles A Hamiltonian cycle touches each node exactly once and returns to its starting node.
46 Hamiltonian Cycles A Hamiltonian cycle touches each node exactly once and returns to its starting node.
47 Hamiltonian Cycles A Hamiltonian cycle touches each node exactly once and returns to its starting node.
48 Hamiltonian Cycles A Hamiltonian cycle touches each node exactly once and returns to its starting node. Note: We didn t use every edge here.
49 Finding Eulerian and Hamiltonian Cycles Given a network G, we now have two questions that we can program a computer to answer about G. Eulerian Cycle Problem (ECP): Find an Eulerian cycle in G or prove that G does not have an Eulerian cycle. Hamiltonian Cycle Problem (HCP): Find a Hamiltonian cycle in G or prove that G does not have a Hamiltonian cycle.
50 Eulerian Cycles If there were a solution to the Königsberg Bridge Problem, then we could find an Eulerian cycle in this network. However, no such cycle exists. Why?
51 Eulerian Cycles If there were a solution to the Königsberg Bridge Problem, then we could find an Eulerian cycle in this network. However, no such cycle exists. Why? If we add two more edges, there will be such a cycle.
52 Euler s Theorem
53 Directed Networks Directed Network: A network in which each edge has a direction (represented by an arrow). Think of directed edges as one-way bridges. Undirected Directed
54 Eulerian Cycles in Directed Networks An Eulerian cycle in a directed network must travel down all the edges in the correct direction. Does this network have an Eulerian cycle?
55 Balanced Networks We can label each node with the number of edges in and the number of edges out. A network is balanced if at each node, the number of incoming edges equals the number departing. This network isn t balanced. (2, 2) (1, 1) (2, 2) (2, 2) (1, 1) However, adding some edges makes it balanced. (1, 1) (2, 2)
56 Euler s Theorem Euler s Theorem: A directed network contains an Eulerian cycle when the network is balanced and connected. (2, 2) (2, 2) (1, 1) (2, 2) (1, 1) Not Connected (1, 1) (2, 2) Connected + Balanced = Eulerian
57 Euler s Theorem Euler s Theorem: A directed network contains an Eulerian cycle when the network is balanced and connected. Key point: It is also possible to program a computer to quickly find an Eulerian cycle in a balanced, connected directed network... even one with a billion edges!
58 Difficult Computational Problems
59 So What s the Big Deal? I thought computers were supermachines! Universal Pictures
60 So What s the Big Deal? Computers don t need 300-year old math to solve problems.
61 So What s the Big Deal? Aren t computers going to take over the world anyway? MGM
62 So What s the Big Deal? So let s examine the case of finding a Hamiltonian cycle
63 Searching for an Efficient Solution for the HCP Key Point: No one has ever found a similar efficient test if a network has a Hamiltonian cycle. Of course, we could examine every possible ant walk through the network to solve the HCP. However, this brute force approach is infeasible: there are more walks through the average network with just 1,000 nodes than there are atoms in the universe!
64 Searching for an Efficient Solution for the HCP Attempting to solve the HCP is difficult. I can't find an efficient algorithm, I guess I'm just too dumb. From Garey and Johnson. Computers and Intractability. 1979
65 Searching for an Efficient Solution for the HCP Attempting to solve the HCP is difficult. The hope is that you could verify that you failed because an efficient method solving the HCP doesn t exist. I can't find an efficient algorithm, because no such algorithm is possible. From Garey and Johnson. Computers and Intractability. 1979
66 Searching for an Efficient Solution for the HCP Attempting to solve the HCP is difficult. The hope is that you could verify that you failed because an efficient method solving the HCP doesn t exist. The present state of affairs is somewhere in between. I can't find an efficient algorithm, but neither can all these smart people. From Garey and Johnson. Computers and Intractability. 1979
67 High-Stakes Mathematics The question of whether the HCP can be solved efficiently is one of seven Millennium Problems (Clay Institute). Find an efficient algorithm for the HCP, or demonstrate that no such algorithm exists, and you will win $1 million. Only one problem has been solved, by Grigory Perelman, who refused the prize. Grigory Perelman
68 From Euler and Hamilton to Assembling Genomes
69 Returning to our Toy Example Goal: Use overlapping DNA reads in order to reconstruct the original genome. GTG GCG TGG TGC GGC GCA CGT CAA AAT Idea: Let s construct a network that represents the overlap information in our reads.
70 Constructing a Network from Reads Create a node for every read. GTG GCG GCA TGG TGC GGC CGT CAA AAT
71 Constructing a Network from Reads Create a node for every read. CGT GGC AAT GTG TGG TGC CAA GCA GCG
72 Constructing a Network from Reads Create a node for every read. Prefix: First 2 nucleotides of a read (CAA) Suffix: Last 2 nucleotides of a read (CAA) Different 3-mers may share a prefix/suffix:, TGA, CTG CGT GGC AAT GTG TGG TGC CAA GCA GCG
73 Constructing a Network from Reads As for the edges, connect node v to node w with a directed edge if the suffix of v matches the prefix of w. CGT GGC AAT GTG TGG TGC CAA GCA GCG
74 Constructing a Network from Reads As for the edges, connect node v to node w with a directed edge if the suffix of v matches the prefix of w. CGT GGC AAT GTG TGG TGC CAA GCA GCG
75 Constructing a Network from Reads As for the edges, connect node v to node w with a directed edge if the suffix of v matches the prefix of w. CGT GGC AAT GTG TGG TGC CAA GCA GCG
76 Constructing a Network from Reads As for the edges, connect node v to node w with a directed edge if the suffix of v matches the prefix of w. CGT GGC AAT GTG TGG TGC CAA GCA GCG
77 Constructing a Network from Reads As for the edges, connect node v to node w with a directed edge if the suffix of v matches the prefix of w. CGT GGC AAT GTG TGG TGC CAA GCA GCG
78 Constructing a Network from Reads As for the edges, connect node v to node w with a directed edge if the suffix of v matches the prefix of w. CGT GGC AAT GTG TGG TGC CAA GCA GCG
79 Constructing a Network from Reads As for the edges, connect node v to node w with a directed edge if the suffix of v matches the prefix of w. CGT GGC AAT GTG TGG TGC CAA GCA GCG
80 Constructing a Network from Reads As for the edges, connect node v to node w with a directed edge if the suffix of v matches the prefix of w. CGT GGC AAT GTG TGG TGC CAA GCA GCG
81 Constructing a Network from Reads As for the edges, connect node v to node w with a directed edge if the suffix of v matches the prefix of w. CGT GGC AAT GTG TGG TGC CAA GCA GCG
82 Constructing a Network from Reads As for the edges, connect node v to node w with a directed edge if the suffix of v matches the prefix of w. CGT GGC AAT GTG TGG TGC CAA GCA GCG
83 Constructing a Network from Reads As for the edges, connect node v to node w with a directed edge if the suffix of v matches the prefix of w. CGT GGC AAT GTG TGG TGC CAA GCA GCG
84 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: CGT GGC AAT GTG TGG TGC CAA GCA GCG
85 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: CGT GGC AAT GTG TGG TGC CAA GCA GCG
86 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG CGT GGC AAT GTG TGG TGC CAA GCA GCG
87 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC CGT GGC AAT GTG TGG TGC CAA GCA GCG
88 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG CGT GGC AAT GTG TGG TGC CAA GCA GCG
89 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT CGT GGC AAT GTG TGG TGC CAA GCA GCG
90 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG CGT GGC AAT GTG TGG TGC CAA GCA GCG
91 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC CGT GGC AAT GTG TGG TGC CAA GCA GCG
92 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA CGT GGC AAT GTG TGG TGC CAA GCA GCG
93 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA CGT GGC AAT GTG TGG TGC CAA GCA GCG
94 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT CGT GGC AAT GTG TGG TGC CAA GCA GCG
95 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT à CGT GGC AAT GTG TGG TGC CAA GCA GCG
96 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT à
97 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT à A T G Genome:
98 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT à TGG A T G G Genome: G
99 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT à TGG GGC A T G G Genome: GC C
100 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT à TGG GGC GCG A T G G Genome: GCG G C
101 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT à TGG GGC GCG CGT A T G G Genome: GCGT T G C
102 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT à TGG GGC GCG CGT GTG G A T G G Genome: GCGTG T G C
103 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT à TGG GGC GCG CGT GTG TGC C G A T G G Genome: GCGTGC T G C
104 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT à TGG GGC GCG CGT GTG TGC GCA C G A A T G G Genome: GCGTGCA T G C
105 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT à Genome: TGG GGC GCG CGT GTG TGC GCA CAA GCGTGCAA C G A T A G T C G G
106 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT à Genome: TGG GGC GCG CGT GTG TGC GCA CAA AAT GCGTGCAAT C G A T A G T C G G
107 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT à Genome: TGG GGC GCG CGT GTG TGC GCA CAA AAT GCGTGCA C G A T A G T C G G
108 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT à Genome: TGG GGC GCG CGT GTG TGC GCA CAA AAT GCGTGCA C G A T A G T C G G
109 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT à Genome: TGG GGC GCG CGT GTG TGC GCA CAA AAT GCGTGCA C G A T A G T C G G
110 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT à Genome: TGG GGC GCG CGT GTG TGC GCA CAA AAT GCGTGCA C G A T A G T C G G
111 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT à Genome: TGG GGC GCG CGT GTG TGC GCA CAA AAT GCGTGCA C G A T A G T C G G
112 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT à Genome: TGG GGC GCG CGT GTG TGC GCA CAA AAT GCGTGCA C G A T A G T C G G What is wrong with this approach?
113 The Problem with Our Network Ultimately, we must solve the HCP on our network (millions of nodes) in order to obtain a candidate genome
114 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT
115 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT
116 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT TG
117 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT TG GC
118 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT CG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT TG GC
119 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT CG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT TG GC
120 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT CG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT TG GC CA
121 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT CG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT AT TG GC CA
122 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT CG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT AT TG GC CA
123 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT CG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT AT TG GC CA
124 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT CG GG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT AT TG GC CA
125 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT CG GG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT AT TG GC CA
126 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT CG GG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT AT TG GC CA
127 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT CG GG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT AT TG GC CA
128 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT CG GG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT AT TG GC CA
129 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT CG GG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT AT TG GC CA
130 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT CG GG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT AT TG GC CA
131 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT CG GG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT AT TG GC CA
132 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT CG GG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT AT TG AA GC CA
133 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT CG GG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT AT TG AA GC CA
134 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT CG GG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT AT TG AA GC CA
135 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT CG GG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT AT TG AA GC CA
136 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. Connect node v to node w with a directed edge if there is a read whose prefix is v and whose suffix is w. GT GG CG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT AT TG AA GC CA
137 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. Connect node v to node w with a directed edge if there is a read whose prefix is v and whose suffix is w. GTG GT GG CG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT AT TG AA GC CA
138 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. Connect node v to node w with a directed edge if there is a read whose prefix is v and whose suffix is w. GTG GT GG CG GCG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT AT TG AA GC CA
139 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. Connect node v to node w with a directed edge if there is a read whose prefix is v and whose suffix is w. GTG GT GG CG GCG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT AT TG AA GC GCA CA
140 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. Connect node v to node w with a directed edge if there is a read whose prefix is v and whose suffix is w. GTG GT GG CG GCG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT AT TG AA GC GCA CA
141 Second Try: Assign Reads to Edges Form a different network as follows: Reads Create a node for each distinct prefix/suffix from reads. Connect node v to node w with a directed edge if there is a read whose prefix is v and whose suffix is w. GTG GCG GCA TGG TGC GGC CGT CAA AAT AT GTG GT TG TGG GG AA CG GC GCG GCA CA
142 Second Try: Assign Reads to Edges Form a different network as follows: Reads Create a node for each distinct prefix/suffix from reads. Connect node v to node w with a directed edge if there is a read whose prefix is v and whose suffix is w. GTG GCG GCA TGG TGC GGC CGT CAA AAT AT GTG GT TG TGG GG TGC AA CG GC GCG GCA CA
143 Second Try: Assign Reads to Edges Form a different network as follows: Reads Create a node for each distinct prefix/suffix from reads. Connect node v to node w with a directed edge if there is a read whose prefix is v and whose suffix is w. GTG GCG GCA TGG TGC GGC CGT CAA AAT AT GTG GT TG TGG GG TGC AA GGC CG GC GCG GCA CA
144 Second Try: Assign Reads to Edges Form a different network as follows: Reads Create a node for each distinct prefix/suffix from reads. Connect node v to node w with a directed edge if there is a read whose prefix is v and whose suffix is w. GTG GCG GCA TGG TGC GGC CGT CAA AAT AT GTG GT TG TGG CGT GG TGC AA GGC CG GC GCG GCA CA
145 Second Try: Assign Reads to Edges Form a different network as follows: Reads Create a node for each distinct prefix/suffix from reads. Connect node v to node w with a directed edge if there is a read whose prefix is v and whose suffix is w. GTG GCG GCA TGG TGC GGC CGT CAA AAT AT GTG GT TG TGG CGT GG TGC AA GGC CG GC GCG CAA GCA CA
146 Second Try: Assign Reads to Edges Form a different network as follows: Reads Create a node for each distinct prefix/suffix from reads. Connect node v to node w with a directed edge if there is a read whose prefix is v and whose suffix is w. GTG GCG GCA TGG TGC GGC CGT CAA AAT AT GTG AAT GT TG TGG CGT GG TGC AA GGC CG GC GCG CAA GCA CA
147 Second Try: Assign Reads to Edges Form a different network as follows: Reads Create a node for each distinct prefix/suffix from reads. Connect node v to node w with a directed edge if there is a read whose prefix is v and whose suffix is w. GTG GCG GCA TGG TGC GGC CGT CAA AAT AT GTG AAT GT TG TGG CGT GG TGC AA GGC CG GC GCG CAA GCA CA
148 An Eulerian Cycle Recovers the Genome We have an Eulerian cycle in this network: GT CGT CG GTG TGG GG GGC GCG AT TG TGC GC GCA CA AAT CAA AA
149 An Eulerian Cycle Recovers the Genome We have an Eulerian cycle in this network: GT CGT CG GTG TGG GG GGC GCG AT 1 TG TGC GC GCA CA AAT CAA AA
150 An Eulerian Cycle Recovers the Genome We have an Eulerian cycle in this network: à TGG à GT CGT CG AT 1 GTG TGG 2 TG GG TGC GCG GGC GC GCA CA AAT AA CAA
151 An Eulerian Cycle Recovers the Genome We have an Eulerian cycle in this network: à TGG à GGC GT CGT CG AT 1 GTG TGG 2 TG GG TGC GCG GGC 3 GC GCA CA AAT AA CAA
152 An Eulerian Cycle Recovers the Genome We have an Eulerian cycle in this network: à TGG à GGC à GCG GT CGT CG AT 1 GTG TGG 2 TG GG TGC 4 GCG GGC 3 GC GCA CA AAT AA CAA
153 An Eulerian Cycle Recovers the Genome We have an Eulerian cycle in this network: à TGG à GGC à GCG à CGT GT CGT 5 CG AT 1 GTG TGG 2 TG GG TGC 4 GCG GGC 3 GC GCA CA AAT AA CAA
154 An Eulerian Cycle Recovers the Genome We have an Eulerian cycle in this network: à TGG à GGC à GCG à CGT à GTG GT CGT 5 CG AT 1 GTG 6 TGG 2 TG GG TGC 4 GCG GGC 3 GC GCA CA AAT AA CAA
155 An Eulerian Cycle Recovers the Genome We have an Eulerian cycle in this network: à TGG à GGC à GCG à CGT à GTG à TGC GT CGT 5 CG AT 1 GTG 6 TGG 2 TG GG TGC 7 4 GCG GGC 3 GC GCA CA AAT AA CAA
156 An Eulerian Cycle Recovers the Genome We have an Eulerian cycle in this network: à TGG à GGC à GCG à CGT à GTG à TGC à GCA GT CGT 5 CG AT 1 GTG TG GG 6 4 TGG GGC 2 3 GCG TGC GCA 7 GC 8 CA AAT AA CAA
157 An Eulerian Cycle Recovers the Genome We have an Eulerian cycle in this network: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA GT CGT 5 CG AT GTG 1 AAT TG GG 6 4 TGG GGC 2 3 AA GCG TGC GCA 7 GC 8 9 CAA CA
158 An Eulerian Cycle Recovers the Genome We have an Eulerian cycle in this network: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT GT CGT 5 CG AT GTG 1 10 AAT TG GG 6 4 TGG GGC 2 3 AA GCG TGC GCA 7 GC 8 9 CAA CA
159 An Eulerian Cycle Recovers the Genome We have an Eulerian cycle in this network: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT GT CGT 5 CG AT GTG 1 10 AAT TG GG 6 4 TGG GGC 2 3 AA GCG TGC GCA 7 GC 8 9 CAA CA
160 An Eulerian Cycle Recovers the Genome We have an Eulerian cycle in this network: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT This is the same sequence of reads that we found before! TGG GGC GCG CGT GTG TGC GCA CAA AAT
161 An Eulerian Cycle Recovers the Genome We have an Eulerian cycle in this network: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT This is the same sequence of reads that we found before! Thus, we will obtain the same sequenced genome as before. A A T C G G T G C G Genome: TGG GGC GCG CGT GTG TGC GCA CAA AAT GCGTGCA
162 An Eulerian Cycle Recovers the Genome We have an Eulerian cycle in this network: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT This is the same sequence of reads that we found before! Thus, we will obtain the same sequenced genome as before. A A T C G G T G C G Genome: TGG GGC GCG CGT GTG TGC GCA CAA AAT GCGTGCA
163 An Eulerian Cycle Recovers the Genome We have an Eulerian cycle in this network: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT This is the same sequence of reads that we found before! Thus we will obtain the same sequenced genome as before. The only difference: a computer can find an Eulerian cycle quickly. Genome: TGG GGC GCG CGT GTG TGC GCA CAA AAT GCGTGCA
164 Dealing with Practical Complications
165 Complications in Genome Assembly 1. DNA is double-stranded (and may consist of multiple chromosomes). 2. Reads have imperfect coverage of the underlying genome. 3. Sequencing machines are error-prone.
166 Complication #1: Assembling Double-Stranded DNA DNA strands run in opposite directions....gcaatacgacagtcagcggacagacgttac......gtaacgtctgtccgctgactgtcgtattgccat...
167 Complication #1: Assembling Double-Stranded DNA DNA strands run in opposite directions....gcaatacgacagtcagcggacagacgttac......gtaacgtctgtccgctgactgtcgtattgccat... Given a read, we need to search for it and its reverse complement against the genome. Read GTCGTATTGC Reverse Complement GCAATACGAC
168 Complication #2: Imperfect Coverage In practice, we can only reconstruct the genome in a number of pieces called contigs.
169 Complication #3: Errors in Reads Cause Bubbles CC TGCCG GCCGT CCGTA CGTAT GT TG GA TGGAC GGACA C TGCC GCCG CCGT CGTA GTAT T G TGGA GGAC GACA GCCGC CCGC Bubble! CGCA GCAT C C CCGCA CGCAT GC
170 Thank You! Questions?
Genome Reconstruction: A Puzzle with a Billion Pieces Phillip E. C. Compeau and Pavel A. Pevzner
Genome Reconstruction: A Puzzle with a Billion Pieces Phillip E. C. Compeau and Pavel A. Pevzner Outline I. Problem II. Two Historical Detours III.Example IV.The Mathematics of DNA Sequencing V.Complications
More informationDNA Sequencing. Overview
BINF 3350, Genomics and Bioinformatics DNA Sequencing Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Backgrounds Eulerian Cycles Problem Hamiltonian Cycles
More informationPyramidal and Chiral Groupings of Gold Nanocrystals Assembled Using DNA Scaffolds
Pyramidal and Chiral Groupings of Gold Nanocrystals Assembled Using DNA Scaffolds February 27, 2009 Alexander Mastroianni, Shelley Claridge, A. Paul Alivisatos Department of Chemistry, University of California,
More informationby the Genevestigator program (www.genevestigator.com). Darker blue color indicates higher gene expression.
Figure S1. Tissue-specific expression profile of the genes that were screened through the RHEPatmatch and root-specific microarray filters. The gene expression profile (heat map) was drawn by the Genevestigator
More informationHP22.1 Roth Random Primer Kit A für die RAPD-PCR
HP22.1 Roth Random Kit A für die RAPD-PCR Kit besteht aus 20 Einzelprimern, jeweils aufgeteilt auf 2 Reaktionsgefäße zu je 1,0 OD Achtung: Angaben beziehen sich jeweils auf ein Reaktionsgefäß! Sequenz
More information6 Anhang. 6.1 Transgene Su(var)3-9-Linien. P{GS.ry + hs(su(var)3-9)egfp} 1 I,II,III,IV 3 2I 3 3 I,II,III 3 4 I,II,III 2 5 I,II,III,IV 3
6.1 Transgene Su(var)3-9-n P{GS.ry + hs(su(var)3-9)egfp} 1 I,II,III,IV 3 2I 3 3 I,II,III 3 4 I,II,II 5 I,II,III,IV 3 6 7 I,II,II 8 I,II,II 10 I,II 3 P{GS.ry + UAS(Su(var)3-9)EGFP} A AII 3 B P{GS.ry + (10.5kbSu(var)3-9EGFP)}
More informationAppendix A. Example code output. Chapter 1. Chapter 3
Appendix A Example code output This is a compilation of output from selected examples. Some of these examples requires exernal input from e.g. STDIN, for such examples the interaction with the program
More informationTCGR: A Novel DNA/RNA Visualization Technique
TCGR: A Novel DNA/RNA Visualization Technique Donya Quick and Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas, Texas 75275 dquick@mail.smu.edu, mhd@engr.smu.edu
More informationwarm-up exercise Representing Data Digitally goals for today proteins example from nature
Representing Data Digitally Anne Condon September 6, 007 warm-up exercise pick two examples of in your everyday life* in what media are the is represented? is the converted from one representation to another,
More informationSupplementary Table 1. Data collection and refinement statistics
Supplementary Table 1. Data collection and refinement statistics APY-EphA4 APY-βAla8.am-EphA4 Crystal Space group P2 1 P2 1 Cell dimensions a, b, c (Å) 36.27, 127.7, 84.57 37.22, 127.2, 84.6 α, β, γ (
More informationDNA Sequencing The Shortest Superstring & Traveling Salesman Problems Sequencing by Hybridization
Eulerian & Hamiltonian Cycle Problems DNA Sequencing The Shortest Superstring & Traveling Salesman Problems Sequencing by Hybridization The Bridge Obsession Problem Find a tour crossing every bridge just
More informationSUPPLEMENTARY INFORMATION. Systematic evaluation of CRISPR-Cas systems reveals design principles for genome editing in human cells
SUPPLEMENTARY INFORMATION Systematic evaluation of CRISPR-Cas systems reveals design principles for genome editing in human cells Yuanming Wang 1,2,7, Kaiwen Ivy Liu 2,7, Norfala-Aliah Binte Sutrisnoh
More information10/15/2009 Comp 590/Comp Fall
Lecture 13: Graph Algorithms Study Chapter 8.1 8.8 10/15/2009 Comp 590/Comp 790-90 Fall 2009 1 The Bridge Obsession Problem Find a tour crossing every bridge just once Leonhard Euler, 1735 Bridges of Königsberg
More informationSequencing. Computational Biology IST Ana Teresa Freitas 2011/2012. (BACs) Whole-genome shotgun sequencing Celera Genomics
Computational Biology IST Ana Teresa Freitas 2011/2012 Sequencing Clone-by-clone shotgun sequencing Human Genome Project Whole-genome shotgun sequencing Celera Genomics (BACs) 1 Must take the fragments
More informationGraph Algorithms in Bioinformatics
Graph Algorithms in Bioinformatics Computational Biology IST Ana Teresa Freitas 2015/2016 Sequencing Clone-by-clone shotgun sequencing Human Genome Project Whole-genome shotgun sequencing Celera Genomics
More informationSequence Assembly. BMI/CS 576 Mark Craven Some sequencing successes
Sequence Assembly BMI/CS 576 www.biostat.wisc.edu/bmi576/ Mark Craven craven@biostat.wisc.edu Some sequencing successes Yersinia pestis Cannabis sativa The sequencing problem We want to determine the identity
More informationAlgorithms for Bioinformatics
Adapted from slides by Alexandru Tomescu, Leena Salmela and Veli Mäkinen, which are partly from http://bix.ucsd.edu/bioalgorithms/slides.php 582670 Algorithms for Bioinformatics Lecture 3: Graph Algorithms
More informationGenome Sequencing Algorithms
Genome Sequencing Algorithms Phillip Compaeu and Pavel Pevzner Bioinformatics Algorithms: an Active Learning Approach Leonhard Euler (1707 1783) William Hamilton (1805 1865) Nicolaas Govert de Bruijn (1918
More information10/8/13 Comp 555 Fall
10/8/13 Comp 555 Fall 2013 1 Find a tour crossing every bridge just once Leonhard Euler, 1735 Bridges of Königsberg 10/8/13 Comp 555 Fall 2013 2 Find a cycle that visits every edge exactly once Linear
More informationAlgorithms for Bioinformatics
Adapted from slides by Alexandru Tomescu, Leena Salmela and Veli Mäkinen, which are partly from http://bix.ucsd.edu/bioalgorithms/slides.php 58670 Algorithms for Bioinformatics Lecture 5: Graph Algorithms
More informationCSCI2950-C Lecture 4 DNA Sequencing and Fragment Assembly
CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assembly Ben Raphael Sept. 22, 2009 http://cs.brown.edu/courses/csci2950-c/ l-mer composition Def: Given string s, the Spectrum ( s, l ) is unordered multiset
More informationGenome 373: Genome Assembly. Doug Fowler
Genome 373: Genome Assembly Doug Fowler What are some of the things we ve seen we can do with HTS data? We ve seen that HTS can enable a wide variety of analyses ranging from ID ing variants to genome-
More informationDigging into acceptor splice site prediction: an iterative feature selection approach
Digging into acceptor splice site prediction: an iterative feature selection approach Yvan Saeys, Sven Degroeve, and Yves Van de Peer Department of Plant Systems Biology, Ghent University, Flanders Interuniversity
More informationSequence Assembly Required!
Sequence Assembly Required! 1 October 3, ISMB 20172007 1 Sequence Assembly Genome Sequenced Fragments (reads) Assembled Contigs Finished Genome 2 Greedy solution is bounded 3 Typical assembly strategy
More informationMachine Learning Classifiers
Machine Learning Classifiers Outline Different types of learning problems Different types of learning algorithms Supervised learning Decision trees Naïve Bayes Perceptrons, Multi-layer Neural Networks
More informationRead Mapping. de Novo Assembly. Genomics: Lecture #2 WS 2014/2015
Mapping de Novo Assembly Institut für Medizinische Genetik und Humangenetik Charité Universitätsmedizin Berlin Genomics: Lecture #2 WS 2014/2015 Today Genome assembly: the basics Hamiltonian and Eulerian
More information2 41L Tag- AA GAA AAA ATA AAA GCA TTA RYA GAA ATT TGT RMW GAR C K65 Tag- A AAT CCA TAC AAT ACT CCA GTA TTT GCY ATA AAG AA
176 SUPPLEMENTAL TABLES 177 Table S1. ASPE Primers for HIV-1 group M subtype B Primer no Type a Sequence (5'-3') Tag ID b Position c 1 M41 Tag- AA GAA AAA ATA AAA GCA TTA RYA GAA ATT TGT RMW GAR A d 45
More informationDNA Fragment Assembly
Algorithms in Bioinformatics Sami Khuri Department of Computer Science San José State University San José, California, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri DNA Fragment Assembly Overlap
More informationde novo assembly Simon Rasmussen 36626: Next Generation Sequencing analysis DTU Bioinformatics Next Generation Sequencing Analysis
de novo assembly Simon Rasmussen 36626: Next Generation Sequencing analysis DTU Bioinformatics 27626 - Next Generation Sequencing Analysis Generalized NGS analysis Data size Application Assembly: Compare
More informationEfficient Selection of Unique and Popular Oligos for Large EST Databases. Stefano Lonardi. University of California, Riverside
Efficient Selection of Unique and Popular Oligos for Large EST Databases Stefano Lonardi University of California, Riverside joint work with Jie Zheng, Timothy Close, Tao Jiang University of California,
More informationPurpose of sequence assembly
Sequence Assembly Purpose of sequence assembly Reconstruct long DNA/RNA sequences from short sequence reads Genome sequencing RNA sequencing for gene discovery Amplicon sequencing But not for transcript
More informationA relation between trinucleotide comma-free codes and trinucleotide circular codes
Theoretical Computer Science 401 (2008) 17 26 www.elsevier.com/locate/tcs A relation between trinucleotide comma-free codes and trinucleotide circular codes Christian J. Michel a,, Giuseppe Pirillo b,c,
More informationEulerian Tours and Fleury s Algorithm
Eulerian Tours and Fleury s Algorithm CSE21 Winter 2017, Day 12 (B00), Day 8 (A00) February 8, 2017 http://vlsicad.ucsd.edu/courses/cse21-w17 Vocabulary Path (or walk): describes a route from one vertex
More informationGraph Algorithms in Bioinformatics
Graph Algorithms in Bioinformatics Bioinformatics: Issues and Algorithms CSE 308-408 Fall 2007 Lecture 13 Lopresti Fall 2007 Lecture 13-1 - Outline Introduction to graph theory Eulerian & Hamiltonian Cycle
More informationMLiB - Mandatory Project 2. Gene finding using HMMs
MLiB - Mandatory Project 2 Gene finding using HMMs Viterbi decoding >NC_002737.1 Streptococcus pyogenes M1 GAS TTGTTGATATTCTGTTTTTTCTTTTTTAGTTTTCCACATGAAAAATAGTTGAAAACAATA GCGGTGTCCCCTTAAAATGGCTTTTCCACAGGTTGTGGAGAACCCAAATTAACAGTGTTA
More informationCrick s Hypothesis Revisited: The Existence of a Universal Coding Frame
Crick s Hypothesis Revisited: The Existence of a Universal Coding Frame Jean-Louis Lassez*, Ryan A. Rossi Computer Science Department, Coastal Carolina University jlassez@coastal.edu, raross@coastal.edu
More informationEulerian tours. Russell Impagliazzo and Miles Jones Thanks to Janine Tiefenbruck. April 20, 2016
Eulerian tours Russell Impagliazzo and Miles Jones Thanks to Janine Tiefenbruck http://cseweb.ucsd.edu/classes/sp16/cse21-bd/ April 20, 2016 Seven Bridges of Konigsberg Is there a path that crosses each
More informationSupplementary Data. Image Processing Workflow Diagram A - Preprocessing. B - Hough Transform. C - Angle Histogram (Rose Plot)
Supplementary Data Image Processing Workflow Diagram A - Preprocessing B - Hough Transform C - Angle Histogram (Rose Plot) D - Determination of holes Description of Image Processing Workflow The key steps
More informationGraphs and Puzzles. Eulerian and Hamiltonian Tours.
Graphs and Puzzles. Eulerian and Hamiltonian Tours. CSE21 Winter 2017, Day 11 (B00), Day 7 (A00) February 3, 2017 http://vlsicad.ucsd.edu/courses/cse21-w17 Exam Announcements Seating Chart on Website Good
More informationLABORATORY STANDARD OPERATING PROCEDURE FOR PULSENET CODE: PNL28 MLVA OF SHIGA TOXIN-PRODUCING ESCHERICHIA COLI
1. PURPOSE: to describe the standardized laboratory protocol for molecular subtyping of Shiga toxin-producing Escherichia coli O157 (STEC O157) and Salmonella enterica serotypes Typhimurium and Enteritidis.
More informationSupplementary Materials:
Supplementary Materials: Amino acid codo n Numb er Table S1. Codon usage in all the protein coding genes. RSC U Proportion (%) Amino acid codo n Numb er RSC U Proportion (%) Phe UUU 861 1.31 5.71 Ser UCU
More informationDegenerate Coding and Sequence Compacting
ESI The Erwin Schrödinger International Boltzmanngasse 9 Institute for Mathematical Physics A-1090 Wien, Austria Degenerate Coding and Sequence Compacting Maya Gorel Kirzhner V.M. Vienna, Preprint ESI
More informationRESEARCH TOPIC IN BIOINFORMANTIC
RESEARCH TOPIC IN BIOINFORMANTIC GENOME ASSEMBLY Instructor: Dr. Yufeng Wu Noted by: February 25, 2012 Genome Assembly is a kind of string sequencing problems. As we all know, the human genome is very
More informationde Bruijn graphs for sequencing data
de Bruijn graphs for sequencing data Rayan Chikhi CNRS Bonsai team, CRIStAL/INRIA, Univ. Lille 1 SMPGD 2016 1 MOTIVATION - de Bruijn graphs are instrumental for reference-free sequencing data analysis:
More informationScalable Solutions for DNA Sequence Analysis
Scalable Solutions for DNA Sequence Analysis Michael Schatz Dec 4, 2009 JHU/UMD Joint Sequencing Meeting The Evolution of DNA Sequencing Year Genome Technology Cost 2001 Venter et al. Sanger (ABI) $300,000,000
More informationDNA Fragment Assembly
SIGCSE 009 Algorithms in Bioinformatics Sami Khuri Department of Computer Science San José State University San José, California, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri DNA Fragment Assembly
More informationSupporting Information
Copyright WILEY VCH Verlag GmbH & Co. KGaA, 69469 Weinheim, Germany, 2015. Supporting Information for Small, DOI: 10.1002/smll.201501370 A Compact DNA Cube with Side Length 10 nm Max B. Scheible, Luvena
More informationWalking with Euler through Ostpreußen and RNA
Walking with Euler through Ostpreußen and RNA Mark Muldoon February 4, 2010 Königsberg (1652) Kaliningrad (2007)? The Königsberg Bridge problem asks whether it is possible to walk around the old city in
More informationComputational Methods for de novo Assembly of Next-Generation Genome Sequencing Data
1/39 Computational Methods for de novo Assembly of Next-Generation Genome Sequencing Data Rayan Chikhi ENS Cachan Brittany / IRISA (Genscale team) Advisor : Dominique Lavenier 2/39 INTRODUCTION, YEAR 2000
More informationdebgr: An Efficient and Near-Exact Representation of the Weighted de Bruijn Graph Prashant Pandey Stony Brook University, NY, USA
debgr: An Efficient and Near-Exact Representation of the Weighted de Bruijn Graph Prashant Pandey Stony Brook University, NY, USA De Bruijn graphs are ubiquitous [Pevzner et al. 2001, Zerbino and Birney,
More informationof Nebraska - Lincoln
University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln MAT Exam Expository Papers Math in the Middle Institute Partnership 7-2008 De Bruijn Cycles Val Adams University of Nebraska
More informationDNA Fragment Assembly Algorithms: Toward a Solution for Long Repeats
San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research 2008 DNA Fragment Assembly Algorithms: Toward a Solution for Long Repeats Ching Li San Jose State University
More informationI519 Introduction to Bioinformatics, Genome assembly. Yuzhen Ye School of Informatics & Computing, IUB
I519 Introduction to Bioinformatics, 2014 Genome assembly Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Contents Genome assembly problem Approaches Comparative assembly The string
More information02-711/ Computational Genomics and Molecular Biology Fall 2016
Literature assignment 2 Due: Nov. 3 rd, 2016 at 4:00pm Your name: Article: Phillip E C Compeau, Pavel A. Pevzner, Glenn Tesler. How to apply de Bruijn graphs to genome assembly. Nature Biotechnology 29,
More informationProblem statement. CS267 Assignment 3: Parallelize Graph Algorithms for de Novo Genome Assembly. Spring Example.
CS267 Assignment 3: Problem statement 2 Parallelize Graph Algorithms for de Novo Genome Assembly k-mers are sequences of length k (alphabet is A/C/G/T). An extension is a simple symbol (A/C/G/T/F). The
More informationWorksheet 28: Wednesday November 18 Euler and Topology
Worksheet 28: Wednesday November 18 Euler and Topology The Konigsberg Problem: The Foundation of Topology The Konigsberg Bridge Problem is a very famous problem solved by Euler in 1735. The process he
More informationAssembly in the Clouds
Assembly in the Clouds Michael Schatz October 13, 2010 Beyond the Genome Shredded Book Reconstruction Dickens accidentally shreds the first printing of A Tale of Two Cities Text printed on 5 long spools
More informationde novo assembly Rayan Chikhi Pennsylvania State University Workshop On Genomics - Cesky Krumlov - January /73
1/73 de novo assembly Rayan Chikhi Pennsylvania State University Workshop On Genomics - Cesky Krumlov - January 2014 2/73 YOUR INSTRUCTOR IS.. - Postdoc at Penn State, USA - PhD at INRIA / ENS Cachan,
More informationIntroduction to Genome Assembly. Tandy Warnow
Introduction to Genome Assembly Tandy Warnow 2 Shotgun DNA Sequencing DNA target sample SHEAR & SIZE End Reads / Mate Pairs 550bp 10,000bp Not all sequencing technologies produce mate-pairs. Different
More informationA Novel Implementation of an Extended 8x8 Playfair Cipher Using Interweaving on DNA-encoded Data
International Journal of Electrical and Computer Engineering (IJECE) Vol. 4, No. 1, Feburary 2014, pp. 93~100 ISSN: 2088-8708 93 A Novel Implementation of an Extended 8x8 Playfair Cipher Using Interweaving
More informationEECS 203 Lecture 20. More Graphs
EECS 203 Lecture 20 More Graphs Admin stuffs Last homework due today Office hour changes starting Friday (also in Piazza) Friday 6/17: 2-5 Mark in his office. Sunday 6/19: 2-5 Jasmine in the UGLI. Monday
More informationIntermediate Math Circles Wednesday, February 22, 2017 Graph Theory III
1 Eulerian Graphs Intermediate Math Circles Wednesday, February 22, 2017 Graph Theory III Let s begin this section with a problem that you may remember from lecture 1. Consider the layout of land and water
More informationHow to apply de Bruijn graphs to genome assembly
PRIMER How to apply de Bruijn graphs to genome assembly Phillip E C Compeau, Pavel A Pevzner & lenn Tesler A mathematical concept known as a de Bruijn graph turns the formidable challenge of assembling
More informationStructural analysis and haplotype diversity in swine LEP and MC4R genes
J. Anim. Breed. Genet. ISSN - OIGINAL ATICLE Structural analysis and haplotype diversity in swine LEP and MC genes M. D Andrea, F. Pilla, E. Giuffra, D. Waddington & A.L. Archibald University of Molise,
More informationBioinformatics-themed projects in Discrete Mathematics
Bioinformatics-themed projects in Discrete Mathematics Art Duval University of Texas at El Paso Joint Mathematics Meeting MAA Contributed Paper Session on Discrete Mathematics in the Undergraduate Curriculum
More informationProgramming Applications. What is Computer Programming?
Programming Applications What is Computer Programming? An algorithm is a series of steps for solving a problem A programming language is a way to express our algorithm to a computer Programming is the
More informationSequencing. Short Read Alignment. Sequencing. Paired-End Sequencing 6/10/2010. Tobias Rausch 7 th June 2010 WGS. ChIP-Seq. Applied Biosystems.
Sequencing Short Alignment Tobias Rausch 7 th June 2010 WGS RNA-Seq Exon Capture ChIP-Seq Sequencing Paired-End Sequencing Target genome Fragments Roche GS FLX Titanium Illumina Applied Biosystems SOLiD
More information#30: Graph theory May 25, 2009
#30: Graph theory May 25, 2009 Graph theory is the study of graphs. But not the kind of graphs you are used to, like a graph of y = x 2 graph theory graphs are completely different from graphs of functions.
More informationBMI/CS 576 Fall 2015 Midterm Exam
BMI/CS 576 Fall 2015 Midterm Exam Prof. Colin Dewey Tuesday, October 27th, 2015 11:00am-12:15pm Name: KEY Write your answers on these pages and show your work. You may use the back sides of pages as necessary.
More informationGenome Sequencing & Assembly. Slides by Carl Kingsford
Genome Sequencing & Assembly Slides by Carl Kingsford Genome Sequencing ACCGTCCAATTGG...! TGGCAGGTTAACC... E.g. human: 3 billion bases split into 23 chromosomes Main tool of traditional sequencing: DNA
More information7.36/7.91 recitation. DG Lectures 5 & 6 2/26/14
7.36/7.91 recitation DG Lectures 5 & 6 2/26/14 1 Announcements project specific aims due in a little more than a week (March 7) Pset #2 due March 13, start early! Today: library complexity BWT and read
More informationCHAPTER 10 GRAPHS AND TREES. Alessandro Artale UniBZ - artale/z
CHAPTER 10 GRAPHS AND TREES Alessandro Artale UniBZ - http://www.inf.unibz.it/ artale/z SECTION 10.1 Graphs: Definitions and Basic Properties Copyright Cengage Learning. All rights reserved. Graphs: Definitions
More informationIntroduction III. Graphs. Motivations I. Introduction IV
Introduction I Graphs Computer Science & Engineering 235: Discrete Mathematics Christopher M. Bourke cbourke@cse.unl.edu Graph theory was introduced in the 18th century by Leonhard Euler via the Königsberg
More informationPERFORMANCE ANALYSIS OF DATAMINIG TECHNIQUE IN RBC, WBC and PLATELET CANCER DATASETS
PERFORMANCE ANALYSIS OF DATAMINIG TECHNIQUE IN RBC, WBC and PLATELET CANCER DATASETS Mayilvaganan M 1 and Hemalatha 2 1 Associate Professor, Department of Computer Science, PSG College of arts and science,
More informationIE 102 Spring Routing Through Networks - 1
IE 102 Spring 2017 Routing Through Networks - 1 The Bridges of Koenigsberg: Euler 1735 Graph Theory began in 1735 Leonard Eüler Visited Koenigsberg People wondered whether it is possible to take a walk,
More informationOFFICE OF RESEARCH AND SPONSORED PROGRAMS
OFFICE OF RESEARCH AND SPONSORED PROGRAMS June 9, 2016 Mr. Satoshi Harada Department of Innovation Research Japan Science and Technology Agency (JST) K s Gobancho, 7, Gobancho, Chiyoda-ku, Tokyo, 102-0076
More informationDetecting Superbubbles in Assembly Graphs. Taku Onodera (U. Tokyo)! Kunihiko Sadakane (NII)! Tetsuo Shibuya (U. Tokyo)!
Detecting Superbubbles in Assembly Graphs Taku Onodera (U. Tokyo)! Kunihiko Sadakane (NII)! Tetsuo Shibuya (U. Tokyo)! de Bruijn Graph-based Assembly Reads (substrings of original DNA sequence) de Bruijn
More informationBioinformatics: Fragment Assembly. Walter Kosters, Universiteit Leiden. IPA Algorithms&Complexity,
Bioinformatics: Fragment Assembly Walter Kosters, Universiteit Leiden IPA Algorithms&Complexity, 29.6.2007 www.liacs.nl/home/kosters/ 1 Fragment assembly Problem We study the following problem from bioinformatics:
More informationGrade 7/8 Math Circles Graph Theory - Solutions October 13/14, 2015
Faculty of Mathematics Waterloo, Ontario N2L 3G1 Centre for Education in Mathematics and Computing Grade 7/8 Math Circles Graph Theory - Solutions October 13/14, 2015 The Seven Bridges of Königsberg In
More informationSolutions Exercise Set 3 Author: Charmi Panchal
Solutions Exercise Set 3 Author: Charmi Panchal Problem 1: Suppose we have following fragments: f1 = ATCCTTAACCCC f2 = TTAACTCA f3 = TTAATACTCCC f4 = ATCTTTC f5 = CACTCCCACACA f6 = CACAATCCTTAACCC f7 =
More informationEuler and Hamilton paths. Jorge A. Cobb The University of Texas at Dallas
Euler and Hamilton paths Jorge A. Cobb The University of Texas at Dallas 1 Paths and the adjacency matrix The powers of the adjacency matrix A r (with normal, not boolean multiplication) contain the number
More informationCSE 549: Genome Assembly De Bruijn Graph. All slides in this lecture not marked with * courtesy of Ben Langmead.
CSE 549: Genome Assembly De Bruijn Graph All slides in this lecture not marked with * courtesy of Ben Langmead. Real-world assembly methods OLC: Overlap-Layout-Consensus assembly DBG: De Bruijn graph assembly
More informationParallel de novo Assembly of Complex (Meta) Genomes via HipMer
Parallel de novo Assembly of Complex (Meta) Genomes via HipMer Aydın Buluç Computational Research Division, LBNL May 23, 2016 Invited Talk at HiCOMB 2016 Outline and Acknowledgments Joint work (alphabetical)
More informationSequence Alignment 1
Sequence Alignment 1 Nucleotide and Base Pairs Purine: A and G Pyrimidine: T and C 2 DNA 3 For this course DNA is double-helical, with two complementary strands. Complementary bases: Adenine (A) - Thymine
More informationCrossing bridges. Crossing bridges Great Ideas in Theoretical Computer Science. Lecture 12: Graphs I: The Basics. Königsberg (Prussia)
15-251 Great Ideas in Theoretical Computer Science Lecture 12: Graphs I: The Basics February 22nd, 2018 Crossing bridges Königsberg (Prussia) Now Kaliningrad (Russia) Is there a way to walk through the
More informationEULERIAN GRAPHS AND ITS APPLICATIONS
EULERIAN GRAPHS AND ITS APPLICATIONS Aruna R 1, Madhu N.R 2 & Shashidhar S.N 3 1.2&3 Assistant Professor, Department of Mathematics. R.L.Jalappa Institute of Technology, Doddaballapur, B lore Rural Dist
More informationQuasiAlign: Position Sensitive P-Mer Frequency Clustering with Applications to Genomic Classification and Differentiation
QuasiAlign: Position Sensitive P-Mer Frequency Clustering with Applications to Genomic Classification and Differentiation Anurag Nagar Southern Methodist University Michael Hahsler Southern Methodist University
More informationIntroduction to Networks
LESSON 1 Introduction to Networks Exploratory Challenge 1 One classic math puzzle is the Seven Bridges of Königsberg problem which laid the foundation for networks and graph theory. In the 18th century
More information6.2. Paths and Cycles
6.2. PATHS AND CYCLES 85 6.2. Paths and Cycles 6.2.1. Paths. A path from v 0 to v n of length n is a sequence of n+1 vertices (v k ) and n edges (e k ) of the form v 0, e 1, v 1, e 2, v 2,..., e n, v n,
More informationMIT BLOSSOMS INITIATIVE. Taking Walks, Delivering Mail: An Introduction to Graph Theory Karima R. Nigmatulina MIT
MIT BLOSSOMS INITIATIVE Taking Walks, Delivering Mail: An Introduction to Graph Theory Karima R. Nigmatulina MIT Section 1 Hello and welcome everyone! My name is Karima Nigmatulina, and I am a doctoral
More informationMichał Kierzynka et al. Poznan University of Technology. 17 March 2015, San Jose
Michał Kierzynka et al. Poznan University of Technology 17 March 2015, San Jose The research has been supported by grant No. 2012/05/B/ST6/03026 from the National Science Centre, Poland. DNA de novo assembly
More informationFINDING THE RIGHT PATH
Task 1: Seven Bridges of Konigsberg! Today we are going to begin with the story of Konigsberg in the 18 th century, its geography, bridges, and the question asked by its citizens. FINDING THE RIGHT PATH
More information1. PURPOSE: to describe the standardized laboratory protocol for molecular subtyping of Salmonella enterica serotype Enteritidis.
1. PURPOSE: to describe the standardized laboratory protocol for molecular subtyping of Salmonella enterica serotype Enteritidis. 2. SCOPE: to provide the PulseNet participants with a single protocol for
More informationSequence Design Problems in Discovery of Regulatory Elements
Sequence Design Problems in Discovery of Regulatory Elements Yaron Orenstein, Bonnie Berger and Ron Shamir Regulatory Genomics workshop Simons Institute March 10th, 2016, Berkeley, CA Differentially methylated
More informationDescription of a genome assembler: CABOG
Theo Zimmermann Description of a genome assembler: CABOG CABOG (Celera Assembler with the Best Overlap Graph) is an assembler built upon the Celera Assembler, which, at first, was designed for Sanger sequencing,
More informationDiscrete Mathematics and Probability Theory Fall 2013 Vazirani Note 7
CS 70 Discrete Mathematics and Probability Theory Fall 2013 Vazirani Note 7 An Introduction to Graphs A few centuries ago, residents of the city of Königsberg, Prussia were interested in a certain problem.
More informationBLAST & Genome assembly
BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies November 17, 2012 1 Introduction Introduction 2 BLAST What is BLAST? The algorithm 3 Genome assembly De
More informationInstant Insanity Instructor s Guide Make-it and Take-it Kit for AMTNYS 2006
Instant Insanity Instructor s Guide Make-it and Take-it Kit for AMTNYS 2006 THE KIT: This kit contains materials for two Instant Insanity games, a student activity sheet with answer key and this instructor
More informationHow to Run NCBI BLAST on zcluster at GACRC
How to Run NCBI BLAST on zcluster at GACRC BLAST: Basic Local Alignment Search Tool Georgia Advanced Computing Resource Center University of Georgia Suchitra Pakala pakala@uga.edu 1 OVERVIEW What is BLAST?
More informationMATH 113 Section 9.2: Topology
MATH 113 Section 9.2: Topology Prof. Jonathan Duncan Walla Walla College Winter Quarter, 2007 Outline 1 Introduction to Topology 2 Topology and Childrens Drawings 3 Networks 4 Conclusion Geometric Topology
More information