Genome Reconstruction: A Puzzle with a Billion Pieces. Phillip Compeau Carnegie Mellon University Computational Biology Department

Size: px
Start display at page:

Download "Genome Reconstruction: A Puzzle with a Billion Pieces. Phillip Compeau Carnegie Mellon University Computational Biology Department"

Transcription

1

2 Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau Carnegie Mellon University Computational Biology Department

3 Eternity II: The Highest-Stakes Puzzle in History Courtesy: Matej Bat ha

4 The Newspaper Problem

5 What is a Genome?

6 The Molecular Structure of DNA DNA s Double Helix (1953) DNA s Molecular Structure

7 Nucleotides Nucleotide: Half of one rung of DNA. There are only four choices for the nucleic acid of a nucleotide: 1. Adenine (A) 2. Cytosine (C) 3. Guanine (G) bonds to C 4. Thymine (T) bonds to A The order of nucleotides determines genetics!

8 DNA Sequencing Genome: The nucleotide sequence read down one side of an organism s chromosomal DNA. CCGTAGTCGCGAACAGTATACGAGACAGTACAGATACGATACGATACGATCATTAACCGAGAGTACCAGATTCCAGATCATACG TTACGCTTAGCTACGGACGTACGATACCCAGATTACGATCCATATAGATATAACCGGTGTGTCTTGCTAATACGTAACGGGGTGCCT TCGATAGGTCAGAATACCAGATCTCTCGATCTTCTTACAGATACTACGATCCCCAGATACTACCCCTACTGACCCATCGTACGGGTA CTACTACGGATATACCGTAGAGGGATCCATATATCCCGAGACGTCTCGCGCATAAGATCATCGTCTAGATACACGTACGTA CTAGACTAGCGTCCTCTTATCGTCCCGATCGAGTCGCGTGCTCAGAAAAGCTACGATACGATACCCGATACTAGACCATAG A human genome has about 3 billion nucleotides. Sequencing a genome means reading all these letters. Key Point: DNA is submicroscopic! How do we read something that we cannot see?

9 Introduction to Genome Sequencing

10 All Humans Are Genetically Very Similar Any two humans share 99.9% of their genomes. The 0.1% difference accounts for height, color, high cholesterol susceptibility, etc. eye CTGGACTACGCTACTACTGCTAGCTGTATTACGA TCAGCTACCACATCGTAGCTACGCATTAGCAAGCTAT CGATCGATCGATCGATTATCTACGATCGATCGATCGATCA CTATACGAGCTACTACGTACGTACGATCGCGGGACTATTA TCGACTACAGATAAAACCTAGTACAACAGTATACATA GCTGCGGGATACGATTAGCTAATAGCTGACGATATCCGAT CTGGACTACGCTACTACTGCTAGCTGTATTACGA TCAGCTACAACATCGTAGCTACGCATTAGCAAGCTAT CGATCGATCGATCGATTATCTACGATCGATCGATCGATCA CTATACGAGCTACTACGTACGTACGATCGCGTGACTATTA TCGACTACAGAAACCTAGTACAACAGTATACATA GCTGCGGGATACGATTAGCTAATAGCTGACGATATCCGAT

11 All Humans Are Genetically Very Similar Any two humans share 99.9% of their genomes. The 0.1% difference accounts for height, color, high cholesterol susceptibility, etc. eye CTGGACTACGCTACTACTGCTAGCTGTATTACGA TCAGCTACCACATCGTAGCTACGCATTAGCAAGCTAT CGATCGATCGATCGATTATCTACGATCGATCGATCGATCA CTATACGAGCTACTACGTACGTACGATCGCGGGACTATTA TCGACTACAGATAAAACCTAGTACAACAGTATACATA GCTGCGGGATACGATTAGCTAATAGCTGACGATATCCGAT CTGGACTACGCTACTACTGCTAGCTGTATTACGA TCAGCTACAACATCGTAGCTACGCATTAGCAAGCTAT CGATCGATCGATCGATTATCTACGATCGATCGATCGATCA CTATACGAGCTACTACGTACGTACGATCGCGTGACTATTA TCGACTACAGAAACCTAGTACAACAGTATACATA GCTGCGGGATACGATTAGCTAATAGCTGACGATATCCGAT

12 Species vs. Individual Sequencing Species Sequencing: What is a species s consensus genome?

13 Species Sequencing vs. Individual Sequencing Individual Sequencing: What makes an individual unique?

14 Why Would We Sequence a Species s Genome? Hug et al., 2016 Courtesy: Nature Biotechnology, Discovery Magazine

15 Why Would We Sequence an Individual s Genome? Personalized Medicine: Tailoring medical treatment to the individual based on their genome. 2010: First person whose life was saved because of genome sequencing.

16 Brief History of Genome Sequencing Late 1970s: Walter Gilbert and Frederick Sanger develop independent sequencing methods. 1980: They share the Nobel Prize in Chemistry. Walter Gilbert Still, their sequencing methods were too expensive for large genomes: with a $1 per nucleotide cost, it would cost $3 billion to sequence a human genome. Frederick Sanger

17 Brief History of Genome Sequencing 1990: The public Human Genome Project, headed by Francis Collins, aims to sequence the human genome. Francis Collins 1997: Craig Venter founds Celera Genomics, a private firm, with the same goal. Craig Venter

18 Brief History of Genome Sequencing 2000: The draft of the human genome is simultaneously completed by the (public) Human Genome Consortium and (private) Celera Genomics.

19 Brief History of Genome Sequencing 2000s: Many more mammalian genomes are sequenced.

20 The Arrival of Personal Genomics 2010s: The market for sequencing machines takes off. Illumina reduces the cost of sequencing an individual human genome from $3 billion to $1,000. Beijing Genome Institute orders hundreds of sequencing machines. Companies offer partial genome sequencing for ~$500. UK funds genome sequencing for 100,000 citizens.

21 The Future of Genome Sequencing The Near Future (?): Hopefully, sequencing an individual genome will soon become as routine as an X-ray.

22 The Newspaper Problem and Genome Sequencing

23 Returning to The Newspaper Problem Given enough time, how might you solve this problem?

24 The Newspaper Problem as an Overlap Puzzle The newspaper problem is not the same as a jigsaw puzzle: We have multiple copies of the same edition of a newspaper. Plus, some pieces of paper got blown to bits in the explosion. Instead, we must use overlapping shreds of paper to reconstruct what the newspaper said. This gives us a giant overlap puzzle.

25 What Makes Genome Sequencing So Difficult? When we read a book, we can read the entire book one letter at a time from beginning to end. However, modern sequencing machines can only read short pieces of DNA (~250 nucleotides long), called reads. vs.

26 Sequencing a Genome: Illustration Multiple identical copies of a genome Shatter the genome into reads Sequence the reads (Lab) AGAATATCA TGAGAATAT GAGAATATC Assemble the genome using overlapping reads (Computational) AGAATATCA GAGAATATC TGAGAATAT...TGAGAATATCA... What does genome sequencing remind you of?

27 Fragment Assembly = Overlap Puzzle!

28 Complications in Genome Assembly 1. DNA is double-stranded (and may consist of multiple chromosomes). 2. Reads have imperfect coverage of the underlying genome. 3. Sequencing machines are error-prone.

29 Simplifying Assumptions for Genome Assembly 1. DNA is single-stranded (and consists of a single circular chromosome, like bacteria). 2. Reads have perfect coverage of the underlying genome. 3. Sequencing machines are error-free.

30 From Biological Data to a Computational Problem Aim: Construct a shortest circular genome containing all our reads. These reads have length 3, and the answer isn t obvious. How in the world would we solve this problem if we had a billion reads? TTT TCT TTG TCG TTC TCC TTA TCA GTT GCT GTG GCG GTC GCC GTA GCA CTT CCT CTG CCG CTC CCC CTA CCA ATT ACT ACG ATC ACC ATA ACA TGT TAT TGG TAG TGC TAC TGA TAA GGT GAT GGG GAG GGC GAC GGA GAA CGT CAT CGG CAG CGC CAC CGA CAA AGT AAT AGG AAG AGC AAC AGA AAA What method would you use?

31 From Biological Data to a Computational Problem Idea: Look for reads having maximum overlap and build circular string. Say we start with AAT. Which read should we pick next? AAT TTT TCT TTG TCG TTC TCC TTA TCA GTT GCT GTG GCG GTC GCC GTA GCA CTT CCT CTG CCG CTC CCC CTA CCA ATT ACT ACG ATC ACC ATA ACA TGT TAT TGG TAG TGC TAC TGA TAA GGT GAT GGG GAG GGC GAC GGA GAA CGT CAT CGG CAG CGC CAC CGA CAA AGT AAT AGG AAG AGC AAC AGA AAA

32 From Biological Data to a Computational Problem Idea: Look for reads having maximum overlap and build circular string. Say we start with AAT. Which read should we pick next? AAT TTT TCT TTG TCG TTC TCC TTA TCA GTT GCT GTG GCG GTC GCC GTA GCA CTT CCT CTG CCG CTC CCC CTA CCA ATT ACT ACG ATC ACC ATA ACA TGT TAT TGG TAG TGC TAC TGA TAA GGT GAT GGG GAG GGC GAC GGA GAA CGT CAT CGG CAG CGC CAC CGA CAA AGT AAT AGG AAG AGC AAC AGA AAA

33 From Biological Data to a Computational Problem Idea: Look for reads having maximum overlap and build circular string. Now we have a choice between TGC and TGG. AAT TG? TTT TCT TTG TCG TTC TCC TTA TCA GTT GCT GTG GCG GTC GCC GTA GCA CTT CCT CTG CCG CTC CCC CTA CCA ATT ACT ACG ATC ACC ATA ACA TGT TAT TGG TAG TGC TAC TGA TAA GGT GAT GGG GAG GGC GAC GGA GAA CGT CAT CGG CAG CGC CAC CGA CAA AGT AAT AGG AAG AGC AAC AGA AAA

34 From Biological Data to a Computational Problem Idea: Look for reads having maximum overlap and build circular string. If we get lucky with TGC, again we have a choice between GCA and GCG. AAT TGC GC? TTT TCT TTG TCG TTC TCC TTA TCA GTT GCT GTG GCG GTC GCC GTA GCA CTT CCT CTG CCG CTC CCC CTA CCA ATT ACT ACG ATC ACC ATA ACA TGT TAT TGG TAG TGC TAC TGA TAA GGT GAT GGG GAG GGC GAC GGA GAA CGT CAT CGG CAG CGC CAC CGA CAA AGT AAT AGG AAG AGC AAC AGA AAA

35 Repeats Cause Headaches in Genome Assembly 50% of the human genome is made up of repeats: strings that appear multiple times with minor variations. Analogy: The Triazzle contains lots of repeated figures, which makes it difficult to solve (even with just 16 pieces). Repeated information is what makes Eternity II so difficult to solve!

36 Taking a Walk

37 The Bridges of Königsberg The people of Königsberg, Prussia (present-day Kaliningrad, Russia) enjoyed taking walks.

38 The Bridges of Königsberg They wondered if they could start at home, walk through the city, cross each bridge (blue) exactly once, and return home. Can you see a solution?

39 The Bridges of Königsberg 1735: Leonhard Euler develops an approach to answer this question for any city, even for a city with a million islands. We will soon see how Euler did this. Leonhard Euler

40 The Icosian Game Over a century passes 1857: Irish mathematician William Hamilton designs a game consisting of a board representing 20 islands connected by bridges. Goal: find a walk that visits every island exactly once and returns back where it started. Icosian Game Can you see a solution?

41 These Two Stories Have Something in Common Find a walk that crosses every bridge once and returns home. Find a walk that visits every island once and returns home.

42 ... But How Do They Relate to Genome Sequencing? But how do these problems relate to genome sequencing? Assemble the genome using overlapping reads AGAATATCA GAGAATATC TGAGAATAT...TGAGAATATCA...

43 Hamiltonian and Eulerian Cycles

44 Königsberg Bridges Network For the Königsberg Bridge Problem, we create a network: Nodes = 4 land masses of the city Edges = 7 bridges connecting land masses We are looking for an Eulerian cycle: crossing every edge. Now can you see a solution?

45 Hamiltonian Cycles A Hamiltonian cycle touches each node exactly once and returns to its starting node.

46 Hamiltonian Cycles A Hamiltonian cycle touches each node exactly once and returns to its starting node.

47 Hamiltonian Cycles A Hamiltonian cycle touches each node exactly once and returns to its starting node.

48 Hamiltonian Cycles A Hamiltonian cycle touches each node exactly once and returns to its starting node. Note: We didn t use every edge here.

49 Finding Eulerian and Hamiltonian Cycles Given a network G, we now have two questions that we can program a computer to answer about G. Eulerian Cycle Problem (ECP): Find an Eulerian cycle in G or prove that G does not have an Eulerian cycle. Hamiltonian Cycle Problem (HCP): Find a Hamiltonian cycle in G or prove that G does not have a Hamiltonian cycle.

50 Eulerian Cycles If there were a solution to the Königsberg Bridge Problem, then we could find an Eulerian cycle in this network. However, no such cycle exists. Why?

51 Eulerian Cycles If there were a solution to the Königsberg Bridge Problem, then we could find an Eulerian cycle in this network. However, no such cycle exists. Why? If we add two more edges, there will be such a cycle.

52 Euler s Theorem

53 Directed Networks Directed Network: A network in which each edge has a direction (represented by an arrow). Think of directed edges as one-way bridges. Undirected Directed

54 Eulerian Cycles in Directed Networks An Eulerian cycle in a directed network must travel down all the edges in the correct direction. Does this network have an Eulerian cycle?

55 Balanced Networks We can label each node with the number of edges in and the number of edges out. A network is balanced if at each node, the number of incoming edges equals the number departing. This network isn t balanced. (2, 2) (1, 1) (2, 2) (2, 2) (1, 1) However, adding some edges makes it balanced. (1, 1) (2, 2)

56 Euler s Theorem Euler s Theorem: A directed network contains an Eulerian cycle when the network is balanced and connected. (2, 2) (2, 2) (1, 1) (2, 2) (1, 1) Not Connected (1, 1) (2, 2) Connected + Balanced = Eulerian

57 Euler s Theorem Euler s Theorem: A directed network contains an Eulerian cycle when the network is balanced and connected. Key point: It is also possible to program a computer to quickly find an Eulerian cycle in a balanced, connected directed network... even one with a billion edges!

58 Difficult Computational Problems

59 So What s the Big Deal? I thought computers were supermachines! Universal Pictures

60 So What s the Big Deal? Computers don t need 300-year old math to solve problems.

61 So What s the Big Deal? Aren t computers going to take over the world anyway? MGM

62 So What s the Big Deal? So let s examine the case of finding a Hamiltonian cycle

63 Searching for an Efficient Solution for the HCP Key Point: No one has ever found a similar efficient test if a network has a Hamiltonian cycle. Of course, we could examine every possible ant walk through the network to solve the HCP. However, this brute force approach is infeasible: there are more walks through the average network with just 1,000 nodes than there are atoms in the universe!

64 Searching for an Efficient Solution for the HCP Attempting to solve the HCP is difficult. I can't find an efficient algorithm, I guess I'm just too dumb. From Garey and Johnson. Computers and Intractability. 1979

65 Searching for an Efficient Solution for the HCP Attempting to solve the HCP is difficult. The hope is that you could verify that you failed because an efficient method solving the HCP doesn t exist. I can't find an efficient algorithm, because no such algorithm is possible. From Garey and Johnson. Computers and Intractability. 1979

66 Searching for an Efficient Solution for the HCP Attempting to solve the HCP is difficult. The hope is that you could verify that you failed because an efficient method solving the HCP doesn t exist. The present state of affairs is somewhere in between. I can't find an efficient algorithm, but neither can all these smart people. From Garey and Johnson. Computers and Intractability. 1979

67 High-Stakes Mathematics The question of whether the HCP can be solved efficiently is one of seven Millennium Problems (Clay Institute). Find an efficient algorithm for the HCP, or demonstrate that no such algorithm exists, and you will win $1 million. Only one problem has been solved, by Grigory Perelman, who refused the prize. Grigory Perelman

68 From Euler and Hamilton to Assembling Genomes

69 Returning to our Toy Example Goal: Use overlapping DNA reads in order to reconstruct the original genome. GTG GCG TGG TGC GGC GCA CGT CAA AAT Idea: Let s construct a network that represents the overlap information in our reads.

70 Constructing a Network from Reads Create a node for every read. GTG GCG GCA TGG TGC GGC CGT CAA AAT

71 Constructing a Network from Reads Create a node for every read. CGT GGC AAT GTG TGG TGC CAA GCA GCG

72 Constructing a Network from Reads Create a node for every read. Prefix: First 2 nucleotides of a read (CAA) Suffix: Last 2 nucleotides of a read (CAA) Different 3-mers may share a prefix/suffix:, TGA, CTG CGT GGC AAT GTG TGG TGC CAA GCA GCG

73 Constructing a Network from Reads As for the edges, connect node v to node w with a directed edge if the suffix of v matches the prefix of w. CGT GGC AAT GTG TGG TGC CAA GCA GCG

74 Constructing a Network from Reads As for the edges, connect node v to node w with a directed edge if the suffix of v matches the prefix of w. CGT GGC AAT GTG TGG TGC CAA GCA GCG

75 Constructing a Network from Reads As for the edges, connect node v to node w with a directed edge if the suffix of v matches the prefix of w. CGT GGC AAT GTG TGG TGC CAA GCA GCG

76 Constructing a Network from Reads As for the edges, connect node v to node w with a directed edge if the suffix of v matches the prefix of w. CGT GGC AAT GTG TGG TGC CAA GCA GCG

77 Constructing a Network from Reads As for the edges, connect node v to node w with a directed edge if the suffix of v matches the prefix of w. CGT GGC AAT GTG TGG TGC CAA GCA GCG

78 Constructing a Network from Reads As for the edges, connect node v to node w with a directed edge if the suffix of v matches the prefix of w. CGT GGC AAT GTG TGG TGC CAA GCA GCG

79 Constructing a Network from Reads As for the edges, connect node v to node w with a directed edge if the suffix of v matches the prefix of w. CGT GGC AAT GTG TGG TGC CAA GCA GCG

80 Constructing a Network from Reads As for the edges, connect node v to node w with a directed edge if the suffix of v matches the prefix of w. CGT GGC AAT GTG TGG TGC CAA GCA GCG

81 Constructing a Network from Reads As for the edges, connect node v to node w with a directed edge if the suffix of v matches the prefix of w. CGT GGC AAT GTG TGG TGC CAA GCA GCG

82 Constructing a Network from Reads As for the edges, connect node v to node w with a directed edge if the suffix of v matches the prefix of w. CGT GGC AAT GTG TGG TGC CAA GCA GCG

83 Constructing a Network from Reads As for the edges, connect node v to node w with a directed edge if the suffix of v matches the prefix of w. CGT GGC AAT GTG TGG TGC CAA GCA GCG

84 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: CGT GGC AAT GTG TGG TGC CAA GCA GCG

85 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: CGT GGC AAT GTG TGG TGC CAA GCA GCG

86 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG CGT GGC AAT GTG TGG TGC CAA GCA GCG

87 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC CGT GGC AAT GTG TGG TGC CAA GCA GCG

88 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG CGT GGC AAT GTG TGG TGC CAA GCA GCG

89 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT CGT GGC AAT GTG TGG TGC CAA GCA GCG

90 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG CGT GGC AAT GTG TGG TGC CAA GCA GCG

91 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC CGT GGC AAT GTG TGG TGC CAA GCA GCG

92 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA CGT GGC AAT GTG TGG TGC CAA GCA GCG

93 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA CGT GGC AAT GTG TGG TGC CAA GCA GCG

94 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT CGT GGC AAT GTG TGG TGC CAA GCA GCG

95 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT à CGT GGC AAT GTG TGG TGC CAA GCA GCG

96 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT à

97 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT à A T G Genome:

98 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT à TGG A T G G Genome: G

99 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT à TGG GGC A T G G Genome: GC C

100 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT à TGG GGC GCG A T G G Genome: GCG G C

101 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT à TGG GGC GCG CGT A T G G Genome: GCGT T G C

102 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT à TGG GGC GCG CGT GTG G A T G G Genome: GCGTG T G C

103 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT à TGG GGC GCG CGT GTG TGC C G A T G G Genome: GCGTGC T G C

104 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT à TGG GGC GCG CGT GTG TGC GCA C G A A T G G Genome: GCGTGCA T G C

105 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT à Genome: TGG GGC GCG CGT GTG TGC GCA CAA GCGTGCAA C G A T A G T C G G

106 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT à Genome: TGG GGC GCG CGT GTG TGC GCA CAA AAT GCGTGCAAT C G A T A G T C G G

107 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT à Genome: TGG GGC GCG CGT GTG TGC GCA CAA AAT GCGTGCA C G A T A G T C G G

108 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT à Genome: TGG GGC GCG CGT GTG TGC GCA CAA AAT GCGTGCA C G A T A G T C G G

109 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT à Genome: TGG GGC GCG CGT GTG TGC GCA CAA AAT GCGTGCA C G A T A G T C G G

110 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT à Genome: TGG GGC GCG CGT GTG TGC GCA CAA AAT GCGTGCA C G A T A G T C G G

111 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT à Genome: TGG GGC GCG CGT GTG TGC GCA CAA AAT GCGTGCA C G A T A G T C G G

112 A Hamiltonian Cycle Recovers the Genome Here we have a Hamiltonian cycle: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT à Genome: TGG GGC GCG CGT GTG TGC GCA CAA AAT GCGTGCA C G A T A G T C G G What is wrong with this approach?

113 The Problem with Our Network Ultimately, we must solve the HCP on our network (millions of nodes) in order to obtain a candidate genome

114 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT

115 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT

116 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT TG

117 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT TG GC

118 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT CG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT TG GC

119 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT CG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT TG GC

120 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT CG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT TG GC CA

121 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT CG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT AT TG GC CA

122 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT CG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT AT TG GC CA

123 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT CG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT AT TG GC CA

124 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT CG GG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT AT TG GC CA

125 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT CG GG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT AT TG GC CA

126 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT CG GG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT AT TG GC CA

127 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT CG GG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT AT TG GC CA

128 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT CG GG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT AT TG GC CA

129 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT CG GG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT AT TG GC CA

130 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT CG GG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT AT TG GC CA

131 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT CG GG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT AT TG GC CA

132 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT CG GG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT AT TG AA GC CA

133 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT CG GG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT AT TG AA GC CA

134 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT CG GG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT AT TG AA GC CA

135 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. GT CG GG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT AT TG AA GC CA

136 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. Connect node v to node w with a directed edge if there is a read whose prefix is v and whose suffix is w. GT GG CG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT AT TG AA GC CA

137 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. Connect node v to node w with a directed edge if there is a read whose prefix is v and whose suffix is w. GTG GT GG CG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT AT TG AA GC CA

138 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. Connect node v to node w with a directed edge if there is a read whose prefix is v and whose suffix is w. GTG GT GG CG GCG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT AT TG AA GC CA

139 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. Connect node v to node w with a directed edge if there is a read whose prefix is v and whose suffix is w. GTG GT GG CG GCG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT AT TG AA GC GCA CA

140 Second Try: Assign Reads to Edges Form a different network as follows: Create a node for each distinct prefix/suffix from reads. Connect node v to node w with a directed edge if there is a read whose prefix is v and whose suffix is w. GTG GT GG CG GCG Reads GTG GCG GCA TGG TGC GGC CGT CAA AAT AT TG AA GC GCA CA

141 Second Try: Assign Reads to Edges Form a different network as follows: Reads Create a node for each distinct prefix/suffix from reads. Connect node v to node w with a directed edge if there is a read whose prefix is v and whose suffix is w. GTG GCG GCA TGG TGC GGC CGT CAA AAT AT GTG GT TG TGG GG AA CG GC GCG GCA CA

142 Second Try: Assign Reads to Edges Form a different network as follows: Reads Create a node for each distinct prefix/suffix from reads. Connect node v to node w with a directed edge if there is a read whose prefix is v and whose suffix is w. GTG GCG GCA TGG TGC GGC CGT CAA AAT AT GTG GT TG TGG GG TGC AA CG GC GCG GCA CA

143 Second Try: Assign Reads to Edges Form a different network as follows: Reads Create a node for each distinct prefix/suffix from reads. Connect node v to node w with a directed edge if there is a read whose prefix is v and whose suffix is w. GTG GCG GCA TGG TGC GGC CGT CAA AAT AT GTG GT TG TGG GG TGC AA GGC CG GC GCG GCA CA

144 Second Try: Assign Reads to Edges Form a different network as follows: Reads Create a node for each distinct prefix/suffix from reads. Connect node v to node w with a directed edge if there is a read whose prefix is v and whose suffix is w. GTG GCG GCA TGG TGC GGC CGT CAA AAT AT GTG GT TG TGG CGT GG TGC AA GGC CG GC GCG GCA CA

145 Second Try: Assign Reads to Edges Form a different network as follows: Reads Create a node for each distinct prefix/suffix from reads. Connect node v to node w with a directed edge if there is a read whose prefix is v and whose suffix is w. GTG GCG GCA TGG TGC GGC CGT CAA AAT AT GTG GT TG TGG CGT GG TGC AA GGC CG GC GCG CAA GCA CA

146 Second Try: Assign Reads to Edges Form a different network as follows: Reads Create a node for each distinct prefix/suffix from reads. Connect node v to node w with a directed edge if there is a read whose prefix is v and whose suffix is w. GTG GCG GCA TGG TGC GGC CGT CAA AAT AT GTG AAT GT TG TGG CGT GG TGC AA GGC CG GC GCG CAA GCA CA

147 Second Try: Assign Reads to Edges Form a different network as follows: Reads Create a node for each distinct prefix/suffix from reads. Connect node v to node w with a directed edge if there is a read whose prefix is v and whose suffix is w. GTG GCG GCA TGG TGC GGC CGT CAA AAT AT GTG AAT GT TG TGG CGT GG TGC AA GGC CG GC GCG CAA GCA CA

148 An Eulerian Cycle Recovers the Genome We have an Eulerian cycle in this network: GT CGT CG GTG TGG GG GGC GCG AT TG TGC GC GCA CA AAT CAA AA

149 An Eulerian Cycle Recovers the Genome We have an Eulerian cycle in this network: GT CGT CG GTG TGG GG GGC GCG AT 1 TG TGC GC GCA CA AAT CAA AA

150 An Eulerian Cycle Recovers the Genome We have an Eulerian cycle in this network: à TGG à GT CGT CG AT 1 GTG TGG 2 TG GG TGC GCG GGC GC GCA CA AAT AA CAA

151 An Eulerian Cycle Recovers the Genome We have an Eulerian cycle in this network: à TGG à GGC GT CGT CG AT 1 GTG TGG 2 TG GG TGC GCG GGC 3 GC GCA CA AAT AA CAA

152 An Eulerian Cycle Recovers the Genome We have an Eulerian cycle in this network: à TGG à GGC à GCG GT CGT CG AT 1 GTG TGG 2 TG GG TGC 4 GCG GGC 3 GC GCA CA AAT AA CAA

153 An Eulerian Cycle Recovers the Genome We have an Eulerian cycle in this network: à TGG à GGC à GCG à CGT GT CGT 5 CG AT 1 GTG TGG 2 TG GG TGC 4 GCG GGC 3 GC GCA CA AAT AA CAA

154 An Eulerian Cycle Recovers the Genome We have an Eulerian cycle in this network: à TGG à GGC à GCG à CGT à GTG GT CGT 5 CG AT 1 GTG 6 TGG 2 TG GG TGC 4 GCG GGC 3 GC GCA CA AAT AA CAA

155 An Eulerian Cycle Recovers the Genome We have an Eulerian cycle in this network: à TGG à GGC à GCG à CGT à GTG à TGC GT CGT 5 CG AT 1 GTG 6 TGG 2 TG GG TGC 7 4 GCG GGC 3 GC GCA CA AAT AA CAA

156 An Eulerian Cycle Recovers the Genome We have an Eulerian cycle in this network: à TGG à GGC à GCG à CGT à GTG à TGC à GCA GT CGT 5 CG AT 1 GTG TG GG 6 4 TGG GGC 2 3 GCG TGC GCA 7 GC 8 CA AAT AA CAA

157 An Eulerian Cycle Recovers the Genome We have an Eulerian cycle in this network: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA GT CGT 5 CG AT GTG 1 AAT TG GG 6 4 TGG GGC 2 3 AA GCG TGC GCA 7 GC 8 9 CAA CA

158 An Eulerian Cycle Recovers the Genome We have an Eulerian cycle in this network: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT GT CGT 5 CG AT GTG 1 10 AAT TG GG 6 4 TGG GGC 2 3 AA GCG TGC GCA 7 GC 8 9 CAA CA

159 An Eulerian Cycle Recovers the Genome We have an Eulerian cycle in this network: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT GT CGT 5 CG AT GTG 1 10 AAT TG GG 6 4 TGG GGC 2 3 AA GCG TGC GCA 7 GC 8 9 CAA CA

160 An Eulerian Cycle Recovers the Genome We have an Eulerian cycle in this network: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT This is the same sequence of reads that we found before! TGG GGC GCG CGT GTG TGC GCA CAA AAT

161 An Eulerian Cycle Recovers the Genome We have an Eulerian cycle in this network: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT This is the same sequence of reads that we found before! Thus, we will obtain the same sequenced genome as before. A A T C G G T G C G Genome: TGG GGC GCG CGT GTG TGC GCA CAA AAT GCGTGCA

162 An Eulerian Cycle Recovers the Genome We have an Eulerian cycle in this network: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT This is the same sequence of reads that we found before! Thus, we will obtain the same sequenced genome as before. A A T C G G T G C G Genome: TGG GGC GCG CGT GTG TGC GCA CAA AAT GCGTGCA

163 An Eulerian Cycle Recovers the Genome We have an Eulerian cycle in this network: à TGG à GGC à GCG à CGT à GTG à TGC à GCA à CAA à AAT This is the same sequence of reads that we found before! Thus we will obtain the same sequenced genome as before. The only difference: a computer can find an Eulerian cycle quickly. Genome: TGG GGC GCG CGT GTG TGC GCA CAA AAT GCGTGCA

164 Dealing with Practical Complications

165 Complications in Genome Assembly 1. DNA is double-stranded (and may consist of multiple chromosomes). 2. Reads have imperfect coverage of the underlying genome. 3. Sequencing machines are error-prone.

166 Complication #1: Assembling Double-Stranded DNA DNA strands run in opposite directions....gcaatacgacagtcagcggacagacgttac......gtaacgtctgtccgctgactgtcgtattgccat...

167 Complication #1: Assembling Double-Stranded DNA DNA strands run in opposite directions....gcaatacgacagtcagcggacagacgttac......gtaacgtctgtccgctgactgtcgtattgccat... Given a read, we need to search for it and its reverse complement against the genome. Read GTCGTATTGC Reverse Complement GCAATACGAC

168 Complication #2: Imperfect Coverage In practice, we can only reconstruct the genome in a number of pieces called contigs.

169 Complication #3: Errors in Reads Cause Bubbles CC TGCCG GCCGT CCGTA CGTAT GT TG GA TGGAC GGACA C TGCC GCCG CCGT CGTA GTAT T G TGGA GGAC GACA GCCGC CCGC Bubble! CGCA GCAT C C CCGCA CGCAT GC

170 Thank You! Questions?

Genome Reconstruction: A Puzzle with a Billion Pieces Phillip E. C. Compeau and Pavel A. Pevzner

Genome Reconstruction: A Puzzle with a Billion Pieces Phillip E. C. Compeau and Pavel A. Pevzner Genome Reconstruction: A Puzzle with a Billion Pieces Phillip E. C. Compeau and Pavel A. Pevzner Outline I. Problem II. Two Historical Detours III.Example IV.The Mathematics of DNA Sequencing V.Complications

More information

DNA Sequencing. Overview

DNA Sequencing. Overview BINF 3350, Genomics and Bioinformatics DNA Sequencing Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Backgrounds Eulerian Cycles Problem Hamiltonian Cycles

More information

Pyramidal and Chiral Groupings of Gold Nanocrystals Assembled Using DNA Scaffolds

Pyramidal and Chiral Groupings of Gold Nanocrystals Assembled Using DNA Scaffolds Pyramidal and Chiral Groupings of Gold Nanocrystals Assembled Using DNA Scaffolds February 27, 2009 Alexander Mastroianni, Shelley Claridge, A. Paul Alivisatos Department of Chemistry, University of California,

More information

by the Genevestigator program (www.genevestigator.com). Darker blue color indicates higher gene expression.

by the Genevestigator program (www.genevestigator.com). Darker blue color indicates higher gene expression. Figure S1. Tissue-specific expression profile of the genes that were screened through the RHEPatmatch and root-specific microarray filters. The gene expression profile (heat map) was drawn by the Genevestigator

More information

HP22.1 Roth Random Primer Kit A für die RAPD-PCR

HP22.1 Roth Random Primer Kit A für die RAPD-PCR HP22.1 Roth Random Kit A für die RAPD-PCR Kit besteht aus 20 Einzelprimern, jeweils aufgeteilt auf 2 Reaktionsgefäße zu je 1,0 OD Achtung: Angaben beziehen sich jeweils auf ein Reaktionsgefäß! Sequenz

More information

6 Anhang. 6.1 Transgene Su(var)3-9-Linien. P{GS.ry + hs(su(var)3-9)egfp} 1 I,II,III,IV 3 2I 3 3 I,II,III 3 4 I,II,III 2 5 I,II,III,IV 3

6 Anhang. 6.1 Transgene Su(var)3-9-Linien. P{GS.ry + hs(su(var)3-9)egfp} 1 I,II,III,IV 3 2I 3 3 I,II,III 3 4 I,II,III 2 5 I,II,III,IV 3 6.1 Transgene Su(var)3-9-n P{GS.ry + hs(su(var)3-9)egfp} 1 I,II,III,IV 3 2I 3 3 I,II,III 3 4 I,II,II 5 I,II,III,IV 3 6 7 I,II,II 8 I,II,II 10 I,II 3 P{GS.ry + UAS(Su(var)3-9)EGFP} A AII 3 B P{GS.ry + (10.5kbSu(var)3-9EGFP)}

More information

Appendix A. Example code output. Chapter 1. Chapter 3

Appendix A. Example code output. Chapter 1. Chapter 3 Appendix A Example code output This is a compilation of output from selected examples. Some of these examples requires exernal input from e.g. STDIN, for such examples the interaction with the program

More information

TCGR: A Novel DNA/RNA Visualization Technique

TCGR: A Novel DNA/RNA Visualization Technique TCGR: A Novel DNA/RNA Visualization Technique Donya Quick and Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas, Texas 75275 dquick@mail.smu.edu, mhd@engr.smu.edu

More information

warm-up exercise Representing Data Digitally goals for today proteins example from nature

warm-up exercise Representing Data Digitally goals for today proteins example from nature Representing Data Digitally Anne Condon September 6, 007 warm-up exercise pick two examples of in your everyday life* in what media are the is represented? is the converted from one representation to another,

More information

Supplementary Table 1. Data collection and refinement statistics

Supplementary Table 1. Data collection and refinement statistics Supplementary Table 1. Data collection and refinement statistics APY-EphA4 APY-βAla8.am-EphA4 Crystal Space group P2 1 P2 1 Cell dimensions a, b, c (Å) 36.27, 127.7, 84.57 37.22, 127.2, 84.6 α, β, γ (

More information

DNA Sequencing The Shortest Superstring & Traveling Salesman Problems Sequencing by Hybridization

DNA Sequencing The Shortest Superstring & Traveling Salesman Problems Sequencing by Hybridization Eulerian & Hamiltonian Cycle Problems DNA Sequencing The Shortest Superstring & Traveling Salesman Problems Sequencing by Hybridization The Bridge Obsession Problem Find a tour crossing every bridge just

More information

SUPPLEMENTARY INFORMATION. Systematic evaluation of CRISPR-Cas systems reveals design principles for genome editing in human cells

SUPPLEMENTARY INFORMATION. Systematic evaluation of CRISPR-Cas systems reveals design principles for genome editing in human cells SUPPLEMENTARY INFORMATION Systematic evaluation of CRISPR-Cas systems reveals design principles for genome editing in human cells Yuanming Wang 1,2,7, Kaiwen Ivy Liu 2,7, Norfala-Aliah Binte Sutrisnoh

More information

10/15/2009 Comp 590/Comp Fall

10/15/2009 Comp 590/Comp Fall Lecture 13: Graph Algorithms Study Chapter 8.1 8.8 10/15/2009 Comp 590/Comp 790-90 Fall 2009 1 The Bridge Obsession Problem Find a tour crossing every bridge just once Leonhard Euler, 1735 Bridges of Königsberg

More information

Sequencing. Computational Biology IST Ana Teresa Freitas 2011/2012. (BACs) Whole-genome shotgun sequencing Celera Genomics

Sequencing. Computational Biology IST Ana Teresa Freitas 2011/2012. (BACs) Whole-genome shotgun sequencing Celera Genomics Computational Biology IST Ana Teresa Freitas 2011/2012 Sequencing Clone-by-clone shotgun sequencing Human Genome Project Whole-genome shotgun sequencing Celera Genomics (BACs) 1 Must take the fragments

More information

Graph Algorithms in Bioinformatics

Graph Algorithms in Bioinformatics Graph Algorithms in Bioinformatics Computational Biology IST Ana Teresa Freitas 2015/2016 Sequencing Clone-by-clone shotgun sequencing Human Genome Project Whole-genome shotgun sequencing Celera Genomics

More information

Sequence Assembly. BMI/CS 576 Mark Craven Some sequencing successes

Sequence Assembly. BMI/CS 576  Mark Craven Some sequencing successes Sequence Assembly BMI/CS 576 www.biostat.wisc.edu/bmi576/ Mark Craven craven@biostat.wisc.edu Some sequencing successes Yersinia pestis Cannabis sativa The sequencing problem We want to determine the identity

More information

Algorithms for Bioinformatics

Algorithms for Bioinformatics Adapted from slides by Alexandru Tomescu, Leena Salmela and Veli Mäkinen, which are partly from http://bix.ucsd.edu/bioalgorithms/slides.php 582670 Algorithms for Bioinformatics Lecture 3: Graph Algorithms

More information

Genome Sequencing Algorithms

Genome Sequencing Algorithms Genome Sequencing Algorithms Phillip Compaeu and Pavel Pevzner Bioinformatics Algorithms: an Active Learning Approach Leonhard Euler (1707 1783) William Hamilton (1805 1865) Nicolaas Govert de Bruijn (1918

More information

10/8/13 Comp 555 Fall

10/8/13 Comp 555 Fall 10/8/13 Comp 555 Fall 2013 1 Find a tour crossing every bridge just once Leonhard Euler, 1735 Bridges of Königsberg 10/8/13 Comp 555 Fall 2013 2 Find a cycle that visits every edge exactly once Linear

More information

Algorithms for Bioinformatics

Algorithms for Bioinformatics Adapted from slides by Alexandru Tomescu, Leena Salmela and Veli Mäkinen, which are partly from http://bix.ucsd.edu/bioalgorithms/slides.php 58670 Algorithms for Bioinformatics Lecture 5: Graph Algorithms

More information

CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assembly

CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assembly CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assembly Ben Raphael Sept. 22, 2009 http://cs.brown.edu/courses/csci2950-c/ l-mer composition Def: Given string s, the Spectrum ( s, l ) is unordered multiset

More information

Genome 373: Genome Assembly. Doug Fowler

Genome 373: Genome Assembly. Doug Fowler Genome 373: Genome Assembly Doug Fowler What are some of the things we ve seen we can do with HTS data? We ve seen that HTS can enable a wide variety of analyses ranging from ID ing variants to genome-

More information

Digging into acceptor splice site prediction: an iterative feature selection approach

Digging into acceptor splice site prediction: an iterative feature selection approach Digging into acceptor splice site prediction: an iterative feature selection approach Yvan Saeys, Sven Degroeve, and Yves Van de Peer Department of Plant Systems Biology, Ghent University, Flanders Interuniversity

More information

Sequence Assembly Required!

Sequence Assembly Required! Sequence Assembly Required! 1 October 3, ISMB 20172007 1 Sequence Assembly Genome Sequenced Fragments (reads) Assembled Contigs Finished Genome 2 Greedy solution is bounded 3 Typical assembly strategy

More information

Machine Learning Classifiers

Machine Learning Classifiers Machine Learning Classifiers Outline Different types of learning problems Different types of learning algorithms Supervised learning Decision trees Naïve Bayes Perceptrons, Multi-layer Neural Networks

More information

Read Mapping. de Novo Assembly. Genomics: Lecture #2 WS 2014/2015

Read Mapping. de Novo Assembly. Genomics: Lecture #2 WS 2014/2015 Mapping de Novo Assembly Institut für Medizinische Genetik und Humangenetik Charité Universitätsmedizin Berlin Genomics: Lecture #2 WS 2014/2015 Today Genome assembly: the basics Hamiltonian and Eulerian

More information

2 41L Tag- AA GAA AAA ATA AAA GCA TTA RYA GAA ATT TGT RMW GAR C K65 Tag- A AAT CCA TAC AAT ACT CCA GTA TTT GCY ATA AAG AA

2 41L Tag- AA GAA AAA ATA AAA GCA TTA RYA GAA ATT TGT RMW GAR C K65 Tag- A AAT CCA TAC AAT ACT CCA GTA TTT GCY ATA AAG AA 176 SUPPLEMENTAL TABLES 177 Table S1. ASPE Primers for HIV-1 group M subtype B Primer no Type a Sequence (5'-3') Tag ID b Position c 1 M41 Tag- AA GAA AAA ATA AAA GCA TTA RYA GAA ATT TGT RMW GAR A d 45

More information

DNA Fragment Assembly

DNA Fragment Assembly Algorithms in Bioinformatics Sami Khuri Department of Computer Science San José State University San José, California, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri DNA Fragment Assembly Overlap

More information

de novo assembly Simon Rasmussen 36626: Next Generation Sequencing analysis DTU Bioinformatics Next Generation Sequencing Analysis

de novo assembly Simon Rasmussen 36626: Next Generation Sequencing analysis DTU Bioinformatics Next Generation Sequencing Analysis de novo assembly Simon Rasmussen 36626: Next Generation Sequencing analysis DTU Bioinformatics 27626 - Next Generation Sequencing Analysis Generalized NGS analysis Data size Application Assembly: Compare

More information

Efficient Selection of Unique and Popular Oligos for Large EST Databases. Stefano Lonardi. University of California, Riverside

Efficient Selection of Unique and Popular Oligos for Large EST Databases. Stefano Lonardi. University of California, Riverside Efficient Selection of Unique and Popular Oligos for Large EST Databases Stefano Lonardi University of California, Riverside joint work with Jie Zheng, Timothy Close, Tao Jiang University of California,

More information

Purpose of sequence assembly

Purpose of sequence assembly Sequence Assembly Purpose of sequence assembly Reconstruct long DNA/RNA sequences from short sequence reads Genome sequencing RNA sequencing for gene discovery Amplicon sequencing But not for transcript

More information

A relation between trinucleotide comma-free codes and trinucleotide circular codes

A relation between trinucleotide comma-free codes and trinucleotide circular codes Theoretical Computer Science 401 (2008) 17 26 www.elsevier.com/locate/tcs A relation between trinucleotide comma-free codes and trinucleotide circular codes Christian J. Michel a,, Giuseppe Pirillo b,c,

More information

Eulerian Tours and Fleury s Algorithm

Eulerian Tours and Fleury s Algorithm Eulerian Tours and Fleury s Algorithm CSE21 Winter 2017, Day 12 (B00), Day 8 (A00) February 8, 2017 http://vlsicad.ucsd.edu/courses/cse21-w17 Vocabulary Path (or walk): describes a route from one vertex

More information

Graph Algorithms in Bioinformatics

Graph Algorithms in Bioinformatics Graph Algorithms in Bioinformatics Bioinformatics: Issues and Algorithms CSE 308-408 Fall 2007 Lecture 13 Lopresti Fall 2007 Lecture 13-1 - Outline Introduction to graph theory Eulerian & Hamiltonian Cycle

More information

MLiB - Mandatory Project 2. Gene finding using HMMs

MLiB - Mandatory Project 2. Gene finding using HMMs MLiB - Mandatory Project 2 Gene finding using HMMs Viterbi decoding >NC_002737.1 Streptococcus pyogenes M1 GAS TTGTTGATATTCTGTTTTTTCTTTTTTAGTTTTCCACATGAAAAATAGTTGAAAACAATA GCGGTGTCCCCTTAAAATGGCTTTTCCACAGGTTGTGGAGAACCCAAATTAACAGTGTTA

More information

Crick s Hypothesis Revisited: The Existence of a Universal Coding Frame

Crick s Hypothesis Revisited: The Existence of a Universal Coding Frame Crick s Hypothesis Revisited: The Existence of a Universal Coding Frame Jean-Louis Lassez*, Ryan A. Rossi Computer Science Department, Coastal Carolina University jlassez@coastal.edu, raross@coastal.edu

More information

Eulerian tours. Russell Impagliazzo and Miles Jones Thanks to Janine Tiefenbruck. April 20, 2016

Eulerian tours. Russell Impagliazzo and Miles Jones Thanks to Janine Tiefenbruck.  April 20, 2016 Eulerian tours Russell Impagliazzo and Miles Jones Thanks to Janine Tiefenbruck http://cseweb.ucsd.edu/classes/sp16/cse21-bd/ April 20, 2016 Seven Bridges of Konigsberg Is there a path that crosses each

More information

Supplementary Data. Image Processing Workflow Diagram A - Preprocessing. B - Hough Transform. C - Angle Histogram (Rose Plot)

Supplementary Data. Image Processing Workflow Diagram A - Preprocessing. B - Hough Transform. C - Angle Histogram (Rose Plot) Supplementary Data Image Processing Workflow Diagram A - Preprocessing B - Hough Transform C - Angle Histogram (Rose Plot) D - Determination of holes Description of Image Processing Workflow The key steps

More information

Graphs and Puzzles. Eulerian and Hamiltonian Tours.

Graphs and Puzzles. Eulerian and Hamiltonian Tours. Graphs and Puzzles. Eulerian and Hamiltonian Tours. CSE21 Winter 2017, Day 11 (B00), Day 7 (A00) February 3, 2017 http://vlsicad.ucsd.edu/courses/cse21-w17 Exam Announcements Seating Chart on Website Good

More information

LABORATORY STANDARD OPERATING PROCEDURE FOR PULSENET CODE: PNL28 MLVA OF SHIGA TOXIN-PRODUCING ESCHERICHIA COLI

LABORATORY STANDARD OPERATING PROCEDURE FOR PULSENET CODE: PNL28 MLVA OF SHIGA TOXIN-PRODUCING ESCHERICHIA COLI 1. PURPOSE: to describe the standardized laboratory protocol for molecular subtyping of Shiga toxin-producing Escherichia coli O157 (STEC O157) and Salmonella enterica serotypes Typhimurium and Enteritidis.

More information

Supplementary Materials:

Supplementary Materials: Supplementary Materials: Amino acid codo n Numb er Table S1. Codon usage in all the protein coding genes. RSC U Proportion (%) Amino acid codo n Numb er RSC U Proportion (%) Phe UUU 861 1.31 5.71 Ser UCU

More information

Degenerate Coding and Sequence Compacting

Degenerate Coding and Sequence Compacting ESI The Erwin Schrödinger International Boltzmanngasse 9 Institute for Mathematical Physics A-1090 Wien, Austria Degenerate Coding and Sequence Compacting Maya Gorel Kirzhner V.M. Vienna, Preprint ESI

More information

RESEARCH TOPIC IN BIOINFORMANTIC

RESEARCH TOPIC IN BIOINFORMANTIC RESEARCH TOPIC IN BIOINFORMANTIC GENOME ASSEMBLY Instructor: Dr. Yufeng Wu Noted by: February 25, 2012 Genome Assembly is a kind of string sequencing problems. As we all know, the human genome is very

More information

de Bruijn graphs for sequencing data

de Bruijn graphs for sequencing data de Bruijn graphs for sequencing data Rayan Chikhi CNRS Bonsai team, CRIStAL/INRIA, Univ. Lille 1 SMPGD 2016 1 MOTIVATION - de Bruijn graphs are instrumental for reference-free sequencing data analysis:

More information

Scalable Solutions for DNA Sequence Analysis

Scalable Solutions for DNA Sequence Analysis Scalable Solutions for DNA Sequence Analysis Michael Schatz Dec 4, 2009 JHU/UMD Joint Sequencing Meeting The Evolution of DNA Sequencing Year Genome Technology Cost 2001 Venter et al. Sanger (ABI) $300,000,000

More information

DNA Fragment Assembly

DNA Fragment Assembly SIGCSE 009 Algorithms in Bioinformatics Sami Khuri Department of Computer Science San José State University San José, California, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri DNA Fragment Assembly

More information

Supporting Information

Supporting Information Copyright WILEY VCH Verlag GmbH & Co. KGaA, 69469 Weinheim, Germany, 2015. Supporting Information for Small, DOI: 10.1002/smll.201501370 A Compact DNA Cube with Side Length 10 nm Max B. Scheible, Luvena

More information

Walking with Euler through Ostpreußen and RNA

Walking with Euler through Ostpreußen and RNA Walking with Euler through Ostpreußen and RNA Mark Muldoon February 4, 2010 Königsberg (1652) Kaliningrad (2007)? The Königsberg Bridge problem asks whether it is possible to walk around the old city in

More information

Computational Methods for de novo Assembly of Next-Generation Genome Sequencing Data

Computational Methods for de novo Assembly of Next-Generation Genome Sequencing Data 1/39 Computational Methods for de novo Assembly of Next-Generation Genome Sequencing Data Rayan Chikhi ENS Cachan Brittany / IRISA (Genscale team) Advisor : Dominique Lavenier 2/39 INTRODUCTION, YEAR 2000

More information

debgr: An Efficient and Near-Exact Representation of the Weighted de Bruijn Graph Prashant Pandey Stony Brook University, NY, USA

debgr: An Efficient and Near-Exact Representation of the Weighted de Bruijn Graph Prashant Pandey Stony Brook University, NY, USA debgr: An Efficient and Near-Exact Representation of the Weighted de Bruijn Graph Prashant Pandey Stony Brook University, NY, USA De Bruijn graphs are ubiquitous [Pevzner et al. 2001, Zerbino and Birney,

More information

of Nebraska - Lincoln

of Nebraska - Lincoln University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln MAT Exam Expository Papers Math in the Middle Institute Partnership 7-2008 De Bruijn Cycles Val Adams University of Nebraska

More information

DNA Fragment Assembly Algorithms: Toward a Solution for Long Repeats

DNA Fragment Assembly Algorithms: Toward a Solution for Long Repeats San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research 2008 DNA Fragment Assembly Algorithms: Toward a Solution for Long Repeats Ching Li San Jose State University

More information

I519 Introduction to Bioinformatics, Genome assembly. Yuzhen Ye School of Informatics & Computing, IUB

I519 Introduction to Bioinformatics, Genome assembly. Yuzhen Ye School of Informatics & Computing, IUB I519 Introduction to Bioinformatics, 2014 Genome assembly Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Contents Genome assembly problem Approaches Comparative assembly The string

More information

02-711/ Computational Genomics and Molecular Biology Fall 2016

02-711/ Computational Genomics and Molecular Biology Fall 2016 Literature assignment 2 Due: Nov. 3 rd, 2016 at 4:00pm Your name: Article: Phillip E C Compeau, Pavel A. Pevzner, Glenn Tesler. How to apply de Bruijn graphs to genome assembly. Nature Biotechnology 29,

More information

Problem statement. CS267 Assignment 3: Parallelize Graph Algorithms for de Novo Genome Assembly. Spring Example.

Problem statement. CS267 Assignment 3: Parallelize Graph Algorithms for de Novo Genome Assembly. Spring Example. CS267 Assignment 3: Problem statement 2 Parallelize Graph Algorithms for de Novo Genome Assembly k-mers are sequences of length k (alphabet is A/C/G/T). An extension is a simple symbol (A/C/G/T/F). The

More information

Worksheet 28: Wednesday November 18 Euler and Topology

Worksheet 28: Wednesday November 18 Euler and Topology Worksheet 28: Wednesday November 18 Euler and Topology The Konigsberg Problem: The Foundation of Topology The Konigsberg Bridge Problem is a very famous problem solved by Euler in 1735. The process he

More information

Assembly in the Clouds

Assembly in the Clouds Assembly in the Clouds Michael Schatz October 13, 2010 Beyond the Genome Shredded Book Reconstruction Dickens accidentally shreds the first printing of A Tale of Two Cities Text printed on 5 long spools

More information

de novo assembly Rayan Chikhi Pennsylvania State University Workshop On Genomics - Cesky Krumlov - January /73

de novo assembly Rayan Chikhi Pennsylvania State University Workshop On Genomics - Cesky Krumlov - January /73 1/73 de novo assembly Rayan Chikhi Pennsylvania State University Workshop On Genomics - Cesky Krumlov - January 2014 2/73 YOUR INSTRUCTOR IS.. - Postdoc at Penn State, USA - PhD at INRIA / ENS Cachan,

More information

Introduction to Genome Assembly. Tandy Warnow

Introduction to Genome Assembly. Tandy Warnow Introduction to Genome Assembly Tandy Warnow 2 Shotgun DNA Sequencing DNA target sample SHEAR & SIZE End Reads / Mate Pairs 550bp 10,000bp Not all sequencing technologies produce mate-pairs. Different

More information

A Novel Implementation of an Extended 8x8 Playfair Cipher Using Interweaving on DNA-encoded Data

A Novel Implementation of an Extended 8x8 Playfair Cipher Using Interweaving on DNA-encoded Data International Journal of Electrical and Computer Engineering (IJECE) Vol. 4, No. 1, Feburary 2014, pp. 93~100 ISSN: 2088-8708 93 A Novel Implementation of an Extended 8x8 Playfair Cipher Using Interweaving

More information

EECS 203 Lecture 20. More Graphs

EECS 203 Lecture 20. More Graphs EECS 203 Lecture 20 More Graphs Admin stuffs Last homework due today Office hour changes starting Friday (also in Piazza) Friday 6/17: 2-5 Mark in his office. Sunday 6/19: 2-5 Jasmine in the UGLI. Monday

More information

Intermediate Math Circles Wednesday, February 22, 2017 Graph Theory III

Intermediate Math Circles Wednesday, February 22, 2017 Graph Theory III 1 Eulerian Graphs Intermediate Math Circles Wednesday, February 22, 2017 Graph Theory III Let s begin this section with a problem that you may remember from lecture 1. Consider the layout of land and water

More information

How to apply de Bruijn graphs to genome assembly

How to apply de Bruijn graphs to genome assembly PRIMER How to apply de Bruijn graphs to genome assembly Phillip E C Compeau, Pavel A Pevzner & lenn Tesler A mathematical concept known as a de Bruijn graph turns the formidable challenge of assembling

More information

Structural analysis and haplotype diversity in swine LEP and MC4R genes

Structural analysis and haplotype diversity in swine LEP and MC4R genes J. Anim. Breed. Genet. ISSN - OIGINAL ATICLE Structural analysis and haplotype diversity in swine LEP and MC genes M. D Andrea, F. Pilla, E. Giuffra, D. Waddington & A.L. Archibald University of Molise,

More information

Bioinformatics-themed projects in Discrete Mathematics

Bioinformatics-themed projects in Discrete Mathematics Bioinformatics-themed projects in Discrete Mathematics Art Duval University of Texas at El Paso Joint Mathematics Meeting MAA Contributed Paper Session on Discrete Mathematics in the Undergraduate Curriculum

More information

Programming Applications. What is Computer Programming?

Programming Applications. What is Computer Programming? Programming Applications What is Computer Programming? An algorithm is a series of steps for solving a problem A programming language is a way to express our algorithm to a computer Programming is the

More information

Sequencing. Short Read Alignment. Sequencing. Paired-End Sequencing 6/10/2010. Tobias Rausch 7 th June 2010 WGS. ChIP-Seq. Applied Biosystems.

Sequencing. Short Read Alignment. Sequencing. Paired-End Sequencing 6/10/2010. Tobias Rausch 7 th June 2010 WGS. ChIP-Seq. Applied Biosystems. Sequencing Short Alignment Tobias Rausch 7 th June 2010 WGS RNA-Seq Exon Capture ChIP-Seq Sequencing Paired-End Sequencing Target genome Fragments Roche GS FLX Titanium Illumina Applied Biosystems SOLiD

More information

#30: Graph theory May 25, 2009

#30: Graph theory May 25, 2009 #30: Graph theory May 25, 2009 Graph theory is the study of graphs. But not the kind of graphs you are used to, like a graph of y = x 2 graph theory graphs are completely different from graphs of functions.

More information

BMI/CS 576 Fall 2015 Midterm Exam

BMI/CS 576 Fall 2015 Midterm Exam BMI/CS 576 Fall 2015 Midterm Exam Prof. Colin Dewey Tuesday, October 27th, 2015 11:00am-12:15pm Name: KEY Write your answers on these pages and show your work. You may use the back sides of pages as necessary.

More information

Genome Sequencing & Assembly. Slides by Carl Kingsford

Genome Sequencing & Assembly. Slides by Carl Kingsford Genome Sequencing & Assembly Slides by Carl Kingsford Genome Sequencing ACCGTCCAATTGG...! TGGCAGGTTAACC... E.g. human: 3 billion bases split into 23 chromosomes Main tool of traditional sequencing: DNA

More information

7.36/7.91 recitation. DG Lectures 5 & 6 2/26/14

7.36/7.91 recitation. DG Lectures 5 & 6 2/26/14 7.36/7.91 recitation DG Lectures 5 & 6 2/26/14 1 Announcements project specific aims due in a little more than a week (March 7) Pset #2 due March 13, start early! Today: library complexity BWT and read

More information

CHAPTER 10 GRAPHS AND TREES. Alessandro Artale UniBZ - artale/z

CHAPTER 10 GRAPHS AND TREES. Alessandro Artale UniBZ -  artale/z CHAPTER 10 GRAPHS AND TREES Alessandro Artale UniBZ - http://www.inf.unibz.it/ artale/z SECTION 10.1 Graphs: Definitions and Basic Properties Copyright Cengage Learning. All rights reserved. Graphs: Definitions

More information

Introduction III. Graphs. Motivations I. Introduction IV

Introduction III. Graphs. Motivations I. Introduction IV Introduction I Graphs Computer Science & Engineering 235: Discrete Mathematics Christopher M. Bourke cbourke@cse.unl.edu Graph theory was introduced in the 18th century by Leonhard Euler via the Königsberg

More information

PERFORMANCE ANALYSIS OF DATAMINIG TECHNIQUE IN RBC, WBC and PLATELET CANCER DATASETS

PERFORMANCE ANALYSIS OF DATAMINIG TECHNIQUE IN RBC, WBC and PLATELET CANCER DATASETS PERFORMANCE ANALYSIS OF DATAMINIG TECHNIQUE IN RBC, WBC and PLATELET CANCER DATASETS Mayilvaganan M 1 and Hemalatha 2 1 Associate Professor, Department of Computer Science, PSG College of arts and science,

More information

IE 102 Spring Routing Through Networks - 1

IE 102 Spring Routing Through Networks - 1 IE 102 Spring 2017 Routing Through Networks - 1 The Bridges of Koenigsberg: Euler 1735 Graph Theory began in 1735 Leonard Eüler Visited Koenigsberg People wondered whether it is possible to take a walk,

More information

OFFICE OF RESEARCH AND SPONSORED PROGRAMS

OFFICE OF RESEARCH AND SPONSORED PROGRAMS OFFICE OF RESEARCH AND SPONSORED PROGRAMS June 9, 2016 Mr. Satoshi Harada Department of Innovation Research Japan Science and Technology Agency (JST) K s Gobancho, 7, Gobancho, Chiyoda-ku, Tokyo, 102-0076

More information

Detecting Superbubbles in Assembly Graphs. Taku Onodera (U. Tokyo)! Kunihiko Sadakane (NII)! Tetsuo Shibuya (U. Tokyo)!

Detecting Superbubbles in Assembly Graphs. Taku Onodera (U. Tokyo)! Kunihiko Sadakane (NII)! Tetsuo Shibuya (U. Tokyo)! Detecting Superbubbles in Assembly Graphs Taku Onodera (U. Tokyo)! Kunihiko Sadakane (NII)! Tetsuo Shibuya (U. Tokyo)! de Bruijn Graph-based Assembly Reads (substrings of original DNA sequence) de Bruijn

More information

Bioinformatics: Fragment Assembly. Walter Kosters, Universiteit Leiden. IPA Algorithms&Complexity,

Bioinformatics: Fragment Assembly. Walter Kosters, Universiteit Leiden. IPA Algorithms&Complexity, Bioinformatics: Fragment Assembly Walter Kosters, Universiteit Leiden IPA Algorithms&Complexity, 29.6.2007 www.liacs.nl/home/kosters/ 1 Fragment assembly Problem We study the following problem from bioinformatics:

More information

Grade 7/8 Math Circles Graph Theory - Solutions October 13/14, 2015

Grade 7/8 Math Circles Graph Theory - Solutions October 13/14, 2015 Faculty of Mathematics Waterloo, Ontario N2L 3G1 Centre for Education in Mathematics and Computing Grade 7/8 Math Circles Graph Theory - Solutions October 13/14, 2015 The Seven Bridges of Königsberg In

More information

Solutions Exercise Set 3 Author: Charmi Panchal

Solutions Exercise Set 3 Author: Charmi Panchal Solutions Exercise Set 3 Author: Charmi Panchal Problem 1: Suppose we have following fragments: f1 = ATCCTTAACCCC f2 = TTAACTCA f3 = TTAATACTCCC f4 = ATCTTTC f5 = CACTCCCACACA f6 = CACAATCCTTAACCC f7 =

More information

Euler and Hamilton paths. Jorge A. Cobb The University of Texas at Dallas

Euler and Hamilton paths. Jorge A. Cobb The University of Texas at Dallas Euler and Hamilton paths Jorge A. Cobb The University of Texas at Dallas 1 Paths and the adjacency matrix The powers of the adjacency matrix A r (with normal, not boolean multiplication) contain the number

More information

CSE 549: Genome Assembly De Bruijn Graph. All slides in this lecture not marked with * courtesy of Ben Langmead.

CSE 549: Genome Assembly De Bruijn Graph. All slides in this lecture not marked with * courtesy of Ben Langmead. CSE 549: Genome Assembly De Bruijn Graph All slides in this lecture not marked with * courtesy of Ben Langmead. Real-world assembly methods OLC: Overlap-Layout-Consensus assembly DBG: De Bruijn graph assembly

More information

Parallel de novo Assembly of Complex (Meta) Genomes via HipMer

Parallel de novo Assembly of Complex (Meta) Genomes via HipMer Parallel de novo Assembly of Complex (Meta) Genomes via HipMer Aydın Buluç Computational Research Division, LBNL May 23, 2016 Invited Talk at HiCOMB 2016 Outline and Acknowledgments Joint work (alphabetical)

More information

Sequence Alignment 1

Sequence Alignment 1 Sequence Alignment 1 Nucleotide and Base Pairs Purine: A and G Pyrimidine: T and C 2 DNA 3 For this course DNA is double-helical, with two complementary strands. Complementary bases: Adenine (A) - Thymine

More information

Crossing bridges. Crossing bridges Great Ideas in Theoretical Computer Science. Lecture 12: Graphs I: The Basics. Königsberg (Prussia)

Crossing bridges. Crossing bridges Great Ideas in Theoretical Computer Science. Lecture 12: Graphs I: The Basics. Königsberg (Prussia) 15-251 Great Ideas in Theoretical Computer Science Lecture 12: Graphs I: The Basics February 22nd, 2018 Crossing bridges Königsberg (Prussia) Now Kaliningrad (Russia) Is there a way to walk through the

More information

EULERIAN GRAPHS AND ITS APPLICATIONS

EULERIAN GRAPHS AND ITS APPLICATIONS EULERIAN GRAPHS AND ITS APPLICATIONS Aruna R 1, Madhu N.R 2 & Shashidhar S.N 3 1.2&3 Assistant Professor, Department of Mathematics. R.L.Jalappa Institute of Technology, Doddaballapur, B lore Rural Dist

More information

QuasiAlign: Position Sensitive P-Mer Frequency Clustering with Applications to Genomic Classification and Differentiation

QuasiAlign: Position Sensitive P-Mer Frequency Clustering with Applications to Genomic Classification and Differentiation QuasiAlign: Position Sensitive P-Mer Frequency Clustering with Applications to Genomic Classification and Differentiation Anurag Nagar Southern Methodist University Michael Hahsler Southern Methodist University

More information

Introduction to Networks

Introduction to Networks LESSON 1 Introduction to Networks Exploratory Challenge 1 One classic math puzzle is the Seven Bridges of Königsberg problem which laid the foundation for networks and graph theory. In the 18th century

More information

6.2. Paths and Cycles

6.2. Paths and Cycles 6.2. PATHS AND CYCLES 85 6.2. Paths and Cycles 6.2.1. Paths. A path from v 0 to v n of length n is a sequence of n+1 vertices (v k ) and n edges (e k ) of the form v 0, e 1, v 1, e 2, v 2,..., e n, v n,

More information

MIT BLOSSOMS INITIATIVE. Taking Walks, Delivering Mail: An Introduction to Graph Theory Karima R. Nigmatulina MIT

MIT BLOSSOMS INITIATIVE. Taking Walks, Delivering Mail: An Introduction to Graph Theory Karima R. Nigmatulina MIT MIT BLOSSOMS INITIATIVE Taking Walks, Delivering Mail: An Introduction to Graph Theory Karima R. Nigmatulina MIT Section 1 Hello and welcome everyone! My name is Karima Nigmatulina, and I am a doctoral

More information

Michał Kierzynka et al. Poznan University of Technology. 17 March 2015, San Jose

Michał Kierzynka et al. Poznan University of Technology. 17 March 2015, San Jose Michał Kierzynka et al. Poznan University of Technology 17 March 2015, San Jose The research has been supported by grant No. 2012/05/B/ST6/03026 from the National Science Centre, Poland. DNA de novo assembly

More information

FINDING THE RIGHT PATH

FINDING THE RIGHT PATH Task 1: Seven Bridges of Konigsberg! Today we are going to begin with the story of Konigsberg in the 18 th century, its geography, bridges, and the question asked by its citizens. FINDING THE RIGHT PATH

More information

1. PURPOSE: to describe the standardized laboratory protocol for molecular subtyping of Salmonella enterica serotype Enteritidis.

1. PURPOSE: to describe the standardized laboratory protocol for molecular subtyping of Salmonella enterica serotype Enteritidis. 1. PURPOSE: to describe the standardized laboratory protocol for molecular subtyping of Salmonella enterica serotype Enteritidis. 2. SCOPE: to provide the PulseNet participants with a single protocol for

More information

Sequence Design Problems in Discovery of Regulatory Elements

Sequence Design Problems in Discovery of Regulatory Elements Sequence Design Problems in Discovery of Regulatory Elements Yaron Orenstein, Bonnie Berger and Ron Shamir Regulatory Genomics workshop Simons Institute March 10th, 2016, Berkeley, CA Differentially methylated

More information

Description of a genome assembler: CABOG

Description of a genome assembler: CABOG Theo Zimmermann Description of a genome assembler: CABOG CABOG (Celera Assembler with the Best Overlap Graph) is an assembler built upon the Celera Assembler, which, at first, was designed for Sanger sequencing,

More information

Discrete Mathematics and Probability Theory Fall 2013 Vazirani Note 7

Discrete Mathematics and Probability Theory Fall 2013 Vazirani Note 7 CS 70 Discrete Mathematics and Probability Theory Fall 2013 Vazirani Note 7 An Introduction to Graphs A few centuries ago, residents of the city of Königsberg, Prussia were interested in a certain problem.

More information

BLAST & Genome assembly

BLAST & Genome assembly BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies November 17, 2012 1 Introduction Introduction 2 BLAST What is BLAST? The algorithm 3 Genome assembly De

More information

Instant Insanity Instructor s Guide Make-it and Take-it Kit for AMTNYS 2006

Instant Insanity Instructor s Guide Make-it and Take-it Kit for AMTNYS 2006 Instant Insanity Instructor s Guide Make-it and Take-it Kit for AMTNYS 2006 THE KIT: This kit contains materials for two Instant Insanity games, a student activity sheet with answer key and this instructor

More information

How to Run NCBI BLAST on zcluster at GACRC

How to Run NCBI BLAST on zcluster at GACRC How to Run NCBI BLAST on zcluster at GACRC BLAST: Basic Local Alignment Search Tool Georgia Advanced Computing Resource Center University of Georgia Suchitra Pakala pakala@uga.edu 1 OVERVIEW What is BLAST?

More information

MATH 113 Section 9.2: Topology

MATH 113 Section 9.2: Topology MATH 113 Section 9.2: Topology Prof. Jonathan Duncan Walla Walla College Winter Quarter, 2007 Outline 1 Introduction to Topology 2 Topology and Childrens Drawings 3 Networks 4 Conclusion Geometric Topology

More information