Graph Algorithms in Bioinformatics
|
|
- Angelina Poole
- 6 years ago
- Views:
Transcription
1 Graph Algorithms in Bioinformatics Bioinformatics: Issues and Algorithms CSE Fall 2007 Lecture 13 Lopresti Fall 2007 Lecture
2 Outline Introduction to graph theory Eulerian & Hamiltonian Cycle Problems Benzer experiment and interval graphs DNA sequencing The Shortest Superstring & Traveling Salesman Problems Sequencing by hybridization Lopresti Fall 2007 Lecture
3 The Bridge Obsession Problem Find a tour crossing every bridge just once. Leonhard Euler, 1735 Bridges of Königsberg Lopresti Fall 2007 Lecture
4 Eulerian Cycle Problem Find a cycle that visits every edge exactly once. Algorithm for solution has linear time complexity. More complicated Königsberg Lopresti Fall 2007 Lecture
5 Mapping problems to graphs Arthur Cayley studied chemical structures of hydrocarbons in mid-1800's. He used trees (acyclic connected graphs) to enumerate structural isomers. Lopresti Fall 2007 Lecture
6 Hamiltonian Cycle Problem Find a cycle that visits every vertex exactly once. NP-complete (i.e., a very hard problem to solve). Game invented by Sir William Hamilton in 1857 Lopresti Fall 2007 Lecture
7 Beginning of graph theory in biology Benzer s work: Developed deletion mapping. Proved linearity of gene. Demonstrated internal structure of gene. Seymour Benzer (1950's) Lopresti Fall 2007 Lecture
8 Viruses attack bacteria Normally bacteriophage T4 kills bacteria. However if T4 is mutated (e.g., an important gene is deleted), it gets disabled and loses ability to kill bacteria. Suppose bacteria is infected with two different mutants, each of which is disabled. Would the bacteria still survive? Amazingly, a pair of disable viruses can kill a bacteria even if each of them is disabled. How can this be explained? Lopresti Fall 2007 Lecture
9 Benzer s experiment Idea: infect bacteria with pairs of mutant T4 bacteriophage (virus). Each T4 mutant has an unknown interval deleted from its genome. If two intervals overlap: T4 pair is missing part of its genome and is disabled bacteria survive. If two intervals do not overlap: T4 pair has its entire genome and is enabled bacteria die. Lopresti Fall 2007 Lecture
10 Benzer s experiment Complementation between pairs of mutant T4 bacteriophages Lopresti Fall 2007 Lecture
11 Benzer s experiment and interval graphs Construct interval graph: each T4 mutant is a vertex; place edge between mutant pairs where bacteria survived (i.e., deleted intervals in pair of mutants overlap). Interval graph structure reveals whether DNA is linear or branched DNA. Case for linear genes Lopresti Fall 2007 Lecture
12 Benzer s experiment and interval graphs Case for branched genes Lopresti Fall 2007 Lecture
13 Interval graphs: a comparison Linear genome Branched genome Lopresti Fall 2007 Lecture
14 DNA sequencing: history Sanger method (1977): labeled ddntps* terminate DNA copying at random points. Gilbert method (1977): chemical method to cleave DNA at specific points (G, G+A, T+C, C). Both methods generate labeled fragments of varying lengths that are further electrophoresed. * Dideoxynucleotides, or ddntps, are nucleotides lacking 3'-hydroxyl (-OH) group on their deoxyribose sugar. Note that the sugar normally referred to as deoxyribose lacks a 2'-OH, while the dideoxyribose lacks hydroxyl groups on both its 2' and 3' carbon. The lack of this hydroxyl group means that, after being added by a DNA polymerase to a growing nucleotide chain, no further nucleotides can be added, as no phosphodiester bond can be created. Thus, these molecules form the basis of the dideoxy chaintermination method of DNA sequencing, which was developed by Frederick Sanger in (From Lopresti Fall 2007 Lecture
15 Sanger Method: generating a read Start at primer (restriction site). Grow DNA chain. Include ddntps. Stops reaction at all possible points. Separate products by length, using gel electrophoresis. Lopresti Fall 2007 Lecture
16 DNA sequencing Shear DNA into millions of small fragments. Read nucleotides at a time from the small fragments (Sanger method). Lopresti Fall 2007 Lecture
17 Fragment assembly Computational challenge: assemble individual short fragments (reads) into single genomic sequence. Until late 1990's, the shotgun fragment assembly of human genome was viewed as intractable problem. The Shortest Superstring Problem. Given set of strings, find shortest string containing all of them. Input: Strings s 1, s 2,..., s n. Output: A string s that contains all strings s 1, s 2,..., s n as substrings, such that length of s is minimized. Problem is NP-complete (efficient solution unlikely). This formulation does not take into account sequencing errors! Lopresti Fall 2007 Lecture
18 Shortest Superstring Problem: an example Note: this is an approximation an atttactive mathematical model for a real-world biological problem. Lopresti Fall 2007 Lecture
19 Reducing SSP to TSP Define overlap(s i, s j ) as the length of the longest prefix of s j that matches a suffix of s i. aaaggcatcaaatctaaaggcatcaaa aaaggcatcaaatctaaaggcatcaaa What is overlap(s i, s j ) for these strings? 3? aaaggcatcaaatctaaaggcatcaaa aaaggcatcaaatctaaaggcatcaaa aaaggcatcaaatctaaaggcatcaaa What is overlap(s i, s j ) for these strings? No, 12! Lopresti Fall 2007 Lecture
20 Reducing SSP to TSP Define overlap(s i, s j ) as the length of the longest prefix of s j that matches a suffix of s i. aaaggcatcaaatctaaaggcatcaaa aaaggcatcaaatctaaaggcatcaaa aaaggcatcaaatctaaaggcatcaaa Build graph with n vertices representing strings s 1, s 2,..., s n. Insert edges of length overlap overlap(s i, s j ) between vertices s i and s j. Find longest path which visits every vertex exactly once. This is Traveling Salesman Problem (TSP), which is also NPcomplete. Lopresti Fall 2007 Lecture
21 Reducing SSP to TSP Lopresti Fall 2007 Lecture
22 SSP to TSP: an Example Say that S = {ATC, CCA, CAG, TCC, AGT} SSP AGT CCA ATC ATCCAGT TCC CAG TSP ATC 0 1 AGT 1 2 CAG TCC ATCCAGT CCA Lopresti Fall 2007 Lecture
23 Sequencing by Hybridization (SBH): history 1988: SBH suggested as an an alternative sequencing method. Nobody believed it will ever work. 1991: Light directed polymer synthesis developed by Steve Fodor and colleagues. 1994: Affymetrix develops first 64-kb DNA microarray. First microarray prototype (1989) First commercial DNA microarray prototype w/16,000 features (1994) 500,000 features per chip (2002) Lopresti Fall 2007 Lecture
24 How SBH works Attach all possible DNA probes of length l to flat surface, each probe at distinct and known location. This set of probes is called the DNA array. Apply a solution containing fluorescently labeled DNA fragment to array. The DNA fragment hybridizes with only those probes that are complementary to substrings of length l of fragment. Using a spectroscopic detector, determine which probes hybridize to DNA fragment to obtain l-mer composition of target DNA fragment. Apply combinatorial algorithm (next) to reconstruct sequence of target DNA fragment from l-mer composition. Lopresti Fall 2007 Lecture
25 Hybridization on DNA Array Lopresti Fall 2007 Lecture
26 l-mer composition Spectrum(s, l) is an unordered multiset of all possible (n l + 1) l-mers in a string s of length n. The order of individual elements in Spectrum(s, l) does not matter. For s = TATGGTGC all of following are equivalent representations of Spectrum(s, 3): {TAT, ATG, TGG, GGT, GTG, TGC} {ATG, GGT, GTG, TAT, TGC, TGG} {TGG, TGC, TAT, GTG, GGT, ATG} Lopresti Fall 2007 Lecture
27 l-mer composition Spectrum(s, l) is an unordered multiset of all possible (n l + 1) l-mers in a string s of length n. The order of individual elements in Spectrum(s, l) does not matter. For s = TATGGTGC all of following are equivalent representations of Spectrum(s, 3): {TAT, ATG, TGG, GGT, GTG, TGC} {ATG, GGT, GTG, TAT, TGC, TGG} {TGG, TGC, TAT, GTG, GGT, ATG} We usually choose lexicographically-ordered representation as canonical one. Lopresti Fall 2007 Lecture
28 Different sequences different spectrums Note that different sequences may have same spectrum. Spectrum(GTATCT, 2)= Spectrum(GTCTAT, 2)= {AT, CT, GT, TA, TC} The Sequencing by Hybridization (SBH) Problem. Reconstruct a string from its l-mer composition. Input: Set S representing all l-mers from an unknown string s. Output: String s such that Spectrum(s, l) = S. Lopresti Fall 2007 Lecture
29 SBH: Hamiltonian Path approach S = {ATG AGG TGC TCC GTC GGT GCA CAG} Corresponding Hamiltonian Path Problem H : ATG AGG TGC TCC GTC GGT GCA CAG ATGCAG GTCC Path visits every vertex once! Lopresti Fall 2007 Lecture
30 SBH: Hamiltonian Path approach A more complicated graph: S = {ATG TGG TGC GTG GGC GCA GCG CGT } H Lopresti Fall 2007 Lecture
31 SBH: Hamiltonian Path approach S = {ATG TGG TGC GTG GGC GCA GCG CGT} Path 1: H Path 2: H ATGCGTGGCA ATGGCGTGCA Lopresti Fall 2007 Lecture
32 SBH: Eulerian Path approach S = {ATG TGG TGC GTG GGC GCA GCG CGT} Vertices correspond to (l 1)-mers: {AT, TG, GC, GG, GT, CA, CG} Edges correspond to l-mers from S. GT CG AT TG GC CA GG Path visits every edge once! Lopresti Fall 2007 Lecture
33 SBH: Eulerian Path approach S = {ATG TGG TGC GTG GGC GCA GCG CGT} corresponds to two different paths: GT CG GT CG 5 4 AT TG 6 4 GC CA AT TG 5 3 GC CA GG ATGGCGTGCA GG ATGCGTGGCA I.e., two acceptable solutions to this SBH problem. Lopresti Fall 2007 Lecture
34 Euler Theorem A graph is balanced if for every vertex, the number of incoming edges equals the number of outgoing edges: in(v) = out(v) Theorem: a connected graph is Eulerian (i.e., has an Euler Cycle) if and only if each of its vertices is balanced. Lopresti Fall 2007 Lecture
35 Euler Theorem: Proof Two cases to consider: (1) Eulerian balanced For every edge entering v (incoming edge) there exists an edge leaving v (outgoing edge). Therefore in(v) = out(v) (2) Balanced Eulerian??? Lopresti Fall 2007 Lecture
36 Algorithm for constructing an Eulerian Cycle (a) Start with an arbitrary vertex v and form arbitrary cycle with unused edges until dead-end is reached. Since graph is Eulerian, dead-end must be starting point, i.e., vertex v. w v Lopresti Fall 2007 Lecture
37 Algorithm for constructing an Eulerian Cycle (b) If cycle from Step (a) earlier is not Eulerian cycle, it must contain a vertex w, which has untraversed edges. Perform Step (a) again, using vertex w as the starting point. Once again, we will end up at starting vertex w. w v Lopresti Fall 2007 Lecture
38 Algorithm for constructing an Eulerian Cycle (c) Combine cycles from Step (a) and Step (b) into single cycle and iterate Step (b). w v Lopresti Fall 2007 Lecture
39 Euler Theorem: extension Theorem: A connected graph has an Eulerian path if and only if it contains at most two semi-balanced vertices and all other vertices are balanced. I.e., one vertex has an extra outgoing edge, while another corresponding vertex has an extra incoming edge. Lopresti Fall 2007 Lecture
40 Some Difficulties with SBH Fidelity of Hybridization: difficult to detect differences between probes hybridized with perfect matches and those with one or two mismatches. Array Size: effect of low fidelity can be decreased with longer l-mers, but array size increases exponentially in l. Array size is limited with current technology. Practicality: SBH is still impractical. As DNA microarray technology improves, SBH may become practical in future. Practicality (again): although impractical, SBH still spearheaded techniques for expression and SNP analysis. Lopresti Fall 2007 Lecture
41 Traditional DNA sequencing DNA Shake DNA fragments Vector = circular genome (bacterium, plasmid) Known location (restriction site) + = Lopresti Fall 2007 Lecture
42 Different types of vectors VECTOR Plasmid Cosmid BAC (Bacterial Artificial Chromosome) YAC (Yeast Artificial Chromosome) Size of insert (bp) 2,000-10,000 40,000 70, ,000 > 300,000 Not used much recently Lopresti Fall 2007 Lecture
43 Electrophoresis diagrams Lopresti Fall 2007 Lecture
44 Challenging to read sometimes Filtering Smoothening Correcting for length compression Method for deciding nucleotides Lopresti Fall 2007 Lecture
45 Shotgun sequencing Genomic segment Cut many times at random (shotgun) Get one or two reads from each segment ~500 bp ~500 bp Lopresti Fall 2007 Lecture
46 Fragment assembly Reads Cover region with ~7-fold redundancy. Overlap reads, extend to reconstruct original genomic region. Lopresti Fall 2007 Lecture
47 Read coverage C Length of genomic segment: L Number of reads: n Coverage C = (nl) / L Length of each read: l How much coverage is enough? Lander-Waterman model: Assuming uniform distribution of reads, C = 10 results in 1 gapped region per 1,000,000 nucleotides. Lopresti Fall 2007 Lecture
48 Challenges in fragment assembly Repeats: a major problem for fragment assembly. More than 50% of human genome consists of repeats: over 1 million Alu repeats (about 300 bp), about 200,000 LINE repeats (1000 bp and longer). Repeat Repeat Repeat Green and blue fragments are interchangeable when assembling repetitive DNA Lopresti Fall 2007 Lecture
49 Bad things happen when you confuse repeated regions... Correct assembly (we don't know this starting off): Repeat Repeat Repeat Potential mis-assembly (seems just as plausible): Repeat Repeat Repeat Lopresti Fall 2007 Lecture
50 Triazzle: a fun example The puzzle looks simple BUT there are repeats!!! Repeats make it very difficult. Try it only $7.99 at Lopresti Fall 2007 Lecture
51 Repeat types Low-Complexity DNA (e.g., ATATATATACATA ) Microsatellite repeats (a 1 a k ) N where k ~ 3-6 (e.g., CAGCAGTAGCAGCACCAG) Transposons/retrotransposons SINE LINE LTR retroposons Gene Families Short Interspersed Nuclear Elements (e.g., Alu: ~300 bp long, 10 6 copies) Long Interspersed Nuclear Elements ~500-5,000 bp long, 200,000 copies Long Terminal Repeats (~700 bp at each end genes duplicate & then diverge Segmental duplications ~very long, very similar copies Lopresti Fall 2007 Lecture
52 Overlap-Layout-Consensus Assemblers: ARACHNE, PHRAP, CAP, TIGR, CELERA Overlap: find potentially overlapping reads Layout: merge reads into contigs and contigs into supercontigs Consensus: derive the DNA sequence and correct read errors..acgattacaataggtt.. Lopresti Fall 2007 Lecture
53 Overlap Find best match between suffix of one read and prefix of another. Due to sequencing errors, need to use dynamic programming to find optimal overlap alignment. Apply filtration method to filter out pairs of fragments that do not share a significantly long common substring. Lopresti Fall 2007 Lecture
54 Overlapping reads Sort all k-mers in reads (k ~ 24). Find pairs of reads sharing a k-mer. Extend to full alignment throw away if not >95% similar. TACA TAGATTACACAGATTACT GA TAGT TAGATTACACAGATTACTAGA Lopresti Fall 2007 Lecture
55 Overlapping reads and repeats A k-mer that appears N times initiates N 2 comparisons For an Alu that appears 10 6 times, this means comparisons, which is obviously too much. Solution: discard all k-mers that appear more than t x Coverage, (t ~ 10) Lopresti Fall 2007 Lecture
56 Finding overlapping reads Create local multiple alignments from the overlapping reads: TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TAGATTACACAGATTACTGA TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TAGATTACACAGATTACTGA TTACACAGATTATTGA Lopresti Fall 2007 Lecture
57 Finding overlapping reads Correct errors using multiple alignment C: 20 C: 35 T: 30 TAGATTACACAGATTACTGA C: 35 TAGATTACACAGATTACTGA C: 40 TAG TAGATTACACAGATTACTGA TTACACAGATTATTGA TAGATTACACAGATTACTGA A: 15 A: 25 - A: 40 A: 25 Score alignments Accept alignments with good scores A: 15 A: 25 A: 0 A: 40 A: 25 C: 20 C: 35 C: 0 C: 35 C: 40 Lopresti Fall 2007 Lecture
58 Layout Repeats are major challenge. Do two aligned fragments really overlap, or are they from two copies of a repeat? Solution: repeat masking hide repeats!!! Masking results in high rate of misassembly (up to 20%). Misassembly means a lot more work at finishing step. Lopresti Fall 2007 Lecture
59 Merge reads into contigs repeat region Merge reads up to potential repeat boundaries Lopresti Fall 2007 Lecture
60 Repeats, errors, and contig lengths Repeats shorter than read length are okay. Repeats with more base pair differences than sequencing error rate are okay. To make smaller portion of genome appear repetitive: - try to increase effective increase read length, - work to decrease sequencing error rate. Lopresti Fall 2007 Lecture
61 Role of error correction Discards ~90% of single-letter sequencing errors. Lowers error rate: decreases effective repeat content, increases contig length. Lopresti Fall 2007 Lecture
62 Merge reads into contigs Repeat region Ignore non-maximal reads. Merge only maximal reads into contigs. Lopresti Fall 2007 Lecture
63 Merge reads into contigs a b Repeat boundary??? Sequencing error Ignore hanging reads, when detecting repeat boundaries. Lopresti Fall 2007 Lecture
64 Merge reads into contigs????? Unambiguous Insert non-maximal reads whenever unambiguous. Lopresti Fall 2007 Lecture
65 Link contigs into supercontigs Normal density Too dense: Overcollapsed? Inconsistent links: Overcollapsed? Lopresti Fall 2007 Lecture
66 Link contigs into supercontigs Find all links between unique contigs. Connect contigs incrementally, if 2 links. Lopresti Fall 2007 Lecture
67 Link contigs into supercontigs Fill gaps in supercontigs with paths of overcollapsed contigs. Lopresti Fall 2007 Lecture
68 Wrap-up Readings for next time: IBP (hash tables, repeat-finding, exact pattern matching, keyword trees, suffix trees). Remember: Come to class having done the readings. Check Blackboard regularly for updates. Lopresti Fall 2007 Lecture
10/8/13 Comp 555 Fall
10/8/13 Comp 555 Fall 2013 1 Find a tour crossing every bridge just once Leonhard Euler, 1735 Bridges of Königsberg 10/8/13 Comp 555 Fall 2013 2 Find a cycle that visits every edge exactly once Linear
More information10/15/2009 Comp 590/Comp Fall
Lecture 13: Graph Algorithms Study Chapter 8.1 8.8 10/15/2009 Comp 590/Comp 790-90 Fall 2009 1 The Bridge Obsession Problem Find a tour crossing every bridge just once Leonhard Euler, 1735 Bridges of Königsberg
More informationSequencing. Computational Biology IST Ana Teresa Freitas 2011/2012. (BACs) Whole-genome shotgun sequencing Celera Genomics
Computational Biology IST Ana Teresa Freitas 2011/2012 Sequencing Clone-by-clone shotgun sequencing Human Genome Project Whole-genome shotgun sequencing Celera Genomics (BACs) 1 Must take the fragments
More informationGraph Algorithms in Bioinformatics
Graph Algorithms in Bioinformatics Computational Biology IST Ana Teresa Freitas 2015/2016 Sequencing Clone-by-clone shotgun sequencing Human Genome Project Whole-genome shotgun sequencing Celera Genomics
More informationDNA Sequencing The Shortest Superstring & Traveling Salesman Problems Sequencing by Hybridization
Eulerian & Hamiltonian Cycle Problems DNA Sequencing The Shortest Superstring & Traveling Salesman Problems Sequencing by Hybridization The Bridge Obsession Problem Find a tour crossing every bridge just
More informationAlgorithms for Bioinformatics
Adapted from slides by Alexandru Tomescu, Leena Salmela and Veli Mäkinen, which are partly from http://bix.ucsd.edu/bioalgorithms/slides.php 58670 Algorithms for Bioinformatics Lecture 5: Graph Algorithms
More informationAlgorithms for Bioinformatics
Adapted from slides by Alexandru Tomescu, Leena Salmela and Veli Mäkinen, which are partly from http://bix.ucsd.edu/bioalgorithms/slides.php 582670 Algorithms for Bioinformatics Lecture 3: Graph Algorithms
More informationSequence Assembly Required!
Sequence Assembly Required! 1 October 3, ISMB 20172007 1 Sequence Assembly Genome Sequenced Fragments (reads) Assembled Contigs Finished Genome 2 Greedy solution is bounded 3 Typical assembly strategy
More informationCSCI2950-C Lecture 4 DNA Sequencing and Fragment Assembly
CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assembly Ben Raphael Sept. 22, 2009 http://cs.brown.edu/courses/csci2950-c/ l-mer composition Def: Given string s, the Spectrum ( s, l ) is unordered multiset
More informationDNA Sequencing. Overview
BINF 3350, Genomics and Bioinformatics DNA Sequencing Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Backgrounds Eulerian Cycles Problem Hamiltonian Cycles
More informationI519 Introduction to Bioinformatics, Genome assembly. Yuzhen Ye School of Informatics & Computing, IUB
I519 Introduction to Bioinformatics, 2014 Genome assembly Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Contents Genome assembly problem Approaches Comparative assembly The string
More informationDNA Fragment Assembly
Algorithms in Bioinformatics Sami Khuri Department of Computer Science San José State University San José, California, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri DNA Fragment Assembly Overlap
More informationSequence Assembly. BMI/CS 576 Mark Craven Some sequencing successes
Sequence Assembly BMI/CS 576 www.biostat.wisc.edu/bmi576/ Mark Craven craven@biostat.wisc.edu Some sequencing successes Yersinia pestis Cannabis sativa The sequencing problem We want to determine the identity
More informationGenome Reconstruction: A Puzzle with a Billion Pieces Phillip E. C. Compeau and Pavel A. Pevzner
Genome Reconstruction: A Puzzle with a Billion Pieces Phillip E. C. Compeau and Pavel A. Pevzner Outline I. Problem II. Two Historical Detours III.Example IV.The Mathematics of DNA Sequencing V.Complications
More informationGenome 373: Genome Assembly. Doug Fowler
Genome 373: Genome Assembly Doug Fowler What are some of the things we ve seen we can do with HTS data? We ve seen that HTS can enable a wide variety of analyses ranging from ID ing variants to genome-
More informationCS681: Advanced Topics in Computational Biology
CS681: Advanced Topics in Computational Biology Can Alkan EA224 calkan@cs.bilkent.edu.tr Week 7 Lectures 2-3 http://www.cs.bilkent.edu.tr/~calkan/teaching/cs681/ Genome Assembly Test genome Random shearing
More informationEulerian tours. Russell Impagliazzo and Miles Jones Thanks to Janine Tiefenbruck. April 20, 2016
Eulerian tours Russell Impagliazzo and Miles Jones Thanks to Janine Tiefenbruck http://cseweb.ucsd.edu/classes/sp16/cse21-bd/ April 20, 2016 Seven Bridges of Konigsberg Is there a path that crosses each
More informationGenome Reconstruction: A Puzzle with a Billion Pieces. Phillip Compeau Carnegie Mellon University Computational Biology Department
http://cbd.cmu.edu Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau Carnegie Mellon University Computational Biology Department Eternity II: The Highest-Stakes Puzzle in History Courtesy:
More informationGraphs and Puzzles. Eulerian and Hamiltonian Tours.
Graphs and Puzzles. Eulerian and Hamiltonian Tours. CSE21 Winter 2017, Day 11 (B00), Day 7 (A00) February 3, 2017 http://vlsicad.ucsd.edu/courses/cse21-w17 Exam Announcements Seating Chart on Website Good
More informationEulerian Tours and Fleury s Algorithm
Eulerian Tours and Fleury s Algorithm CSE21 Winter 2017, Day 12 (B00), Day 8 (A00) February 8, 2017 http://vlsicad.ucsd.edu/courses/cse21-w17 Vocabulary Path (or walk): describes a route from one vertex
More informationDNA Fragment Assembly
SIGCSE 009 Algorithms in Bioinformatics Sami Khuri Department of Computer Science San José State University San José, California, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri DNA Fragment Assembly
More informationRead Mapping. de Novo Assembly. Genomics: Lecture #2 WS 2014/2015
Mapping de Novo Assembly Institut für Medizinische Genetik und Humangenetik Charité Universitätsmedizin Berlin Genomics: Lecture #2 WS 2014/2015 Today Genome assembly: the basics Hamiltonian and Eulerian
More information(for more info see:
Genome assembly (for more info see: http://www.cbcb.umd.edu/research/assembly_primer.shtml) Introduction Sequencing technologies can only "read" short fragments from a genome. Reconstructing the entire
More informationRESEARCH TOPIC IN BIOINFORMANTIC
RESEARCH TOPIC IN BIOINFORMANTIC GENOME ASSEMBLY Instructor: Dr. Yufeng Wu Noted by: February 25, 2012 Genome Assembly is a kind of string sequencing problems. As we all know, the human genome is very
More informationGenome Sequencing Algorithms
Genome Sequencing Algorithms Phillip Compaeu and Pavel Pevzner Bioinformatics Algorithms: an Active Learning Approach Leonhard Euler (1707 1783) William Hamilton (1805 1865) Nicolaas Govert de Bruijn (1918
More informationDNA Fragment Assembly Algorithms: Toward a Solution for Long Repeats
San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research 2008 DNA Fragment Assembly Algorithms: Toward a Solution for Long Repeats Ching Li San Jose State University
More informationGenome Assembly and De Novo RNAseq
Genome Assembly and De Novo RNAseq BMI 7830 Kun Huang Department of Biomedical Informatics The Ohio State University Outline Problem formulation Hamiltonian path formulation Euler path and de Bruijin graph
More informationde novo assembly Simon Rasmussen 36626: Next Generation Sequencing analysis DTU Bioinformatics Next Generation Sequencing Analysis
de novo assembly Simon Rasmussen 36626: Next Generation Sequencing analysis DTU Bioinformatics 27626 - Next Generation Sequencing Analysis Generalized NGS analysis Data size Application Assembly: Compare
More informationPurpose of sequence assembly
Sequence Assembly Purpose of sequence assembly Reconstruct long DNA/RNA sequences from short sequence reads Genome sequencing RNA sequencing for gene discovery Amplicon sequencing But not for transcript
More informationBMI/CS 576 Fall 2015 Midterm Exam
BMI/CS 576 Fall 2015 Midterm Exam Prof. Colin Dewey Tuesday, October 27th, 2015 11:00am-12:15pm Name: KEY Write your answers on these pages and show your work. You may use the back sides of pages as necessary.
More informationBioinformatics-themed projects in Discrete Mathematics
Bioinformatics-themed projects in Discrete Mathematics Art Duval University of Texas at El Paso Joint Mathematics Meeting MAA Contributed Paper Session on Discrete Mathematics in the Undergraduate Curriculum
More informationby the Genevestigator program (www.genevestigator.com). Darker blue color indicates higher gene expression.
Figure S1. Tissue-specific expression profile of the genes that were screened through the RHEPatmatch and root-specific microarray filters. The gene expression profile (heat map) was drawn by the Genevestigator
More informationIntroduction to Genome Assembly. Tandy Warnow
Introduction to Genome Assembly Tandy Warnow 2 Shotgun DNA Sequencing DNA target sample SHEAR & SIZE End Reads / Mate Pairs 550bp 10,000bp Not all sequencing technologies produce mate-pairs. Different
More informationBLAST & Genome assembly
BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies May 15, 2014 1 BLAST What is BLAST? The algorithm 2 Genome assembly De novo assembly Mapping assembly 3
More informationReducing Genome Assembly Complexity with Optical Maps
Reducing Genome Assembly Complexity with Optical Maps Lee Mendelowitz LMendelo@math.umd.edu Advisor: Dr. Mihai Pop Computer Science Department Center for Bioinformatics and Computational Biology mpop@umiacs.umd.edu
More informationBioinformatics: Fragment Assembly. Walter Kosters, Universiteit Leiden. IPA Algorithms&Complexity,
Bioinformatics: Fragment Assembly Walter Kosters, Universiteit Leiden IPA Algorithms&Complexity, 29.6.2007 www.liacs.nl/home/kosters/ 1 Fragment assembly Problem We study the following problem from bioinformatics:
More informationSolutions Exercise Set 3 Author: Charmi Panchal
Solutions Exercise Set 3 Author: Charmi Panchal Problem 1: Suppose we have following fragments: f1 = ATCCTTAACCCC f2 = TTAACTCA f3 = TTAATACTCCC f4 = ATCTTTC f5 = CACTCCCACACA f6 = CACAATCCTTAACCC f7 =
More informationOmega: an Overlap-graph de novo Assembler for Metagenomics
Omega: an Overlap-graph de novo Assembler for Metagenomics B a h l e l H a i d e r, Ta e - H y u k A h n, B r i a n B u s h n e l l, J u a n j u a n C h a i, A l e x C o p e l a n d, C h o n g l e Pa n
More informationIDBA - A Practical Iterative de Bruijn Graph De Novo Assembler
IDBA - A Practical Iterative de Bruijn Graph De Novo Assembler Yu Peng, Henry Leung, S.M. Yiu, Francis Y.L. Chin Department of Computer Science, The University of Hong Kong Pokfulam Road, Hong Kong {ypeng,
More informationGraphs and Genetics. Outline. Computational Biology IST. Ana Teresa Freitas 2015/2016. Slides source: AED (MEEC/IST); Jones and Pevzner (book)
raphs and enetics Computational Biology IST Ana Teresa Freitas / Slides source: AED (MEEC/IST); Jones and Pevzner (book) Outline l Motivacion l Introduction to raph Theory l Eulerian & Hamiltonian Cycle
More informationIDBA A Practical Iterative de Bruijn Graph De Novo Assembler
IDBA A Practical Iterative de Bruijn Graph De Novo Assembler Yu Peng, Henry C.M. Leung, S.M. Yiu, and Francis Y.L. Chin Department of Computer Science, The University of Hong Kong Pokfulam Road, Hong Kong
More informationBLAST & Genome assembly
BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies November 17, 2012 1 Introduction Introduction 2 BLAST What is BLAST? The algorithm 3 Genome assembly De
More informationReducing Genome Assembly Complexity with Optical Maps Mid-year Progress Report
Reducing Genome Assembly Complexity with Optical Maps Mid-year Progress Report Lee Mendelowitz LMendelo@math.umd.edu Advisor: Dr. Mihai Pop Computer Science Department Center for Bioinformatics and Computational
More informationPath Finding in Graphs. Problem Set #2 will be posted by tonight
Path Finding in Graphs Problem Set #2 will be posted by tonight 1 From Last Time Two graphs representing 5-mers from the sequence "GACGGCGGCGCACGGCGCAA" Hamiltonian Path: Eulerian Path: Each k-mer is a
More information8/19/13. Computational problems. Introduction to Algorithm
I519, Introduction to Introduction to Algorithm Yuzhen Ye (yye@indiana.edu) School of Informatics and Computing, IUB Computational problems A computational problem specifies an input-output relationship
More informationEuler and Hamilton paths. Jorge A. Cobb The University of Texas at Dallas
Euler and Hamilton paths Jorge A. Cobb The University of Texas at Dallas 1 Paths and the adjacency matrix The powers of the adjacency matrix A r (with normal, not boolean multiplication) contain the number
More informationCSE 417 Branch & Bound (pt 4) Branch & Bound
CSE 417 Branch & Bound (pt 4) Branch & Bound Reminders > HW8 due today > HW9 will be posted tomorrow start early program will be slow, so debugging will be slow... Review of previous lectures > Complexity
More informationCSE 111 Bio: Program Design I Lecture 3: Python Basics & More Bio
Theresa McCracken @McHumor.com CSE 111 Bio: Program Design I Lecture 3: Python Basics & More Bio Robert Sloan (CS) & Rachel Poretsky (Bio) University of Illinois, Chicago August 31, 2017 DNA sequencing
More informationV1.0: Seth Gilbert, V1.1: Steven Halim August 30, Abstract. d(e), and we assume that the distance function is non-negative (i.e., d(x, y) 0).
CS4234: Optimisation Algorithms Lecture 4 TRAVELLING-SALESMAN-PROBLEM (4 variants) V1.0: Seth Gilbert, V1.1: Steven Halim August 30, 2016 Abstract The goal of the TRAVELLING-SALESMAN-PROBLEM is to find
More informationDescription of a genome assembler: CABOG
Theo Zimmermann Description of a genome assembler: CABOG CABOG (Celera Assembler with the Best Overlap Graph) is an assembler built upon the Celera Assembler, which, at first, was designed for Sanger sequencing,
More informationEfficient Selection of Unique and Popular Oligos for Large EST Databases. Stefano Lonardi. University of California, Riverside
Efficient Selection of Unique and Popular Oligos for Large EST Databases Stefano Lonardi University of California, Riverside joint work with Jie Zheng, Timothy Close, Tao Jiang University of California,
More informationComputational models for bionformatics
Computational models for bionformatics De-novo assembly and alignment-free measures Michele Schimd Department of Information Engineering July 8th, 2015 Michele Schimd (DEI) PostDoc @ DEI July 8th, 2015
More informationSarah Will Math 490 December 2, 2009
Sarah Will Math 490 December 2, 2009 Euler Circuits INTRODUCTION Euler wrote the first paper on graph theory. It was a study and proof that it was impossible to cross the seven bridges of Königsberg once
More informationDNA arrays. and their various applications. Algorithmen der Bioinformatik II - SoSe Christoph Dieterich
DNA arrays and their various applications Algorithmen der Bioinformatik II - SoSe 2007 Christoph Dieterich 1 Introduction Motivation DNA microarray is a parallel approach to gene screening and target identification.
More informationIE 102 Spring Routing Through Networks - 1
IE 102 Spring 2017 Routing Through Networks - 1 The Bridges of Koenigsberg: Euler 1735 Graph Theory began in 1735 Leonard Eüler Visited Koenigsberg People wondered whether it is possible to take a walk,
More informationA THEORETICAL ANALYSIS OF SCALABILITY OF THE PARALLEL GENOME ASSEMBLY ALGORITHMS
A THEORETICAL ANALYSIS OF SCALABILITY OF THE PARALLEL GENOME ASSEMBLY ALGORITHMS Munib Ahmed, Ishfaq Ahmad Department of Computer Science and Engineering, University of Texas At Arlington, Arlington, Texas
More informationCSCI 1820 Notes. Scribes: tl40. February 26 - March 02, Estimating size of graphs used to build the assembly.
CSCI 1820 Notes Scribes: tl40 February 26 - March 02, 2018 Chapter 2. Genome Assembly Algorithms 2.1. Statistical Theory 2.2. Algorithmic Theory Idury-Waterman Algorithm Estimating size of graphs used
More informationBuilding approximate overlap graphs for DNA assembly using random-permutations-based search.
An algorithm is presented for fast construction of graphs of reads, where an edge between two reads indicates an approximate overlap between the reads. Since the algorithm finds approximate overlaps directly,
More informationCombinatorial Pattern Matching. CS 466 Saurabh Sinha
Combinatorial Pattern Matching CS 466 Saurabh Sinha Genomic Repeats Example of repeats: ATGGTCTAGGTCCTAGTGGTC Motivation to find them: Genomic rearrangements are often associated with repeats Trace evolutionary
More informationGenome 373: Mapping Short Sequence Reads I. Doug Fowler
Genome 373: Mapping Short Sequence Reads I Doug Fowler Two different strategies for parallel amplification BRIDGE PCR EMULSION PCR Two different strategies for parallel amplification BRIDGE PCR EMULSION
More informationMacVector for Mac OS X. The online updater for this release is MB in size
MacVector 17.0.3 for Mac OS X The online updater for this release is 143.5 MB in size You must be running MacVector 15.5.4 or later for this updater to work! System Requirements MacVector 17.0 is supported
More informationPyramidal and Chiral Groupings of Gold Nanocrystals Assembled Using DNA Scaffolds
Pyramidal and Chiral Groupings of Gold Nanocrystals Assembled Using DNA Scaffolds February 27, 2009 Alexander Mastroianni, Shelley Claridge, A. Paul Alivisatos Department of Chemistry, University of California,
More informationGenome Assembly Using de Bruijn Graphs. Biostatistics 666
Genome Assembly Using de Bruijn Graphs Biostatistics 666 Previously: Reference Based Analyses Individual short reads are aligned to reference Genotypes generated by examining reads overlapping each position
More informationIDBA - A practical Iterative de Bruijn Graph De Novo Assembler
IDBA - A practical Iterative de Bruijn Graph De Novo Assembler Speaker: Gabriele Capannini May 21, 2010 Introduction De Novo Assembly assembling reads together so that they form a new, previously unknown
More informationHow to apply de Bruijn graphs to genome assembly
PRIMER How to apply de Bruijn graphs to genome assembly Phillip E C Compeau, Pavel A Pevzner & lenn Tesler A mathematical concept known as a de Bruijn graph turns the formidable challenge of assembling
More informationDiscrete Mathematics and Probability Theory Fall 2009 Satish Rao,David Tse Note 8
CS 70 Discrete Mathematics and Probability Theory Fall 2009 Satish Rao,David Tse Note 8 An Introduction to Graphs Formulating a simple, precise specification of a computational problem is often a prerequisite
More informationWalking with Euler through Ostpreußen and RNA
Walking with Euler through Ostpreußen and RNA Mark Muldoon February 4, 2010 Königsberg (1652) Kaliningrad (2007)? The Königsberg Bridge problem asks whether it is possible to walk around the old city in
More informationReducing Genome Assembly Complexity with Optical Maps Final Report
Reducing Genome Assembly Complexity with Optical Maps Final Report Lee Mendelowitz LMendelo@math.umd.edu Advisor: Dr. Mihai Pop Computer Science Department Center for Bioinformatics and Computational Biology
More informationDecision Problems. Observation: Many polynomial algorithms. Questions: Can we solve all problems in polynomial time? Answer: No, absolutely not.
Decision Problems Observation: Many polynomial algorithms. Questions: Can we solve all problems in polynomial time? Answer: No, absolutely not. Definition: The class of problems that can be solved by polynomial-time
More informationCS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow
CS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow 2 Shotgun DNA Sequencing DNA target sample SHEAR & SIZE End Reads / Mate Pairs 550bp 10,000bp Not all sequencing technologies
More informationTheorem 2.9: nearest addition algorithm
There are severe limits on our ability to compute near-optimal tours It is NP-complete to decide whether a given undirected =(,)has a Hamiltonian cycle An approximation algorithm for the TSP can be used
More informationEECS 203 Lecture 20. More Graphs
EECS 203 Lecture 20 More Graphs Admin stuffs Last homework due today Office hour changes starting Friday (also in Piazza) Friday 6/17: 2-5 Mark in his office. Sunday 6/19: 2-5 Jasmine in the UGLI. Monday
More informationDiscrete Mathematics for CS Spring 2008 David Wagner Note 13. An Introduction to Graphs
CS 70 Discrete Mathematics for CS Spring 2008 David Wagner Note 13 An Introduction to Graphs Formulating a simple, precise specification of a computational problem is often a prerequisite to writing a
More information11/5/09 Comp 590/Comp Fall
11/5/09 Comp 590/Comp 790-90 Fall 2009 1 Example of repeats: ATGGTCTAGGTCCTAGTGGTC Motivation to find them: Genomic rearrangements are often associated with repeats Trace evolutionary secrets Many tumors
More informationTCGR: A Novel DNA/RNA Visualization Technique
TCGR: A Novel DNA/RNA Visualization Technique Donya Quick and Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas, Texas 75275 dquick@mail.smu.edu, mhd@engr.smu.edu
More informationMITOCW watch?v=zyw2aede6wu
MITOCW watch?v=zyw2aede6wu The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To
More informationWeek 11: Eulerian and Hamiltonian graphs; Trees. 21 and 23 November, 2018
(1/22) MA284 : Discrete Mathematics Week 11: Eulerian and amiltonian graphs; Trees http://www.maths.nuigalway.ie/ niall/ma284/ amilton s Icosian Game (Library or the Royal Irish Academy) 21 and 23 November,
More informationPreliminary Syllabus. Genomics. Introduction & Genome Assembly Sequence Comparison Gene Modeling Gene Function Identification
Preliminary Syllabus Sep 30 Oct 2 Oct 7 Oct 9 Oct 14 Oct 16 Oct 21 Oct 25 Oct 28 Nov 4 Nov 8 Introduction & Genome Assembly Sequence Comparison Gene Modeling Gene Function Identification OCTOBER BREAK
More informationLecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations:
Lecture Overview Sequence Alignment & Search Karin Verspoor, Ph.D. Faculty, Computational Bioscience Program University of Colorado School of Medicine With credit and thanks to Larry Hunter for creating
More informationComputational Genomics and Molecular Biology, Fall
Computational Genomics and Molecular Biology, Fall 2015 1 Sequence Alignment Dannie Durand Pairwise Sequence Alignment The goal of pairwise sequence alignment is to establish a correspondence between the
More information02-711/ Computational Genomics and Molecular Biology Fall 2016
Literature assignment 2 Due: Nov. 3 rd, 2016 at 4:00pm Your name: Article: Phillip E C Compeau, Pavel A. Pevzner, Glenn Tesler. How to apply de Bruijn graphs to genome assembly. Nature Biotechnology 29,
More informationSimple Graph. General Graph
Graph Theory A graph is a collection of points (also called vertices) and lines (also called edges), with each edge ending at a vertex In general, it is allowed for more than one edge to have the same
More informationFinishing Circular Assemblies. J Fass UCD Genome Center Bioinformatics Core Thursday April 16, 2015
Finishing Circular Assemblies J Fass UCD Genome Center Bioinformatics Core Thursday April 16, 2015 Assembly Strategies de Bruijn graph Velvet, ABySS earlier, basic assemblers IDBA, SPAdes later, multi-k
More informationChapter 3: Paths and Cycles
Chapter 3: Paths and Cycles 5 Connectivity 1. Definitions: Walk: finite sequence of edges in which any two consecutive edges are adjacent or identical. (Initial vertex, Final vertex, length) Trail: walk
More informationWeek 11: Eulerian and Hamiltonian graphs; Trees. 15 and 17 November, 2017
(1/22) MA284 : Discrete Mathematics Week 11: Eulerian and Hamiltonian graphs; Trees http://www.maths.nuigalway.ie/~niall/ma284/ 15 and 17 November, 2017 Hamilton s Icosian Game (Library or the Royal Irish
More informationFundamental Properties of Graphs
Chapter three In many real-life situations we need to know how robust a graph that represents a certain network is, how edges or vertices can be removed without completely destroying the overall connectivity,
More information12 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, October 18, 2006
12 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, October 18, 2006 3 Sequence comparison by compression This chapter is based on the following articles, which are all recommended reading: X. Chen,
More informationDiscrete Mathematics and Probability Theory Fall 2013 Vazirani Note 7
CS 70 Discrete Mathematics and Probability Theory Fall 2013 Vazirani Note 7 An Introduction to Graphs A few centuries ago, residents of the city of Königsberg, Prussia were interested in a certain problem.
More informationGenome Sequencing & Assembly. Slides by Carl Kingsford
Genome Sequencing & Assembly Slides by Carl Kingsford Genome Sequencing ACCGTCCAATTGG...! TGGCAGGTTAACC... E.g. human: 3 billion bases split into 23 chromosomes Main tool of traditional sequencing: DNA
More informationNotes slides from before lecture. CSE 21, Winter 2017, Section A00. Lecture 9 Notes. Class URL:
Notes slides from before lecture CSE 21, Winter 2017, Section A00 Lecture 9 Notes Class URL: http://vlsicad.ucsd.edu/courses/cse21-w17/ Notes slides from before lecture Notes February 8 (1) HW4 is due
More informationSOLiD GFF File Format
SOLiD GFF File Format 1 Introduction The GFF file is a text based repository and contains data and analysis results; colorspace calls, quality values (QV) and variant annotations. The inputs to the GFF
More information6.00 Introduction to Computer Science and Programming Fall 2008
MIT OpenCourseWare http://ocw.mit.edu 6.00 Introduction to Computer Science and Programming Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
More informationComputational Biology Lecture 12: Physical mapping by restriction mapping Saad Mneimneh
Computational iology Lecture : Physical mapping by restriction mapping Saad Mneimneh In the beginning of the course, we looked at genetic mapping, which is the problem of identify the relative order of
More informationIntroduction III. Graphs. Motivations I. Introduction IV
Introduction I Graphs Computer Science & Engineering 235: Discrete Mathematics Christopher M. Bourke cbourke@cse.unl.edu Graph theory was introduced in the 18th century by Leonhard Euler via the Königsberg
More informationAn Introduction to Graph Theory
An Introduction to Graph Theory Evelyne Smith-Roberge University of Waterloo March 22, 2017 What is a graph? Definition A graph G is: a set V (G) of objects called vertices together with: a set E(G), of
More informationNumber Theory and Graph Theory
1 Number Theory and Graph Theory Chapter 7 Graph properties By A. Satyanarayana Reddy Department of Mathematics Shiv Nadar University Uttar Pradesh, India E-mail: satya8118@gmail.com 2 Module-2: Eulerian
More informationCLC Server. End User USER MANUAL
CLC Server End User USER MANUAL Manual for CLC Server 10.0.1 Windows, macos and Linux March 8, 2018 This software is for research purposes only. QIAGEN Aarhus Silkeborgvej 2 Prismet DK-8000 Aarhus C Denmark
More informationBioinformatics I, WS 09-10, D. Huson, February 10,
Bioinformatics I, WS 09-10, D. Huson, February 10, 2010 189 12 More on Suffix Trees This week we study the following material: WOTD-algorithm MUMs finding repeats using suffix trees 12.1 The WOTD Algorithm
More informationLecture 5: Markov models
Master s course Bioinformatics Data Analysis and Tools Lecture 5: Markov models Centre for Integrative Bioinformatics Problem in biology Data and patterns are often not clear cut When we want to make a
More informationCHAPTER 10 GRAPHS AND TREES. Alessandro Artale UniBZ - artale/z
CHAPTER 10 GRAPHS AND TREES Alessandro Artale UniBZ - http://www.inf.unibz.it/ artale/z SECTION 10.1 Graphs: Definitions and Basic Properties Copyright Cengage Learning. All rights reserved. Graphs: Definitions
More information