DNA Fragment Assembly
|
|
- Cynthia Bruce
- 5 years ago
- Views:
Transcription
1 SIGCSE 009 Algorithms in Bioinformatics Sami Khuri Department of Computer Science San José State University San José, California, USA DNA Fragment Assembly Overlap Graphs Shotgun Sequencing Repeated Regions Sequencing by Hybridization Hamiltonian Cycle Euler Path To Sequence To sequence a DNA molecule is to obtain the string bases that it contains. In large scale DNA sequencing we have to sequence large DNA molecules (thousands of base pairs). Introduction It is impossible to directly sequence contiguous stretches of more than a few hundred bases. On the other hand, we know how to cut random pieces of a long DNA molecule and to produce enough copies of the molecule to sequence. A typical approach to sequence long DNA molecules is to sample and then sequence fragments from them. The problem is that these pieces (fragments) have to be assembled. Steps of Fragment Assembly In large scale DNA sequencing, we are given a collection of many fragments of short DNA sequences. The fragments are approximate substrings of a very long DNA molecule. The consists in reconstructing the original sequence from the fragments. 4.
2 SIGCSE 009 Consensus Sequence Building Importance of Fragment Assembly We need to have reliable, complete genomic sequences of human and other model organisms Base-pair sequence is the most basic piece of DNA information gene structure and function described by sequence Why Sequencing? By comparing genome sequences from carefully chosen organisms, scientists are able to identify specific DNA sequences that have been conserved throughout the evolution of different species, which is a strong indicator that these sequences reflect functionally important regions of the genome. Why Sequencing? (II) To catalog all the genes present in one organism. To compare the gene content of one organism to another organism. To study features other than genes. To study genome evolution. As a foundation for future experimentation. Genome Sequencing Strategies Human Genome Project: map-based strategy individual clones subjected to shotgun sequencing the sequences from the clones (shotgun fragments) then reassembled Celera: whole genome sequence strategy shotgun sequencing On the sequencing of the human genome by Waterston et al., PNAS, vol. 99, 00 4.
3 SIGCSE 009 Hierarchical vs. Whole-Genome Fragments of a DNA Molecule Each fragment corresponds to a substring of one of the strands of the target molecule. We do not know: which strand the fragment belongs to, the position of the fragment relative to the beginning of the strand, if the fragment contains errors. On the sequencing of the human genome by Waterston et al., PNAS, vol. 99, 00 Shotgun Sequencing The Fragment Assembly A large number of fragments are obtained by a sequencing technique: the shotgun method. The reconstruction of the target molecule s sequence is based on fragment overlap. Fragment lengths vary from 00 to 700. Target sequences are between 0,000 and 00,000 base-pairs. The problem consists in obtaining the whole sequence of the target DNA molecule. Since we have a collection of fragments to piece together, this problem is known as the An Example Consider the following four sequences: A C C G T C G T G C T T A C T A C C G T Assume, that it is known that the target sequence is of size 0. The Layout Align in the same column the bases that are equal. Position fragments so that they align well with each other to get a layout. - - A C C G T C G T G C T T A C C T A C C G T - - T T A C C G T G C Consensus of length 9 4.
4 SIGCSE 009 Consensus Sequence The Consensus Sequence or the consensus is obtained by taking a majority vote among all bases in the same column. The answer: TTACCGTGC, has nine bases, close to the approximated value of 0, and contains each fragment as an exact substring. In general, fragments are seldom exact substrings of the consensus. Major Sequencing Centers Joint Genome Institute (USA) Five National Laboratories: Lawrence Berkeley, Lawrence Livermore, Los Alamos, Oak Ridge, Pacific Northwest Stanford Human Genome Center The Institute for Genomic Research (USA) Sanger Institute (UK) J. Craig Venter Institute (USA) Washington University (USA) Integrated Genomics (USA) Genoscope (France) Broad Institute A division of the Whitehead Institute (USA) Chromosome 9 (JGI) It has the highest gene density of all human chromosomes, more than double the genomewide average. It contains,46 protein-coding genes, pseudogenes. Genes that code for such diseases as: insulin-dependent diabetes, myotonic dystrophy, migraines, familial hypercholesterolemia Mimulus guttatus (JGI) Mimulus guttatus: a model organism for studies of evolution and ecology (seep monkey flower) Mimulus species have: a small genome (about 40 Mb), a short generation time (6 - weeks), high fecundity (00 to 000 seeds per pollination), self-compatibility, and ease of greenhouse propagation Complicating Factors DNA sequencing is very challenging since: Real problem instances are very large. Many fragments contain errors: Base call errors Chimeras Vector contamination The orientation of the fragments is frequently unknown; and both strands must be analyzed. There might be a lack of coverage. Repeated Regions Repeats are sequences that appear two or more times in the target molecule. X X Short repeats are repeats covered by one fragment. They do not pose any problem. Long repeats cause most of the problems. 4.4
5 SIGCSE 009 Repeated Regions II Repeats are not necessarily identical, if the similarity is high enough it can be mistaken and considered base call errors. There are two types of repeats: Direct Repeats Inverted Repeats Repeats: An Example Sequence: ATGGCTCATAGGCTCGAG ATGGCTCGAG Repeats: An Example MODELS Sequence: GGCTC TGGCT ATGGC GCTCAT TAGGCT GGCTCG GCTCGA CTCGAG ATGGCTCATAGGCTCGAG GGCTC --GGCTC--- TGGCT -TGGCT---- ATGGC ATGGC----- GCTCAT ---GCTC-AT TAGGCT TAGGCT---- GGCTCG --GGCTCG-- GCTCGA ---GCTCGA- CTCGAG ----CTCGAG GGCTC--- -TGGCT ATGGC GCTCAT TAGGCT GGCTCG GCTCGA CTCGAG ATGGCTCATAGGCTCGAG Models of the fragment assembly problem. Shortest Common Superstring Reconstruction Multicontig None addresses the biological issues completely. Assumption. Fragment Collection is free of contamination and chimeras. Shortest Common Superstring The Shortest Common Superstring (SCS): One of the first attempts to formalize the. Look for the shortest superstring from a collection of given strings. SCS limitations in representing the fragment assembly problem: Does not account for errors. NP hard problem, hence approximation algorithms are used. SCS Problem Definition Input: A collection F of strings Output: A shortest possible string S such that for every f belonging to F, S is a superstring of f. F corresponds to the fragments Each fragment is given by its sequence in the correct orientation S is the sequence of the target DNA molecule. 4.5
6 SIGCSE 009 SCS: An Example Example Let F = {ACT, CTA, AGT} SCS of F, sequence S = ACTAGT S contains all possible fragments in F as substrings. Shortest Common Superstring (SCS) Problem: Given a set of strings, find a shortest string that contains all of them Input: Strings s, s,., s n Output: A string S that contains all strings s, s,., s n as substrings, such that the length of S is minimized. Example: What is the SSP of : {000, 00, 00, 00, 0, 0, 0, }? An Example of SCS SCS: Drawbacks Drawbacks Computational problem specifies S should be a perfect superstring of each fragment. Hence, SCS does not allow for experimental errors in fragments. Orientation must be known, which is seldom the case. Even if the above factors are controlled, it might not be the actual biological solution due to repeated sections in the target DNA sequence. An Introduction to Bioinformatics Algorithms by N. Jones and A. Pevzner FAP Algorithms The Algorithms we consider: Fragments have no errors Fragments are of known orientation Representing Overlays: Common superstring correspond to paths in a graph based on the collection of fragments. Properties of these superstrings are translated to properties of paths It is easier to relate new problems to graphs due to familiarity and knowledge we have about them. Overlap Directed Graphs Given a set F of fragments, we can construct a directed graph as follows: The vertices of F represent the given DNA fragments. If there is an overlap between the suffix of fragment F_ and the prefix of fragment F_, then an edge is drawn from F_ to F_. Each edge is given a weight corresponding to the length of the overlap. 4.6
7 SIGCSE 009 Overlap Graphs Note that the Overlap Graph: Is a multigraph since we can have more than one edge between any vertices in the graph There is an edge between any vertices with weight zero To find the target DNA sequence, we look for a Hamiltonian path: A path that visits each vertex exactly once. We choose the Hamiltonian path with the largest sum of edges. Paths Originating Superstrings Only edges with strictly positive weight are drawn. TACGA Collection F={a,b,c,d} a = TACGA b = ACCC c = d = GACA ACCC GACA Paths Originating Superstrings Collection F={a,b,c,d} a = TACGA b = ACCC c = d = GACA Path P_ = dbc GAGACC Path P_ = abcd TACGACCAGA TACGA ACCC GACA Example: Overlap Multigraph t-overlap: suffix(a,t) = prefix(b,t) Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices? TACGG GGACG GCCC Example: Overlap Multigraph t-overlap: suffix(a,t) = prefix(b,t) Example: Overlap Multigraph t-overlap: suffix(a,t) = prefix(b,t) TACGG TACGG Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices? GGACAG GCCC Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices? GGACAG GCCC 4.7
8 SIGCSE 009 Example: Overlap Multigraph t-overlap: suffix(a,t) = prefix(b,t) Example: Overlap Multigraph t-overlap: suffix(a,t) = prefix(b,t) TACGG TACGG Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices? GGACAG GCCC Idea: Why don't we search for a path in the overlap multigraph that covers all the vertices? GGACAG GCCC Example : Overlap Multigraph F_ = F_ = F_ = F_4 = F_5 = Reconstruct the target DNA sequence from the given fragments 4.8
9 SIGCSE ACCGCATGACCACTA Shortest Superstrings As Paths A collection F is said to be substring-free if there are no two distinct strings a and b in F such that a is a substring of b. Let F be a substring-free collection. Then for every common superstring S of F there is a Hamiltonian path P in OM(F) such that S(P) is a subsequence of S. Let F be a substring-free collection. If S is a shortest common superstring of F, there is a Hamiltonian path P such that S=S(P) The Overlap Graph Looking for shortest common superstrings is the same as looking for Hamiltonian paths of maximum weight in a directed multigraph. Goal: maximize the weight. Simplify the multigraph, consider only the heaviest edge between every pair of nodes, discarding other parallel edges. Call the new graph the overlap graph of F denoted by OG(F). 4.9
10 SIGCSE 009 The Greedy Algorithm Edges are processed in non increasing order by weight. Continuously add the heaviest available edge as long as it does not upset the construction of the Hamiltonian path given the previously chosen edges. The procedure ends when there are exactly n- edges, or when the accepted edges induce a connected subgraph. Example: Greedy Algorithm Fails F={ATGC, GCC, TGCAT} ATGC GCC TGCAT Order the edges by weight (ATGC, TGCAT) = (ATGC, GCC) = (TGCAT, ATGC) = The greedy algorithm will choose first (ATGC, TGCAT) = and then is forced to select an edge with weight 0 to complete the path: (ATGC, TGCAT) (TGCAT,GCC) Instead the solution should be (TGCAT, ATGC) = (ATGC, GCC) = Sequencing by Hybridization Sequencing by Hybridization Universal DNA Array detects all the k-mers in given DNA sample (red dots) Genome Sequence Assembly by Mihai Pop, TIGR Spectrum ( T, l ): The set of all possible (n l + ) l-mers in a string T of length n The order of individual elements in Spectrum ( T, l ) does not matter Example: T = ATGCGTGGCA Spectrum (T, ) = {ATG, TGC, GCG, CGT, GTG, TGG, GGC, GCA} The SBH Problem SBH: An Example (I) Goal: Reconstruct a string T from its l-mer composition Input: A set S, representing all l-mers from an (unknown) string T Output: String T such that Spectrum(T,l) = S S = {ACG,CGC,GCA,CAT,ATC} hybridization A C G G C A C A T A T C Spectrum for k= DNA Sample A A T C Adapted from Shuai Cheng Li: CS48/68 4.0
11 SIGCSE 009 SBH: An Example (II) S = {ACG,CGC,GCA,CAT,ATC} DNA Sample hybridization A A T C Two Samples, One Spectrum Two samples may result in the same spectrum More information is needed to construct a unique sequence A C G G C A C A T A T C Spectrum for k= A C G G C A C A T A T C A A T C T is such that Spectrum (T, ) = {ACG,CGC,GCA,CAT,ATC} In other words, Spectrum(T,) = S Adapted from Shuai Cheng Li: CS48/68 T A C C T C C A A C C C C T C T C T C C C C G G C C C C A A C C T C C A A C C C C G G C C C C T C T C T C C C C A Adapted from Shuai Cheng Li: CS48/68 Two Samples, One Spectrum (II) Two samples may result in the same spectrum More information is needed to construct a unique sequence A C C T C C A A C C C C T C T C T C C C C G G C C C C A A C C T C C A A C C C C G G C C C C T C T C T C C C C A Adapted from Shuai Cheng Li: CS48/68 SBH and Eulerian Path Given a spectrum S, draw a directed graph where: Each vertex represents a (k-)-prefix or (k-)-suffix of k-mers in S Each edge is a k-mer from S connecting a vertex representing a (k-)-prefix and a (k-)-suffix. Find a Eulerian path of G, and reconstruct the sequence from the path Example: Spectrum= {ACG, ATC, CAT, CGC, GCA} Edges: ACG, ATC, CAT, CGC and GCA Vertices: AC, CG, AT, TC, CA, and GC. Adapted from Shuai Cheng Li: CS48/68 Eulerian Path: An Example Example: Spectrum= {ACG, ATC, CAT, CGC, GCA} Draw the vertices: AC, AT, CA, CG, GC, TC (alphabetical order) SBH and Eulerian Path (II) Example: Spectrum= {ACG, ATC, CAT, CGC, GCA} Draw the vertices: AC, AT, CA, CG, GC, TC (alphabetical order) Draw edge from vertex AC to vertex CG edge ACG ACG AC AT CA CG GC TC AC AT CA CG GC TC Adapted from Shuai Cheng Li: CS48/68 4.
12 SIGCSE 009 SBH and Eulerian Path (III) Example: Spectrum= {ACG, ATC, CAT, CGC, GCA} Draw the vertices: AC, AT, CA, CG, GC, TC (alphabetical order) Draw edge from vertex AC to vertex CG edge ACG Draw edge from vertex AT to vertex TC edge ATC ACG SBH and Eulerian Path (IV) Spectrum= {ACG, ATC, CAT, CGC, GCA} Draw the vertices: AC, AT, CA, CG, GC, TC (alphabetical order) Draw edge from vertex AC to vertex CG edge ACG Draw edge from vertex AT to vertex TC edge ATC Draw edge from vertex CA to vertex AT edge CAT Draw edge from vertex CG to vertex GC edge CGC Draw edge from vertex GC to vertex CA edge GCA AC AT CA CG GC TC AC AT CA CG GC TC ATC SBH and Eulerian Path (V) An Eulerian Path is a path which visits each edge of the graph once Eulerian path: ACCG GC CA AT TC Sequence: ACGCATC Multiple paths are possible Uniqueness Spectrum={ATG, TGC, GCG, CGT, GTG, TGG, GGC, GCA } GT CG GT CG AT TG GC CA AT TG GC CA GG GG AC AT CA CG GC TC ATGCGTGGCA ATGGCGTGCA Adapted from Shuai Cheng Li: CS48/68 Challenges of SBH The solution may not be unique For example: Obtain an Eulerian cycle instead of a path multiple solutions The input data, the Spectrum S, may contain errors For example: false positives, false negatives, uncertain frequency of k-mers Multiple parallel edges ambiguous solutions Some Solutions Several solutions were proposed to solve the problems Positional Eulerian Path (PEP) by Hannnenhalli et al. 996 Positional Sequencing by Hybridization (PSBH) add extra information to probes Interactive Protocols by Skiena et al. 995 Gapped probes by Preparata et al. 000 and Frieze et al. 999 Analog-Spectrum by Preparata 004 Note that we consider the simple case were the spectrum yields an Euler path. 4.
DNA Fragment Assembly
Algorithms in Bioinformatics Sami Khuri Department of Computer Science San José State University San José, California, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri DNA Fragment Assembly Overlap
More informationCSCI2950-C Lecture 4 DNA Sequencing and Fragment Assembly
CSCI2950-C Lecture 4 DNA Sequencing and Fragment Assembly Ben Raphael Sept. 22, 2009 http://cs.brown.edu/courses/csci2950-c/ l-mer composition Def: Given string s, the Spectrum ( s, l ) is unordered multiset
More informationGraph Algorithms in Bioinformatics
Graph Algorithms in Bioinformatics Computational Biology IST Ana Teresa Freitas 2015/2016 Sequencing Clone-by-clone shotgun sequencing Human Genome Project Whole-genome shotgun sequencing Celera Genomics
More informationSequencing. Computational Biology IST Ana Teresa Freitas 2011/2012. (BACs) Whole-genome shotgun sequencing Celera Genomics
Computational Biology IST Ana Teresa Freitas 2011/2012 Sequencing Clone-by-clone shotgun sequencing Human Genome Project Whole-genome shotgun sequencing Celera Genomics (BACs) 1 Must take the fragments
More informationDNA Sequencing The Shortest Superstring & Traveling Salesman Problems Sequencing by Hybridization
Eulerian & Hamiltonian Cycle Problems DNA Sequencing The Shortest Superstring & Traveling Salesman Problems Sequencing by Hybridization The Bridge Obsession Problem Find a tour crossing every bridge just
More information10/15/2009 Comp 590/Comp Fall
Lecture 13: Graph Algorithms Study Chapter 8.1 8.8 10/15/2009 Comp 590/Comp 790-90 Fall 2009 1 The Bridge Obsession Problem Find a tour crossing every bridge just once Leonhard Euler, 1735 Bridges of Königsberg
More informationAlgorithms for Bioinformatics
Adapted from slides by Alexandru Tomescu, Leena Salmela and Veli Mäkinen, which are partly from http://bix.ucsd.edu/bioalgorithms/slides.php 58670 Algorithms for Bioinformatics Lecture 5: Graph Algorithms
More informationSequence Assembly Required!
Sequence Assembly Required! 1 October 3, ISMB 20172007 1 Sequence Assembly Genome Sequenced Fragments (reads) Assembled Contigs Finished Genome 2 Greedy solution is bounded 3 Typical assembly strategy
More information10/8/13 Comp 555 Fall
10/8/13 Comp 555 Fall 2013 1 Find a tour crossing every bridge just once Leonhard Euler, 1735 Bridges of Königsberg 10/8/13 Comp 555 Fall 2013 2 Find a cycle that visits every edge exactly once Linear
More informationGenome 373: Genome Assembly. Doug Fowler
Genome 373: Genome Assembly Doug Fowler What are some of the things we ve seen we can do with HTS data? We ve seen that HTS can enable a wide variety of analyses ranging from ID ing variants to genome-
More informationDNA Sequencing. Overview
BINF 3350, Genomics and Bioinformatics DNA Sequencing Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Backgrounds Eulerian Cycles Problem Hamiltonian Cycles
More informationAlgorithms for Bioinformatics
Adapted from slides by Alexandru Tomescu, Leena Salmela and Veli Mäkinen, which are partly from http://bix.ucsd.edu/bioalgorithms/slides.php 582670 Algorithms for Bioinformatics Lecture 3: Graph Algorithms
More informationSequence Assembly. BMI/CS 576 Mark Craven Some sequencing successes
Sequence Assembly BMI/CS 576 www.biostat.wisc.edu/bmi576/ Mark Craven craven@biostat.wisc.edu Some sequencing successes Yersinia pestis Cannabis sativa The sequencing problem We want to determine the identity
More informationBioinformatics: Fragment Assembly. Walter Kosters, Universiteit Leiden. IPA Algorithms&Complexity,
Bioinformatics: Fragment Assembly Walter Kosters, Universiteit Leiden IPA Algorithms&Complexity, 29.6.2007 www.liacs.nl/home/kosters/ 1 Fragment assembly Problem We study the following problem from bioinformatics:
More informationGraph Algorithms in Bioinformatics
Graph Algorithms in Bioinformatics Bioinformatics: Issues and Algorithms CSE 308-408 Fall 2007 Lecture 13 Lopresti Fall 2007 Lecture 13-1 - Outline Introduction to graph theory Eulerian & Hamiltonian Cycle
More informationGenome Reconstruction: A Puzzle with a Billion Pieces Phillip E. C. Compeau and Pavel A. Pevzner
Genome Reconstruction: A Puzzle with a Billion Pieces Phillip E. C. Compeau and Pavel A. Pevzner Outline I. Problem II. Two Historical Detours III.Example IV.The Mathematics of DNA Sequencing V.Complications
More informationRESEARCH TOPIC IN BIOINFORMANTIC
RESEARCH TOPIC IN BIOINFORMANTIC GENOME ASSEMBLY Instructor: Dr. Yufeng Wu Noted by: February 25, 2012 Genome Assembly is a kind of string sequencing problems. As we all know, the human genome is very
More informationGenome Reconstruction: A Puzzle with a Billion Pieces. Phillip Compeau Carnegie Mellon University Computational Biology Department
http://cbd.cmu.edu Genome Reconstruction: A Puzzle with a Billion Pieces Phillip Compeau Carnegie Mellon University Computational Biology Department Eternity II: The Highest-Stakes Puzzle in History Courtesy:
More informationPurpose of sequence assembly
Sequence Assembly Purpose of sequence assembly Reconstruct long DNA/RNA sequences from short sequence reads Genome sequencing RNA sequencing for gene discovery Amplicon sequencing But not for transcript
More information(for more info see:
Genome assembly (for more info see: http://www.cbcb.umd.edu/research/assembly_primer.shtml) Introduction Sequencing technologies can only "read" short fragments from a genome. Reconstructing the entire
More informationde novo assembly Simon Rasmussen 36626: Next Generation Sequencing analysis DTU Bioinformatics Next Generation Sequencing Analysis
de novo assembly Simon Rasmussen 36626: Next Generation Sequencing analysis DTU Bioinformatics 27626 - Next Generation Sequencing Analysis Generalized NGS analysis Data size Application Assembly: Compare
More informationI519 Introduction to Bioinformatics, Genome assembly. Yuzhen Ye School of Informatics & Computing, IUB
I519 Introduction to Bioinformatics, 2014 Genome assembly Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Contents Genome assembly problem Approaches Comparative assembly The string
More informationGenome Sequencing Algorithms
Genome Sequencing Algorithms Phillip Compaeu and Pavel Pevzner Bioinformatics Algorithms: an Active Learning Approach Leonhard Euler (1707 1783) William Hamilton (1805 1865) Nicolaas Govert de Bruijn (1918
More informationEulerian tours. Russell Impagliazzo and Miles Jones Thanks to Janine Tiefenbruck. April 20, 2016
Eulerian tours Russell Impagliazzo and Miles Jones Thanks to Janine Tiefenbruck http://cseweb.ucsd.edu/classes/sp16/cse21-bd/ April 20, 2016 Seven Bridges of Konigsberg Is there a path that crosses each
More informationGraphs and Puzzles. Eulerian and Hamiltonian Tours.
Graphs and Puzzles. Eulerian and Hamiltonian Tours. CSE21 Winter 2017, Day 11 (B00), Day 7 (A00) February 3, 2017 http://vlsicad.ucsd.edu/courses/cse21-w17 Exam Announcements Seating Chart on Website Good
More informationBLAST & Genome assembly
BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies May 15, 2014 1 BLAST What is BLAST? The algorithm 2 Genome assembly De novo assembly Mapping assembly 3
More informationRead Mapping. de Novo Assembly. Genomics: Lecture #2 WS 2014/2015
Mapping de Novo Assembly Institut für Medizinische Genetik und Humangenetik Charité Universitätsmedizin Berlin Genomics: Lecture #2 WS 2014/2015 Today Genome assembly: the basics Hamiltonian and Eulerian
More informationDescription of a genome assembler: CABOG
Theo Zimmermann Description of a genome assembler: CABOG CABOG (Celera Assembler with the Best Overlap Graph) is an assembler built upon the Celera Assembler, which, at first, was designed for Sanger sequencing,
More informationBioinformatics-themed projects in Discrete Mathematics
Bioinformatics-themed projects in Discrete Mathematics Art Duval University of Texas at El Paso Joint Mathematics Meeting MAA Contributed Paper Session on Discrete Mathematics in the Undergraduate Curriculum
More informationBMI/CS 576 Fall 2015 Midterm Exam
BMI/CS 576 Fall 2015 Midterm Exam Prof. Colin Dewey Tuesday, October 27th, 2015 11:00am-12:15pm Name: KEY Write your answers on these pages and show your work. You may use the back sides of pages as necessary.
More informationSolutions Exercise Set 3 Author: Charmi Panchal
Solutions Exercise Set 3 Author: Charmi Panchal Problem 1: Suppose we have following fragments: f1 = ATCCTTAACCCC f2 = TTAACTCA f3 = TTAATACTCCC f4 = ATCTTTC f5 = CACTCCCACACA f6 = CACAATCCTTAACCC f7 =
More informationEulerian Tours and Fleury s Algorithm
Eulerian Tours and Fleury s Algorithm CSE21 Winter 2017, Day 12 (B00), Day 8 (A00) February 8, 2017 http://vlsicad.ucsd.edu/courses/cse21-w17 Vocabulary Path (or walk): describes a route from one vertex
More informationIntroduction to Genome Assembly. Tandy Warnow
Introduction to Genome Assembly Tandy Warnow 2 Shotgun DNA Sequencing DNA target sample SHEAR & SIZE End Reads / Mate Pairs 550bp 10,000bp Not all sequencing technologies produce mate-pairs. Different
More informationBLAST & Genome assembly
BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies November 17, 2012 1 Introduction Introduction 2 BLAST What is BLAST? The algorithm 3 Genome assembly De
More informationDNA Fragment Assembly Algorithms: Toward a Solution for Long Repeats
San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research 2008 DNA Fragment Assembly Algorithms: Toward a Solution for Long Repeats Ching Li San Jose State University
More informationComputational Biology Lecture 12: Physical mapping by restriction mapping Saad Mneimneh
Computational iology Lecture : Physical mapping by restriction mapping Saad Mneimneh In the beginning of the course, we looked at genetic mapping, which is the problem of identify the relative order of
More informationPyramidal and Chiral Groupings of Gold Nanocrystals Assembled Using DNA Scaffolds
Pyramidal and Chiral Groupings of Gold Nanocrystals Assembled Using DNA Scaffolds February 27, 2009 Alexander Mastroianni, Shelley Claridge, A. Paul Alivisatos Department of Chemistry, University of California,
More informationCS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow
CS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow 2 Shotgun DNA Sequencing DNA target sample SHEAR & SIZE End Reads / Mate Pairs 550bp 10,000bp Not all sequencing technologies
More informationby the Genevestigator program (www.genevestigator.com). Darker blue color indicates higher gene expression.
Figure S1. Tissue-specific expression profile of the genes that were screened through the RHEPatmatch and root-specific microarray filters. The gene expression profile (heat map) was drawn by the Genevestigator
More information02-711/ Computational Genomics and Molecular Biology Fall 2016
Literature assignment 2 Due: Nov. 3 rd, 2016 at 4:00pm Your name: Article: Phillip E C Compeau, Pavel A. Pevzner, Glenn Tesler. How to apply de Bruijn graphs to genome assembly. Nature Biotechnology 29,
More informationReducing Genome Assembly Complexity with Optical Maps
Reducing Genome Assembly Complexity with Optical Maps Lee Mendelowitz LMendelo@math.umd.edu Advisor: Dr. Mihai Pop Computer Science Department Center for Bioinformatics and Computational Biology mpop@umiacs.umd.edu
More informationDNA arrays. and their various applications. Algorithmen der Bioinformatik II - SoSe Christoph Dieterich
DNA arrays and their various applications Algorithmen der Bioinformatik II - SoSe 2007 Christoph Dieterich 1 Introduction Motivation DNA microarray is a parallel approach to gene screening and target identification.
More informationTCGR: A Novel DNA/RNA Visualization Technique
TCGR: A Novel DNA/RNA Visualization Technique Donya Quick and Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas, Texas 75275 dquick@mail.smu.edu, mhd@engr.smu.edu
More informationGraphs and Genetics. Outline. Computational Biology IST. Ana Teresa Freitas 2015/2016. Slides source: AED (MEEC/IST); Jones and Pevzner (book)
raphs and enetics Computational Biology IST Ana Teresa Freitas / Slides source: AED (MEEC/IST); Jones and Pevzner (book) Outline l Motivacion l Introduction to raph Theory l Eulerian & Hamiltonian Cycle
More informationOmega: an Overlap-graph de novo Assembler for Metagenomics
Omega: an Overlap-graph de novo Assembler for Metagenomics B a h l e l H a i d e r, Ta e - H y u k A h n, B r i a n B u s h n e l l, J u a n j u a n C h a i, A l e x C o p e l a n d, C h o n g l e Pa n
More informationCS681: Advanced Topics in Computational Biology
CS681: Advanced Topics in Computational Biology Can Alkan EA224 calkan@cs.bilkent.edu.tr Week 7 Lectures 2-3 http://www.cs.bilkent.edu.tr/~calkan/teaching/cs681/ Genome Assembly Test genome Random shearing
More informationComputational Molecular Biology
Computational Molecular Biology Erwin M. Bakker Lecture 3, mainly from material by R. Shamir [2] and H.J. Hoogeboom [4]. 1 Pairwise Sequence Alignment Biological Motivation Algorithmic Aspect Recursive
More informationA THEORETICAL ANALYSIS OF SCALABILITY OF THE PARALLEL GENOME ASSEMBLY ALGORITHMS
A THEORETICAL ANALYSIS OF SCALABILITY OF THE PARALLEL GENOME ASSEMBLY ALGORITHMS Munib Ahmed, Ishfaq Ahmad Department of Computer Science and Engineering, University of Texas At Arlington, Arlington, Texas
More informationComputational Genomics and Molecular Biology, Fall
Computational Genomics and Molecular Biology, Fall 2015 1 Sequence Alignment Dannie Durand Pairwise Sequence Alignment The goal of pairwise sequence alignment is to establish a correspondence between the
More informationwarm-up exercise Representing Data Digitally goals for today proteins example from nature
Representing Data Digitally Anne Condon September 6, 007 warm-up exercise pick two examples of in your everyday life* in what media are the is represented? is the converted from one representation to another,
More informationEfficient Selection of Unique and Popular Oligos for Large EST Databases. Stefano Lonardi. University of California, Riverside
Efficient Selection of Unique and Popular Oligos for Large EST Databases Stefano Lonardi University of California, Riverside joint work with Jie Zheng, Timothy Close, Tao Jiang University of California,
More informationComputational Molecular Biology
Computational Molecular Biology Erwin M. Bakker Lecture 2 Materials used from R. Shamir [2] and H.J. Hoogeboom [4]. 1 Molecular Biology Sequences DNA A, T, C, G RNA A, U, C, G Protein A, R, D, N, C E,
More informationHow to apply de Bruijn graphs to genome assembly
PRIMER How to apply de Bruijn graphs to genome assembly Phillip E C Compeau, Pavel A Pevzner & lenn Tesler A mathematical concept known as a de Bruijn graph turns the formidable challenge of assembling
More informationGenome Assembly and De Novo RNAseq
Genome Assembly and De Novo RNAseq BMI 7830 Kun Huang Department of Biomedical Informatics The Ohio State University Outline Problem formulation Hamiltonian path formulation Euler path and de Bruijin graph
More informationAlignment of Long Sequences
Alignment of Long Sequences BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2009 Mark Craven craven@biostat.wisc.edu Pairwise Whole Genome Alignment: Task Definition Given a pair of genomes (or other large-scale
More informationFundamental Properties of Graphs
Chapter three In many real-life situations we need to know how robust a graph that represents a certain network is, how edges or vertices can be removed without completely destroying the overall connectivity,
More informationCSCI 1820 Notes. Scribes: tl40. February 26 - March 02, Estimating size of graphs used to build the assembly.
CSCI 1820 Notes Scribes: tl40 February 26 - March 02, 2018 Chapter 2. Genome Assembly Algorithms 2.1. Statistical Theory 2.2. Algorithmic Theory Idury-Waterman Algorithm Estimating size of graphs used
More informationComputational models for bionformatics
Computational models for bionformatics De-novo assembly and alignment-free measures Michele Schimd Department of Information Engineering July 8th, 2015 Michele Schimd (DEI) PostDoc @ DEI July 8th, 2015
More informationIDBA - A Practical Iterative de Bruijn Graph De Novo Assembler
IDBA - A Practical Iterative de Bruijn Graph De Novo Assembler Yu Peng, Henry Leung, S.M. Yiu, Francis Y.L. Chin Department of Computer Science, The University of Hong Kong Pokfulam Road, Hong Kong {ypeng,
More informationIDBA A Practical Iterative de Bruijn Graph De Novo Assembler
IDBA A Practical Iterative de Bruijn Graph De Novo Assembler Yu Peng, Henry C.M. Leung, S.M. Yiu, and Francis Y.L. Chin Department of Computer Science, The University of Hong Kong Pokfulam Road, Hong Kong
More informationResearch Article An Improved Scoring Matrix for Multiple Sequence Alignment
Hindawi Publishing Corporation Mathematical Problems in Engineering Volume 2012, Article ID 490649, 9 pages doi:10.1155/2012/490649 Research Article An Improved Scoring Matrix for Multiple Sequence Alignment
More informationAdam M Phillippy Center for Bioinformatics and Computational Biology
Adam M Phillippy Center for Bioinformatics and Computational Biology WGS sequencing shearing sequencing assembly WGS assembly Overlap reads identify reads with shared k-mers calculate edit distance Layout
More informationReducing Genome Assembly Complexity with Optical Maps Mid-year Progress Report
Reducing Genome Assembly Complexity with Optical Maps Mid-year Progress Report Lee Mendelowitz LMendelo@math.umd.edu Advisor: Dr. Mihai Pop Computer Science Department Center for Bioinformatics and Computational
More informationUSING BRAT-BW Table 1. Feature comparison of BRAT-bw, BRAT-large, Bismark and BS Seeker (as of on March, 2012)
USING BRAT-BW-2.0.1 BRAT-bw is a tool for BS-seq reads mapping, i.e. mapping of bisulfite-treated sequenced reads. BRAT-bw is a part of BRAT s suit. Therefore, input and output formats for BRAT-bw are
More informationAs of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be
48 Bioinformatics I, WS 09-10, S. Henz (script by D. Huson) November 26, 2009 4 BLAST and BLAT Outline of the chapter: 1. Heuristics for the pairwise local alignment of two sequences 2. BLAST: search and
More informationParallel de novo Assembly of Complex (Meta) Genomes via HipMer
Parallel de novo Assembly of Complex (Meta) Genomes via HipMer Aydın Buluç Computational Research Division, LBNL May 23, 2016 Invited Talk at HiCOMB 2016 Outline and Acknowledgments Joint work (alphabetical)
More informationRead Mapping and Assembly
Statistical Bioinformatics: Read Mapping and Assembly Stefan Seemann seemann@rth.dk University of Copenhagen April 9th 2019 Why sequencing? Why sequencing? Which organism does the sample comes from? Assembling
More informationV1.0: Seth Gilbert, V1.1: Steven Halim August 30, Abstract. d(e), and we assume that the distance function is non-negative (i.e., d(x, y) 0).
CS4234: Optimisation Algorithms Lecture 4 TRAVELLING-SALESMAN-PROBLEM (4 variants) V1.0: Seth Gilbert, V1.1: Steven Halim August 30, 2016 Abstract The goal of the TRAVELLING-SALESMAN-PROBLEM is to find
More informationNew Implementation for the Multi-sequence All-Against-All Substring Matching Problem
New Implementation for the Multi-sequence All-Against-All Substring Matching Problem Oana Sandu Supervised by Ulrike Stege In collaboration with Chris Upton, Alex Thomo, and Marina Barsky University of
More informationGene expression & Clustering (Chapter 10)
Gene expression & Clustering (Chapter 10) Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species Dynamic programming Approximate pattern matching
More informationAppendix A. Example code output. Chapter 1. Chapter 3
Appendix A Example code output This is a compilation of output from selected examples. Some of these examples requires exernal input from e.g. STDIN, for such examples the interaction with the program
More informationShortest Path Algorithm
Shortest Path Algorithm C Works just fine on this graph. C Length of shortest path = Copyright 2005 DIMACS BioMath Connect Institute Robert Hochberg Dynamic Programming SP #1 Same Questions, Different
More informationLecture 5: Markov models
Master s course Bioinformatics Data Analysis and Tools Lecture 5: Markov models Centre for Integrative Bioinformatics Problem in biology Data and patterns are often not clear cut When we want to make a
More informationOn Universal Cycles of Labeled Graphs
On Universal Cycles of Labeled Graphs Greg Brockman Harvard University Cambridge, MA 02138 United States brockman@hcs.harvard.edu Bill Kay University of South Carolina Columbia, SC 29208 United States
More informationIntroduction to Bioinformatics Problem Set 3: Genome Sequencing
Introduction to Bioinformatics Problem Set 3: Genome Sequencing 1. Assemble a sequence with your bare hands! You are trying to determine the DNA sequence of a very (very) small plasmids, which you estimate
More informationAdjacent: Two distinct vertices u, v are adjacent if there is an edge with ends u, v. In this case we let uv denote such an edge.
1 Graph Basics What is a graph? Graph: a graph G consists of a set of vertices, denoted V (G), a set of edges, denoted E(G), and a relation called incidence so that each edge is incident with either one
More informationPart II. Graph Theory. Year
Part II Year 2017 2016 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2017 53 Paper 3, Section II 15H Define the Ramsey numbers R(s, t) for integers s, t 2. Show that R(s, t) exists for all s,
More informationSpecial course in Computer Science: Advanced Text Algorithms
Special course in Computer Science: Advanced Text Algorithms Lecture 8: Multiple alignments Elena Czeizler and Ion Petre Department of IT, Abo Akademi Computational Biomodelling Laboratory http://www.users.abo.fi/ipetre/textalg
More informationClustering Techniques
Clustering Techniques Bioinformatics: Issues and Algorithms CSE 308-408 Fall 2007 Lecture 16 Lopresti Fall 2007 Lecture 16-1 - Administrative notes Your final project / paper proposal is due on Friday,
More informationCS270 Combinatorial Algorithms & Data Structures Spring Lecture 19:
CS270 Combinatorial Algorithms & Data Structures Spring 2003 Lecture 19: 4.1.03 Lecturer: Satish Rao Scribes: Kevin Lacker and Bill Kramer Disclaimer: These notes have not been subjected to the usual scrutiny
More informationIn this section we describe how to extend the match refinement to the multiple case and then use T-Coffee to heuristically compute a multiple trace.
5 Multiple Match Refinement and T-Coffee In this section we describe how to extend the match refinement to the multiple case and then use T-Coffee to heuristically compute a multiple trace. This exposition
More informationGraph and Digraph Glossary
1 of 15 31.1.2004 14:45 Graph and Digraph Glossary A B C D E F G H I-J K L M N O P-Q R S T U V W-Z Acyclic Graph A graph is acyclic if it contains no cycles. Adjacency Matrix A 0-1 square matrix whose
More informationCombinatorial Pattern Matching. CS 466 Saurabh Sinha
Combinatorial Pattern Matching CS 466 Saurabh Sinha Genomic Repeats Example of repeats: ATGGTCTAGGTCCTAGTGGTC Motivation to find them: Genomic rearrangements are often associated with repeats Trace evolutionary
More informationLecture 3: February Local Alignment: The Smith-Waterman Algorithm
CSCI1820: Sequence Alignment Spring 2017 Lecture 3: February 7 Lecturer: Sorin Istrail Scribe: Pranavan Chanthrakumar Note: LaTeX template courtesy of UC Berkeley EECS dept. Notes are also adapted from
More information6 Anhang. 6.1 Transgene Su(var)3-9-Linien. P{GS.ry + hs(su(var)3-9)egfp} 1 I,II,III,IV 3 2I 3 3 I,II,III 3 4 I,II,III 2 5 I,II,III,IV 3
6.1 Transgene Su(var)3-9-n P{GS.ry + hs(su(var)3-9)egfp} 1 I,II,III,IV 3 2I 3 3 I,II,III 3 4 I,II,II 5 I,II,III,IV 3 6 7 I,II,II 8 I,II,II 10 I,II 3 P{GS.ry + UAS(Su(var)3-9)EGFP} A AII 3 B P{GS.ry + (10.5kbSu(var)3-9EGFP)}
More informationDNA Sequence Assembly and Multiple Sequence Alignment by an Eulerian Path Approach
DNA Sequence Assembly and Multiple Sequence Alignment by an Eulerian Path Approach Yu Zhang Department of Mathematics University of Southern California Los Angeles, CA 90089-1113 Phone: 213-821-2231 yuzhang@usc.edu
More informationFastA & the chaining problem
FastA & the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem 1 Sources for this lecture: Lectures by Volker Heun, Daniel Huson and Knut Reinert,
More informationFastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:
FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:56 4001 4 FastA and the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem
More information6.00 Introduction to Computer Science and Programming Fall 2008
MIT OpenCourseWare http://ocw.mit.edu 6.00 Introduction to Computer Science and Programming Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
More informationSUPPLEMENTARY INFORMATION. Systematic evaluation of CRISPR-Cas systems reveals design principles for genome editing in human cells
SUPPLEMENTARY INFORMATION Systematic evaluation of CRISPR-Cas systems reveals design principles for genome editing in human cells Yuanming Wang 1,2,7, Kaiwen Ivy Liu 2,7, Norfala-Aliah Binte Sutrisnoh
More informationGraphs and trees come up everywhere. We can view the internet as a graph (in many ways) Web search views web pages as a graph
Graphs and Trees Graphs and trees come up everywhere. We can view the internet as a graph (in many ways) who is connected to whom Web search views web pages as a graph Who points to whom Niche graphs (Ecology):
More information1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998
7 Multiple Sequence Alignment The exposition was prepared by Clemens Gröpl, based on earlier versions by Daniel Huson, Knut Reinert, and Gunnar Klau. It is based on the following sources, which are all
More informationSAPLING: Suffix Array Piecewise Linear INdex for Genomics Michael Kirsche
SAPLING: Suffix Array Piecewise Linear INdex for Genomics Michael Kirsche mkirsche@jhu.edu StringBio 2018 Outline Substring Search Problem Caching and Learned Data Structures Methods Results Ongoing work
More informationWalking with Euler through Ostpreußen and RNA
Walking with Euler through Ostpreußen and RNA Mark Muldoon February 4, 2010 Königsberg (1652) Kaliningrad (2007)? The Königsberg Bridge problem asks whether it is possible to walk around the old city in
More informationSCHOOL OF ENGINEERING & BUILT ENVIRONMENT. Mathematics. An Introduction to Graph Theory
SCHOOL OF ENGINEERING & BUILT ENVIRONMENT Mathematics An Introduction to Graph Theory. Introduction. Definitions.. Vertices and Edges... The Handshaking Lemma.. Connected Graphs... Cut-Points and Bridges.
More informationSVM Classification in -Arrays
SVM Classification in -Arrays SVM classification and validation of cancer tissue samples using microarray expression data Furey et al, 2000 Special Topics in Bioinformatics, SS10 A. Regl, 7055213 What
More informationSequence Design Problems in Discovery of Regulatory Elements
Sequence Design Problems in Discovery of Regulatory Elements Yaron Orenstein, Bonnie Berger and Ron Shamir Regulatory Genomics workshop Simons Institute March 10th, 2016, Berkeley, CA Differentially methylated
More informationDiscrete Mathematics and Probability Theory Fall 2013 Vazirani Note 7
CS 70 Discrete Mathematics and Probability Theory Fall 2013 Vazirani Note 7 An Introduction to Graphs A few centuries ago, residents of the city of Königsberg, Prussia were interested in a certain problem.
More information24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:
24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, 2010 3 BLAST and FASTA This lecture is based on the following papers, which are all recommended reading: D.J. Lipman and W.R. Pearson, Rapid
More informationAnalysis of Biological Networks. 1. Clustering 2. Random Walks 3. Finding paths
Analysis of Biological Networks 1. Clustering 2. Random Walks 3. Finding paths Problem 1: Graph Clustering Finding dense subgraphs Applications Identification of novel pathways, complexes, other modules?
More information