Computational models for bionformatics
|
|
- Rebecca Martin
- 5 years ago
- Views:
Transcription
1 Computational models for bionformatics De-novo assembly and alignment-free measures Michele Schimd Department of Information Engineering July 8th, 2015 Michele Schimd (DEI) DEI July 8th, / 19
2 Before starting at DEI... After graduating in Computer Engineer here at unipd (October 2009) I applied for a Ph.D. at DEI, got admitted, but without grants. In January 2010 I started working at Avanade Italy s.r.l. as Junior Analyst in the R&D team to develop prototype applications for touch devices. Microsoft Surface Table Microsoft Windows 8 touch ipad Michele Schimd (DEI) DEI July 8th, / 19
3 Second attempt Aurora Science In September 2010 I started at the ACG group with a 11 months grant on the Aurora Science Project. Michele Schimd (DEI) DEI July 8th, / 19
4 Second attempt Aurora Science In September 2010 I started at the ACG group with a 11 months grant on the Aurora Science Project. Bioinformatics Collaborating with Paolo Fontana s group (part of Aurora project) at IASMA a I started working on the De-novo assembly of DNA. a Istituto Agrario San Michele all Adige Michele Schimd (DEI) DEI July 8th, / 19
5 Second attempt Aurora Science In September 2010 I started at the ACG group with a 11 months grant on the Aurora Science Project. Bioinformatics Collaborating with Paolo Fontana s group (part of Aurora project) at IASMA a I started working on the De-novo assembly of DNA. a Istituto Agrario San Michele all Adige Gap filling Specifically their focus was on gap filling where gaps between reconstructed fragments (contigs) of a genomic sequence need to be filled using new data (i.e., new reads). Michele Schimd (DEI) DEI July 8th, / 19
6 De-novo assembly problem basics Joint work with Gianfranco Bilardi Statement (an attempt) Given a set R = {r 1, r 2,..., r R } of reads (i.e., strings) find a sequence S that is the most likely sequencing source of reads in R. Michele Schimd (DEI) DEI July 8th, / 19
7 De-novo assembly problem basics Joint work with Gianfranco Bilardi Statement (an attempt) Given a set R = {r 1, r 2,..., r R } of reads (i.e., strings) find a sequence S that is the most likely sequencing source of reads in R. Current approaches (highly distilled) Most approaches involve two steps: 1 overlap reads 2 reconstruct the sequence (resolve the overlaps). Michele Schimd (DEI) DEI July 8th, / 19
8 De-novo assembly problem basics Joint work with Gianfranco Bilardi Statement (an attempt) Given a set R = {r 1, r 2,..., r R } of reads (i.e., strings) find a sequence S that is the most likely sequencing source of reads in R. Current approaches (highly distilled) Most approaches involve two steps: 1 overlap reads 2 reconstruct the sequence (resolve the overlaps). Example: Overlap Layout Consensus (OLC) 1 Find best overlaps (e.g., test all O(R 2 ) pairs) 2 Layout overlap relation on a directed graph 3 Find paths corresponding to consensus sequences Michele Schimd (DEI) DEI July 8th, / 19
9 Computational aspect of the assembly problem Can we improve current approaches? Expose locality/parallelism from current approaches. Unfortunately graph-based algorithms do not directly expose such properties Michele Schimd (DEI) DEI July 8th, / 19
10 Computational aspect of the assembly problem Can we improve current approaches? Expose locality/parallelism from current approaches. Unfortunately graph-based algorithms do not directly expose such properties State of the art assemblers In late 90 first 2000s the community faced challenges posed by Next Generation Sequencers (NGS). Developed algorithms were focused on the characteristics of data: short reads, error rate, massive sequencing. Resulting assemblers are effective but have little focus on computational aspects. a a Meaning not enough to allow aggressive optimizations. Michele Schimd (DEI) DEI July 8th, / 19
11 Developing a computational oriented assembler Good news Use of well known primitives: sorting Development of bounds Michele Schimd (DEI) DEI July 8th, / 19
12 Developing a computational oriented assembler Good news Use of well known primitives: sorting Development of bounds Bad news Real data differs substantially from ideal behavior Lots of heursitics have been used Models may become complicated due to errors and mutations Michele Schimd (DEI) DEI July 8th, / 19
13 A first (draft) sort-based assembly algorithm Overlap by sorting (simple case) Given R = r 1, r 2,..., r R 1 sort R by lexicographic order; 2 scan sorted list and find equivalence classes Michele Schimd (DEI) DEI July 8th, / 19
14 A first (draft) sort-based assembly algorithm Overlap by sorting (simple case) Given R = r 1, r 2,..., r R 1 sort R by lexicographic order; 2 scan sorted list and find equivalence classes Finding the overlap If x and y overlap (without errors),there exist a (maximal) k such that suff k (x) = x x k+1 x x k+2... x x = y 1 y 2... y k = pref k (y) for each x create two copies (i.e., aliases) (x, suff k (x)) (x, pref k (x)) and use as sorting key. Michele Schimd (DEI) DEI July 8th, / 19
15 A more realistic picture Mismatches (i.e., errors and mutations) Create ε aliases forming a pairwise disjoint partition of x. Each matched alias guarantees a correspondence of its size (e.g., m/ε if partitions are even) x m a b c m ε a d c y Refine the overlap (e.g., count mismatches for unmatched partitions) Michele Schimd (DEI) DEI July 8th, / 19
16 A more realistic picture Mismatches (i.e., errors and mutations) Create ε aliases forming a pairwise disjoint partition of x. Each matched alias guarantees a correspondence of its size (e.g., m/ε if partitions are even) x m a b c m ε a d c y Refine the overlap (e.g., count mismatches for unmatched partitions) Michele Schimd (DEI) DEI July 8th, / 19
17 A more realistic picture Mismatches (i.e., errors and mutations) Create ε aliases forming a pairwise disjoint partition of x. Each matched alias guarantees a correspondence of its size (e.g., m/ε if partitions are even) x m a b c m ε a d c y Refine the overlap (e.g., count mismatches for unmatched partitions) Michele Schimd (DEI) DEI July 8th, / 19
18 General Algorithm Algorithm: Sort and overlap 1 Partition each x R in ε substrings 2 For each partition x (h) (h = 1,..., ε) Add the alias (x, x (h) ) to L 3 Sort L using x (h) as sorting keys 4 Identify equivalent classes form sorted L 5 Refine overlap and create the contigs 6 Iterate using produced contigs as new R larger ε (i.e., smaller partition) until no more contigs are created or ε > ε max Michele Schimd (DEI) DEI July 8th, / 19
19 Coarse analysis of the algorithm Sorting Sorting step require O(R log R) comparisons, each comparison requires O(m) operations. Michele Schimd (DEI) DEI July 8th, / 19
20 Coarse analysis of the algorithm Sorting Sorting step require O(R log R) comparisons, each comparison requires O(m) operations. Equivalence class identification The step can be conducted in O(R), but the average size of the classes n e is important... Michele Schimd (DEI) DEI July 8th, / 19
21 Coarse analysis of the algorithm Sorting Sorting step require O(R log R) comparisons, each comparison requires O(m) operations. Equivalence class identification The step can be conducted in O(R), but the average size of the classes n e is important overlap extension and refinement For each of the equivalence class we must identify true overlapping reads. The all-against-all requires O(n 2 e) (hence the smaller n e the better) Michele Schimd (DEI) DEI July 8th, / 19
22 Contig analysis I Definitions Let N be the size of the genome to be reconstructed, R the number of strings, m the length of the strings (suppose m constant) and define the coverage as c = mr/n. Contig extension Starting from one string, a contig is extended if the next read starts at a distance m ( depends from ε). Uncovered probability The probability of no reads starting at a given position is ( ρ 0 1 c ) m Michele Schimd (DEI) DEI July 8th, / 19
23 Contig analysis II A simple model The probability of seeing uncovered positions is P[ uncovered] = ρ 0 At each position n reads are added with probability (1 ρ 0 ) n ρ 0 The average number of reads n r in a contig is E[n r ] = n=0 n(1 ρ 0 ) n ρ 0 = 1 ρ 0 ρ 0 Michele Schimd (DEI) DEI July 8th, / 19
24 Contig analysis III Average number of contig Suppose that we have R d < R distinct reads. The average number of contig n c can be estimate by E[n c ] R d ρ 0 E[n r ] = R d 1 ρ 0 A more accurate result Since R d is not infinite the above expected value E[n r ] should be R d E[n r ] = n(1 ρ 0 ) n ρ 0 n=0 which gives a different results. The error of assuming the simple form is negligible for R d 0. Michele Schimd (DEI) DEI July 8th, / 19
25 Contig analysis IV A (half-surprising) similar work: [Preparata 2013] The work by Franco P. Preparata On Contigs and Coverage published on Journal of Computational Biology solves (almost) the same problem N E[n c ] = E[L c ] + E[L g ] where E[L c ] is the average length of contigs and E[L g ] is the average length of gaps between contigs. We are further developing our theory to match first and expand next the work of Preparata. Michele Schimd (DEI) DEI July 8th, / 19
26 Wrapping up for de-novo assembly Our goal Stochastic analysis of efficacy (number of contigs) and efficiency (size of equivalence classes). Development of an assembler Michele Schimd (DEI) DEI July 8th, / 19
27 Wrapping up for de-novo assembly Our goal Stochastic analysis of efficacy (number of contigs) and efficiency (size of equivalence classes). Development of an assembler Current results A promising sorting-based algorithm Preliminary bounds (e.g., contigs) Michele Schimd (DEI) DEI July 8th, / 19
28 Wrapping up for de-novo assembly Our goal Stochastic analysis of efficacy (number of contigs) and efficiency (size of equivalence classes). Development of an assembler Current results A promising sorting-based algorithm Up next Preliminary bounds (e.g., contigs) Size of equivalent classes n e Complexity analysis T (N, R, m, ε) Associated trade-off Michele Schimd (DEI) DEI July 8th, / 19
29 Alignment-free measures Joint work with Matteo Comin Background When I started working on de-novo assembly we focused on quality scores which are numerical values assigned to each nucleobase produced by modern sequencer. Michele Schimd (DEI) DEI July 8th, / 19
30 Alignment-free measures Joint work with Matteo Comin Background When I started working on de-novo assembly we focused on quality scores which are numerical values assigned to each nucleobase produced by modern sequencer. QVs and assembly The ideas was to use QVs to solve de-novo assembly, this task lead to a stochastic model of sequencing (i.e., my Ph.D. thesis) with applications to alignment-free measures. Michele Schimd (DEI) DEI July 8th, / 19
31 Alignment-free measures Joint work with Matteo Comin Background When I started working on de-novo assembly we focused on quality scores which are numerical values assigned to each nucleobase produced by modern sequencer. QVs and assembly The ideas was to use QVs to solve de-novo assembly, this task lead to a stochastic model of sequencing (i.e., my Ph.D. thesis) with applications to alignment-free measures. Alignment-free measures Alignment-free measures are statistics used to assess the (dis)similarity between sequences without using alignment (i.e., overlap) primitives. Michele Schimd (DEI) DEI July 8th, / 19
32 Where are we now? Our measure D q 2 Use an IID model of errors to compute w ( ) P C (w) = 1 10 q 10 l=1 For each observed k-mer (i.e., a k-long string) w compute X q w = r R Finally compute a statistic measure: D q 2 = P C (w) w r w Σ k X q w Y q w = X q w Y q w Michele Schimd (DEI) DEI July 8th, / 19
33 Results and future extensions So far... These measures give good results on clustering of reads (e.g., identification of reads from different species). Recent preliminary results indicate that phylogeny tree reconstruction can benefit from them Michele Schimd (DEI) DEI July 8th, / 19
34 Results and future extensions So far... These measures give good results on clustering of reads (e.g., identification of reads from different species). Recent preliminary results indicate that phylogeny tree reconstruction can benefit from them... what s next We believe that D q 2 measures can be used to improve compression overlap in our sort-based assembler... ideas? Michele Schimd (DEI) DEI July 8th, / 19
35 Thank you Questions and discussion Further... Coffee break ACG Lab, Room 416 DEI/G, Michele Schimd (DEI) DEI July 8th, / 19
BLAST & Genome assembly
BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies May 15, 2014 1 BLAST What is BLAST? The algorithm 2 Genome assembly De novo assembly Mapping assembly 3
More informationde novo assembly Simon Rasmussen 36626: Next Generation Sequencing analysis DTU Bioinformatics Next Generation Sequencing Analysis
de novo assembly Simon Rasmussen 36626: Next Generation Sequencing analysis DTU Bioinformatics 27626 - Next Generation Sequencing Analysis Generalized NGS analysis Data size Application Assembly: Compare
More informationBLAST & Genome assembly
BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies November 17, 2012 1 Introduction Introduction 2 BLAST What is BLAST? The algorithm 3 Genome assembly De
More informationON HEURISTIC METHODS IN NEXT-GENERATION SEQUENCING DATA ANALYSIS
ON HEURISTIC METHODS IN NEXT-GENERATION SEQUENCING DATA ANALYSIS Ivan Vogel Doctoral Degree Programme (1), FIT BUT E-mail: xvogel01@stud.fit.vutbr.cz Supervised by: Jaroslav Zendulka E-mail: zendulka@fit.vutbr.cz
More informationDIME: A Novel De Novo Metagenomic Sequence Assembly Framework
DIME: A Novel De Novo Metagenomic Sequence Assembly Framework Version 1.1 Xuan Guo Department of Computer Science Georgia State University Atlanta, GA 30303, U.S.A July 17, 2014 1 Contents 1 Introduction
More informationRESEARCH TOPIC IN BIOINFORMANTIC
RESEARCH TOPIC IN BIOINFORMANTIC GENOME ASSEMBLY Instructor: Dr. Yufeng Wu Noted by: February 25, 2012 Genome Assembly is a kind of string sequencing problems. As we all know, the human genome is very
More informationPerformance analysis of parallel de novo genome assembly in shared memory system
IOP Conference Series: Earth and Environmental Science PAPER OPEN ACCESS Performance analysis of parallel de novo genome assembly in shared memory system To cite this article: Syam Budi Iryanto et al 2018
More informationDescription of a genome assembler: CABOG
Theo Zimmermann Description of a genome assembler: CABOG CABOG (Celera Assembler with the Best Overlap Graph) is an assembler built upon the Celera Assembler, which, at first, was designed for Sanger sequencing,
More informationBuilding approximate overlap graphs for DNA assembly using random-permutations-based search.
An algorithm is presented for fast construction of graphs of reads, where an edge between two reads indicates an approximate overlap between the reads. Since the algorithm finds approximate overlaps directly,
More informationReducing Genome Assembly Complexity with Optical Maps
Reducing Genome Assembly Complexity with Optical Maps Lee Mendelowitz LMendelo@math.umd.edu Advisor: Dr. Mihai Pop Computer Science Department Center for Bioinformatics and Computational Biology mpop@umiacs.umd.edu
More informationSequence Assembly Required!
Sequence Assembly Required! 1 October 3, ISMB 20172007 1 Sequence Assembly Genome Sequenced Fragments (reads) Assembled Contigs Finished Genome 2 Greedy solution is bounded 3 Typical assembly strategy
More informationPreliminary Syllabus. Genomics. Introduction & Genome Assembly Sequence Comparison Gene Modeling Gene Function Identification
Preliminary Syllabus Sep 30 Oct 2 Oct 7 Oct 9 Oct 14 Oct 16 Oct 21 Oct 25 Oct 28 Nov 4 Nov 8 Introduction & Genome Assembly Sequence Comparison Gene Modeling Gene Function Identification OCTOBER BREAK
More information(for more info see:
Genome assembly (for more info see: http://www.cbcb.umd.edu/research/assembly_primer.shtml) Introduction Sequencing technologies can only "read" short fragments from a genome. Reconstructing the entire
More informationMichał Kierzynka et al. Poznan University of Technology. 17 March 2015, San Jose
Michał Kierzynka et al. Poznan University of Technology 17 March 2015, San Jose The research has been supported by grant No. 2012/05/B/ST6/03026 from the National Science Centre, Poland. DNA de novo assembly
More informationIDBA - A Practical Iterative de Bruijn Graph De Novo Assembler
IDBA - A Practical Iterative de Bruijn Graph De Novo Assembler Yu Peng, Henry Leung, S.M. Yiu, Francis Y.L. Chin Department of Computer Science, The University of Hong Kong Pokfulam Road, Hong Kong {ypeng,
More informationGenome 373: Genome Assembly. Doug Fowler
Genome 373: Genome Assembly Doug Fowler What are some of the things we ve seen we can do with HTS data? We ve seen that HTS can enable a wide variety of analyses ranging from ID ing variants to genome-
More informationSequencing. Computational Biology IST Ana Teresa Freitas 2011/2012. (BACs) Whole-genome shotgun sequencing Celera Genomics
Computational Biology IST Ana Teresa Freitas 2011/2012 Sequencing Clone-by-clone shotgun sequencing Human Genome Project Whole-genome shotgun sequencing Celera Genomics (BACs) 1 Must take the fragments
More informationComputational Genomics and Molecular Biology, Fall
Computational Genomics and Molecular Biology, Fall 2015 1 Sequence Alignment Dannie Durand Pairwise Sequence Alignment The goal of pairwise sequence alignment is to establish a correspondence between the
More informationData Preprocessing. Next Generation Sequencing analysis DTU Bioinformatics Next Generation Sequencing Analysis
Data Preprocessing Next Generation Sequencing analysis DTU Bioinformatics Generalized NGS analysis Data size Application Assembly: Compare Raw Pre- specific: Question Alignment / samples / Answer? reads
More informationPreliminary Studies on de novo Assembly with Short Reads
Preliminary Studies on de novo Assembly with Short Reads Nanheng Wu Satish Rao, Ed. Yun S. Song, Ed. Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No.
More informationA Genome Assembly Algorithm Designed for Single-Cell Sequencing
SPAdes A Genome Assembly Algorithm Designed for Single-Cell Sequencing Bankevich A, Nurk S, Antipov D, et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput
More informationCS 68: BIOINFORMATICS. Prof. Sara Mathieson Swarthmore College Spring 2018
CS 68: BIOINFORMATICS Prof. Sara Mathieson Swarthmore College Spring 2018 Outline: Jan 31 DBG assembly in practice Velvet assembler Evaluation of assemblies (if time) Start: string alignment Candidate
More informationSequence clustering. Introduction. Clustering basics. Hierarchical clustering
Sequence clustering Introduction Data clustering is one of the key tools used in various incarnations of data-mining - trying to make sense of large datasets. It is, thus, natural to ask whether clustering
More informationSequencing error correction
Sequencing error correction Ben Langmead Department of Computer Science You are free to use these slides. If you do, please sign the guestbook (www.langmead-lab.org/teaching-materials), or email me (ben.langmead@gmail.com)
More informationGenome 373: Mapping Short Sequence Reads I. Doug Fowler
Genome 373: Mapping Short Sequence Reads I Doug Fowler Two different strategies for parallel amplification BRIDGE PCR EMULSION PCR Two different strategies for parallel amplification BRIDGE PCR EMULSION
More informationDynamic Programming & Smith-Waterman algorithm
m m Seminar: Classical Papers in Bioinformatics May 3rd, 2010 m m 1 2 3 m m Introduction m Definition is a method of solving problems by breaking them down into simpler steps problem need to contain overlapping
More informationReducing Genome Assembly Complexity with Optical Maps Mid-year Progress Report
Reducing Genome Assembly Complexity with Optical Maps Mid-year Progress Report Lee Mendelowitz LMendelo@math.umd.edu Advisor: Dr. Mihai Pop Computer Science Department Center for Bioinformatics and Computational
More informationAlignment of Long Sequences
Alignment of Long Sequences BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2009 Mark Craven craven@biostat.wisc.edu Pairwise Whole Genome Alignment: Task Definition Given a pair of genomes (or other large-scale
More informationAdam M Phillippy Center for Bioinformatics and Computational Biology
Adam M Phillippy Center for Bioinformatics and Computational Biology WGS sequencing shearing sequencing assembly WGS assembly Overlap reads identify reads with shared k-mers calculate edit distance Layout
More informationMULTIPLE SEQUENCE ALIGNMENT SOLUTIONS AND APPLICATIONS
MULTIPLE SEQUENCE ALIGNMENT SOLUTIONS AND APPLICATIONS By XU ZHANG A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE
More informationIDBA A Practical Iterative de Bruijn Graph De Novo Assembler
IDBA A Practical Iterative de Bruijn Graph De Novo Assembler Yu Peng, Henry C.M. Leung, S.M. Yiu, and Francis Y.L. Chin Department of Computer Science, The University of Hong Kong Pokfulam Road, Hong Kong
More information10/15/2009 Comp 590/Comp Fall
Lecture 13: Graph Algorithms Study Chapter 8.1 8.8 10/15/2009 Comp 590/Comp 790-90 Fall 2009 1 The Bridge Obsession Problem Find a tour crossing every bridge just once Leonhard Euler, 1735 Bridges of Königsberg
More informationEECS730: Introduction to Bioinformatics
EECS730: Introduction to Bioinformatics Lecture 04: Variations of sequence alignments http://www.pitt.edu/~mcs2/teaching/biocomp/tutorials/global.html Slides adapted from Dr. Shaojie Zhang (University
More informationAMOS Assembly Validation and Visualization
AMOS Assembly Validation and Visualization Michael Schatz Center for Bioinformatics and Computational Biology University of Maryland April 7, 2006 Outline AMOS Introduction Getting Data into AMOS AMOS
More informationRead Mapping. de Novo Assembly. Genomics: Lecture #2 WS 2014/2015
Mapping de Novo Assembly Institut für Medizinische Genetik und Humangenetik Charité Universitätsmedizin Berlin Genomics: Lecture #2 WS 2014/2015 Today Genome assembly: the basics Hamiltonian and Eulerian
More informationMacVector for Mac OS X. The online updater for this release is MB in size
MacVector 17.0.3 for Mac OS X The online updater for this release is 143.5 MB in size You must be running MacVector 15.5.4 or later for this updater to work! System Requirements MacVector 17.0 is supported
More informationBioinformatics-themed projects in Discrete Mathematics
Bioinformatics-themed projects in Discrete Mathematics Art Duval University of Texas at El Paso Joint Mathematics Meeting MAA Contributed Paper Session on Discrete Mathematics in the Undergraduate Curriculum
More informationIntroduction and tutorial for SOAPdenovo. Xiaodong Fang Department of Science and BGI May, 2012
Introduction and tutorial for SOAPdenovo Xiaodong Fang fangxd@genomics.org.cn Department of Science and Technology @ BGI May, 2012 Why de novo assembly? Genome is the genetic basis for different phenotypes
More informationRead Mapping. Slides by Carl Kingsford
Read Mapping Slides by Carl Kingsford Bowtie Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop and Steven L Salzberg, Genome Biology
More informationData Preprocessing : Next Generation Sequencing analysis CBS - DTU Next Generation Sequencing Analysis
Data Preprocessing 27626: Next Generation Sequencing analysis CBS - DTU Generalized NGS analysis Data size Application Assembly: Compare Raw Pre- specific: Question Alignment / samples / Answer? reads
More informationDNA Fragment Assembly
Algorithms in Bioinformatics Sami Khuri Department of Computer Science San José State University San José, California, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri DNA Fragment Assembly Overlap
More informationA maximum likelihood approach to genome assembly
A maximum likelihood approach to genome assembly Laureando: Giacomo Baruzzo Relatore: Prof. Gianfranco Bilardi 08/10/2013 UNIVERSITÀ DEGLI STUDI DI PADOVA Dipartimento di Ingegneria dell Informazione -
More informationPROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota
Marina Sirota MOTIVATION: PROTEIN MULTIPLE ALIGNMENT To study evolution on the genetic level across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein
More informationGraph Algorithms in Bioinformatics
Graph Algorithms in Bioinformatics Computational Biology IST Ana Teresa Freitas 2015/2016 Sequencing Clone-by-clone shotgun sequencing Human Genome Project Whole-genome shotgun sequencing Celera Genomics
More informationLAB # 3 / Project # 1
DEI Departamento de Engenharia Informática Algorithms for Discrete Structures 2011/2012 LAB # 3 / Project # 1 Matching Proteins This is a lab guide for Algorithms in Discrete Structures. These exercises
More informationTitle:- Instructions to run GS Assembler and Mapper Course # BIOL 8803 Special Topic on Computational Genomics Assembly Group
Title:- Instructions to run GS Assembler and Mapper Course # BIOL 8803 Special Topic on Computational Genomics Assembly Group Contents 1. Genome Assembly... 3 1.0. Data and Projects... 3 1.1. GS De Novo
More informationGenome Assembly Using de Bruijn Graphs. Biostatistics 666
Genome Assembly Using de Bruijn Graphs Biostatistics 666 Previously: Reference Based Analyses Individual short reads are aligned to reference Genotypes generated by examining reads overlapping each position
More informationBLAST, Profile, and PSI-BLAST
BLAST, Profile, and PSI-BLAST Jianlin Cheng, PhD School of Electrical Engineering and Computer Science University of Central Florida 26 Free for academic use Copyright @ Jianlin Cheng & original sources
More informationNetwork Based Hard/Soft Information Fusion Stochastic Graph Analytics
Network Based Hard/Soft Information Fusion Stochastic Graph Analytics Geoff Gross, Kedar Sambhoos, Rakesh Nagi (PI) Tel. (716) 645-3471, Email: gagross@buffalo.edu Objectives Represent soft data uncertainties
More informationGenomic Finishing & Consed
Genomic Finishing & Consed SEA stages of genomic analysis Draft vs Finished Draft Sequence Single sequencing approach Limited human intervention Cheap, Fast Finished sequence Multiple approaches Human
More informationThroughout the chapter, we will assume that the reader is familiar with the basics of phylogenetic trees.
Chapter 7 SUPERTREE ALGORITHMS FOR NESTED TAXA Philip Daniel and Charles Semple Abstract: Keywords: Most supertree algorithms combine collections of rooted phylogenetic trees with overlapping leaf sets
More informationAs of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be
48 Bioinformatics I, WS 09-10, S. Henz (script by D. Huson) November 26, 2009 4 BLAST and BLAT Outline of the chapter: 1. Heuristics for the pairwise local alignment of two sequences 2. BLAST: search and
More informationBrief review from last class
Sequence Alignment Brief review from last class DNA is has direction, we will use only one (5 -> 3 ) and generate the opposite strand as needed. DNA is a 3D object (see lecture 1) but we will model it
More informationAlgorithms for Bioinformatics
Adapted from slides by Alexandru Tomescu, Leena Salmela and Veli Mäkinen, which are partly from http://bix.ucsd.edu/bioalgorithms/slides.php 582670 Algorithms for Bioinformatics Lecture 3: Graph Algorithms
More information2 Experimental Methodology and Results
Developing Consensus Ontologies for the Semantic Web Larry M. Stephens, Aurovinda K. Gangam, and Michael N. Huhns Department of Computer Science and Engineering University of South Carolina, Columbia,
More informationOmega: an Overlap-graph de novo Assembler for Metagenomics
Omega: an Overlap-graph de novo Assembler for Metagenomics B a h l e l H a i d e r, Ta e - H y u k A h n, B r i a n B u s h n e l l, J u a n j u a n C h a i, A l e x C o p e l a n d, C h o n g l e Pa n
More informationNextGenMap and the impact of hhighly polymorphic regions. Arndt von Haeseler
NextGenMap and the impact of hhighly polymorphic regions Arndt von Haeseler Joint work with: The Technological Revolution Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program
More informationData Mining Technologies for Bioinformatics Sequences
Data Mining Technologies for Bioinformatics Sequences Deepak Garg Computer Science and Engineering Department Thapar Institute of Engineering & Tecnology, Patiala Abstract Main tool used for sequence alignment
More informationPurpose of sequence assembly
Sequence Assembly Purpose of sequence assembly Reconstruct long DNA/RNA sequences from short sequence reads Genome sequencing RNA sequencing for gene discovery Amplicon sequencing But not for transcript
More informationSorting With Forbidden Intermediates
1 Sorting With Forbidden Intermediates Carlo Comin Anthony Labarre Romeo Rizzi Stéphane Vialette February 15th, 2016 Genome rearrangements for permutations Permutations model genomes with the same contents
More informationSlopMap: a software application tool for quick and flexible identification of similar sequences using exact k-mer matching
SlopMap: a software application tool for quick and flexible identification of similar sequences using exact k-mer matching Ilya Y. Zhbannikov 1, Samuel S. Hunter 1,2, Matthew L. Settles 1,2, and James
More informationLecture Overview. Sequence search & alignment. Searching sequence databases. Sequence Alignment & Search. Goals: Motivations:
Lecture Overview Sequence Alignment & Search Karin Verspoor, Ph.D. Faculty, Computational Bioscience Program University of Colorado School of Medicine With credit and thanks to Larry Hunter for creating
More informationDatabase Searching Using BLAST
Mahidol University Objectives SCMI512 Molecular Sequence Analysis Database Searching Using BLAST Lecture 2B After class, students should be able to: explain the FASTA algorithm for database searching explain
More informationColumn-Oriented Database Systems. Liliya Rudko University of Helsinki
Column-Oriented Database Systems Liliya Rudko University of Helsinki 2 Contents 1. Introduction 2. Storage engines 2.1 Evolutionary Column-Oriented Storage (ECOS) 2.2 HYRISE 3. Database management systems
More informationPrinciples of Bioinformatics. BIO540/STA569/CSI660 Fall 2010
Principles of Bioinformatics BIO540/STA569/CSI660 Fall 2010 Lecture 11 Multiple Sequence Alignment I Administrivia Administrivia The midterm examination will be Monday, October 18 th, in class. Closed
More informationParallel and Sequential Data Structures and Algorithms Lecture (Spring 2012) Lecture 25 Suffix Arrays
Lecture 25 Suffix Arrays Parallel and Sequential Data Structures and Algorithms, 15-210 (Spring 2012) Lectured by Kanat Tangwongsan April 17, 2012 Material in this lecture: The main theme of this lecture
More informationLAGAN and Multi-LAGAN: Efficient Tools for Large-Scale Multiple Alignment of Genomic DNA
LAGAN and Multi-LAGAN: Efficient Tools for Large-Scale Multiple Alignment of Genomic DNA Michael Brudno, Chuong B. Do, Gregory M. Cooper, et al. Presented by Xuebei Yang About Alignments Pairwise Alignments
More informationAlgorithms for Bioinformatics
Adapted from slides by Leena Salmena and Veli Mäkinen, which are partly from http: //bix.ucsd.edu/bioalgorithms/slides.php. 582670 Algorithms for Bioinformatics Lecture 6: Distance based clustering and
More informationTutorial. Aligning contigs manually using the Genome Finishing. Sample to Insight. February 6, 2019
Aligning contigs manually using the Genome Finishing Module February 6, 2019 Sample to Insight QIAGEN Aarhus Silkeborgvej 2 Prismet 8000 Aarhus C Denmark Telephone: +45 70 22 32 44 www.qiagenbioinformatics.com
More informationSequence Alignment & Search
Sequence Alignment & Search Karin Verspoor, Ph.D. Faculty, Computational Bioscience Program University of Colorado School of Medicine With credit and thanks to Larry Hunter for creating the first version
More informationIntroduction to Genome Assembly. Tandy Warnow
Introduction to Genome Assembly Tandy Warnow 2 Shotgun DNA Sequencing DNA target sample SHEAR & SIZE End Reads / Mate Pairs 550bp 10,000bp Not all sequencing technologies produce mate-pairs. Different
More informationMemory Efficient Minimum Substring Partitioning
Memory Efficient Minimum Substring Partitioning Yang Li, Pegah Kamousi, Fangqiu Han, Shengqi Yang, Xifeng Yan, Subhash Suri University of California, Santa Barbara {yangli, pegah, fhan, sqyang, xyan, suri}@cs.ucsb.edu
More informationSanger Data Assembly in SeqMan Pro
Sanger Data Assembly in SeqMan Pro DNASTAR provides two applications for assembling DNA sequence fragments: SeqMan NGen and SeqMan Pro. SeqMan NGen is primarily used to assemble Next Generation Sequencing
More informationI519 Introduction to Bioinformatics, Genome assembly. Yuzhen Ye School of Informatics & Computing, IUB
I519 Introduction to Bioinformatics, 2014 Genome assembly Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Contents Genome assembly problem Approaches Comparative assembly The string
More informationDesigning parallel algorithms for constructing large phylogenetic trees on Blue Waters
Designing parallel algorithms for constructing large phylogenetic trees on Blue Waters Erin Molloy University of Illinois at Urbana Champaign General Allocation (PI: Tandy Warnow) Exploratory Allocation
More informationAlgorithms. Lecture Notes 5
Algorithms. Lecture Notes 5 Dynamic Programming for Sequence Comparison The linear structure of the Sequence Comparison problem immediately suggests a dynamic programming approach. Naturally, our sub-instances
More informationNGS Data Analysis. Roberto Preste
NGS Data Analysis Roberto Preste 1 Useful info http://bit.ly/2r1y2dr Contacts: roberto.preste@gmail.com Slides: http://bit.ly/ngs-data 2 NGS data analysis Overview 3 NGS Data Analysis: the basic idea http://bit.ly/2r1y2dr
More informationCLC Server. End User USER MANUAL
CLC Server End User USER MANUAL Manual for CLC Server 10.0.1 Windows, macos and Linux March 8, 2018 This software is for research purposes only. QIAGEN Aarhus Silkeborgvej 2 Prismet DK-8000 Aarhus C Denmark
More informationAnalysis of parallel suffix tree construction
168 Analysis of parallel suffix tree construction Malvika Singh 1 1 (Computer Science, Dhirubhai Ambani Institute of Information and Communication Technology, Gandhinagar, Gujarat, India. Email: malvikasingh2k@gmail.com)
More informationMacVector for Mac OS X
MacVector 11.0.4 for Mac OS X System Requirements MacVector 11 runs on any PowerPC or Intel Macintosh running Mac OS X 10.4 or higher. It is a Universal Binary, meaning that it runs natively on both PowerPC
More informationShotgun sequencing. Coverage is simply the average number of reads that overlap each true base in genome.
Shotgun sequencing Genome (unknown) Reads (randomly chosen; have errors) Coverage is simply the average number of reads that overlap each true base in genome. Here, the coverage is ~10 just draw a line
More informationError Correction in Next Generation DNA Sequencing Data
Western University Scholarship@Western Electronic Thesis and Dissertation Repository December 2012 Error Correction in Next Generation DNA Sequencing Data Michael Z. Molnar The University of Western Ontario
More information10/8/13 Comp 555 Fall
10/8/13 Comp 555 Fall 2013 1 Find a tour crossing every bridge just once Leonhard Euler, 1735 Bridges of Königsberg 10/8/13 Comp 555 Fall 2013 2 Find a cycle that visits every edge exactly once Linear
More informationReducing Genome Assembly Complexity with Optical Maps
Reducing Genome Assembly Complexity with Optical Maps AMSC 663 Mid-Year Progress Report 12/13/2011 Lee Mendelowitz Lmendelo@math.umd.edu Advisor: Mihai Pop mpop@umiacs.umd.edu Computer Science Department
More informationStudy of Data Localities in Suffix-Tree Based Genetic Algorithms
Study of Data Localities in Suffix-Tree Based Genetic Algorithms Carl I. Bergenhem, Michael T. Smith Abstract. This paper focuses on the study of cache localities of two genetic algorithms based on the
More informationNetworked Access to Library Resources
Institute of Museum and Library Services National Leadership Grant Realizing the Vision of Networked Access to Library Resources An Applied Research and Demonstration Project to Establish and Operate a
More informationLecture 5: Multiple sequence alignment
Lecture 5: Multiple sequence alignment Introduction to Computational Biology Teresa Przytycka, PhD (with some additions by Martin Vingron) Why do we need multiple sequence alignment Pairwise sequence alignment
More informationThe Value of Mate-pairs for Repeat Resolution
The Value of Mate-pairs for Repeat Resolution An Analysis on Graphs Created From Short Reads Joshua Wetzel Department of Computer Science Rutgers University Camden in conjunction with CBCB at University
More informationCompares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA.
Compares a sequence of protein to another sequence or database of a protein, or a sequence of DNA to another sequence or library of DNA. Fasta is used to compare a protein or DNA sequence to all of the
More informationFrom Smith-Waterman to BLAST
From Smith-Waterman to BLAST Jeremy Buhler July 23, 2015 Smith-Waterman is the fundamental tool that we use to decide how similar two sequences are. Isn t that all that BLAST does? In principle, it is
More informationExercise 2: Browser-Based Annotation and RNA-Seq Data
Exercise 2: Browser-Based Annotation and RNA-Seq Data Jeremy Buhler July 24, 2018 This exercise continues your introduction to practical issues in comparative annotation. You ll be annotating genomic sequence
More informationDNA Sequencing The Shortest Superstring & Traveling Salesman Problems Sequencing by Hybridization
Eulerian & Hamiltonian Cycle Problems DNA Sequencing The Shortest Superstring & Traveling Salesman Problems Sequencing by Hybridization The Bridge Obsession Problem Find a tour crossing every bridge just
More informationDBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies
DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies Chengxi Ye 1, Christopher M. Hill 1, Shigang Wu 2, Jue Ruan 2, Zhanshan (Sam) Ma
More informationEvolution of Tandemly Repeated Sequences
University of Canterbury Department of Mathematics and Statistics Evolution of Tandemly Repeated Sequences A thesis submitted in partial fulfilment of the requirements of the Degree for Master of Science
More informationA THEORETICAL ANALYSIS OF SCALABILITY OF THE PARALLEL GENOME ASSEMBLY ALGORITHMS
A THEORETICAL ANALYSIS OF SCALABILITY OF THE PARALLEL GENOME ASSEMBLY ALGORITHMS Munib Ahmed, Ishfaq Ahmad Department of Computer Science and Engineering, University of Texas At Arlington, Arlington, Texas
More information1. R. Durbin, S. Eddy, A. Krogh und G. Mitchison: Biological sequence analysis, Cambridge, 1998
7 Multiple Sequence Alignment The exposition was prepared by Clemens GrÃP pl, based on earlier versions by Daniel Huson, Knut Reinert, and Gunnar Klau. It is based on the following sources, which are all
More informationMultiple Sequence Alignment (MSA)
I519 Introduction to Bioinformatics, Fall 2013 Multiple Sequence Alignment (MSA) Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Multiple sequence alignment (MSA) Generalize
More informationComputational Molecular Biology
Computational Molecular Biology Erwin M. Bakker Lecture 2 Materials used from R. Shamir [2] and H.J. Hoogeboom [4]. 1 Molecular Biology Sequences DNA A, T, C, G RNA A, U, C, G Protein A, R, D, N, C E,
More informationSpace Efficient Linear Time Construction of
Space Efficient Linear Time Construction of Suffix Arrays Pang Ko and Srinivas Aluru Dept. of Electrical and Computer Engineering 1 Laurence H. Baker Center for Bioinformatics and Biological Statistics
More informationPlacement and Motion Planning Algorithms for Robotic Sensing Systems
Placement and Motion Planning Algorithms for Robotic Sensing Systems Pratap Tokekar Ph.D. Thesis Defense Adviser: Prof. Volkan Isler UNIVERSITY OF MINNESOTA Driven to Discover ROBOTIC SENSOR NETWORKS http://rsn.cs.umn.edu/
More information