Read Mapping and Assembly
|
|
- Mark Wade
- 5 years ago
- Views:
Transcription
1 Statistical Bioinformatics: Read Mapping and Assembly Stefan Seemann University of Copenhagen April 9th 2019
2 Why sequencing?
3 Why sequencing? Which organism does the sample comes from? Assembling the genome of an organism. Which region of the genome is transcribed? Variablity (AS events) of mrnas (ncrnas) of the same gene. Expression level of the mrna (ncrna).
4 Next-generation sequencing
5 Workflow of RNA-seq
6 Some terminology read - a long word that comes from a NGS machine coverage - the average number of reads (or inserts) that cover a position in the target DNA piece shotgun sequencing - the process of obtaining many reads from random locations in DNA, to detect overlaps and assemble mate pair - a pair of reads from two ends of the same insert fragment (we know approx. distance) contig - a contiguous sequence formed by several overlapping reads with no gaps consensus sequence - sequence derived from the multiple alignment of reads in a contig
7 Read Mapping Map a small sequence to a reference genome or catalog of sequences (read, transcripts,... ). Often the very first step in the analysis of sequencing data.
8 First there were sequences...
9 Naive mapping Search for query at each position in reference genome ACGTTACCGAATCGATCAAAGTCGA GTTA m = query length, n = genome length
10 Naive mapping Search for query at each position in reference genome ACGTTACCGAATCGATCAAAGTCGA GTTA m = query length, n = genome length
11 Naive mapping Search for query at each position in reference genome ACGTTACCGAATCGATCAAAGTCGA GTTA :) m = query length, n = genome length Time: O(mn)
12 Naive mapping Human Genome (queries) would take far too long: Illumina/Solexa sequencing technology produces million, bp short reads Mapping these reads to a 3.2 billion bp human genome is a challenge Far worse when we allow for Indels and mismatches. Are optimal alignments still feasible?
13 Principles of mapping reads The need: Hundreds of millions of reads (or more). Adapt to handle growing amount of data. Exploit technological development. Handle protocol development (as was the case with pair-end libs). What to do?
14 Principles of mapping reads The need: Hundreds of millions of reads (or more). Adapt to handle growing amount of data. Exploit technological development. Handle protocol development (as was the case with pair-end libs). What to do? We are looking for nearly identical matches overlap detection
15 Banded Smith-Waterman Dynamic Programming Algorithm guaranteed optimal, but slow Overlap detection
16 Overlap detection Banded Smith-Waterman Dynamic Programming Algorithm guaranteed optimal, but slow BLAST Seed-and-extend for good matches to a DB
17 Overlap detection Banded Smith-Waterman Dynamic Programming Algorithm guaranteed optimal, but slow BLAST Seed-and-extend for good matches to a DB K-mer word counting Read mapping
18 Principles of mapping reads Many sequenced reads are redundant!!! We do not need to search the entire genome each time again.
19 Book analog Do not search the entire book, instead search the book index. Gorodkin et al. Methods Mol Biol. 2014
20 Book analog Do not search the entire book, instead search the book index. Methods in Molecular Biology 1097 Jan Gorodkin Walter L. Ruzzo Editors RNA Sequence, Structure, and Function: Computational and Bioinformatic Methods INDEX A model non-canonical ,271 probability... 81, 254, 282, 283, 432 Ab initio , 408 setrepresentation Abstractshapeanalysis , stacking interactions , 401, 409 Adenine...34 Watson-Crick...49, 166, 171, 180, 187, 338, Adenosineplatform , 381, 384, 385, 390, 405, 406, 410 Affinity....40, 364, 493, 495, 498, 499, 501, 504, 505, Base-pairing probability , 513 Base pair types Alignment bifurcated full , 286 cis Hoogsteen/Hoogsteen gaps...15, 128, 139, 264, 267, 285, 295 cis Hoogsteen/sugaredge seed...111, 116, 172, 176 cis sugar edge/sugaredge Ali-stemplot cis Watson-Crick/Hoogsteen ALPS , 449 cis Watson-Crick/sugar edge AMBER , 400, 407, cis Watson-Crick/Watson-Crick Ambiguity transhoogsteen/hoogsteen avoidance...12 transhoogsteen/sugaredge semantic... 97, 100, 101 transsugar edge/sugaredge syntactic... 89,101 transwatson-crick/hoogsteen Aminoglycosides...40 transwatson-crick/sugar edge Ancestralcorrelations transwatson-crick/watson-crick Annotation Base triple... 6, 23,39, 268, 333, 385, 391 automated Bcheck...167, 187, , 201, 206, 210 false...110, 111, 115, 116 Bellman s GAP...102, 236, 238, 241 pipeline...114, 117, 193, 203 Benchmarks... 20, 21, 23, 24, 310, 311, 396, 401, 481, Antibiotics Antisense...2, 417, 418, 478, 481, 504, 509 Big O Aptamer(s)... 40, 237, 364, 396 BioEdit Aragorn , 187, , , 210 BioPredsi...482, 483 ARB , 380 Bit (unit of information)... 12, 89, 95, 171, 176, Arc-Annotated Sequence , 268, , 179, 182, 184, 189, 190, 249, 264 Argonaute...458, 500 BLAST...5,19, 111, 113, 117, 118, 306, 396, Assessment...57, , 418, 444, 447 Azoarcus , 404 Blockbuster Boltzmann sampling... 79, 80 B Boltzmann weight... 80, 218, 220, 221, 226, 230, 235, Backbone , 397, , 405, 406, 409, 411, 236, 423, 424, 426, , 502, 504, 509 Boltzmann-weightedenergies...218, 423, 424, 426 Backbone torsion , 411 Boulder ALE , 381 Backtracking , 79 80, 130, 152 Bowtie Barriertree... 82, 83, 339, 340 Breastcancer Base pair BWA canonical covariance , 399, 408 correspondence C direct distance... 78, 80, 216, Carnac , 297, 298, 307 indirect Carrying capacity , 326 intermolecular...426, CASP... 19,396 intramolecular...421, 425, 428, 429, 432, 483 Cations Jan Gorodkin and Walter L. Ruzzo (eds.), RNA Sequence, Structure, and Function: Computational and Bioinformatic Methods, Methods in Molecular Biology, vol. 1097, DOI / , Springer Science+Business Media New York Gorodkin et al. Methods Mol Biol. 2014
21 k-mer k-mer is a substring of length k For example sequence GGCGATTCATCG: 4-mer GGCG, GCGA CGAT 3-mer GGC, GCG, CGA, GAT Schatz et al. Genome Res 2010
22 Read mapping through indexing Most computational time is spent on alignment. Solution An index is a data structure that improves the speed of data retrieval operations at the cost of additional storage space to maintain the index data structure. Pros Quick search for matches in an entire genome. Cons Index structure of the entire genome takes a lot of memory. Will my laptop have the memory to run the alignment algorithm?
23 Challenges of read mapping In principle read mapping is to map an exact piece of sequence (a string) to the genome. Go in groups of 2-3 people and discuss (3 min) Why an exact mapping might not always be what we want. Which concerns might you have when mapping genomic sequence? Which concerns might you have when mapping transcriptomic data?
24 Indexing problems Flexibility and constraints: Errors versus natural variation: trade-off in error threshold. Computational efficiency (time / memory): allowed mismatches / unique mappings. Balance between speed, memory and reported mappings. Indexing is often used for seed matching. Indexing method choice is crucial!
25 Indexing techniques
26 Hash tables for mapping reads Using an index structure to store information. Each entry is an Key Value pair: hf(key) = Value. Key is sequence motif (k-mer) Value is a motif match defined by a hash function hf Example: Hash function defines if 6-mer (Key) is in the reference sequence (Value=1) or not (Value=0). Key Value TATAAT 1 GATACC 0 AGTCAT 0 TATAAA 1 TATGAT 1 CGTACT 0 TATACT 1 Blastn finds seeds of exact matches between the query and database sequences by using a hash table of all 11-mers (megablast 28-mers).
27 Suffix arrays for mapping reads Even hashing can be too slow. Why map the same substring multiple times? Concept: Create a seed once, but keep track of all genomic locations. Employ seed extension. Employ suffix array to keep track of all suffixes. The suffix of a string (DNA sequence) is any portion from the end and some range into the sequence. For example AGTT is a suffix of ACTACCAGTT. Burrows-Wheeler Transform (BWT) is an extension of suffix arrays for efficient data compression (e.g. used in Bowtie).
28 Build a suffix array: Suffix arrays for mapping reads Reference sequence: AGGAGC Reference suffices Lexicographically sorted suffices 0: AGGAGC$ 6: $ 1: GGAGC$ 3: AGC 2: GAGC$ 0: AGGAGC 3: AGC$ 5: C 4: GC$ 2: GAGC 5: C$ 4: GC 6: $ 1: GGAGC SA = [6,3,0,5,2,4,1] $: the end of string symbol lexigraphically smaller than A.
29 Suffix arrays for mapping reads Binary search of perfect matches: m = query length, n = reference length Time: O(m log(n)) After finding perfect match you look in the adjacent cells for additional matches.
30 Search a suffix array: Suffix arrays for mapping reads Reference sequence: AGGAGC Reference suffices Lexicographically sorted suffices 0: AGGAGC$ 6: $ } 1: GGAGC$ 3: AGC 2: GAGC$ 0: AGGAGC AG 3: AGC$ 5: C 4: GC$ 2: GAGC 5: C$ 4: GC 6: $ 1: GGAGC binary search SA = [6,3,0,5,2,4,1] $: the end of string symbol lexigraphically smaller than A.
31 Principles of mapping reads Coping with gaps: Support of long gaps is computationally expensive, but sometimes required (longer reads have higher chance of gaps). Splice Junction mappers can be devided in 1 Exon-first methods TopHat (based on Bowtie2) 2 Seed-extend methods Blat, STAR-aligner, Segemehl
32 Strategies of gapped alignments Graphic credit: Garber et al. (2011) Nature Methods 8:
33 Time and memory requirements of read mappers [Time unit is minutes, and memory unit is GB] Fonseca et al. Bioinformatics 2012
34 Why assembling?
35 The assembly process Assembly is a hierarchical process, starting from individual reads, build high confidence contigs, incorporate the mates to build scaffolds.
36 The assembly process Reference based (comparative) De novo Combined methods
37 The assembly process Reference based (comparative) De novo Combined methods When to apply which assembly type?
38 Reference based assembly Applications Organisms with good reference genomes, except perhaps polyploid organisms Pros Can be run in parallel; Contamination not a major issue; less coverage needed Cons Reference dependant; known splice site dependant; long introns may be predicted Common software Cufflinks
39 De novo based assembly Applications Organisms with poor reference genomes or no reference genome; independent of correct splice sites or intron length Pros Identification of novel transcripts; no reference needed Cons Computationally intensive; requires high read depth; contamination a major issue Common software Trinity
40 Assembly algorithm Different approaches: Greedy algorithm Overlap graphs De Brujn graphs Basic ideas: Find all overlaps between reads Build a graph Simplify the graph (sequencing errors) Traverse a graph to produce a consensus
41 Greedy algorithm We pick two strings s i and s j with largest overlap from R (breaking ties arbitrarily) and replace them with their merge. Stop when there is only one string left.
42 Greedy algorithm We pick two strings s i and s j with largest overlap from R (breaking ties arbitrarily) and replace them with their merge. Stop when there is only one string left. What may go wrong?
43 Why to use graphs for assembly? Say a sequencer produces d reads of length n in one sequencing run from a genome of length m: d reads n 100 nt m nt human Task: Glue overlapping reads together to recover biology. Combinatorical problem best solved by graph theory! NGS library Graph Genome Transcriptome
44 Overlap graph each read alignment is a node short overlaps (>5bp) are indicated by directed edges transitive overlaps (longer overlaps) are shown as dotted edges
45 Overlap graph
46 Overlap graph
47 De Bruijn graph
48 De Bruijn graph
49 Evaluating an assembly Evaluating assembly quality is as important as the assembly itself! Assembly quality depends on 1 Coverage: low coverage is mathematically hopeless 2 Repeat composition: high repeat content is challenging 3 Read length: longer reads help resolve repeats 4 Error rate: errors reduce coverage and increase false positives
50 Repeats Over 50% of the human genome is repetitive! Repeats The BIG problem Long range misassembly Masking Repeats Known Repeats High Depth regions Paired-end sequencing can meet (some of) the need.
51 Summary HT sequencing produces millions of reads Index structures improve speed. Tradeoff between speed and memory. Graph theory helps to solve the complexity in assembling Sequencing of full length RNA molecules by new technologies (PacBio, Nanopore) eliminates the need to fragment the RNA during library preparation no more need for assembler tools (however new tools are needed)
Long Read RNA-seq Mapper
UNIVERSITY OF ZAGREB FACULTY OF ELECTRICAL ENGENEERING AND COMPUTING MASTER THESIS no. 1005 Long Read RNA-seq Mapper Josip Marić Zagreb, February 2015. Table of Contents 1. Introduction... 1 2. RNA Sequencing...
More informationSequencing. Short Read Alignment. Sequencing. Paired-End Sequencing 6/10/2010. Tobias Rausch 7 th June 2010 WGS. ChIP-Seq. Applied Biosystems.
Sequencing Short Alignment Tobias Rausch 7 th June 2010 WGS RNA-Seq Exon Capture ChIP-Seq Sequencing Paired-End Sequencing Target genome Fragments Roche GS FLX Titanium Illumina Applied Biosystems SOLiD
More informationBLAST & Genome assembly
BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies May 15, 2014 1 BLAST What is BLAST? The algorithm 2 Genome assembly De novo assembly Mapping assembly 3
More informationRNA-seq. Manpreet S. Katari
RNA-seq Manpreet S. Katari Evolution of Sequence Technology Normalizing the Data RPKM (Reads per Kilobase of exons per million reads) Score = R NT R = # of unique reads for the gene N = Size of the gene
More informationMapping NGS reads for genomics studies
Mapping NGS reads for genomics studies Valencia, 28-30 Sep 2015 BIER Alejandro Alemán aaleman@cipf.es Genomics Data Analysis CIBERER Where are we? Fastq Sequence preprocessing Fastq Alignment BAM Visualization
More informationShort Read Alignment. Mapping Reads to a Reference
Short Read Alignment Mapping Reads to a Reference Brandi Cantarel, Ph.D. & Daehwan Kim, Ph.D. BICF 05/2018 Introduction to Mapping Short Read Aligners DNA vs RNA Alignment Quality Pitfalls and Improvements
More informationAligners. J Fass 21 June 2017
Aligners J Fass 21 June 2017 Definitions Assembly: I ve found the shredded remains of an important document; put it back together! UC Davis Genome Center Bioinformatics Core J Fass Aligners 2017-06-21
More informationRNA-seq Data Analysis
Seyed Abolfazl Motahari RNA-seq Data Analysis Basics Next Generation Sequencing Biological Samples Data Cost Data Volume Big Data Analysis in Biology تحلیل داده ها کنترل سیستمهای بیولوژیکی تشخیص بیماریها
More informationBLAST & Genome assembly
BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies November 17, 2012 1 Introduction Introduction 2 BLAST What is BLAST? The algorithm 3 Genome assembly De
More informationOmega: an Overlap-graph de novo Assembler for Metagenomics
Omega: an Overlap-graph de novo Assembler for Metagenomics B a h l e l H a i d e r, Ta e - H y u k A h n, B r i a n B u s h n e l l, J u a n j u a n C h a i, A l e x C o p e l a n d, C h o n g l e Pa n
More informationGSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu
GSNAP: Fast and SNP-tolerant detection of complex variants and splicing in short reads by Thomas D. Wu and Serban Nacu Matt Huska Freie Universität Berlin Computational Methods for High-Throughput Omics
More informationGenome 373: Genome Assembly. Doug Fowler
Genome 373: Genome Assembly Doug Fowler What are some of the things we ve seen we can do with HTS data? We ve seen that HTS can enable a wide variety of analyses ranging from ID ing variants to genome-
More informationShort Read Alignment Algorithms
Short Read Alignment Algorithms Raluca Gordân Department of Biostatistics and Bioinformatics Department of Computer Science Department of Molecular Genetics and Microbiology Center for Genomic and Computational
More informationUnder the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS Researchers April 16, 2014 Gabe Rudy VP of Product Development Golden Helix Questions during the presentation Use the Questions pane in your GoToWebinar window
More informationIntroduction to Read Alignment. UCD Genome Center Bioinformatics Core Tuesday 15 September 2015
Introduction to Read Alignment UCD Genome Center Bioinformatics Core Tuesday 15 September 2015 From reads to molecules Why align? Individual A Individual B ATGATAGCATCGTCGGGTGTCTGCTCAATAATAGTGCCGTATCATGCTGGTGTTATAATCGCCGCATGACATGATCAATGG
More informationde novo assembly Simon Rasmussen 36626: Next Generation Sequencing analysis DTU Bioinformatics Next Generation Sequencing Analysis
de novo assembly Simon Rasmussen 36626: Next Generation Sequencing analysis DTU Bioinformatics 27626 - Next Generation Sequencing Analysis Generalized NGS analysis Data size Application Assembly: Compare
More informationSequence Analysis Pipeline
Sequence Analysis Pipeline Transcript fragments 1. PREPROCESSING 2. ASSEMBLY (today) Removal of contaminants, vector, adaptors, etc Put overlapping sequence together and calculate bigger sequences 3. Analysis/Annotation
More informationRead Mapping. Slides by Carl Kingsford
Read Mapping Slides by Carl Kingsford Bowtie Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop and Steven L Salzberg, Genome Biology
More informationAs of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be
48 Bioinformatics I, WS 09-10, S. Henz (script by D. Huson) November 26, 2009 4 BLAST and BLAT Outline of the chapter: 1. Heuristics for the pairwise local alignment of two sequences 2. BLAST: search and
More informationScalable RNA Sequencing on Clusters of Multicore Processors
JOAQUÍN DOPAZO JOAQUÍN TARRAGA SERGIO BARRACHINA MARÍA ISABEL CASTILLO HÉCTOR MARTÍNEZ ENRIQUE S. QUINTANA ORTÍ IGNACIO MEDINA INTRODUCTION DNA Exon 0 Exon 1 Exon 2 Intron 0 Intron 1 Reads Sequencing RNA
More informationBioinformatics in next generation sequencing projects
Bioinformatics in next generation sequencing projects Rickard Sandberg Assistant Professor Department of Cell and Molecular Biology Karolinska Institutet March 2011 Once sequenced the problem becomes computational
More informationGenome 373: Mapping Short Sequence Reads I. Doug Fowler
Genome 373: Mapping Short Sequence Reads I Doug Fowler Two different strategies for parallel amplification BRIDGE PCR EMULSION PCR Two different strategies for parallel amplification BRIDGE PCR EMULSION
More informationGenome Assembly and De Novo RNAseq
Genome Assembly and De Novo RNAseq BMI 7830 Kun Huang Department of Biomedical Informatics The Ohio State University Outline Problem formulation Hamiltonian path formulation Euler path and de Bruijin graph
More informationSlopMap: a software application tool for quick and flexible identification of similar sequences using exact k-mer matching
SlopMap: a software application tool for quick and flexible identification of similar sequences using exact k-mer matching Ilya Y. Zhbannikov 1, Samuel S. Hunter 1,2, Matthew L. Settles 1,2, and James
More informationAligners. J Fass 23 August 2017
Aligners J Fass 23 August 2017 Definitions Assembly: I ve found the shredded remains of an important document; put it back together! UC Davis Genome Center Bioinformatics Core J Fass Aligners 2017-08-23
More informationIllumina Next Generation Sequencing Data analysis
Illumina Next Generation Sequencing Data analysis Chiara Dal Fiume Sr Field Application Scientist Italy 2010 Illumina, Inc. All rights reserved. Illumina, illuminadx, Solexa, Making Sense Out of Life,
More informationReview of Recent NGS Short Reads Alignment Tools BMI-231 final project, Chenxi Chen Spring 2014
Review of Recent NGS Short Reads Alignment Tools BMI-231 final project, Chenxi Chen Spring 2014 Deciphering the information contained in DNA sequences began decades ago since the time of Sanger sequencing.
More informationSequence Assembly. BMI/CS 576 Mark Craven Some sequencing successes
Sequence Assembly BMI/CS 576 www.biostat.wisc.edu/bmi576/ Mark Craven craven@biostat.wisc.edu Some sequencing successes Yersinia pestis Cannabis sativa The sequencing problem We want to determine the identity
More informationAligning reads: tools and theory
Aligning reads: tools and theory Genome Sequence read :LM-Mel-14neg :LM-Mel-42neg :LM-Mel-14neg :LM-Mel-14pos :LM-Mel-42neg :LM-Mel-14neg :LM-Mel-42neg :LM-Mel-14neg chrx: 152139280 152139290 152139300
More informationKisSplice. Identifying and Quantifying SNPs, indels and Alternative Splicing Events from RNA-seq data. 29th may 2013
Identifying and Quantifying SNPs, indels and Alternative Splicing Events from RNA-seq data 29th may 2013 Next Generation Sequencing A sequencing experiment now produces millions of short reads ( 100 nt)
More informationA THEORETICAL ANALYSIS OF SCALABILITY OF THE PARALLEL GENOME ASSEMBLY ALGORITHMS
A THEORETICAL ANALYSIS OF SCALABILITY OF THE PARALLEL GENOME ASSEMBLY ALGORITHMS Munib Ahmed, Ishfaq Ahmad Department of Computer Science and Engineering, University of Texas At Arlington, Arlington, Texas
More informationMapping Reads to Reference Genome
Mapping Reads to Reference Genome DNA carries genetic information DNA is a double helix of two complementary strands formed by four nucleotides (bases): Adenine, Cytosine, Guanine and Thymine 2 of 31 Gene
More informationSequence mapping and assembly. Alistair Ward - Boston College
Sequence mapping and assembly Alistair Ward - Boston College Sequenced a genome? Fragmented a genome -> DNA library PCR amplification Sequence reads (ends of DNA fragment for mate pairs) We no longer have
More informationCLC Server. End User USER MANUAL
CLC Server End User USER MANUAL Manual for CLC Server 10.0.1 Windows, macos and Linux March 8, 2018 This software is for research purposes only. QIAGEN Aarhus Silkeborgvej 2 Prismet DK-8000 Aarhus C Denmark
More informationThe Burrows-Wheeler Transform and Bioinformatics. J. Matthew Holt April 1st, 2015
The Burrows-Wheeler Transform and Bioinformatics J. Matthew Holt April 1st, 2015 Outline Recall Suffix Arrays The Burrows-Wheeler Transform The FM-index Pattern Matching Multi-string BWTs Merge Algorithms
More informationDatabase Searching Using BLAST
Mahidol University Objectives SCMI512 Molecular Sequence Analysis Database Searching Using BLAST Lecture 2B After class, students should be able to: explain the FASTA algorithm for database searching explain
More informationExercise 2: Browser-Based Annotation and RNA-Seq Data
Exercise 2: Browser-Based Annotation and RNA-Seq Data Jeremy Buhler July 24, 2018 This exercise continues your introduction to practical issues in comparative annotation. You ll be annotating genomic sequence
More informationDifferential gene expression analysis using RNA-seq
https://abc.med.cornell.edu/ Differential gene expression analysis using RNA-seq Applied Bioinformatics Core, September/October 2018 Friederike Dündar with Luce Skrabanek & Paul Zumbo Day 3: Counting reads
More informationGPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units
GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units Abstract A very popular discipline in bioinformatics is Next-Generation Sequencing (NGS) or DNA sequencing. It specifies
More informationGalaxy Platform For NGS Data Analyses
Galaxy Platform For NGS Data Analyses Weihong Yan wyan@chem.ucla.edu Collaboratory Web Site http://qcb.ucla.edu/collaboratory Collaboratory Workshops Workshop Outline ü Day 1 UCLA galaxy and user account
More informationHigh-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg
High-throughput sequencing: Alignment and related topic Simon Anders EMBL Heidelberg Established platforms HTS Platforms Illumina HiSeq, ABI SOLiD, Roche 454 Newcomers: Benchtop machines 454 GS Junior,
More informationAccelerating Genomic Sequence Alignment Workload with Scalable Vector Architecture
Accelerating Genomic Sequence Alignment Workload with Scalable Vector Architecture Dong-hyeon Park, Jon Beaumont, Trevor Mudge University of Michigan, Ann Arbor Genomics Past Weeks ~$3 billion Human Genome
More informationHigh-throughput sequencing: Alignment and related topic. Simon Anders EMBL Heidelberg
High-throughput sequencing: Alignment and related topic Simon Anders EMBL Heidelberg Established platforms HTS Platforms Illumina HiSeq, ABI SOLiD, Roche 454 Newcomers: Benchtop machines: Illumina MiSeq,
More informationBuilding approximate overlap graphs for DNA assembly using random-permutations-based search.
An algorithm is presented for fast construction of graphs of reads, where an edge between two reads indicates an approximate overlap between the reads. Since the algorithm finds approximate overlaps directly,
More informationIDBA A Practical Iterative de Bruijn Graph De Novo Assembler
IDBA A Practical Iterative de Bruijn Graph De Novo Assembler Yu Peng, Henry C.M. Leung, S.M. Yiu, and Francis Y.L. Chin Department of Computer Science, The University of Hong Kong Pokfulam Road, Hong Kong
More informationRNA-seq. Read mapping and Quantification. Genomics: Lecture #12. Institut für Medizinische Genetik und Humangenetik Charité Universitätsmedizin Berlin
(1) Read and Quantification Institut für Medizinische Genetik und Humangenetik Charité Universitätsmedizin Berlin Genomics: Lecture #12 Today (1) Gene Expression Previous gold standard: Basic protocol
More informationDarwin: A Genomic Co-processor gives up to 15,000X speedup on long read assembly (To appear in ASPLOS 2018)
Darwin: A Genomic Co-processor gives up to 15,000X speedup on long read assembly (To appear in ASPLOS 2018) Yatish Turakhia EE PhD candidate Stanford University Prof. Bill Dally (Electrical Engineering
More informationTopics of the talk. Biodatabases. Data types. Some sequence terminology...
Topics of the talk Biodatabases Jarno Tuimala / Eija Korpelainen CSC What data are stored in biological databases? What constitutes a good database? Nucleic acid sequence databases Amino acid sequence
More informationMasher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs
Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs Anas Abu-Doleh 1,2, Erik Saule 1, Kamer Kaya 1 and Ümit V. Çatalyürek 1,2 1 Department of Biomedical Informatics 2 Department of Electrical
More informationGenome Assembly Using de Bruijn Graphs. Biostatistics 666
Genome Assembly Using de Bruijn Graphs Biostatistics 666 Previously: Reference Based Analyses Individual short reads are aligned to reference Genotypes generated by examining reads overlapping each position
More informationIDBA - A Practical Iterative de Bruijn Graph De Novo Assembler
IDBA - A Practical Iterative de Bruijn Graph De Novo Assembler Yu Peng, Henry Leung, S.M. Yiu, Francis Y.L. Chin Department of Computer Science, The University of Hong Kong Pokfulam Road, Hong Kong {ypeng,
More informationDr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata
Analysis of RNA sequencing data sets using the Galaxy environment Dr. Gabriela Salinas Dr. Orr Shomroni Kaamini Rhaithata Microarray and Deep-sequencing core facility 30.10.2017 RNA-seq workflow I Hypothesis
More informationNCGAS Makes Robust Transcriptome Assembly Easier with a Readily Usable Workflow Following de novo Assembly Best Practices
NCGAS Makes Robust Transcriptome Assembly Easier with a Readily Usable Workflow Following de novo Assembly Best Practices Sheri Sanders Bioinformatics Analyst NCGAS @ IU ss93@iu.edu Many users new to de
More informationThe Burrows-Wheeler Transform and Bioinformatics. J. Matthew Holt
The Burrows-Wheeler Transform and Bioinformatics J. Matthew Holt holtjma@cs.unc.edu Last Class - Multiple Pattern Matching Problem m - length of text d - max length of pattern x - number of patterns Method
More informationData Preprocessing. Next Generation Sequencing analysis DTU Bioinformatics Next Generation Sequencing Analysis
Data Preprocessing Next Generation Sequencing analysis DTU Bioinformatics Generalized NGS analysis Data size Application Assembly: Compare Raw Pre- specific: Question Alignment / samples / Answer? reads
More informationAdam M Phillippy Center for Bioinformatics and Computational Biology
Adam M Phillippy Center for Bioinformatics and Computational Biology WGS sequencing shearing sequencing assembly WGS assembly Overlap reads identify reads with shared k-mers calculate edit distance Layout
More informationIdentiyfing splice junctions from RNA-Seq data
Identiyfing splice junctions from RNA-Seq data Joseph K. Pickrell pickrell@uchicago.edu October 4, 2010 Contents 1 Motivation 2 2 Identification of potential junction-spanning reads 2 3 Calling splice
More informationSupplementary Figure 1. Fast read-mapping algorithm of BrowserGenome.
Supplementary Figure 1 Fast read-mapping algorithm of BrowserGenome. (a) Indexing strategy: The genome sequence of interest is divided into non-overlapping 12-mers. A Hook table is generated that contains
More informationON HEURISTIC METHODS IN NEXT-GENERATION SEQUENCING DATA ANALYSIS
ON HEURISTIC METHODS IN NEXT-GENERATION SEQUENCING DATA ANALYSIS Ivan Vogel Doctoral Degree Programme (1), FIT BUT E-mail: xvogel01@stud.fit.vutbr.cz Supervised by: Jaroslav Zendulka E-mail: zendulka@fit.vutbr.cz
More informationPurpose of sequence assembly
Sequence Assembly Purpose of sequence assembly Reconstruct long DNA/RNA sequences from short sequence reads Genome sequencing RNA sequencing for gene discovery Amplicon sequencing But not for transcript
More informationSequence Alignment. GBIO0002 Archana Bhardwaj University of Liege
Sequence Alignment GBIO0002 Archana Bhardwaj University of Liege 1 What is Sequence Alignment? A sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity.
More informationReducing Genome Assembly Complexity with Optical Maps
Reducing Genome Assembly Complexity with Optical Maps AMSC 663 Mid-Year Progress Report 12/13/2011 Lee Mendelowitz Lmendelo@math.umd.edu Advisor: Mihai Pop mpop@umiacs.umd.edu Computer Science Department
More informationReducing Genome Assembly Complexity with Optical Maps
Reducing Genome Assembly Complexity with Optical Maps Lee Mendelowitz LMendelo@math.umd.edu Advisor: Dr. Mihai Pop Computer Science Department Center for Bioinformatics and Computational Biology mpop@umiacs.umd.edu
More informationFinishing Circular Assemblies. J Fass UCD Genome Center Bioinformatics Core Thursday April 16, 2015
Finishing Circular Assemblies J Fass UCD Genome Center Bioinformatics Core Thursday April 16, 2015 Assembly Strategies de Bruijn graph Velvet, ABySS earlier, basic assemblers IDBA, SPAdes later, multi-k
More informationFPGA Acceleration of Short Read Alignment
TECHNICAL REPORT 1 FPGA Acceleration of Short Read Alignment Nathaniel McVicar, Akina Hoshino, Anna La Torre, Thomas A. Reh, Walter L. Ruzzo and Scott Hauck Abstract Aligning millions of short DNA or RNA
More informationPerformance analysis of parallel de novo genome assembly in shared memory system
IOP Conference Series: Earth and Environmental Science PAPER OPEN ACCESS Performance analysis of parallel de novo genome assembly in shared memory system To cite this article: Syam Budi Iryanto et al 2018
More informationFrom Smith-Waterman to BLAST
From Smith-Waterman to BLAST Jeremy Buhler July 23, 2015 Smith-Waterman is the fundamental tool that we use to decide how similar two sequences are. Isn t that all that BLAST does? In principle, it is
More information24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, This lecture is based on the following papers, which are all recommended reading:
24 Grundlagen der Bioinformatik, SS 10, D. Huson, April 26, 2010 3 BLAST and FASTA This lecture is based on the following papers, which are all recommended reading: D.J. Lipman and W.R. Pearson, Rapid
More informationSAPLING: Suffix Array Piecewise Linear INdex for Genomics Michael Kirsche
SAPLING: Suffix Array Piecewise Linear INdex for Genomics Michael Kirsche mkirsche@jhu.edu StringBio 2018 Outline Substring Search Problem Caching and Learned Data Structures Methods Results Ongoing work
More informationSequencing. Computational Biology IST Ana Teresa Freitas 2011/2012. (BACs) Whole-genome shotgun sequencing Celera Genomics
Computational Biology IST Ana Teresa Freitas 2011/2012 Sequencing Clone-by-clone shotgun sequencing Human Genome Project Whole-genome shotgun sequencing Celera Genomics (BACs) 1 Must take the fragments
More informationBiology 644: Bioinformatics
Find the best alignment between 2 sequences with lengths n and m, respectively Best alignment is very dependent upon the substitution matrix and gap penalties The Global Alignment Problem tries to find
More informationDNA Fragment Assembly
Algorithms in Bioinformatics Sami Khuri Department of Computer Science San José State University San José, California, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri DNA Fragment Assembly Overlap
More informationCMSC423: Bioinformatic Algorithms, Databases and Tools. Exact string matching: Suffix trees Suffix arrays
CMSC423: Bioinformatic Algorithms, Databases and Tools Exact string matching: Suffix trees Suffix arrays Searching multiple strings Can we search multiple strings at the same time? Would it help if we
More informationFinding the appropriate method, with a special focus on: Mapping and alignment. Philip Clausen
Finding the appropriate method, with a special focus on: Mapping and alignment Philip Clausen Background Most people choose their methods based on popularity and history, not by reasoning and research.
More informationManual of SOAPdenovo-Trans-v1.03. Yinlong Xie, Gengxiong Wu, Jingbo Tang,
Manual of SOAPdenovo-Trans-v1.03 Yinlong Xie, 2013-07-19 Gengxiong Wu, 2013-07-19 Jingbo Tang, 2013-07-19 ********** Introduction SOAPdenovo-Trans is a de novo transcriptome assembler basing on the SOAPdenovo
More informationData Preprocessing : Next Generation Sequencing analysis CBS - DTU Next Generation Sequencing Analysis
Data Preprocessing 27626: Next Generation Sequencing analysis CBS - DTU Generalized NGS analysis Data size Application Assembly: Compare Raw Pre- specific: Question Alignment / samples / Answer? reads
More informationIntroduction and tutorial for SOAPdenovo. Xiaodong Fang Department of Science and BGI May, 2012
Introduction and tutorial for SOAPdenovo Xiaodong Fang fangxd@genomics.org.cn Department of Science and Technology @ BGI May, 2012 Why de novo assembly? Genome is the genetic basis for different phenotypes
More informationGraph Algorithms in Bioinformatics
Graph Algorithms in Bioinformatics Computational Biology IST Ana Teresa Freitas 2015/2016 Sequencing Clone-by-clone shotgun sequencing Human Genome Project Whole-genome shotgun sequencing Celera Genomics
More informationWilson Leung 05/27/2008 A Simple Introduction to NCBI BLAST
A Simple Introduction to NCBI BLAST Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment Resources: The BLAST web server is available at http://www.ncbi.nih.gov/blast/
More informationRead Mapping. de Novo Assembly. Genomics: Lecture #2 WS 2014/2015
Mapping de Novo Assembly Institut für Medizinische Genetik und Humangenetik Charité Universitätsmedizin Berlin Genomics: Lecture #2 WS 2014/2015 Today Genome assembly: the basics Hamiltonian and Eulerian
More informationABSTRACT MST BASED AB INITIO ASSEMBLER OF EXPRESSED SEQUENCE TAGS. By Yuan Zhang
ABSTRACT MST BASED AB INITIO ASSEMBLER OF EXPRESSED SEQUENCE TAGS By Yuan Zhang In the thesis we present a new algorithm for the assembly of ESTs based on a minimum spanning tree construction. By representing
More informationComputational models for bionformatics
Computational models for bionformatics De-novo assembly and alignment-free measures Michele Schimd Department of Information Engineering July 8th, 2015 Michele Schimd (DEI) PostDoc @ DEI July 8th, 2015
More information(for more info see:
Genome assembly (for more info see: http://www.cbcb.umd.edu/research/assembly_primer.shtml) Introduction Sequencing technologies can only "read" short fragments from a genome. Reconstructing the entire
More informationHigh-throughout sequencing and using short-read aligners. Simon Anders
High-throughout sequencing and using short-read aligners Simon Anders High-throughput sequencing (HTS) Sequencing millions of short DNA fragments in parallel. a.k.a.: next-generation sequencing (NGS) massively-parallel
More informationDarwin: A Hardware-acceleration Framework for Genomic Sequence Alignment
Darwin: A Hardware-acceleration Framework for Genomic Sequence Alignment Yatish Turakhia EE PhD candidate Stanford University Prof. Bill Dally (Electrical Engineering and Computer Science) Prof. Gill Bejerano
More informationTaller práctico sobre uso, manejo y gestión de recursos genómicos de abril de 2013 Assembling long-read Transcriptomics
Taller práctico sobre uso, manejo y gestión de recursos genómicos 22-24 de abril de 2013 Assembling long-read Transcriptomics Rocío Bautista Outline Introduction How assembly Tools assembling long-read
More informationRNA-Seq in Galaxy: Tuxedo protocol. Igor Makunin, UQ RCC, QCIF
RNA-Seq in Galaxy: Tuxedo protocol Igor Makunin, UQ RCC, QCIF Acknowledgments Genomics Virtual Lab: gvl.org.au Galaxy for tutorials: galaxy-tut.genome.edu.au Galaxy Australia: galaxy-aust.genome.edu.au
More informationITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013
ITMO Ecole de Bioinformatique Hands-on session: smallrna-seq N. Servant 21 rd November 2013 1. Data and objectives We will use the data from GEO (GSE35368, Toedling, Servant et al. 2011). Two samples were
More informationNGS NEXT GENERATION SEQUENCING
NGS NEXT GENERATION SEQUENCING Paestum (Sa) 15-16 -17 maggio 2014 Relatore Dr Cataldo Senatore Dr.ssa Emilia Vaccaro Sanger Sequencing Reactions For given template DNA, it s like PCR except: Uses only
More informationMeraculous De Novo Assembly of the Ariolimax dolichophallus Genome. Charles Cole, Jake Houser, Kyle McGovern, and Jennie Richardson
Meraculous De Novo Assembly of the Ariolimax dolichophallus Genome Charles Cole, Jake Houser, Kyle McGovern, and Jennie Richardson Meraculous Assembler Published by the US Department of Energy Joint Genome
More informationAccelrys Pipeline Pilot and HP ProLiant servers
Accelrys Pipeline Pilot and HP ProLiant servers A performance overview Technical white paper Table of contents Introduction... 2 Accelrys Pipeline Pilot benchmarks on HP ProLiant servers... 2 NGS Collection
More informationNGS Data and Sequence Alignment
Applications and Servers SERVER/REMOTE Compute DB WEB Data files NGS Data and Sequence Alignment SSH WEB SCP Manpreet S. Katari App Aug 11, 2016 Service Terminal IGV Data files Window Personal Computer/Local
More informationExon Probeset Annotations and Transcript Cluster Groupings
Exon Probeset Annotations and Transcript Cluster Groupings I. Introduction This whitepaper covers the procedure used to group and annotate probesets. Appropriate grouping of probesets into transcript clusters
More informationAlgorithms in Bioinformatics: A Practical Introduction. Database Search
Algorithms in Bioinformatics: A Practical Introduction Database Search Biological databases Biological data is double in size every 15 or 16 months Increasing in number of queries: 40,000 queries per day
More information7.36/7.91 recitation. DG Lectures 5 & 6 2/26/14
7.36/7.91 recitation DG Lectures 5 & 6 2/26/14 1 Announcements project specific aims due in a little more than a week (March 7) Pset #2 due March 13, start early! Today: library complexity BWT and read
More informationShotgun sequencing. Coverage is simply the average number of reads that overlap each true base in genome.
Shotgun sequencing Genome (unknown) Reads (randomly chosen; have errors) Coverage is simply the average number of reads that overlap each true base in genome. Here, the coverage is ~10 just draw a line
More informationSequence Assembly Required!
Sequence Assembly Required! 1 October 3, ISMB 20172007 1 Sequence Assembly Genome Sequenced Fragments (reads) Assembled Contigs Finished Genome 2 Greedy solution is bounded 3 Typical assembly strategy
More informationT-IDBA: A de novo Iterative de Bruijn Graph Assembler for Transcriptome
T-IDBA: A de novo Iterative de Bruin Graph Assembler for Transcriptome Yu Peng, Henry C.M. Leung, S.M. Yiu, Francis Y.L. Chin Department of Computer Science, The University of Hong Kong Pokfulam Road,
More informationPreliminary Syllabus. Genomics. Introduction & Genome Assembly Sequence Comparison Gene Modeling Gene Function Identification
Preliminary Syllabus Sep 30 Oct 2 Oct 7 Oct 9 Oct 14 Oct 16 Oct 21 Oct 25 Oct 28 Nov 4 Nov 8 Introduction & Genome Assembly Sequence Comparison Gene Modeling Gene Function Identification OCTOBER BREAK
More informationParallel de novo Assembly of Complex (Meta) Genomes via HipMer
Parallel de novo Assembly of Complex (Meta) Genomes via HipMer Aydın Buluç Computational Research Division, LBNL May 23, 2016 Invited Talk at HiCOMB 2016 Outline and Acknowledgments Joint work (alphabetical)
More information