Compressed Full-Text Indexes for Highly Repetitive Collections. Lectio praecursoria Jouni Sirén
|
|
- Claribel Barton
- 5 years ago
- Views:
Transcription
1 Compressed Full-Text Indexes for Highly Repetitive Collections Lectio praecursoria Jouni Sirén
2 ALGORITHM
3 Grossi, Vitter: Compressed suffix arrays and suffix trees with Navarro, Mäkinen: Compressed full-text indexes Ferragina, Manzini: Indexing compressed text Raman, Raman, Rao: Succinct indexable dictionaries with applications Sadakane: New text indexing functionalities of the compressed suffix Burrows, Wheeler: A block sorting lossless data compression algorithm Sadakane: Compressed suffix trees with full functionality Manber, Myers: Suffix arrays: A new method for on-line string searches Ferragina, Manzini, Mäkinen, Navarro: Compressed representations of
4 Are there papers with Sadakane as the first author? Grossi, Vitter: Compressed suffix arrays and suffix trees with Navarro, Mäkinen: Compressed full-text indexes Ferragina, Manzini: Indexing compressed text Raman, Raman, Rao: Succinct indexable dictionaries with applications Sadakane: New text indexing functionalities of the compressed suffix Burrows, Wheeler: A block sorting lossless data compression algorithm Sadakane: Compressed suffix trees with full functionality Manber, Myers: Suffix arrays: A new method for on-line string searches Ferragina, Manzini, Mäkinen, Navarro: Compressed representations of
5 How many papers have Sadakane as the first author? Grossi, Vitter: Compressed suffix arrays and suffix trees with Navarro, Mäkinen: Compressed full-text indexes Ferragina, Manzini: Indexing compressed text Raman, Raman, Rao: Succinct indexable dictionaries with applications Sadakane: New text indexing functionalities of the compressed suffix Burrows, Wheeler: A block sorting lossless data compression algorithm Sadakane: Compressed suffix trees with full functionality Manber, Myers: Suffix arrays: A new method for on-line string searches Ferragina, Manzini, Mäkinen, Navarro: Compressed representations of
6 What are the papers with Sadakane as the first author? Grossi, Vitter: Compressed suffix arrays and suffix trees with Navarro, Mäkinen: Compressed full-text indexes Ferragina, Manzini: Indexing compressed text Raman, Raman, Rao: Succinct indexable dictionaries with applications Sadakane: New text indexing functionalities of the compressed suffix Burrows, Wheeler: A block sorting lossless data compression algorithm Sadakane: Compressed suffix trees with full functionality Manber, Myers: Suffix arrays: A new method for on-line string searches Ferragina, Manzini, Mäkinen, Navarro: Compressed representations of
7 Burrows, Wheeler: A block sorting lossless data compression algorithm Ferragina, Manzini: Indexing compressed text Ferragina, Manzini, Mäkinen, Navarro: Compressed representations of Grossi, Vitter: Compressed suffix arrays and suffix trees with Manber, Myers: Suffix arrays: A new method for on-line string searches Navarro, Mäkinen: Compressed full-text indexes Raman, Raman, Rao: Succinct indexable dictionaries with applications Sadakane: Compressed suffix trees with full functionality Sadakane: New text indexing functionalities of the compressed suffix
8 Burrows, Wheeler: A block sorting lossless data compression algorithm Ferragina, Manzini: Indexing compressed text Ferragina, Manzini, Mäkinen, Navarro: Compressed representations of Grossi, Vitter: Compressed suffix arrays and suffix trees with Manber, Myers: Suffix arrays: A new method for on-line string searches Navarro, Mäkinen: Compressed full-text indexes Raman, Raman, Rao: Succinct indexable dictionaries with applications Sadakane: Compressed suffix trees with full functionality Sadakane: New text indexing functionalities of the compressed suffix
9 Burrows, Wheeler: A block sorting lossless data compression algorithm Ferragina, Manzini: Indexing compressed text Ferragina, Manzini, Mäkinen, Navarro: Compressed representations of Grossi, Vitter: Compressed suffix arrays and suffix trees with Manber, Myers: Suffix arrays: A new method for on-line string searches Navarro, Mäkinen: Compressed full-text indexes Raman, Raman, Rao: Succinct indexable dictionaries with applications Sadakane: Compressed suffix trees with full functionality Sadakane: New text indexing functionalities of the compressed suffix
10 Burrows, Wheeler: A block sorting lossless data compression algorithm Ferragina, Manzini: Indexing compressed text Ferragina, Manzini, Mäkinen, Navarro: Compressed representations of Grossi, Vitter: Compressed suffix arrays and suffix trees with Manber, Myers: Suffix arrays: A new method for on-line string searches Navarro, Mäkinen: Compressed full-text indexes Raman, Raman, Rao: Succinct indexable dictionaries with applications Sadakane: Compressed suffix trees with full functionality Sadakane: New text indexing functionalities of the compressed suffix
11 Burrows, Wheeler: A block sorting lossless data compression algorithm Ferragina, Manzini: Indexing compressed text Ferragina, Manzini, Mäkinen, Navarro: Compressed representations of Grossi, Vitter: Compressed suffix arrays and suffix trees with Manber, Myers: Suffix arrays: A new method for on-line string searches Navarro, Mäkinen: Compressed full-text indexes Raman, Raman, Rao: Succinct indexable dictionaries with applications Sadakane: Compressed suffix trees with full functionality Sadakane: New text indexing functionalities of the compressed suffix
12 Burrows, Wheeler: A block sorting lossless data compression algorithm Ferragina, Manzini: Indexing compressed text Ferragina, Manzini, Mäkinen, Navarro: Compressed representations of Grossi, Vitter: Compressed suffix arrays and suffix trees with Manber, Myers: Suffix arrays: A new method for on-line string searches Navarro, Mäkinen: Compressed full-text indexes Raman, Raman, Rao: Succinct indexable dictionaries with applications Sadakane: Compressed suffix trees with full functionality Sadakane: New text indexing functionalities of the compressed suffix
13 DATA STRUCTURE
14 What if we have to preserve the original order of the records? We may want even faster queries. Perhaps there are too many records to fit into memory. Then we probably need another data structure.
15 INDEX
16 Grossi, Vitter: Compressed suffix arrays and suffix trees with Burrows Navarro, Mäkinen: Compressed full-text indexes Ferragina Ferragina, Manzini: Indexing compressed text Grossi Raman, Raman, Rao: Succinct indexable dictionaries with applications Manber Sadakane: New text indexing functionalities of the compressed suffix Navarro Burrows, Wheeler: A block sorting lossless data compression algorithm Raman Sadakane: Compressed suffix trees with full functionality Sadakane Manber, Myers: Suffix arrays: A new method for on-line string searches Ferragina, Manzini, Mäkinen, Navarro: Compressed representations of
17 FULL-TEXT INDEX
18 Suffix Array G A C G T A C T G $ $ A C G T A C T G $ A C T G $ C G T A C T G $ C T G $ G $ G A C G T A C T G $ G T A C T G $ T A C T G $ T G $
19 Suffix Array G A C G T A C T G $
20 While a character takes 1 byte, each pointer requires 4 or 8 bytes. Suffix array usually requires 5 or 9 times more space than the text. We need something smaller to handle large texts.
21 COMPRESSED INDEX
22 Ferragina, Manzini 2000, 2005: FM-index Grossi, Vitter 2000, 2005: Compressed Suffix Array Use Burrows-Wheeler transform to simulate the suffix array. Compresses to 40% to 80% of text size. Yet some data should compress better.
23 HIGHLY REPETITIVE DATA
24 Individual Genomes G A C G T A - C T G C A G A T G - T A A T G C G A C G T A - C T G C A G A T G C T A A T C C G A C G T A G C A G A T G C T A A T G C G A C G T A - C T G C A G - T G C T A A T G C G A C G T A G C A G A T G C T A A T C C G A C G T A - C T G C T G A T G C T A A T G C G A C G T A C C T G C A G A T G C T A A T G C G A C G T A C C T G C A G - T G C T A A T G C G A C G T A - C T G C T G A T G C T A A T G C G A C G T A - C T G C A G A T G C T A A T C C
25 Version History dhcp-eduroam-hy :thesis jltsiren$ svn diff -r 662 thesis.tex Index: thesis.tex =================================================================== --- thesis.tex (revision 662) +++ thesis.tex (working copy) -23,7 +23,7 \isbnpdf{ } \issn{ } \printhouse{unigrafia} -\pubpages{ } % FIXME +\pubpages{ } \supervisorlist{veli Mäkinen, University of Helsinki, Finland} \preexaminera{kunihiko Sadakane, National Institute of Informatics, Japan} \preexaminerb{jorma Tarhio, Aalto University, Finland}
26 Finnish language Wikipedia with full version history 42 gigabytes Suffix array construction 378 gigabytes Run-length compressed suffix array 4.4 gigabytes Do we have 378 gigabytes of memory?
27 INDEX CONSTRUCTION
28 Suffix Array Direct Construction Data Construction RLCSA
29 Suffix Array Direct Construction Data Construction RLCSA
30 INDEXING AUTOMATA
31 # G A C G T A C C T G $ T ACTA A CTA C ACTG A # GA ACG CG GT TA ACC CC CTG TG$ G$ $ # G A C G T A C C T G $ AT A TGT T AG A $ ACC ACG ACTA ACTG AG AT CC CG CTA CTG G$ GA GT TA TG$ TGT # BWT G T G G T T G A A A AC AT # CT CG C A $ Edges
32 Compressed Full-Text Indexes for Highly Repetitive Collections
Sampled longest common prefix array. Sirén, Jouni. Springer 2010
https://helda.helsinki.fi Sampled longest common prefix array Sirén, Jouni Springer 2010 Sirén, J 2010, Sampled longest common prefix array. in A Amir & L Parida (eds), Combinatorial Pattern Matching :
More informationarxiv: v3 [cs.ds] 29 Jun 2010
Sampled Longest Common Prefix Array Jouni Sirén Department of Computer Science, University of Helsinki, Finland jltsiren@cs.helsinki.fi arxiv:1001.2101v3 [cs.ds] 29 Jun 2010 Abstract. When augmented with
More informationLempel-Ziv Compressed Full-Text Self-Indexes
Lempel-Ziv Compressed Full-Text Self-Indexes Diego G. Arroyuelo Billiardi Ph.D. Student, Departamento de Ciencias de la Computación Universidad de Chile darroyue@dcc.uchile.cl Advisor: Gonzalo Navarro
More informationDocument Retrieval on Repetitive Collections
Document Retrieval on Repetitive Collections Gonzalo Navarro 1, Simon J. Puglisi 2, and Jouni Sirén 1 1 Center for Biotechnology and Bioengineering, Department of Computer Science, University of Chile,
More informationFaster Average Case Low Memory Semi-External Construction of the Burrows-Wheeler Transform
Faster Average Case Low Memory Semi-External Construction of the Burrows-Wheeler German Tischler The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus Hinxton, Cambridge, CB10 1SA, United Kingdom
More informationImproved and Extended Locating Functionality on Compressed Suffix Arrays
Improved and Extended Locating Functionality on Compressed Suffix Arrays Simon Gog and Gonzalo Navarro 2 Department of Computing and Information Systems, The University of Melbourne. simon.gog@unimelb.edu.au
More informationCompressed Indexes for Dynamic Text Collections
Compressed Indexes for Dynamic Text Collections HO-LEUNG CHAN University of Hong Kong and WING-KAI HON National Tsing Hua University and TAK-WAH LAM University of Hong Kong and KUNIHIKO SADAKANE Kyushu
More informationSmaller Self-Indexes for Natural Language
Smaller Self-Indexes for Natural Language Nieves R. Brisaboa 1, Gonzalo Navarro 2, and Alberto Ordóñez 1 1 Database Lab., Univ. of A Coruña, Spain. {brisaboa,alberto.ordonez}@udc.es 2 Dept. of Computer
More informationIndexing Graphs for Path Queries with Applications in Genome Research
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 1 Indexing Graphs for Path Queries with Applications in Genome Research Jouni Sirén, Niko Välimäki, Veli Mäkinen Abstract We propose a
More informationA Linear-Time Burrows-Wheeler Transform Using Induced Sorting
A Linear-Time Burrows-Wheeler Transform Using Induced Sorting Daisuke Okanohara 1 and Kunihiko Sadakane 2 1 Department of Computer Science, University of Tokyo, Hongo 7-3-1, Bunkyo-ku, Tokyo 113-0013,
More informationIndexing Graphs for Path Queries. Jouni Sirén Wellcome Trust Sanger Institute
Indexing raphs for Path Queries Jouni Sirén Wellcome rust Sanger Institute J. Sirén, N. Välimäki, V. Mäkinen: Indexing raphs for Path Queries with pplications in enome Research. WBI 20, BB 204.. Bowe,.
More informationLocally Compressed Suffix Arrays
Locally Compressed Suffix Arrays Rodrigo González, Gonzalo Navarro, and Héctor Ferrada Dept. of Computer Science, University of Chile. {rgonzale,gnavarro,hferrada}@dcc.uchile.cl We introduce a compression
More informationSpace-efficient Construction of LZ-index
Space-efficient Construction of LZ-index Diego Arroyuelo and Gonzalo Navarro Dept. of Computer Science, University of Chile, Blanco Encalada 212, Santiago, Chile. {darroyue,gnavarro}@dcc.uchile.cl Abstract.
More informationParallel Lightweight Wavelet Tree, Suffix Array and FM-Index Construction
Parallel Lightweight Wavelet Tree, Suffix Array and FM-Index Construction Julian Labeit Julian Shun Guy E. Blelloch Karlsruhe Institute of Technology UC Berkeley Carnegie Mellon University julianlabeit@gmail.com
More informationA Compressed Text Index on Secondary Memory
A Compressed Text Index on Secondary Memory Rodrigo González and Gonzalo Navarro Department of Computer Science, University of Chile. {rgonzale,gnavarro}@dcc.uchile.cl Abstract. We introduce a practical
More informationarxiv: v2 [cs.ds] 31 Aug 2012
Algorithmica manuscript No. (will be inserted by the editor) Faster Approximate Pattern Matching in Compressed Repetitive Texts Travis Gagie Pawe l Gawrychowski Christopher Hoobin Simon J. Puglisi the
More informationEngineering a Lightweight External Memory Su x Array Construction Algorithm
Engineering a Lightweight External Memory Su x Array Construction Algorithm Juha Kärkkäinen, Dominik Kempa Department of Computer Science, University of Helsinki, Finland {Juha.Karkkainen Dominik.Kempa}@cs.helsinki.fi
More informationEngineering a Compressed Suffix Tree Implementation
Engineering a Compressed Suffix Tree Implementation Niko Välimäki 1, Wolfgang Gerlach 2, Kashyap Dixit 3, and Veli Mäkinen 1 1 Department of Computer Science, University of Helsinki, Finland. {nvalimak,vmakinen}@cs.helsinki.fi
More informationScalable and Versatile k-mer Indexing for High-Throughput Sequencing Data
Scalable and Versatile k-mer Indexing for High-Throughput Sequencing Data Niko Välimäki Genome-Scale Biology Research Program, and Department of Medical Genetics, Faculty of Medicine, University of Helsinki,
More informationSAPLING: Suffix Array Piecewise Linear INdex for Genomics Michael Kirsche
SAPLING: Suffix Array Piecewise Linear INdex for Genomics Michael Kirsche mkirsche@jhu.edu StringBio 2018 Outline Substring Search Problem Caching and Learned Data Structures Methods Results Ongoing work
More informationEfficient compression of large repetitive strings
Efficient compression of large repetitive strings A thesis submitted in fulfillment of the requirements for the degree of Doctor of Philosophy Christopher Hoobin B.Eng (Hons.), B.App.Sci (Hons.), School
More informationPractical Rank/Select Queries over Arbitrary Sequences
Practical Rank/Select Queries over Arbitrary Sequences Francisco Claude and Gonzalo Navarro Department of Computer Science, Universidad de Chile {fclaude,gnavarro}@dcc.uchile.cl Abstract. We present a
More informationHISAT2. Fast and sensi0ve alignment against general human popula0on. Daehwan Kim
HISA2 Fast and sensi0ve alignment against general human popula0on Daehwan Kim infphilo@gmail.com History about BW, FM, XBW, GBW, and GFM BW (1994) BW for Linear path Burrows M, Wheeler DJ: A Block Sor0ng
More informationOptimal Succinctness for Range Minimum Queries
Optimal Succinctness for Range Minimum Queries Johannes Fischer Universität Tübingen, Center for Bioinformatics (ZBIT), Sand 14, D-72076 Tübingen fischer@informatik.uni-tuebingen.de Abstract. For a static
More informationInexact Sequence Mapping Study Cases: Hybrid GPU Computing and Memory Demanding Indexes
Inexact Sequence Mapping Study Cases: Hybrid GPU Computing and Memory Demanding Indexes José Salavert 1, Andrés Tomás 1, Ignacio Medina 2, and Ignacio Blanquer 1 1 GRyCAP department of I3M, Universitat
More informationImproved and Extended Locating Functionality on Compressed Suffix Arrays
Improved and Extended Locating Functionality on Compressed Suffix Arrays Simon Gog a, Gonzalo Navarro b, Matthias Petri c a Institute of Theoretical Informatics, Karlsruhe Institute of Technology, Germany
More informationResearch Article Using the Sadakane Compressed Suffix Tree to Solve the All-Pairs Suffix-Prefix Problem
Hindawi Publishing Corporation BioMed Research International Volume 2014, Article ID 745298, 11 pages http://dx.doi.org/10.1155/2014/745298 Research Article Using the Sadakane Compressed Suffix Tree to
More informationEngineering Rank and Select Queries on Wavelet Trees
Engineering Rank and Select Queries on Wavelet Trees Jan H. Knudsen, 292926 Roland L. Pedersen, 292817 Master s Thesis, Computer Science Advisor: Gerth Stølting Brodal June, 215 abracadrabra 1111 abacaaba
More informationHardware-Oriented Succinct-Data-Structure for Text Processing Based on Block-Size-Constrained Compression
International Journal of Computer Information Systems and Industrial Management Applications. ISSN 215-7988 Volume 8 (216) pp. 1 11 c MIR Labs, www.mirlabs.net/ijcisim/index.html Hardware-Oriented Succinct-Data-Structure
More informationarxiv: v1 [cs.ds] 3 Nov 2015
Burrows-Wheeler transform for terabases arxiv:.88v [cs.ds] 3 Nov Jouni Sirén Wellcome Trust Sanger Institute Wellcome Trust Genome ampus Hinxton, ambridge, B S, UK jouni.siren@iki.fi bstract In order to
More informationPermuted Longest-Common-Prefix Array
Permuted Longest-Common-Prefix Array Juha Kärkkäinen University of Helsinki Joint work with Giovanni Manzini and Simon Puglisi CPM 2009 Lille, France, June 2009 CPM 2009 p. 1 Outline 1. Background 2. Description
More informationSpace-efficient Algorithms for Document Retrieval
Space-efficient Algorithms for Document Retrieval Niko Välimäki and Veli Mäkinen Department of Computer Science, University of Helsinki, Finland. {nvalimak,vmakinen}@cs.helsinki.fi Abstract. We study the
More informationA Lempel-Ziv Text Index on Secondary Storage
A Lempel-Ziv Text Index on Secondary Storage Diego Arroyuelo and Gonzalo Navarro Dept. of Computer Science, University of Chile, Blanco Encalada 212, Santiago, Chile. {darroyue,gnavarro}@dcc.uchile.cl
More informationI519 Introduction to Bioinformatics. Indexing techniques. Yuzhen Ye School of Informatics & Computing, IUB
I519 Introduction to Bioinformatics Indexing techniques Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Contents We have seen indexing technique used in BLAST Applications that rely
More informationPractical Approaches to Reduce the Space Requirement of Lempel-Ziv-Based Compressed Text Indices
Practical Approaches to Reduce the Space Requirement of Lempel-Ziv-Based Compressed Text Indices DIEGO ARROYUELO University of Chile and GONZALO NAVARRO University of Chile Given a text T[1..u] over an
More informationCS : Data Structures Michael Schatz. Nov 14, 2016 Lecture 32: BWT
CS 600.226: Data Structures Michael Schatz Nov 14, 2016 Lecture 32: BWT HW8 Assignment 8: Competitive Spelling Bee Out on: November 2, 2018 Due by: November 9, 2018 before 10:00 pm Collaboration: None
More informationMapping Reads to Reference Genome
Mapping Reads to Reference Genome DNA carries genetic information DNA is a double helix of two complementary strands formed by four nucleotides (bases): Adenine, Cytosine, Guanine and Thymine 2 of 31 Gene
More informationApplications of Succinct Dynamic Compact Tries to Some String Problems
Applications of Succinct Dynamic Compact Tries to Some String Problems Takuya Takagi 1, Takashi Uemura 2, Shunsuke Inenaga 3, Kunihiko Sadakane 4, and Hiroki Arimura 1 1 IST & School of Engineering, Hokkaido
More informationSpace-Efficient Framework for Top-k String Retrieval Problems
Space-Efficient Framework for Top-k String Retrieval Problems Wing-Kai Hon Department of Computer Science National Tsing-Hua University, Taiwan Email: wkhon@cs.nthu.edu.tw Rahul Shah Department of Computer
More informationMEMORY-EFFICIENT STORAGE OF MULTIPLE GENOMES OF THE SAME SPECIES
UNIVERSITY OF ZAGREB FACULTY OF ELECTRICAL ENGINEERING AND COMPUTING BACHELOR THESIS No. 4005 MEMORY-EFFICIENT STORAGE OF MULTIPLE GENOMES OF THE SAME SPECIES Vinko Kodžoman Zagreb, June 2015 Acknowledgements
More informationApplication of the BWT Method to Solve the Exact String Matching Problem
Application of the BWT Method to Solve the Exact String Matching Problem T. W. Chen and R. C. T. Lee Department of Computer Science National Tsing Hua University, Hsinchu, Taiwan chen81052084@gmail.com
More informationCS : Data Structures
CS 600.226: Data Structures Michael Schatz Nov 16, 2016 Lecture 32: Mike Week pt 2: BWT Assignment 9: Due Friday Nov 18 @ 10pm Remember: javac Xlint:all & checkstyle *.java & JUnit Solutions should be
More informationCache-efficient string sorting for Burrows-Wheeler Transform. Advait D. Karande Sriram Saroop
Cache-efficient string sorting for Burrows-Wheeler Transform Advait D. Karande Sriram Saroop What is Burrows-Wheeler Transform? A pre-processing step for data compression Involves sorting of all rotations
More informationAn Efficient, Versatile Approach to Suffix Sorting
An Efficient, Versatile Approach to Suffix Sorting 1.2 1.2 MICHAEL A. MANISCALCO Independent Researcher and SIMON J. PUGLISI Curtin University of Technology Sorting the suffixes of a string into lexicographical
More informationAutoíndices Comprimidos para Texto Basados en Lempel-Ziv
Universidad de Chile Facultad de Ciencias Físicas y Matemáticas Departamento de Ciencias de la Computación Autoíndices Comprimidos para Texto Basados en Lempel-Ziv Por Diego Arroyuelo Billiardi Tesis para
More informationCompressed Full-Text Indexes
Compressed Full-Text Indexes Gonzalo Navarro Center for Web Research, Department of Computer Science, University of Chile Veli Mäkinen Department of Computer Science, University of Helsinki, Finland Full-text
More informationDynamic Rank/Select Dictionaries with Applications to XML Indexing
Dynamic Rank/Select Dictionaries with Applications to XML Indexing Ankur Gupta Wing-Kai Hon Rahul Shah Jeffrey Scott Vitter Abstract We consider a central problem in text indexing: Given a text T over
More informationWhen Indexing Equals Compression: Experiments with Compressing Suffix Arrays and Applications
When Indexing Equals Compression: Experiments with Compressing Suffix Arrays and Applications LUCA FOSCHINI Scuola Superiore Sant Anna, Pisa, Italy ROBERTO GROSSI Università di Pisa, Pisa, Italy ANKUR
More informationarxiv: v7 [cs.ds] 1 Apr 2015
arxiv:1407.8142v7 [cs.ds] 1 Apr 2015 Parallel Wavelet Tree Construction Julian Shun Carnegie Mellon University jshun@cs.cmu.edu We present parallel algorithms for wavelet tree construction with polylogarithmic
More informationColored Range Queries and Document Retrieval
Colored Range Queries and Document Retrieval Travis Gagie 1, Gonzalo Navarro 1, and Simon J. Puglisi 2 1 Dept. of Computer Science, Univt. of Chile, {tgagie,gnavarro}@dcc.uchile.cl 2 School of Computer
More informationUltra-succinct Representation of Ordered Trees
Ultra-succinct Representation of Ordered Trees Jesper Jansson Kunihiko Sadakane Wing-Kin Sung April 8, 2006 Abstract There exist two well-known succinct representations of ordered trees: BP (balanced parenthesis)
More informationCMSC423: Bioinformatic Algorithms, Databases and Tools. Exact string matching: Suffix trees Suffix arrays
CMSC423: Bioinformatic Algorithms, Databases and Tools Exact string matching: Suffix trees Suffix arrays Searching multiple strings Can we search multiple strings at the same time? Would it help if we
More informationA Compressed Suffix Tree Based Implementation With Low Peak Memory Usage 1
Available online at www.sciencedirect.com Electronic Notes in Theoretical Computer Science 302 (2014) 73 94 www.elsevier.com/locate/entcs A Compressed Suffix Tree Based Implementation With Low Peak Memory
More informationSelf-Indexing Natural Language
Self-Indexing Natural Language Nieves R. Brisaboa 1, Antonio Fariña 1, Gonzalo Navarro 2, Angeles S. Places 1, and Eduardo Rodríguez 1 1 Database Lab. Univ. da Coruña, Spain. {brisaboa fari asplaces erodriguezl}@udc.es
More informationBMC Bioinformatics. Research article On finding minimal absent words Armando J Pinho*, Paulo JSG Ferreira, Sara P Garcia and João MOS Rodrigues
BMC Bioinformatics BioMed Central Research article On finding minimal absent words Armando J Pinho*, Paulo JSG Ferreira, Sara P Garcia and João MOS Rodrigues Open Access Address: Signal Processing Lab,
More informationOn enhancing variation detection through pan-genome indexing
Standard approach...t......t......t......acgatgctagtgcatgt......t......t......t... reference genome Variation graph reference SNP: A->T...ACGATGCTTGTGCATGT donor genome Can we boost variation detection
More informationDynamic FM-Index for a Collection of Texts with Application to Space-efficient Construction of the Compressed Suffix Array.
Technische Fakultät Diplomarbeit Wintersemester 2006/2007 Dynamic FM-Index for a Collection of Texts with Application to Space-efficient Construction of the Compressed Suffix Array Diplomarbeit im Fach
More informationDynamic Extended Suffix Arrays
Dynamic Extended Suffix Arrays M. Salson a,. Lecroq a M. Léonard a L. Mouchard a,b, a Université de Rouen, LIIS EA 408, 768 Mont-Saint-Aignan, France b Algorithm Design Group, Department of omputer Science,
More informationarxiv: v2 [cs.ds] 17 Nov 2016
EPR-dictionaries: A practical and fast data structure for constant time searches in unidirectional and bidirectional FM-indices Christopher Pockrandt, Marcel Ehrhardt, and Knut Reinert Department of Computer
More informationISTITUTO DI MATEMATICA COMPUTAZIONALE. An experimental study of a compressed index
ISTITUTO DI MATEMATICA COMPUTAZIONALE CNR, Area della Ricerca I-56100 Pisa, Italy Tel. +39 050 3152402 Fax +39 050 3152333 E-mail: imc@imc.pi.cnr.it URL: http://www.imc.pi.cnr.it An experimental study
More informationSuffix Array Construction
Suffix Array Construction Suffix array construction means simply sorting the set of all suffixes. Using standard sorting or string sorting the time complexity is Ω(DP (T [0..n] )). Another possibility
More informationData Compression. Guest lecture, SGDS Fall 2011
Data Compression Guest lecture, SGDS Fall 2011 1 Basics Lossy/lossless Alphabet compaction Compression is impossible Compression is possible RLE Variable-length codes Undecidable Pigeon-holes Patterns
More informationApproximate string matching using compressed suffix arrays
Theoretical Computer Science 352 (2006) 240 249 www.elsevier.com/locate/tcs Approximate string matching using compressed suffix arrays Trinh N.D. Huynh a, Wing-Kai Hon b, Tak-Wah Lam b, Wing-Kin Sung a,
More informationLinear Suffix Array Construction by Almost Pure Induced-Sorting
Linear Suffix Array Construction by Almost Pure Induced-Sorting Ge Nong Computer Science Department Sun Yat-Sen University Guangzhou 510275, P.R.C. issng@mail.sysu.edu.cn Sen Zhang Dept. of Math., Comp.
More informationCompressed Data Structures with Relevance
Compressed Data Structures with Relevance Jeff Vitter Provost and Executive Vice Chancellor The University of Kansas, USA (and collaborators Rahul Shah, Wing-Kai Hon, Roberto Grossi, Oğuzhan Külekci, Bojian
More informationSequence Assembly. BMI/CS 576 Mark Craven Some sequencing successes
Sequence Assembly BMI/CS 576 www.biostat.wisc.edu/bmi576/ Mark Craven craven@biostat.wisc.edu Some sequencing successes Yersinia pestis Cannabis sativa The sequencing problem We want to determine the identity
More informationCompressed String Dictionaries
Compressed String Dictionaries Nieves R. Brisaboa 1, Rodrigo Cánovas 2, Francisco Claude 3, Miguel A. Martínez-Prieto 2,4, and Gonzalo Navarro 2 1 Database Lab, Universidade da Coruña, Spain 2 Department
More informationarxiv: v1 [cs.ds] 15 Nov 2018
Vectorized Character Counting for Faster Pattern Matching Roman Snytsar Microsoft Corp., One Microsoft Way, Redmond WA 98052, USA Roman.Snytsar@microsoft.com Keywords: Parallel Processing, Vectorization,
More informationDictionary Compression
Image: ALCÁZAR (SEGOVIA, SPAIN) Dictionary Compression Reducing Symbolic Redundancy in RDF Antonio Fariña, Javier D. Fernández and Miguel A. Martinez-Prieto 3rd KEYSTONE Training School Keyword search
More informationSequencing. Short Read Alignment. Sequencing. Paired-End Sequencing 6/10/2010. Tobias Rausch 7 th June 2010 WGS. ChIP-Seq. Applied Biosystems.
Sequencing Short Alignment Tobias Rausch 7 th June 2010 WGS RNA-Seq Exon Capture ChIP-Seq Sequencing Paired-End Sequencing Target genome Fragments Roche GS FLX Titanium Illumina Applied Biosystems SOLiD
More informationCompressed Data Structures: Dictionaries and Data-Aware Measures
Compressed Data Structures: Dictionaries and Data-Aware Measures Ankur Gupta, Wing-Kai Hon, Rahul Shah, and Jeffrey Scott Vitter {agupta,wkhon,rahul,jsv}@cs.purdue.edu Department of Computer Sciences,
More informationParallel Lightweight Wavelet-Tree, Suffix-Array and FM-Index Construction
Parallel Lightweight Wavelet-Tree, Suffix-Array and FM-Index Construction Bachelor Thesis of Julian Labeit At the Department of Informatics Institute for Theoretical Computer Science Reviewer: Second reviewer:
More informationCS : Data Structures Michael Schatz. Nov 26, 2016 Lecture 34: Advanced Sorting
CS 600.226: Data Structures Michael Schatz Nov 26, 2016 Lecture 34: Advanced Sorting Assignment 9: StringOmics Out on: November 16, 2018 Due by: November 30, 2018 before 10:00 pm Collaboration: None Grading:
More informationUSING BRAT-BW Table 1. Feature comparison of BRAT-bw, BRAT-large, Bismark and BS Seeker (as of on March, 2012)
USING BRAT-BW-2.0.1 BRAT-bw is a tool for BS-seq reads mapping, i.e. mapping of bisulfite-treated sequenced reads. BRAT-bw is a part of BRAT s suit. Therefore, input and output formats for BRAT-bw are
More informationBLAST & Genome assembly
BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies May 15, 2014 1 BLAST What is BLAST? The algorithm 2 Genome assembly De novo assembly Mapping assembly 3
More informationLecture 17: Suffix Arrays and Burrows Wheeler Transforms. Not in Book. Recall Suffix Trees
Lecture 17: Suffix Arrays and Burrows Wheeler Transforms Not in Book 1 Recall Suffix Trees 2 1 Suffix Trees 3 Suffix Tree Summary 4 2 Suffix Arrays 5 Searching Suffix Arrays 6 3 Searching Suffix Arrays
More informationTop-k Document Retrieval in External Memory
Top-k Document Retrieval in External Memory Rahul Shah 1, Cheng Sheng 2, Sharma V. Thankachan 1 and Jeffrey Scott Vitter 3 1 Louisiana State University, USA. {rahul,thanks}@csc.lsu.edu 2 The Chinese University
More informationA 135mW Fully Integrated Data Processor for
5mW Fully Integrated Data Processor for Next-eneration Sequencing Yi-hung Wu, Jui-Hung Hung, hia-hsiang Yang, National aiwan University, National hiao ung University 7 IEEE 4.8: 5mW Fully Integrated Data
More informationPractical Approaches to Reduce the Space Requirement of Lempel-Ziv-Based Compressed Text Indices
Practical Approaches to Reduce the Space Requirement of Lempel-Ziv-Based Compressed Text Indices DIEGO ARROYUELO Yahoo! Research Chile and GONZALO NAVARRO University of Chile Given a text T[..n] over an
More informationAn experimental study of an opportunistic index
An experimental study of an opportunistic index Paolo Ferragina* Giovanni Manzini t Abstract The size of electronic data is currently growing at a faster rate than computer memory and disk storage capacities.
More informationThe Compressed Permuterm Index
The Compressed Permuterm Index PAOLO FERRAGINA University of Pisa, Italy ROSSANO VENTURINI ISTI-CNR, Pisa, Italy Abstract. The Permuterm index [Garfield 1976] is a time-efficient and elegant solution to
More informationThe Burrows-Wheeler Transform and Bioinformatics. J. Matthew Holt April 1st, 2015
The Burrows-Wheeler Transform and Bioinformatics J. Matthew Holt April 1st, 2015 Outline Recall Suffix Arrays The Burrows-Wheeler Transform The FM-index Pattern Matching Multi-string BWTs Merge Algorithms
More informationCompressed Suffix Trees: Design, Construction, and Applications
Universität Ulm 89069 Ulm Germany Fakultät für Ingenieurwissenschaften und Informatik Institut für Theoretische Informatik Compressed Suffix Trees: Design, Construction, and Applications DISSERTATION zur
More informationNGS Data and Sequence Alignment
Applications and Servers SERVER/REMOTE Compute DB WEB Data files NGS Data and Sequence Alignment SSH WEB SCP Manpreet S. Katari App Aug 11, 2016 Service Terminal IGV Data files Window Personal Computer/Local
More informationBlock Graphs in Practice
Travis Gagie 1,, Christopher Hoobin 2 Simon J. Puglisi 1 1 Department of Computer Science, University of Helsinki, Finland 2 School of CSIT, RMIT University, Australia Travis.Gagie@cs.helsinki.fi Abstract
More informationFaster Compressed Suffix Trees for Repetitive Text Collections
Faster Compressed Suffix Trees for Repetitive Text Collections Gonzalo Navarro 1 and Alberto Ordóñez 2 1 Dept. of Computer Science, Univ. of Chile, Chile. gnavarro@dcc.uchile.cl 2 Lab. de Bases de Datos,
More informationOptimizing Multi-Core Algorithms for Pattern Search
Optimizing Multi-Core Algorithms for Pattern Search Veronica Gil-Costa 1,2, Cesar Ochoa 1 and Marcela Printista 1,2 1 LIDIC, Universidad Nacional de San Luis, Ejercito de los Andes 950, San Luis, Argentina
More informationBRAT-BW: Efficient and accurate mapping of bisulfite-treated reads [Supplemental Material]
BRAT-BW: Efficient and accurate mapping of bisulfite-treated reads [Supplemental Material] Elena Y. Harris 1, Nadia Ponts 2,3, Karine G. Le Roch 2 and Stefano Lonardi 1 1 Department of Computer Science
More informationOptimized Succinct Data Structures for Massive Data
SOFTWARE PRACTICE AND EXPERIENCE Softw. Pract. Exper. 0000; 00:1 28 Published online in Wiley InterScience (www.interscience.wiley.com). Optimized Succinct Data Structures for Massive Data Simon Gog 1
More informationVolume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies
Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com
More informationThe Burrows-Wheeler Transform and Bioinformatics. J. Matthew Holt
The Burrows-Wheeler Transform and Bioinformatics J. Matthew Holt holtjma@cs.unc.edu Last Class - Multiple Pattern Matching Problem m - length of text d - max length of pattern x - number of patterns Method
More informationA Fast Block sorting Algorithm for lossless Data Compression
A Fast Block sorting Algorithm for lossless Data Compression DI Michael Schindler Vienna University of Technology Karlsplatz 13/1861, A 1040 Wien, Austria, Europe michael@eiunix.tuwien.ac.at if.at is transformed
More informationCompression of Concatenated Web Pages Using XBW
Compression of Concatenated Web Pages Using XBW Radovan Šesták and Jan Lánský Charles University, Faculty of Mathematics and Physics, Department of Software Engineering Malostranské nám. 25, 118 00 Praha
More informationarxiv: v3 [cs.ds] 1 Feb 2019
Relaxing Wheeler raphs for Indexing Reads arxiv:09.00v [cs.ds] Feb 09 ravis agie School of omputer Science and elecommunications, Universidad Diego Portales, hile enter for Biotechnology and Bioengineering,
More informationIndexed Hierarchical Approximate String Matching
Indexed Hierarchical Approximate String Matching Luís M. S. Russo 1,3, Gonzalo Navarro 2, and Arlindo L. Oliveira 1,4 1 INESC-ID, R. Alves Redol 9, 1000 Lisboa, Portugal. aml@algos.inesc-id.pt 2 Dept.
More informationLightweight Data Indexing and Compression in External Memory
Lightweight Data Indexing and Compression in External Memory Paolo Ferragina 1, Travis Gagie 2, and Giovanni Manzini 3 1 Dipartimento di Informatica, Università di Pisa, Italy 2 Departamento de Ciencias
More informationDetecting Superbubbles in Assembly Graphs. Taku Onodera (U. Tokyo)! Kunihiko Sadakane (NII)! Tetsuo Shibuya (U. Tokyo)!
Detecting Superbubbles in Assembly Graphs Taku Onodera (U. Tokyo)! Kunihiko Sadakane (NII)! Tetsuo Shibuya (U. Tokyo)! de Bruijn Graph-based Assembly Reads (substrings of original DNA sequence) de Bruijn
More informationShort Read Alignment Algorithms
Short Read Alignment Algorithms Raluca Gordân Department of Biostatistics and Bioinformatics Department of Computer Science Department of Molecular Genetics and Microbiology Center for Genomic and Computational
More informationEfficient Wavelet Tree Construction and Querying for Multicore Architectures
Efficient Wavelet Tree Construction and Querying for Multicore Architectures José Fuentes-Sepúlveda, Erick Elejalde, Leo Ferres, Diego Seco Universidad de Concepción, Concepción, Chile {jfuentess,eelejalde,lferres,dseco}@udec.cl
More informationApproximate String Matching Using a Bidirectional Index
Approximate String Matching Using a Bidirectional Index Gregory Kucherov, Kamil Salikhov, Dekel Tsur To cite this version: Gregory Kucherov, Kamil Salikhov, Dekel Tsur. Approximate String Matching Using
More information