Shortest Unique Substring Query Revisited
|
|
- Dayna Paul
- 6 years ago
- Views:
Transcription
1 Shortest Unique Substring Query Revisited Atalay Mert İleri Bilkent University, Turkey M. Oğuzhan Külekci Istanbul Medipol University, Turkey Bojian Xu Eastern Washington University, USA June 18, 2014
2 Outline 1 Preliminary Information 2 Left Bounded Shortest Unique Substrings 3 Finding Shortest Unique Substring For All Positions 4 Experimental Results 5 Conclusion 6 Acknowledgments 7 Q&A
3 Shortest Unique Substring (SUS) A substring S[i j] is called Shortest Unique Substring covering a position k, if i k j and there is no other unique substring S [i j ] exists such that i k j and j i < j i
4 Shortest Unique Substring (SUS) A substring S[i j] is called Shortest Unique Substring covering a position k, if i k j and there is no other unique substring S [i j ] exists such that i k j and j i < j i Problem 1 (find details in paper) Given a string location k, find the leftmost SUS covering location k. Problem 2 (find details in paper) Given a string location k, find all the SUSes covering location k. Problem 3 (discussed in this talk) Find the leftmost SUS covering every string location 1, 2,..., n. Problem 4 (find details in paper) Find all the SUSes covering every string location 1, 2,..., n.
5 Why SUS queries? Distinguishing results in a text search Finding DNA signatures of organisms Extracting the event context in historical events
6 Preliminary Information Preliminaries: Suffix Array and Rank Array 1 mississippi 2 ississippi 3 ssissippi 4 sissippi 5 issippi 6 ssippi 7 sippi 8 ippi 9 ppi 10 pi 11 i
7 Preliminary Information Preliminaries: Suffix Array and Rank Array 1 mississippi 2 ississippi 3 ssissippi 4 sissippi 5 issippi 6 ssippi 7 sippi 8 ippi 9 ppi 10 pi 11 i 11 i 8 ippi 5 issippi 2 ississippi 1 mississippi 10 pi 9 ppi 7 sippi 4 sissippi 6 ssippi 3 ssissippi
8 Preliminary Information Preliminaries: Suffix Array and Rank Array 1 mississippi 2 ississippi 3 ssissippi 4 sissippi 5 issippi 6 ssippi 7 sippi 8 ippi 9 ppi 10 pi 11 i 11 i 8 ippi 5 issippi 2 ississippi 1 mississippi 10 pi 9 ppi 7 sippi 4 sissippi 6 ssippi 3 ssissippi SA RANK
9 Preliminary Information Preliminaries: Longest Common Prefix Array 11 i 8 ippi 5 issippi 2 ississippi 1 mississippi 10 pi 9 ppi 7 sippi 4 sissippi 6 ssippi 3 ssissippi LCP
10 Left Bounded Shortest Unique Substrings Left Bounded Shortest Unique Substring Definition For a particular string location k {1, 2,..., n}, the left-bounded shortest unique substring (LSUS) starting at location k, denoted as LSUS k, is a unique substring S[k... j], such that either k = j or any proper prefix of S[k... j] is not unique. Examples: mississippi LSUS 1 = S[1, 1] mississippi LSUS 2 = S[2, 6]. mississippi LSUS 10 = S[10, 11] mississippi LSUS 11 does not exist
11 Left Bounded Shortest Unique Substrings Fast computation of LSUS es from LCP array For i = 1, 2,..., n: { S[i... i + Li ], if i + L LSUS i = i n not existing, otherwise where L i = max{lcp[rank[i]], LCP[RANK[i] + 1]}.
12 Left Bounded Shortest Unique Substrings Fast computation of LSUS es from LCP array mississippi RANK LCP L i 0 LSUS i m
13 Left Bounded Shortest Unique Substrings Fast computation of LSUS es from LCP array mississippi RANK LCP L i 0 4 LSUS i m issis
14 Left Bounded Shortest Unique Substrings Fast computation of LSUS es from LCP array mississippi RANK LCP L i LSUS i m issis ssis
15 Left Bounded Shortest Unique Substrings Fast computation of LSUS es from LCP array mississippi RANK LCP L i LSUS i m issis ssis sis issip ssip sip ip pp
16 Left Bounded Shortest Unique Substrings Fast computation of LSUS es from LCP array mississippi RANK LCP L i LSUS i m issis ssis sis issip ssip sip ip pp pi
17 Left Bounded Shortest Unique Substrings Fast computation of LSUS es from LCP array mississippi RANK LCP L i LSUS i m issis ssis sis issip ssip sip ip pp pi -
18 Left Bounded Shortest Unique Substrings Properties of LSUS LSUS 1 always exists, because at least it can be the whole string, which is unique.
19 Left Bounded Shortest Unique Substrings Properties of LSUS LSUS 1 always exists, because at least it can be the whole string, which is unique. LSUS i cannot end before LSUS i 1 Repeat abcadabcdcab... Unique
20 Finding Shortest Unique Substring For All Positions Finding leftmost SUS for all positions: Overview SUS i = { SUSi 1 + S[i], if SUS i SLS i SLS i, if SUS i > SLS i SLS i Leftmost shortest LSUS covering string location i.
21 Finding Shortest Unique Substring For All Positions abcabcbcab
22 Finding Shortest Unique Substring For All Positions abcabcbcab abcabcbcab
23 Finding Shortest Unique Substring For All Positions abcabcbcab abcabcbcab
24 Finding Shortest Unique Substring For All Positions abcabcbcab abcabcbcab
25 Finding Shortest Unique Substring For All Positions abcabcbcab abcabcbcab
26 Finding Shortest Unique Substring For All Positions Finding the candidate: Data structure Chunk Start LSUS Start Prev Next Chunk End LSUS Length abcacbca...
27 Finding Shortest Unique Substring For All Positions Example Run S = mississippi List
28 Finding Shortest Unique Substring For All Positions Example Run m i s s i s s i p p i CS LS List 1 1 CE LL 1 1 SUS 0 = SLS 1 = (1, 1) SUS 1 = (1, 1) CS: Chunk Start CE: Chunk End LS: LSUS Start LL: LSUS Length
29 Finding Shortest Unique Substring For All Positions Example Run m i s s i s s i p p i List CS 2 LS 2 CE 6 LL 5 SUS 1 = (1, 1) SLS 2 = (2, 5) SUS 2 = (1, 2) CS: Chunk Start CE: Chunk End LS: LSUS Start LL: LSUS Length
30 Finding Shortest Unique Substring For All Positions Example Run m i s s i s s i p p i List CS LS 3 3 CE 6 LL 4 SUS 2 = (1, 2) SLS 3 = (3, 4) SUS 3 = (1, 3) CS: Chunk Start CE: Chunk End LS: LSUS Start LL: LSUS Length
31 Finding Shortest Unique Substring For All Positions Example Run m i s s i s s i p p i List CS LS 4 4 CE 6 LL 3 SUS 3 = (1, 3) SLS 4 = (4, 3) SUS 4 = (4, 3) CS: Chunk Start CE: Chunk End LS: LSUS Start LL: LSUS Length
32 Finding Shortest Unique Substring For All Positions Example Run m i s s i s s i p p i CS LS CS LS List CE LL CE LL SUS 4 = (4, 3) SLS 5 = (4, 3) SUS 5 = (4, 3) CS: Chunk Start CE: Chunk End LS: LSUS Start LL: LSUS Length
33 Finding Shortest Unique Substring For All Positions Example Run m i s s i s s i p p i List CS LS CS LS CE LL CE LL SUS 5 = (4, 3) SLS 6 = (4, 3) SUS 6 = (4, 3) CS: Chunk Start CE: Chunk End LS: LSUS Start LL: LSUS Length
34 Finding Shortest Unique Substring For All Positions Example Run m i s s i s s i p p i List CS 7 LS 7 CE 9 LL 3 SUS 6 = (4, 3) SLS 7 = (7, 3) SUS 7 = (7, 3) CS: Chunk Start CE: Chunk End LS: LSUS Start LL: LSUS Length
35 Finding Shortest Unique Substring For All Positions Example Run m i s s i s s i p p i List CS LS 8 8 CE 9 LL 2 SUS 7 = (7, 3) SLS 8 = (8, 2) SUS 8 = (8, 2) CS: Chunk Start CE: Chunk End LS: LSUS Start LL: LSUS Length
36 Finding Shortest Unique Substring For All Positions Example Run m i s s i s s i p p i CS LS CS LS List CE LL CE LL SUS 8 = (8, 2) SLS 9 = (8, 2) SUS 9 = (8, 2) CS: Chunk Start CE: Chunk End LS: LSUS Start LL: LSUS Length
37 Finding Shortest Unique Substring For All Positions Example Run m i s s i s s i p p i CS LS CS LS List CE 10 LL CE LL SUS 9 = (8, 2) SLS 10 = (9, 2) SUS 10 = (9, 2) CS: Chunk Start CE: Chunk End LS: LSUS Start LL: LSUS Length
38 Finding Shortest Unique Substring For All Positions Example Run m i s s i s s i p p i List CS 11 LS 10 CE 11 LL 2 SUS 10 = (9, 2) SLS 11 = (10, 2) SUS 11 = (10, 2) CS: Chunk Start CE: Chunk End LS: LSUS Start LL: LSUS Length
39 Finding Shortest Unique Substring For All Positions Time Complexity Derivation of LCP and Rank array Calculation of LSUS Maintenance of data structure Finding SLS Determining SUS Number of iterations: Time O(n) O(1) O(1) O(1) O(1) n Total Time: O(n) + n (O(1) + O(1) + O(1) + O(1)) = O(n)
40 Finding Shortest Unique Substring For All Positions Space Complexity LCP array Rank array Data structure Previous SUS Memory (in words) O(n) O(n) O(n) O(1) Total Memory: O(n) + O(n) + O(n) + O(1) = O(n)
41 Experimental Results Involved work in experimental study J. Pei, W. C. H. Wu, M. Y. Yeh, On shortest unique substring queries, ICDE 2013 K. Tsuruta, S. Inenaga, H. Bannai, M. Takeda, Shortest unique substrings queries in optimal time, SOFSEM 2014
42 Experimental Results Time Measurements
43 Experimental Results Memory Measurements
44 Conclusion Conclusion We developed O(n) time and space algorithms for finding SUS covering one position and all positions. Our work improved the prior work of Pei et al. 8 times in terms of time and 20 times in terms of memory.
45 Acknowledgments Acknowledgments I want to thank them for their monetary support for my trip to CPM2014. CPM Organizing Commitee Bilim Akademisi
46 Q&A Thanks! Questions?
Suffix Trees. Martin Farach-Colton Rutgers University & Tokutek, Inc
Suffix Trees Martin Farach-Colton Rutgers University & Tokutek, Inc What s in this talk? What s a suffix tree? What can you do with them? How do you build them? A challenge problem What s in this talk?
More informationAn empirical evaluation of a metric index for approximate string matching. Bilegsaikhan Naidan and Magnus Lie Hetland
An empirical evaluation of a metric index for approximate string matching Bilegsaikhan Naidan and Magnus Lie Hetland Abstract In this paper, we evaluate a metric index for the approximate string matching
More informationIndexing Web pages. Web Search: Indexing Web Pages. Indexing the link structure. checkpoint URL s. Connectivity Server: Node table
Indexing Web pages Web Search: Indexing Web Pages CPS 296.1 Topics in Database Systems Indexing the link structure AltaVista Connectivity Server case study Bharat et al., The Fast Access to Linkage Information
More informationSuffix Tree and Array
Suffix Tree and rray 1 Things To Study So far we learned how to find approximate matches the alignments. nd they are difficult. Finding exact matches are much easier. Suffix tree and array are two data
More informationIndexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems
Indexing: Part IV CPS 216 Advanced Database Systems Announcements (February 17) 2 Homework #2 due in two weeks Reading assignments for this and next week The query processing survey by Graefe Due next
More informationOn stabbing queries for generalised longest repeats
350 Int. J. Data Mining and Bioinformatics, Vol. 15, No. 4, 2016 On stabbing queries for generalised longest repeats Bojian Xu Department of Computer Science, Eastern Washington Universit, Chene, Washington
More informationarxiv: v3 [cs.ds] 10 Jan 2014
Shortest Unique Substring Query Revisited Atalay Mert İleri, M. Oğuzhan Külekci, and Bojian Xu Department of Computer Engineering, Bilkent University, Turkey TÜBİTAK National Research Institute of Electronics
More informationCompressed Data Structures with Relevance
Compressed Data Structures with Relevance Jeff Vitter Provost and Executive Vice Chancellor The University of Kansas, USA (and collaborators Rahul Shah, Wing-Kai Hon, Roberto Grossi, Oğuzhan Külekci, Bojian
More informationParallel Distributed Memory String Indexes
Parallel Distributed Memory String Indexes Efficient Construction and Querying Patrick Flick & Srinivas Aluru Computational Science and Engineering Georgia Institute of Technology 1 In this talk Overview
More informationAuto-assemblage for Suffix Tree Clustering
Auto-assemblage for Suffix Tree Clustering Pushplata, Mr Ram Chatterjee Abstract Due to explosive growth of extracting the information from large repository of data, to get effective results, clustering
More informationSuffix Array Construction
7 Suffix Array Construction 7.1 Notation and terminology... 7-1 7.2 The suffix array... 7-2 7.3 The substring-search problem... 7-3 7.4 Suffix Array Construction Algorithms... 7-5 The Skew Algorithm The
More informationInformation Retrieval and Organisation
Informtion Retrievl nd Orgnistion Suffix Trees dpted from http://www.mth.tu.c.il/~himk/seminr02/suffixtrees.ppt Dell Zhng Birkeck, University of London Trie A tree representing set of strings { } eef d
More informationText Indexing: Lecture 8
Simon Gog gog@kit.edu - Simon Gog: KIT The Research University in the Helmholtz Association www.kit.edu The LCP-Interval Tree Definition of an LCP-interval []) An interval [i, j], where i n is called LCP-interval
More informationSearching Strings by Substring
7 Searching Strings by Substring 7.1 Notation and terminology... 7-1 7.2 The Suffix Array... 7-2 The substring-search problem The LCP-array and its construction Suffix-array construction 7.3 The Suffix
More informationData structures for string pattern matching: Suffix trees
Suffix trees Data structures for string pattern matching: Suffix trees Linear algorithms for exact string matching KMP Z-value algorithm What is suffix tree? A tree-like data structure for solving problems
More informationSUFFIX ARRAYS A Competitive Choice for Fast Lempel-Ziv Compressions
SUFFIX ARRAYS A Competitive Choice for Fast Lempel-Ziv Compressions Artur J. Ferreira 1,3, Arlindo L. Oliveira 2,4 and Mário A. T. Figueiredo 3,4 1 Instituto Superior de Engenharia de Lisboa, Lisboa, Portugal
More informationSuffix trees, suffix arrays, BWT
ALGORITHMES POUR LA BIO-INFORMATIQUE ET LA VISUALISATION COURS 3 Rluc Uricru Suffix trees, suffix rrys, BWT Bsed on: Suffix trees nd suffix rrys presenttion y Him Kpln Suffix trees course y Pco Gomez Liner-Time
More informationBioinformatics: Fragment Assembly. Walter Kosters, Universiteit Leiden. IPA Algorithms&Complexity,
Bioinformatics: Fragment Assembly Walter Kosters, Universiteit Leiden IPA Algorithms&Complexity, 29.6.2007 www.liacs.nl/home/kosters/ 1 Fragment assembly Problem We study the following problem from bioinformatics:
More informationASSIGNMENT OF MASTER S THESIS
CZECH TECHNICAL UNIVERSITY IN PRAGUE FACULTY OF INFORMATION TECHNOLOGY ASSIGNMENT OF MASTER S THESIS Title: Implementation of the ACB compression method improvements in the Java language Student: Bc. Jiří
More informationUniversidade Nova de Lisboa Faculdade de Ciências e Tecnologia Departamento de Informática. Burrows-Wheeler Transform in Secondary Memory
Universidade Nova de Lisboa Faculdade de Ciências e Tecnologia Departamento de Informática Master s Thesis in Computer Engineering 2nd Semester, 2009/2010 Burrows-Wheeler Transform in Secondary Memory
More informationDynamic FM-Index for a Collection of Texts with Application to Space-efficient Construction of the Compressed Suffix Array.
Technische Fakultät Diplomarbeit Wintersemester 2006/2007 Dynamic FM-Index for a Collection of Texts with Application to Space-efficient Construction of the Compressed Suffix Array Diplomarbeit im Fach
More informationSuffix Trees and Arrays
Suffix Trees and Arrays Yufei Tao KAIST May 1, 2013 We will discuss the following substring matching problem: Problem (Substring Matching) Let σ be a single string of n characters. Given a query string
More informationAn Efficient Algorithm for Identifying the Most Contributory Substring. Ben Stephenson Department of Computer Science University of Western Ontario
An Efficient Algorithm for Identifying the Most Contributory Substring Ben Stephenson Department of Computer Science University of Western Ontario Problem Definition Related Problems Applications Algorithm
More informationSpace Efficient Linear Time Construction of
Space Efficient Linear Time Construction of Suffix Arrays Pang Ko and Srinivas Aluru Dept. of Electrical and Computer Engineering 1 Laurence H. Baker Center for Bioinformatics and Biological Statistics
More informationSuffix Arrays Slides by Carl Kingsford
Suffix Arrays 02-714 Slides by Carl Kingsford Suffix Arrays Even though Suffix Trees are O(n) space, the constant hidden by the big-oh notation is somewhat big : 20 bytes / character in good implementations.
More informationReporting Consecutive Substring Occurrences Under Bounded Gap Constraints
Reporting Consecutive Substring Occurrences Under Bounded Gap Constraints Gonzalo Navarro University of Chile, Chile gnavarro@dcc.uchile.cl Sharma V. Thankachan Georgia Institute of Technology, USA sthankachan@gatech.edu
More informationLinear Work Suffix Array Construction
Linear Work Suffix Array Construction Juha Karkkainen, Peter Sanders, Stefan Burkhardt Presented by Roshni Sahoo March 7, 2019 Presented by Roshni Sahoo Linear Work Suffix Array Construction March 7, 2019
More informationLempel-Ziv compression: how and why?
Lempel-Ziv compression: how and why? Algorithms on Strings Paweł Gawrychowski July 9, 2013 s July 9, 2013 2/18 Outline Lempel-Ziv compression Computing the factorization Using the factorization July 9,
More informationExact String Matching. The Knuth-Morris-Pratt Algorithm
Exact String Matching The Knuth-Morris-Pratt Algorithm Outline for Today The Exact Matching Problem A simple algorithm Motivation for better algorithms The Knuth-Morris-Pratt algorithm The Exact Matching
More information1 Introduciton. 2 Tries /651: Algorithms CMU, Spring Lecturer: Danny Sleator
15-451/651: Algorithms CMU, Spring 2015 Lecture #25: Suffix Trees April 22, 2015 (Earth Day) Lecturer: Danny Sleator Outline: Suffix Trees definition properties (i.e. O(n) space) applications Suffix Arrays
More informationSuffix Arrays. Slides adapted from the course by Ben Langmead
Suffix Arrays Slides adapted from the course by Ben Langmead ben.langmead@gmail.com Suffix array: building Given T, how to we efficiently build T s suffix array? 6 5 $ a$ a ba $ 6 2 3 0 4 1 aaba$ aba$
More informationtagerator: a program for mapping short sequence tags a manual
tagerator: a program for mapping short sequence tags a manual Stefan Kurtz Center for Bioinformatics, University of Hamburg August 6, 2012 1 Preliminary definitions By S let us denote the concatenation
More informationSolution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree.
Solution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree. The basic approach is the same as when using a suffix tree,
More informationSuffix trees and applications. String Algorithms
Suffix trees and applications String Algorithms Tries a trie is a data structure for storing and retrieval of strings. Tries a trie is a data structure for storing and retrieval of strings. x 1 = a b x
More informationSuffix-based text indices, construction algorithms, and applications.
Suffix-based text indices, construction algorithms, and applications. F. Franek Computing and Software McMaster University Hamilton, Ontario 2nd CanaDAM Conference Centre de recherches mathématiques in
More informationPAPER Constructing the Suffix Tree of a Tree with a Large Alphabet
IEICE TRANS. FUNDAMENTALS, VOL.E8??, NO. JANUARY 999 PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet Tetsuo SHIBUYA, SUMMARY The problem of constructing the suffix tree of a tree is
More informationDetermining gapped palindrome density in RNA using suffix arrays
Determining gapped palindrome density in RNA using suffix arrays Sjoerd J. Henstra Leiden Institute of Advanced Computer Science Leiden University, The Netherlands Abstract DNA and RNA strings contain
More informationParameterized Suffix Arrays for Binary Strings
Parameterized Suffix Arrays for Binary Strings Satoshi Deguchi 1, Fumihito Higashijima 1, Hideo Bannai 1, Shunsuke Inenaga 2, and Masayuki Takeda 1 1 Department of Informatics, Kyushu University 2 Graduate
More informationarxiv: v3 [cs.ds] 29 Jun 2010
Sampled Longest Common Prefix Array Jouni Sirén Department of Computer Science, University of Helsinki, Finland jltsiren@cs.helsinki.fi arxiv:1001.2101v3 [cs.ds] 29 Jun 2010 Abstract. When augmented with
More informationInexact Matching, Alignment. See Gusfield, Chapter 9 Dasgupta et al. Chapter 6 (Dynamic Programming)
Inexact Matching, Alignment See Gusfield, Chapter 9 Dasgupta et al. Chapter 6 (Dynamic Programming) Outline Yet more applications of generalized suffix trees, when combined with a least common ancestor
More informationMethods for High Degrees of Similarity. Index-Based Methods Exploiting Prefixes and Suffixes Exploiting Length
Methods for High Degrees of Similarity Index-Based Methods Exploiting Prefixes and Suffixes Exploiting Length 1 Overview LSH-based methods are excellent for similarity thresholds that are not too high.
More information1 Dynamic Programming
CS161 Lecture 13 Dynamic Programming and Greedy Algorithms Scribe by: Eric Huang Date: May 13, 2015 1 Dynamic Programming The idea of dynamic programming is to have a table of solutions of subproblems
More informationLinear-Time Suffix Array Implementation in Haskell
Linear-Time Suffix Array Implementation in Haskell Anna Geiduschek and Ben Isaacs CS240H Final Project GitHub: https://github.com/ageiduschek/dc3 Suffix Array June 5th, 2014 Abstract The purpose of our
More informationFast Algorithms for Top-k Approximate String Matching
Fast Algorithms for Top-k Approximate String Matching Zhenglu Yang # Jianjun Yu Masaru Kitsuregawa # # Institute of Industrial Science, The University of Tokyo, Japan {yangzl, kitsure}@tkliisu-tokyoacjp
More informationSearching a Sorted Set of Strings
Department of Mathematics and Computer Science January 24, 2017 University of Southern Denmark RF Searching a Sorted Set of Strings Assume we have a set of n strings in RAM, and know their sorted order
More informationStrings. Zachary Friggstad. Programming Club Meeting
Strings Zachary Friggstad Programming Club Meeting Outline Suffix Arrays Knuth-Morris-Pratt Pattern Matching Suffix Arrays (no code, see Comp. Prog. text) Sort all of the suffixes of a string lexicographically.
More informationCOMP4128 Programming Challenges
Multi- COMP4128 Programming Challenges School of Computer Science and Engineering UNSW Australia Table of Contents 2 Multi- 1 2 Multi- 3 3 Multi- Given two strings, a text T and a pattern P, find the first
More informationProblem Set 8 Due: Start of Class, November 16
CS242 Computer Networks Handout # 16 Randy Shull November 9, 2017 Wellesley College Problem Set 8 Due: Start of Class, November 16 Reading: Kurose & Ross, Sections 4.1-4.4 Problem 1 [10]: Short answer
More informationNew Implementation for the Multi-sequence All-Against-All Substring Matching Problem
New Implementation for the Multi-sequence All-Against-All Substring Matching Problem Oana Sandu Supervised by Ulrike Stege In collaboration with Chris Upton, Alex Thomo, and Marina Barsky University of
More informationA Fast Order-Preserving Matching with q-neighborhood Filtration Using SIMD Instructions
A Fast Order-Preserving Matching with q-neighborhood Filtration Using SIMD Instructions Yohei Ueki, Kazuyuki Narisawa, and Ayumi Shinohara Graduate School of Information Sciences, Tohoku University, Japan
More informationPermuted Longest-Common-Prefix Array
Permuted Longest-Common-Prefix Array Juha Kärkkäinen University of Helsinki Joint work with Giovanni Manzini and Simon Puglisi CPM 2009 Lille, France, June 2009 CPM 2009 p. 1 Outline 1. Background 2. Description
More informationMethods for High Degrees of Similarity. Index-Based Methods Exploiting Prefixes and Suffixes Exploiting Length
Methods for High Degrees of Similarity Index-Based Methods Exploiting Prefixes and Suffixes Exploiting Length 1 Overview LSH-based methods are excellent for similarity thresholds that are not too high.
More informationLecture 7 February 26, 2010
6.85: Advanced Data Structures Spring Prof. Andre Schulz Lecture 7 February 6, Scribe: Mark Chen Overview In this lecture, we consider the string matching problem - finding all places in a text where some
More informationFrequency Covers for Strings
Fundamenta Informaticae XXI (2018) 1 16 1 DOI 10.3233/FI-2016-0000 IOS Press Frequency Covers for Strings Neerja Mhaskar C Dept. of Computing and Software, McMaster University, Canada pophlin@mcmaster.ca
More informationIndexing and Searching
Indexing and Searching Introduction How to retrieval information? A simple alternative is to search the whole text sequentially Another option is to build data structures over the text (called indices)
More informationCSED233: Data Structures (2017F) Lecture12: Strings and Dynamic Programming
(2017F) Lecture12: Strings and Dynamic Programming Daijin Kim CSE, POSTECH dkim@postech.ac.kr Strings A string is a sequence of characters Examples of strings: Python program HTML document DNA sequence
More informationPattern Mining in Frequent Dynamic Subgraphs
Pattern Mining in Frequent Dynamic Subgraphs Karsten M. Borgwardt, Hans-Peter Kriegel, Peter Wackersreuther Institute of Computer Science Ludwig-Maximilians-Universität Munich, Germany kb kriegel wackersr@dbs.ifi.lmu.de
More informationOn the Design and Analysis of Data Center Network Architectures for Interconnecting Dual-Port Servers. Dawei Li and Prof. Jie Wu Temple University
On the Design and Analysis of Data Center Network Architectures for Interconnecting Dual-Port Servers Dawei Li and Prof. Jie Wu Temple University Outline Introduction Preliminaries Maximizing the number
More informationModern Information Retrieval
Modern Information Retrieval Chapter 9 Indexing and Searching with Gonzalo Navarro Introduction Inverted Indexes Signature Files Suffix Trees and Suffix Arrays Sequential Searching Multi-dimensional Indexing
More informationSymbolic String Verification: Combining String Analysis and Size Analysis
Symbolic String Verification: Combining String Analysis and Size Analysis Fang Yu Tevfik Bultan Oscar H. Ibarra Deptartment of Computer Science University of California Santa Barbara, USA {yuf, bultan,
More informationString Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42
String Matching Pedro Ribeiro DCC/FCUP 2016/2017 Pedro Ribeiro (DCC/FCUP) String Matching 2016/2017 1 / 42 On this lecture The String Matching Problem Naive Algorithm Deterministic Finite Automata Knuth-Morris-Pratt
More informationExact Matching Part III: Ukkonen s Algorithm. See Gusfield, Chapter 5 Visualizations from
Exact Matching Part III: Ukkonen s Algorithm See Gusfield, Chapter 5 Visualizations from http://brenden.github.io/ukkonen-animation/ Goals for Today Understand how suffix links are used in Ukkonen's algorithm
More informationLecture 10: Suffix Trees
Computtionl Genomics Prof. Ron Shmir, Prof. Him Wolfson, Dr. Irit Gt-Viks School of Computer Science, Tel Aviv University גנומיקה חישובית פרופ' רון שמיר, פרופ' חיים וולפסון, דר' עירית גת-ויקס ביה"ס למדעי
More informationCS173 Longest Increasing Substrings. Tandy Warnow
CS173 Longest Increasing Substrings Tandy Warnow CS 173 Longest Increasing Substrings Tandy Warnow Today s material The Longest Increasing Subsequence problem DP algorithm for finding a longest increasing
More informationSequencing Alignment I
Sequencing Alignment I Lectures 16 Nov 21, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022 1 Outline: Sequence
More informationImproved Processing of Path Query on RDF Data Using Suffix Array
Journal of Convergence Information Technology Volume 4, Number 3, September 2009 Improved Processing of Path Query on RDF Data Using Suffix Array Corresponding author Sung Wan Kim * Division of Computer,
More informationDivide and Conquer Algorithms
CSE341T 09/13/2017 Lecture 5 Divide and Conquer Algorithms We have already seen a couple of divide and conquer algorithms in this lecture. The reduce algorithm and the algorithm to copy elements of the
More informationLecture 18 April 12, 2005
6.897: Advanced Data Structures Spring 5 Prof. Erik Demaine Lecture 8 April, 5 Scribe: Igor Ganichev Overview In this lecture we are starting a sequence of lectures about string data structures. Today
More informationFACIAL MOVEMENT BASED PERSON AUTHENTICATION
FACIAL MOVEMENT BASED PERSON AUTHENTICATION Pengqing Xie Yang Liu (Presenter) Yong Guan Iowa State University Department of Electrical and Computer Engineering OUTLINE Introduction Literature Review Methodology
More informationCS256 Applied Theory of Computation
CS256 Applied Theory of Computation Parallel Computation IV John E Savage Overview PRAM Work-time framework for parallel algorithms Prefix computations Finding roots of trees in a forest Parallel merging
More informationLecture L16 April 19, 2012
6.851: Advanced Data Structures Spring 2012 Prof. Erik Demaine Lecture L16 April 19, 2012 1 Overview In this lecture, we consider the string matching problem - finding some or all places in a text where
More informationApplications of Succinct Dynamic Compact Tries to Some String Problems
Applications of Succinct Dynamic Compact Tries to Some String Problems Takuya Takagi 1, Takashi Uemura 2, Shunsuke Inenaga 3, Kunihiko Sadakane 4, and Hiroki Arimura 1 1 IST & School of Engineering, Hokkaido
More informationmarc skodborg, simon fischer,
E F F I C I E N T I M P L E M E N TAT I O N S O F S U F - F I X T R E E S marc skodborg, 201206073 simon fischer, 201206049 master s thesis June 2017 Advisor: Christian Nørgaard Storm Pedersen AARHUS AU
More informationLongest Common Extensions in Trees
Longest Common Extensions in Trees Philip Bille 1, Pawe l Gawrychowski 2, Inge Li Gørtz 1, Gad M. Landau 3,4, and Oren Weimann 3 1 DTU Informatics, Denmark. phbi@dtu.dk, inge@dtu.dk 2 University of Warsaw,
More informationAbdullah-Al Mamun. CSE 5095 Yufeng Wu Spring 2013
Abdullah-Al Mamun CSE 5095 Yufeng Wu Spring 2013 Introduction Data compression is the art of reducing the number of bits needed to store or transmit data Compression is closely related to decompression
More informationW O R S T- C A S E L I N E A R T I M E S U F F I X A R - R AY C O N S T R U C T I O N A L G O R I T H M S
W O R S T- C A S E L I N E A R T I M E S U F F I X A R - R AY C O N S T R U C T I O N A L G O R I T H M S mikkel ravnholt knudsen, 201303546 jens christian liingaard hansen, 201303561 master s thesis June
More informationApplied Databases. Sebastian Maneth. Lecture 14 Indexed String Search, Suffix Trees. University of Edinburgh - March 9th, 2017
Applied Databases Lecture 14 Indexed String Search, Suffix Trees Sebastian Maneth University of Edinburgh - March 9th, 2017 2 Recap: Morris-Pratt (1970) Given Pattern P, Text T, find all occurrences of
More informationCSE 417 Dynamic Programming (pt 5) Multiple Inputs
CSE 417 Dynamic Programming (pt 5) Multiple Inputs Reminders > HW5 due Wednesday Dynamic Programming Review > Apply the steps... optimal substructure: (small) set of solutions, constructed from solutions
More informationString Matching Algorithms
String Matching Algorithms 1. Naïve String Matching The naïve approach simply test all the possible placement of Pattern P[1.. m] relative to text T[1.. n]. Specifically, we try shift s = 0, 1,..., n -
More informationEfficient Computation of Substring Equivalence Classes with Suffix Arrays
Efficient Computation of Substring Equivalence Classes with Suffix Arrays Kazuyuki Narisawa 1, Shunsuke Inenaga 2, Hideo Bannai 1,and Masayuki Takeda 1,3 1 Department of Informatics, Kyushu University,
More informationParallel and Sequential Data Structures and Algorithms Lecture (Spring 2012) Lecture 25 Suffix Arrays
Lecture 25 Suffix Arrays Parallel and Sequential Data Structures and Algorithms, 15-210 (Spring 2012) Lectured by Kanat Tangwongsan April 17, 2012 Material in this lecture: The main theme of this lecture
More informationSuffix trees. December Computational Genomics
Computtionl Genomics Prof Irit Gt-Viks, Prof. Ron Shmir, Prof. Roded Shrn School of Computer Science, Tel Aviv University גנומיקה חישובית פרופ' עירית גת-ויקס, פרופ' רון שמיר, פרופ' רודד שרן ביה"ס למדעי
More informationLecture #39. Initial grading run Wednesday. GUI files up soon. Today: Dynamic programming and memoization.
Lecture #39 Initial grading run Wednesday. GUI files up soon. Today: Dynamic programming and memoization. Last modified: Fri Dec 10 12:19:27 2004 CS61B: Lecture #39 1 A puzzle (D. Garcia): Dynamic Programming
More informationApplications of Suffix Tree
Applications of Suffix Tree Let us have a glimpse of the numerous applications of suffix trees. Exact String Matching As already mentioned earlier, given the suffix tree of the text, all occ occurrences
More informationSolution Methods for the Multi-trip Elementary Shortest Path Problem with Resource Constraints
Solution Methods for the Multi-trip Elementary Shortest Path Problem with Resource Constraints Zeliha Akca Ted K. Ralphs Rosemary T. Berger December 31, 2010 Abstract We investigate the multi-trip elementary
More informationCS4800: Algorithms & Data Jonathan Ullman
CS4800: Algorithms & Data Jonathan Ullman Lecture 13: Shortest Paths: Dijkstra s Algorithm, Heaps DFS(?) Feb 0, 018 Navigation s 9 15 14 5 6 3 18 30 11 5 0 16 4 6 6 3 19 t Weighted Graphs A graph with
More informationIndexing Land Surface for Efficient knn Query
Indexing Land Surface for Efficient knn Query Cyrus Shahabi, Lu-An Tang and Songhua Xing InfoLab University of Southern California Los Angeles, CA 90089-0781 http://infolab.usc.edu Outline q Mo+va+on q
More informationCSE 5095 Topics in Big Data Analytics Spring 2014; Homework 1 Solutions
CSE 5095 Topics in Big Data Analytics Spring 2014; Homework 1 Solutions Note: Solutions to problems 4, 5, and 6 are due to Marius Nicolae. 1. Consider the following algorithm: for i := 1 to α n log e n
More informationLearn Relational Database from Scratch. Dan Li, Ph.D. Associate Professor Computer Science Eastern Washington University
Learn Relational Database from Scratch Dan Li, Ph.D. Associate Professor Computer Science Eastern Washington University Self-Introduction Associate professor of Computer Science at EWU Area of expertise
More informationDynamic Programming II
June 9, 214 DP: Longest common subsequence biologists often need to find out how similar are 2 DNA sequences DNA sequences are strings of bases: A, C, T and G how to define similarity? DP: Longest common
More informationDynamic Programming: 1D Optimization. Dynamic Programming: 2D Optimization. Fibonacci Sequence. Crazy 8 s. Edit Distance
Dynamic Programming: 1D Optimization Fibonacci Sequence To efficiently calculate F [x], the xth element of the Fibonacci sequence, we can construct the array F from left to right (or bottom up ). We start
More informationBLAST & Genome assembly
BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies May 15, 2014 1 BLAST What is BLAST? The algorithm 2 Genome assembly De novo assembly Mapping assembly 3
More informationEfficient algorithms for two extensions of LPF table: the power of suffix arrays
Efficient algorithms for two extensions of LPF table: the power of suffix arrays Maxime Crochemore 1,3, Costas S. Iliopoulos 1,4, Marcin Kubica 2, Wojciech Rytter 2,5, and Tomasz Waleń 2 1 Dept. of Computer
More informationModeling Delta Encoding of Compressed Files
Modeling Delta Encoding of Compressed Files EXTENDED ABSTRACT S.T. Klein, T.C. Serebro, and D. Shapira 1 Dept of CS Bar Ilan University Ramat Gan, Israel tomi@cs.biu.ac.il 2 Dept of CS Bar Ilan University
More informationExact String Matching Part II. Suffix Trees See Gusfield, Chapter 5
Exact String Matching Part II Suffix Trees See Gusfield, Chapter 5 Outline for Today What are suffix trees Application to exact matching Building a suffix tree in linear time, part I: Ukkonen s algorithm
More informationMapping Reads to Reference Genome
Mapping Reads to Reference Genome DNA carries genetic information DNA is a double helix of two complementary strands formed by four nucleotides (bases): Adenine, Cytosine, Guanine and Thymine 2 of 31 Gene
More informationCompression. storage medium/ communications network. For the purpose of this lecture, we observe the following constraints:
CS231 Algorithms Handout # 31 Prof. Lyn Turbak November 20, 2001 Wellesley College Compression The Big Picture We want to be able to store and retrieve data, as well as communicate it with others. In general,
More informationCMPUT 403: Strings. Zachary Friggstad. March 11, 2016
CMPUT 403: Strings Zachary Friggstad March 11, 2016 Outline Tries Suffix Arrays Knuth-Morris-Pratt Pattern Matching Tries Given a dictionary D of strings and a query string s, determine if s is in D. Using
More informationComputing Patterns in Strings I. Specific, Generic, Intrinsic
Outline : Specific, Generic, Intrinsic 1,2,3 1 Algorithms Research Group, Department of Computing & Software McMaster University, Hamilton, Ontario, Canada email: smyth@mcmaster.ca 2 Digital Ecosystems
More informationEE/CSCI 451 Spring 2018 Homework 2 Assigned: February 7, 2018 Due: February 14, 2018, before 11:59 pm Total Points: 100
EE/CSCI 45 Spring 08 Homework Assigned: February 7, 08 Due: February 4, 08, before :59 pm Total Points: 00 [0 points] Explain the following terms:. Diameter of a network. Bisection width of a network.
More information