Shortest Unique Substring Query Revisited

Similar documents
Suffix Trees. Martin Farach-Colton Rutgers University & Tokutek, Inc

An empirical evaluation of a metric index for approximate string matching. Bilegsaikhan Naidan and Magnus Lie Hetland

Indexing Web pages. Web Search: Indexing Web Pages. Indexing the link structure. checkpoint URL s. Connectivity Server: Node table

Suffix Tree and Array

Indexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems

On stabbing queries for generalised longest repeats

arxiv: v3 [cs.ds] 10 Jan 2014

Compressed Data Structures with Relevance

Parallel Distributed Memory String Indexes

Auto-assemblage for Suffix Tree Clustering

Suffix Array Construction

Information Retrieval and Organisation

Text Indexing: Lecture 8

Searching Strings by Substring

Data structures for string pattern matching: Suffix trees

SUFFIX ARRAYS A Competitive Choice for Fast Lempel-Ziv Compressions

Suffix trees, suffix arrays, BWT

Bioinformatics: Fragment Assembly. Walter Kosters, Universiteit Leiden. IPA Algorithms&Complexity,

ASSIGNMENT OF MASTER S THESIS

Universidade Nova de Lisboa Faculdade de Ciências e Tecnologia Departamento de Informática. Burrows-Wheeler Transform in Secondary Memory

Dynamic FM-Index for a Collection of Texts with Application to Space-efficient Construction of the Compressed Suffix Array.

Suffix Trees and Arrays

An Efficient Algorithm for Identifying the Most Contributory Substring. Ben Stephenson Department of Computer Science University of Western Ontario

Space Efficient Linear Time Construction of

Suffix Arrays Slides by Carl Kingsford

Reporting Consecutive Substring Occurrences Under Bounded Gap Constraints

Linear Work Suffix Array Construction

Lempel-Ziv compression: how and why?

Exact String Matching. The Knuth-Morris-Pratt Algorithm

1 Introduciton. 2 Tries /651: Algorithms CMU, Spring Lecturer: Danny Sleator

Suffix Arrays. Slides adapted from the course by Ben Langmead

tagerator: a program for mapping short sequence tags a manual

Solution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree.

Suffix trees and applications. String Algorithms

Suffix-based text indices, construction algorithms, and applications.

PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet

Determining gapped palindrome density in RNA using suffix arrays

Parameterized Suffix Arrays for Binary Strings

arxiv: v3 [cs.ds] 29 Jun 2010

Inexact Matching, Alignment. See Gusfield, Chapter 9 Dasgupta et al. Chapter 6 (Dynamic Programming)

Methods for High Degrees of Similarity. Index-Based Methods Exploiting Prefixes and Suffixes Exploiting Length

1 Dynamic Programming

Linear-Time Suffix Array Implementation in Haskell

Fast Algorithms for Top-k Approximate String Matching

Searching a Sorted Set of Strings

Strings. Zachary Friggstad. Programming Club Meeting

COMP4128 Programming Challenges

Problem Set 8 Due: Start of Class, November 16

New Implementation for the Multi-sequence All-Against-All Substring Matching Problem

A Fast Order-Preserving Matching with q-neighborhood Filtration Using SIMD Instructions

Permuted Longest-Common-Prefix Array

Methods for High Degrees of Similarity. Index-Based Methods Exploiting Prefixes and Suffixes Exploiting Length

Lecture 7 February 26, 2010

Frequency Covers for Strings

Indexing and Searching

CSED233: Data Structures (2017F) Lecture12: Strings and Dynamic Programming

Pattern Mining in Frequent Dynamic Subgraphs

On the Design and Analysis of Data Center Network Architectures for Interconnecting Dual-Port Servers. Dawei Li and Prof. Jie Wu Temple University

Modern Information Retrieval

Symbolic String Verification: Combining String Analysis and Size Analysis

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42

Exact Matching Part III: Ukkonen s Algorithm. See Gusfield, Chapter 5 Visualizations from

Lecture 10: Suffix Trees

CS173 Longest Increasing Substrings. Tandy Warnow

Sequencing Alignment I

Improved Processing of Path Query on RDF Data Using Suffix Array

Divide and Conquer Algorithms

Lecture 18 April 12, 2005

FACIAL MOVEMENT BASED PERSON AUTHENTICATION

CS256 Applied Theory of Computation

Lecture L16 April 19, 2012

Applications of Succinct Dynamic Compact Tries to Some String Problems

marc skodborg, simon fischer,

Longest Common Extensions in Trees

Abdullah-Al Mamun. CSE 5095 Yufeng Wu Spring 2013

W O R S T- C A S E L I N E A R T I M E S U F F I X A R - R AY C O N S T R U C T I O N A L G O R I T H M S

Applied Databases. Sebastian Maneth. Lecture 14 Indexed String Search, Suffix Trees. University of Edinburgh - March 9th, 2017

CSE 417 Dynamic Programming (pt 5) Multiple Inputs

String Matching Algorithms

Efficient Computation of Substring Equivalence Classes with Suffix Arrays

Parallel and Sequential Data Structures and Algorithms Lecture (Spring 2012) Lecture 25 Suffix Arrays

Suffix trees. December Computational Genomics

Lecture #39. Initial grading run Wednesday. GUI files up soon. Today: Dynamic programming and memoization.

Applications of Suffix Tree

Solution Methods for the Multi-trip Elementary Shortest Path Problem with Resource Constraints

CS4800: Algorithms & Data Jonathan Ullman

Indexing Land Surface for Efficient knn Query

CSE 5095 Topics in Big Data Analytics Spring 2014; Homework 1 Solutions

Learn Relational Database from Scratch. Dan Li, Ph.D. Associate Professor Computer Science Eastern Washington University

Dynamic Programming II

Dynamic Programming: 1D Optimization. Dynamic Programming: 2D Optimization. Fibonacci Sequence. Crazy 8 s. Edit Distance

BLAST & Genome assembly

Efficient algorithms for two extensions of LPF table: the power of suffix arrays

Modeling Delta Encoding of Compressed Files

Exact String Matching Part II. Suffix Trees See Gusfield, Chapter 5

Mapping Reads to Reference Genome

Compression. storage medium/ communications network. For the purpose of this lecture, we observe the following constraints:

CMPUT 403: Strings. Zachary Friggstad. March 11, 2016

Computing Patterns in Strings I. Specific, Generic, Intrinsic

EE/CSCI 451 Spring 2018 Homework 2 Assigned: February 7, 2018 Due: February 14, 2018, before 11:59 pm Total Points: 100

Transcription:

Shortest Unique Substring Query Revisited Atalay Mert İleri Bilkent University, Turkey M. Oğuzhan Külekci Istanbul Medipol University, Turkey Bojian Xu Eastern Washington University, USA June 18, 2014

Outline 1 Preliminary Information 2 Left Bounded Shortest Unique Substrings 3 Finding Shortest Unique Substring For All Positions 4 Experimental Results 5 Conclusion 6 Acknowledgments 7 Q&A

Shortest Unique Substring (SUS) A substring S[i j] is called Shortest Unique Substring covering a position k, if i k j and there is no other unique substring S [i j ] exists such that i k j and j i < j i

Shortest Unique Substring (SUS) A substring S[i j] is called Shortest Unique Substring covering a position k, if i k j and there is no other unique substring S [i j ] exists such that i k j and j i < j i Problem 1 (find details in paper) Given a string location k, find the leftmost SUS covering location k. Problem 2 (find details in paper) Given a string location k, find all the SUSes covering location k. Problem 3 (discussed in this talk) Find the leftmost SUS covering every string location 1, 2,..., n. Problem 4 (find details in paper) Find all the SUSes covering every string location 1, 2,..., n.

Why SUS queries? Distinguishing results in a text search Finding DNA signatures of organisms Extracting the event context in historical events

Preliminary Information Preliminaries: Suffix Array and Rank Array 1 mississippi 2 ississippi 3 ssissippi 4 sissippi 5 issippi 6 ssippi 7 sippi 8 ippi 9 ppi 10 pi 11 i

Preliminary Information Preliminaries: Suffix Array and Rank Array 1 mississippi 2 ississippi 3 ssissippi 4 sissippi 5 issippi 6 ssippi 7 sippi 8 ippi 9 ppi 10 pi 11 i 11 i 8 ippi 5 issippi 2 ississippi 1 mississippi 10 pi 9 ppi 7 sippi 4 sissippi 6 ssippi 3 ssissippi

Preliminary Information Preliminaries: Suffix Array and Rank Array 1 mississippi 2 ississippi 3 ssissippi 4 sissippi 5 issippi 6 ssippi 7 sippi 8 ippi 9 ppi 10 pi 11 i 11 i 8 ippi 5 issippi 2 ississippi 1 mississippi 10 pi 9 ppi 7 sippi 4 sissippi 6 ssippi 3 ssissippi SA 11 8 5 2 1 10 9 7 4 6 3 RANK 5 4 11 9 3 10 8 2 7 6 1

Preliminary Information Preliminaries: Longest Common Prefix Array 11 i 8 ippi 5 issippi 2 ississippi 1 mississippi 10 pi 9 ppi 7 sippi 4 sissippi 6 ssippi 3 ssissippi LCP 0 1 1 4 0 0 1 0 2 1 3

Left Bounded Shortest Unique Substrings Left Bounded Shortest Unique Substring Definition For a particular string location k {1, 2,..., n}, the left-bounded shortest unique substring (LSUS) starting at location k, denoted as LSUS k, is a unique substring S[k... j], such that either k = j or any proper prefix of S[k... j] is not unique. Examples: mississippi LSUS 1 = S[1, 1] mississippi LSUS 2 = S[2, 6]. mississippi LSUS 10 = S[10, 11] mississippi LSUS 11 does not exist

Left Bounded Shortest Unique Substrings Fast computation of LSUS es from LCP array For i = 1, 2,..., n: { S[i... i + Li ], if i + L LSUS i = i n not existing, otherwise where L i = max{lcp[rank[i]], LCP[RANK[i] + 1]}.

Left Bounded Shortest Unique Substrings Fast computation of LSUS es from LCP array mississippi RANK 5 4 11 9 3 10 8 2 7 6 1 LCP 0 1 1 4 0 0 1 0 2 1 3 L i 0 LSUS i m

Left Bounded Shortest Unique Substrings Fast computation of LSUS es from LCP array mississippi RANK 5 4 11 9 3 10 8 2 7 6 1 LCP 0 1 1 4 0 0 1 0 2 1 3 L i 0 4 LSUS i m issis

Left Bounded Shortest Unique Substrings Fast computation of LSUS es from LCP array mississippi RANK 5 4 11 9 3 10 8 2 7 6 1 LCP 0 1 1 4 0 0 1 0 2 1 3 L i 0 4 3 LSUS i m issis ssis

Left Bounded Shortest Unique Substrings Fast computation of LSUS es from LCP array mississippi RANK 5 4 11 9 3 10 8 2 7 6 1 LCP 0 1 1 4 0 0 1 0 2 1 3 L i 0 4 3 2 4 3 2 1 1 LSUS i m issis ssis sis issip ssip sip ip pp

Left Bounded Shortest Unique Substrings Fast computation of LSUS es from LCP array mississippi RANK 5 4 11 9 3 10 8 2 7 6 1 LCP 0 1 1 4 0 0 1 0 2 1 3 L i 0 4 3 2 4 3 2 1 1 1 LSUS i m issis ssis sis issip ssip sip ip pp pi

Left Bounded Shortest Unique Substrings Fast computation of LSUS es from LCP array mississippi RANK 5 4 11 9 3 10 8 2 7 6 1 LCP 0 1 1 4 0 0 1 0 2 1 3 L i 0 4 3 2 4 3 2 1 1 1 1 LSUS i m issis ssis sis issip ssip sip ip pp pi -

Left Bounded Shortest Unique Substrings Properties of LSUS LSUS 1 always exists, because at least it can be the whole string, which is unique.

Left Bounded Shortest Unique Substrings Properties of LSUS LSUS 1 always exists, because at least it can be the whole string, which is unique. LSUS i cannot end before LSUS i 1 Repeat abcadabcdcab... Unique

Finding Shortest Unique Substring For All Positions Finding leftmost SUS for all positions: Overview SUS i = { SUSi 1 + S[i], if SUS i 1 + 1 SLS i SLS i, if SUS i 1 + 1 > SLS i SLS i Leftmost shortest LSUS covering string location i.

Finding Shortest Unique Substring For All Positions abcabcbcab

Finding Shortest Unique Substring For All Positions abcabcbcab abcabcbcab

Finding Shortest Unique Substring For All Positions abcabcbcab abcabcbcab

Finding Shortest Unique Substring For All Positions abcabcbcab abcabcbcab

Finding Shortest Unique Substring For All Positions abcabcbcab abcabcbcab

Finding Shortest Unique Substring For All Positions Finding the candidate: Data structure Chunk Start LSUS Start Prev Next Chunk End LSUS Length abcacbca...

Finding Shortest Unique Substring For All Positions Example Run S = mississippi List

Finding Shortest Unique Substring For All Positions Example Run 1 2 3 4 5 6 7 8 9 10 11 m i s s i s s i p p i CS LS List 1 1 CE LL 1 1 SUS 0 = SLS 1 = (1, 1) SUS 1 = (1, 1) CS: Chunk Start CE: Chunk End LS: LSUS Start LL: LSUS Length

Finding Shortest Unique Substring For All Positions Example Run 1 2 3 4 5 6 7 8 9 10 11 m i s s i s s i p p i List CS 2 LS 2 CE 6 LL 5 SUS 1 = (1, 1) SLS 2 = (2, 5) SUS 2 = (1, 2) CS: Chunk Start CE: Chunk End LS: LSUS Start LL: LSUS Length

Finding Shortest Unique Substring For All Positions Example Run 1 2 3 4 5 6 7 8 9 10 11 m i s s i s s i p p i List CS LS 3 3 CE 6 LL 4 SUS 2 = (1, 2) SLS 3 = (3, 4) SUS 3 = (1, 3) CS: Chunk Start CE: Chunk End LS: LSUS Start LL: LSUS Length

Finding Shortest Unique Substring For All Positions Example Run 1 2 3 4 5 6 7 8 9 10 11 m i s s i s s i p p i List CS LS 4 4 CE 6 LL 3 SUS 3 = (1, 3) SLS 4 = (4, 3) SUS 4 = (4, 3) CS: Chunk Start CE: Chunk End LS: LSUS Start LL: LSUS Length

Finding Shortest Unique Substring For All Positions Example Run 1 2 3 4 5 6 7 8 9 10 11 m i s s i s s i p p i CS LS CS LS List 5 4 7 5 CE LL CE LL 6 3 9 5 SUS 4 = (4, 3) SLS 5 = (4, 3) SUS 5 = (4, 3) CS: Chunk Start CE: Chunk End LS: LSUS Start LL: LSUS Length

Finding Shortest Unique Substring For All Positions Example Run 1 2 3 4 5 6 7 8 9 10 11 m i s s i s s i p p i List CS LS CS LS 6 4 7 6 CE LL CE LL 6 3 9 4 SUS 5 = (4, 3) SLS 6 = (4, 3) SUS 6 = (4, 3) CS: Chunk Start CE: Chunk End LS: LSUS Start LL: LSUS Length

Finding Shortest Unique Substring For All Positions Example Run 1 2 3 4 5 6 7 8 9 10 11 m i s s i s s i p p i List CS 7 LS 7 CE 9 LL 3 SUS 6 = (4, 3) SLS 7 = (7, 3) SUS 7 = (7, 3) CS: Chunk Start CE: Chunk End LS: LSUS Start LL: LSUS Length

Finding Shortest Unique Substring For All Positions Example Run 1 2 3 4 5 6 7 8 9 10 11 m i s s i s s i p p i List CS LS 8 8 CE 9 LL 2 SUS 7 = (7, 3) SLS 8 = (8, 2) SUS 8 = (8, 2) CS: Chunk Start CE: Chunk End LS: LSUS Start LL: LSUS Length

Finding Shortest Unique Substring For All Positions Example Run 1 2 3 4 5 6 7 8 9 10 11 m i s s i s s i p p i CS LS CS LS List 9 8 10 9 CE LL CE LL 9 2 10 2 SUS 8 = (8, 2) SLS 9 = (8, 2) SUS 9 = (8, 2) CS: Chunk Start CE: Chunk End LS: LSUS Start LL: LSUS Length

Finding Shortest Unique Substring For All Positions Example Run 1 2 3 4 5 6 7 8 9 10 11 m i s s i s s i p p i CS LS CS LS List 10 9 11 10 CE 10 LL CE LL 2 11 2 SUS 9 = (8, 2) SLS 10 = (9, 2) SUS 10 = (9, 2) CS: Chunk Start CE: Chunk End LS: LSUS Start LL: LSUS Length

Finding Shortest Unique Substring For All Positions Example Run 1 2 3 4 5 6 7 8 9 10 11 m i s s i s s i p p i List CS 11 LS 10 CE 11 LL 2 SUS 10 = (9, 2) SLS 11 = (10, 2) SUS 11 = (10, 2) CS: Chunk Start CE: Chunk End LS: LSUS Start LL: LSUS Length

Finding Shortest Unique Substring For All Positions Time Complexity Derivation of LCP and Rank array Calculation of LSUS Maintenance of data structure Finding SLS Determining SUS Number of iterations: Time O(n) O(1) O(1) O(1) O(1) n Total Time: O(n) + n (O(1) + O(1) + O(1) + O(1)) = O(n)

Finding Shortest Unique Substring For All Positions Space Complexity LCP array Rank array Data structure Previous SUS Memory (in words) O(n) O(n) O(n) O(1) Total Memory: O(n) + O(n) + O(n) + O(1) = O(n)

Experimental Results Involved work in experimental study J. Pei, W. C. H. Wu, M. Y. Yeh, On shortest unique substring queries, ICDE 2013 K. Tsuruta, S. Inenaga, H. Bannai, M. Takeda, Shortest unique substrings queries in optimal time, SOFSEM 2014

Experimental Results Time Measurements

Experimental Results Memory Measurements

Conclusion Conclusion We developed O(n) time and space algorithms for finding SUS covering one position and all positions. Our work improved the prior work of Pei et al. 8 times in terms of time and 20 times in terms of memory.

Acknowledgments Acknowledgments I want to thank them for their monetary support for my trip to CPM2014. CPM Organizing Commitee Bilim Akademisi

Q&A Thanks! Questions?