Tries. Yufei Tao KAIST. April 9, Y. Tao, April 9, 2013 Tries

Similar documents
What are suffix trees?

Definition of Regular Expression

Information Retrieval and Organisation

Suffix trees, suffix arrays, BWT

COMP 423 lecture 11 Jan. 28, 2008

CS201 Discussion 10 DRAWTREE + TRIES

Outline. Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen s algorithm Applications of ST

COMBINATORIAL PATTERN MATCHING

Orthogonal line segment intersection

In the last lecture, we discussed how valid tokens may be specified by regular expressions.

Applied Databases. Sebastian Maneth. Lecture 13 Online Pattern Matching on Strings. University of Edinburgh - February 29th, 2016

Lecture 10: Suffix Trees

Dr. D.M. Akbar Hussain

Suffix Tries. Slides adapted from the course by Ben Langmead

If you are at the university, either physically or via the VPN, you can download the chapters of this book as PDFs.

Suffix trees. December Computational Genomics

Intermediate Information Structures

CS481: Bioinformatics Algorithms

ΕΠΛ323 - Θεωρία και Πρακτική Μεταγλωττιστών

Position Heaps: A Simple and Dynamic Text Indexing Data Structure

Fig.25: the Role of LEX

Algorithm Design (5) Text Search

The Greedy Method. The Greedy Method

CS321 Languages and Compiler Design I. Winter 2012 Lecture 5

CS143 Handout 07 Summer 2011 June 24 th, 2011 Written Set 1: Lexical Analysis

CSCI 104. Rafael Ferreira da Silva. Slides adapted from: Mark Redekopp and David Kempe

Finite Automata. Lecture 4 Sections Robb T. Koether. Hampden-Sydney College. Wed, Jan 21, 2015

Ma/CS 6b Class 1: Graph Recap

An Algorithm for Enumerating All Maximal Tree Patterns Without Duplication Using Succinct Data Structure

ΕΠΛ323 - Θεωρία και Πρακτική Μεταγλωττιστών. Lecture 3b Lexical Analysis Elias Athanasopoulos

Alignment of Long Sequences. BMI/CS Spring 2012 Colin Dewey

Deterministic. Finite Automata. And Regular Languages. Fall 2018 Costas Busch - RPI 1

Topic 2: Lexing and Flexing

Today. CS 188: Artificial Intelligence Fall Recap: Search. Example: Pancake Problem. Example: Pancake Problem. General Tree Search.

CS 241. Fall 2017 Midterm Review Solutions. October 24, Bits and Bytes 1. 3 MIPS Assembler 6. 4 Regular Languages 7.

Announcements. CS 188: Artificial Intelligence Fall Recap: Search. Today. Example: Pancake Problem. Example: Pancake Problem

CSc 453. Compilers and Systems Software. 4 : Lexical Analysis II. Department of Computer Science University of Arizona

Presentation Martin Randers

10.2 Graph Terminology and Special Types of Graphs

MATH 25 CLASS 5 NOTES, SEP

From Indexing Data Structures to de Bruijn Graphs

CS 432 Fall Mike Lam, Professor a (bc)* Regular Expressions and Finite Automata

On String Matching in Chunked Texts

Representation of Numbers. Number Representation. Representation of Numbers. 32-bit Unsigned Integers 3/24/2014. Fixed point Integer Representation

CS 241 Week 4 Tutorial Solutions

Lexical Analysis: Constructing a Scanner from Regular Expressions

I/O Efficient Dynamic Data Structures for Longest Prefix Queries

Lecture T1: Pattern Matching

Implementing Automata. CSc 453. Compilers and Systems Software. 4 : Lexical Analysis II. Department of Computer Science University of Arizona

1.1. Interval Notation and Set Notation Essential Question When is it convenient to use set-builder notation to represent a set of numbers?

A dual of the rectangle-segmentation problem for binary matrices

CS311H: Discrete Mathematics. Graph Theory IV. A Non-planar Graph. Regions of a Planar Graph. Euler s Formula. Instructor: Işıl Dillig

Chapter44. Polygons and solids. Contents: A Polygons B Triangles C Quadrilaterals D Solids E Constructing solids

ITEC2620 Introduction to Data Structures

Here is an example where angles with a common arm and vertex overlap. Name all the obtuse angles adjacent to

Compression Outline :Algorithms in the Real World. Lempel-Ziv Algorithms. LZ77: Sliding Window Lempel-Ziv

Announcements. CS 188: Artificial Intelligence Fall Recap: Search. Today. General Tree Search. Uniform Cost. Lecture 3: A* Search 9/4/2007

Typing with Weird Keyboards Notes

2 Computing all Intersections of a Set of Segments Line Segment Intersection

CIS 1068 Program Design and Abstraction Spring2015 Midterm Exam 1. Name SOLUTION

The dictionary model allows several consecutive symbols, called phrases

10.5 Graphing Quadratic Functions

Languages. L((a (b)(c))*) = { ε,a,bc,aa,abc,bca,... } εw = wε = w. εabba = abbaε = abba. (a (b)(c)) *

Suffix Trees and Arrays

Regular Expression Matching with Multi-Strings and Intervals. Philip Bille Mikkel Thorup

ASTs, Regex, Parsing, and Pretty Printing

Solving Problems by Searching. CS 486/686: Introduction to Artificial Intelligence Winter 2016

CS 340, Fall 2014 Dec 11 th /13 th Final Exam Note: in all questions, the special symbol ɛ (epsilon) is used to indicate the empty string.

Lecture 8: Graph-theoretic problems (again)

Lecture 7: Integration Techniques

Notes for Graph Theory

The Fundamental Theorem of Calculus

MTH 146 Conics Supplement

binary trees, expression trees

Phylogeny and Molecular Evolution

Lists in Lisp and Scheme

Rational Numbers---Adding Fractions With Like Denominators.

Pointwise convergence need not behave well with respect to standard properties such as continuity.

Introduction to Integration

UNIT 11. Query Optimization

Ma/CS 6b Class 1: Graph Recap

Quiz2 45mins. Personal Number: Problem 1. (20pts) Here is an Table of Perl Regular Ex

4/29/18 FIBONACCI NUMBERS GOLDEN RATIO, RECURRENCES. Fibonacci function. Fibonacci (Leonardo Pisano) ? Statue in Pisa Italy

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 4: Lexical Analyzers 28 Jan 08

Reducing a DFA to a Minimal DFA

a(e, x) = x. Diagrammatically, this is encoded as the following commutative diagrams / X

Lecture 12 : Topological Spaces

A Tautology Checker loosely related to Stålmarck s Algorithm by Martin Richards

9 Graph Cutting Procedures

CSCI1950 Z Computa4onal Methods for Biology Lecture 2. Ben Raphael January 26, hhp://cs.brown.edu/courses/csci1950 z/ Outline

Before We Begin. Introduction to Spatial Domain Filtering. Introduction to Digital Image Processing. Overview (1): Administrative Details (1):

INTRODUCTION TO SIMPLICIAL COMPLEXES

Stack. A list whose end points are pointed by top and bottom

Solving Problems by Searching. CS 486/686: Introduction to Artificial Intelligence

Context-Free Grammars

CSE 549: Suffix Tries & Suffix Trees. All slides in this lecture not marked with * of Ben Langmead.

Paradigm 5. Data Structure. Suffix trees. What is a suffix tree? Suffix tree. Simple applications. Simple applications. Algorithms

CS 221: Artificial Intelligence Fall 2011

From Dependencies to Evaluation Strategies

EXPONENTIAL & POWER GRAPHS

Transcription:

Tries Yufei To KAIST April 9, 2013 Y. To, April 9, 2013 Tries

In this lecture, we will discuss the following exct mtching prolem on strings. Prolem Let S e set of strings, ech of which hs unique integer id. Given query string q, query reports: the id of q if it exists in S nothing otherwise. Exmple Suppose tht S {,,,,,,, }. Let the ids of these strings e (from left to right) 1, 2,..., 8, respectively. Given q, query returns id 3, wheres given q, it returns nothing. Y. To, April 9, 2013 Tries

Think How is this prolem relted to inverted indexes nd serch engines? Y. To, April 9, 2013 Tries

Nottions nd A Nive Solution Let A e the lphet (i.e., every chrcter of ny string must come from A). s e the length of string s, i.e., the numer of chrcters in s. m S, i.e., the numer of strings in S. n the totl length of the strings in S, i.e., n s S s. When A is smll nd ll strings in S re short (e.g., s 10 for ll s S), the exct mtching prolem on strings cn e reduced to exct mtching on integers. For exmple, consider tht ech string s represents n English word, nd tht every s hs length t most 10. We cn mp s to n integer from 0 to 26 10 1. Think Why does the method no longer work if A is lrge or strings cn e ritrrily long? Y. To, April 9, 2013 Tries

Next, we will descrie nother solution sed on dt structure clled trie. First, let us define the concept of prefix. Let s e string of length t. We cn write its chrcters (from left to right) s s[1], s[2],..., s[t], respectively. Then, for ny i [1, t], the string formed y the sequence s[1],..., s[i] is clled prefix of s. Specilly, n empty string is lso prefix of s. Exmple s hs 6 prefixes:,,,,, nd. Let S e set of strings. We sy tht string s is possile prefix of S if s is prefix of t lest one string in S. Y. To, April 9, 2013 Tries

A set S of strings is clled prefix-free if no string in S is prefix of ny other string in S. Every set of strings cn e mde prefix-free y ppending specil termintion symol to ech string in S. Exmple Let S {,,,,,,, }. We cn convert S to S {,,,,,,, }, which is prefix-free. From now on, we will consider tht S is prefix-free, nd tht every string in S ends with. Y. To, April 9, 2013 Tries

Tries The trie on S is tree T defined s follows: Ech node u of T corresponds to distinct possile prefix of S. Let P(u) e the prefix tht u represents. Let u e node, nd v child node of u. Then: P(u) is prefix of P(v). P(v) P(u) + 1. Ech node u is leled with chrcter c, which is the lst chrcter of P(u). Y. To, April 9, 2013 Tries

Exmple: Let S {,,,,,,, }. The trie is: Note tht every -node u corresponds to distinct string s S. We therefore store the id of s t u. Y. To, April 9, 2013 Tries

Lemm The trie on S hs t most n nodes. Y. To, April 9, 2013 Tries

How do we nswer n exct mtching query with q? How out q? Y. To, April 9, 2013 Tries

How to delete the string? How out inserting? Y. To, April 9, 2013 Tries

Notice tht the efficiency of queries, insertions nd deletions depends on how well we cn solve the following prolem: Given node u nd chrcter σ A {}, how to find the child of v of u tht corresponds to σ? Different trdeoffs exist: By orgnizing the child nodes of u in n rry, we cn find v in O(1) time, ut the rry occupies O( A ) spce. By orgnizing the child nodes of u in inry serch tree (BST), we cn find v in O(log A ) time, nd the tree occupies O( f ) spce, where f is the numer of child nodes of u. Y. To, April 9, 2013 Tries

Theorem By using the rry implementtion, trie occupies O( A n) spce, nswers query with string q in O( q ) time, nd supports the insertion nd deletion of string s in O( A s ) time. By using the BST implementtion, trie occupies O(n) spce, nswers query with string q in O( q log A ) time, nd supports the insertion nd deletion of string s in O( s log A ) time. Y. To, April 9, 2013 Tries

Next, we will descrie nother trie vrint, clled lnced trie, which occupies O(n) spce, nd nswers query with string q in O(log m + q ) time. The trie, however, is sttic, nmely, it does not support insertions nd deletions. Y. To, April 9, 2013 Tries

From now on, we consider tht S is sorted lpheticlly (plcing efore ll chrcters of A). In generl, given set S of x sorted strings, we refer to the one in S whose rnk is x/2 s the medin of S. Exmple The medin of {,,,,,,, } is. Furthermore, given prefix p, denote y S(p) the set of strings in S with prefix p. Exmple Let S {,,,,,,, }. Then S() {,, }. Y. To, April 9, 2013 Tries

We lso need to define wht it mens y conctention. The conctention of two strings s 1 nd s 2 forms string y ppending the chrcters of s 2 t the end of s 1. Exmple If s 1 nd s 2, then conctention gives. If s 1 nd s 2, then conctention gives. Similrly, if s 1 nd s 2, conctention gives. Y. To, April 9, 2013 Tries

Let S e set of strings. The lnced trie on S is tree T defined s follows: Every node u in T corresponds to set S(u) of strings, nd crries lel L(u) nd positionl index I (u), which will e formlly defined elow. L(u) is the i-th chrcter of the medin of S(u), where i I (u). Ech u corresponds to possile prefix P(u) of S, where P(u) is the conctention of the lels of the nodes on the pth from the root to u. If u is the root, S(u) S, nd I (u) 1. u is lef if S(u) 1 nd I (u) s, where s is the (only) string in S(u). An internl u hs t most 3 child nodes u <, u, nd u > such tht: S(u <) is the set of strings in S(u) lpheticlly less thn P(u). I (u <) I (u). S(u ) is the set of strings in S(u) tht hve P(u) s their prefixes. I (u ) I (u) + 1. S(u >) is the set of remining strings in S(u). I (u >) I (u) Y. To, April 9, 2013 Tries

Exmple: Let S {,,,,,,, }. The lnced trie is: (, 1) (, 2) (, 3) (, 2) < (, 3) (, 4) (, 3) < < (, 4) (, 4) (, 5) (, 3) (, 4) > (, 6) (, 5) (, 5) (, 4) (, 5) < > (, 6) (, 5) (, 6) (, 5) (, 6) Ech node u is denoted in the form (L(u), I (u)). > (, 6) Y. To, April 9, 2013 Tries

(, 1) (, 2) (, 3) (, 2) < (, 3) (, 4) (, 3) < < (, 4) (, 4) (, 5) (, 3) (, 4) > (, 6) (, 5) (, 5) (, 4) (, 5) < > (, 6) (, 5) (, 6) (, 5) (, 6) > (, 6) How do we nswer n exct mtching query with q? How out q? Y. To, April 9, 2013 Tries

Theorem A lnced trie occupies O(n) spce, nd nswers query with string q in O(log m + q ) time Y. To, April 9, 2013 Tries