Information Retrieval and Organisation

Similar documents
Suffix trees, suffix arrays, BWT

Lecture 10: Suffix Trees

Suffix trees. December Computational Genomics

COMBINATORIAL PATTERN MATCHING

Intermediate Information Structures

What are suffix trees?

CS481: Bioinformatics Algorithms

Tries. Yufei Tao KAIST. April 9, Y. Tao, April 9, 2013 Tries

Algorithm Design (5) Text Search

Applied Databases. Sebastian Maneth. Lecture 13 Online Pattern Matching on Strings. University of Edinburgh - February 29th, 2016

Outline. Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen s algorithm Applications of ST

COMP 423 lecture 11 Jan. 28, 2008

Suffix Tries. Slides adapted from the course by Ben Langmead

Fig.25: the Role of LEX

Definition of Regular Expression

Presentation Martin Randers

Alignment of Long Sequences. BMI/CS Spring 2012 Colin Dewey

CS321 Languages and Compiler Design I. Winter 2012 Lecture 5

From Indexing Data Structures to de Bruijn Graphs

Position Heaps: A Simple and Dynamic Text Indexing Data Structure

CSE 549: Suffix Tries & Suffix Trees. All slides in this lecture not marked with * of Ben Langmead.

Orthogonal line segment intersection

Dr. D.M. Akbar Hussain

2 Computing all Intersections of a Set of Segments Line Segment Intersection

Stack. A list whose end points are pointed by top and bottom

Lexical Analysis: Constructing a Scanner from Regular Expressions

Reducing a DFA to a Minimal DFA

Today. CS 188: Artificial Intelligence Fall Recap: Search. Example: Pancake Problem. Example: Pancake Problem. General Tree Search.

CS143 Handout 07 Summer 2011 June 24 th, 2011 Written Set 1: Lexical Analysis

Announcements. CS 188: Artificial Intelligence Fall Recap: Search. Today. Example: Pancake Problem. Example: Pancake Problem

CS 432 Fall Mike Lam, Professor a (bc)* Regular Expressions and Finite Automata

The Greedy Method. The Greedy Method

CS201 Discussion 10 DRAWTREE + TRIES

In the last lecture, we discussed how valid tokens may be specified by regular expressions.

Lists in Lisp and Scheme

Suffix Tree and Array

Regular Expression Matching with Multi-Strings and Intervals. Philip Bille Mikkel Thorup

ΕΠΛ323 - Θεωρία και Πρακτική Μεταγλωττιστών

Topic 2: Lexing and Flexing

Algorithms in bioinformatics (CSI 5126) 1

Finite Automata. Lecture 4 Sections Robb T. Koether. Hampden-Sydney College. Wed, Jan 21, 2015

CSCE 531, Spring 2017, Midterm Exam Answer Key

I/O Efficient Dynamic Data Structures for Longest Prefix Queries

Paradigm 5. Data Structure. Suffix trees. What is a suffix tree? Suffix tree. Simple applications. Simple applications. Algorithms

Announcements. CS 188: Artificial Intelligence Fall Recap: Search. Today. General Tree Search. Uniform Cost. Lecture 3: A* Search 9/4/2007

Lexical Analysis. Amitabha Sanyal. ( as) Department of Computer Science and Engineering, Indian Institute of Technology, Bombay

Suffix Trees. Martin Farach-Colton Rutgers University & Tokutek, Inc

10.5 Graphing Quadratic Functions

Ma/CS 6b Class 1: Graph Recap

CS 241. Fall 2017 Midterm Review Solutions. October 24, Bits and Bytes 1. 3 MIPS Assembler 6. 4 Regular Languages 7.

The dictionary model allows several consecutive symbols, called phrases

Shortest Unique Substring Query Revisited

CSCI 104. Rafael Ferreira da Silva. Slides adapted from: Mark Redekopp and David Kempe

Ma/CS 6b Class 1: Graph Recap

Typing with Weird Keyboards Notes

CS 430 Spring Mike Lam, Professor. Parsing

CS 241 Week 4 Tutorial Solutions

Efficient implementation of lazy suffix trees

CS 340, Fall 2014 Dec 11 th /13 th Final Exam Note: in all questions, the special symbol ɛ (epsilon) is used to indicate the empty string.

Compression Outline :Algorithms in the Real World. Lempel-Ziv Algorithms. LZ77: Sliding Window Lempel-Ziv

On String Matching in Chunked Texts

CSCI 3130: Formal Languages and Automata Theory Lecture 12 The Chinese University of Hong Kong, Fall 2011

ΕΠΛ323 - Θεωρία και Πρακτική Μεταγλωττιστών. Lecture 3b Lexical Analysis Elias Athanasopoulos

Principles of Programming Languages

CS311H: Discrete Mathematics. Graph Theory IV. A Non-planar Graph. Regions of a Planar Graph. Euler s Formula. Instructor: Işıl Dillig

Subtracting Fractions

Slides for Data Mining by I. H. Witten and E. Frank

Network Interconnection: Bridging CS 571 Fall Kenneth L. Calvert All rights reserved

2014 Haskell January Test Regular Expressions and Finite Automata

CSCI1950 Z Computa4onal Methods for Biology Lecture 2. Ben Raphael January 26, hhp://cs.brown.edu/courses/csci1950 z/ Outline

Lecture T1: Pattern Matching

Quiz2 45mins. Personal Number: Problem 1. (20pts) Here is an Table of Perl Regular Ex

Solving Problems by Searching. CS 486/686: Introduction to Artificial Intelligence Winter 2016

Languages. L((a (b)(c))*) = { ε,a,bc,aa,abc,bca,... } εw = wε = w. εabba = abbaε = abba. (a (b)(c)) *

Context-Free Grammars

EXPONENTIAL & POWER GRAPHS

If you are at the university, either physically or via the VPN, you can download the chapters of this book as PDFs.

Unit #9 : Definite Integral Properties, Fundamental Theorem of Calculus

String Searching. String Search. Applications. Brute Force: Typical Case

Indexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems

The Fundamental Theorem of Calculus

An Algorithm for Enumerating All Maximal Tree Patterns Without Duplication Using Succinct Data Structure

Suffix Arrays on Words

Greedy Algorithm. Algorithm Fall Semester

Product of polynomials. Introduction to Programming (in C++) Numerical algorithms. Product of polynomials. Product of polynomials

4/29/18 FIBONACCI NUMBERS GOLDEN RATIO, RECURRENCES. Fibonacci function. Fibonacci (Leonardo Pisano) ? Statue in Pisa Italy

1.5 Extrema and the Mean Value Theorem

Efficient K-NN Search in Polyphonic Music Databases Using a Lower Bounding Mechanism

Solving Problems by Searching. CS 486/686: Introduction to Artificial Intelligence

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 4: Lexical Analyzers 28 Jan 08

UNIVERSITY OF EDINBURGH COLLEGE OF SCIENCE AND ENGINEERING SCHOOL OF INFORMATICS INFORMATICS 1 COMPUTATION & LOGIC INSTRUCTIONS TO CANDIDATES

PARALLEL AND DISTRIBUTED COMPUTING

TO REGULAR EXPRESSIONS

Context-Free Grammars

The Math Learning Center PO Box 12929, Salem, Oregon Math Learning Center

Indexing Web pages. Web Search: Indexing Web Pages. Indexing the link structure. checkpoint URL s. Connectivity Server: Node table

Section 3.1: Sequences and Series

The Complexity of Nonrepetitive Coloring

Chapter 9. Greedy Technique. Copyright 2007 Pearson Addison-Wesley. All rights reserved.

P(r)dr = probability of generating a random number in the interval dr near r. For this probability idea to make sense we must have

Transcription:

Informtion Retrievl nd Orgnistion Suffix Trees dpted from http://www.mth.tu.c.il/~himk/seminr02/suffixtrees.ppt Dell Zhng Birkeck, University of London

Trie A tree representing set of strings { } eef d fe fg c f e e c d f e c g

Trie Assume no string is prefix of nother 1) Ech edge is leled y letter. c 2) No two edges outgoing from the sme node re leled the sme. 3) Ech string corresponds to lef. f e e d f e g

Compressed Trie Compress unry nodes, lel edges y strings c c e e d f eef d f f e g e g

Suffix Tree Given string s, suffix tree of s is compressed trie of ll suffixes of s. To mke these suffixes prefix-free we dd specil chrcter, sy, t the end of s.

Suffix Tree For exmple, let s =, suffix tree of s is compressed trie of ll suffixes of. { } Note tht suffix tree hs O(n) nodes n = s. Why?

Suffix Tree Construction The trivil lgorithm Put the lrgest suffix in

Put the suffix in

Put the suffix in

Put the suffix in

Put the suffix in

We will lso lel ech lef with the strting point of the corresponding suffix 0 2 3 1 4

Suffix Tree Construction The trivil lgorithm tkes O(n 2 ) time. It is possile to uild suffix tree in O(n) time using Ukkonen s lgorithm. But, how come? Does it tke O(n) spce? To use only O(n) spce, encode the edge-lels s (eginning-position, end-position).

Consider the string

Consider the string (6,12)

Consider the string (0,0) (6,12) (1,1) (6,12) (2,2) (6,12) (3,3) (6,12) (4,4) (6,12) (5,12) (12,12) (6,6) (12,12) (7,7) (12,12) (8,8) (12,12) (9,9) (12,12) (10,10) (12,12) (11,12)

Suffix Tree Applictions Wht Cn We Do with It? Exct String Mtching Exct Set Mtching The Sustring Prolem for Dtse of Ptterns Longest Common Sustring of Two Strings Recognising DNA Contmintion Common Sustring of More Thn Two Strings

Exct String Mtching Given text T ( T = n), pre-process it such tht when pttern P ( P = m) rrives you cn quickly decide when it occurs in T. We my lso wnt to find ll occurrences of P in T.

Exct String Mtching In pre-processing, we just uild suffix tree in O(n) time 0 2 3 1 4

Exct String Mtching Given pttern P = we trverse the tree ccording to the pttern. If we do not get stuck trversing the pttern then the pttern occurs in the text, otherwise it does not. Ech lef in the sutree elow the node we rech corresponds to n occurrence. By trversing this sutree we get ll k occurrences in O(n+k) time.

Exct String Mtching How to mtch pttern (query) ginst dtse of strings (documents)?

Generlized Suffix Tree Given set of strings S, the generlized suffix tree of S is compressed trie of ll suffixes of ech s S. To mke these suffixes prefix-free we dd specil chr, sy, t the end of s. To ssocite ech suffix with unique string in S, dd different specil chr to ech s. Ech lef node needs to e lelled y the document id together with the suffix position.

Generlized Suffix Tree For exmple, Let s 1 = nd s 2 =, here is generlized suffix tree for s 1 nd s 2. { } # # # # 2 # 1 # 0 1 # 3 # 4 2 3 0

Longest Common Sustring Given two strings s 1 nd s 2, we uild their generlized suffix tree. Every node with lef descendnt from string s 1 nd lef descendnt from string s 2 represents mximl common sustring nd vice vers. Find such node with lrgest string depth.

Lowest Common Ancestor A lot more cn e gined from the suffix tree, if we pre-process it so tht we cn nswer LCA queries on it in constnt time.

Lowest Common Ancestor Why? The LCA of two leves represents the longest common prefix (LCP) of these 2 suffixes # 4 3 2 # 1 # 0 1 # 3 2 0

Finding Mximl Plindromes A plindrome: cc, cc, To find ll plindromes in string s (of length m), we uild generlized suffix tree for the string s nd the reversed string s r. The plindrome with centre etween i-1 nd i is the LCP of the suffix t position i of s nd the suffix t position m-i of s r.

Finding Mximl Plindromes For exmple, consider the string c. Prepre generlized suffix tree for s = c nd s r = c# For every i find the LCA of the suffix i of s nd the suffix m-i of s r. All plindromes cn e identified in liner time.

Let s = c then s r = c# 5 c # 6 6 2 2 3 4 4 5 3 0 1 1 0

Suffix Tree Drwcks It is O(n) ut the constnt is quite ig. It consume lot of spce. Notice tht if we indeed wnt to trverse n edge in O(1) time then we need n rry (of pointers) of size Σ in ech node, where Σ is the lphet.

Suffix Arry It is much simpler nd esier to implement. Compred with suffix trees, we lose some functionlity, ut we sve spce.

Suffix Arry For exmple, let s = Sort the suffixes lexicogrphiclly:,,, The suffix rry gives the indices of the suffixes in sorted order 2 0 3 1

Suffix Arry Construction The trivil lgorithm Quicksort The liner time lgorithm Build suffix tree in O(n) time first, nd then trverse the tree in in-order, lexicogrphiclly picking edges outgoing from ech node, nd fill the suffix rry. It cn lso e uilt in O(n) time directly.

Exct String Mtching How do we serch for pttern P in the text T, using the suffix rry of T? If P occurs in T, then ll its occurrences re consecutive in the suffix rry. So we cn do two inry serches on the suffix rry: the first serch loctes the strting position of the intervl, nd the second one determines the end position. It tkes O(m log(n)) time, s single suffix comprison needs to compre up to m chrcters.

Exct String Mtching It is lso possile to do it in O(m+log(n)) with n dditionl rry of LCP. Mner & Myers (1990)

T = mississippi P = iss L M R 10 7 4 1 0 9 8 6 3 5 2 i ippi issippi ississippi mississippi pi ppi sippi sisippi ssippi ssissippi