Text search. CSE 392, Computers Playing Jeopardy!, Fall

Similar documents
1 Introduciton. 2 Tries /651: Algorithms CMU, Spring Lecturer: Danny Sleator

Exact String Matching Part II. Suffix Trees See Gusfield, Chapter 5

CS 345A Data Mining. MapReduce

String Matching. Pedro Ribeiro 2016/2017 DCC/FCUP. Pedro Ribeiro (DCC/FCUP) String Matching 2016/ / 42

Lecture 7 February 26, 2010

Lecture 5: Suffix Trees

Applied Databases. Sebastian Maneth. Lecture 14 Indexed String Search, Suffix Trees. University of Edinburgh - March 9th, 2017

CSE 5095 Topics in Big Data Analytics Spring 2014; Homework 1 Solutions

Lecture L16 April 19, 2012

Data Structures in Java

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

CSE 230 Intermediate Programming in C and C++ Binary Tree

Lecture 18 April 12, 2005

SORTING. Practical applications in computing require things to be in order. To consider: Runtime. Memory Space. Stability. In-place algorithms???

Indexing and Searching

Applications of Suffix Tree

Chapter 7. Space and Time Tradeoffs. Copyright 2007 Pearson Addison-Wesley. All rights reserved.

Clustering Lecture 8: MapReduce

COMP4128 Programming Challenges

Suffix Trees and Arrays

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Programming II (CS300)

CSE 214 Computer Science II Searching

CSE 214 Computer Science II Introduction to Tree

CSCI-401 Examlet #5. Name: Class: Date: True/False Indicate whether the sentence or statement is true or false.

58093 String Processing Algorithms. Lectures, Autumn 2013, period II

CSL 201 Data Structures Mid-Semester Exam minutes

Suffix links are stored for compact trie nodes only, but we can define and compute them for any position represented by a pair (u, d):

Indexing and Searching

LCP Array Construction

Data structures for string pattern matching: Suffix trees

INSTITUTE OF AERONAUTICAL ENGINEERING

CSE 100: B+ TREE, 2-3 TREE, NP- COMPLETNESS

Chapter 8: Data Abstractions

Indexing and Searching

PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet

Data Structures And Algorithms

MapReduce. Stony Brook University CSE545, Fall 2016

CS 307 Final Spring 2009

CSE 214 Computer Science II Heaps and Priority Queues

Given a text file, or several text files, how do we search for a query string?

CMPUT 403: Strings. Zachary Friggstad. March 11, 2016

CSCI 104 Tries. Mark Redekopp David Kempe

Some Search Structures. Balanced Search Trees. Binary Search Trees. A Binary Search Tree. Review Binary Search Trees

Key Java Simple Data Types

Data Structures and Algorithms

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Data Structures and Methods. Johan Bollen Old Dominion University Department of Computer Science

UNIVERSITY OF CALIFORNIA Department of Electrical Engineering and Computer Sciences Computer Science Division. P. N. Hilfinger.

Module 9: Binary trees

Randomized Ternary Search Tries

Suffix-based text indices, construction algorithms, and applications.

Lecture Notes on Tries

CSE 417 Dynamic Programming (pt 4) Sub-problems on Trees

Balanced Search Trees. CS 3110 Fall 2010

Trees in java.util. A set is an object that stores unique elements In Java, two implementations are available:

Special course in Computer Science: Advanced Text Algorithms

Java Simple Data Types

B+ Tree Review. CSE332: Data Abstractions Lecture 10: More B Trees; Hashing. Can do a little better with insert. Adoption for insert

CSE 530A. B+ Trees. Washington University Fall 2013

CSE Data Structures and Introduction to Algorithms... In Java! Instructor: Fei Wang. Binary Search Trees. CSE2100 DS & Algorithms 1

Physical Level of Databases: B+-Trees

MI-PDB, MIE-PDB: Advanced Database Systems

Friday Four Square! 4:15PM, Outside Gates

Solution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree.

CSE 332 Autumn 2013: Midterm Exam (closed book, closed notes, no calculators)

File Organization and Storage Structures

[1] Strings, Arrays, Pointers, and Hashing

Data Structure and Algorithm Homework #3 Due: 1:20pm, Thursday, May 16, 2017 TA === Homework submission instructions ===

CSE 230 Computer Science II (Data Structure) Introduction

Homework 3: Wikipedia Clustering Cliff Engle & Antonio Lupher CS 294-1

Suffix trees and applications. String Algorithms

Assoc. Prof. Marenglen Biba. (C) 2010 Pearson Education, Inc. All rights reserved.

CSE 1223: Exam II Autumn 2016

Integer Algorithms and Data Structures

Parallel Programming Concepts

Data Structures and Algorithms. Course slides: Radix Search, Radix sort, Bucket sort, Patricia tree

Midterm 3 practice problems

DO NOT. In the following, long chains of states with a single child are condensed into an edge showing all the letters along the way.

CS61B Fall 2015 Guerrilla Section 3 Worksheet. 8 November 2015

Module 8: Binary trees

Semantic Analysis. CSE 307 Principles of Programming Languages Stony Brook University

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

CMSC423: Bioinformatic Algorithms, Databases and Tools. Exact string matching: Suffix trees Suffix arrays

Quiz 1 Solutions. (a) f(n) = n g(n) = log n Circle all that apply: f = O(g) f = Θ(g) f = Ω(g)

Assoc. Prof. Marenglen Biba. (C) 2010 Pearson Education, Inc. All rights reserved.

4. Suffix Trees and Arrays

CS171 Final Practice Exam

CS 12 Fall 2003 Solutions for mid-term exam #2

Recall: Properties of B-Trees

Section 1: True / False (1 point each, 15 pts total)

CS 231 Data Structures and Algorithms Fall Binary Search Trees Lecture 23 October 29, Prof. Zadia Codabux

Review: Trees Binary Search Trees Sets in Java Collections API CS1102S: Data Structures and Algorithms 05 A: Trees II

PROGRAMMING FUNDAMENTALS

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

1. AVL Trees (10 Points)

CS 231 Data Structures and Algorithms Fall Recursion and Binary Trees Lecture 21 October 24, Prof. Zadia Codabux

Binary Search Trees. Carlos Moreno uwaterloo.ca EIT

Transcription:

Text search CSE 392, Computers Playing Jeopardy!, Fall 2011 Stony Brook University http://www.cs.stonybrook.edu/~cse392 1

Today 2 parts: theoretical: costs of searching substrings, data structures for string search practical: implementation of text search

Sub-array algorithm example Given an array {t,h,i,s,i,s,a,t,e,s,t} and a pattern {t,e,s,t}, write a program that checks whether the pattern is present in the array: public static boolean substring(char[] s, char[] sub){ for(int i=0; i < s.length - sub.length; i++) if(startswith(s,sub,i)) t i)) return true; return false; } public static boolean startswith(char[] s, char[] sub, int m){ for(int i=0; i<sub.length; i++) if(sub[i]!= s[m+i]) return false; return true; } Cost: m x n 3 10/4/2011

Suffix arrays and trees Idea: preprocess the text, so the search of the substring is fast Specialized data structures (e.g., tries) Assumption: no suffix is a prefix of another suffix (can be a substring, but not a prefix) Assure this by adding a character $ to end of S Costs: Build data structure for text (e.g., suffix tree) This is preprocessing O(m) Search time: For example: Suffix trees: O(n+k) where k is the number of occurrences of P in T

Suffix arrays Anarrayof integers giving i the starting ti positions of suffixes of a string in lexicographical order 1 2 3 4 5 6 7 8 T E S T I N G $ 8 suffixes: TESTING$, ESTING$, STING$, TING$, ING$, NG$, G$, $. One-based indexing: {8,2,5,7,6,3,1,4} index Sorted suffix lcp Longest common prefix: how many characters 8 $ 0 one suffix has in common with the one above it 2 ESTING$ 0 5 ING$ 0 7 G$ 0 6 NG$ 0 3 STING$ 0 1 TESTING$ 0 4 TING$ 1

Suffix arrays Construction: ti comparison sort or suffix trees Application: fast search of every occurrence of a substring within a string find every suffix that begins with the substring Cost: O(m log n) time if W <= suffixat(pos[1]) then ans = 1 else if W > suffixat(pos[n]) then ans = n else{ L = 1, R = n while R-L > 1 do{ M = (L + R)/2 if W <= suffixat(pos[m]) then R = M else L = M } ans = R }

Suffix tries Tries = ordered tree data structure that is used to store associative arrays where the keys are usually strings The time to insert, or to delete or to find is identical

Suffix trees A data structure that presents the suffixes of a given string in a way that allows for fast implementation of string operations

Building trees: O(m 2 ) algorithm Initialize One edge for the entire string S[1..m]$ For i = 2 to m Add suffix S[i..m] to suffix tree Find match point for string S[i..m] in current tree If in middle of edge, create new node w Add remainder of S[i..m] as edge label l to suffix i leaf Running Time O(m-i) time to add suffix S[i..m]

Assignment The Suffix Array Representing "BANANAS" The Suffix Trie Representing "BANANAS The Suffix Tree Representing "BANANAS"

Before search: Tokenization Automatically recognize words and sentences identify what constitutes an individual or distinct word, referred to as a token Tokenizer or lexer sequences of characters which represent words and other elements, such as punctuation, which are represented by numeric codes, email addresses, phone numbers, and URLs

Other indexes Theoretical: Gödel numbering (assigns to each symbol and well-formed formula of some formal language a unique natural number) not practical Hashing: fast, but not unique collisions, clustering B-trees: balanced search trees where every node has between m/2 and m children, where m>1 is a fixed integer

Inverted index A mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents T 0 = "it is what it is "a": {2} T 1 = "what is it T 2 = "it is a banana" "banana": {2} "is": {0, 1, 2} "it": {0, 1, 2} "what": {0, 1} search for the terms "what", "is" and "it" would give

Hash tables hash table: an array of some fixed size, that positions elements according to an algorithm called a hash function 0 14 elements (e.g., strings) hash hf func. h(element) length 1 hash table

Hashing, hash functions Map every element into some index in the array Lookup becomes constant-time: time: simply look at that one slot again later to see if the element is there add, remove, contains all become O(1)! Example: h(i) = i % array.length 15

B-trees The data items are stored at leaves The nonleaf nodes store up to M-1 keys to guide the searching; key I represents the smallest key in subtree I +1. The root is either a leaf or has between two and M children. All nonleaf nodes (except the root) have between [M/2] and M children All leaves are at the same depth and have between [L/2] and L children, for some L (the determination of L is described shortly).

Apache Lucene http://lucene.apache.org/ Tutorial: http://www.lucenetutorial.com/lucene-in-5-minutes.html

Parallelism: MapReduce Input: a set of key/value pairs User supplies two functions: map(k,v) list(k1,v1) reduce(k1, list(v1)) v2 (k1,v1) is an intermediate key/value pair Output is the set of (k1,v2) pairs Prasad (c) 2011 L06MapReduce 18 Pearson Education, Inc. & P.Fodor (CS Stony Brook)

Hadoop An open-source implementation of Map Reduce in Java Uses HDFS for stable storage Download from: http://lucene.apache.org/hadoop/ p. p http://developer.yahoo.com/hadoop/tutorial/module3.html Prasad (c) 2011 L06MapReduce 19 Pearson Education, Inc. & P.Fodor (CS Stony Brook)