L02 : 08/21/2015 L03 : 08/24/2015.

Similar documents
Applications of Suffix Tree

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

CSE 373 Lecture 19: Wrap-Up of Sorting

Parallel Distributed Memory String Indexes

Solution to Problem 1 of HW 2. Finding the L1 and L2 edges of the graph used in the UD problem, using a suffix array instead of a suffix tree.

Suffix links are stored for compact trie nodes only, but we can define and compute them for any position represented by a pair (u, d):

Suffix-based text indices, construction algorithms, and applications.

Main Memory and the CPU Cache

4. Suffix Trees and Arrays

Suffix Trees. Martin Farach-Colton Rutgers University & Tokutek, Inc

Lowest Common Ancestor (LCA) Queries

Lecture 5: Suffix Trees

Spring 2017 EXTERNAL SORTING (CH. 13 IN THE COW BOOK) 2/7/17 CS 564: Database Management Systems; (c) Jignesh M. Patel,

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Suffix Arrays Slides by Carl Kingsford

Linear Work Suffix Array Construction

Introduction to I/O Efficient Algorithms (External Memory Model)

An AVL tree with N nodes is an excellent data. The Big-Oh analysis shows that most operations finish within O(log N) time

Suffix trees and applications. String Algorithms

CSE332: Data Abstractions Lecture 7: B Trees. James Fogarty Winter 2012

Design and Analysis of Algorithms Lecture- 9: B- Trees

The Right Read Optimization is Actually Write Optimization. Leif Walsh

Indexing and Searching

Succinct dictionary matching with no slowdown

CSE 530A. B+ Trees. Washington University Fall 2013

Advanced Database Systems

CSE 373 OCTOBER 25 TH B-TREES

Scan Algorithm Effects on Parallelism and Memory Conflicts

External Sorting Implementing Relational Operators

String Matching Algorithms

On the Performance of MapReduce: A Stochastic Approach

Motivation for Sorting. External Sorting: Overview. Outline. CSE 190D Database System Implementation. Topic 3: Sorting. Chapter 13 of Cow Book

XML Storage and Indexing

PS2 out today. Lab 2 out today. Lab 1 due today - how was it?

Lecture 9 March 4, 2010

CS 310: Memory Hierarchy and B-Trees

CSCI 104 Tries. Mark Redekopp David Kempe

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

SORTING. Practical applications in computing require things to be in order. To consider: Runtime. Memory Space. Stability. In-place algorithms???

COSC 6385 Computer Architecture. - Memory Hierarchies (I)

Indexing and Searching

MITOCW watch?v=ninwepprkdq

Evaluating XPath Queries

Information Retrieval

Algorithms and Data Structures Lesson 3

Algorithm Analysis. College of Computing & Information Technology King Abdulaziz University. CPCS-204 Data Structures I

Cpt S 223 Course Overview. Cpt S 223, Fall 2007 Copyright: Washington State University

Geometric data structures:

CS101 Lecture 30: How Search Works and searching algorithms.

Lecture 18 April 12, 2005

Suffix Tree and Array

Efficient Data Structures for Tamper-Evident Logging

MapReduce, Hadoop and Spark. Bompotas Agorakis

CSE 506: Opera.ng Systems. The Page Cache. Don Porter

4. Suffix Trees and Arrays

Massive Data Algorithmics. Lecture 12: Cache-Oblivious Model

Suffix Trees and Arrays

Indexing and Searching

COMP4128 Programming Challenges

CSE 373: Data Structures and Algorithms. Memory and Locality. Autumn Shrirang (Shri) Mare

In-memory processing of big data via succinct data structures

Database Applications (15-415)

Binary Trees

The Page Cache 3/16/16. Logical Diagram. Background. Recap of previous lectures. The address space abstracvon. Today s Problem.

Indexing. Jan Chomicki University at Buffalo. Jan Chomicki () Indexing 1 / 25

UNIT III BALANCED SEARCH TREES AND INDEXING

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Scan Primitives for GPU Computing

DIVIDE AND CONQUER ALGORITHMS ANALYSIS WITH RECURRENCE EQUATIONS

Application of TRIE data structure and corresponding associative algorithms for process optimization in GRID environment

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

(Refer Slide Time: 0:19)

Balanced Trees Part One

Indexing Web pages. Web Search: Indexing Web Pages. Indexing the link structure. checkpoint URL s. Connectivity Server: Node table

Data Compression. Guest lecture, SGDS Fall 2011

Cache-efficient string sorting for Burrows-Wheeler Transform. Advait D. Karande Sriram Saroop

Lecture 7 February 26, 2010

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

CSE 417 Dynamic Programming (pt 5) Multiple Inputs

Exercise 1 : B-Trees [ =17pts]

PAPER Constructing the Suffix Tree of a Tree with a Large Alphabet

SORTING, SETS, AND SELECTION

Sorting atomic items. Chapter 5

FUNCTIONALLY OBLIVIOUS (AND SUCCINCT) Edward Kmett

Outline. Depth-first Binary Tree Traversal. GerĂȘnciade Dados daweb -DCC922 - XML Query Processing. Motivation 24/03/2014

Treelogy: A Benchmark Suite for Tree Traversals

Suppose you are accessing elements of an array: ... or suppose you are dereferencing pointers: temp->next->next = elem->prev->prev;

Sorting. CMPS 2200 Fall Carola Wenk Slides courtesy of Charles Leiserson with small changes by Carola Wenk

) $ f ( n) " %( g( n)

Checking for duplicates Maximum density Battling computers and algorithms Barometer Instructions Big O expressions. John Edgar 2

QB LECTURE #1: Algorithms and Dynamic Programming

Plot SIZE. How will execution time grow with SIZE? Actual Data. int array[size]; int A = 0;

Suffix Arrays CMSC 423

Section 4 SOLUTION: AVL Trees & B-Trees

Information Systems (Informationssysteme)

Information Retrieval

Searching a Sorted Set of Strings

Applied Databases. Sebastian Maneth. Lecture 14 Indexed String Search, Suffix Trees. University of Edinburgh - March 9th, 2017

CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 20 Dr. Kuo-pao Yang

Given a text file, or several text files, how do we search for a query string?

Transcription:

L02 : 08/21/2015 http://www.csee.wvu.edu/~adjeroh/classes/cs493n/ Multimedia use to be the idea of Big Data Definition of Big Data is moving data (It will be different in 5..10 yrs). Big data is highly complex. One way to look at Big data is by the drivers of big data What makes big data possible? What is the hype? 1 Terabyte(TB) = 1,000 GB 1 Petabyte(PB) = 1,000 TB 1 Exabyte(EB) = 1,000 PB L03 : 08/24/2015 Nature of Big Data Challenges Data structures for Big Data 5 V s 1) Velocity a) Data in Motion b) Streaming Data 2) Volume 3) Variety 4) Value 5) Veracity

L04 : 08/26/2015 OVERVIEW I/O Problems (continued) Searching on Big Data Suffix trees Intro Properties Applications Have CPU and data is stored in RAM. We can go between CPU and RAM to do calculations. Assume there is infinite memory. There is no difference between time for the RAM and CPU. To get from bottom level to RAM could take a lot of time.

Basic RAM model of computation: Capability problem given situation: Problem: Disk input/output is really slow. Disk access time approximately equals 10^6 times for CPU to process the data. Scalability Problem: Process time is growing:

Ultimately we want a single I/O, but we have block I/O How to solve the problem for computing systems, for reading and writing data: Technology can reduce time between disk and CPU Reduce the disk I/O operations. Adjeroh s Solution below: Need to introduce some notations: B = # of blocks read at a time N = total number of items (Amount of data we need to read) M= number of items that can fit in main memory (main memory size) Make assumption that memory is bigger than B 2 (M >= B 2 ) If you want to read every item you have to do N /B Simple scanning will be N /B I/Os rather N I/O operations Locality is Key!!! Simple example: Traversing a linked list. N = 10 B = 2 M = 4 Basic Algorithm:

Reading data 2 items at a time. > Algorithm approximately equals N=10 I/O s Improved Placement >Number of I/O s approximately equals N/B = 5 I/O s Consider when: N= 256 x 10^6, B=8000 disk access time = 1ms Using basic algorithm: Time needed = 71 hours Using improved placement: Time needed approximately equals 32 sec. *Block I/O is hardware issue but we must understand the software side of the issue.* Standard results on block I/O Basic Algorithm Improved Algorithm Scanning N N/B Sorting NlogN N/B * log m/b (N/B) Permitting N min { N, N/B * log m/b (N/B) Searching log 2 N Log B N We want to sort data to make reading data more efficient.

Search Data Structures >Finding the item >Ranking the web pages Simple Naive Search Given the database T, the pattern P, Find all positions in T where P occurs. Three types of search questions: > Decision query > Counting query > Enumerate/location query T: 1 N P: 1 m Will take a long time to find answer. Will take O(Nm). We need to focus on decreasing this time to seconds. L03 : 08/24/2015 Suffix Trees Intro Searching With Construction Problems Suffix Arrays Naive Search Algorithm Inputs: T= t 1 t 2... t n P =P 1 P 2... P m

Best case is n. > If you have a big data set, the N can be quite big. Overall time = O((n m+1)m) = O(nm) > On average: O(n) EXAMPLE: searching on google Suffix Trees T = acraca$ 1234567 Prefixes : a ac acr.. acraca$ Suffix Tree (ST): >A tree that represents all of the suffixes in a given strip. ex. T= acraca$

> If we take a given node the branches from that node will have different symbols > These trees have algorithms that were used to construct them(in slide handout) > Look at SUFFIX TREE FROM LCA LIST to construct a tree in linear time. (Pg. 70) Storing data in O(n) is a problem. A search Trie only takes O(m) Suffix tree requires 33n Bytes to store (each integer is 4 bits) L06 : 08/31/2015 Problems with ST s Suffix Arrays Intro Searching with Construction

Generaqlized Suffix Tree If we have multiple sequences and want to search on them at the same time: ex T1, T2,... Tk T = T 1 $ 1, T 2 $ 2 T k $ k Representing a node as an array > Consider the two types of nodes: >Internal Node: >Leaf Node:

Ways to represent Nodes: O(m) Using arrays at each node (Fastest Search) O(m* ) Using linked list at each node O(m*log ) binary tree (Sigma is very small compared to total length of the sequence) Size of the ST >Original Text = 1n bytes (Assuming is 256) >We can represent 1 symbol using 1 byte. > At each node we have an integer I.D. >Internal Nodes: Node I.D. > 1int = 4n bytes parent ID > 1int = 4n bytes Edge labels > 2int = 8n bytes Leaf Nodes: ID > 4n bytes parent > 4n bytes Suffix Links > 8n bytes Total : 33n bytes The issue is that if we look at the 33n then 33n +n can be quite huge. T = a c r a c a $ 1 2 3 4 5 6 7 Suffix: T= acraca$ 1 craca$ 2 raca$ 3 aca$ 4 ca$ 5 a$ 6 $ 7

Searching with the SA: > Binary search using the SA based off of example above: P=c r y 1 2 3 P=p 1 p 2... p m SA = [7 6 4 1 5 2 3] STEP 1: STEP 2: c = = T SA[4] = a?? NOPE c > a c = = T SA[6] = c?? YES m is the number of binary searches we need to make. m*logn >Size will be 1n + 4n bytes = 5n bytes > WE want to avoid suffix trees and get into suffix arrays

L07 : 09/02/2015 Searching on SA SA Construction LCP (Longest Common Prefix) From SA to ST Recall: T= a c r a c a $ 1 2 3 4 5 6 7 P= c r y 1 2 3 n= T, m= P 10 SA Sorted Suffixes 1 7 $ 2 6 a$ 3 4 aca$ 4 1 acraca$ 5 5 ca$ 6 2 craca$ 7 3 raca$ *Trace through this example with the code below to find out if the pattern matches.*

>Can traverse the suffix tree nodes from left to right to give us the suffix array. Searching with SA (via Binary Search) Example: T= a c r a c a $ P= c r y

1 2 3 4 5 6 7 1 2 3 when k = 1 mid=1 + 7/2 = 4 T SA[mid] [1] == P[1]?? T 1 [1]= a==p[1]=c NO c>a low=mid+1; mid=low+high/2=6 ST: size(st T )>=33n bytes size(sa T )>=5n bytes >A suffix tree is light weight Construction of suffix array 1) Simply list the suffixes, then sort them. Each suffix has n length. >Need O(nlogn)*O(n) => O(n 2 logn) 2) Traverse the ST depth first from left to right. =>O(n) time, O(n) space *Look at Manber Myers suffix sorting algorithm in text* L08 : 09/04/2015 Suffix Arrays (continued) Construction LCP PageRank Intro Algorithm Problems Trust Rank O(n 2 ) direct sorting of suffixes

O(n) via ST Today will go through O(nlogn) successive doubling (without ST). And we will talk about O(n) without ST. History of ST and SA > The whole idea of suffix tree was introduced in 1973. > It was not till around 1991 till we have what is now called UK Konen s Algorithm.(33n) > Farach in 1996 introduced dividing suffixes into two groups.(76n) > In 1993 a Suffix array was discovered: Manber & Meyer T + SA = (1+4)bytes =. Required O(nlogn) to construct. Use first column to induce the other column. Can exploit the letters already found in previous columns. Successive doubling: Constructing Suffix Array

> LCP (longest Common Prefix) LCP

* If you have your suffix array you can construct a suffix tree and find LCP.* LCA = longest common ancestor depth of LCA: Page Rank

Damping factor means that a certain node will always point to another node. PR i (K) = (1 d)/n + PR i (k)