L02 : 08/21/2015 L03 : 08/24/ PDF Free Download

L02 : 08/21/2015 http://www.csee.wvu.edu/~adjeroh/classes/cs493n/ Multimedia use to be the idea of Big Data Definition of Big Data is moving data (It will be different in 5..10 yrs). Big data is highly complex. One way to look at Big data is by the drivers of big data What makes big data possible? What is the hype? 1 Terabyte(TB) = 1,000 GB 1 Petabyte(PB) = 1,000 TB 1 Exabyte(EB) = 1,000 PB L03 : 08/24/2015 Nature of Big Data Challenges Data structures for Big Data 5 V s 1) Velocity a) Data in Motion b) Streaming Data 2) Volume 3) Variety 4) Value 5) Veracity

L04 : 08/26/2015 OVERVIEW I/O Problems (continued) Searching on Big Data Suffix trees Intro Properties Applications Have CPU and data is stored in RAM. We can go between CPU and RAM to do calculations. Assume there is infinite memory. There is no difference between time for the RAM and CPU. To get from bottom level to RAM could take a lot of time.

Basic RAM model of computation: Capability problem given situation: Problem: Disk input/output is really slow. Disk access time approximately equals 10^6 times for CPU to process the data. Scalability Problem: Process time is growing:

Ultimately we want a single I/O, but we have block I/O How to solve the problem for computing systems, for reading and writing data: Technology can reduce time between disk and CPU Reduce the disk I/O operations. Adjeroh s Solution below: Need to introduce some notations: B = # of blocks read at a time N = total number of items (Amount of data we need to read) M= number of items that can fit in main memory (main memory size) Make assumption that memory is bigger than B 2 (M >= B 2 ) If you want to read every item you have to do N /B Simple scanning will be N /B I/Os rather N I/O operations Locality is Key!!! Simple example: Traversing a linked list. N = 10 B = 2 M = 4 Basic Algorithm:

Reading data 2 items at a time. > Algorithm approximately equals N=10 I/O s Improved Placement >Number of I/O s approximately equals N/B = 5 I/O s Consider when: N= 256 x 10^6, B=8000 disk access time = 1ms Using basic algorithm: Time needed = 71 hours Using improved placement: Time needed approximately equals 32 sec. *Block I/O is hardware issue but we must understand the software side of the issue.* Standard results on block I/O Basic Algorithm Improved Algorithm Scanning N N/B Sorting NlogN N/B * log m/b (N/B) Permitting N min { N, N/B * log m/b (N/B) Searching log 2 N Log B N We want to sort data to make reading data more efficient.

Search Data Structures >Finding the item >Ranking the web pages Simple Naive Search Given the database T, the pattern P, Find all positions in T where P occurs. Three types of search questions: > Decision query > Counting query > Enumerate/location query T: 1 N P: 1 m Will take a long time to find answer. Will take O(Nm). We need to focus on decreasing this time to seconds. L03 : 08/24/2015 Suffix Trees Intro Searching With Construction Problems Suffix Arrays Naive Search Algorithm Inputs: T= t 1 t 2... t n P =P 1 P 2... P m

Best case is n. > If you have a big data set, the N can be quite big. Overall time = O((n m+1)m) = O(nm) > On average: O(n) EXAMPLE: searching on google Suffix Trees T = acraca$ 1234567 Prefixes : a ac acr.. acraca$ Suffix Tree (ST): >A tree that represents all of the suffixes in a given strip. ex. T= acraca$

> If we take a given node the branches from that node will have different symbols > These trees have algorithms that were used to construct them(in slide handout) > Look at SUFFIX TREE FROM LCA LIST to construct a tree in linear time. (Pg. 70) Storing data in O(n) is a problem. A search Trie only takes O(m) Suffix tree requires 33n Bytes to store (each integer is 4 bits) L06 : 08/31/2015 Problems with ST s Suffix Arrays Intro Searching with Construction

Generaqlized Suffix Tree If we have multiple sequences and want to search on them at the same time: ex T1, T2,... Tk T = T 1 $ 1, T 2 $ 2 T k $ k Representing a node as an array > Consider the two types of nodes: >Internal Node: >Leaf Node:

Ways to represent Nodes: O(m) Using arrays at each node (Fastest Search) O(m* ) Using linked list at each node O(m*log ) binary tree (Sigma is very small compared to total length of the sequence) Size of the ST >Original Text = 1n bytes (Assuming is 256) >We can represent 1 symbol using 1 byte. > At each node we have an integer I.D. >Internal Nodes: Node I.D. > 1int = 4n bytes parent ID > 1int = 4n bytes Edge labels > 2int = 8n bytes Leaf Nodes: ID > 4n bytes parent > 4n bytes Suffix Links > 8n bytes Total : 33n bytes The issue is that if we look at the 33n then 33n +n can be quite huge. T = a c r a c a $ 1 2 3 4 5 6 7 Suffix: T= acraca$ 1 craca$ 2 raca$ 3 aca$ 4 ca$ 5 a$ 6 $ 7

Searching with the SA: > Binary search using the SA based off of example above: P=c r y 1 2 3 P=p 1 p 2... p m SA = [7 6 4 1 5 2 3] STEP 1: STEP 2: c = = T SA[4] = a?? NOPE c > a c = = T SA[6] = c?? YES m is the number of binary searches we need to make. m*logn >Size will be 1n + 4n bytes = 5n bytes > WE want to avoid suffix trees and get into suffix arrays

L07 : 09/02/2015 Searching on SA SA Construction LCP (Longest Common Prefix) From SA to ST Recall: T= a c r a c a $ 1 2 3 4 5 6 7 P= c r y 1 2 3 n= T, m= P 10 SA Sorted Suffixes 1 7 $ 2 6 a$ 3 4 aca$ 4 1 acraca$ 5 5 ca$ 6 2 craca$ 7 3 raca$ *Trace through this example with the code below to find out if the pattern matches.*

>Can traverse the suffix tree nodes from left to right to give us the suffix array. Searching with SA (via Binary Search) Example: T= a c r a c a $ P= c r y

1 2 3 4 5 6 7 1 2 3 when k = 1 mid=1 + 7/2 = 4 T SA[mid] [1] == P[1]?? T 1 [1]= a==p[1]=c NO c>a low=mid+1; mid=low+high/2=6 ST: size(st T )>=33n bytes size(sa T )>=5n bytes >A suffix tree is light weight Construction of suffix array 1) Simply list the suffixes, then sort them. Each suffix has n length. >Need O(nlogn)*O(n) => O(n 2 logn) 2) Traverse the ST depth first from left to right. =>O(n) time, O(n) space *Look at Manber Myers suffix sorting algorithm in text* L08 : 09/04/2015 Suffix Arrays (continued) Construction LCP PageRank Intro Algorithm Problems Trust Rank O(n 2 ) direct sorting of suffixes

O(n) via ST Today will go through O(nlogn) successive doubling (without ST). And we will talk about O(n) without ST. History of ST and SA > The whole idea of suffix tree was introduced in 1973. > It was not till around 1991 till we have what is now called UK Konen s Algorithm.(33n) > Farach in 1996 introduced dividing suffixes into two groups.(76n) > In 1993 a Suffix array was discovered: Manber & Meyer T + SA = (1+4)bytes =. Required O(nlogn) to construct. Use first column to induce the other column. Can exploit the letters already found in previous columns. Successive doubling: Constructing Suffix Array

> LCP (longest Common Prefix) LCP

* If you have your suffix array you can construct a suffix tree and find LCP.* LCA = longest common ancestor depth of LCA: Page Rank

Damping factor means that a certain node will always point to another node. PR i (K) = (1 d)/n + PR i (k)

L02 : 08/21/2015 L03 : 08/24/2015.