Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Size: px

Start display at page:

Download "Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)"

Preston Holt
5 years ago
Views:

1 Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 07) Week 4: Analyzing Text (/) January 6, 07 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides are available at This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See for details

2 Source: Search!

3 Abstract IR Architecture Query Documents Representation Function online offline Representation Function Query Representation Document Representation Comparison Function Index Hits

4 Doc one, two Doc red, blue Doc 3 cat in the hat Doc 4 green eggs and ham blue 3 4 What goes in each cell? cat egg boolean count positions green ham hat one red two

5 Doc one, two Doc red, blue Doc 3 cat in the hat Doc 4 green eggs and ham 3 4 blue Indexing: building this structure cat Retrieval: manipulating this structure egg green ham hat one red two

6 Doc one, two Doc red, blue Doc 3 cat in the hat Doc 4 green eggs and ham 3 4 blue blue cat cat 3 egg egg 4 green green 4 ham ham 4 hat hat 3 one one red red two two

7 Doc one, two Doc red, blue Doc 3 cat in the hat Doc 4 green eggs and ham tf 3 4 df blue blue cat cat 3 egg egg 4 green green 4 ham ham 4 hat hat 3 one one red red two two

8 Doc one, two Doc red, blue Doc 3 cat in the hat Doc 4 green eggs and ham tf 3 4 df blue blue [3] cat cat 3 [] egg egg 4 [] [,4] [,4] green green 4 [] ham ham 4 [3] hat hat 3 [] one one [] red red [] two two [3]

9 Inverted Indexing with MapReduce Doc one, two Doc red, blue Doc 3 cat in the hat Map one two red blue cat hat 3 3 Shuffle and Sort: aggregate values by keys Reduce cat 3 one red blue hat two 3

10 Inverted Indexing: Pseudo-Code : class Mapper : method Map(docid n, doc d) 3: H new AssociativeArray. histogram to hold term frequencies 4: for all term t doc d do. processes the doc, e.g., tokenization and stopword removal 5: H{t} H{t} + 6: for all term t H do 7: Emit(term t, posting hn, H{t}i). emits individual postings : class Reducer : method Reduce(term t, postings [hn,f i...]) 3: P new List 4: for all hn, fi postings [hn,f i...] do 5: P.Append(hn, fi). appends postings unsorted 6: P.Sort() 7: Emit(term t, postingslist P ). sorts for compression

11 Positional Indexes Doc one, two Doc red, blue Doc 3 cat in the hat Map one two [] [3] red blue [] [3] cat hat 3 3 [] [] [,4] [,4] Shuffle and Sort: aggregate values by keys Reduce cat 3 [] one red [,4] [] [] [,4] blue hat two 3 [3] [] [3]

12 Positional Indexes : class Mapper : method Map(docid n, doc d) 3: H new AssociativeArray. histogram to hold term frequencies 4: for all term t doc d do. processes the doc, e.g., tokenization and stopword removal 5: H{t} H{t} + 6: for all term t H do 7: Emit(term t, posting hn, H{t}i). emits individual postings : class Reducer : method Reduce(term t, postings [hn,f i...]) 3: P new List 4: for all hn, fi postings [hn,f i...] do 5: P.Append(hn, fi). appends postings unsorted 6: P.Sort() 7: Emit(term t, postingslist P ). sorts for compression

13 Another Try (key) (values) (keys) (values) [,4] [,4] 34 [3] 9 [9] 3 [,8,] [,8,] 35 [8,4] 34 [3] 80 3 [,9,76] 35 [8,4] 9 [9] 80 [,9,76] How is this different? Let the framework do the sorting Term frequency implicitly stored Where have we seen this before?

14 Inverted Indexing: Pseudo-Code : class Mapper : method Map(docid n, doc d) 3: H new AssociativeArray 4: for all term t doc d do. builds a histogram of term frequencies 5: H{t} H{t} + 6: for all term t H do 7: Emit(tuple ht, ni, tf H{t}). emits individual postings, with a tuple as the key : class Partitioner : method Partition(tuple ht, ni, tf f) 3: return Hash(t) modnumofreducers. keys of same term are sent to same reducer : class Reducer : method Initialize 3: t prev ; 4: P new PostingsList 5: method Reduce(tuple ht, ni, tf [f]) 6: if t 6= t prev ^ t prev 6= ; then 7: Emit(term t, postings P ). emits postings list of term t prev 8: P.Reset() 9: P.Append(hn, fi). appends postings in sorted order 0: t prev t : method Close : Emit(term t, postings P ). emits last postings list from this reducer

15 Postings Encoding Conceptually: In Practice: Don t encode docnos, encode gaps (or d-gaps) But it s not obvious that this save space = delta encoding, delta compression, gap compression

16 Overview of Integer Compression Byte-aligned technique VByte Bit-aligned Unary codes g/d codes Golomb codes (local Bernoulli model) Word-aligned Simple family Bit packing family (PForDelta, etc.)

17 VByte Simple idea: use only as many bytes as needed Need to reserve one bit per byte as the continuation bit Use remaining bits for encoding value 7 bits 4 bits bits Works okay, easy to implement Beware of branch mispredicts!

18 Simple-9 How many different ways can we divide up 8 bits? 8 -bit numbers 4 -bit numbers selectors (9 total ways) 9 3-bit numbers 7 4-bit numbers Efficient decompression with hard-coded decoders Simple Family general idea applies to 64-bit words, etc. Beware of branch mispredicts?

19 Bit Packing What s the smallest number of bits we need to code a block (=8) of integers? Efficient decompression with hard-coded decoders PForDelta bit packing + separate storage of overflow bits Beware of branch mispredicts?

20 Golomb Codes x ³, parameter b: q + in unary, where q = ë( x - ) / bû r in binary, where r = x - qb -, in ëlog bû or élog bù bits Example: b = 3, r = 0,, (0, 0, ) b = 6, r = 0,,, 3, 4, 5 (00, 0, 00, 0, 0, ) x = 9, b = 3: q =, r =, code = 0: x = 9, b = 6: q =, r =, code = 0:00 Punch line: optimal b ~ 0.69 (N/df) Different b for every term!

21 Chicken and Egg? (key) 9 (value) [,4] [9] [,8,] 34 [3] 35 [8,4] 80 [,9,76] But wait! How do we set the Golomb parameter b? Recall: optimal b ~ 0.69 (N/df) We need the df to set b But we don t know the df until we ve seen all postings! Write postings Sound familiar?

22 Getting the df In the mapper: Emit special key-value pairs to keep track of df In the reducer: Make sure special key-value pairs come first: process them to determine df Remember: proper partitioning!

23 Getting the df: Modified Mapper Doc one, two Input document (key) (value) [,4] Emit normal key-value pairs one [] two [3] «[] Emit special key-value pairs to keep track of df one «[] two «[]

24 (key) Getting the df: Modified Reducer (value) «[63] [8] [7] [,4] First, compute the df by summing contributions from all special key-value pair Compute b from df 9 [9] [,8,] 34 [3] 35 [8,4] 80 [,9,76] Important: properly define sort order to make sure special key-value pairs come first! Write postings Where have we seen this before?

25 Basic Inverted Indexer: Reducer (key) (value) 9 [,4] [9] [,8,] 34 [3] 35 [8,4] 80 [,9,76] Write postings compressed

26 Inverted Indexing: IP (~Pairs) : class Mapper : method Map(docid n, doc d) 3: H new AssociativeArray 4: for all term t doc d do. builds a histogram of term frequencies 5: H{t} H{t} + 6: for all term t H do 7: Emit(tuple ht, ni, tf H{t}). emits individual postings, with a tuple as the key : class Partitioner : method Partition(tuple ht, ni, tf f) 3: return Hash(t) modnumofreducers. keys of same term are sent to same reducer : class Reducer : method Initialize 3: t prev ; 4: P new PostingsList 5: method Reduce(tuple ht, ni, tf [f]) 6: if t 6= t prev ^ t prev 6= ; then 7: Emit(term t, postings P ). emits postings list of term t prev 8: P.Reset() 9: P.Append(hn, fi). appends postings in sorted order 0: t prev t : method Close : Emit(term t, postings P ). emits last postings list from this reducer

27 Merging Postings Let s define an operation on postings lists P: Postings(, 5,, 39, 54) Postings(, 46) = Postings(,, 5,, 39, 46, 54) Then we can rewrite our indexing algorithm! flatmap: emit singleton postings reducebykey:

28 What s the issue? Postings Postings = Postings M Solution: apply compression as needed!

29 Inverted Indexing: LP (~Stripes) Slightly less elegant implementation but uses same idea : class Mapper : method Initialize 3: M new AssociativeArray. holds partial lists of postings 4: method Map(docid n, doc d) 5: H new AssociativeArray. builds a histogram of term frequencies 6: for all term t doc d do 7: H{t} H{t} + 8: for all term t H do 9: M{t}.Add(posting hn, H{t}i). adds a posting to partial postings lists 0: if MemoryFull() then : Flush() : method Flush. flushes partial lists of postings as intermediate output 3: for all term t M do 4: P SortAndEncodePostings(M{t}) 5: Emit(term t, postingslist P ) 6: M.Clear() 7: method Close 8: Flush()

30 Inverted Indexing: LP (~Stripes) : class Reducer : method Reduce(term t, postingslists [P,P,...]) 3: P f new List. temporarily stores partial lists of postings 4: R new List. stores merged partial lists of postings 5: for all P postingslists [P,P,...] do 6: P f.add(p ) 7: if MemoryNearlyFull() then 8: R.Add(MergeLists(P f )) 9: P f.clear() 0: R.Add(MergeLists(P f )) : Emit(term t, postingslist MergeLists(R)). emits fully merged postings list of term t

31 LP vs. IP? Experiments on ClueWeb09 collection: segments + 0.8m documents (47 GB compressed,.97 TB uncompressed) IP algorithm LP algorithm R = Indexing Time (minutes) R = Number of Documents (millions) Alg. Time Intermediate Pairs Intermediate Size IP 38.5 min bytes LP 9.6 min bytes From: Elsayed et al., Brute-Force Approaches to Batch Retrieval: Scalable Indexing with MapReduce, or Why Bother? 00

32 : class Mapper : method Initialize 3: M new AssociativeArray. holds partial lists of postings 4: method Map(docid n, doc d) 5: H new AssociativeArray. builds a histogram of term frequencies 6: for all term t doc d do 7: H{t} H{t} + 8: for all term t H do 9: M{t}.Add(posting hn, H{t}i). adds a posting to partial postings lists 0: if MemoryFull() then : Flush() : method Flush. flushes partial lists of postings as intermediate output 3: for all term t M do 4: P SortAndEncodePostings(M{t}) 5: Emit(term t, postingslist P ) 6: M.Clear() 7: method Close 8: Flush() Another Look at LP flatmap: emit singleton postings reducebykey: : class Reducer : method Reduce(term t, postingslists [P,P,...]) 3: P f new List. temporarily stores partial lists of postings 4: R new List. stores merged partial lists of postings 5: for all P postingslists [P,P,...] do 6: P f.add(p ) 7: if MemoryNearlyFull() then 8: R.Add(MergeLists(P f )) 9: P f.clear() 0: R.Add(MergeLists(P f )) : Emit(term t, postingslist MergeLists(R)). emits fully merged postings list of term t RDD[(K, V)] aggregatebykey seqop: (U, V) U, combop: (U, U) U RDD[(K, U)]

33 Algorithm design in a nutshell Exploit associativity and commutativity via commutative monoids (if you can) Exploit framework-based sorting to sequence computations (if you can t) Source: Wikipedia (Walnut)

34 Abstract IR Architecture Query Documents Representation Function online offline Representation Function Query Representation Document Representation Comparison Function Index Hits

35 MapReduce it? The indexing problem Scalability is critical Must be relatively fast, but need not be real time Fundamentally a batch operation Incremental updates may or may not be important For the web, crawling is a challenge in itself The retrieval problem Must have sub-second response time For the web, only need relatively few results

36 Assume everything fits in memory on a single machine (For now)

37 Boolean Retrieval Users express queries as a Boolean expression AND, OR, NOT Can be arbitrarily nested Retrieval is based on the notion of sets Any query divides the collection into two sets: retrieved, not-retrieved Pure Boolean systems do not define an ordering of the results

38 Boolean Retrieval To execute a Boolean query: OR Build query syntax tree ( blue AND ) OR ham ham AND blue For each clause, look up postings blue ham Traverse postings and apply Boolean operator

39 Term-at-a-Time OR blue 5 9 ham AND blue ham AND 5 9 blue ham OR AND Efficiency analysis? blue What s RPN?

40 Document-at-a-Time OR blue 5 9 ham AND blue ham blue ham Tradeoffs? Efficiency analysis?

41 Boolean Retrieval Users express queries as a Boolean expression AND, OR, NOT Can be arbitrarily nested Retrieval is based on the notion of sets Any query divides the collection into two sets: retrieved, not-retrieved Pure Boolean systems do not define an ordering of the results

42 Ranked Retrieval Order documents by how likely they are to be relevant Estimate relevance(q, d i ) Sort documents by relevance How do we estimate relevance? Take similarity as a proxy for relevance

43 Vector Space Model t 3 d d 3 θ d φ t t d 5 d 4 Assumption: Documents that are close together in vector space talk about the same things Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ closeness )

44 Similarity Metric Use angle between the vectors: d j =[w j,,w j,,w j,3,...w j,n ] d k =[w k,,w k,,w k,3,...w k,n ] cos = d j d k d j d k sim(d j,d k )= d j d k d j d k = P n i=0 w j,iw k,i q Pn i=0 w j,i q Pn i=0 w k,i Or, more generally, inner products: sim(d j,d k )=d j d k = nx w j,i w k,i i=0

45 Term Weighting Term weights consist of two components Local: how important is the term in this document? Global: how important is the term in the collection? Here s the intuition: Terms that appear often in a document should get high weights Terms that appear in many documents should get low weights How do we capture this mathematically? Term frequency (local) Inverse document frequency (global)

46 TF.IDF Term Weighting w i, j = tf i, j log N n i w i, j tf i, j N n i weight assigned to term i in document j number of occurrence of term i in document j number of documents in entire collection number of documents with term i

47 Retrieval in a Nutshell Look up postings lists corresponding to query terms Traverse postings for each query term Store partial query-document scores in accumulators Select top k results to return

48 Retrieval: Document-at-a-Time Evaluate documents one at a time (score all query terms) blue Accumulators (e.g. min heap) Document score in top k? Yes: Insert document score, extract-min if heap too large No: Do nothing Tradeoffs: Small memory footprint (good) Skipping possible to avoid reading all postings (good) More seeks and irregular data accesses (bad)

49 Retrieval: Term-At-A-Time Evaluate documents one query term at a time Usually, starting from most rare term (often with tf-sorted postings) blue 9 35 Score {q=x} (doc n) = s Accumulators (e.g., hash) Tradeoffs: Early termination heuristics (good) Large memory footprint (bad), but filtering heuristics possible

50 Assume everything fits in memory on a single machine Okay, let s relax this assumption now

51 Important Ideas Partitioning (for scalability) Replication (for redundancy) Caching (for speed) Routing (for load balancing) The rest is just details!

52 Term vs. Document Partitioning D T D Term Partitioning T T 3 T Document Partitioning T D D D 3

53 FE brokers partitions replicas cache

54 Datacenter Datacenter Datace brokers brokers partitions partitions part Tier replicas cache Tier replicas cache Tier partitions partitions part Tier replicas cache Tier replicas cache Tier partitions partitions part Tier replicas cache Tier replicas cache Tier

55 Important Ideas Partitioning (for scalability) Replication (for redundancy) Caching (for speed) Routing (for load balancing)

56 Questions? Source: Wikipedia (Japanese rock garden)

Data-Intensive Distributed Computing

Data-Intensive Distributed Computing CS 45/65 43/63 (Winter 08) Part 3: Analyzing Text (/) January 30, 08 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides are