Scalable out-of-core classification (of large test sets) can we do better that the current approach?

Size: px

Start display at page:

Download "Scalable out-of-core classification (of large test sets) can we do better that the current approach?"

Alexandrina Wood
6 years ago
Views:

1 Announcements Working AWS codes will be out soon project deadlines now posted Waitlist is at You need approval if you are an MS student on the 805 waitlist William has no offer hours next week

2 Scalable out-of-core classification (of large test sets) can we do better that the current approach?

3 Review of NB algorithms HW Train events Test events Parallel? 1 Msgs è Disk HashMap (for subset) 2 Msgs è Disk HashMap (for subset) 3 Msgs è Disk Msgs on Disk (coming.) no yes yes

4 Testing Large-vocab Naïve Bayes For each example id, y, x 1,.,x d in train: Sort the event-counter update messages Scan and add the sorted messages and output the final counter values Initialize a HashSet NEEDED and a hashtable C For each example id, y, x 1,.,x d in test: Add x 1,.,x d to NEEDED For each event, C(event) in the summed counters [For assignment] If event involves a NEEDED term x read it into C For each example id, y, x 1,.,x d in test: For each y in dom(y): Compute log Pr(y,x 1,.,x d ) =. Model size: O( V ) Time: O(n 2 ), size of test Memory: same Time: O(n 2 ) Memory: same Time: O(n 2 ) Memory: same 4

5 Can we do better? Test data id 1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 w 2,1 w 2,2 w 2,3. id 3 w 3,1 w 3,2. id 4 w 4,1 w 4,2 id 5 w 5,1 w 5,2... Event counts X=w 1^Y=sports X=w 1^Y=worldNews X=.. X=w 2^Y= X= id 1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 w 2,1 w 2,2 w 2,3. id 3 w 3,1 w 3,2. id 4 w 4,1 w 4,2 What we d like C[X=w 1,1^Y=sports]=5245, C[X=w 1,1^Y=..],C[X=w 1,2^] C[X=w 2,1^Y=.]=1054,, C[X=w 2,k2^] C[X=w 3,1^Y=.]=

6 Can we do better? Step 1: group counters by word w How: Stream and sort: for each C[X=w^Y=y]=n print w C[Y=y]=n sort and build a list of values associated with each key w Like an inverted index w Event counts X=w 1^Y=sports X=w 1^Y=worldNews X=.. X=w 2^Y= X= Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=worldNews]=564 C[w^Y=sports]=21,C[w^Y=worldNews]=

7 If these records were in a key-value DB we would know what to do. Test data Record of all event counts for each word id 1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 w 2,1 w 2,2 w 2,3. id 3 w 3,1 w 3,2. id 4 w 4,1 w 4,2 id 5 w 5,1 w 5,2... w Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world C[w^Y=sports]=21,C[w^Y=worldN Step 2: stream through and for each test case id i w i,1 w i,2 w i,3. w i,ki Classification logic request the event counters needed to classify id i from the event-count DB, then classify using the answers

8 Is there a stream-and-sort analog of this request-and-answer pattern? Test data Record of all event counts for each word id 1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 w 2,1 w 2,2 w 2,3. id 3 w 3,1 w 3,2. id 4 w 4,1 w 4,2 id 5 w 5,1 w 5,2... w Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world C[w^Y=sports]=21,C[w^Y=worldN Step 2: stream through and for each test case id i w i,1 w i,2 w i,3. w i,ki Classification logic request the event counters needed to classify id i from the event-count DB, then classify using the answers

9 Recall: Stream and Sort Counting: sort messages so the recipient can stream through them example 1 example 2 example 3. C[x1] += D1 C[x1] += D2. Counting logic C[x] +=D Sort Logic to combine counter updates Machine A Machine C Machine B

10 Is there a stream-and-sort analog of this request-and-answer pattern? Test data Record of all event counts for each word id 1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 w 2,1 w 2,2 w 2,3. id 3 w 3,1 w 3,2. id 4 w 4,1 w 4,2 id 5 w 5,1 w 5,2... w Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world C[w^Y=sports]=21,C[w^Y=worldN Classification logic W 1,1 counters to id 1 W 1,2 counters to id 1 W i,j counters to id i

11 Is there a stream-and-sort analog of this request-and-answer pattern? Test data Record of all event counts for each word id 1 found an aarvark in s farmville today! id 2 id 3. id 4 id 5.. w Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world C[w^Y=sports]=21,C[w^Y=worldN Classification logic found ctrs to id 1 ctrs to id 1 today ctrs to id 1

12 Is there a stream-and-sort analog of this request-and-answer pattern? Test data Record of all event counts for each word id 1 found an aarvark in s farmville today! id 2 id 3. id 4 id 5.. w Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world C[w^Y=sports]=21,C[w^Y=worldN ~ is the last ascii character Classification logic found ~ctrs to id 1 ~ctrs to id 1 today ~ctrs to id 1 % export LC_COLLATE=C means that it will sort after anything else with unix sort

13 Is there a stream-and-sort analog of this request-and-answer pattern? Test data Record of all event counts for each word id 1 found an in s farmville today! id 2 id 3. id 4 id 5.. w Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world C[w^Y=sports]=21,C[w^Y=worldN Counter records Classification logic found ~ctr to id 1 ~ctr to id 2 today ~ctr to id i requests Combine and sort

14 A stream-and-sort analog of the request-andanswer pattern Record of all event counts for each word w Counts C[w^Y=sports]=2 Counter records w Counts C[w^Y=sports]=2 ~ctr to id1 C[w^Y=sports]= ~ctr to id345 ~ctr to id9854 ~ctr to id345 ~ctr to id34742 C[] found ~ctr to id 1 ~ctr to id 1 today ~ctr to id 1 requests Combine and sort ~ctr to id1 Request-handling logic

15 A stream-and-sort analog of the request-andanswer pattern previouskey = somethingimpossible For each (key,val) in input: If key==previouskey Answer(recordForPrevKey,val) Else previouskey = key recordforprevkey = val define Answer(record,request): find id where request = ~ctr to id print id ~ctr for request is record w Counts C[w^Y=sports]=2 ~ctr to id1 C[w^Y=sports]= ~ctr to id345 ~ctr to id9854 ~ctr to id345 ~ctr to id34742 C[] ~ctr to id1 requests Combine and sort Request-handling logic

A stream-and-sort analog of the request-andanswer pattern previouskey = somethingimpossible For each (key,val) in input: If key==previouskey Answer(recordForPrevKey,val) Else previouskey = key

16 A stream-and-sort analog of the request-andanswer pattern previouskey = somethingimpossible For each (key,val) in input: If key==previouskey Answer(recordForPrevKey,val) Else previouskey = key recordforprevkey = val define Answer(record,request): find id where request = ~ctr to id print id ~ctr for request is record w Counts C[w^Y=sports]=2 ~ctr to id1 C[w^Y=sports]= ~ctr to id345 ~ctr to id9854 ~ctr to id345 ~ctr to id34742 C[] Output: id1 ~ctr for is C[w^Y=sports]=2 id1 ~ctr for is. Combine and sort requests ~ctr to id1 Request-handling logic

w A stream-and-sort analog of the request-andanswer pattern Counts C[w^Y=sports]=2 ~ctr to id1 C[w^Y=sports]= ~ctr to id345 ~ctr to id9854 ~ctr to id345 ~ctr to id34742

17 w A stream-and-sort analog of the request-andanswer pattern Counts C[w^Y=sports]=2 ~ctr to id1 C[w^Y=sports]= ~ctr to id345 ~ctr to id9854 ~ctr to id345 ~ctr to id34742 C[] ~ctr to id1 Output: id1 ~ctr for is C[w^Y=sports]=2 id1 ~ctr for is. id 1 found an in s farmville today! id 2 id 3. id 4 id 5.. Request-handling logic Combine and sort????

18 id 1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 w 2,1 w 2,2 w 2,3. id 3 w 3,1 w 3,2. id 4 w 4,1 w 4,2 What we d wanted C[X=w 1,1^Y=sports]=5245, C[X=w 1,1^Y=..],C[X=w 1,2^] C[X=w 2,1^Y=.]=1054,, C[X=w 2,k2^] C[X=w 3,1^Y=.]= Key id1 id2 What we ended up with and it s good enough! Value found farmville today ~ctr for is C[w^Y=sports]=2 ~ctr for found is C[w^Y=sports]=1027,C[w^Y=worldNews]=564 w 2,1 w 2,2 w 2,3. ~ctr for w 2,1 is

19 Implementation summary java CountForNB train.dat > eventcounts.dat java CountsByWord eventcounts.dat sort java CollectRecords > words.dat java requestwordcounts test.dat cat - words.dat sort java answerwordcountrequests cat - test.dat sort testnbusingrequests train.dat id 1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 w 2,1 w 2,2 w 2,3. id 3 w 3,1 w 3,2. id 4 w 4,1 w 4,2 id 5 w 5,1 w 5,2... counts.dat X=w1^Y=sports X=w1^Y=worldNews X=.. X=w2^Y= X=

20 Implementation summary java CountForNB train.dat > eventcounts.dat java CountsByWord eventcounts.dat sort java CollectRecords > words.dat java requestwordcounts test.dat cat - words.dat sort java answerwordcountrequests cat - test.dat sort testnbusingrequests words.dat w Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=worldNews]=564 C[w^Y=sports]=21,C[w^Y=worldNews]=4464

21 Implementation summary java CountForNB train.dat > eventcounts.dat java CountsByWord eventcounts.dat sort java CollectRecords > words.dat java requestwordcounts test.dat output looks like this cat - words.dat sort java answerwordcountrequests input looks cat - test.dat sort testnbusingrequests like this words.dat found ~ctr to id 1 ~ctr to id 2 today ~ctr to id i w Counts C[w^Y=sports]=2 w Counts C[w^Y=sports]=2 ~ctr to id1 C[w^Y=sports]= ~ctr to id345 ~ctr to id9854 ~ctr to id345

22 Implementation summary java CountForNB train.dat > eventcounts.dat java CountsByWord eventcounts.dat sort java CollectRecords > words.dat java requestwordcounts test.dat cat - words.dat sort java answerwordcountrequests cat - test.dat sort testnbusingrequests Output looks like this Output: id1 ~ctr for is C[w^Y=sports]=2 id1 ~ctr for is. test.dat id 1 found an in s farmville today! id 2 id 3. id 4 id 5..

23 Implementation summary java CountForNB train.dat > eventcounts.dat java CountsByWord eventcounts.dat sort java CollectRecords > words.dat java requestwordcounts test.dat cat - words.dat sort java answerwordcountrequests cat -test.dat sort testnbusingrequests Input looks like this Key id1 id2 Value found farmville today ~ctr for is C[w^Y=sports]=2 ~ctr for found is C[w^Y=sports]=1027,C[w^Y=worldNews]=564 w 2,1 w 2,2 w 2,3. ~ctr for w 2,1 is

24 Summary We ve filled in the missing piece Training (event-counting) hashtable in-memory on-disk, low-memory Testing (classification) hashtable in-memory on-disk, low-memory

25 Rocchio s Algorithm 25

Rocchio s algorithm DF(w) = # different docs w occurs in TF(w, d) = #

β D C y u(y) u(y) 2 d ' D C y Many variants of these formulae as long as

26 Rocchio s algorithm DF(w) = # different docs w occurs in TF(w, d) = # different times w occurs in doc d IDF(w) = D DF(w) u(w, d) = log(tf(w, d) +1) log(idf(w)) u(d) = u(w 1, d),...,u(w V, d) u(y) = α 1 C y d C y f (d) = argmax y u(d) u(d) 2 u(d) u(d) 2 1 β D C y u(y) u(y) 2 d ' D C y Many variants of these formulae as long as u(w,d)=0 for words not in d! Store only non-zeros in u(d), so size is O( d ) u(d ') u(d ') 2 But size of u(y) is O( n V ) i u 2 = u i 2 26

27 Rocchio s algorithm DF(w) =# different docs w occurs in TF(w,d) =# different times w occurs in doc d IDF(w) = D DF(w) u(w,d) = log(tf(w,d) +1) log(idf(w)) u(d) = u(w 1,d),...,u(w V,d), v(d) = u(d) u(d) 2 = v(w 1,d),... u(y) = α 1 C y d C y v(d) f (d) = argmax y v(d) v(y) 1 β D C y v(d), v(y) = u(y) u(y) 2 d ' D C y Given a table mapping w to DF(w), we can compute v(d) from the words in d and the rest of the learning algorithm is just adding 27

28 Imagine a similar process but for labeled documents Train data Rocchio v Bayes Event counts id 1 y1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3. id 3 y3 w 3,1 w 3,2. id 4 y4 w 4,1 w 4,2 id 5 y5 w 5,1 w 5,2... X=w 1^Y=sports X=w 1^Y=worldNews X=.. X=w 2^Y= X= Recall Naïve Bayes test process? id 1 y1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3. id 3 y3 w 3,1 w 3,2. C[X=w 1,1^Y=sports]=5245, C[X=w 1,1^Y=..],C[X=w 1,2^] C[X=w 2,1^Y=.]=1054,, C[X=w 2,k2^] C[X=w 3,1^Y=.]= id 4 y4 w 4,1 w 4,2 28

29 Rocchio. Train data id 1 y1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3. id 3 y3 w 3,1 w 3,2. id 4 y4 w 4,1 w 4,2 id 5 y5 w 5,1 w 5,2... Rocchio: DF counts id 1 y1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3. id 3 y3 w 3,1 w 3,2. id 4 y4 w 4,1 w 4,2 v(w 1,1,id1), v(w 1,2,id1)v(w 1,k1,id1) v(w 2,1,id2), v(w 2,2,id2) 29

30 recap: Is there a stream-and-sort analog of this request-and-answer pattern? Test data Record of all event counts for each word id 1 found an in s farmville today! id 2 id 3. id 4 id 5.. w Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world C[w^Y=sports]=21,C[w^Y=worldN Counter records Classification logic found ~ctr to id 1 ~ctr to id 2 today ~ctr to id i requests Combine and sort

31 recap: Is there a stream-and-sort analog of this request-and-answer pattern? Test data Record of all event counts for each word id 1 found an in s farmville today! id 2 id 3. id 4 id 5.. w DF(w) Counter records Classification logic found ~ctr to id 1 ~ctr to id 2 today ~ctr to id i requests Combine and sort

recap: A stream-and-sort analog of the requestand-answer pattern Record of all event counts for each word w Counts C[w^Y=sports]=2 Counter records w 4 Counts ~ctr to id1

32 recap: A stream-and-sort analog of the requestand-answer pattern Record of all event counts for each word w Counts C[w^Y=sports]=2 Counter records w 4 Counts ~ctr to id1 ~ctr to id345 ~ctr to id9854 ~ctr to id345 ~ctr to id34742 C[] found ~ctr to id 1 ~ctr to id 1 today ~ctr to id 1 requests Combine and sort ~ctr to id1 Request-handling logic

33 recap: A stream-and-sort analog of the requestand-answer pattern previouskey = somethingimpossible For each (key,val) in input: If key==previouskey Answer(recordForPrevKey,val) Else previouskey = key recordforprevkey = val define Answer(record,request): find id where request = ~ctr to id print id ~ctr for request is record w 4 Counts ~ctr to id1 ~ctr to id345 ~ctr to id9854 ~ctr to id345 ~ctr to id34742 C[] ~ctr to id1 requests Combine and sort Request-handling logic

34 Rocchio. Train data id 1 y1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3. id 3 y3 w 3,1 w 3,2. id 4 y4 w 4,1 w 4,2 id 5 y5 w 5,1 w 5,2... DF(w) = # different docs w occurs in TF(w, d) = # Rocchio: different times DF counts w occurs in doc d IDF(w) = D 12 DF(w) 1054 u(w, d) = log(tf(w, d) +1) log(idf(w)) u(d) = u(w 1, d),...,u(w V, d), v(d) 3 = u(d) u(d) 2 id 1 y1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3. id 3 y3 w 3,1 w 3,2. id 4 y4 w 4,1 w 4,2 v(id 1 ) v(id 2 ) 34

w 1,k1 ), the document vector for id 1 v(w 2,1 w 2,2 w 2,3.

35 u(y) = α 1 C y d C y v(d) f (d) = argmax y v(d) v(y) 1 β D C y Rocchio. v(d), v(y) = u(y) u(y) 2 d' D C y id 1 y1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3. id 3 y3 w 3,1 w 3,2. id 4 y4 w 4,1 w 4,2 v(w 1,1 w 1,2 w 1,3. w 1,k1 ), the document vector for id 1 v(w 2,1 w 2,2 w 2,3.)= v(w 2,1,d), v(w 2,2,d), For each (y, v), go through the non-zero values in v one for each w in the document dand increment a counter for that dimension of v(y) Message: increment v(y1) s weight for w 1,1 by αv(w 1,1,d) / C y Message: increment v(y1) s weight for w 1,2 by αv(w 1,2,d) / C y 35

36 f (d) = argmax y v(d) v(y) = argmax y v(w,d) v(w, y) Rocchio at Test Time w Train data Rocchio: DF counts id 1 y1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3. id 3 y3 w 3,1 w 3,2. id 4 y4 w 4,1 w 4,2 id 5 y5 w 5,1 w 5,2... v(y1,w)= v(y1,w)=0.013, v(y2,w)=... id 1 y1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3. id 3 y3 w 3,1 w 3,2. v(id 1 ), v(w 1,1,y1),v(w 1,1,y1),.,v(w 1,k1,yk),,v(w 1,k1,yk) v(id 2 ), v(w 2,1,y1),v(w 2,1,y1),. id 4 y4 w 4,1 w 4,2 36

37 Rocchio Summary Compute DF one scan thru docs Compute v(id i ) for each document output size O(n) Add up vectors to get v(y) time: O(n), n=corpus size like NB event-counts time: O(n) one scan, if DF fits in memory like first part of NB test procedure otherwise time: O(n) one scan if v(y) s fit in memory like NB training otherwise Classification ~= disk NB 37

38 One? more Rocchio observation Documents/labels Split into documents subsets Documents/labels 1 Documents/labels 2 Documents/labels 3 Compute DFs DFs -1 DFs - 2 DFs -3 DFs Sort and add counts 38

39 One?? more Rocchio observation DFs Documents/labels Split into documents subsets Documents/labels 1 Documents/labels 2 Documents/labels 3 Compute partial v(y) s v-1 v-2 v-3 v(y) s Sort and add vectors 39

O(1) more Rocchio observation Documents/labels Split into documents subsets Documents/labels 1 Documents/labels 2 Documents/labels 3 DFs DFs Compute partial v(y) s DFs v-1 v-2 v-3 coming up - how

40 O(1) more Rocchio observation Documents/labels Split into documents subsets Documents/labels 1 Documents/labels 2 Documents/labels 3 DFs DFs Compute partial v(y) s DFs v-1 v-2 v-3 coming up - how v(y) s Sort and add vectors We have shared access to the DFs, but only shared read access we don t need to share write access. So we only need to copy the information across the different processes. 40

41 ABSTRACTIONS FOR STREAM AND SORT AND MAP-REDUCE 41

42 Abstractions On Top Of Map- Reduce We ve decomposed some algorithms into a map-reduce workflow (series of mapreduce steps) naive Bayes training naïve Bayes testing How else can we express these sorts of computations? Are there some common special cases of map-reduce steps we can parameterize and reuse? 42

Abstractions On Top Of Map- Reduce Some obvious streaming processes: for each row

some boolean test, and copy out only the ones that pass the test Example: stem

row table2 = MAP table1 TO λ row : f(row)) Example: apply stop words (,1) è (,1) (

43 Abstractions On Top Of Map- Reduce Some obvious streaming processes: for each row in a table Transform it and output the result Decide if you want to keep it with some boolean test, and copy out only the ones that pass the test Example: stem words in a stream of word-count pairs: ( s,1) è (,1) Proposed syntax: f(row)!row table2 = MAP table1 TO λ row : f(row)) Example: apply stop words (,1) è (,1) ( the,1) è deleted Proposed syntax: f(row)! {true,false} table2 = FILTER table1 BY λ row : f(row)) 43

to get the output table (flatten) Proposed syntax: Example: tokenizing a line I found an è [ i, found, an,

44 Abstractions On Top Of Map- Reduce A non-obvious? streaming processes: for each row in a table Transform it to a list of items Splice all the lists together to get the output table (flatten) Proposed syntax: Example: tokenizing a line I found an è [ i, found, an, ] We love zymurgy è [ we, love, zymurgy ]..but Uinal table is one word per row f(row)!list of rows table2 = FLATMAP table1 TO λ row : f(row)) i found an we love 44

45 Abstractions On Top Of Map- Reduce An example: the Naïve Bayes test program 45

46 NB Test Step Event counts How: Stream and sort: for each C[X=w^Y=y]=n print w C[Y=y]=n sort and build a list of values associated with each key w Like an inverted index X=w 1^Y=sports X=w 1^Y=worldNews X=.. X=w 2^Y= X= w Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world News]=564 C[w^Y=sports]=21,C[w^Y=worldNe ws]=4464

47 NB Test Step The general case: We re taking rows from a table In a particular format (event,count) Applying a function to get a new value The word for the event And grouping the rows of the table by this new value è Grouping operation Special case of a map-reduce Event counts X=w 1^Y=sports X=w 1^Y=worldNews X=.. X=w 2^Y= X= Proposed syntax: f(row)!field GROUP table BY λ row : f(row) Could deuine f via: a function, a Uield of a deuined record structure, w Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world News]=564 C[w^Y=sports]=21,C[w^Y=worldNe ws]=4464

NB Test Step The general case: We re taking rows from a table In a particular format (event,count) Applying a function to get a new value The word for the event And grouping the rows of the table by

48 NB Test Step The general case: We re taking rows from a table In a particular format (event,count) Applying a function to get a new value The word for the event And grouping the rows of the table by this new value è Grouping operation Special case of a map-reduce Proposed syntax: f(row)!field Aside: you guys know how to implement this, right? 1. Output pairs (f(row),row) with a map/streaming process 2. Sort pairs by key which is f(row) 3. Reduce and aggregate by appending together all the values associated with the same key GROUP table BY λ row : f(row) Could deuine f via: a function, a Uield of a deuined record structure,

49 Abstractions On Top Of Map- Reduce And another example from the Naïve Bayes test program 49

50 Request-and-answer Test data Record of all event counts for each word id 1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 w 2,1 w 2,2 w 2,3. id 3 w 3,1 w 3,2. id 4 w 4,1 w 4,2 id 5 w 5,1 w 5,2... w Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world C[w^Y=sports]=21,C[w^Y=worldN Step 2: stream through and for each test case id i w i,1 w i,2 w i,3. w i,ki Classification logic request the event counters needed to classify id i from the event-count DB, then classify using the answers

51 Request-and-answer Break down into stages Generate the data being requested (indexed by key, here a word) Eg with group by Generate the requests as (key, requestor) pairs Eg with flatmap to Join these two tables by key Join defined as (1) cross-product and (2) filter out pairs with different values for keys This replaces the step of concatenating two different tables of key-value pairs, and reducing them together Postprocess the joined result

52 w found Request ~ctr to id1 w Counters C[w^Y=sports]=2 ~ctr to id1 C[w^Y=sports]=1027,C[w^Y=worldNews ~ctr to id1 C[w^Y=sports]=21,C[w^Y=worldNews]= ~ctr to id2 w Counters Requests C[w^Y=sports]=2 ~ctr to id1 C[w^Y=sports]= ~ctr to id345 C[w^Y=sports]= ~ctr to id9854 C[w^Y=sports]= ~ctr to id345 C[w^Y=sports]= ~ctr to id34742 C[] ~ctr to id1 C[]

(word,docid):word, C[w^Y=sports]=2 id1 wordcounters BY lambda (word,counters):word C[w^Y=sports]= using python id345 syntax for functions

53 w found Request id1 w Counters C[w^Y=sports]=2 id1 C[w^Y=sports]=1027,C[w^Y=worldNews id1 C[w^Y=sports]=21,C[w^Y=worldNews]= id2 Examples: JOIN wordindoc BY word, wordcounters w BY Counters word --- if word(row) Requests de;ined correctly JOIN wordindoc BY lambda (word,docid):word, C[w^Y=sports]=2 id1 wordcounters BY lambda (word,counters):word C[w^Y=sports]= using python id345 syntax for functions C[w^Y=sports]= id9854 Proposed syntax: C[w^Y=sports]= id345 JOIN table1 BY λ row : f(row), table2 BY λ row : g(row) C[w^Y=sports]= id34742 C[] id1 C[]

54 Abstract Implementation: [TF]IDF data = pairs (docid,term) where term is a word appears in document with id docid operators: key value docid term DISTINCT, found MAP, JOIN (d123,found),(d134,found), 2456 GROUP BY. [RETAINING ] REDUCING TO a reduce step d123 found (d123,), 7 d123 docfreq = DISTINCT data GROUP BY λ(docid,term):term REDUCING TO key count /* value (term,df) */ docids = MAP DATA BY=λ(docid,term):docid DISTINCT numdocs = GROUP docids BY λdocid:1 REDUCING TO count /* (1,numDocs) */ question how many reducers should I use here? dataplusdf = JOIN data BY λ(docid, term):term, docfreq BY λ(term, df):term MAP λ((docid,term),(term,df)):(docid,term,df) /* (docid,term,document-freq) */ unnormalizeddocvecs = JOIN dataplusdf by λrow:1, numdocs by λrow:1 MAP λ((docid,term,df),(dummy,numdocs)): (docid,term,log(numdocs/df)) /* (docid, term, weight-before-normalizing) : u */ 1/2 54

55 Abstract Implementation: [TF]IDF data = pairs (docid,term) where term is a word appears in document with id docid operators: key value docid term DISTINCT, found MAP, JOIN (d123,found),(d134,found), 2456 GROUP BY. [RETAINING ] REDUCING TO a reduce step d123 found (d123,), 7 d123 docfreq = DISTINCT data GROUP BY λ(docid,term):term REDUCING TO key count /* value (term,df) */ docids = MAP DATA BY=λ(docid,term):docid DISTINCT numdocs = GROUP docids BY λdocid:1 REDUCING TO count /* (1,numDocs) */ question how many reducers should I use here? dataplusdf = JOIN data BY λ(docid, term):term, docfreq BY λ(term, df):term MAP λ((docid,term),(term,df)):(docid,term,df) /* (docid,term,document-freq) */ unnormalizeddocvecs = JOIN dataplusdf by λrow:1, numdocs by λrow:1 MAP λ((docid,term,df),(dummy,numdocs)): (docid,term,log(numdocs/df)) /* (docid, term, weight-before-normalizing) : u */ 55 question how many reducers should I use here? 1/2

56 Abstract Implementation: TFIDF normalizers = GROUP unnormalizeddocvecs BY λ(docid,term,w):docid RETAINING λ(docid,term,w): w 2 REDUCING TO sum /* (docid,sum-of-square-weights) */ 2/2 key d1234 (d1234,found,1.542), (d1234,,13.23), d docvec = JOIN unnormalizeddocvecs BY λ(docid,term,w):docid, normalizers BY λ(docid,norm):docid MAP λ((docid,term,w), (docid,norm)): (docid,term,w/sqrt(norm)) /* (docid, term, weight) */ docid term w d1234 found d docid w d d

57 Two ways to join Reduce-side join Map-side join 57

Two ways to join Reduce-side join for A,B A term df found 2456 7 term docid d15 concat and sort do the

58 Two ways to join Reduce-side join for A,B A term df found term docid d15 concat and sort do the join ( A (, 7) B... (found, 2456) (, d15) (found,d7) B... found d7 found d23 (found, 2456) (found,d23) found 58

59 Two ways to join Reduce-side join for A,B df rk 7 rk 2456 docid d15... d7 d23 concat and sort term df found A 2456 A 7 term doci d B d15... found found found B d7 B d23 B tricky bit: need sort by first two values (, AB) we want the DF s to come first but all tuples with key should go to same worker do the join ( A (, 7) B... (found, 2456) (found, 2456) (, d15) (found,d7) (found,d23) 59

60 Two ways to join Reduce-side join for A,B df rk 7 rk 2456 docid d15... d7 d23 concat and sort term df found A 2456 A 7 term doci d B d15... found found found B d7 B d23 B tricky bit: need sort by first two values (, AB) we want the DF s to come first but all tuples with key should go to same worker custom sort (secondary sort key): Writeable with your own Comparator custom Partitioner (specified for job like the Mapper, Reducer,..) 60

61 Two ways to join Map-side join write the smaller relation out to disk send it to each Map worker DistributedCache when you initialize each Mapper, load in the small relation Configure() is called at initialization time map through the larger relation and do the join faster but requires one relation to go in memory 61

MAP-REDUCE ABSTRACTIONS

MAP-REDUCE ABSTRACTIONS 1 Abstractions On Top Of Hadoop We ve decomposed some algorithms into a map- reduce work9low (series of map- reduce steps) naive Bayes training naïve Bayes testing phrase scoring