Scalable out-of-core classification (of large test sets) can we do better that the current approach?

Size: px
Start display at page:

Download "Scalable out-of-core classification (of large test sets) can we do better that the current approach?"

Transcription

1 Announcements Working AWS codes will be out soon project deadlines now posted Waitlist is at You need approval if you are an MS student on the 805 waitlist William has no offer hours next week

2 Scalable out-of-core classification (of large test sets) can we do better that the current approach?

3 Review of NB algorithms HW Train events Test events Parallel? 1 Msgs è Disk HashMap (for subset) 2 Msgs è Disk HashMap (for subset) 3 Msgs è Disk Msgs on Disk (coming.) no yes yes

4 Testing Large-vocab Naïve Bayes For each example id, y, x 1,.,x d in train: Sort the event-counter update messages Scan and add the sorted messages and output the final counter values Initialize a HashSet NEEDED and a hashtable C For each example id, y, x 1,.,x d in test: Add x 1,.,x d to NEEDED For each event, C(event) in the summed counters [For assignment] If event involves a NEEDED term x read it into C For each example id, y, x 1,.,x d in test: For each y in dom(y): Compute log Pr(y,x 1,.,x d ) =. Model size: O( V ) Time: O(n 2 ), size of test Memory: same Time: O(n 2 ) Memory: same Time: O(n 2 ) Memory: same 4

5 Can we do better? Test data id 1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 w 2,1 w 2,2 w 2,3. id 3 w 3,1 w 3,2. id 4 w 4,1 w 4,2 id 5 w 5,1 w 5,2... Event counts X=w 1^Y=sports X=w 1^Y=worldNews X=.. X=w 2^Y= X= id 1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 w 2,1 w 2,2 w 2,3. id 3 w 3,1 w 3,2. id 4 w 4,1 w 4,2 What we d like C[X=w 1,1^Y=sports]=5245, C[X=w 1,1^Y=..],C[X=w 1,2^] C[X=w 2,1^Y=.]=1054,, C[X=w 2,k2^] C[X=w 3,1^Y=.]=

6 Can we do better? Step 1: group counters by word w How: Stream and sort: for each C[X=w^Y=y]=n print w C[Y=y]=n sort and build a list of values associated with each key w Like an inverted index w Event counts X=w 1^Y=sports X=w 1^Y=worldNews X=.. X=w 2^Y= X= Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=worldNews]=564 C[w^Y=sports]=21,C[w^Y=worldNews]=

7 If these records were in a key-value DB we would know what to do. Test data Record of all event counts for each word id 1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 w 2,1 w 2,2 w 2,3. id 3 w 3,1 w 3,2. id 4 w 4,1 w 4,2 id 5 w 5,1 w 5,2... w Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world C[w^Y=sports]=21,C[w^Y=worldN Step 2: stream through and for each test case id i w i,1 w i,2 w i,3. w i,ki Classification logic request the event counters needed to classify id i from the event-count DB, then classify using the answers

8 Is there a stream-and-sort analog of this request-and-answer pattern? Test data Record of all event counts for each word id 1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 w 2,1 w 2,2 w 2,3. id 3 w 3,1 w 3,2. id 4 w 4,1 w 4,2 id 5 w 5,1 w 5,2... w Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world C[w^Y=sports]=21,C[w^Y=worldN Step 2: stream through and for each test case id i w i,1 w i,2 w i,3. w i,ki Classification logic request the event counters needed to classify id i from the event-count DB, then classify using the answers

9 Recall: Stream and Sort Counting: sort messages so the recipient can stream through them example 1 example 2 example 3. C[x1] += D1 C[x1] += D2. Counting logic C[x] +=D Sort Logic to combine counter updates Machine A Machine C Machine B

10 Is there a stream-and-sort analog of this request-and-answer pattern? Test data Record of all event counts for each word id 1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 w 2,1 w 2,2 w 2,3. id 3 w 3,1 w 3,2. id 4 w 4,1 w 4,2 id 5 w 5,1 w 5,2... w Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world C[w^Y=sports]=21,C[w^Y=worldN Classification logic W 1,1 counters to id 1 W 1,2 counters to id 1 W i,j counters to id i

11 Is there a stream-and-sort analog of this request-and-answer pattern? Test data Record of all event counts for each word id 1 found an aarvark in s farmville today! id 2 id 3. id 4 id 5.. w Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world C[w^Y=sports]=21,C[w^Y=worldN Classification logic found ctrs to id 1 ctrs to id 1 today ctrs to id 1

12 Is there a stream-and-sort analog of this request-and-answer pattern? Test data Record of all event counts for each word id 1 found an aarvark in s farmville today! id 2 id 3. id 4 id 5.. w Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world C[w^Y=sports]=21,C[w^Y=worldN ~ is the last ascii character Classification logic found ~ctrs to id 1 ~ctrs to id 1 today ~ctrs to id 1 % export LC_COLLATE=C means that it will sort after anything else with unix sort

13 Is there a stream-and-sort analog of this request-and-answer pattern? Test data Record of all event counts for each word id 1 found an in s farmville today! id 2 id 3. id 4 id 5.. w Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world C[w^Y=sports]=21,C[w^Y=worldN Counter records Classification logic found ~ctr to id 1 ~ctr to id 2 today ~ctr to id i requests Combine and sort

14 A stream-and-sort analog of the request-andanswer pattern Record of all event counts for each word w Counts C[w^Y=sports]=2 Counter records w Counts C[w^Y=sports]=2 ~ctr to id1 C[w^Y=sports]= ~ctr to id345 ~ctr to id9854 ~ctr to id345 ~ctr to id34742 C[] found ~ctr to id 1 ~ctr to id 1 today ~ctr to id 1 requests Combine and sort ~ctr to id1 Request-handling logic

15 A stream-and-sort analog of the request-andanswer pattern previouskey = somethingimpossible For each (key,val) in input: If key==previouskey Answer(recordForPrevKey,val) Else previouskey = key recordforprevkey = val define Answer(record,request): find id where request = ~ctr to id print id ~ctr for request is record w Counts C[w^Y=sports]=2 ~ctr to id1 C[w^Y=sports]= ~ctr to id345 ~ctr to id9854 ~ctr to id345 ~ctr to id34742 C[] ~ctr to id1 requests Combine and sort Request-handling logic

16 A stream-and-sort analog of the request-andanswer pattern previouskey = somethingimpossible For each (key,val) in input: If key==previouskey Answer(recordForPrevKey,val) Else previouskey = key recordforprevkey = val define Answer(record,request): find id where request = ~ctr to id print id ~ctr for request is record w Counts C[w^Y=sports]=2 ~ctr to id1 C[w^Y=sports]= ~ctr to id345 ~ctr to id9854 ~ctr to id345 ~ctr to id34742 C[] Output: id1 ~ctr for is C[w^Y=sports]=2 id1 ~ctr for is. Combine and sort requests ~ctr to id1 Request-handling logic

17 w A stream-and-sort analog of the request-andanswer pattern Counts C[w^Y=sports]=2 ~ctr to id1 C[w^Y=sports]= ~ctr to id345 ~ctr to id9854 ~ctr to id345 ~ctr to id34742 C[] ~ctr to id1 Output: id1 ~ctr for is C[w^Y=sports]=2 id1 ~ctr for is. id 1 found an in s farmville today! id 2 id 3. id 4 id 5.. Request-handling logic Combine and sort????

18 id 1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 w 2,1 w 2,2 w 2,3. id 3 w 3,1 w 3,2. id 4 w 4,1 w 4,2 What we d wanted C[X=w 1,1^Y=sports]=5245, C[X=w 1,1^Y=..],C[X=w 1,2^] C[X=w 2,1^Y=.]=1054,, C[X=w 2,k2^] C[X=w 3,1^Y=.]= Key id1 id2 What we ended up with and it s good enough! Value found farmville today ~ctr for is C[w^Y=sports]=2 ~ctr for found is C[w^Y=sports]=1027,C[w^Y=worldNews]=564 w 2,1 w 2,2 w 2,3. ~ctr for w 2,1 is

19 Implementation summary java CountForNB train.dat > eventcounts.dat java CountsByWord eventcounts.dat sort java CollectRecords > words.dat java requestwordcounts test.dat cat - words.dat sort java answerwordcountrequests cat - test.dat sort testnbusingrequests train.dat id 1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 w 2,1 w 2,2 w 2,3. id 3 w 3,1 w 3,2. id 4 w 4,1 w 4,2 id 5 w 5,1 w 5,2... counts.dat X=w1^Y=sports X=w1^Y=worldNews X=.. X=w2^Y= X=

20 Implementation summary java CountForNB train.dat > eventcounts.dat java CountsByWord eventcounts.dat sort java CollectRecords > words.dat java requestwordcounts test.dat cat - words.dat sort java answerwordcountrequests cat - test.dat sort testnbusingrequests words.dat w Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=worldNews]=564 C[w^Y=sports]=21,C[w^Y=worldNews]=4464

21 Implementation summary java CountForNB train.dat > eventcounts.dat java CountsByWord eventcounts.dat sort java CollectRecords > words.dat java requestwordcounts test.dat output looks like this cat - words.dat sort java answerwordcountrequests input looks cat - test.dat sort testnbusingrequests like this words.dat found ~ctr to id 1 ~ctr to id 2 today ~ctr to id i w Counts C[w^Y=sports]=2 w Counts C[w^Y=sports]=2 ~ctr to id1 C[w^Y=sports]= ~ctr to id345 ~ctr to id9854 ~ctr to id345

22 Implementation summary java CountForNB train.dat > eventcounts.dat java CountsByWord eventcounts.dat sort java CollectRecords > words.dat java requestwordcounts test.dat cat - words.dat sort java answerwordcountrequests cat - test.dat sort testnbusingrequests Output looks like this Output: id1 ~ctr for is C[w^Y=sports]=2 id1 ~ctr for is. test.dat id 1 found an in s farmville today! id 2 id 3. id 4 id 5..

23 Implementation summary java CountForNB train.dat > eventcounts.dat java CountsByWord eventcounts.dat sort java CollectRecords > words.dat java requestwordcounts test.dat cat - words.dat sort java answerwordcountrequests cat -test.dat sort testnbusingrequests Input looks like this Key id1 id2 Value found farmville today ~ctr for is C[w^Y=sports]=2 ~ctr for found is C[w^Y=sports]=1027,C[w^Y=worldNews]=564 w 2,1 w 2,2 w 2,3. ~ctr for w 2,1 is

24 Summary We ve filled in the missing piece Training (event-counting) hashtable in-memory on-disk, low-memory Testing (classification) hashtable in-memory on-disk, low-memory

25 Rocchio s Algorithm 25

26 Rocchio s algorithm DF(w) = # different docs w occurs in TF(w, d) = # different times w occurs in doc d IDF(w) = D DF(w) u(w, d) = log(tf(w, d) +1) log(idf(w)) u(d) = u(w 1, d),...,u(w V, d) u(y) = α 1 C y d C y f (d) = argmax y u(d) u(d) 2 u(d) u(d) 2 1 β D C y u(y) u(y) 2 d ' D C y Many variants of these formulae as long as u(w,d)=0 for words not in d! Store only non-zeros in u(d), so size is O( d ) u(d ') u(d ') 2 But size of u(y) is O( n V ) i u 2 = u i 2 26

27 Rocchio s algorithm DF(w) =# different docs w occurs in TF(w,d) =# different times w occurs in doc d IDF(w) = D DF(w) u(w,d) = log(tf(w,d) +1) log(idf(w)) u(d) = u(w 1,d),...,u(w V,d), v(d) = u(d) u(d) 2 = v(w 1,d),... u(y) = α 1 C y d C y v(d) f (d) = argmax y v(d) v(y) 1 β D C y v(d), v(y) = u(y) u(y) 2 d ' D C y Given a table mapping w to DF(w), we can compute v(d) from the words in d and the rest of the learning algorithm is just adding 27

28 Imagine a similar process but for labeled documents Train data Rocchio v Bayes Event counts id 1 y1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3. id 3 y3 w 3,1 w 3,2. id 4 y4 w 4,1 w 4,2 id 5 y5 w 5,1 w 5,2... X=w 1^Y=sports X=w 1^Y=worldNews X=.. X=w 2^Y= X= Recall Naïve Bayes test process? id 1 y1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3. id 3 y3 w 3,1 w 3,2. C[X=w 1,1^Y=sports]=5245, C[X=w 1,1^Y=..],C[X=w 1,2^] C[X=w 2,1^Y=.]=1054,, C[X=w 2,k2^] C[X=w 3,1^Y=.]= id 4 y4 w 4,1 w 4,2 28

29 Rocchio. Train data id 1 y1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3. id 3 y3 w 3,1 w 3,2. id 4 y4 w 4,1 w 4,2 id 5 y5 w 5,1 w 5,2... Rocchio: DF counts id 1 y1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3. id 3 y3 w 3,1 w 3,2. id 4 y4 w 4,1 w 4,2 v(w 1,1,id1), v(w 1,2,id1)v(w 1,k1,id1) v(w 2,1,id2), v(w 2,2,id2) 29

30 recap: Is there a stream-and-sort analog of this request-and-answer pattern? Test data Record of all event counts for each word id 1 found an in s farmville today! id 2 id 3. id 4 id 5.. w Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world C[w^Y=sports]=21,C[w^Y=worldN Counter records Classification logic found ~ctr to id 1 ~ctr to id 2 today ~ctr to id i requests Combine and sort

31 recap: Is there a stream-and-sort analog of this request-and-answer pattern? Test data Record of all event counts for each word id 1 found an in s farmville today! id 2 id 3. id 4 id 5.. w DF(w) Counter records Classification logic found ~ctr to id 1 ~ctr to id 2 today ~ctr to id i requests Combine and sort

32 recap: A stream-and-sort analog of the requestand-answer pattern Record of all event counts for each word w Counts C[w^Y=sports]=2 Counter records w 4 Counts ~ctr to id1 ~ctr to id345 ~ctr to id9854 ~ctr to id345 ~ctr to id34742 C[] found ~ctr to id 1 ~ctr to id 1 today ~ctr to id 1 requests Combine and sort ~ctr to id1 Request-handling logic

33 recap: A stream-and-sort analog of the requestand-answer pattern previouskey = somethingimpossible For each (key,val) in input: If key==previouskey Answer(recordForPrevKey,val) Else previouskey = key recordforprevkey = val define Answer(record,request): find id where request = ~ctr to id print id ~ctr for request is record w 4 Counts ~ctr to id1 ~ctr to id345 ~ctr to id9854 ~ctr to id345 ~ctr to id34742 C[] ~ctr to id1 requests Combine and sort Request-handling logic

34 Rocchio. Train data id 1 y1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3. id 3 y3 w 3,1 w 3,2. id 4 y4 w 4,1 w 4,2 id 5 y5 w 5,1 w 5,2... DF(w) = # different docs w occurs in TF(w, d) = # Rocchio: different times DF counts w occurs in doc d IDF(w) = D 12 DF(w) 1054 u(w, d) = log(tf(w, d) +1) log(idf(w)) u(d) = u(w 1, d),...,u(w V, d), v(d) 3 = u(d) u(d) 2 id 1 y1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3. id 3 y3 w 3,1 w 3,2. id 4 y4 w 4,1 w 4,2 v(id 1 ) v(id 2 ) 34

35 u(y) = α 1 C y d C y v(d) f (d) = argmax y v(d) v(y) 1 β D C y Rocchio. v(d), v(y) = u(y) u(y) 2 d' D C y id 1 y1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3. id 3 y3 w 3,1 w 3,2. id 4 y4 w 4,1 w 4,2 v(w 1,1 w 1,2 w 1,3. w 1,k1 ), the document vector for id 1 v(w 2,1 w 2,2 w 2,3.)= v(w 2,1,d), v(w 2,2,d), For each (y, v), go through the non-zero values in v one for each w in the document dand increment a counter for that dimension of v(y) Message: increment v(y1) s weight for w 1,1 by αv(w 1,1,d) / C y Message: increment v(y1) s weight for w 1,2 by αv(w 1,2,d) / C y 35

36 f (d) = argmax y v(d) v(y) = argmax y v(w,d) v(w, y) Rocchio at Test Time w Train data Rocchio: DF counts id 1 y1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3. id 3 y3 w 3,1 w 3,2. id 4 y4 w 4,1 w 4,2 id 5 y5 w 5,1 w 5,2... v(y1,w)= v(y1,w)=0.013, v(y2,w)=... id 1 y1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3. id 3 y3 w 3,1 w 3,2. v(id 1 ), v(w 1,1,y1),v(w 1,1,y1),.,v(w 1,k1,yk),,v(w 1,k1,yk) v(id 2 ), v(w 2,1,y1),v(w 2,1,y1),. id 4 y4 w 4,1 w 4,2 36

37 Rocchio Summary Compute DF one scan thru docs Compute v(id i ) for each document output size O(n) Add up vectors to get v(y) time: O(n), n=corpus size like NB event-counts time: O(n) one scan, if DF fits in memory like first part of NB test procedure otherwise time: O(n) one scan if v(y) s fit in memory like NB training otherwise Classification ~= disk NB 37

38 One? more Rocchio observation Documents/labels Split into documents subsets Documents/labels 1 Documents/labels 2 Documents/labels 3 Compute DFs DFs -1 DFs - 2 DFs -3 DFs Sort and add counts 38

39 One?? more Rocchio observation DFs Documents/labels Split into documents subsets Documents/labels 1 Documents/labels 2 Documents/labels 3 Compute partial v(y) s v-1 v-2 v-3 v(y) s Sort and add vectors 39

40 O(1) more Rocchio observation Documents/labels Split into documents subsets Documents/labels 1 Documents/labels 2 Documents/labels 3 DFs DFs Compute partial v(y) s DFs v-1 v-2 v-3 coming up - how v(y) s Sort and add vectors We have shared access to the DFs, but only shared read access we don t need to share write access. So we only need to copy the information across the different processes. 40

41 ABSTRACTIONS FOR STREAM AND SORT AND MAP-REDUCE 41

42 Abstractions On Top Of Map- Reduce We ve decomposed some algorithms into a map-reduce workflow (series of mapreduce steps) naive Bayes training naïve Bayes testing How else can we express these sorts of computations? Are there some common special cases of map-reduce steps we can parameterize and reuse? 42

43 Abstractions On Top Of Map- Reduce Some obvious streaming processes: for each row in a table Transform it and output the result Decide if you want to keep it with some boolean test, and copy out only the ones that pass the test Example: stem words in a stream of word-count pairs: ( s,1) è (,1) Proposed syntax: f(row)!row table2 = MAP table1 TO λ row : f(row)) Example: apply stop words (,1) è (,1) ( the,1) è deleted Proposed syntax: f(row)! {true,false} table2 = FILTER table1 BY λ row : f(row)) 43

44 Abstractions On Top Of Map- Reduce A non-obvious? streaming processes: for each row in a table Transform it to a list of items Splice all the lists together to get the output table (flatten) Proposed syntax: Example: tokenizing a line I found an è [ i, found, an, ] We love zymurgy è [ we, love, zymurgy ]..but Uinal table is one word per row f(row)!list of rows table2 = FLATMAP table1 TO λ row : f(row)) i found an we love 44

45 Abstractions On Top Of Map- Reduce An example: the Naïve Bayes test program 45

46 NB Test Step Event counts How: Stream and sort: for each C[X=w^Y=y]=n print w C[Y=y]=n sort and build a list of values associated with each key w Like an inverted index X=w 1^Y=sports X=w 1^Y=worldNews X=.. X=w 2^Y= X= w Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world News]=564 C[w^Y=sports]=21,C[w^Y=worldNe ws]=4464

47 NB Test Step The general case: We re taking rows from a table In a particular format (event,count) Applying a function to get a new value The word for the event And grouping the rows of the table by this new value è Grouping operation Special case of a map-reduce Event counts X=w 1^Y=sports X=w 1^Y=worldNews X=.. X=w 2^Y= X= Proposed syntax: f(row)!field GROUP table BY λ row : f(row) Could deuine f via: a function, a Uield of a deuined record structure, w Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world News]=564 C[w^Y=sports]=21,C[w^Y=worldNe ws]=4464

48 NB Test Step The general case: We re taking rows from a table In a particular format (event,count) Applying a function to get a new value The word for the event And grouping the rows of the table by this new value è Grouping operation Special case of a map-reduce Proposed syntax: f(row)!field Aside: you guys know how to implement this, right? 1. Output pairs (f(row),row) with a map/streaming process 2. Sort pairs by key which is f(row) 3. Reduce and aggregate by appending together all the values associated with the same key GROUP table BY λ row : f(row) Could deuine f via: a function, a Uield of a deuined record structure,

49 Abstractions On Top Of Map- Reduce And another example from the Naïve Bayes test program 49

50 Request-and-answer Test data Record of all event counts for each word id 1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 w 2,1 w 2,2 w 2,3. id 3 w 3,1 w 3,2. id 4 w 4,1 w 4,2 id 5 w 5,1 w 5,2... w Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world C[w^Y=sports]=21,C[w^Y=worldN Step 2: stream through and for each test case id i w i,1 w i,2 w i,3. w i,ki Classification logic request the event counters needed to classify id i from the event-count DB, then classify using the answers

51 Request-and-answer Break down into stages Generate the data being requested (indexed by key, here a word) Eg with group by Generate the requests as (key, requestor) pairs Eg with flatmap to Join these two tables by key Join defined as (1) cross-product and (2) filter out pairs with different values for keys This replaces the step of concatenating two different tables of key-value pairs, and reducing them together Postprocess the joined result

52 w found Request ~ctr to id1 w Counters C[w^Y=sports]=2 ~ctr to id1 C[w^Y=sports]=1027,C[w^Y=worldNews ~ctr to id1 C[w^Y=sports]=21,C[w^Y=worldNews]= ~ctr to id2 w Counters Requests C[w^Y=sports]=2 ~ctr to id1 C[w^Y=sports]= ~ctr to id345 C[w^Y=sports]= ~ctr to id9854 C[w^Y=sports]= ~ctr to id345 C[w^Y=sports]= ~ctr to id34742 C[] ~ctr to id1 C[]

53 w found Request id1 w Counters C[w^Y=sports]=2 id1 C[w^Y=sports]=1027,C[w^Y=worldNews id1 C[w^Y=sports]=21,C[w^Y=worldNews]= id2 Examples: JOIN wordindoc BY word, wordcounters w BY Counters word --- if word(row) Requests de;ined correctly JOIN wordindoc BY lambda (word,docid):word, C[w^Y=sports]=2 id1 wordcounters BY lambda (word,counters):word C[w^Y=sports]= using python id345 syntax for functions C[w^Y=sports]= id9854 Proposed syntax: C[w^Y=sports]= id345 JOIN table1 BY λ row : f(row), table2 BY λ row : g(row) C[w^Y=sports]= id34742 C[] id1 C[]

54 Abstract Implementation: [TF]IDF data = pairs (docid,term) where term is a word appears in document with id docid operators: key value docid term DISTINCT, found MAP, JOIN (d123,found),(d134,found), 2456 GROUP BY. [RETAINING ] REDUCING TO a reduce step d123 found (d123,), 7 d123 docfreq = DISTINCT data GROUP BY λ(docid,term):term REDUCING TO key count /* value (term,df) */ docids = MAP DATA BY=λ(docid,term):docid DISTINCT numdocs = GROUP docids BY λdocid:1 REDUCING TO count /* (1,numDocs) */ question how many reducers should I use here? dataplusdf = JOIN data BY λ(docid, term):term, docfreq BY λ(term, df):term MAP λ((docid,term),(term,df)):(docid,term,df) /* (docid,term,document-freq) */ unnormalizeddocvecs = JOIN dataplusdf by λrow:1, numdocs by λrow:1 MAP λ((docid,term,df),(dummy,numdocs)): (docid,term,log(numdocs/df)) /* (docid, term, weight-before-normalizing) : u */ 1/2 54

55 Abstract Implementation: [TF]IDF data = pairs (docid,term) where term is a word appears in document with id docid operators: key value docid term DISTINCT, found MAP, JOIN (d123,found),(d134,found), 2456 GROUP BY. [RETAINING ] REDUCING TO a reduce step d123 found (d123,), 7 d123 docfreq = DISTINCT data GROUP BY λ(docid,term):term REDUCING TO key count /* value (term,df) */ docids = MAP DATA BY=λ(docid,term):docid DISTINCT numdocs = GROUP docids BY λdocid:1 REDUCING TO count /* (1,numDocs) */ question how many reducers should I use here? dataplusdf = JOIN data BY λ(docid, term):term, docfreq BY λ(term, df):term MAP λ((docid,term),(term,df)):(docid,term,df) /* (docid,term,document-freq) */ unnormalizeddocvecs = JOIN dataplusdf by λrow:1, numdocs by λrow:1 MAP λ((docid,term,df),(dummy,numdocs)): (docid,term,log(numdocs/df)) /* (docid, term, weight-before-normalizing) : u */ 55 question how many reducers should I use here? 1/2

56 Abstract Implementation: TFIDF normalizers = GROUP unnormalizeddocvecs BY λ(docid,term,w):docid RETAINING λ(docid,term,w): w 2 REDUCING TO sum /* (docid,sum-of-square-weights) */ 2/2 key d1234 (d1234,found,1.542), (d1234,,13.23), d docvec = JOIN unnormalizeddocvecs BY λ(docid,term,w):docid, normalizers BY λ(docid,norm):docid MAP λ((docid,term,w), (docid,norm)): (docid,term,w/sqrt(norm)) /* (docid, term, weight) */ docid term w d1234 found d docid w d d

57 Two ways to join Reduce-side join Map-side join 57

58 Two ways to join Reduce-side join for A,B A term df found term docid d15 concat and sort do the join ( A (, 7) B... (found, 2456) (, d15) (found,d7) B... found d7 found d23 (found, 2456) (found,d23) found 58

59 Two ways to join Reduce-side join for A,B df rk 7 rk 2456 docid d15... d7 d23 concat and sort term df found A 2456 A 7 term doci d B d15... found found found B d7 B d23 B tricky bit: need sort by first two values (, AB) we want the DF s to come first but all tuples with key should go to same worker do the join ( A (, 7) B... (found, 2456) (found, 2456) (, d15) (found,d7) (found,d23) 59

60 Two ways to join Reduce-side join for A,B df rk 7 rk 2456 docid d15... d7 d23 concat and sort term df found A 2456 A 7 term doci d B d15... found found found B d7 B d23 B tricky bit: need sort by first two values (, AB) we want the DF s to come first but all tuples with key should go to same worker custom sort (secondary sort key): Writeable with your own Comparator custom Partitioner (specified for job like the Mapper, Reducer,..) 60

61 Two ways to join Map-side join write the smaller relation out to disk send it to each Map worker DistributedCache when you initialize each Mapper, load in the small relation Configure() is called at initialization time map through the larger relation and do the join faster but requires one relation to go in memory 61

MAP-REDUCE ABSTRACTIONS

MAP-REDUCE ABSTRACTIONS MAP-REDUCE ABSTRACTIONS 1 Abstractions On Top Of Hadoop We ve decomposed some algorithms into a map- reduce work9low (series of map- reduce steps) naive Bayes training naïve Bayes testing phrase scoring

More information

Scaling up: Naïve Bayes on Hadoop. What, How and Why?

Scaling up: Naïve Bayes on Hadoop. What, How and Why? Scaling up: Naïve Bayes on Hadoop What, How and Why? Big ML c. 2001 (Banko & Brill, Scaling to Very Very Large, ACL 2001) Task: distinguish pairs of easily-confused words ( affect vs effect ) in context

More information

SOFT JOINS WITH TFIDF: WHY AND WHAT

SOFT JOINS WITH TFIDF: WHY AND WHAT SOFT JOINS WITH TFIDF: WHY AND WHAT 1 Sim Joins on Product Descriptions Surface similarity can be high for descriptions of distinct items: o o AERO TGX-Series Work Table -42'' x 96'' Model 1TGX-4296 All

More information

Map-Reduce With Hadoop!

Map-Reduce With Hadoop! Map-Reduce With Hadoop! Announcement 1/2! Assignments, in general:! Autolab is not secure and assignments aren t designed for adversarial interactions! Our policy: deliberately gaming an autograded assignment

More information

Naïve Bayes: Counts in Memory

Naïve Bayes: Counts in Memory Recap 1 Summary to date Computational complexity: what and how to count Memory vs disk access Cost of scanning vs seeks for disk (and memory) Probability review Classification with a density estimator

More information

In the once upon a time days of the First Age of Magic, the prudent sorcerer regarded his own true name as his most valued possession but also the

In the once upon a time days of the First Age of Magic, the prudent sorcerer regarded his own true name as his most valued possession but also the In the once upon a time days of the First Age of Magic, the prudent sorcerer regarded his own true name as his most valued possession but also the greatest threat to his continued good health, for--the

More information

Announcements: projects

Announcements: projects Announcements: projects 805 students: Project proposals are due Sun 10/1. If you d like to work with 605 students then indicate this on your proposal. 605 students: the week after 10/1 I will post the

More information

In the once upon a time days of the First Age of Magic, the prudent sorcerer regarded his own true name as his most valued possession but also the

In the once upon a time days of the First Age of Magic, the prudent sorcerer regarded his own true name as his most valued possession but also the In the once upon a time days of the First Age of Magic, the prudent sorcerer regarded his own true name as his most valued possession but also the greatest threat to his continued good health, for--the

More information

TI2736-B Big Data Processing. Claudia Hauff

TI2736-B Big Data Processing. Claudia Hauff TI2736-B Big Data Processing Claudia Hauff ti2736b-ewi@tudelft.nl Intro Streams Streams Map Reduce HDFS Pig Pig Design Patterns Hadoop Ctd. Graphs Giraph Spark Zoo Keeper Spark Learning objectives Implement

More information

Always start off with a joke, to lighten the mood. Workflows 3: Graphs, PageRank, Loops, Spark

Always start off with a joke, to lighten the mood. Workflows 3: Graphs, PageRank, Loops, Spark Always start off with a joke, to lighten the mood Workflows 3: Graphs, PageRank, Loops, Spark 1 The course so far Cost of operations Streaming learning algorithms Parallel streaming with map-reduce Building

More information

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488) Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-

More information

Birkbeck (University of London)

Birkbeck (University of London) Birkbeck (University of London) MSc Examination for Internal Students Department of Computer Science and Information Systems Information Retrieval and Organisation (COIY64H7) Credit Value: 5 Date of Examination:

More information

Lecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)!

Lecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)! Big Data Processing, 2014/15 Lecture 7: MapReduce design patterns!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm

More information

STATS Data Analysis using Python. Lecture 8: Hadoop and the mrjob package Some slides adapted from C. Budak

STATS Data Analysis using Python. Lecture 8: Hadoop and the mrjob package Some slides adapted from C. Budak STATS 700-002 Data Analysis using Python Lecture 8: Hadoop and the mrjob package Some slides adapted from C. Budak Recap Previous lecture: Hadoop/MapReduce framework in general Today s lecture: actually

More information

Instructor: Stefan Savev

Instructor: Stefan Savev LECTURE 2 What is indexing? Indexing is the process of extracting features (such as word counts) from the documents (in other words: preprocessing the documents). The process ends with putting the information

More information

Lecture Transcript While and Do While Statements in C++

Lecture Transcript While and Do While Statements in C++ Lecture Transcript While and Do While Statements in C++ Hello and welcome back. In this lecture we are going to look at the while and do...while iteration statements in C++. Here is a quick recap of some

More information

Building an Inverted Index

Building an Inverted Index Building an Inverted Index Algorithms Memory-based Disk-based (Sort-Inversion) Sorting Merging (2-way; multi-way) 2 Memory-based Inverted Index Phase I (parse and read) For each document Identify distinct

More information

MapReduce Patterns, Algorithms, and Use Cases

MapReduce Patterns, Algorithms, and Use Cases MapReduce Patterns, Algorithms, and Use Cases In this article I digested a number of MapReduce patterns and algorithms to give a systematic view of the different techniques that can be found on the web

More information

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval 1 Naïve Implementation Convert all documents in collection D to tf-idf weighted vectors, d j, for keyword vocabulary V. Convert

More information

Database Systems CSE 414

Database Systems CSE 414 Database Systems CSE 414 Lecture 15-16: Basics of Data Storage and Indexes (Ch. 8.3-4, 14.1-1.7, & skim 14.2-3) 1 Announcements Midterm on Monday, November 6th, in class Allow 1 page of notes (both sides,

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 5: Index Compression Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-04-17 1/59 Overview

More information

Part A: MapReduce. Introduction Model Implementation issues

Part A: MapReduce. Introduction Model Implementation issues Part A: Massive Parallelism li with MapReduce Introduction Model Implementation issues Acknowledgements Map-Reduce The material is largely based on material from the Stanford cources CS246, CS345A and

More information

MapReduce Design Patterns

MapReduce Design Patterns MapReduce Design Patterns MapReduce Restrictions Any algorithm that needs to be implemented using MapReduce must be expressed in terms of a small number of rigidly defined components that must fit together

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

Extreme Computing. Introduction to MapReduce. Cluster Outline Map Reduce

Extreme Computing. Introduction to MapReduce. Cluster Outline Map Reduce Extreme Computing Introduction to MapReduce 1 Cluster We have 12 servers: scutter01, scutter02,... scutter12 If working outside Informatics, first: ssh student.ssh.inf.ed.ac.uk Then log into a random server:

More information

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies Informa)on Retrieval and Map- Reduce Implementa)ons Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies mas4108@louisiana.edu Map-Reduce: Why? Need to process 100TB datasets On 1 node:

More information

CSC 261/461 Database Systems Lecture 19

CSC 261/461 Database Systems Lecture 19 CSC 261/461 Database Systems Lecture 19 Fall 2017 Announcements CIRC: CIRC is down!!! MongoDB and Spark (mini) projects are at stake. L Project 1 Milestone 4 is out Due date: Last date of class We will

More information

Database Systems CSE 414

Database Systems CSE 414 Database Systems CSE 414 Lecture 10-11: Basics of Data Storage and Indexes (Ch. 8.3-4, 14.1-1.7, & skim 14.2-3) 1 Announcements No WQ this week WQ4 is due next Thursday HW3 is due next Tuesday should be

More information

Data-Intensive Distributed Computing

Data-Intensive Distributed Computing Data-Intensive Distributed Computing CS 45/65 43/63 (Winter 08) Part 3: Analyzing Text (/) January 30, 08 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides are

More information

/ Cloud Computing. Recitation 3 Sep 13 & 15, 2016

/ Cloud Computing. Recitation 3 Sep 13 & 15, 2016 15-319 / 15-619 Cloud Computing Recitation 3 Sep 13 & 15, 2016 1 Overview Administrative Issues Last Week s Reflection Project 1.1, OLI Unit 1, Quiz 1 This Week s Schedule Project1.2, OLI Unit 2, Module

More information

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 3 Dictionaries and Tolerant retrieval Chapter 4 Index construction Chapter 5 Index compression Content Dictionary data structures

More information

Overview of this week

Overview of this week Overview of this week Debugging tips for ML algorithms Graph algorithms at scale A prototypical graph algorithm: PageRank n memory Putting more and more on disk Sampling from a graph What is a good sample

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 26: Parallel Databases and MapReduce CSE 344 - Winter 2013 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Cluster will run in Amazon s cloud (AWS)

More information

Map/Reduce on the Enron dataset

Map/Reduce on the Enron dataset Map/Reduce on the Enron dataset We are going to use EMR on the Enron email dataset: http://aws.amazon.com/datasets/enron-email-data/ https://en.wikipedia.org/wiki/enron_scandal This dataset contains 1,227,255

More information

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search Index Construction Dictionary, postings, scalable indexing, dynamic indexing Web Search 1 Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing

More information

Recitation 6 CS435: Introduction to Big Data

Recitation 6 CS435: Introduction to Big Data Recitation 6 CS435: Introduction to Big Data GTAs: Bibek R. Shrestha and Aaron Pereira Email: cs435@cs.colostate.edu October 5, 2018 Recall last recitation How to approach PA2? othere are few documents

More information

ML from Large Datasets

ML from Large Datasets 10-605 ML from Large Datasets 1 Announcements HW1b is going out today You should now be on autolab have a an account on stoat a locally-administered Hadoop cluster shortly receive a coupon for Amazon Web

More information

Introduction to MapReduce

Introduction to MapReduce 732A54 Big Data Analytics Introduction to MapReduce Christoph Kessler IDA, Linköping University Towards Parallel Processing of Big-Data Big Data too large to be read+processed in reasonable time by 1 server

More information

CSCI 5417 Information Retrieval Systems Jim Martin!

CSCI 5417 Information Retrieval Systems Jim Martin! CSCI 5417 Information Retrieval Systems Jim Martin! Lecture 4 9/1/2011 Today Finish up spelling correction Realistic indexing Block merge Single-pass in memory Distributed indexing Next HW details 1 Query

More information

Types, lists & functions

Types, lists & functions Week 2 Types, lists & functions Data types If you want to write a program that allows the user to input something, you can use the command input: name = input (" What is your name? ") print (" Hello "+

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 07) Week 4: Analyzing Text (/) January 6, 07 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides

More information

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018 CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018 Lecture 4: Apache Pig Aidan Hogan aidhog@gmail.com HADOOP: WRAPPING UP 0. Reading/Writing to HDFS Creates a file system for default configuration Check

More information

Data Analytics Job Guarantee Program

Data Analytics Job Guarantee Program Data Analytics Job Guarantee Program 1. INSTALLATION OF VMWARE 2. MYSQL DATABASE 3. CORE JAVA 1.1 Types of Variable 1.2 Types of Datatype 1.3 Types of Modifiers 1.4 Types of constructors 1.5 Introduction

More information

1 Document Classification [60 points]

1 Document Classification [60 points] CIS519: Applied Machine Learning Spring 2018 Homework 4 Handed Out: April 3 rd, 2018 Due: April 14 th, 2018, 11:59 PM 1 Document Classification [60 points] In this problem, you will implement several text

More information

Distributed computing: index building and use

Distributed computing: index building and use Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

More information

Graphs (Part II) Shannon Quinn

Graphs (Part II) Shannon Quinn Graphs (Part II) Shannon Quinn (with thanks to William Cohen and Aapo Kyrola of CMU, and J. Leskovec, A. Rajaraman, and J. Ullman of Stanford University) Parallel Graph Computation Distributed computation

More information

Map- Reduce. Everything Data CompSci Spring 2014

Map- Reduce. Everything Data CompSci Spring 2014 Map- Reduce Everything Data CompSci 290.01 Spring 2014 2 Announcements (Thu. Feb 27) Homework #8 will be posted by noon tomorrow. Project deadlines: 2/25: Project team formation 3/4: Project Proposal is

More information

2.3 Algorithms Using Map-Reduce

2.3 Algorithms Using Map-Reduce 28 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK one becomes available. The Master must also inform each Reduce task that the location of its input from that Map task has changed. Dealing with a failure

More information

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks Administrative Index Compression! n Assignment 1? n Homework 2 out n What I did last summer lunch talks today David Kauchak cs458 Fall 2012 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt

More information

Binary Relations Part One

Binary Relations Part One Binary Relations Part One Outline for Today Binary Relations Reasoning about connections between objects. Equivalence Relations Reasoning about clusters. A Fundamental Theorem How do we know we have the

More information

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling Doug Downey Based partially on slides by Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze Announcements Project progress report

More information

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April, 2017 Sham Kakade 2017 1 Document Retrieval n Goal: Retrieve

More information

Big data systems 12/8/17

Big data systems 12/8/17 Big data systems 12/8/17 Today Basic architecture Two levels of scheduling Spark overview Basic architecture Cluster Manager Cluster Cluster Manager 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores

More information

Compile-Time Code Generation for Embedded Data-Intensive Query Languages

Compile-Time Code Generation for Embedded Data-Intensive Query Languages Compile-Time Code Generation for Embedded Data-Intensive Query Languages Leonidas Fegaras University of Texas at Arlington http://lambda.uta.edu/ Outline Emerging DISC (Data-Intensive Scalable Computing)

More information

Programming and Data Structure

Programming and Data Structure Programming and Data Structure Dr. P.P.Chakraborty Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture # 09 Problem Decomposition by Recursion - II We will

More information

CSCI6900 Assignment 1: Naïve Bayes on Hadoop

CSCI6900 Assignment 1: Naïve Bayes on Hadoop DEPARTMENT OF COMPUTER SCIENCE, UNIVERSITY OF GEORGIA CSCI6900 Assignment 1: Naïve Bayes on Hadoop DUE: Friday, January 29 by 11:59:59pm Out January 8, 2015 1 INTRODUCTION TO NAÏVE BAYES Much of machine

More information

CSCI544, Fall 2016: Assignment 1

CSCI544, Fall 2016: Assignment 1 CSCI544, Fall 2016: Assignment 1 Due Date: September 23 rd, 4pm. Introduction The goal of this assignment is to get some experience implementing the simple but effective machine learning technique, Naïve

More information

Higher level data processing in Apache Spark

Higher level data processing in Apache Spark Higher level data processing in Apache Spark Pelle Jakovits 12 October, 2016, Tartu Outline Recall Apache Spark Spark DataFrames Introduction Creating and storing DataFrames DataFrame API functions SQL

More information

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable CSE 544 Principles of Database Management Systems Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable References Bigtable: A Distributed Storage System for Structured Data. Fay Chang et. al. OSDI

More information

Corpus methods in linguistics and NLP Lecture 7: Programming for large-scale data processing

Corpus methods in linguistics and NLP Lecture 7: Programming for large-scale data processing Corpus methods in linguistics and NLP Lecture 7: Programming for large-scale data processing Richard Johansson December 1, 2015 today's lecture as you've seen, processing large corpora can take time! for

More information

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing ECE5610/CSC6220 Introduction to Parallel and Distribution Computing Lecture 6: MapReduce in Parallel Computing 1 MapReduce: Simplified Data Processing Motivation Large-Scale Data Processing on Large Clusters

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

Intro to Programming. Unit 7. What is Programming? What is Programming? Intro to Programming

Intro to Programming. Unit 7. What is Programming? What is Programming? Intro to Programming Intro to Programming Unit 7 Intro to Programming 1 What is Programming? 1. Programming Languages 2. Markup vs. Programming 1. Introduction 2. Print Statement 3. Strings 4. Types and Values 5. Math Externals

More information

Map-Reduce. Marco Mura 2010 March, 31th

Map-Reduce. Marco Mura 2010 March, 31th Map-Reduce Marco Mura (mura@di.unipi.it) 2010 March, 31th This paper is a note from the 2009-2010 course Strumenti di programmazione per sistemi paralleli e distribuiti and it s based by the lessons of

More information

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014 Parallel Data Processing with Hadoop/MapReduce CS140 Tao Yang, 2014 Overview What is MapReduce? Example with word counting Parallel data processing with MapReduce Hadoop file system More application example

More information

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414 Introduction to Database Systems CSE 414 Lecture 17: MapReduce and Spark Announcements Midterm this Friday in class! Review session tonight See course website for OHs Includes everything up to Monday s

More information

CSC209. Software Tools and Systems Programming. https://mcs.utm.utoronto.ca/~209

CSC209. Software Tools and Systems Programming. https://mcs.utm.utoronto.ca/~209 CSC209 Software Tools and Systems Programming https://mcs.utm.utoronto.ca/~209 What is this Course About? Software Tools Using them Building them Systems Programming Quirks of C The file system System

More information

Summer 2017 Discussion 10: July 25, Introduction. 2 Primitives and Define

Summer 2017 Discussion 10: July 25, Introduction. 2 Primitives and Define CS 6A Scheme Summer 207 Discussion 0: July 25, 207 Introduction In the next part of the course, we will be working with the Scheme programming language. In addition to learning how to write Scheme programs,

More information

Pig A language for data processing in Hadoop

Pig A language for data processing in Hadoop Pig A language for data processing in Hadoop Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Apache Pig: Introduction Tool for querying data on Hadoop

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

SCHEME 7. 1 Introduction. 2 Primitives COMPUTER SCIENCE 61A. October 29, 2015

SCHEME 7. 1 Introduction. 2 Primitives COMPUTER SCIENCE 61A. October 29, 2015 SCHEME 7 COMPUTER SCIENCE 61A October 29, 2015 1 Introduction In the next part of the course, we will be working with the Scheme programming language. In addition to learning how to write Scheme programs,

More information

MapReduce Patterns. MCSN - N. Tonellotto - Distributed Enabling Platforms

MapReduce Patterns. MCSN - N. Tonellotto - Distributed Enabling Platforms MapReduce Patterns 1 Intermediate Data Written locally Transferred from mappers to reducers over network Issue - Performance bottleneck Solution - Use combiners - Use In-Mapper Combining 2 Original Word

More information

VECTOR SPACE CLASSIFICATION

VECTOR SPACE CLASSIFICATION VECTOR SPACE CLASSIFICATION Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. Chapter 14 Wei Wei wwei@idi.ntnu.no Lecture

More information

Information Retrieval

Information Retrieval Information Retrieval Suan Lee - Information Retrieval - 05 Index Compression 1 05 Index Compression - Information Retrieval - 05 Index Compression 2 Last lecture index construction Sort-based indexing

More information

CSE 344 FEBRUARY 14 TH INDEXING

CSE 344 FEBRUARY 14 TH INDEXING CSE 344 FEBRUARY 14 TH INDEXING EXAM Grades posted to Canvas Exams handed back in section tomorrow Regrades: Friday office hours EXAM Overall, you did well Average: 79 Remember: lowest between midterm/final

More information

Spring 2017 EXTERNAL SORTING (CH. 13 IN THE COW BOOK) 2/7/17 CS 564: Database Management Systems; (c) Jignesh M. Patel,

Spring 2017 EXTERNAL SORTING (CH. 13 IN THE COW BOOK) 2/7/17 CS 564: Database Management Systems; (c) Jignesh M. Patel, Spring 2017 EXTERNAL SORTING (CH. 13 IN THE COW BOOK) 2/7/17 CS 564: Database Management Systems; (c) Jignesh M. Patel, 2013 1 Motivation for External Sort Often have a large (size greater than the available

More information

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION

More information

Web Information Retrieval Exercises Boolean query answering. Prof. Luca Becchetti

Web Information Retrieval Exercises Boolean query answering. Prof. Luca Becchetti Web Information Retrieval Exercises Boolean query answering Prof. Luca Becchetti Material rif 3. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schueze, Introduction to Information Retrieval, Cambridge

More information

Data Partitioning and MapReduce

Data Partitioning and MapReduce Data Partitioning and MapReduce Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies,

More information

Clustering Documents. Case Study 2: Document Retrieval

Clustering Documents. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 21 th, 2015 Sham Kakade 2016 1 Document Retrieval Goal: Retrieve

More information

STIDistrict Query (Basic)

STIDistrict Query (Basic) STIDistrict Query (Basic) Creating a Basic Query To create a basic query in the Query Builder, open the STIDistrict workstation and click on Utilities Query Builder. When the program opens, database objects

More information

Lecture 30: Distributed Map-Reduce using Hadoop and Spark Frameworks

Lecture 30: Distributed Map-Reduce using Hadoop and Spark Frameworks COMP 322: Fundamentals of Parallel Programming Lecture 30: Distributed Map-Reduce using Hadoop and Spark Frameworks Mack Joyner and Zoran Budimlić {mjoyner, zoran}@rice.edu http://comp322.rice.edu COMP

More information

Recitation 9. CS435: Introduction to Big Data. GTA: Bibek R. Shrestha March 23, 2018

Recitation 9. CS435: Introduction to Big Data. GTA: Bibek R. Shrestha   March 23, 2018 Recitation 9 CS435: Introduction to Big Data GTA: Bibek R. Shrestha Email: cs435@cs.colostate.edu March 23, 2018 Today... Discussion on issues in Programming Assignment 2 2 How to program PA2: Part-A?

More information

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark Announcements HW2 due this Thursday AWS accounts Any success? Feel

More information

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data Introduction to Hadoop High Availability Scaling Advantages and Challenges Introduction to Big Data What is Big data Big Data opportunities Big Data Challenges Characteristics of Big data Introduction

More information

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression Web Information Retrieval Lecture 4 Dictionaries, Index Compression Recap: lecture 2,3 Stemming, tokenization etc. Faster postings merges Phrase queries Index construction This lecture Dictionary data

More information

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf

More information

CSE 344 Final Examination

CSE 344 Final Examination CSE 344 Final Examination March 15, 2016, 2:30pm - 4:20pm Name: Question Points Score 1 47 2 17 3 36 4 54 5 46 Total: 200 This exam is CLOSED book and CLOSED devices. You are allowed TWO letter-size pages

More information

Map Reduce. Yerevan.

Map Reduce. Yerevan. Map Reduce Erasmus+ @ Yerevan dacosta@irit.fr Divide and conquer at PaaS 100 % // Typical problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate

More information

this is so cumbersome!

this is so cumbersome! Pig Arend Hintze this is so cumbersome! Instead of programming everything in java MapReduce or streaming: wouldn t it we wonderful to have a simpler interface? Problem: break down complex MapReduce tasks

More information

Counting Problems; and Recursion! CSCI 2824, Fall 2012!

Counting Problems; and Recursion! CSCI 2824, Fall 2012! Counting Problems; and Recursion! CSCI 2824, Fall 2012!!! Assignments To read this week: Sections 5.5-5.6 (Ensley/Crawley Problem Set 3 has been sent out today. Challenge problem today!!!!! So, to recap

More information

Lecture 10 September 19, 2007

Lecture 10 September 19, 2007 CS 6604: Data Mining Fall 2007 Lecture 10 September 19, 2007 Lecture: Naren Ramakrishnan Scribe: Seungwon Yang 1 Overview In the previous lecture we examined the decision tree classifier and choices for

More information

Vendor: Hortonworks. Exam Code: HDPCD. Exam Name: Hortonworks Data Platform Certified Developer. Version: Demo

Vendor: Hortonworks. Exam Code: HDPCD. Exam Name: Hortonworks Data Platform Certified Developer. Version: Demo Vendor: Hortonworks Exam Code: HDPCD Exam Name: Hortonworks Data Platform Certified Developer Version: Demo QUESTION 1 Workflows expressed in Oozie can contain: A. Sequences of MapReduce and Pig. These

More information

index construct Overview Overview Recap How to construct index? Introduction Index construction Introduction to Recap

index construct Overview Overview Recap How to construct index? Introduction Index construction Introduction to Recap to to Information Retrieval Index Construct Ruixuan Li Huazhong University of Science and Technology http://idc.hust.edu.cn/~rxli/ October, 2012 1 2 How to construct index? Computerese term document docid

More information

Structured Streaming. Big Data Analysis with Scala and Spark Heather Miller

Structured Streaming. Big Data Analysis with Scala and Spark Heather Miller Structured Streaming Big Data Analysis with Scala and Spark Heather Miller Why Structured Streaming? DStreams were nice, but in the last session, aggregation operations like a simple word count quickly

More information

CME 323: Distributed Algorithms and Optimization Instructor: Reza Zadeh HW#3 - Due at the beginning of class May 18th.

CME 323: Distributed Algorithms and Optimization Instructor: Reza Zadeh HW#3 - Due at the beginning of class May 18th. CME 323: Distributed Algorithms and Optimization Instructor: Reza Zadeh (rezab@stanford.edu) HW#3 - Due at the beginning of class May 18th. 1. Download the following materials: Slides: http://stanford.edu/~rezab/dao/slides/itas_workshop.pdf

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval (Supplementary Material) Zhou Shuigeng March 23, 2007 Advanced Distributed Computing 1 Text Databases and IR Text databases (document databases) Large collections

More information

Query Evaluation Strategies

Query Evaluation Strategies Introduction to Search Engine Technology Term-at-a-Time and Document-at-a-Time Evaluation Ronny Lempel Yahoo! Labs (Many of the following slides are courtesy of Aya Soffer and David Carmel, IBM Haifa Research

More information

MapReduce Simplified Data Processing on Large Clusters

MapReduce Simplified Data Processing on Large Clusters MapReduce Simplified Data Processing on Large Clusters Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 1 /

More information