Scalable out-of-core classification (of large test sets) can we do better that the current approach?
|
|
- Alexandrina Wood
- 6 years ago
- Views:
Transcription
1 Announcements Working AWS codes will be out soon project deadlines now posted Waitlist is at You need approval if you are an MS student on the 805 waitlist William has no offer hours next week
2 Scalable out-of-core classification (of large test sets) can we do better that the current approach?
3 Review of NB algorithms HW Train events Test events Parallel? 1 Msgs è Disk HashMap (for subset) 2 Msgs è Disk HashMap (for subset) 3 Msgs è Disk Msgs on Disk (coming.) no yes yes
4 Testing Large-vocab Naïve Bayes For each example id, y, x 1,.,x d in train: Sort the event-counter update messages Scan and add the sorted messages and output the final counter values Initialize a HashSet NEEDED and a hashtable C For each example id, y, x 1,.,x d in test: Add x 1,.,x d to NEEDED For each event, C(event) in the summed counters [For assignment] If event involves a NEEDED term x read it into C For each example id, y, x 1,.,x d in test: For each y in dom(y): Compute log Pr(y,x 1,.,x d ) =. Model size: O( V ) Time: O(n 2 ), size of test Memory: same Time: O(n 2 ) Memory: same Time: O(n 2 ) Memory: same 4
5 Can we do better? Test data id 1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 w 2,1 w 2,2 w 2,3. id 3 w 3,1 w 3,2. id 4 w 4,1 w 4,2 id 5 w 5,1 w 5,2... Event counts X=w 1^Y=sports X=w 1^Y=worldNews X=.. X=w 2^Y= X= id 1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 w 2,1 w 2,2 w 2,3. id 3 w 3,1 w 3,2. id 4 w 4,1 w 4,2 What we d like C[X=w 1,1^Y=sports]=5245, C[X=w 1,1^Y=..],C[X=w 1,2^] C[X=w 2,1^Y=.]=1054,, C[X=w 2,k2^] C[X=w 3,1^Y=.]=
6 Can we do better? Step 1: group counters by word w How: Stream and sort: for each C[X=w^Y=y]=n print w C[Y=y]=n sort and build a list of values associated with each key w Like an inverted index w Event counts X=w 1^Y=sports X=w 1^Y=worldNews X=.. X=w 2^Y= X= Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=worldNews]=564 C[w^Y=sports]=21,C[w^Y=worldNews]=
7 If these records were in a key-value DB we would know what to do. Test data Record of all event counts for each word id 1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 w 2,1 w 2,2 w 2,3. id 3 w 3,1 w 3,2. id 4 w 4,1 w 4,2 id 5 w 5,1 w 5,2... w Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world C[w^Y=sports]=21,C[w^Y=worldN Step 2: stream through and for each test case id i w i,1 w i,2 w i,3. w i,ki Classification logic request the event counters needed to classify id i from the event-count DB, then classify using the answers
8 Is there a stream-and-sort analog of this request-and-answer pattern? Test data Record of all event counts for each word id 1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 w 2,1 w 2,2 w 2,3. id 3 w 3,1 w 3,2. id 4 w 4,1 w 4,2 id 5 w 5,1 w 5,2... w Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world C[w^Y=sports]=21,C[w^Y=worldN Step 2: stream through and for each test case id i w i,1 w i,2 w i,3. w i,ki Classification logic request the event counters needed to classify id i from the event-count DB, then classify using the answers
9 Recall: Stream and Sort Counting: sort messages so the recipient can stream through them example 1 example 2 example 3. C[x1] += D1 C[x1] += D2. Counting logic C[x] +=D Sort Logic to combine counter updates Machine A Machine C Machine B
10 Is there a stream-and-sort analog of this request-and-answer pattern? Test data Record of all event counts for each word id 1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 w 2,1 w 2,2 w 2,3. id 3 w 3,1 w 3,2. id 4 w 4,1 w 4,2 id 5 w 5,1 w 5,2... w Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world C[w^Y=sports]=21,C[w^Y=worldN Classification logic W 1,1 counters to id 1 W 1,2 counters to id 1 W i,j counters to id i
11 Is there a stream-and-sort analog of this request-and-answer pattern? Test data Record of all event counts for each word id 1 found an aarvark in s farmville today! id 2 id 3. id 4 id 5.. w Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world C[w^Y=sports]=21,C[w^Y=worldN Classification logic found ctrs to id 1 ctrs to id 1 today ctrs to id 1
12 Is there a stream-and-sort analog of this request-and-answer pattern? Test data Record of all event counts for each word id 1 found an aarvark in s farmville today! id 2 id 3. id 4 id 5.. w Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world C[w^Y=sports]=21,C[w^Y=worldN ~ is the last ascii character Classification logic found ~ctrs to id 1 ~ctrs to id 1 today ~ctrs to id 1 % export LC_COLLATE=C means that it will sort after anything else with unix sort
13 Is there a stream-and-sort analog of this request-and-answer pattern? Test data Record of all event counts for each word id 1 found an in s farmville today! id 2 id 3. id 4 id 5.. w Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world C[w^Y=sports]=21,C[w^Y=worldN Counter records Classification logic found ~ctr to id 1 ~ctr to id 2 today ~ctr to id i requests Combine and sort
14 A stream-and-sort analog of the request-andanswer pattern Record of all event counts for each word w Counts C[w^Y=sports]=2 Counter records w Counts C[w^Y=sports]=2 ~ctr to id1 C[w^Y=sports]= ~ctr to id345 ~ctr to id9854 ~ctr to id345 ~ctr to id34742 C[] found ~ctr to id 1 ~ctr to id 1 today ~ctr to id 1 requests Combine and sort ~ctr to id1 Request-handling logic
15 A stream-and-sort analog of the request-andanswer pattern previouskey = somethingimpossible For each (key,val) in input: If key==previouskey Answer(recordForPrevKey,val) Else previouskey = key recordforprevkey = val define Answer(record,request): find id where request = ~ctr to id print id ~ctr for request is record w Counts C[w^Y=sports]=2 ~ctr to id1 C[w^Y=sports]= ~ctr to id345 ~ctr to id9854 ~ctr to id345 ~ctr to id34742 C[] ~ctr to id1 requests Combine and sort Request-handling logic
16 A stream-and-sort analog of the request-andanswer pattern previouskey = somethingimpossible For each (key,val) in input: If key==previouskey Answer(recordForPrevKey,val) Else previouskey = key recordforprevkey = val define Answer(record,request): find id where request = ~ctr to id print id ~ctr for request is record w Counts C[w^Y=sports]=2 ~ctr to id1 C[w^Y=sports]= ~ctr to id345 ~ctr to id9854 ~ctr to id345 ~ctr to id34742 C[] Output: id1 ~ctr for is C[w^Y=sports]=2 id1 ~ctr for is. Combine and sort requests ~ctr to id1 Request-handling logic
17 w A stream-and-sort analog of the request-andanswer pattern Counts C[w^Y=sports]=2 ~ctr to id1 C[w^Y=sports]= ~ctr to id345 ~ctr to id9854 ~ctr to id345 ~ctr to id34742 C[] ~ctr to id1 Output: id1 ~ctr for is C[w^Y=sports]=2 id1 ~ctr for is. id 1 found an in s farmville today! id 2 id 3. id 4 id 5.. Request-handling logic Combine and sort????
18 id 1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 w 2,1 w 2,2 w 2,3. id 3 w 3,1 w 3,2. id 4 w 4,1 w 4,2 What we d wanted C[X=w 1,1^Y=sports]=5245, C[X=w 1,1^Y=..],C[X=w 1,2^] C[X=w 2,1^Y=.]=1054,, C[X=w 2,k2^] C[X=w 3,1^Y=.]= Key id1 id2 What we ended up with and it s good enough! Value found farmville today ~ctr for is C[w^Y=sports]=2 ~ctr for found is C[w^Y=sports]=1027,C[w^Y=worldNews]=564 w 2,1 w 2,2 w 2,3. ~ctr for w 2,1 is
19 Implementation summary java CountForNB train.dat > eventcounts.dat java CountsByWord eventcounts.dat sort java CollectRecords > words.dat java requestwordcounts test.dat cat - words.dat sort java answerwordcountrequests cat - test.dat sort testnbusingrequests train.dat id 1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 w 2,1 w 2,2 w 2,3. id 3 w 3,1 w 3,2. id 4 w 4,1 w 4,2 id 5 w 5,1 w 5,2... counts.dat X=w1^Y=sports X=w1^Y=worldNews X=.. X=w2^Y= X=
20 Implementation summary java CountForNB train.dat > eventcounts.dat java CountsByWord eventcounts.dat sort java CollectRecords > words.dat java requestwordcounts test.dat cat - words.dat sort java answerwordcountrequests cat - test.dat sort testnbusingrequests words.dat w Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=worldNews]=564 C[w^Y=sports]=21,C[w^Y=worldNews]=4464
21 Implementation summary java CountForNB train.dat > eventcounts.dat java CountsByWord eventcounts.dat sort java CollectRecords > words.dat java requestwordcounts test.dat output looks like this cat - words.dat sort java answerwordcountrequests input looks cat - test.dat sort testnbusingrequests like this words.dat found ~ctr to id 1 ~ctr to id 2 today ~ctr to id i w Counts C[w^Y=sports]=2 w Counts C[w^Y=sports]=2 ~ctr to id1 C[w^Y=sports]= ~ctr to id345 ~ctr to id9854 ~ctr to id345
22 Implementation summary java CountForNB train.dat > eventcounts.dat java CountsByWord eventcounts.dat sort java CollectRecords > words.dat java requestwordcounts test.dat cat - words.dat sort java answerwordcountrequests cat - test.dat sort testnbusingrequests Output looks like this Output: id1 ~ctr for is C[w^Y=sports]=2 id1 ~ctr for is. test.dat id 1 found an in s farmville today! id 2 id 3. id 4 id 5..
23 Implementation summary java CountForNB train.dat > eventcounts.dat java CountsByWord eventcounts.dat sort java CollectRecords > words.dat java requestwordcounts test.dat cat - words.dat sort java answerwordcountrequests cat -test.dat sort testnbusingrequests Input looks like this Key id1 id2 Value found farmville today ~ctr for is C[w^Y=sports]=2 ~ctr for found is C[w^Y=sports]=1027,C[w^Y=worldNews]=564 w 2,1 w 2,2 w 2,3. ~ctr for w 2,1 is
24 Summary We ve filled in the missing piece Training (event-counting) hashtable in-memory on-disk, low-memory Testing (classification) hashtable in-memory on-disk, low-memory
25 Rocchio s Algorithm 25
26 Rocchio s algorithm DF(w) = # different docs w occurs in TF(w, d) = # different times w occurs in doc d IDF(w) = D DF(w) u(w, d) = log(tf(w, d) +1) log(idf(w)) u(d) = u(w 1, d),...,u(w V, d) u(y) = α 1 C y d C y f (d) = argmax y u(d) u(d) 2 u(d) u(d) 2 1 β D C y u(y) u(y) 2 d ' D C y Many variants of these formulae as long as u(w,d)=0 for words not in d! Store only non-zeros in u(d), so size is O( d ) u(d ') u(d ') 2 But size of u(y) is O( n V ) i u 2 = u i 2 26
27 Rocchio s algorithm DF(w) =# different docs w occurs in TF(w,d) =# different times w occurs in doc d IDF(w) = D DF(w) u(w,d) = log(tf(w,d) +1) log(idf(w)) u(d) = u(w 1,d),...,u(w V,d), v(d) = u(d) u(d) 2 = v(w 1,d),... u(y) = α 1 C y d C y v(d) f (d) = argmax y v(d) v(y) 1 β D C y v(d), v(y) = u(y) u(y) 2 d ' D C y Given a table mapping w to DF(w), we can compute v(d) from the words in d and the rest of the learning algorithm is just adding 27
28 Imagine a similar process but for labeled documents Train data Rocchio v Bayes Event counts id 1 y1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3. id 3 y3 w 3,1 w 3,2. id 4 y4 w 4,1 w 4,2 id 5 y5 w 5,1 w 5,2... X=w 1^Y=sports X=w 1^Y=worldNews X=.. X=w 2^Y= X= Recall Naïve Bayes test process? id 1 y1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3. id 3 y3 w 3,1 w 3,2. C[X=w 1,1^Y=sports]=5245, C[X=w 1,1^Y=..],C[X=w 1,2^] C[X=w 2,1^Y=.]=1054,, C[X=w 2,k2^] C[X=w 3,1^Y=.]= id 4 y4 w 4,1 w 4,2 28
29 Rocchio. Train data id 1 y1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3. id 3 y3 w 3,1 w 3,2. id 4 y4 w 4,1 w 4,2 id 5 y5 w 5,1 w 5,2... Rocchio: DF counts id 1 y1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3. id 3 y3 w 3,1 w 3,2. id 4 y4 w 4,1 w 4,2 v(w 1,1,id1), v(w 1,2,id1)v(w 1,k1,id1) v(w 2,1,id2), v(w 2,2,id2) 29
30 recap: Is there a stream-and-sort analog of this request-and-answer pattern? Test data Record of all event counts for each word id 1 found an in s farmville today! id 2 id 3. id 4 id 5.. w Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world C[w^Y=sports]=21,C[w^Y=worldN Counter records Classification logic found ~ctr to id 1 ~ctr to id 2 today ~ctr to id i requests Combine and sort
31 recap: Is there a stream-and-sort analog of this request-and-answer pattern? Test data Record of all event counts for each word id 1 found an in s farmville today! id 2 id 3. id 4 id 5.. w DF(w) Counter records Classification logic found ~ctr to id 1 ~ctr to id 2 today ~ctr to id i requests Combine and sort
32 recap: A stream-and-sort analog of the requestand-answer pattern Record of all event counts for each word w Counts C[w^Y=sports]=2 Counter records w 4 Counts ~ctr to id1 ~ctr to id345 ~ctr to id9854 ~ctr to id345 ~ctr to id34742 C[] found ~ctr to id 1 ~ctr to id 1 today ~ctr to id 1 requests Combine and sort ~ctr to id1 Request-handling logic
33 recap: A stream-and-sort analog of the requestand-answer pattern previouskey = somethingimpossible For each (key,val) in input: If key==previouskey Answer(recordForPrevKey,val) Else previouskey = key recordforprevkey = val define Answer(record,request): find id where request = ~ctr to id print id ~ctr for request is record w 4 Counts ~ctr to id1 ~ctr to id345 ~ctr to id9854 ~ctr to id345 ~ctr to id34742 C[] ~ctr to id1 requests Combine and sort Request-handling logic
34 Rocchio. Train data id 1 y1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3. id 3 y3 w 3,1 w 3,2. id 4 y4 w 4,1 w 4,2 id 5 y5 w 5,1 w 5,2... DF(w) = # different docs w occurs in TF(w, d) = # Rocchio: different times DF counts w occurs in doc d IDF(w) = D 12 DF(w) 1054 u(w, d) = log(tf(w, d) +1) log(idf(w)) u(d) = u(w 1, d),...,u(w V, d), v(d) 3 = u(d) u(d) 2 id 1 y1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3. id 3 y3 w 3,1 w 3,2. id 4 y4 w 4,1 w 4,2 v(id 1 ) v(id 2 ) 34
35 u(y) = α 1 C y d C y v(d) f (d) = argmax y v(d) v(y) 1 β D C y Rocchio. v(d), v(y) = u(y) u(y) 2 d' D C y id 1 y1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3. id 3 y3 w 3,1 w 3,2. id 4 y4 w 4,1 w 4,2 v(w 1,1 w 1,2 w 1,3. w 1,k1 ), the document vector for id 1 v(w 2,1 w 2,2 w 2,3.)= v(w 2,1,d), v(w 2,2,d), For each (y, v), go through the non-zero values in v one for each w in the document dand increment a counter for that dimension of v(y) Message: increment v(y1) s weight for w 1,1 by αv(w 1,1,d) / C y Message: increment v(y1) s weight for w 1,2 by αv(w 1,2,d) / C y 35
36 f (d) = argmax y v(d) v(y) = argmax y v(w,d) v(w, y) Rocchio at Test Time w Train data Rocchio: DF counts id 1 y1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3. id 3 y3 w 3,1 w 3,2. id 4 y4 w 4,1 w 4,2 id 5 y5 w 5,1 w 5,2... v(y1,w)= v(y1,w)=0.013, v(y2,w)=... id 1 y1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3. id 3 y3 w 3,1 w 3,2. v(id 1 ), v(w 1,1,y1),v(w 1,1,y1),.,v(w 1,k1,yk),,v(w 1,k1,yk) v(id 2 ), v(w 2,1,y1),v(w 2,1,y1),. id 4 y4 w 4,1 w 4,2 36
37 Rocchio Summary Compute DF one scan thru docs Compute v(id i ) for each document output size O(n) Add up vectors to get v(y) time: O(n), n=corpus size like NB event-counts time: O(n) one scan, if DF fits in memory like first part of NB test procedure otherwise time: O(n) one scan if v(y) s fit in memory like NB training otherwise Classification ~= disk NB 37
38 One? more Rocchio observation Documents/labels Split into documents subsets Documents/labels 1 Documents/labels 2 Documents/labels 3 Compute DFs DFs -1 DFs - 2 DFs -3 DFs Sort and add counts 38
39 One?? more Rocchio observation DFs Documents/labels Split into documents subsets Documents/labels 1 Documents/labels 2 Documents/labels 3 Compute partial v(y) s v-1 v-2 v-3 v(y) s Sort and add vectors 39
40 O(1) more Rocchio observation Documents/labels Split into documents subsets Documents/labels 1 Documents/labels 2 Documents/labels 3 DFs DFs Compute partial v(y) s DFs v-1 v-2 v-3 coming up - how v(y) s Sort and add vectors We have shared access to the DFs, but only shared read access we don t need to share write access. So we only need to copy the information across the different processes. 40
41 ABSTRACTIONS FOR STREAM AND SORT AND MAP-REDUCE 41
42 Abstractions On Top Of Map- Reduce We ve decomposed some algorithms into a map-reduce workflow (series of mapreduce steps) naive Bayes training naïve Bayes testing How else can we express these sorts of computations? Are there some common special cases of map-reduce steps we can parameterize and reuse? 42
43 Abstractions On Top Of Map- Reduce Some obvious streaming processes: for each row in a table Transform it and output the result Decide if you want to keep it with some boolean test, and copy out only the ones that pass the test Example: stem words in a stream of word-count pairs: ( s,1) è (,1) Proposed syntax: f(row)!row table2 = MAP table1 TO λ row : f(row)) Example: apply stop words (,1) è (,1) ( the,1) è deleted Proposed syntax: f(row)! {true,false} table2 = FILTER table1 BY λ row : f(row)) 43
44 Abstractions On Top Of Map- Reduce A non-obvious? streaming processes: for each row in a table Transform it to a list of items Splice all the lists together to get the output table (flatten) Proposed syntax: Example: tokenizing a line I found an è [ i, found, an, ] We love zymurgy è [ we, love, zymurgy ]..but Uinal table is one word per row f(row)!list of rows table2 = FLATMAP table1 TO λ row : f(row)) i found an we love 44
45 Abstractions On Top Of Map- Reduce An example: the Naïve Bayes test program 45
46 NB Test Step Event counts How: Stream and sort: for each C[X=w^Y=y]=n print w C[Y=y]=n sort and build a list of values associated with each key w Like an inverted index X=w 1^Y=sports X=w 1^Y=worldNews X=.. X=w 2^Y= X= w Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world News]=564 C[w^Y=sports]=21,C[w^Y=worldNe ws]=4464
47 NB Test Step The general case: We re taking rows from a table In a particular format (event,count) Applying a function to get a new value The word for the event And grouping the rows of the table by this new value è Grouping operation Special case of a map-reduce Event counts X=w 1^Y=sports X=w 1^Y=worldNews X=.. X=w 2^Y= X= Proposed syntax: f(row)!field GROUP table BY λ row : f(row) Could deuine f via: a function, a Uield of a deuined record structure, w Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world News]=564 C[w^Y=sports]=21,C[w^Y=worldNe ws]=4464
48 NB Test Step The general case: We re taking rows from a table In a particular format (event,count) Applying a function to get a new value The word for the event And grouping the rows of the table by this new value è Grouping operation Special case of a map-reduce Proposed syntax: f(row)!field Aside: you guys know how to implement this, right? 1. Output pairs (f(row),row) with a map/streaming process 2. Sort pairs by key which is f(row) 3. Reduce and aggregate by appending together all the values associated with the same key GROUP table BY λ row : f(row) Could deuine f via: a function, a Uield of a deuined record structure,
49 Abstractions On Top Of Map- Reduce And another example from the Naïve Bayes test program 49
50 Request-and-answer Test data Record of all event counts for each word id 1 w 1,1 w 1,2 w 1,3. w 1,k1 id 2 w 2,1 w 2,2 w 2,3. id 3 w 3,1 w 3,2. id 4 w 4,1 w 4,2 id 5 w 5,1 w 5,2... w Counts associated with W C[w^Y=sports]=2 C[w^Y=sports]=1027,C[w^Y=world C[w^Y=sports]=21,C[w^Y=worldN Step 2: stream through and for each test case id i w i,1 w i,2 w i,3. w i,ki Classification logic request the event counters needed to classify id i from the event-count DB, then classify using the answers
51 Request-and-answer Break down into stages Generate the data being requested (indexed by key, here a word) Eg with group by Generate the requests as (key, requestor) pairs Eg with flatmap to Join these two tables by key Join defined as (1) cross-product and (2) filter out pairs with different values for keys This replaces the step of concatenating two different tables of key-value pairs, and reducing them together Postprocess the joined result
52 w found Request ~ctr to id1 w Counters C[w^Y=sports]=2 ~ctr to id1 C[w^Y=sports]=1027,C[w^Y=worldNews ~ctr to id1 C[w^Y=sports]=21,C[w^Y=worldNews]= ~ctr to id2 w Counters Requests C[w^Y=sports]=2 ~ctr to id1 C[w^Y=sports]= ~ctr to id345 C[w^Y=sports]= ~ctr to id9854 C[w^Y=sports]= ~ctr to id345 C[w^Y=sports]= ~ctr to id34742 C[] ~ctr to id1 C[]
53 w found Request id1 w Counters C[w^Y=sports]=2 id1 C[w^Y=sports]=1027,C[w^Y=worldNews id1 C[w^Y=sports]=21,C[w^Y=worldNews]= id2 Examples: JOIN wordindoc BY word, wordcounters w BY Counters word --- if word(row) Requests de;ined correctly JOIN wordindoc BY lambda (word,docid):word, C[w^Y=sports]=2 id1 wordcounters BY lambda (word,counters):word C[w^Y=sports]= using python id345 syntax for functions C[w^Y=sports]= id9854 Proposed syntax: C[w^Y=sports]= id345 JOIN table1 BY λ row : f(row), table2 BY λ row : g(row) C[w^Y=sports]= id34742 C[] id1 C[]
54 Abstract Implementation: [TF]IDF data = pairs (docid,term) where term is a word appears in document with id docid operators: key value docid term DISTINCT, found MAP, JOIN (d123,found),(d134,found), 2456 GROUP BY. [RETAINING ] REDUCING TO a reduce step d123 found (d123,), 7 d123 docfreq = DISTINCT data GROUP BY λ(docid,term):term REDUCING TO key count /* value (term,df) */ docids = MAP DATA BY=λ(docid,term):docid DISTINCT numdocs = GROUP docids BY λdocid:1 REDUCING TO count /* (1,numDocs) */ question how many reducers should I use here? dataplusdf = JOIN data BY λ(docid, term):term, docfreq BY λ(term, df):term MAP λ((docid,term),(term,df)):(docid,term,df) /* (docid,term,document-freq) */ unnormalizeddocvecs = JOIN dataplusdf by λrow:1, numdocs by λrow:1 MAP λ((docid,term,df),(dummy,numdocs)): (docid,term,log(numdocs/df)) /* (docid, term, weight-before-normalizing) : u */ 1/2 54
55 Abstract Implementation: [TF]IDF data = pairs (docid,term) where term is a word appears in document with id docid operators: key value docid term DISTINCT, found MAP, JOIN (d123,found),(d134,found), 2456 GROUP BY. [RETAINING ] REDUCING TO a reduce step d123 found (d123,), 7 d123 docfreq = DISTINCT data GROUP BY λ(docid,term):term REDUCING TO key count /* value (term,df) */ docids = MAP DATA BY=λ(docid,term):docid DISTINCT numdocs = GROUP docids BY λdocid:1 REDUCING TO count /* (1,numDocs) */ question how many reducers should I use here? dataplusdf = JOIN data BY λ(docid, term):term, docfreq BY λ(term, df):term MAP λ((docid,term),(term,df)):(docid,term,df) /* (docid,term,document-freq) */ unnormalizeddocvecs = JOIN dataplusdf by λrow:1, numdocs by λrow:1 MAP λ((docid,term,df),(dummy,numdocs)): (docid,term,log(numdocs/df)) /* (docid, term, weight-before-normalizing) : u */ 55 question how many reducers should I use here? 1/2
56 Abstract Implementation: TFIDF normalizers = GROUP unnormalizeddocvecs BY λ(docid,term,w):docid RETAINING λ(docid,term,w): w 2 REDUCING TO sum /* (docid,sum-of-square-weights) */ 2/2 key d1234 (d1234,found,1.542), (d1234,,13.23), d docvec = JOIN unnormalizeddocvecs BY λ(docid,term,w):docid, normalizers BY λ(docid,norm):docid MAP λ((docid,term,w), (docid,norm)): (docid,term,w/sqrt(norm)) /* (docid, term, weight) */ docid term w d1234 found d docid w d d
57 Two ways to join Reduce-side join Map-side join 57
58 Two ways to join Reduce-side join for A,B A term df found term docid d15 concat and sort do the join ( A (, 7) B... (found, 2456) (, d15) (found,d7) B... found d7 found d23 (found, 2456) (found,d23) found 58
59 Two ways to join Reduce-side join for A,B df rk 7 rk 2456 docid d15... d7 d23 concat and sort term df found A 2456 A 7 term doci d B d15... found found found B d7 B d23 B tricky bit: need sort by first two values (, AB) we want the DF s to come first but all tuples with key should go to same worker do the join ( A (, 7) B... (found, 2456) (found, 2456) (, d15) (found,d7) (found,d23) 59
60 Two ways to join Reduce-side join for A,B df rk 7 rk 2456 docid d15... d7 d23 concat and sort term df found A 2456 A 7 term doci d B d15... found found found B d7 B d23 B tricky bit: need sort by first two values (, AB) we want the DF s to come first but all tuples with key should go to same worker custom sort (secondary sort key): Writeable with your own Comparator custom Partitioner (specified for job like the Mapper, Reducer,..) 60
61 Two ways to join Map-side join write the smaller relation out to disk send it to each Map worker DistributedCache when you initialize each Mapper, load in the small relation Configure() is called at initialization time map through the larger relation and do the join faster but requires one relation to go in memory 61
MAP-REDUCE ABSTRACTIONS
MAP-REDUCE ABSTRACTIONS 1 Abstractions On Top Of Hadoop We ve decomposed some algorithms into a map- reduce work9low (series of map- reduce steps) naive Bayes training naïve Bayes testing phrase scoring
More informationScaling up: Naïve Bayes on Hadoop. What, How and Why?
Scaling up: Naïve Bayes on Hadoop What, How and Why? Big ML c. 2001 (Banko & Brill, Scaling to Very Very Large, ACL 2001) Task: distinguish pairs of easily-confused words ( affect vs effect ) in context
More informationSOFT JOINS WITH TFIDF: WHY AND WHAT
SOFT JOINS WITH TFIDF: WHY AND WHAT 1 Sim Joins on Product Descriptions Surface similarity can be high for descriptions of distinct items: o o AERO TGX-Series Work Table -42'' x 96'' Model 1TGX-4296 All
More informationMap-Reduce With Hadoop!
Map-Reduce With Hadoop! Announcement 1/2! Assignments, in general:! Autolab is not secure and assignments aren t designed for adversarial interactions! Our policy: deliberately gaming an autograded assignment
More informationNaïve Bayes: Counts in Memory
Recap 1 Summary to date Computational complexity: what and how to count Memory vs disk access Cost of scanning vs seeks for disk (and memory) Probability review Classification with a density estimator
More informationIn the once upon a time days of the First Age of Magic, the prudent sorcerer regarded his own true name as his most valued possession but also the
In the once upon a time days of the First Age of Magic, the prudent sorcerer regarded his own true name as his most valued possession but also the greatest threat to his continued good health, for--the
More informationAnnouncements: projects
Announcements: projects 805 students: Project proposals are due Sun 10/1. If you d like to work with 605 students then indicate this on your proposal. 605 students: the week after 10/1 I will post the
More informationIn the once upon a time days of the First Age of Magic, the prudent sorcerer regarded his own true name as his most valued possession but also the
In the once upon a time days of the First Age of Magic, the prudent sorcerer regarded his own true name as his most valued possession but also the greatest threat to his continued good health, for--the
More informationTI2736-B Big Data Processing. Claudia Hauff
TI2736-B Big Data Processing Claudia Hauff ti2736b-ewi@tudelft.nl Intro Streams Streams Map Reduce HDFS Pig Pig Design Patterns Hadoop Ctd. Graphs Giraph Spark Zoo Keeper Spark Learning objectives Implement
More informationAlways start off with a joke, to lighten the mood. Workflows 3: Graphs, PageRank, Loops, Spark
Always start off with a joke, to lighten the mood Workflows 3: Graphs, PageRank, Loops, Spark 1 The course so far Cost of operations Streaming learning algorithms Parallel streaming with map-reduce Building
More informationEfficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)
Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-
More informationBirkbeck (University of London)
Birkbeck (University of London) MSc Examination for Internal Students Department of Computer Science and Information Systems Information Retrieval and Organisation (COIY64H7) Credit Value: 5 Date of Examination:
More informationLecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)!
Big Data Processing, 2014/15 Lecture 7: MapReduce design patterns!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm
More informationSTATS Data Analysis using Python. Lecture 8: Hadoop and the mrjob package Some slides adapted from C. Budak
STATS 700-002 Data Analysis using Python Lecture 8: Hadoop and the mrjob package Some slides adapted from C. Budak Recap Previous lecture: Hadoop/MapReduce framework in general Today s lecture: actually
More informationInstructor: Stefan Savev
LECTURE 2 What is indexing? Indexing is the process of extracting features (such as word counts) from the documents (in other words: preprocessing the documents). The process ends with putting the information
More informationLecture Transcript While and Do While Statements in C++
Lecture Transcript While and Do While Statements in C++ Hello and welcome back. In this lecture we are going to look at the while and do...while iteration statements in C++. Here is a quick recap of some
More informationBuilding an Inverted Index
Building an Inverted Index Algorithms Memory-based Disk-based (Sort-Inversion) Sorting Merging (2-way; multi-way) 2 Memory-based Inverted Index Phase I (parse and read) For each document Identify distinct
More informationMapReduce Patterns, Algorithms, and Use Cases
MapReduce Patterns, Algorithms, and Use Cases In this article I digested a number of MapReduce patterns and algorithms to give a systematic view of the different techniques that can be found on the web
More informationBasic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval 1 Naïve Implementation Convert all documents in collection D to tf-idf weighted vectors, d j, for keyword vocabulary V. Convert
More informationDatabase Systems CSE 414
Database Systems CSE 414 Lecture 15-16: Basics of Data Storage and Indexes (Ch. 8.3-4, 14.1-1.7, & skim 14.2-3) 1 Announcements Midterm on Monday, November 6th, in class Allow 1 page of notes (both sides,
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 5: Index Compression Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-04-17 1/59 Overview
More informationPart A: MapReduce. Introduction Model Implementation issues
Part A: Massive Parallelism li with MapReduce Introduction Model Implementation issues Acknowledgements Map-Reduce The material is largely based on material from the Stanford cources CS246, CS345A and
More informationMapReduce Design Patterns
MapReduce Design Patterns MapReduce Restrictions Any algorithm that needs to be implemented using MapReduce must be expressed in terms of a small number of rigidly defined components that must fit together
More informationIntroduction to MapReduce
Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed
More informationExtreme Computing. Introduction to MapReduce. Cluster Outline Map Reduce
Extreme Computing Introduction to MapReduce 1 Cluster We have 12 servers: scutter01, scutter02,... scutter12 If working outside Informatics, first: ssh student.ssh.inf.ed.ac.uk Then log into a random server:
More informationInforma)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies
Informa)on Retrieval and Map- Reduce Implementa)ons Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies mas4108@louisiana.edu Map-Reduce: Why? Need to process 100TB datasets On 1 node:
More informationCSC 261/461 Database Systems Lecture 19
CSC 261/461 Database Systems Lecture 19 Fall 2017 Announcements CIRC: CIRC is down!!! MongoDB and Spark (mini) projects are at stake. L Project 1 Milestone 4 is out Due date: Last date of class We will
More informationDatabase Systems CSE 414
Database Systems CSE 414 Lecture 10-11: Basics of Data Storage and Indexes (Ch. 8.3-4, 14.1-1.7, & skim 14.2-3) 1 Announcements No WQ this week WQ4 is due next Thursday HW3 is due next Tuesday should be
More informationData-Intensive Distributed Computing
Data-Intensive Distributed Computing CS 45/65 43/63 (Winter 08) Part 3: Analyzing Text (/) January 30, 08 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides are
More information/ Cloud Computing. Recitation 3 Sep 13 & 15, 2016
15-319 / 15-619 Cloud Computing Recitation 3 Sep 13 & 15, 2016 1 Overview Administrative Issues Last Week s Reflection Project 1.1, OLI Unit 1, Quiz 1 This Week s Schedule Project1.2, OLI Unit 2, Module
More informationIntroduction to Information Retrieval (Manning, Raghavan, Schutze)
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 3 Dictionaries and Tolerant retrieval Chapter 4 Index construction Chapter 5 Index compression Content Dictionary data structures
More informationOverview of this week
Overview of this week Debugging tips for ML algorithms Graph algorithms at scale A prototypical graph algorithm: PageRank n memory Putting more and more on disk Sampling from a graph What is a good sample
More informationIntroduction to Data Management CSE 344
Introduction to Data Management CSE 344 Lecture 26: Parallel Databases and MapReduce CSE 344 - Winter 2013 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Cluster will run in Amazon s cloud (AWS)
More informationMap/Reduce on the Enron dataset
Map/Reduce on the Enron dataset We are going to use EMR on the Enron email dataset: http://aws.amazon.com/datasets/enron-email-data/ https://en.wikipedia.org/wiki/enron_scandal This dataset contains 1,227,255
More informationIndex Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search
Index Construction Dictionary, postings, scalable indexing, dynamic indexing Web Search 1 Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis Query processing
More informationRecitation 6 CS435: Introduction to Big Data
Recitation 6 CS435: Introduction to Big Data GTAs: Bibek R. Shrestha and Aaron Pereira Email: cs435@cs.colostate.edu October 5, 2018 Recall last recitation How to approach PA2? othere are few documents
More informationML from Large Datasets
10-605 ML from Large Datasets 1 Announcements HW1b is going out today You should now be on autolab have a an account on stoat a locally-administered Hadoop cluster shortly receive a coupon for Amazon Web
More informationIntroduction to MapReduce
732A54 Big Data Analytics Introduction to MapReduce Christoph Kessler IDA, Linköping University Towards Parallel Processing of Big-Data Big Data too large to be read+processed in reasonable time by 1 server
More informationCSCI 5417 Information Retrieval Systems Jim Martin!
CSCI 5417 Information Retrieval Systems Jim Martin! Lecture 4 9/1/2011 Today Finish up spelling correction Realistic indexing Block merge Single-pass in memory Distributed indexing Next HW details 1 Query
More informationTypes, lists & functions
Week 2 Types, lists & functions Data types If you want to write a program that allows the user to input something, you can use the command input: name = input (" What is your name? ") print (" Hello "+
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 07) Week 4: Analyzing Text (/) January 6, 07 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides
More informationCC PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018
CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018 Lecture 4: Apache Pig Aidan Hogan aidhog@gmail.com HADOOP: WRAPPING UP 0. Reading/Writing to HDFS Creates a file system for default configuration Check
More informationData Analytics Job Guarantee Program
Data Analytics Job Guarantee Program 1. INSTALLATION OF VMWARE 2. MYSQL DATABASE 3. CORE JAVA 1.1 Types of Variable 1.2 Types of Datatype 1.3 Types of Modifiers 1.4 Types of constructors 1.5 Introduction
More information1 Document Classification [60 points]
CIS519: Applied Machine Learning Spring 2018 Homework 4 Handed Out: April 3 rd, 2018 Due: April 14 th, 2018, 11:59 PM 1 Document Classification [60 points] In this problem, you will implement several text
More informationDistributed computing: index building and use
Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput
More informationGraphs (Part II) Shannon Quinn
Graphs (Part II) Shannon Quinn (with thanks to William Cohen and Aapo Kyrola of CMU, and J. Leskovec, A. Rajaraman, and J. Ullman of Stanford University) Parallel Graph Computation Distributed computation
More informationMap- Reduce. Everything Data CompSci Spring 2014
Map- Reduce Everything Data CompSci 290.01 Spring 2014 2 Announcements (Thu. Feb 27) Homework #8 will be posted by noon tomorrow. Project deadlines: 2/25: Project team formation 3/4: Project Proposal is
More information2.3 Algorithms Using Map-Reduce
28 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK one becomes available. The Master must also inform each Reduce task that the location of its input from that Map task has changed. Dealing with a failure
More informationAdministrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks
Administrative Index Compression! n Assignment 1? n Homework 2 out n What I did last summer lunch talks today David Kauchak cs458 Fall 2012 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt
More informationBinary Relations Part One
Binary Relations Part One Outline for Today Binary Relations Reasoning about connections between objects. Equivalence Relations Reasoning about clusters. A Fundamental Theorem How do we know we have the
More informationEECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling
EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling Doug Downey Based partially on slides by Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze Announcements Project progress report
More informationClustering Documents. Document Retrieval. Case Study 2: Document Retrieval
Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April, 2017 Sham Kakade 2017 1 Document Retrieval n Goal: Retrieve
More informationBig data systems 12/8/17
Big data systems 12/8/17 Today Basic architecture Two levels of scheduling Spark overview Basic architecture Cluster Manager Cluster Cluster Manager 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores
More informationCompile-Time Code Generation for Embedded Data-Intensive Query Languages
Compile-Time Code Generation for Embedded Data-Intensive Query Languages Leonidas Fegaras University of Texas at Arlington http://lambda.uta.edu/ Outline Emerging DISC (Data-Intensive Scalable Computing)
More informationProgramming and Data Structure
Programming and Data Structure Dr. P.P.Chakraborty Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture # 09 Problem Decomposition by Recursion - II We will
More informationCSCI6900 Assignment 1: Naïve Bayes on Hadoop
DEPARTMENT OF COMPUTER SCIENCE, UNIVERSITY OF GEORGIA CSCI6900 Assignment 1: Naïve Bayes on Hadoop DUE: Friday, January 29 by 11:59:59pm Out January 8, 2015 1 INTRODUCTION TO NAÏVE BAYES Much of machine
More informationCSCI544, Fall 2016: Assignment 1
CSCI544, Fall 2016: Assignment 1 Due Date: September 23 rd, 4pm. Introduction The goal of this assignment is to get some experience implementing the simple but effective machine learning technique, Naïve
More informationHigher level data processing in Apache Spark
Higher level data processing in Apache Spark Pelle Jakovits 12 October, 2016, Tartu Outline Recall Apache Spark Spark DataFrames Introduction Creating and storing DataFrames DataFrame API functions SQL
More informationCSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable
CSE 544 Principles of Database Management Systems Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable References Bigtable: A Distributed Storage System for Structured Data. Fay Chang et. al. OSDI
More informationCorpus methods in linguistics and NLP Lecture 7: Programming for large-scale data processing
Corpus methods in linguistics and NLP Lecture 7: Programming for large-scale data processing Richard Johansson December 1, 2015 today's lecture as you've seen, processing large corpora can take time! for
More informationECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing
ECE5610/CSC6220 Introduction to Parallel and Distribution Computing Lecture 6: MapReduce in Parallel Computing 1 MapReduce: Simplified Data Processing Motivation Large-Scale Data Processing on Large Clusters
More informationDeveloping MapReduce Programs
Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes
More informationChapter 6: Information Retrieval and Web Search. An introduction
Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods
More informationIntro to Programming. Unit 7. What is Programming? What is Programming? Intro to Programming
Intro to Programming Unit 7 Intro to Programming 1 What is Programming? 1. Programming Languages 2. Markup vs. Programming 1. Introduction 2. Print Statement 3. Strings 4. Types and Values 5. Math Externals
More informationMap-Reduce. Marco Mura 2010 March, 31th
Map-Reduce Marco Mura (mura@di.unipi.it) 2010 March, 31th This paper is a note from the 2009-2010 course Strumenti di programmazione per sistemi paralleli e distribuiti and it s based by the lessons of
More informationParallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014
Parallel Data Processing with Hadoop/MapReduce CS140 Tao Yang, 2014 Overview What is MapReduce? Example with word counting Parallel data processing with MapReduce Hadoop file system More application example
More informationAnnouncements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414
Introduction to Database Systems CSE 414 Lecture 17: MapReduce and Spark Announcements Midterm this Friday in class! Review session tonight See course website for OHs Includes everything up to Monday s
More informationCSC209. Software Tools and Systems Programming. https://mcs.utm.utoronto.ca/~209
CSC209 Software Tools and Systems Programming https://mcs.utm.utoronto.ca/~209 What is this Course About? Software Tools Using them Building them Systems Programming Quirks of C The file system System
More informationSummer 2017 Discussion 10: July 25, Introduction. 2 Primitives and Define
CS 6A Scheme Summer 207 Discussion 0: July 25, 207 Introduction In the next part of the course, we will be working with the Scheme programming language. In addition to learning how to write Scheme programs,
More informationPig A language for data processing in Hadoop
Pig A language for data processing in Hadoop Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Apache Pig: Introduction Tool for querying data on Hadoop
More informationIntroduction to MapReduce
Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed
More informationSCHEME 7. 1 Introduction. 2 Primitives COMPUTER SCIENCE 61A. October 29, 2015
SCHEME 7 COMPUTER SCIENCE 61A October 29, 2015 1 Introduction In the next part of the course, we will be working with the Scheme programming language. In addition to learning how to write Scheme programs,
More informationMapReduce Patterns. MCSN - N. Tonellotto - Distributed Enabling Platforms
MapReduce Patterns 1 Intermediate Data Written locally Transferred from mappers to reducers over network Issue - Performance bottleneck Solution - Use combiners - Use In-Mapper Combining 2 Original Word
More informationVECTOR SPACE CLASSIFICATION
VECTOR SPACE CLASSIFICATION Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. Chapter 14 Wei Wei wwei@idi.ntnu.no Lecture
More informationInformation Retrieval
Information Retrieval Suan Lee - Information Retrieval - 05 Index Compression 1 05 Index Compression - Information Retrieval - 05 Index Compression 2 Last lecture index construction Sort-based indexing
More informationCSE 344 FEBRUARY 14 TH INDEXING
CSE 344 FEBRUARY 14 TH INDEXING EXAM Grades posted to Canvas Exams handed back in section tomorrow Regrades: Friday office hours EXAM Overall, you did well Average: 79 Remember: lowest between midterm/final
More informationSpring 2017 EXTERNAL SORTING (CH. 13 IN THE COW BOOK) 2/7/17 CS 564: Database Management Systems; (c) Jignesh M. Patel,
Spring 2017 EXTERNAL SORTING (CH. 13 IN THE COW BOOK) 2/7/17 CS 564: Database Management Systems; (c) Jignesh M. Patel, 2013 1 Motivation for External Sort Often have a large (size greater than the available
More informationBig Data. Big Data Analyst. Big Data Engineer. Big Data Architect
Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION
More informationWeb Information Retrieval Exercises Boolean query answering. Prof. Luca Becchetti
Web Information Retrieval Exercises Boolean query answering Prof. Luca Becchetti Material rif 3. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schueze, Introduction to Information Retrieval, Cambridge
More informationData Partitioning and MapReduce
Data Partitioning and MapReduce Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies,
More informationClustering Documents. Case Study 2: Document Retrieval
Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 21 th, 2015 Sham Kakade 2016 1 Document Retrieval Goal: Retrieve
More informationSTIDistrict Query (Basic)
STIDistrict Query (Basic) Creating a Basic Query To create a basic query in the Query Builder, open the STIDistrict workstation and click on Utilities Query Builder. When the program opens, database objects
More informationLecture 30: Distributed Map-Reduce using Hadoop and Spark Frameworks
COMP 322: Fundamentals of Parallel Programming Lecture 30: Distributed Map-Reduce using Hadoop and Spark Frameworks Mack Joyner and Zoran Budimlić {mjoyner, zoran}@rice.edu http://comp322.rice.edu COMP
More informationRecitation 9. CS435: Introduction to Big Data. GTA: Bibek R. Shrestha March 23, 2018
Recitation 9 CS435: Introduction to Big Data GTA: Bibek R. Shrestha Email: cs435@cs.colostate.edu March 23, 2018 Today... Discussion on issues in Programming Assignment 2 2 How to program PA2: Part-A?
More informationCSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark
CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark Announcements HW2 due this Thursday AWS accounts Any success? Feel
More informationIntroduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data
Introduction to Hadoop High Availability Scaling Advantages and Challenges Introduction to Big Data What is Big data Big Data opportunities Big Data Challenges Characteristics of Big data Introduction
More informationWeb Information Retrieval. Lecture 4 Dictionaries, Index Compression
Web Information Retrieval Lecture 4 Dictionaries, Index Compression Recap: lecture 2,3 Stemming, tokenization etc. Faster postings merges Phrase queries Index construction This lecture Dictionary data
More informationText Analytics. Index-Structures for Information Retrieval. Ulf Leser
Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf
More informationCSE 344 Final Examination
CSE 344 Final Examination March 15, 2016, 2:30pm - 4:20pm Name: Question Points Score 1 47 2 17 3 36 4 54 5 46 Total: 200 This exam is CLOSED book and CLOSED devices. You are allowed TWO letter-size pages
More informationMap Reduce. Yerevan.
Map Reduce Erasmus+ @ Yerevan dacosta@irit.fr Divide and conquer at PaaS 100 % // Typical problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate
More informationthis is so cumbersome!
Pig Arend Hintze this is so cumbersome! Instead of programming everything in java MapReduce or streaming: wouldn t it we wonderful to have a simpler interface? Problem: break down complex MapReduce tasks
More informationCounting Problems; and Recursion! CSCI 2824, Fall 2012!
Counting Problems; and Recursion! CSCI 2824, Fall 2012!!! Assignments To read this week: Sections 5.5-5.6 (Ensley/Crawley Problem Set 3 has been sent out today. Challenge problem today!!!!! So, to recap
More informationLecture 10 September 19, 2007
CS 6604: Data Mining Fall 2007 Lecture 10 September 19, 2007 Lecture: Naren Ramakrishnan Scribe: Seungwon Yang 1 Overview In the previous lecture we examined the decision tree classifier and choices for
More informationVendor: Hortonworks. Exam Code: HDPCD. Exam Name: Hortonworks Data Platform Certified Developer. Version: Demo
Vendor: Hortonworks Exam Code: HDPCD Exam Name: Hortonworks Data Platform Certified Developer Version: Demo QUESTION 1 Workflows expressed in Oozie can contain: A. Sequences of MapReduce and Pig. These
More informationindex construct Overview Overview Recap How to construct index? Introduction Index construction Introduction to Recap
to to Information Retrieval Index Construct Ruixuan Li Huazhong University of Science and Technology http://idc.hust.edu.cn/~rxli/ October, 2012 1 2 How to construct index? Computerese term document docid
More informationStructured Streaming. Big Data Analysis with Scala and Spark Heather Miller
Structured Streaming Big Data Analysis with Scala and Spark Heather Miller Why Structured Streaming? DStreams were nice, but in the last session, aggregation operations like a simple word count quickly
More informationCME 323: Distributed Algorithms and Optimization Instructor: Reza Zadeh HW#3 - Due at the beginning of class May 18th.
CME 323: Distributed Algorithms and Optimization Instructor: Reza Zadeh (rezab@stanford.edu) HW#3 - Due at the beginning of class May 18th. 1. Download the following materials: Slides: http://stanford.edu/~rezab/dao/slides/itas_workshop.pdf
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval (Supplementary Material) Zhou Shuigeng March 23, 2007 Advanced Distributed Computing 1 Text Databases and IR Text databases (document databases) Large collections
More informationQuery Evaluation Strategies
Introduction to Search Engine Technology Term-at-a-Time and Document-at-a-Time Evaluation Ronny Lempel Yahoo! Labs (Many of the following slides are courtesy of Aya Soffer and David Carmel, IBM Haifa Research
More informationMapReduce Simplified Data Processing on Large Clusters
MapReduce Simplified Data Processing on Large Clusters Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 1 /
More information