"Big Data" Open Source Systems. CS347: Map-Reduce & Pig. Motivation for Map-Reduce. Building Text Index - Part II. Building Text Index - Part I

Size: px

Start display at page:

Download ""Big Data" Open Source Systems. CS347: Map-Reduce & Pig. Motivation for Map-Reduce. Building Text Index - Part II. Building Text Index - Part I"

Chad Jordan
6 years ago
Views:

1 "Big Data" Open Source Systems CS347: Map-Reduce & Pig Hector Garcia-Molina Stanford University Infrastructure for distributed data computations Map-Reduce, S4, Hyracks, Pregel [Storm, Mupet] Components MemCachedD, ZooKeeper, Kestrel Data services Pig, F1 Cassandra, H-Base, Big Table [Hive] 1 2 Motivation for Map-Reduce Another example: Asymmetric fragment + replicate join Recall one of our sort strategies: R1 ko R2 Local sort R 1 Local sort R 2 Result Ra Rb R1 R2 Local join S S Sa Sb process data & partition k1 R3 Local sort additional processing R 3 f partition process data & partition R3 Result additional processing S union CS 347 Notes 03 CS 347 Notes Building Text Index - Part I Building Text Index - Part II Page stream original Map-Reduce application... 2 rat 3 1 cat Loading rat Tokenizing Sorting FLUSHING Intermediate runs Disk 5 (ant, 5) (cat, 4) Merge Intermediate Runs (ant, 5) (cat, 4) (, 4) (, 5) (eel, 6) (, 4) (, 5) (eel, 6) (ant: 2) (cat: 2,4) (: 1,2,3,4,5) (eel: 6) (rat: 1, 3) Final index 6 1

2 Generalizing: Map-Reduce Generalizing: Map-Reduce Page stream 2 rat 3 1 cat Loading rat Map Tokenizing Sorting FLUSHING Intermediate runs Disk 7 Merge (ant, 5) (cat, 4) Intermediate Runs (ant, 5) (cat, 4) (, 4) (, 5) (, 4) (eel, 6) (, 5) (eel, 6) (ant: 2) (cat: 2,4) (: 1,2,3,4,5) (eel: 6) (rat: 1, 3) Reduce Final index 8 Map Reduce References Input: R={r 1, r 2,...r n }, functions M, R M(r i ) { [k 1, v 1 ], [k 2, v 2 ],.. } R(k i, valset) [k i, valset ] Let S={ [k, v] [k, v] M(r) for some r R } Let K = {k [k,v] S, for any v } Let G(k) = { v [k, v] S } G is bag Output = { [k, T] k K, T=R(k, G(k)) } S is bag MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, available at Pig Latin: A Not-So-Foreign Language for Data Processing, Christopher Olston, Benjamin Reedy, Utkarsh Srivastavava, Ravi Kumar, Andrew Tomkins, available at Example: Counting Word Occurrences Example: Counting Word Occurrences map(string doc, String value); // doc is document name Why does map have 2 parameters? // value is document content for each word w in value: EmitIntermediate(w, 1 ); Example: map(doc, cat cat bat ) emits [cat 1], [ 1], [cat 1], [bat 1], [ 1] reduce(string key, Iterator values); // key is a word // values is a list of counts int result = 0; for each v in values: result += ParseInt(v) Emit(AsString(result)); Example: reduce(, ) emits 4 11 should emit (, 4)?? 12 2

3 Google MR Overview Implementation Issues Combine function File system Partition of input, keys Failures Backup tasks Ordering of results Combine Function Distributed File System [cat 1], [cat 1], [cat 1]... [ 1], [ 1]... all data transfers are through distributed file system must be able to access any part of input file reduce must be able to access local disks on map s any must be able to write its part of answer; answer is left as distributed file Combine is like a local reduce applied before distribution: [cat 3]... [ 2] Partition of input, keys Failures How many s, partitions of input file? How many splits? How many s? Best to have many splits per : Improves load balance; if fails, easier to spread its tasks Should s be assigned to splits near them? Similar questions for reduce s Distributed implementation should produce same output as would have been produced by a nonfaulty sequential execution of the program. General strategy: Master detects failures, and has work re-done by another. split j ok? redo j master

4 Backup Tasks Straggler is a machine that takes unusually long (e.g., bad disk) to finish its work. A straggler can delay final completion. When task is close to finishing, master schedules backup executions for remaining tasks. Must be able to eliminate redundant results Ordering of Results Final result (at each node) is in key order also in key order: [k 1, v 1 ] [k 3, v 3 ] [k 1, T 1 ] [k 2, T 2 ] [k 3, T 3 ] [k 4, T 4 ] Example: Sorting Records Other Issues 5 e 3 c 6 f 2 b 1 a 6 f* W1 W2 3 c 5 e 6 f 1 a 2 b 6 f* W5 1 a 3 c 5 e 7 g 9 i one or two records for k=6? Skipping bad records Debugging 8 h 4 d 9 i 7 g W3 7 g 9 i 4 d 8 h W6 2 b 4 d 6 f 6 f* 8 h Map: CS347 extract k, output [k, record] Notes 09 Reduce: Do nothing! MR Claimed Advantages MR Possible Disadvantages Model easy to use, hides details of parallelization, fault recovery Many problems expressible in MR framework Scales to thousands of machines 1-input 2-stage data flow rigid, hard to adapt to other scenarios Custom code needs to be written even for the most common operations, e.g., projection and filtering Opaque nature of map, reduce functions impedes optimization

5 Questions Can MR be made more declarative? How can we perform joins? How can we perform approximate grouping? example: for all keys that are similar reduce all values for those keys Additional Topics Hadoop: open-source Map-Reduce system Pig: Yahoo system that builds on MR but is more declarative Pig & Pig Latin A layer on top of map-reduce (Hadoop) Pig is the system Pig Latin is the query language Pig Latin is a hybrid between: high-level declarative query language in the spirit of SQL low-level, procedural programming à la mapreduce. Example Table urls: (url, category, pagerank) Find, for each sufficiently large category, the average pagerank of high-pagerank urls in that category. In SQL: SELECT category, AVG(pagerank) FROM urls WHERE pagerank > 0.2 GROUP BY category HAVING COUNT(*) > Example in Pig Latin good_urls = FILTER urls BY pagerank > 0.2; SELECT category, AVG(pagerank) FROM urls WHERE pagerank > 0.2 GROUP BY category HAVING COUNT(*) > 10 6 In Pig Latin: good_urls = FILTER urls BY pagerank > 0.2; groups = GROUP good_urls BY category; big_groups = FILTER groups BY COUNT(good_urls)>10 6 ; output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank); urls: url, category, pagerank z.cnn.com.com 0.9 y.yale.edu.edu 0.5 w.uc.edu.edu 0.1 x.nyt.com.com 0.8 y.ut.edu.edu 0.6 w.wh.gov.gov 0.7 good_urls: url, category, pagerank z.cnn.com.com 0.9 y.yale.edu.edu 0.5 x.nyt.com.com 0.8 y.ut.edu.edu 0.6 w.wh.gov.gov

6 groups = GROUP good_urls BY category; good_urls: url, category, pagerank z.cnn.com.com 0.9 y.yale.edu.edu 0.5 x.nyt.com.com 0.8 y.ut.edu.edu 0.6 w.wh.gov.gov 0.7 groups: category, good_urls.com {(z.cnn.com,.com, 0.9) (x.nyt.com,.com, 0.8)}.edu {(y.yale.edu,.edu, 0.5) (y.ut.edu,.edu, 0.6)}.gov {(w.wh.gov,.gov,.07)} big_groups = FILTER groups BY COUNT(good_urls)>1; groups: category, good_urls.com {(z.cnn.com,.com, 0.9) (x.nyt.com,.com, 0.8)}.edu {(y.yale.edu,.edu, 0.5) (y.ut.edu,.edu, 0.6)}.gov {(w.wh.gov,.gov,.07)} big_groups: category, good_urls.com {(z.cnn.com,.com, 0.9) (x.nyt.com,.com, 0.8)}.edu {(y.yale.edu,.edu, 0.5) (y.ut.edu,.edu, 0.6)} output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank); big_groups: category, good_urls.com {(z.cnn.com,.com, 0.9) (x.nyt.com,.com, 0.8)}.edu {(y.yale.edu,.edu, 0.5) (y.ut.edu,.edu. 0.6)} output: category, good_urls.com 0.85.edu Features Similar to specifying a query execution plan (i.e., a dataflow graph), thereby making it easier for programmers to understand and control how their data processing task is executed. Support for a flexible, fully nested data model Extensive support for user-defined functions Ability to operate over plain input files without any schema information. Novel debugging environment useful when dealing with enormous data sets. 34 Execution Control: Good or Bad? User Defined Functions Example: spam_urls = FILTER urls BY isspam(url); culprit_urls = FILTER spam_urls BY pagerank>0.8; Should system re-order filters? Example groups = GROUP urls BY category; output = FOREACH groups GENERATE category, top10(urls); should be groups.url?.gov {(x.fbi.gov,.gov, 0.7)...}.edu {(y.yale.edu,.edu, 0.5)...} UDF top10 can return scalar or set.com {(z.cnn.com,.com, 0.9)...} 35.gov {(fbi.gov) (cia.gov)...}.edu {(yale.edu)...}.com {(cnn.com) (ibm.com)...} 36 6

$So {1, 2, 3} is stored as {(1) (2) (3)} 37 See flatten examples ahead 38 Specifying Input Data For Each handle for future use queries = LOAD `query_log.$ txt' USING myload() AS (userid, querystring, timestamp); input file custom deserializer output schema expanded_queries = FOREACH queries GENERATE userid,

txt' USING myload() AS (userid, querystring, timestamp); input file custom deserializer output schema expanded_queries = FOREACH queries GENERATE userid,

queries GENERATE userid, FLATTEN(expandQuery(queryString)); 39 40 ForEach and Flattening lakers rumors is a single string value Flattening Example (Fill In) X A B C a1 {(b1,

7 Data Model Expressions in Pig Latin Atom, e.g., `alice' Tuple, e.g., (`alice', `lakers') Bag, e.g., { (`alice', `lakers') (`alice', (`ipod', `apple')} Map, e.g., [ `fan of' { (`lakers') (`ipod') } `age 20 ] Should be (1) + (2) Note: Bags can currently only hold tuples. So {1, 2, 3} is stored as {(1) (2) (3)} 37 See flatten examples ahead 38 Specifying Input Data For Each handle for future use queries = LOAD `query_log.txt' USING myload() AS (userid, querystring, timestamp); input file custom deserializer output schema expanded_queries = FOREACH queries GENERATE userid, expandquery(querystring); See example next slide Note each tuple is processed independently; good for parallelism To remove one level of nesting: expanded_queries = FOREACH queries GENERATE userid, FLATTEN(expandQuery(queryString)); ForEach and Flattening lakers rumors is a single string value Flattening Example (Fill In) X A B C a1 {(b1, b2) (b3, b4) (b5)} {(c1) (c2)} a2 {(b6, (b7,b8))} {(c3) (c4)} plus userid Y = FOREACH X GENERATE A, FLATTEN(B), C

8 Flattening Example (Fill In) Flattening Example a1 b1, b2 {(c1) (c2)} X A B C a1 b3, b4 {(c1) (c2)} a1 b5 {(c1) (c2)} a2 b6 {(c3) (c4)} a2 (b7, b8) {(c3) (c4)} Y = FOREACH X GENERATE A, FLATTEN(B), C Z = FOREACH Y GENERATE A, B, FLATTEN(C) a1 {(b1, b2) (b3, b4) (b5)} {(c1) (c2)} a2 {(b6, (b7,b8))} {(c3) (c4)} a1 b1 b2 {(c1) (c2)} a1 b3 b4 {(c1) (c2)} a1 b5 {(c1) (c2)} a2 b6 (b7, b8) {(c3) (c4)} Note first tuple is (a1, b1, b2, {(c1)(c2)}) Y = FOREACH X GENERATE A, FLATTEN(B), C Is Z=Z where Z = FOREACH X GENERATE A, FLATTEN(B), FLATTEN(C)? 43 Flatten is not recursive Note attribute naming gets complicated. For example, $2 for first tuple is b2; for third tuple it is {(c1)(c2)}. 44 Flattening Example Filter a1 b1, b2 {(c1) (c2)} a1 b3, b4 {(c1) (c2)} a1 b5 {(c1) (c2)} a2 b6 {(c3) (c4)} a2 (b7, b8) {(c3) (c4)} a1 b1 b2 c1 a1 b1 b2 c2 a1 b3 b4 c1 a1 b3 b4 c2 a1 b5 c1 a1 b5 c2 a2 b6 (b7, b8) c3 a2 b6 (b7, b8) c4 Y = FOREACH X GENERATE A, FLATTEN(B), C Z = FOREACH Y GENERATE A, B, FLATTEN(C) Note that Z=Z where Z = FOREACH X GENERATE A, FLATTEN(B), FLATTEN(C) 45 real_queries = FILTER queries BY userid neq `bot'; real_queries = FILTER queries BY NOT isbot(userid); UDF function 46 Co-Group Two data sets for example: results: (querystring, url, position) revenue: (querystring, adslot, amount) grouped_data = COGROUP results BY querystring, revenue BY querystring; url_revenues = FOREACH grouped_data GENERATE FLATTEN(distributeRevenue(results, revenue)); Co-Group more flexible than SQL JOIN CoGroup vs Join

9 Group (Simple CoGroup) grouped_revenue = GROUP revenue BY querystring; query_revenues = FOREACH grouped_revenue GENERATE querystring, SUM(revenue.amount) AS totalrevenue; CoGroup Example 1 Z1 = GROUP X BY A Z1 A X CoGroup Example 1 CoGroup Example 2 Z1 = GROUP X BY A Z1 A X 1 {(1, 1, c1) (1, 1, c2)} 2 {(2, 2, c3) (2, 2, c4)} Z2 = GROUP X BY (A, B) Z1? X Syntax not in paper but being added CoGroup Example 2 CoGroup Example 3 Z2 = GROUP X BY (A, B) Syntax not in paper but being added Z3 = COGROUP X BY A, Y BY A Z1 A/B? X Z1 A X Y (1, 1) {(1, 1, c1) (1, 1, c2)} (2, 2) {(2, 2, c3) (2, 2, c4)}

10 CoGroup Example 3 CoGroup Example 4 Z3 = COGROUP X BY A, Y BY A Z4 = COGROUP X BY A, Y BY B Z1 A X Y Z1 A X Y 1 {(1, 1, c1) (1, 1, c2)} {(1, 1, d1) (1, 2, d2)} 2 {(2, 2, c3) (2, 2, c4)} {(2, 1, d3) (2, 2, d4)} CoGroup Example 4 CoGroup With Function Call? X A B (2,2) a (1,3) b (2,2) c Y = GROUP X BY A Z = GROUP X BY SUM(A) Adds integers in tuple Z4 = COGROUP X BY A, Y BY B Y A X Z? X Z1 A X Y 1 {(1, 1, c1) (1, 1, c2)} {(1, 1, d1) (2, 1, d3)} 2 {(2, 2, c3) (2, 2, c4)} {(1, 2, d2) (2, 2, d4)} CoGroup With Function Call? Pig Latin Join X A B (2,2) a (1,3) b (2,2) c Y A X (2,2) {((2,2), a) ((2,2), c)} (1,3) {((1,3), b)} Y = GROUP X BY A Z = GROUP X BY SUM(A) Adds integers in tuple Z SUM(A)/A? X 4 {((2,2), a) ((1,3), b) ((2,2), c)} join_result = JOIN results BY querystring, revenue BY querystring; Shorthand for: temp_var = COGROUP results BY querystring, revenue BY querystring; join_result = FOREACH temp_var GENERATE FLATTEN(results), FLATTEN(revenue);

11 MapReduce in Pig Latin map_result = FOREACH input GENERATE FLATTEN(map(*)); key_groups = GROUP map_result BY $0; output = FOREACH key_groups GENERATE reduce(*); key is first attribute Store To materialize result in a file: STORE query_revenues INTO `myoutput' USING mystore(); output file custom serializer all attributes Hadoop HDFS: Hadoop file system How to use Hadoop, examples 63 11

Lecture 23: Supplementary slides for Pig Latin. Friday, May 28, 2010

Lecture 23: Supplementary slides for Pig Latin Friday, May 28, 2010 1 Outline Based entirely on Pig Latin: A not-so-foreign language for data processing, by Olston, Reed, Srivastava, Kumar, and Tomkins,