Introduction to Data Management CSE 344

Size: px

Start display at page:

Download "Introduction to Data Management CSE 344"

Arleen McKinney
5 years ago
Views:

1 Introduction to Data Management CSE 344 Lecture 27: Map Reduce and Pig Latin CSE Fall 214 1

2 Announcements HW8 out now, due last Thursday of the qtr You should have received AWS credit code via . Send mail to if problems. Start setting up account now(!). Takes time. And follow instructions!! Usually the biggest problem. Final exam: Mon. 12/8, 2:3-4:2, this room Review Sun. 12/7, 2 pm, EE 37 Comprehensive (but no rel. calculus, datalog) Closed book; reference notes included in test HW grading: yes, we re behind. Should have hw4 and hw5 by Friday night. HW5 and HW6 solutions posted now. CSE Fall 214 2

3 Outline Example of a large MapReduce job Whirlwind tour of Pig Latin for HW8 You ll need to learn from slides, starter code, Hadoop and related web pages; will not do details in class like we did for SQL CSE Fall 214 3

4 Executing a Large MapReduce Job CSE Fall 214 4

5 Anatomy of a Query Execution Running problem #4 2 nodes = 1 master + 19 workers Using PARALLEL 5 CSE Fall 214 5

6 March 213

7 Some other time (March 212) Let s see what happened CSE Fall 214 7

8 1h 16min Kind % Complete Num Tasks Pending Running Complete Killed Failed/Killed Task Attempts map 33.17% / reduce 4.17% / duce Completion Graph - close copy sort reduce

9 1h 16min Only 19 reducers active, out of 5. Why? Kind % Complete Num Tasks Pending Running Complete Killed Failed/Killed Task Attempts map 33.17% / reduce 4.17% / duce Completion Graph - close Copying by 19 reducers in parallel with mappers. When will the other 31 reducers be scheduled? copy sort reduce

1h 16min Only 19 reducers active, out of 5. Why? Hadoop job_21234195_1 on ip-1-23-3-146 User: hadoop Job Name: PigLatin:DefaultJobName Job File: hdfs://1.23.3.146:9/mnt/var/lib/hadoop/tmp/mapred/staging/hadoop/.

10 1h 16min Only 19 reducers active, out of 5. Why? Hadoop job_ _1 on ip User: hadoop Job Name: PigLatin:DefaultJobName Job File: hdfs:// :9/mnt/var/lib/hadoop/tmp/mapred/staging/hadoop/.staging/job_ _1/job.xml Submit Host: ip ec2.internal Submit Host Address: Job-ACLs: All users are allowed Job Setup: Successful Status: Running Started at: Sun Mar 4 19:8:29 UTC 212 Running for: 1hrs, 16mins, 33sec Job Cleanup: Pending 3h 5min Kind % Complete Num Tasks Pending Running Complete Killed Failed/Killed Task Attempts map 33.17% / reduce 4.17% / Kind % Complete Num Tasks Pending Running Complete Killed Failed/Killed Task Attempts map 1.% / 18 reduce 32.42% / Counter Map Reduce Total SLOTS_MILLIS_MAPS 164,62,372 1 Launched reduce tasks Job Counters 9 Rack-local map tasks 5, Launched map tasks 5, File Input Format 5 Bytes Read 4 4 Counters 3 3 S3N_BYTES_READ 175,523,148,98 175,523,148, FileSystemCounters HDFS_BYTES_READ 1,845,837 1,845, FILE_BYTES_WRITTEN ,26,62, ,356, ,351,958,245 Map output materialized 2,444,314,273 2,444,314,273 bytes Reduce Completion Graph - close duce Completion Graph - close 1 Map input records 85,225,193 85,225, Reduce shuffle bytes 99,468,723 copy99,468, Spilled Records When will 173,82,131 the other sort 173,82, Copying by 19 reducers Map output bytes 31 reducers 62,732,457,83 be scheduled? reduce 62,732,457, in parallel with mappers. 4 4 CPU time spent (ms) 55,277,52 2,656,94 57,934, Total committed heap usage 1,956,86,312,96 3,42,83,712 1,959,129,116, (bytes) Map-Reduce Combine 2 input 25 records ,225, ,442, ,668,9 Framework SPLIT_RAW_BYTES 1,845,837 1,845,837 Go back to JobTracker This is Apache Hadoop release.2.25 copy sort reduce

Sorting, and the rest of Reduce may proceed now Speculative Execution Kind % Complete Num Tasks Pending Running Complete Killed Failed/Killed Task Attempts map 33.17% 15816 1549 38 5229 / reduce 4.

11 1h 16min Only 19 reducers active, out of 5. Why? Hadoop job_ _1 on ip User: hadoop Job Name: PigLatin:DefaultJobName Job File: hdfs:// :9/mnt/var/lib/hadoop/tmp/mapred/staging/hadoop/.staging/job_ _1/job.xml Submit Host: ip ec2.internal Submit Host Address: Job-ACLs: All users are allowed Job Setup: Successful Status: Running Started at: Sun Mar 4 19:8:29 UTC 212 Running for: 1hrs, 16mins, 33sec Job Cleanup: Pending 3h 5min Completed. Sorting, and the rest of Reduce may proceed now Speculative Execution Kind % Complete Num Tasks Pending Running Complete Killed Failed/Killed Task Attempts map 33.17% / reduce 4.17% / Kind % Complete Num Tasks Pending Running Complete Killed Failed/Killed Task Attempts map 1.% / 18 reduce 32.42% / Counter Map Reduce Total SLOTS_MILLIS_MAPS 164,62,372 1 Launched reduce tasks Job Counters 9 Rack-local map tasks 5, Launched map tasks 5, File Input Format 5 Bytes Read 4 4 Counters 3 3 S3N_BYTES_READ 175,523,148,98 175,523,148, FileSystemCounters HDFS_BYTES_READ 1,845,837 1,845, FILE_BYTES_WRITTEN ,26,62, ,356, ,351,958,245 Map output materialized 2,444,314,273 2,444,314,273 bytes Reduce Completion Graph - close duce Completion Graph - close 1 Map input records 85,225,193 85,225, Reduce shuffle bytes 99,468,723 copy99,468, Spilled Records When will 173,82,131 the other sort 173,82, Copying by 19 reducers Map output bytes 31 reducers 62,732,457,83 be scheduled? reduce 62,732,457, in parallel with mappers. 4 4 CPU time spent (ms) 55,277,52 2,656,94 57,934, Total committed heap usage 1,956,86,312,96 3,42,83,712 1,959,129,116, (bytes) Map-Reduce Combine 2 input 25 records ,225, ,442, ,668,9 Framework SPLIT_RAW_BYTES 1,845,837 1,845,837 Go back to JobTracker This is Apache Hadoop release.2.25 copy sort reduce

12 3h 51min Kind % Complete Num Tasks Pending Running Complete Killed Failed/Killed Task Attempts map 1.% / 18 reduce 37.72% / p Completion Graph - close duce Completion Graph - close copy sort reduce

3h 51min Kind % Complete Num Tasks Pending Running Complete Killed Failed/Killed Task Attempts map 1.% 15816 15816 / 18 reduce 37.

13 3h 51min Kind % Complete Num Tasks Pending Running Complete Killed Failed/Killed Task Attempts map 1.% / 18 reduce 37.72% / p Completion Graph - close duce Completion Graph - close Some of the 19 reducers have finished Next Batch of Reducers started copy sort reduce

14 3h 51min Hadoop job_ _1 on ip Hadoop job_ _1 on ip User: hadoop Job Name: PigLatin:DefaultJobName Job File: 3h 52min User: hadoop Job Name: PigLatin:DefaultJobName Job File: hdfs:// :9/mnt/var/lib/hadoop/tmp/mapred/staging/hadoop/.staging/job_ _1/job.xml hdfs:// :9/mnt/var/lib/hadoop/tmp/mapred/staging/hadoop/.staging/job_ _1/job.xml Submit Host: ip ec2.internal Submit Host: ip ec2.internal Submit Host Address: Submit Host Address: Job-ACLs: All users are allowed Job-ACLs: All users are allowed Job Setup: Successful Job Setup: Successful Status: Running Status: Running Started at: Sun Mar 4 19:8:29 UTC 212 Started at: Sun Mar 4 19:8:29 UTC 212 Running for: 3hrs, 51mins, 19sec Running for: 3hrs, 52mins, 51sec Job Cleanup: Pending Job Cleanup: Pending Kind % Complete Num Tasks Pending Running Complete Killed Failed/Killed Task Attempts Kind % Complete Num Tasks Pending Running Complete Killed Failed/Killed Task Attempts map 1.% / 18 reduce 37.72% / map 1.% / 18 reduce 42.35% / p Completion Graph - close Launched map tasks 15,834 Launched map tasks 15, SLOTS_MILLIS_REDUCES 118,328, File Output Format 8 File Output Format Bytes Written 7 7 Counters Counters 6 6 File Input Format File Input Format 5 Bytes Read 5 Counters Counters 4 4 SLOTS_MILLIS_REDUCES Bytes Written Bytes Read 25,4,19 3 S3N_BYTES_READ 53,591,875,823 53,591,875,823 S3N_BYTES_READ 53,591,875,823 53,591,875, FILE_BYTES_READ 754,835,48 754,835, FILE_BYTES_READ 847,821, ,821,126 FileSystemCounters HDFS_BYTES_READ ,587, ,587,893 FileSystemCounters HDFS_BYTES_READ ,587, ,587,893 FILE_BYTES_WRITTEN 9,616,982,133 85,567,984 1,467,55,117 FILE_BYTES_WRITTEN 9,616,982, ,512,16 1,481,494,149 duce Completion Graph - close HDFS_BYTES_WRITTEN 3,4,371,86 3,4,371,86 Reduce Completion Graph - close HDFS_BYTES_WRITTEN 3,967,197,533 3,967,197, Job Counters Counter Map Reduce Total SLOTS_MILLIS_MAPS 495,799,522 Launched reduce tasks 31 Rack-local map tasks 15,834 Job Counters Some of the 19 reducers have finished Counter Map Reduce Total SLOTS_MILLIS_MAPS 495,799,522 Launched reduce tasks 39 Rack-local map tasks 15,834 Map output materialized 1 Map output materialized 7,311,35,131 7,311,35, ,311,35,131 7,311,35,131 bytes copy bytes copy 8 Map input records 2,51,793,3 sort 2,51,793,3 7 Map input records 2,51,793,3 2,51,793,3 sort Reduce shuffle bytes 2,755,65,871 6 reduce 2,755,65,871 Reduce shuffle bytes Next Batch of 19 reducers 3,489,678,276 3,489,678,276 reduce 5 Spilled Records Next Batch of Reducers 465,817,71 started 26,163, ,981,248 4 Spilled Records 465,817,71 54,94,866 52,758,576 Map output bytes 199,575,247, ,575,247,17 2 Map output bytes 199,575,247,17 199,575,247, Go back to JobTracker

15 4h 18min Several servers failed: fetch error. Their map tasks need to be rerun. All reducers are waiting. Kind % Complete Num Tasks Pending Running Complete Killed Failed/Killed Task Attempts map 99.88% / 3337 reduce 48.42% / Reduce Completion Graph - close copy sort reduce Go back to JobTracker

16 4h 18min Several servers failed: fetch error. Their map tasks need to be rerun. All reducers are waiting. Kind % Complete Num Tasks Pending Running Complete Killed Failed/Killed Task Attempts map 99.88% / 3337 reduce 48.42% / Reduce Completion Graph - close Why did we lose some reducers? copy sort reduce Go back to JobTracker

1 9 8 7 6 5 4 3 2 1 Kind % Complete Num Tasks Pending Running Complete Killed Failed/Killed Task Attempts map 99.88% 15816 2638 3 13148 15 / 3337 reduce 48.

All reducers are waiting. Why did we lose some reducers?

17 Kind % Complete Num Tasks Pending Running Complete Killed Failed/Killed Task Attempts map 99.88% / 3337 reduce 48.42% / Reduce Completion Graph - close h 18min Several servers failed: fetch error. Their map tasks need to be rerun. All reducers are waiting. Why did we lose some reducers? copy sort reduce 7h 1min Hadoop job_ _1 on ip Mappers finished, User: hadoop CPU time spent (ms) 165,59,32 36,329,45 21,388,77 Job Name: PigLatin:DefaultJobName Job File: Total reducers committed heap usage resumed. hdfs:// :9/mnt/var/lib/hadoop/tmp/mapred/staging/hadoop/.staging/job_ _1/job.xml 5,92,284,372,992 15,76,56,896 5,935,36,933,888 (bytes) Submit Host: ip ec2.internal Submit Host Address: Map-Reduce Combine input records 2,51,793,3 437,117,972 2,938,911,2 Job-ACLs: All users are allowed Framework Job Setup: Successful SPLIT_RAW_BYTES 5,587,893 5,587,893 Status: Running Started at: Sun Mar Reduce 4 19:8:29 input records UTC 212 Running for: 7hrs, Reduce 1mins, input 54sec groups Job Cleanup: Pending Black-listed TaskTrackers: Combine output 3 records 465,817,71 126,918,315 16,55,13 117,266, ,918,315 16,55,13 583,84,327 Physical memory (bytes) 5,781,194,698,752 17,89,435,72 Failed/Killed 5,799,85,133,824 Kind % Complete snapshot Num Tasks Pending Running Complete Killed Task Attempts map 1.% Reduce output records 16,55,11 16,55, / 5968 Virtual memory (bytes) 8,999,333,4,128 29,498,195,968 9,28,831,236,96 reduce 94.15% snapshot / 8 Map output records 2,51,793,3 2,51,793,3 Counter Map Reduce Total Map Completion Graph - close SLOTS_MILLIS_MAPS 676,845, Launched reduce tasks 62 8 Job Counters Rack-local map tasks 21, Launched map tasks 21, SLOTS_MILLIS_REDUCES 39,18,556 3 File Output Format 2 Counters Bytes Written 1 File Input Format Bytes Read Counters S3N_BYTES_READ 53,591,952,796 53,591,952,796 Reduce Completion Graph - close FILE_BYTES_READ 1,921,632,69 1,921,632,69 1 FileSystemCounters HDFS_BYTES_READ 5,587,893 5,587,893 9 copy 8 FILE_BYTES_WRITTEN 9,616,982,133 2,51,943,74 11,668,925,873 7 sort HDFS_BYTES_WRITTEN 9,411,137,927 9,411,137,927 6 reduce 5 Map output materialized 7,311,35,131 7,311,35,131 4 bytes 3 2 Map input records 2,51,793,3 2,51,793,3 1 Reduce shuffle bytes 7,226,95,915 7,226,95, Spilled 15 Records ,817, ,997, ,815,297 Map output bytes 199,575,247,17 199,575,247,17 Go back to JobTracker Go back to JobTracker

7h 2min Success! 7hrs, 2mins. Hadoop job_21234195_1 on ip-1-23-3-146 User: hadoop Job Name: PigLatin:DefaultJobName Job File: hdfs://1.23.3.146:9/mnt/var/lib/hadoop/tmp/mapred/staging/hadoop/.

18 7h 2min Success! 7hrs, 2mins. Hadoop job_ _1 on ip User: hadoop Job Name: PigLatin:DefaultJobName Job File: hdfs:// :9/mnt/var/lib/hadoop/tmp/mapred/staging/hadoop/.staging/job_ _1/job.xml Submit Host: ip ec2.internal Submit Host Address: Job-ACLs: All users are allowed Job Setup: Successful Status: Succeeded Started at: Sun Mar 4 19:8:29 UTC 212 Finished at: Mon Mar 5 2:28:39 UTC 212 Finished in: 7hrs, 2mins, 1sec Job Cleanup: Successful Black-listed TaskTrackers: 3 Kind % Complete Num Tasks Pending Running Complete Killed Failed/Killed Task Attempts map 1.% / 5968 reduce 1.% 5 5 / copy sort reduce

19 Pig Latin Mini-Tutorial (will not discuss in detail in class; please read in order to do homework 8) CSE Fall

20 Pig Latin Overview Data model = loosely typed nested relations Query model = a SQL-like, dataflow language Execution model: Option 1: run locally on your machine; e.g. to debug Option 2: compile into graph of MapReduce jobs, run on a cluster supporting Hadoop CSE Fall 214 2

21 Example Input: a table of urls: (url, category, pagerank) Compute the average pagerank of all sufficiently high pageranks, for each category Return the answers only for categories with sufficiently many such pages CSE Fall

22 First in SQL Page(url, category, pagerank) SELECT category, AVG(pagerank) FROM Page WHERE pagerank >.2 GROUP BY category HAVING COUNT(*) > 1 6 CSE Fall

23 Page(url, category, pagerank) then in Pig-Latin good_urls = FILTER urls BY pagerank >.2 groups = GROUP good_urls BY category big_groups = FILTER groups BY COUNT(good_urls) > 1 6 output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank) CSE Fall

24 Types in Pig-Latin Atomic: string or number, e.g. Alice or 55 Tuple: ( Alice, 55, salesperson ) Bag: {( Alice, 55, salesperson ), ( Betty,44, manager ), } Maps: we will try not to use these CSE Fall

25 Types in Pig-Latin Tuple components can be referenced by number $, $1, $2, Bags can be nested! Non 1 st Normal Form {( a, {1,4,3}), ( c,{ }), ( d, {2,2,5,3,2})} CSE Fall

26 [Olston 28] 26

27 [Olston 28] Loading data Input data = FILES! Heard that before? The LOAD command parses an input file into a bag of records Both parser (= deserializer ) and output type are provided by user For HW8: simply use the code provided CSE Fall

$txt USING userloadfcn( ) AS (userid, querystring, timestamp) Pig provides a set of built-in load/store functions A = LOAD 'student' USING PigStorage('\t') AS (name:$

28 [Olston 28] Loading data queries = LOAD query_log.txt USING userloadfcn( ) AS (userid, querystring, timestamp) Pig provides a set of built-in load/store functions A = LOAD 'student' USING PigStorage('\t') AS (name: chararray, age:int, gpa: float); same as A = LOAD 'student' AS (name: chararray, age:int, gpa: float); CSE Fall

29 [Olston 28] Loading data USING userfuction( ) -- is optional Default deserializer expects tab-delimited file AS type is optional Default is a record with unnamed fields; refer to them as $, $1, The return value of LOAD is just a handle to a bag The actual reading is done in pull mode, or parallelized CSE Fall

30 [Olston 28] FOREACH expanded_queries = FOREACH queries GENERATE userid, expandquery(querystring) expandquery( ) is a UDF that produces likely expansions Note: it returns a bag, hence expanded_queries is a nested bag CSE Fall 214 3

31 [Olston 28] FOREACH expanded_queries = FOREACH queries GENERATE userid, flatten(expandquery(querystring)) Now we get a flat collection CSE Fall

32 [Olston 28] CSE Fall

33 [Olston 28] FLATTEN Note that it is NOT a normal function! (that s one thing I don t like about Pig-latin) A normal FLATTEN would do this: FLATTEN({{2,3},{5},{},{4,5,6}}) = {2,3,5,4,5,6} Its type is: {{T}} à {T} The Pig Latin FLATTEN does this: FLATTEN({4,5,6}) = 4, 5, 6 What is its Type? {T} à T, T, T,, T????? CSE Fall

34 [Olston 28] FILTER Remove all queries from Web bots: real_queries = FILTER queries BY userid neq bot Better: use a complex UDF to detect Web bots: real_queries = FILTER queries BY NOT isbot(userid) CSE Fall

35 [Olston 28] JOIN results: revenue: {(querystring, url, position)} {(querystring, adslot, amount)} join_result = JOIN results BY querystring revenue BY querystring join_result : {(querystring, url, position, adslot, amount)} CSE Fall

36 [Olston 28] CSE Fall

$[Olston 28] GROUP BY revenue: {(querystring, adslot, amount)} grouped_revenue =$ $amount) AS totalrevenue grouped_revenue: {(querystring, {(adslot, amount)})}$

37 [Olston 28] GROUP BY revenue: {(querystring, adslot, amount)} grouped_revenue = GROUP revenue BY querystring query_revenues = FOREACH grouped_revenue GENERATE querystring, SUM(revenue.amount) AS totalrevenue grouped_revenue: {(querystring, {(adslot, amount)})} query_revenues: {(querystring, totalrevenue)} CSE Fall

38 Simple MapReduce [Olston 28] input : {(field1, field2, field3,....)} map_result = FOREACH input GENERATE FLATTEN(map(*)) key_groups = GROUP map_result BY $ output = FOREACH key_groups GENERATE $, reduce($1) map_result : {(a1, a2, a3,...)} key_groups : {(a1, {(a2, a3,...)})} CSE Fall

$[Olston 28] Co-Group results: {(querystring, url, position)} revenue: {(querystring, adslot, amount)} grouped_data = COGROUP results BY querystring, revenue$

39 [Olston 28] Co-Group results: {(querystring, url, position)} revenue: {(querystring, adslot, amount)} grouped_data = COGROUP results BY querystring, revenue BY querystring; grouped_data: {(querystring, results:{(url, position)}, revenue:{(adslot, amount)})} What is the output type in general? CSE Fall

40 [Olston 28] Co-Group Is this an inner join, or an outer join? CSE Fall 214 4

41 [Olston 28] Co-Group grouped_data: {(querystring, results:{(url, position)}, revenue:{(adslot, amount)})} url_revenues = FOREACH grouped_data GENERATE FLATTEN(distributeRevenue(results, revenue)); distributerevenue is a UDF that accepts search results and revenue information for a query string at a time, and outputs a bag of urls and the revenue attributed to them. CSE Fall

$[Olston 28] Co-Group v.s. Join grouped_data: {(querystring, results:{(url, position)}, revenue:{(adslot, amount)})} grouped_data = COGROUP results BY$

42 [Olston 28] Co-Group v.s. Join grouped_data: {(querystring, results:{(url, position)}, revenue:{(adslot, amount)})} grouped_data = COGROUP results BY querystring, revenue BY querystring; join_result = FOREACH grouped_data GENERATE FLATTEN(results), FLATTEN(revenue); Result is the same as JOIN CSE Fall

43 Asking for Output: STORE [Olston 28] STORE query_revenues INTO `theoutput' USING userstorefcn(); Meaning: write query_revenues to the file theoutput CSE Fall

44 [Olston 28] Implementation Over Hadoop! Parse query: Everything between LOAD and STORE à one logical plan Logical plan à graph of MapReduce ops All statements between two (CO)GROUPs à one MapReduce job CSE Fall

45 [Olston 28] Implementation CSE Fall

46 Review: MapReduce Data is typically a file in the Google File System HDFS for Hadoop File system partitions file into chunks Each chunk is replicated on k (typically 3) machines Each machine can run a few map and reduce tasks simultaneously Each map task consumes one chunk Can adjust how much data goes into each map task using splits Scheduler tries to schedule map task where its input data is located Map output is partitioned across reducers Map output is also written locally to disk Number of reduce tasks is configurable System shuffles data between map and reduce tasks Reducers sort-merge data before consuming it CSE Fall

47 MapReduce Phases Local storage ` CSE Fall

Lecture 23: Supplementary slides for Pig Latin. Friday, May 28, 2010

Lecture 23: Supplementary slides for Pig Latin Friday, May 28, 2010 1 Outline Based entirely on Pig Latin: A not-so-foreign language for data processing, by Olston, Reed, Srivastava, Kumar, and Tomkins,