Programming and Debugging Large- Scale Data Processing Workflows

Size: px

Start display at page:

Download "Programming and Debugging Large- Scale Data Processing Workflows"

Jerome Jones
5 years ago
Views:

1 Programming and Debugging Large- Scale Data Processing Workflows Christopher Olston Google Research (work done at Yahoo! Research, with many colleagues)

2 Big- Data Yahoo: Use Cases web search pre- processing cross- dataset linkage web informahon extrachon inges)on storage & processing serving

3 Storage/Processing Architecture storage & processing workflow manager e.g. Oozie, Nova dataflow programming framework e.g. Pig distributed sorhng & hashing e.g. Map- Reduce scalable file system e.g. GFS THIS TALK Debugging aides: Before: example data generator During: instrumentahon framework AOer: provenance metadata manager

4 Pig: A High- Level Dataflow Language & RunHme for Hadoop Web browsing sessions with happy endings. Visits = load /data/visits as (user, url, time);! Visits = foreach Visits generate user, Canonicalize(url), time;!! Pages = load /data/pages as (url, pagerank);!! VP = join Visits by url, Pages by url;! UserVisits = group VP by user;! Sessions = foreach UserVisits generate flatten(findsessions(*));! HappyEndings = filter Sessions by BestIsLast(*);!! store HappyEndings into '/data/happy_endings';!

5 vs. map- reduce: less code! "The [Hofmann PLSA E/M] algorithm was implemented in pig in lines of pig-latin statements. Took a lot less compared to what it took in implementing the algorithm in Map-Reduce Java. Exactly that's the reason I wanted to try it out in Pig. It took 3-4 days for me to write it, starting from learning pig. " " -- Prasenjit Mukherjee, Mahout project" /20 the lines of code Minutes /16 the development Hme Hadoop Pig Hadoop Pig performs on par with raw Hadoop

vs. SQL: step- by- step style; lower- level control "I much prefer writing in Pig [Latin] versus SQL.

data. " " -- Jasmine Novak, Engineer, Yahoo!

6 vs. SQL: step- by- step style; lower- level control "I much prefer writing in Pig [Latin] versus SQL. The step-by-step method of" creating a program in Pig [Latin] is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data. " " -- Jasmine Novak, Engineer, Yahoo!" "PIG seems to give the necessary parallel programming construct (FOREACH, FLATTEN, COGROUP.. etc) and also give sufficient control back to the programmer (which purely declarative approach like [SQL on Map-Reduce] doesn t). " " -- Ricky Ho, Adobe Software"

7 Conceptually: A Graph of Data TransformaHons Find users who tend to visit good pages. Load Visits(user, url, Hme) Load Pages(url, pagerank) Transform to (user, Canonicalize(url), Hme) Join url = url Group by user Transform to (user, Average(pagerank) as avgpr) Filter avgpr > 0.5

Illustrated! Load Visits(user, url, Hme) Transform to (user, Canonicalize(url), Hme) (Amy, cnn.com, 8am) (Amy, hjp://www.snails.com, 9am) (Fred, www.snails.com/index.html, 11am) (Amy, www.cnn.com, 8am) (Amy, www.

8 Illustrated! Load Visits(user, url, Hme) Transform to (user, Canonicalize(url), Hme) (Amy, cnn.com, 8am) (Amy, hjp:// 9am) (Fred, 11am) (Amy, 8am) (Amy, 9am) (Fred, 11am) Join url = url Group by user Load Pages(url, pagerank) (Amy, 8am, 0.9) (Amy, 9am, 0.4) (Fred, 11am, 0.4) ( 0.9) ( 0.4) (Amy, { (Amy, 8am, 0.9), (Amy, 9am, 0.4) }) (Fred, { (Fred, 11am, 0.4) }) Transform to (user, Average(pagerank) as avgpr) ILLUSTRATE lets me check the output of my lengthy (Amy, 0.65) batch jobs and their (Fred, 0.4) custom functions without having to do a lengthy run of a long pipeline. [This feature] enables me to be productive. Filter " " avgpr > Russell Jurney, LinkedIn" (Amy, 0.65)

9 (Naïve Algorithm) Load Visits(user, url, Hme) Transform to (user, Canonicalize(url), Hme) (Amy, cnn.com, 8am) (Amy, hjp:// 9am) (Fred, 11am) (Amy, 8am) (Amy, 9am) (Fred, 11am) Join url = url Group by user Transform to (user, Average(pagerank) as avgpr) Load Pages(url, pagerank) ( 0.9) ( 0.4) Filter avgpr > 0.5

10 Pig Today Open- source (Apache) Dev./support/training by Cloudera, Hortonworks Offered on Amazon ElasHc Map- Reduce Used by LinkedIn, Neqlix, Salesforce, Twijer, Yahoo... At Yahoo, as of early 2011: 1000s of jobs/day 75%+ of Hadoop jobs Mortar: start- up building GUI around Illustrate

Next: INSPECTOR GADGET storage & processing workflow manager e.g. Nova dataflow programming framework e.g. Pig distributed sorhng & hashing e.g. Map- Reduce scalable file system e.

11 Next: INSPECTOR GADGET storage & processing workflow manager e.g. Nova dataflow programming framework e.g. Pig distributed sorhng & hashing e.g. Map- Reduce scalable file system e.g. GFS Debugging aides: Before: example data generator During: instrumentahon framework AOer: provenance metadata manager

12 MoHvated by User Interviews Interviewed 10 Yahoo dataflow programmers (mostly Pig users; some users of other dataflow environments) Asked them how they (wish they could) debug

13 Summary of User Interviews # of requests feature 7 crash culprit determinahon 5 row- level integrity alerts 4 table- level integrity alerts 4 data samples 3 data summaries 3 memory use monitoring 3 backward tracing (provenance) 2 forward tracing 2 golden data/logic teshng 2 step- through debugging 2 latency alerts 1 latency profiling 1 overhead profiling 1 trial runs

14 Our Approach Goal: a programming framework for adding these behaviors, and others, to Pig Precept: avoid modifying Pig or tampering with data flowing through Pig Approach: perform Pig script rewrihng insert special UDFs that look like no- ops to Pig

15 Pig w/ Inspector Gadget load filter load join IG coordinator group count store

16 Example: Integrity Alerts load filter load alert! join IG coordinator propagate alert to user group count store

17 Example: Crash Culprit DeterminaDon load load Phases 1 to n- 1: record counts filter Phase n: records join IG coordinator Phases 1 to n- 1: maintain count lower bounds Phase n: maintain last- seen records group count store

18 Example: Forward Tracing load load filter IG coordinator traced records join group tracing instruchons report traced records to user count store

19 Flow end user result dataflow program + app. parameters applicahon IG driver library launch instrumented dataflow run(s) raw result(s) load load IG coordinator filter join dataflow engine rundme store

20 Agent & Coordinator APIs Agent Class init(args) tags = observerecord(record, tags) receivemessage(source, message) finish() Agent Messaging sendtocoordinator(message) sendtoagent(agentid, message) senddownstream(message) sendupstream(message) Coordinator Class init(args) receivemessage(source, message) output = finish() Coordinator Messaging sendtoagent(agentid, message)

21 ApplicaHons Developed For IG # of requests feature lines of code (Java) 7 crash culprit determinahon row- level integrity alerts 89 4 table- level integrity alerts 99 4 data samples 97 3 data summaries memory use monitoring N/A 3 backward tracing (provenance) forward tracing golden data/logic teshng step- through debugging N/A 2 latency alerts latency profiling overhead profiling trial runs 93

22 Related Work Pig: DryadLINQ, Hive, Jaql, Scope, reladonal query languages Example data generator: [Mannila/Raiha, PODS 86], reverse query processing, constraint databases, hardware verificadon & model checking Inspector gadget: XTrace, taint tracking, aspect- oriented programming

23 Collaborators Shubham Chopra Tyson Condie Anish Das Sarma Alan Gates Pradeep Kamath Ravi Kumar Shravan Narayanamurthy Olga Natkovich Benjamin Reed Santhosh Srinivasan Utkarsh Srivastava Andrew Tomkins

What I m Working on at Google We ve got fantashc cloud building blocks: BigTable, MapReduce, Pregel, and on and on (and so do you: EC2, Hadoop, Redis, ZooKeeper, ) To build your app: 1.

24 What I m Working on at Google We ve got fantashc cloud building blocks: BigTable, MapReduce, Pregel, and on and on (and so do you: EC2, Hadoop, Redis, ZooKeeper, ) To build your app: 1. Think hard, and choose a few building blocks (BB s) 2. SHck What your app if we logic could into separate a blender, the and app pour logic it into the various BB abstrachons from the assemblage (keys/values, of MR building funchons, blocks? callbacks, ) 3. Tune it: stupid map- reduce tricks ; set bazillions of flags 4. Hope that Your BB choices were right, and stay right for a while Nobody ever has to understand your app by reading the code You never ajempt big changes to your app logic or algorithms

25 Research at Google High- risk/high- reward research happening across the company Successful research projects ooen become successful products (e.g. speech recognihon) No pressure to publish incremental papers Interdisciplinary In my case: DB + PL + AI We re hiring J

Programming and Debugging Large- Scale Data Processing Workflows

Programming and Debugging Large- Scale Data Processing Workflows Christopher Olston Google Research (work done at Yahoo! Research, with many colleagues) What I m Working on at Google We ve got fantasjc