PIGFARM - LAS Sponsored Computer Science Senior Design Class Project Spring Carson Cumbee - LAS

Size: px

Start display at page:

Download "PIGFARM - LAS Sponsored Computer Science Senior Design Class Project Spring Carson Cumbee - LAS"

Cuthbert Snow
5 years ago
Views:

1 PIGFARM - LAS Sponsored Computer Science Senior Design Class Project Spring 2017 Carson Cumbee - LAS

2 What is Big Data? Big Data is data that is too large to fit into a single server. It necessitates the need to add an extra layer of software to coordinate among servers to analyze the data Obviously this changes over time

3 What is Hadoop/MapReduce? Hadoop is the defacto open source Big Data platform Fault tolerant distributed file system Based on [1] a 2003 Paper from Google about their internal file system Map/Reduce a parallel computing paradigm that stresses low memory Usage a Map step is executed on local nodes and the results are sent Over the network to Reducers which complete the task. [2] Another famous Google Paper. If you want to query data use a database If you want to make a database use Map/Reduce

4 What is Pig? Instead of all this (java) import org.apache.hadoop.conf.configuration; import org.apache.hadoop.conf.configured;. import org.apache.hadoop.util.tool; import org.apache.hadoop.util.toolrunner; public class L2 extends Configured implements Tool { /** * MAPPER */ public static class Join extends Mapper<LongWritable, Text, Text, Text> { private Set<String> public void setup(context context) { try { Path[] paths = DistributedCache.getLocalCacheFiles(context.getConfiguration()); if (paths == null paths.length < 1) { throw new RuntimeException("DistributedCache no work."); } // Open the small table BufferedReader reader = new BufferedReader( new InputStreamReader(new FileInputStream( paths[0].tostring()))); String line; hash = new HashSet<String>(500); while ((line = reader.readline())!= null) { if (line.length() < 1) continue; String[] fields = line.split(""); if (fields[0].length()!= 0) hash.add(fields[0]); } } catch (IOException ioe) { throw new RuntimeException(ioe); } public void map(longwritable k, Text val, Context context) throws IOException, InterruptedException { List<Text> fields = Library.splitLine(val, ''); } /** * RUN public int run(string[] args) throws Exception { if (args.length!= 3) { System.err.println("Usage: wordcount <input_dir> <output_dir> <reducers>"); return -1; } Job job = new Job(getConf(), "PigMix L2"); job.setjarbyclass(l2.class); job.setinputformatclass(textinputformat.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(text.class); job.setmapperclass(join.class); Properties props = System.getProperties(); Configuration conf = job.getconfiguration(); for (Map.Entry<Object, Object> entry : props.entryset()) { conf.set((string) entry.getkey(), (String) entry.getvalue()); } DistributedCache.addCacheFile(new URI(args[0] + "/pigmix_power_users"), conf); FileInputFormat.addInputPath(job, new Path(args[0] + "/pigmix_page_views")); FileOutputFormat.setOutputPath(job, new Path(args[1] + "/L2out")); job.setnumreducetasks(0); return job.waitforcompletion(true)? 0 : -1; } /** args */ public static void main(string[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new L2(), args); System.exit(res); } if (hash.contains(fields.get(0).tostring())) { context.write(fields.get(0), fields.get(6)); } } }

5 This (Pig Latin)*! rmf /PIGFARM/pigmixout/l2out register /proj/pigfarm/pigmix/pigperf.jar; A = LOAD '/PIGFARM/pigmix/pigmix_page_views' using org.apache.pig.test.udf.storefunc.pigperformanceloader() AS (user, action, timespent, query_term,ip_addr, timestamp, estimated_revenue, page_info,page_links); B = FOREACH A GENERATE user, estimated_revenue; alpha = LOAD '/PIGFARM/pigmix/pigmix_users' using PigStorage('\u0001') AS (name, phone, address, city, state, zip); beta = FOREACH alpha GENERATE name; C = JOIN B BY user, beta BY name; STORE C INTO '/PIGFARM/pigmixout/l2out'; * This is PIGMIX Benchmark script L2.pig

6 PIGFARM Multiple Query Optimization (MQO) The idea that several queries onto a single database can be made more efficient if combined together and issued at the same time When large firms have data scientists throughout their business units writing Pig scripts against common data sets in an uncoordinated manner there is an opportunity to use MQO to improve the analytical bandwidth of these systems.

7 The Real Idea I only like yellow data Farmer CPU PIGSCRIPT 1 Big Data feed NOOPS /dev/null

8 The Real Idea I only like blue data Farmer CPU PIGSCRIPT 2 Big Data feed NOOPS /dev/null

9 The Real Idea I only like red data Farmer CPU PIGSCRIPT N Big Data feed NOOPS /dev/null

10 The Real Idea Instead of this N 1 N 2 N N

11 this fuse the initial map N 1 N 2 N N

12 fuse the LOAD statement At first we thought this would just mean fusing the LOAD statements together, and consistently renaming the variables.and let Apache Pig work its magic --Script determines the number of distinct pred/obj pairs that have math in them rmf /PIGFARM/Merged/test001.gz table = load '/PIGFARM/data2.gz' using PigStorage('\t') as (sub, pred, obj); filt1 = filter table by (obj matches '.*math.*') or (pred matches '.*math.*'); unduped = DISTINCT filt1; store unduped into '/PIGFARM/Merged/test001.gz' using PigStorage('\t'); --Script determines the number of unique objects with North Carolina rmf /PIGFARM/Merged/test003.gz table = load '/PIGFARM/data2.gz' using PigStorage('\t') as (sub, pred, obj, period); filt = filter table by (obj matches '.*"North Carolina".*'); objs = foreach filt generate obj; uniq_objs = distinct objs; grouped_users = group uniq_objs all; count = foreach grouped_users generate COUNT(uniq_objs); --Script computes the average height of people for each subject rmf /PIGFARM/Merged/test002.gz table = load '/PIGFARM/data2.gz' using PigStorage('\t') as (sub, pred, obj); joined = union count, grouped_users; store joined into '/PIGFARM/Merged/test003.gz' using PigStorage('\t'); filt1 = filter table BY (pred matches '.*"people.person.height_meters".*'); removequotes = FOREACH filt1 GENERATE sub, REGEX_EXTRACT(obj, '"(.*)"',1) as num; casted = FOREACH removequotes GENERATE sub, (double)num; grouped = GROUP casted BY sub; avged = FOREACH grouped GENERATE casted.sub, AVG(casted.num); store avged into '/PIGFARM/Merged/test002.gz' using PigStorage('\t');

13 fuse the LOAD statement At first we thought this would just mean fusing the LOAD statements together, and consistently renaming the variables.and let Apache Pig work its magic -- An LAS PIGFARM Compiled Pig Script -- Compiled on: 23/02/17-07:09 -- The following variable accesses the data source: '/PIGFARM/data/spli*.gz' using function: PigStorage('\t') -- 1: table from /proj/pigfarm/script_farm/tomerge/test002.pig -- 2: table from /proj/pigfarm/script_farm/tomerge/test001.pig -- 3: table from /proj/pigfarm/script_farm/tomerge/test003.pig boring_aryabhata = LOAD '/PIGFARM/data2.gz' USING PigStorage('\t') AS( laughing_wing, stoic_allen, jovial_golick, elegant_davinci ); -- Below is the remainder of: /proj/pigfarm/script_farm/tomerge/test002.pig --Script computes the average height of people for each subject rmf /PIGFARM/Merged/test002.gz filt1 = filter boring_aryabhata BY (stoic_allen matches '.*"people.person.height_meters".*'); removequotes = FOREACH filt1 GENERATE laughing_wing, REGEX_EXTRACT(laughing_wing, '"(.*)"',1) as num; casted = FOREACH removequotes GENERATE laughing_wing, (double)num; grouped = GROUP casted BY laughing_wing; avged = FOREACH grouped GENERATE casted.laughing_wing, AVG(casted.num); store avged into '/PIGFARM/Merged/test002.gz' using PigStorage('\t'); -- Below is the remainder of: /proj/pigfarm/script_farm/tomerge/test003.pig --Script determines the number of unique objects with North Carolina rmf /PIGFARM/Merged/test003.gz filt = filter boring_aryabhata by (jovial_golick matches '.*"North Carolina".*'); objs = foreach filt generate jovial_golick; uniq_objs = distinct objs; grouped_users = group uniq_objs all; count = foreach grouped_users generate COUNT(uniq_objs); joined = union count, grouped_users; store joined into '/PIGFARM/Merged/test003.gz' using PigStorage('\t');..

14 fuse the LOAD statement But this didn t work. Pig just submitted the job as if it were the 3 sequential pig jobs. (Although it might still work with TEZ) Decided to move the store statements to the end This actually caused very large temporary files to be created.. A performance killer Decided to identify the initial Map portions of the scripts, STORE them compressed and then read them back in essentially explicit temporary files this seems to work

15 fuse the initial mapper -- An LAS PIGFARM Compiled Pig Script -- Compiled on: 23/02/17-07:09 rmf /PIGFARM/cumbeeMerged/test001.gz rmf /PIGFARM/cumbeeMerged/test002.gz rmf /PIGFARM/cumbeeMerged/test003.gz rmf /PIGFARM/cumbeeMerged/casted.gz rmf /PIGFARM/cumbeeMerged/objs.gz rmf /PIGFARM/cumbeeMerged/filt2.gz rmf /PIGFARM/cumbeeMerged/filt3.gz -- The following variable accesses the data source: '/PIGFARM/data/spli*.gz' using function: PigStorage('\t') -- 1: table from /proj/pigfarm/script_farm/tomerge/test002.pig -- 2: table from /proj/pigfarm/script_farm/tomerge/test001.pig -- 3: table from /proj/pigfarm/script_farm/tomerge/test003.pig boring_aryabhata = LOAD '/PIGFARM/data2.gz' USING PigStorage('\t') AS( laughing_wing, stoic_allen, jovial_golick, elegant_davinci ); filt1 = filter boring_aryabhata BY (stoic_allen matches '.*"people.person.height_meters".*'); filt2 = filter boring_aryabhata by (stoic_allen matches '.*math.*'); filt = filter boring_aryabhata by (jovial_golick matches '.*"North Carolina".*'); filt3 = filter boring_aryabhata by (jovial_golick matches '.*math.*'); objs = foreach filt generate jovial_golick; removequotes = FOREACH filt1 GENERATE laughing_wing, REGEX_EXTRACT(laughing_wing, '"(.*)"',1) as num; casted = FOREACH removequotes GENERATE laughing_wing, (double)num; store casted into '/PIGFARM/cumbeeMerged/casted.gz' using PigStorage('\t'); store objs into '/PIGFARM/cumbeeMerged/objs.gz' using PigStorage('\t'); store filt2 into '/PIGFARM/cumbeeMerged/filt2.gz' using PigStorage('\t'); store filt3 into '/PIGFARM/cumbeeMerged/filt3.gz' using PigStorage('\t'); casted = LOAD '/PIGFARM/cumbeeMerged/casted.gz' using PigStorage('\t') as (laughing_wing,num:double); objs= LOAD '/PIGFARM/cumbeeMerged/objs.gz' using PigStorage('\t') as (jovial_golick); filt2 = LOAD '/PIGFARM/cumbeeMerged/filt2.gz' using PigStorage('\t') as (laughing_wing, stoic_allen, jovial_golick, elegant_davinci); filt3 = LOAD '/PIGFARM/cumbeeMerged/filt3.gz' using PigStorage('\t') as (laughing_wing, stoic_allen, jovial_golick, elegant_davinci); grouped = GROUP casted BY laughing_wing; avged = FOREACH grouped GENERATE casted.laughing_wing, AVG(casted.num); uniq_objs = distinct objs; grouped_users = group uniq_objs all; count = foreach grouped_users generate COUNT(uniq_objs); joined = union count, grouped_users;

16 Datasets PIGMIX standard synthetic Pig Benchmark 250million rows Mostly dense, 400 GB uncompressed Used to test Apache Pig vs Java Map/Reduce performance Freebase large knowledge graph available on the internet 3billion + subject,predicate,object tuples 250 GB uncompressed We made a special loader function UDF for FB called FBLoader()

17 Test Cluster OSCAR LAB Hortonworks cluster 12 Blades 1 login/name server, 11 compute nodes Each blade has 65GB of RAM 12 TB of HDFS Replication factor of 1

18 Preliminary Results Parallel submission minutes PIGFARM minutes Freebase Test 1-3 PRL 1,..8 Individual scripts PRL_1-4 PRL_1-6 PRL_ N/A These scripts were compiled by hand Using Pig defaults for number of reducers

19 PIGFARMers

20 PIGFARMers

21 Work is ongoing Team is still working on the script combiner Make sure it can handle FBLoader() Create a similarity function for scripts based on the data they access Run all of the experiments and write the results up in a paper Also worthwhile to rerun all experiments with Tez vs MapRed A furious finish only 5 weeks left in semester!

22 Conclusion If a large firm is writing Apache Pig scripts to perform Map/Reduce jobs on common data sets there could be a lot of performance gains in fusing the maps together with PIGFARM especially if most of the jobs are map heavy.

23 Thanks! Dr. Aaron Wiechman - LAS Dr. Sean Lynch - LAS Ms. Margaret Heil Director SDC Dr. David Sturgill Tech Advisor Session 3

24 Questions?

25 References [1] Ghemawat, S.; Gobioff, H.; Leung, S. T. (2003). "The Google file system". Proceedings of the nineteenth ACM Symposium on Operating Systems Principles - SOSP '03 (PDF). p. 29. [2] Dean, J. and Ghemawat, S. (2004). "MapReduce: Simplified data processing on large clusters". In Proceedings of the 6th USENIX Symposium on Operating Systems Design and Implementation. p [3] Jes us Camacho-Rodr ıguez, Dario Colazzo, Melanie Herschel, Ioana Manolescu, Soudip Roy Chowdhury. PigReuse: A Reuse-based Optimizer for Pig Latin. [Technical Report] Inria Saclay <hal >

Big Data: Architectures and Data Analytics

Big Data: Architectures and Data Analytics June 26, 2018 Student ID First Name Last Name The exam is open book and lasts 2 hours. Part I Answer to the following questions. There is only one right answer