PIGFARM - LAS Sponsored Computer Science Senior Design Class Project Spring Carson Cumbee - LAS
|
|
- Cuthbert Snow
- 5 years ago
- Views:
Transcription
1 PIGFARM - LAS Sponsored Computer Science Senior Design Class Project Spring 2017 Carson Cumbee - LAS
2 What is Big Data? Big Data is data that is too large to fit into a single server. It necessitates the need to add an extra layer of software to coordinate among servers to analyze the data Obviously this changes over time
3 What is Hadoop/MapReduce? Hadoop is the defacto open source Big Data platform Fault tolerant distributed file system Based on [1] a 2003 Paper from Google about their internal file system Map/Reduce a parallel computing paradigm that stresses low memory Usage a Map step is executed on local nodes and the results are sent Over the network to Reducers which complete the task. [2] Another famous Google Paper. If you want to query data use a database If you want to make a database use Map/Reduce
4 What is Pig? Instead of all this (java) import org.apache.hadoop.conf.configuration; import org.apache.hadoop.conf.configured;. import org.apache.hadoop.util.tool; import org.apache.hadoop.util.toolrunner; public class L2 extends Configured implements Tool { /** * MAPPER */ public static class Join extends Mapper<LongWritable, Text, Text, Text> { private Set<String> public void setup(context context) { try { Path[] paths = DistributedCache.getLocalCacheFiles(context.getConfiguration()); if (paths == null paths.length < 1) { throw new RuntimeException("DistributedCache no work."); } // Open the small table BufferedReader reader = new BufferedReader( new InputStreamReader(new FileInputStream( paths[0].tostring()))); String line; hash = new HashSet<String>(500); while ((line = reader.readline())!= null) { if (line.length() < 1) continue; String[] fields = line.split(""); if (fields[0].length()!= 0) hash.add(fields[0]); } } catch (IOException ioe) { throw new RuntimeException(ioe); } public void map(longwritable k, Text val, Context context) throws IOException, InterruptedException { List<Text> fields = Library.splitLine(val, ''); } /** * RUN public int run(string[] args) throws Exception { if (args.length!= 3) { System.err.println("Usage: wordcount <input_dir> <output_dir> <reducers>"); return -1; } Job job = new Job(getConf(), "PigMix L2"); job.setjarbyclass(l2.class); job.setinputformatclass(textinputformat.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(text.class); job.setmapperclass(join.class); Properties props = System.getProperties(); Configuration conf = job.getconfiguration(); for (Map.Entry<Object, Object> entry : props.entryset()) { conf.set((string) entry.getkey(), (String) entry.getvalue()); } DistributedCache.addCacheFile(new URI(args[0] + "/pigmix_power_users"), conf); FileInputFormat.addInputPath(job, new Path(args[0] + "/pigmix_page_views")); FileOutputFormat.setOutputPath(job, new Path(args[1] + "/L2out")); job.setnumreducetasks(0); return job.waitforcompletion(true)? 0 : -1; } /** args */ public static void main(string[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new L2(), args); System.exit(res); } if (hash.contains(fields.get(0).tostring())) { context.write(fields.get(0), fields.get(6)); } } }
5 This (Pig Latin)*! rmf /PIGFARM/pigmixout/l2out register /proj/pigfarm/pigmix/pigperf.jar; A = LOAD '/PIGFARM/pigmix/pigmix_page_views' using org.apache.pig.test.udf.storefunc.pigperformanceloader() AS (user, action, timespent, query_term,ip_addr, timestamp, estimated_revenue, page_info,page_links); B = FOREACH A GENERATE user, estimated_revenue; alpha = LOAD '/PIGFARM/pigmix/pigmix_users' using PigStorage('\u0001') AS (name, phone, address, city, state, zip); beta = FOREACH alpha GENERATE name; C = JOIN B BY user, beta BY name; STORE C INTO '/PIGFARM/pigmixout/l2out'; * This is PIGMIX Benchmark script L2.pig
6 PIGFARM Multiple Query Optimization (MQO) The idea that several queries onto a single database can be made more efficient if combined together and issued at the same time When large firms have data scientists throughout their business units writing Pig scripts against common data sets in an uncoordinated manner there is an opportunity to use MQO to improve the analytical bandwidth of these systems.
7 The Real Idea I only like yellow data Farmer CPU PIGSCRIPT 1 Big Data feed NOOPS /dev/null
8 The Real Idea I only like blue data Farmer CPU PIGSCRIPT 2 Big Data feed NOOPS /dev/null
9 The Real Idea I only like red data Farmer CPU PIGSCRIPT N Big Data feed NOOPS /dev/null
10 The Real Idea Instead of this N 1 N 2 N N
11 this fuse the initial map N 1 N 2 N N
12 fuse the LOAD statement At first we thought this would just mean fusing the LOAD statements together, and consistently renaming the variables.and let Apache Pig work its magic --Script determines the number of distinct pred/obj pairs that have math in them rmf /PIGFARM/Merged/test001.gz table = load '/PIGFARM/data2.gz' using PigStorage('\t') as (sub, pred, obj); filt1 = filter table by (obj matches '.*math.*') or (pred matches '.*math.*'); unduped = DISTINCT filt1; store unduped into '/PIGFARM/Merged/test001.gz' using PigStorage('\t'); --Script determines the number of unique objects with North Carolina rmf /PIGFARM/Merged/test003.gz table = load '/PIGFARM/data2.gz' using PigStorage('\t') as (sub, pred, obj, period); filt = filter table by (obj matches '.*"North Carolina".*'); objs = foreach filt generate obj; uniq_objs = distinct objs; grouped_users = group uniq_objs all; count = foreach grouped_users generate COUNT(uniq_objs); --Script computes the average height of people for each subject rmf /PIGFARM/Merged/test002.gz table = load '/PIGFARM/data2.gz' using PigStorage('\t') as (sub, pred, obj); joined = union count, grouped_users; store joined into '/PIGFARM/Merged/test003.gz' using PigStorage('\t'); filt1 = filter table BY (pred matches '.*"people.person.height_meters".*'); removequotes = FOREACH filt1 GENERATE sub, REGEX_EXTRACT(obj, '"(.*)"',1) as num; casted = FOREACH removequotes GENERATE sub, (double)num; grouped = GROUP casted BY sub; avged = FOREACH grouped GENERATE casted.sub, AVG(casted.num); store avged into '/PIGFARM/Merged/test002.gz' using PigStorage('\t');
13 fuse the LOAD statement At first we thought this would just mean fusing the LOAD statements together, and consistently renaming the variables.and let Apache Pig work its magic -- An LAS PIGFARM Compiled Pig Script -- Compiled on: 23/02/17-07:09 -- The following variable accesses the data source: '/PIGFARM/data/spli*.gz' using function: PigStorage('\t') -- 1: table from /proj/pigfarm/script_farm/tomerge/test002.pig -- 2: table from /proj/pigfarm/script_farm/tomerge/test001.pig -- 3: table from /proj/pigfarm/script_farm/tomerge/test003.pig boring_aryabhata = LOAD '/PIGFARM/data2.gz' USING PigStorage('\t') AS( laughing_wing, stoic_allen, jovial_golick, elegant_davinci ); -- Below is the remainder of: /proj/pigfarm/script_farm/tomerge/test002.pig --Script computes the average height of people for each subject rmf /PIGFARM/Merged/test002.gz filt1 = filter boring_aryabhata BY (stoic_allen matches '.*"people.person.height_meters".*'); removequotes = FOREACH filt1 GENERATE laughing_wing, REGEX_EXTRACT(laughing_wing, '"(.*)"',1) as num; casted = FOREACH removequotes GENERATE laughing_wing, (double)num; grouped = GROUP casted BY laughing_wing; avged = FOREACH grouped GENERATE casted.laughing_wing, AVG(casted.num); store avged into '/PIGFARM/Merged/test002.gz' using PigStorage('\t'); -- Below is the remainder of: /proj/pigfarm/script_farm/tomerge/test003.pig --Script determines the number of unique objects with North Carolina rmf /PIGFARM/Merged/test003.gz filt = filter boring_aryabhata by (jovial_golick matches '.*"North Carolina".*'); objs = foreach filt generate jovial_golick; uniq_objs = distinct objs; grouped_users = group uniq_objs all; count = foreach grouped_users generate COUNT(uniq_objs); joined = union count, grouped_users; store joined into '/PIGFARM/Merged/test003.gz' using PigStorage('\t');..
14 fuse the LOAD statement But this didn t work. Pig just submitted the job as if it were the 3 sequential pig jobs. (Although it might still work with TEZ) Decided to move the store statements to the end This actually caused very large temporary files to be created.. A performance killer Decided to identify the initial Map portions of the scripts, STORE them compressed and then read them back in essentially explicit temporary files this seems to work
15 fuse the initial mapper -- An LAS PIGFARM Compiled Pig Script -- Compiled on: 23/02/17-07:09 rmf /PIGFARM/cumbeeMerged/test001.gz rmf /PIGFARM/cumbeeMerged/test002.gz rmf /PIGFARM/cumbeeMerged/test003.gz rmf /PIGFARM/cumbeeMerged/casted.gz rmf /PIGFARM/cumbeeMerged/objs.gz rmf /PIGFARM/cumbeeMerged/filt2.gz rmf /PIGFARM/cumbeeMerged/filt3.gz -- The following variable accesses the data source: '/PIGFARM/data/spli*.gz' using function: PigStorage('\t') -- 1: table from /proj/pigfarm/script_farm/tomerge/test002.pig -- 2: table from /proj/pigfarm/script_farm/tomerge/test001.pig -- 3: table from /proj/pigfarm/script_farm/tomerge/test003.pig boring_aryabhata = LOAD '/PIGFARM/data2.gz' USING PigStorage('\t') AS( laughing_wing, stoic_allen, jovial_golick, elegant_davinci ); filt1 = filter boring_aryabhata BY (stoic_allen matches '.*"people.person.height_meters".*'); filt2 = filter boring_aryabhata by (stoic_allen matches '.*math.*'); filt = filter boring_aryabhata by (jovial_golick matches '.*"North Carolina".*'); filt3 = filter boring_aryabhata by (jovial_golick matches '.*math.*'); objs = foreach filt generate jovial_golick; removequotes = FOREACH filt1 GENERATE laughing_wing, REGEX_EXTRACT(laughing_wing, '"(.*)"',1) as num; casted = FOREACH removequotes GENERATE laughing_wing, (double)num; store casted into '/PIGFARM/cumbeeMerged/casted.gz' using PigStorage('\t'); store objs into '/PIGFARM/cumbeeMerged/objs.gz' using PigStorage('\t'); store filt2 into '/PIGFARM/cumbeeMerged/filt2.gz' using PigStorage('\t'); store filt3 into '/PIGFARM/cumbeeMerged/filt3.gz' using PigStorage('\t'); casted = LOAD '/PIGFARM/cumbeeMerged/casted.gz' using PigStorage('\t') as (laughing_wing,num:double); objs= LOAD '/PIGFARM/cumbeeMerged/objs.gz' using PigStorage('\t') as (jovial_golick); filt2 = LOAD '/PIGFARM/cumbeeMerged/filt2.gz' using PigStorage('\t') as (laughing_wing, stoic_allen, jovial_golick, elegant_davinci); filt3 = LOAD '/PIGFARM/cumbeeMerged/filt3.gz' using PigStorage('\t') as (laughing_wing, stoic_allen, jovial_golick, elegant_davinci); grouped = GROUP casted BY laughing_wing; avged = FOREACH grouped GENERATE casted.laughing_wing, AVG(casted.num); uniq_objs = distinct objs; grouped_users = group uniq_objs all; count = foreach grouped_users generate COUNT(uniq_objs); joined = union count, grouped_users;
16 Datasets PIGMIX standard synthetic Pig Benchmark 250million rows Mostly dense, 400 GB uncompressed Used to test Apache Pig vs Java Map/Reduce performance Freebase large knowledge graph available on the internet 3billion + subject,predicate,object tuples 250 GB uncompressed We made a special loader function UDF for FB called FBLoader()
17 Test Cluster OSCAR LAB Hortonworks cluster 12 Blades 1 login/name server, 11 compute nodes Each blade has 65GB of RAM 12 TB of HDFS Replication factor of 1
18 Preliminary Results Parallel submission minutes PIGFARM minutes Freebase Test 1-3 PRL 1,..8 Individual scripts PRL_1-4 PRL_1-6 PRL_ N/A These scripts were compiled by hand Using Pig defaults for number of reducers
19 PIGFARMers
20 PIGFARMers
21 Work is ongoing Team is still working on the script combiner Make sure it can handle FBLoader() Create a similarity function for scripts based on the data they access Run all of the experiments and write the results up in a paper Also worthwhile to rerun all experiments with Tez vs MapRed A furious finish only 5 weeks left in semester!
22 Conclusion If a large firm is writing Apache Pig scripts to perform Map/Reduce jobs on common data sets there could be a lot of performance gains in fusing the maps together with PIGFARM especially if most of the jobs are map heavy.
23 Thanks! Dr. Aaron Wiechman - LAS Dr. Sean Lynch - LAS Ms. Margaret Heil Director SDC Dr. David Sturgill Tech Advisor Session 3
24 Questions?
25 References [1] Ghemawat, S.; Gobioff, H.; Leung, S. T. (2003). "The Google file system". Proceedings of the nineteenth ACM Symposium on Operating Systems Principles - SOSP '03 (PDF). p. 29. [2] Dean, J. and Ghemawat, S. (2004). "MapReduce: Simplified data processing on large clusters". In Proceedings of the 6th USENIX Symposium on Operating Systems Design and Implementation. p [3] Jes us Camacho-Rodr ıguez, Dario Colazzo, Melanie Herschel, Ioana Manolescu, Soudip Roy Chowdhury. PigReuse: A Reuse-based Optimizer for Pig Latin. [Technical Report] Inria Saclay <hal >
Big Data: Architectures and Data Analytics
Big Data: Architectures and Data Analytics June 26, 2018 Student ID First Name Last Name The exam is open book and lasts 2 hours. Part I Answer to the following questions. There is only one right answer
More informationBig Data: Architectures and Data Analytics
Big Data: Architectures and Data Analytics June 26, 2018 Student ID First Name Last Name The exam is open book and lasts 2 hours. Part I Answer to the following questions. There is only one right answer
More informationBig Data: Architectures and Data Analytics
Big Data: Architectures and Data Analytics January 22, 2018 Student ID First Name Last Name The exam is open book and lasts 2 hours. Part I Answer to the following questions. There is only one right answer
More informationBig Data: Architectures and Data Analytics
Big Data: Architectures and Data Analytics July 14, 2017 Student ID First Name Last Name The exam is open book and lasts 2 hours. Part I Answer to the following questions. There is only one right answer
More informationUNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus
UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus Getting to know MapReduce MapReduce Execution Pipeline Runtime Coordination and Task Management MapReduce Application Hadoop Word Count Implementation.
More informationParallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018
Parallel Processing - MapReduce and FlumeJava Amir H. Payberah payberah@kth.se 14/09/2018 The Course Web Page https://id2221kth.github.io 1 / 83 Where Are We? 2 / 83 What do we do when there is too much
More informationLarge-scale Information Processing
Sommer 2013 Large-scale Information Processing Ulf Brefeld Knowledge Mining & Assessment brefeld@kma.informatik.tu-darmstadt.de Anecdotal evidence... I think there is a world market for about five computers,
More informationMapReduce Simplified Data Processing on Large Clusters
MapReduce Simplified Data Processing on Large Clusters Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) MapReduce 1393/8/5 1 /
More informationIntroduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece
Introduction to Map/Reduce Kostas Solomos Computer Science Department University of Crete, Greece What we will cover What is MapReduce? How does it work? A simple word count example (the Hello World! of
More informationDepartment of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 16. Big Data Management VI (MapReduce Programming)
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 16 Big Data Management VI (MapReduce Programming) Credits: Pietro Michiardi (Eurecom): Scalable Algorithm
More informationHadoop Integration Guide
HP Vertica Analytic Database Software Version: 7.0.x Document Release Date: 4/7/2016 Legal Notices Warranty The only warranties for HP products and services are set forth in the express warranty statements
More informationCS435 Introduction to Big Data Spring 2018 Colorado State University. 2/5/2018 Week 4-A Sangmi Lee Pallickara. FAQs. Total Order Sorting Pattern
W4.A.0.0 CS435 Introduction to Big Data W4.A.1 FAQs PA0 submission is open Feb. 6, 5:00PM via Canvas Individual submission (No team submission) If you have not been assigned the port range, please contact
More informationBig Data Analysis using Hadoop. Map-Reduce An Introduction. Lecture 2
Big Data Analysis using Hadoop Map-Reduce An Introduction Lecture 2 Last Week - Recap 1 In this class Examine the Map-Reduce Framework What work each of the MR stages does Mapper Shuffle and Sort Reducer
More informationHadoop Integration Guide
HP Vertica Analytic Database Software Version: 7.0.x Document Release Date: 5/2/2018 Legal Notices Warranty The only warranties for Micro Focus products and services are set forth in the express warranty
More informationCOMP4442. Service and Cloud Computing. Lab 12: MapReduce. Prof. George Baciu PQ838.
COMP4442 Service and Cloud Computing Lab 12: MapReduce www.comp.polyu.edu.hk/~csgeorge/comp4442 Prof. George Baciu csgeorge@comp.polyu.edu.hk PQ838 1 Contents Introduction to MapReduce A WordCount example
More informationAn Introduction to Apache Spark
An Introduction to Apache Spark Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah (SICS) Apache Spark Feb. 2, 2016 1 / 67 Big Data small data big data Amir H. Payberah (SICS) Apache Spark
More informationMap-Reduce Applications: Counting, Graph Shortest Paths
Map-Reduce Applications: Counting, Graph Shortest Paths Adapted from UMD Jimmy Lin s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States. See http://creativecommons.org/licenses/by-nc-sa/3.0/us/
More informationSteps: First install hadoop (if not installed yet) by, https://sl6it.wordpress.com/2015/12/04/1-study-and-configure-hadoop-for-big-data/
SL-V BE IT EXP 7 Aim: Design and develop a distributed application to find the coolest/hottest year from the available weather data. Use weather data from the Internet and process it using MapReduce. Steps:
More informationMAPREDUCE - PARTITIONER
MAPREDUCE - PARTITIONER http://www.tutorialspoint.com/map_reduce/map_reduce_partitioner.htm Copyright tutorialspoint.com A partitioner works like a condition in processing an input dataset. The partition
More informationMap-Reduce Applications: Counting, Graph Shortest Paths
Map-Reduce Applications: Counting, Graph Shortest Paths Adapted from UMD Jimmy Lin s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States. See http://creativecommons.org/licenses/by-nc-sa/3.0/us/
More informationJava in MapReduce. Scope
Java in MapReduce Kevin Swingler Scope A specific look at the Java code you might use for performing MapReduce in Hadoop Java program recap The map method The reduce method The whole program Running on
More informationSession 1 Big Data and Hadoop - Overview. - Dr. M. R. Sanghavi
Session 1 Big Data and Hadoop - Overview - Dr. M. R. Sanghavi Acknowledgement Prof. Kainjan M. Sanghavi For preparing this prsentation This presentation is available on my blog https://maheshsanghavi.wordpress.com/expert-talk-fdp-workshop/
More informationIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce Who Am I - Ryan Tabora - Data Developer at Think Big Analytics - Big Data Consulting - Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc. 2 Who Am I -
More informationTopics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples
Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?
More informationParallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014
Parallel Data Processing with Hadoop/MapReduce CS140 Tao Yang, 2014 Overview What is MapReduce? Example with word counting Parallel data processing with MapReduce Hadoop file system More application example
More informationBatch Inherence of Map Reduce Framework
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 6, June 2015, pg.287
More informationClustering Documents. Document Retrieval. Case Study 2: Document Retrieval
Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April, 2017 Sham Kakade 2017 1 Document Retrieval n Goal: Retrieve
More informationSQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism
Big Data and Hadoop with Azure HDInsight Andrew Brust Senior Director, Technical Product Marketing and Evangelism Datameer Level: Intermediate Meet Andrew Senior Director, Technical Product Marketing and
More informationExperiences with a new Hadoop cluster: deployment, teaching and research. Andre Barczak February 2018
Experiences with a new Hadoop cluster: deployment, teaching and research Andre Barczak February 2018 abstract In 2017 the Machine Learning research group got funding for a new Hadoop cluster. However,
More informationParallel Computing. Prof. Marco Bertini
Parallel Computing Prof. Marco Bertini Apache Hadoop Chaining jobs Chaining MapReduce jobs Many complex tasks need to be broken down into simpler subtasks, each accomplished by an individual MapReduce
More informationClustering Documents. Case Study 2: Document Retrieval
Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 21 th, 2015 Sham Kakade 2016 1 Document Retrieval Goal: Retrieve
More informationGuidelines For Hadoop and Spark Cluster Usage
Guidelines For Hadoop and Spark Cluster Usage Procedure to create an account in CSX. If you are taking a CS prefix course, you already have an account; to get an initial password created: 1. Login to https://cs.okstate.edu/pwreset
More informationAndrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs
Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09 Presented by: Daniel Isaacs It all starts with cluster computing. MapReduce Why
More informationData Analysis Using MapReduce in Hadoop Environment
Data Analysis Using MapReduce in Hadoop Environment Muhammad Khairul Rijal Muhammad*, Saiful Adli Ismail, Mohd Nazri Kama, Othman Mohd Yusop, Azri Azmi Advanced Informatics School (UTM AIS), Universiti
More informationPig A language for data processing in Hadoop
Pig A language for data processing in Hadoop Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Apache Pig: Introduction Tool for querying data on Hadoop
More informationMapReduce-style data processing
MapReduce-style data processing Software Languages Team University of Koblenz-Landau Ralf Lämmel and Andrei Varanovich Related meanings of MapReduce Functional programming with map & reduce An algorithmic
More informationMapReduce. Arend Hintze
MapReduce Arend Hintze Distributed Word Count Example Input data files cat * key-value pairs (0, This is a cat!) (14, cat is ok) (24, walk the dog) Mapper map() function key-value pairs (this, 1) (is,
More informationShark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker
Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha
More informationMap Reduce. MCSN - N. Tonellotto - Distributed Enabling Platforms
Map Reduce 1 MapReduce inside Google Googlers' hammer for 80% of our data crunching Large-scale web search indexing Clustering problems for Google News Produce reports for popular queries, e.g. Google
More informationAn Introduction to Big Data Analysis using Spark
An Introduction to Big Data Analysis using Spark Mohamad Jaber American University of Beirut - Faculty of Arts & Sciences - Department of Computer Science May 17, 2017 Mohamad Jaber (AUB) Spark May 17,
More informationHadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved
Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop
More informationOverview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.
MapReduce and HDFS This presentation includes course content University of Washington Redistributed under the Creative Commons Attribution 3.0 license. All other contents: Overview Why MapReduce? What
More informationCS 470 Spring Parallel Algorithm Development. (Foster's Methodology) Mike Lam, Professor
CS 470 Spring 2018 Mike Lam, Professor Parallel Algorithm Development (Foster's Methodology) Graphics and content taken from IPP section 2.7 and the following: http://www.mcs.anl.gov/~itf/dbpp/text/book.html
More informationTopics covered in this lecture
9/5/2018 CS435 Introduction to Big Data - FALL 2018 W3.B.0 CS435 Introduction to Big Data 9/5/2018 CS435 Introduction to Big Data - FALL 2018 W3.B.1 FAQs How does Hadoop mapreduce run the map instance?
More informationHadoop & Big Data Analytics Complete Practical & Real-time Training
An ISO Certified Training Institute A Unit of Sequelgate Innovative Technologies Pvt. Ltd. www.sqlschool.com Hadoop & Big Data Analytics Complete Practical & Real-time Training Mode : Instructor Led LIVE
More informationIntroduction to MapReduce
Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed
More informationBig Data landscape Lecture #2
Big Data landscape Lecture #2 Contents 1 1 CORE Technologies 2 3 MapReduce YARN 4 SparK 5 Cassandra Contents 2 16 HBase 72 83 Accumulo memcached 94 Blur 10 5 Sqoop/Flume Contents 3 111 MongoDB 12 2 13
More informationHadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)
Bigdata Fundamentals Day1: (2hours) 1. Understanding BigData. a. What is Big Data? b. Big-Data characteristics. c. Challenges with the traditional Data Base Systems and Distributed Systems. 2. Distributions:
More informationHadoop/MapReduce Computing Paradigm
Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications
More informationIntroduction to BigData, Hadoop:-
Introduction to BigData, Hadoop:- Big Data Introduction: Hadoop Introduction What is Hadoop? Why Hadoop? Hadoop History. Different types of Components in Hadoop? HDFS, MapReduce, PIG, Hive, SQOOP, HBASE,
More informationThe core source code of the edge detection of the Otsu-Canny operator in the Hadoop
Attachment: The core source code of the edge detection of the Otsu-Canny operator in the Hadoop platform (ImageCanny.java) //Map task is as follows. package bishe; import java.io.ioexception; import org.apache.hadoop.fs.path;
More informationLecture 11 Hadoop & Spark
Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem
More informationClick Stream Data Analysis Using Hadoop
Governors State University OPUS Open Portal to University Scholarship All Capstone Projects Student Capstone Projects Spring 2015 Click Stream Data Analysis Using Hadoop Krishna Chand Reddy Gaddam Governors
More informationInnovatus Technologies
HADOOP 2.X BIGDATA ANALYTICS 1. Java Overview of Java Classes and Objects Garbage Collection and Modifiers Inheritance, Aggregation, Polymorphism Command line argument Abstract class and Interfaces String
More informationDatabases and Big Data Today. CS634 Class 22
Databases and Big Data Today CS634 Class 22 Current types of Databases SQL using relational tables: still very important! NoSQL, i.e., not using relational tables: term NoSQL popular since about 2007.
More informationA Guide to Running Map Reduce Jobs in Java University of Stirling, Computing Science
A Guide to Running Map Reduce Jobs in Java University of Stirling, Computing Science Introduction The Hadoop cluster in Computing Science at Stirling allows users with a valid user account to submit and
More informationImporting and Exporting Data Between Hadoop and MySQL
Importing and Exporting Data Between Hadoop and MySQL + 1 About me Sarah Sproehnle Former MySQL instructor Joined Cloudera in March 2010 sarah@cloudera.com 2 What is Hadoop? An open-source framework for
More informationGLADE: A Scalable Framework for Efficient Analytics. Florin Rusu (University of California, Merced) Alin Dobra (University of Florida)
DE: A Scalable Framework for Efficient Analytics Florin Rusu (University of California, Merced) Alin Dobra (University of Florida) Big Data Analytics Big Data Storage is cheap ($100 for 1TB disk) Everything
More informationProcessing Large / Big Data through MapR and Pig
Processing Large / Big Data through MapR and Pig Arvind Kumar-Senior ERP Solution Architect / Manager Suhas Pande- Solution Architect (IT and Security) Abstract - We live in the data age. It s not easy
More informationVoldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation
Voldemort Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/29 Outline 1 2 3 Smruti R. Sarangi Leader Election 2/29 Data
More informationRESTORE: REUSING RESULTS OF MAPREDUCE JOBS. Presented by: Ahmed Elbagoury
RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Presented by: Ahmed Elbagoury Outline Background & Motivation What is Restore? Types of Result Reuse System Architecture Experiments Conclusion Discussion Background
More informationHadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop
Hadoop Open Source Projects Hadoop is supplemented by an ecosystem of open source projects Oozie 25 How to Analyze Large Data Sets in Hadoop Although the Hadoop framework is implemented in Java, MapReduce
More informationImproving the MapReduce Big Data Processing Framework
Improving the MapReduce Big Data Processing Framework Gistau, Reza Akbarinia, Patrick Valduriez INRIA & LIRMM, Montpellier, France In collaboration with Divyakant Agrawal, UCSB Esther Pacitti, UM2, LIRMM
More informationPrinciples of Data Management. Lecture #16 (MapReduce & DFS for Big Data)
Principles of Data Management Lecture #16 (MapReduce & DFS for Big Data) Instructor: Mike Carey mjcarey@ics.uci.edu Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Today s News Bulletin
More informationData Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros
Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on
More informationMapReduce programming model
MapReduce programming model technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Warning! Slides are only for presenta8on guide We will discuss+debate
More informationIntroduction to Hadoop. Owen O Malley Yahoo!, Grid Team
Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 2: MapReduce Algorithm Design (2/2) January 12, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo
More informationOutline. What is Big Data? Hadoop HDFS MapReduce Twitter Analytics and Hadoop
Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School September 2012 This work is licensed
More informationPractical Big Data Processing An Overview of Apache Flink
Practical Big Data Processing An Overview of Apache Flink Tilmann Rabl Berlin Big Data Center www.dima.tu-berlin.de bbdc.berlin rabl@tu-berlin.de With slides from Volker Markl and data artisans 1 2013
More informationPigReuse: A Reuse-based Optimizer for Pig Latin
PigReuse: A Reuse-based Optimizer for Pig Latin Jesús Camacho-Rodríguez, Dario Colazzo, Melanie Herschel, Ioana Manolescu, Soudip Roy Chowdhury To cite this version: Jesús Camacho-Rodríguez, Dario Colazzo,
More informationBig Data Hadoop Stack
Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware
More informationHigh Performance Computing on MapReduce Programming Framework
International Journal of Private Cloud Computing Environment and Management Vol. 2, No. 1, (2015), pp. 27-32 http://dx.doi.org/10.21742/ijpccem.2015.2.1.04 High Performance Computing on MapReduce Programming
More informationIntroduction to MapReduce Algorithms and Analysis
Introduction to MapReduce Algorithms and Analysis Jeff M. Phillips October 25, 2013 Trade-Offs Massive parallelism that is very easy to program. Cheaper than HPC style (uses top of the line everything)
More informationBig Data and Scripting map reduce in Hadoop
Big Data and Scripting map reduce in Hadoop 1, 2, connecting to last session set up a local map reduce distribution enable execution of map reduce implementations using local file system only all tasks
More informationPerformance Comparison of Hive, Pig & Map Reduce over Variety of Big Data
Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data Yojna Arora, Dinesh Goyal Abstract: Big Data refers to that huge amount of data which cannot be analyzed by using traditional analytics
More informationGlobal Journal of Engineering Science and Research Management
A FUNDAMENTAL CONCEPT OF MAPREDUCE WITH MASSIVE FILES DATASET IN BIG DATA USING HADOOP PSEUDO-DISTRIBUTION MODE K. Srikanth*, P. Venkateswarlu, Ashok Suragala * Department of Information Technology, JNTUK-UCEV
More informationSeptember 2013 Alberto Abelló & Oscar Romero 1
duce-i duce-i September 2013 Alberto Abelló & Oscar Romero 1 Knowledge objectives 1. Enumerate several use cases of duce 2. Describe what the duce environment is 3. Explain 6 benefits of using duce 4.
More informationA Performance Study of AsterixDB
2017 IEEE International Conference on Big Data (BIGDATA) A Performance Study of AsterixDB Keren Ouaknine School of Engineering and Computer Science Hebrew University of Jerusalem, Israel Email: keren.ouaknine@mail.huji.ac.il
More informationBig Data Analysis using Hadoop Lecture 3
Big Data Analysis using Hadoop Lecture 3 Last Week - Recap Driver Class Mapper Class Reducer Class Create our first MR process Ran on Hadoop Monitored on webpages Checked outputs using HDFS command line
More informationWe are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info
We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423
More informationScaling Up Pig. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics
http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Scaling Up Pig Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based on materials
More informationTI2736-B Big Data Processing. Claudia Hauff
TI2736-B Big Data Processing Claudia Hauff ti2736b-ewi@tudelft.nl Intro Streams Streams Map Reduce HDFS Pig Pig Design Patterns Hadoop Ctd. Graphs Giraph Spark Zoo Keeper Spark Learning objectives Implement
More informationLecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)!
Big Data Processing, 2014/15 Lecture 7: MapReduce design patterns!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm
More informationCS435 Introduction to Big Data Spring 2018 Colorado State University. 2/12/2018 Week 5-A Sangmi Lee Pallickara
W5.A.0.0 CS435 Introduction to Big Data W5.A.1 FAQs PA1 has been posted Feb. 21, 5:00PM via Canvas Individual submission (No team submission) Source code of examples in lectures: https://github.com/adamjshook/mapreducepatterns
More informationJoe Hummel, PhD. Visiting Researcher: U. of California, Irvine Adjunct Professor: U. of Illinois, Chicago & Loyola U., Chicago
Joe Hummel, PhD Visiting Researcher: U. of California, Irvine Adjunct Professor: U. of Illinois, Chicago & Loyola U., Chicago Materials: http://www.joehummel.net/downloads.html Email: joe@joehummel.net
More informationIntroduction to MapReduce
732A54 Big Data Analytics Introduction to MapReduce Christoph Kessler IDA, Linköping University Towards Parallel Processing of Big-Data Big Data too large to be read+processed in reasonable time by 1 server
More informationIntroduction to Data Management CSE 344
Introduction to Data Management CSE 344 Lecture 24: MapReduce CSE 344 - Winter 215 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Due next Thursday evening Will send out reimbursement codes later
More informationMapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1
MapReduce-II September 2013 Alberto Abelló & Oscar Romero 1 Knowledge objectives 1. Enumerate the different kind of processes in the MapReduce framework 2. Explain the information kept in the master 3.
More informationMapReduce & YARN Hands-on Lab Exercise 1 Simple MapReduce program in Java
MapReduce & YARN Hands-on Lab Exercise 1 Simple MapReduce program in Java Contents Page 1 Copyright IBM Corporation, 2015 US Government Users Restricted Rights - Use, duplication or disclosure restricted
More informationResearch challenges in data-intensive computing The Stratosphere Project Apache Flink
Research challenges in data-intensive computing The Stratosphere Project Apache Flink Seif Haridi KTH/SICS haridi@kth.se e2e-clouds.org Presented by: Seif Haridi May 2014 Research Areas Data-intensive
More informationDistributed computing: index building and use
Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput
More informationCS370 Operating Systems
CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2017 Lecture 26 File Systems Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 FAQ Cylinders: all the platters?
More informationBigData and MapReduce with Hadoop
BigData and MapReduce with Hadoop Ivan Tomašić 1, Roman Trobec 1, Aleksandra Rashkovska 1, Matjaž Depolli 1, Peter Mežnar 2, Andrej Lipej 2 1 Jožef Stefan Institute, Jamova 39, 1000 Ljubljana 2 TURBOINŠTITUT
More informationPSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets
2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department
More informationLeanBench: comparing software stacks for batch and query processing of IoT data
Available online at www.sciencedirect.com Procedia Computer Science (216) www.elsevier.com/locate/procedia The 9th International Conference on Ambient Systems, Networks and Technologies (ANT 218) LeanBench:
More informationBlended Learning Outline: Cloudera Data Analyst Training (171219a)
Blended Learning Outline: Cloudera Data Analyst Training (171219a) Cloudera Univeristy s data analyst training course will teach you to apply traditional data analytics and business intelligence skills
More informationShark: Hive (SQL) on Spark
Shark: Hive (SQL) on Spark Reynold Xin UC Berkeley AMP Camp Aug 21, 2012 UC BERKELEY SELECT page_name, SUM(page_views) views FROM wikistats GROUP BY page_name ORDER BY views DESC LIMIT 10; Stage 0: Map-Shuffle-Reduce
More informationBeyond MapReduce: Apache Spark Antonino Virgillito
Beyond MapReduce: Apache Spark Antonino Virgillito 1 Why Spark? Most of Machine Learning Algorithms are iterative because each iteration can improve the results With Disk based approach each iteration
More informationsqoop Easy, parallel database import/export Aaron Kimball Cloudera Inc. June 8, 2010
sqoop Easy, parallel database import/export Aaron Kimball Cloudera Inc. June 8, 2010 Your database Holds a lot of really valuable data! Many structured tables of several hundred GB Provides fast access
More information