Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse
|
|
- Damian Doyle
- 5 years ago
- Views:
Transcription
1 Thanh-Chung Dao 1 Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse Thanh-Chung Dao and Shigeru Chiba The University of Tokyo
2 Thanh-Chung Dao 2 Supercomputers Expensive clusters Multi-core processors Large capacity of main memory High-speed network Focus mainly on compute-intensive applications Data-intensive workloads are emerging as supercomputing problems Graph processing Pre-processing of simulation data
3 Thanh-Chung Dao 3 MapReduce Simple parallel paradigm to process large datasets Hidden parallelization & communication PageRank example Input Splitting Mapping Shuffling Reducing Result Function Mapper Input PageA à PageB, PageC Begin N = outbound links For each outbound link output <Page, 1/N> End PageA à PageB, PageC PageB à PageC PageC à PageA, PageB PageA à PageB, PageC PageB à PageC PageC à PageA, PageB <PageB, 0.5> <PageC, 0.5> Shuffling Rank contribution Done automatically <PageC, 1> (Users can ignore) <PageA, 0.5> <PageB, 0.5> <PageA, Function 0.5> Reducer <PageA, 0.5> Input <PageA, x 1 >, <PageA, x n > Begin rank = 0 <PageB, 0.5> <PageB, 1> <PageB, For 0.5> each item x i rank += x i output <PageA, rank> End <PageC, 0.5> <PageC, 1> <PageC, 1.5> PageA 0.5 PageB 1 PageC 1.5
4 Thanh-Chung Dao 4 Hadoop MapReduce Standard of MapReduce implementation Provide easy-to-use MapReduce APIs TCP/IP-based communication Designed to run on commodity clusters Lab clusters, or Amazon EC2 Scalability (32,000 nodes at Yahoo) & Resilience Written in Java
5 Thanh-Chung Dao 5 Improving Hadoop MapReduce Performance on Supercomputers Hadoop MapReduce is good choice on supercomputers Maturity Productivity Resource allocation at runtime (# of processes, memory, CPU) Supercomputer Static Hadoop Dynamic Communication MPI TCP/IP Workload Compute-intensive Data-intensive
6 Thanh-Chung Dao 6 Our Approach JVM Reuse Statically create JVM processes and dynamically allocate to Hadoop tasks Enable efficient MPI communication by Hadoop tasks Statically created processes can exploit efficient MPI Dynamic allocation enables to use the original Hadoop implementation Shorten start-up time of processes Technique Process pool is used to implement JVM Reuse Minimize changes of the original Hadoop engine
7 Thanh-Chung Dao 7 Why MPI is required for Hadoop The de facto high-speed communication on supercomputers Improve slow MapReduce shuffling Enable Hadoop to co-host traditional MPI applications Combine MPI and MapReduce models Rich data analysis workflow Throughput (Mbps) Efficient data sharing between MPI and MapReduce models E.g. MPI can access data located at Hadoop file system (HDFS) Throughput (Mbps) MPI TCP On FX10 supercomputer 10 times faster Message size (Bytes)
8 Thanh-Chung Dao 8 Slow MapReduce shuffling on Hadoop TCP/IP-based communication JVM-Bypass (Wang et al., IPDPS 2013) MapTasks Map output 1 Map output n HTTP Servlet Server ReduceTasks Sort & Merge nodes Local disk Multiple requests at once Reducing Mapping Phase Shuffling Phase Reducing Phase
9 Thanh-Chung Dao 9 Dynamic Process Creation on MPI Discouraged on supercomputers Reasons of performance Collective mechanism (MPISpawn) Gang scheduling (error-prone if not enough resource) Gerbil (Xu el al., CCGrid 2015) Co-hosting MPI applications on Hadoop Creating dynamically processes Its experiments showed significant overhead Resources should be specified before running MPI applications Number of processes is known (static) Memory and CPU cores
10 Thanh-Chung Dao 10 Dynamic Process Creation on Hadoop Required Resources are allocated on demand to run MapReduce applications Number of processes is unknown (dynamic)
11 Thanh-Chung Dao 11 Dynamic Process Creation on Hadoop Required Resources are allocated on demand to run MapReduce applications Number of processes is unknown (dynamic) 1 Master node 2 n A Node
12 Thanh-Chung Dao 12 Dynamic Process Creation on Hadoop Required Resources are allocated on demand to run MapReduce applications Number of processes is unknown (dynamic) 6 tasks 1 User Job Submission Master node 6 tasks 2 8 tasks n A Node Request sending
13 Thanh-Chung Dao 13 Dynamic Process Creation on Hadoop Required Resources are allocated on demand to run MapReduce applications Number of processes is unknown (dynamic) 6 tasks 1 Process creation 6 processes User Job Submission Master node 6 tasks 2 Process creation 6 processes 8 tasks n Process creation 8 processes Processes A Node Request sending Each task is run on a process
14 Thanh-Chung Dao 14 Dynamic Process Creation on Hadoop Required Resources are allocated on demand to run MapReduce applications Number of processes is unknown (dynamic) 6 tasks 1 Process creation Task running User Job Submission Master node 6 tasks 2 Process creation Task running 8 tasks n Process creation Task running Processes A Node Request sending Each task is run on a process
15 Thanh-Chung Dao 15 Dynamic Process Creation on Hadoop Required Resources are allocated on demand to run MapReduce applications Number of processes is unknown (dynamic) 6 tasks 1 Process creation Terminated User Job Submission Master node 6 tasks 2 Process creation Terminated 8 tasks n Process creation Terminated Processes A Node Request sending Each task is run on a process
16 Thanh-Chung Dao 16 Dynamic Process Creation on Hadoop Required Resources are allocated on demand to run MapReduce applications Number of processes is unknown (dynamic) 1 User Job Completion Master node 2 n A Node Request sending
17 Thanh-Chung Dao 17 Idea of Reusing added Idle JVM processes Number of processes is statically fixed 1 idle idle idle Master node 2 idle idle idle Processes A Node n idle idle idle
18 Thanh-Chung Dao 18 Idea of Reusing added Idle JVM processes Number of processes is statically fixed 6 tasks 1 idle idle idle User Job Submission Master node 6 tasks 2 idle idle idle Processes A Node 8 tasks n Request sending idle idle idle
19 Thanh-Chung Dao 19 Idea of Reusing added Idle JVM processes Number of processes is statically fixed 6 tasks 1 Allocation Busy Busy idle User Job Submission Processes A Node Master node 6 tasks 8 tasks 2 n Request sending Allocation Busy Busy idle Busy Busy Busy
20 Thanh-Chung Dao 20 Idea of Reusing added Idle JVM processes Number of processes is statically fixed 6 tasks 1 Running Running idle User Job Submission Master node 6 tasks 2 Running Running idle Processes A Node 8 tasks n Request sending Running Running Running
21 Thanh-Chung Dao 21 Idea of Reusing added Idle JVM processes Number of processes is statically fixed 6 tasks 1 Cleanup Cleanup idle User Job Submission Master node 6 tasks 2 Cleanup Cleanup idle Processes A Node 8 tasks n Request sending Cleanup Cleanup Cleanup
22 Thanh-Chung Dao 22 Idea of Reusing added Idle JVM processes Number of processes is statically fixed 6 tasks 1 idle idle idle User Job Submission Master node 6 tasks 2 idle idle idle Processes A Node 8 tasks n Request sending idle idle idle
23 Thanh-Chung Dao 23 Idea of Reusing added Idle JVM processes Number of processes is statically fixed 1 idle idle idle User Job Completion Master node 2 idle idle idle Processes A Node n Request sending idle idle idle
24 Thanh-Chung Dao 24 JVM Reuse enables MPI communication MPI communication is established at the beginning JVM Reuse keeps processes running MPI connection is always available
25 Thanh-Chung Dao 25 JVM Reuse shortens start-up time JVM start-up flow of Program A Cmd: java A Class loader subsystem OS level process creation Class loading Class linking (verification & initializing) Invoke main() method of A Execution engine JIT compiler After A finishes, Program B wants to reuse JVM of A Execute A instructions Process creation & class loader are skipped Invoke main() method of B Execute B instructions
26 Thanh-Chung Dao 26 Iterative jobs benefit from JVM Reuse Iterative jobs Many short running JVM processes PageRank is an example Iterative job flow Yes, then quit No Stop Cond? Initial data Pre-processing job Job A Maps use results of the previous Iteration Job A Map Reduce Map Reduce Map Reduce
27 Thanh-Chung Dao 27 Implementation: Process Pool Process creation request Hadoop YARN (Node manager) Process Request sending State changing Pool Manager Round-robin scheduling Allocation Set busy Export env. variables Create new class loader Invoke main() method Busy Busy idle Busy Busy Busy idle De-allocation Clean static fields Unset busy Process Pool on each node
28 Thanh-Chung Dao 28 Our MPI shuffling design MapTasks Map output 2 Map output n Node 2 Local disk MPI send/recv Shuffle Manager One request at once ReduceTask 2 Sort & Merge Reducing Mapping Phase Shuffling Phase Reducing Phase
29 Thanh-Chung Dao 29 Reuse s Technical Issues Loading user s classes The original flow exports CLASSPATH before running Reflection Load user s classes at runtime Create a new class loader for each user Avoid class confliction Clean-up Static fields Security problem e.g. UserGroup static field Must be reset Current design Reset user information and job conf. static fields
30 Thanh-Chung Dao 30 Other Technical Issues Enable Hadoop YARN to host traditional MPI applications YARN is a resource manager Work in progress MPI AppMaster Monitor MPI ranks MPI Container Host a rank Avoid gang scheduling of MPI Work in progress
31 Thanh-Chung Dao 31 Evaluation Hadoop version v2.2.0 Changes in our implementation Line of code / total of Hadoop: ~1000 / 1,851,473 Number of classes / total of Hadoop: 9 / 35142
32 Thanh-Chung Dao 32 Cluster setup FX10 supercomputer Sparc64 Ixfx GHz (16 cores) & 32GB RAM MPI over Tofu interconnection (5GB/s) Central storage Hadoop setup One master and many slaves OpenJDK 7 HDFS is run on the central storage OpenMPI 1.6 Java MPI binding (Vega-Gisbert et al.) MCA parameter: plm_ple_cpu_affinity = 0
33 Thanh-Chung Dao 33 Evaluation of JVM Reuse MPI benefit MPI vs. TCP/IP shuffling Tera-sort job Run on 32 FX10 nodes 4-slot pool & -Xmx4096 Start-up time JVM Reuse vs. the original PageRank iterative job 400 GB wikipedia data Run on 8 FX10 nodes 6-slot pool & -Xmx4096
34 Thanh-Chung Dao 34 MPI vs. TCP/IP shuffling Tera-sort Execution Time (s) Nonblocking TCP Shuffle Blocking TCP shuffle Blocking MPI shuffle up to 10% faster Input size (GB) Nonblocking : accept multi-connection at once Blocking : accept one connection at once
35 Thanh-Chung Dao 35 Shorten start-up time (PageRank) Task ID Start-up (JVM & User info) Task initializing Shuffling (at Reducers) Data reading Task running Task finishing (MapOutput writing) Start-up time 1 st iteration 2 nd iteration Original Hadoop 0s 40s 80s 120s 160s 200s 240s 280s 320s 360s 400s Task ID nd iteration 50 seconds faster JVM Reuse Hadoop 0s 40s 80s 120s 160s 200s 240s 280s 320s 360s 400s Time (s)
36 Thanh-Chung Dao 36 More iterations Sum of start-up time (s) Original Hadoop JVM Reuse-based Hadoop Iteration number (n) Execution Time (s) Original Hadoop JVM Reuse-based Hadoop Iteration number (n) Execution Time (s) Start-up time Computation time Original Hadoop Approach JVM Reuse Hadoop Sum of start-up time Execution time Ratio
37 Thanh-Chung Dao 37 Related work M3R (VLDB 2012) Also apply JVM Reuse to enable in-memory MapReduce Not providing any optimization of JVM Reuse and its evaluation Hadoop MapReduce (HMR) engine is written in X10 We keep the original HMR engine with minimum changes JVM Reuse in Hadoop v1 (2012) Only for a single job JVM processes are terminated after their job is completed Gerbil: MPI + YARN (CCGrid 2015) Hadoop Yarn co-hosts MPI applications Long start-up time and significant overhead DataMPI (IPDPS 2014) Hadoop-like MapReduce implementation using MPI & C JVM-Bypass (IPDPS 2013) C-based shuffling engine & RDMA supported We focus on using MPI over Hadoop processes
38 Thanh-Chung Dao 38 Summary Improving Hadoop MapReduce performance on supercomputers Approach: JVM Reuse Statically create JVM processes and dynamically allocate to Hadoop tasks Enable efficient MPI communication on Hadoop Shorten start-up time Minimum changes of the original Hadoop Future work JVM Reuse drawback Affect CPU-bound tasks Co-host MPI applications more efficiently Full cleanup
HPC-Reuse: efficient process creation for running MPI and Hadoop MapReduce on supercomputers
Chung Dao 1 HPC-Reuse: efficient process creation for running MPI and Hadoop MapReduce on supercomputers Thanh-Chung Dao and Shigeru Chiba The University of Tokyo, Japan Chung Dao 2 Running Hadoop and
More informationSEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers
1 SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers and Shigeru Chiba The University of Tokyo, Japan 2 Running Hadoop on modern supercomputers Hadoop assumes every compute node
More informationProgress on Efficient Integration of Lustre* and Hadoop/YARN
Progress on Efficient Integration of Lustre* and Hadoop/YARN Weikuan Yu Robin Goldstone Omkar Kulkarni Bryon Neitzel * Some name and brands may be claimed as the property of others. MapReduce l l l l A
More informationHadoop Map Reduce 10/17/2018 1
Hadoop Map Reduce 10/17/2018 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce execution engine of Hadoop through YARN 10/17/2018
More informationMixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp
MixApart: Decoupled Analytics for Shared Storage Systems Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp Hadoop Pig, Hive Hadoop + Enterprise storage?! Shared storage
More informationHadoop MapReduce Framework
Hadoop MapReduce Framework Contents Hadoop MapReduce Framework Architecture Interaction Diagram of MapReduce Framework (Hadoop 1.0) Interaction Diagram of MapReduce Framework (Hadoop 2.0) Hadoop MapReduce
More informationData Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros
Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on
More informationGuoping Wang and Chee-Yong Chan Department of Computer Science, School of Computing National University of Singapore VLDB 14.
Guoping Wang and Chee-Yong Chan Department of Computer Science, School of Computing National University of Singapore VLDB 14 Page 1 Introduction & Notations Multi-Job optimization Evaluation Conclusion
More informationApache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context
1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes
More informationMap Reduce & Hadoop Recommended Text:
Map Reduce & Hadoop Recommended Text: Hadoop: The Definitive Guide Tom White O Reilly 2010 VMware Inc. All rights reserved Big Data! Large datasets are becoming more common The New York Stock Exchange
More informationIntroduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng.
Introduction to MapReduce Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng. Before MapReduce Large scale data processing was difficult! Managing hundreds or thousands of processors Managing parallelization
More informationMATE-EC2: A Middleware for Processing Data with Amazon Web Services
MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering Ohio State University * School of Engineering
More informationMOHA: Many-Task Computing Framework on Hadoop
Apache: Big Data North America 2017 @ Miami MOHA: Many-Task Computing Framework on Hadoop Soonwook Hwang Korea Institute of Science and Technology Information May 18, 2017 Table of Contents Introduction
More informationShark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker
Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha
More informationHadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved
Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop
More informationIntroduction to MapReduce
Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed
More informationHadoop/MapReduce Computing Paradigm
Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications
More informationImproving the MapReduce Big Data Processing Framework
Improving the MapReduce Big Data Processing Framework Gistau, Reza Akbarinia, Patrick Valduriez INRIA & LIRMM, Montpellier, France In collaboration with Divyakant Agrawal, UCSB Esther Pacitti, UM2, LIRMM
More information2/26/2017. For instance, consider running Word Count across 20 splits
Based on the slides of prof. Pietro Michiardi Hadoop Internals https://github.com/michiard/disc-cloud-course/raw/master/hadoop/hadoop.pdf Job: execution of a MapReduce application across a data set Task:
More information2/26/2017. Originally developed at the University of California - Berkeley's AMPLab
Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second
More informationSpark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies
Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay 1 Apache Spark - Intro Spark within the Big Data ecosystem Data Sources Data Acquisition / ETL Data Storage Data Analysis / ML Serving 3 Apache
More informationTITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP
TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop
More information2/4/2019 Week 3- A Sangmi Lee Pallickara
Week 3-A-0 2/4/2019 Colorado State University, Spring 2019 Week 3-A-1 CS535 BIG DATA FAQs PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING MODELS FOR SCALABLE BATCH COMPUTING SECTION 1: MAPREDUCE PA1
More informationIntroduction To YARN. Adam Kawa, Spotify The 9 Meeting of Warsaw Hadoop User Group 2/23/13
Introduction To YARN Adam Kawa, Spotify th The 9 Meeting of Warsaw Hadoop User Group About Me Data Engineer at Spotify, Sweden Hadoop Instructor at Compendium (Cloudera Training Partner) +2.5 year of experience
More informationBig Data for Engineers Spring Resource Management
Ghislain Fourny Big Data for Engineers Spring 2018 7. Resource Management artjazz / 123RF Stock Photo Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models
More informationMapReduce Design Patterns
MapReduce Design Patterns MapReduce Restrictions Any algorithm that needs to be implemented using MapReduce must be expressed in terms of a small number of rigidly defined components that must fit together
More informationHaLoop Efficient Iterative Data Processing on Large Clusters
HaLoop Efficient Iterative Data Processing on Large Clusters Yingyi Bu, Bill Howe, Magdalena Balazinska, and Michael D. Ernst University of Washington Department of Computer Science & Engineering Presented
More informationDATA SCIENCE USING SPARK: AN INTRODUCTION
DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data
More informationMI-PDB, MIE-PDB: Advanced Database Systems
MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:
More informationBig data systems 12/8/17
Big data systems 12/8/17 Today Basic architecture Two levels of scheduling Spark overview Basic architecture Cluster Manager Cluster Cluster Manager 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores
More informationLarge-Scale GPU programming
Large-Scale GPU programming Tim Kaldewey Research Staff Member Database Technologies IBM Almaden Research Center tkaldew@us.ibm.com Assistant Adjunct Professor Computer and Information Science Dept. University
More informationHDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung
HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per
More informationProgramming Models MapReduce
Programming Models MapReduce Majd Sakr, Garth Gibson, Greg Ganger, Raja Sambasivan 15-719/18-847b Advanced Cloud Computing Fall 2013 Sep 23, 2013 1 MapReduce In a Nutshell MapReduce incorporates two phases
More informationAndrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs
Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09 Presented by: Daniel Isaacs It all starts with cluster computing. MapReduce Why
More informationDon t Get Caught In the Cold, Warm-up Your JVM Understand and Eliminate JVM Warm-up Overhead in Data-parallel Systems
Don t Get Caught In the Cold, Warm-up Your JVM Understand and Eliminate JVM Warm-up Overhead in Data-parallel Systems David Lion, Adrian Chiu, Hailong Sun*, Xin Zhuang, Nikola Grcevski, Ding Yuan University
More informationExpert Lecture plan proposal Hadoop& itsapplication
Expert Lecture plan proposal Hadoop& itsapplication STARTING UP WITH BIG Introduction to BIG Data Use cases of Big Data The Big data core components Knowing the requirements, knowledge on Analyst job profile
More informationYARN: A Resource Manager for Analytic Platform Tsuyoshi Ozawa
YARN: A Resource Manager for Analytic Platform Tsuyoshi Ozawa ozawa.tsuyoshi@lab.ntt.co.jp ozawa@apache.org About me Tsuyoshi Ozawa Research Engineer @ NTT Twitter: @oza_x86_64 Over 150 reviews in 2015
More informationCloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe
Cloud Programming Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Computing Only required amount of CPU and storage can be used anytime from anywhere via network Availability, throughput, reliability
More informationBig Data 7. Resource Management
Ghislain Fourny Big Data 7. Resource Management artjazz / 123RF Stock Photo Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models Syntax Encoding Storage
More informationCloud Computing CS
Cloud Computing CS 15-319 Programming Models- Part III Lecture 6, Feb 1, 2012 Majd F. Sakr and Mohammad Hammoud 1 Today Last session Programming Models- Part II Today s session Programming Models Part
More informationMAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti
International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department
More informationTopics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples
Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?
More informationMapReduce, Hadoop and Spark. Bompotas Agorakis
MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)
More informationResearch challenges in data-intensive computing The Stratosphere Project Apache Flink
Research challenges in data-intensive computing The Stratosphere Project Apache Flink Seif Haridi KTH/SICS haridi@kth.se e2e-clouds.org Presented by: Seif Haridi May 2014 Research Areas Data-intensive
More informationCSE 444: Database Internals. Lecture 23 Spark
CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei
More informationBig Data Hadoop Course Content
Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux
More informationHADOOP 3.0 is here! Dr. Sandeep Deshmukh Sadepach Labs Pvt. Ltd. - Let us grow together!
HADOOP 3.0 is here! Dr. Sandeep Deshmukh sandeep@sadepach.com Sadepach Labs Pvt. Ltd. - Let us grow together! About me BE from VNIT Nagpur, MTech+PhD from IIT Bombay Worked with Persistent Systems - Life
More informationIntroduction to MapReduce
Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed
More informationCCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH)
Cloudera CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Download Full Version : http://killexams.com/pass4sure/exam-detail/cca-410 Reference: CONFIGURATION PARAMETERS DFS.BLOCK.SIZE
More informationLecture 11 Hadoop & Spark
Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem
More informationClustering Lecture 8: MapReduce
Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data
More informationDistributed Filesystem
Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the
More informationHiTune. Dataflow-Based Performance Analysis for Big Data Cloud
HiTune Dataflow-Based Performance Analysis for Big Data Cloud Jinquan (Jason) Dai, Jie Huang, Shengsheng Huang, Bo Huang, Yan Liu Intel Asia-Pacific Research and Development Ltd Shanghai, China, 200241
More informationResource and Performance Distribution Prediction for Large Scale Analytics Queries
Resource and Performance Distribution Prediction for Large Scale Analytics Queries Prof. Rajiv Ranjan, SMIEEE School of Computing Science, Newcastle University, UK Visiting Scientist, Data61, CSIRO, Australia
More informationBatch Processing Basic architecture
Batch Processing Basic architecture in big data systems COS 518: Distributed Systems Lecture 10 Andrew Or, Mike Freedman 2 1 2 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores 3
More informationCS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab
CS6030 Cloud Computing Ajay Gupta B239, CEAS Computer Science Department Western Michigan University ajay.gupta@wmich.edu 276-3104 1 Acknowledgements I have liberally borrowed these slides and material
More informationBig Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.
Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop
More informationCertified Big Data and Hadoop Course Curriculum
Certified Big Data and Hadoop Course Curriculum The Certified Big Data and Hadoop course by DataFlair is a perfect blend of in-depth theoretical knowledge and strong practical skills via implementation
More informationMapReduce and Friends
MapReduce and Friends Craig C. Douglas University of Wyoming with thanks to Mookwon Seo Why was it invented? MapReduce is a mergesort for large distributed memory computers. It was the basis for a web
More informationHigh Performance Computing on MapReduce Programming Framework
International Journal of Private Cloud Computing Environment and Management Vol. 2, No. 1, (2015), pp. 27-32 http://dx.doi.org/10.21742/ijpccem.2015.2.1.04 High Performance Computing on MapReduce Programming
More informationIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large
More informationClash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics
Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Presented by: Dishant Mittal Authors: Juwei Shi, Yunjie Qiu, Umar Firooq Minhas, Lemei Jiao, Chen Wang, Berthold Reinwald and Fatma
More informationBig Data Hadoop Stack
Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware
More informationChisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique
Chisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique Prateek Dhawalia Sriram Kailasam D. Janakiram Distributed and Object Systems Lab Dept. of Comp.
More informationVoldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation
Voldemort Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/29 Outline 1 2 3 Smruti R. Sarangi Leader Election 2/29 Data
More informationFacilitating Consistency Check between Specification & Implementation with MapReduce Framework
Facilitating Consistency Check between Specification & Implementation with MapReduce Framework Shigeru KUSAKABE, Yoichi OMORI, Keijiro ARAKI Kyushu University, Japan 2 Our expectation Light-weight formal
More informationPerformance Best Practices Paper for IBM Tivoli Directory Integrator v6.1 and v6.1.1
Performance Best Practices Paper for IBM Tivoli Directory Integrator v6.1 and v6.1.1 version 1.0 July, 2007 Table of Contents 1. Introduction...3 2. Best practices...3 2.1 Preparing the solution environment...3
More informationHigh Performance Data Analytics for Numerical Simulations. Bruno Raffin DataMove
High Performance Data Analytics for Numerical Simulations Bruno Raffin DataMove bruno.raffin@inria.fr April 2016 About this Talk HPC for analyzing the results of large scale parallel numerical simulations
More informationRESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING
RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin,
More informationFLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568
FLAT DATACENTER STORAGE Paper-3 Presenter-Pratik Bhatt fx6568 FDS Main discussion points A cluster storage system Stores giant "blobs" - 128-bit ID, multi-megabyte content Clients and servers connected
More information7 Deadly Hadoop Misconfigurations. Kathleen Hadoop Talks Meetup, 27 March 2014
7 Deadly Hadoop Misconfigurations Kathleen Ting kathleen@apache.org @kate_ting Hadoop Talks Meetup, 27 March 2014 Who Am I? Started 3 yr ago as 1 st Cloudera Support Eng Now manages Cloudera s 2 largest
More informationCamdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa
Camdoop Exploiting In-network Aggregation for Big Data Applications costa@imperial.ac.uk joint work with Austin Donnelly, Antony Rowstron, and Greg O Shea (MSR Cambridge) MapReduce Overview Input file
More informationMapReduce: Simplified Data Processing on Large Clusters 유연일민철기
MapReduce: Simplified Data Processing on Large Clusters 유연일민철기 Introduction MapReduce is a programming model and an associated implementation for processing and generating large data set with parallel,
More informationGlobal Journal of Engineering Science and Research Management
A FUNDAMENTAL CONCEPT OF MAPREDUCE WITH MASSIVE FILES DATASET IN BIG DATA USING HADOOP PSEUDO-DISTRIBUTION MODE K. Srikanth*, P. Venkateswarlu, Ashok Suragala * Department of Information Technology, JNTUK-UCEV
More informationCloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018
Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster
More informationBigData and Map Reduce VITMAC03
BigData and Map Reduce VITMAC03 1 Motivation Process lots of data Google processed about 24 petabytes of data per day in 2009. A single machine cannot serve all the data You need a distributed system to
More informationECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective
ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models RCFile: A Fast and Space-efficient Data
More informationTP1-2: Analyzing Hadoop Logs
TP1-2: Analyzing Hadoop Logs Shadi Ibrahim January 26th, 2017 MapReduce has emerged as a leading programming model for data-intensive computing. It was originally proposed by Google to simplify development
More informationCSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark
CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark Announcements HW2 due this Thursday AWS accounts Any success? Feel
More informationInternational Journal of Advance Engineering and Research Development. A Study: Hadoop Framework
Scientific Journal of Impact Factor (SJIF): e-issn (O): 2348- International Journal of Advance Engineering and Research Development Volume 3, Issue 2, February -2016 A Study: Hadoop Framework Devateja
More informationA BigData Tour HDFS, Ceph and MapReduce
A BigData Tour HDFS, Ceph and MapReduce These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing SICS; Yahoo!
More informationCS 345A Data Mining. MapReduce
CS 345A Data Mining MapReduce Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very large Tens to hundreds of terabytes
More informationWelcome to the New Era of Cloud Computing
Welcome to the New Era of Cloud Computing Aaron Kimball The web is replacing the desktop 1 SDKs & toolkits are there What about the backend? Image: Wikipedia user Calyponte 2 Two key concepts Processing
More informationWhere We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344
Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)
More informationCISC 7610 Lecture 2b The beginnings of NoSQL
CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone
More informationChapter 5. The MapReduce Programming Model and Implementation
Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing
More informationWHITEPAPER. Improve Hadoop Performance with Memblaze PBlaze SSD
Improve Hadoop Performance with Memblaze PBlaze SSD Improve Hadoop Performance with Memblaze PBlaze SSD Exclusive Summary We live in the data age. It s not easy to measure the total volume of data stored
More informationParallelizing Multiple Group by Query in Shared-nothing Environment: A MapReduce Study Case
1 / 39 Parallelizing Multiple Group by Query in Shared-nothing Environment: A MapReduce Study Case PAN Jie 1 Yann LE BIANNIC 2 Frédéric MAGOULES 1 1 Ecole Centrale Paris-Applied Mathematics and Systems
More informationiihadoop: an asynchronous distributed framework for incremental iterative computations
DOI 10.1186/s40537-017-0086-3 RESEARCH Open Access iihadoop: an asynchronous distributed framework for incremental iterative computations Afaf G. Bin Saadon * and Hoda M. O. Mokhtar *Correspondence: eng.afaf.fci@gmail.com
More informationParallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce
Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The
More informationCIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )
Guide: CIS 601 Graduate Seminar Presented By: Dr. Sunnie S. Chung Dhruv Patel (2652790) Kalpesh Sharma (2660576) Introduction Background Parallel Data Warehouse (PDW) Hive MongoDB Client-side Shared SQL
More informationLarge Scale Processing with Hadoop
Large Scale Processing with Hadoop William Palmer Some slides courtesy of Per Møldrup-Dalum (State and University Library, Denmark) and Sven Schlarb (Austrian National Library) SCAPE Information Day British
More informationAnalysis in the Big Data Era
Analysis in the Big Data Era Massive Data Data Analysis Insight Key to Success = Timely and Cost-Effective Analysis 2 Hadoop MapReduce Ecosystem Popular solution to Big Data Analytics Java / C++ / R /
More informationIntroduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)
Introduction to MapReduce Adapted from Jimmy Lin (U. Maryland, USA) Motivation Overview Need for handling big data New programming paradigm Review of functional programming mapreduce uses this abstraction
More informationResilient Distributed Datasets
Resilient Distributed Datasets A Fault- Tolerant Abstraction for In- Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin,
More informationDistributed computing: index building and use
Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput
More informationToday s content. Resilient Distributed Datasets(RDDs) Spark and its data model
Today s content Resilient Distributed Datasets(RDDs) ------ Spark and its data model Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing -- Spark By Matei Zaharia,
More informationResearch on Load Balancing in Task Allocation Process in Heterogeneous Hadoop Cluster
2017 2 nd International Conference on Artificial Intelligence and Engineering Applications (AIEA 2017) ISBN: 978-1-60595-485-1 Research on Load Balancing in Task Allocation Process in Heterogeneous Hadoop
More informationWorkload Characterization and Optimization of TPC-H Queries on Apache Spark
Workload Characterization and Optimization of TPC-H Queries on Apache Spark Tatsuhiro Chiba and Tamiya Onodera IBM Research - Tokyo April. 17-19, 216 IEEE ISPASS 216 @ Uppsala, Sweden Overview IBM Research
More informationApache Hadoop.Next What it takes and what it means
Apache Hadoop.Next What it takes and what it means Arun C. Murthy Founder & Architect, Hortonworks @acmurthy (@hortonworks) Page 1 Hello! I m Arun Founder/Architect at Hortonworks Inc. Lead, Map-Reduce
More information