Progress on Efficient Integration of Lustre* and Hadoop/YARN

Similar documents
HPC on supercomputers has been a widely leveraged

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse

5.4 - DAOS Demonstration and Benchmark Report

Scalable Collective Communication and Data Transfer for High-Performance Computation and Data Analytics. Cong Xu

Analytics in the cloud

MOHA: Many-Task Computing Framework on Hadoop

Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng.

INTEGRATING HPFS IN A CLOUD COMPUTING ENVIRONMENT

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

An Overview of Fujitsu s Lustre Based File System

Big Data for Engineers Spring Resource Management

Using file systems at HC3

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

The State and Needs of IO Performance Tools

CS 378 Big Data Programming

Aeromancer: A Workflow Manager for Large- Scale MapReduce-Based Scientific Workflows

Big Data 7. Resource Management

FastForward I/O and Storage: ACG 5.8 Demonstration

Hadoop Map Reduce 10/17/2018 1

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Triton file systems - an introduction. slide 1 of 28

CA485 Ray Walshe Google File System

The Stratosphere Platform for Big Data Analytics

Guidelines for Efficient Parallel I/O on the Cray XT3/XT4

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. The study on magnanimous data-storage system based on cloud computing

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

2/26/2017. For instance, consider running Word Count across 20 splits

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Chisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

Map Reduce & Hadoop Recommended Text:

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP (

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

Hadoop/MapReduce Computing Paradigm

Vendor: Cloudera. Exam Code: CCA-505. Exam Name: Cloudera Certified Administrator for Apache Hadoop (CCAH) CDH5 Upgrade Exam.

Hadoop MapReduce Framework

SparkBench: A Comprehensive Spark Benchmarking Suite Characterizing In-memory Data Analytics

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

Data storage on Triton: an introduction

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management

Blue Waters I/O Performance

Distributed Filesystem

CCA Administrator Exam (CCA131)

Experiences Running and Optimizing the Berkeley Data Analytics Stack on Cray Platforms

Filesystems on SSCK's HP XC6000

High Performance File System and I/O Middleware Design for Big Data on HPC Clusters

An exceedingly high-level overview of ambient noise processing with Spark and Hadoop

Guoping Wang and Chee-Yong Chan Department of Computer Science, School of Computing National University of Singapore VLDB 14.

RAIDIX Data Storage Solution. Clustered Data Storage Based on the RAIDIX Software and GPFS File System

Exam Questions CCA-500

Hadoop Virtualization Extensions on VMware vsphere 5 T E C H N I C A L W H I T E P A P E R

StoreApp: Shared Storage Appliance for Efficient and Scalable Virtualized Hadoop Clusters

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Enosis: Bridging the Semantic Gap between

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

TP1-2: Analyzing Hadoop Logs

Improving the MapReduce Big Data Processing Framework

CAVA: Exploring Memory Locality for Big Data Analytics in Virtualized Clusters

50 Must Read Hadoop Interview Questions & Answers

L5-6:Runtime Platforms Hadoop and HDFS

A Comparative Experimental Study of Parallel File Systems for Large-Scale Data Processing

MI-PDB, MIE-PDB: Advanced Database Systems

High-Performance Lustre with Maximum Data Assurance

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

Accelerate Big Data Insights

Cluster Setup and Distributed File System

Parallel File Systems. John White Lawrence Berkeley National Lab

iihadoop: an asynchronous distributed framework for incremental iterative computations

The amount of data increases every day Some numbers ( 2012):

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

HAWQ: A Massively Parallel Processing SQL Engine in Hadoop

2/26/2017. The amount of data increases every day Some numbers ( 2012):

Introduction to Hadoop and MapReduce

Performance Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis

Database Applications (15-415)

Assessing the Performance Impact of High-Speed Interconnects on MapReduce

HPC-Reuse: efficient process creation for running MPI and Hadoop MapReduce on supercomputers

OpenZFS Performance Improvements

Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa

SEMem: deployment of MPI-based in-memory storage for Hadoop on supercomputers

FARMS: Efficient MapReduce Speculation for Failure Recovery in Short Jobs. a Florida State University b Valdosta State University

ADAPTIVE HANDLING OF 3V S OF BIG DATA TO IMPROVE EFFICIENCY USING HETEROGENEOUS CLUSTERS

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

Improved Solutions for I/O Provisioning and Application Acceleration

PaaS and Hadoop. Dr. Laiping Zhao ( 赵来平 ) School of Computer Software, Tianjin University

Accelerating Parallel Analysis of Scientific Simulation Data via Zazen

HADOOP FRAMEWORK FOR BIG DATA

CS370 Operating Systems

Planning for the HDP Cluster

A BigData Tour HDFS, Ceph and MapReduce

Welcome! Virtual tutorial starts at 15:00 BST

Introduction to Parallel I/O

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Transcription:

Progress on Efficient Integration of Lustre* and Hadoop/YARN Weikuan Yu Robin Goldstone Omkar Kulkarni Bryon Neitzel * Some name and brands may be claimed as the property of others.

MapReduce l l l l A simple data processing model to process big data Designed for commodity off-the-shelf hardware components. Strong merits for big data analytics l l Scalability: increase throughput by increasing # of nodes Fault-tolerance (quick and low cost recovery of the failures of tasks) YARN, the next generation of Hadoop MapReduce Implementation LUG - S2

High-Level Overview of YARN Consists of HDFS and MapReduce frameworks. Exposes map and reduce interfaces. v ResourceManager and NodeManagers v MRAppMaster, MapTask, and ReduceTask. H D F S (3) Map Task Map Task Map Task Map Task Resource Manager (2) MRAppMaster (3) Reduce Task Reduce Task HDFS HDFS LUG - S3

Supercomputers and Lustre* Lustre popularly deployed on supercomputers A vast number of computer nodes (CN) for computation A parallel pool of back-end storage servers, composed a large pool of storage nodes CN CN CN Interconnect Lustre LUG - S4

Lustre for MapReduce-based Analytics? Desire Integration of Lustre as a storage solution Understand the requirements of MapReduce on data organization, task placement, and data movement, and their implications to Lustre Approach: Mitigate the impact of centralized data store at Lustre Reduce repetitive data movement from computer nodes and storage nodes Cater to the preference of task scheduling and data locality LUG - S5

Overarching Goal Enable analytics shipping on Lustre* storage servers Users ship their analytics jobs to SNs on-demand Retain the default I/O model for scientific applications, storing data to Lustre Enable in-situ analytics at the storage nodes Simulation Output Analytics Shipping 3 LUG - S6

Technical Objectives Segregate analytics and storage functionalities within the same storage nodes Mitigate interference between YARN and Lustre* Develop a coordinated data placement and task scheduling between Lustre and YARN Enable and exploit data and task locality Improve Intermediate Data Organization for Efficient Shuffling on Lustre LUG - S7

YARN and Lustre* Integration with Performance Segregation Leverage KVM to create VM (virtual machine) instances on SNs Create Lustre storage servers on the physical machines (PMs) Run YARN programs and Lustre clients on the VMs Placement of YARN Intermediate data On Lustre or local disks? KVM Yarn Physical Machine OSS OST A Total of 8 OSTs LUG - S8

Running Hadoop/YARN on KVM 50 TeraSort jobs, 1GB input each. One job submitted every 3 seconds, There is a huge overhead caused by running YARN on KVM. Running IOR on 6 other machines. The impact is not very significant. Time (sec) 60 50 40 30 20 10 4.9% 6.8% Running Alone on PM Running Alone on KVM Running with IOR - POSIX Running with IOR - MPIIO KVM configuration on each node: u 4 cores, 6 GB memory. u Using a different hard drive for Lustre*. u Using a different network port 0 Average Job Execution Time (s) LUG - S9

KVM Overhead 4 cores are not enough for YARN jobs 6 cores help improve the performance of YARN Increasing memory size from 4GB to 6GB has little effects when number of cores is the bottleneck 60 50 62.6% Running on PM Running on KVM (4 Cores, 6GB Memory) Time (Sec) 40 30 20 31.8% 29.2% Running on KVM (6 Cores, 4GB Memory) Running on KVM (6 Cores, 6GB Memory) 10 0 Average Job Excution Time (s) LUG - S10

YARN Memory Utilization Running Yarn on the physical machines alone. NodeManager is given 8GB memory, 1GB per container, 1GB heap per task. HDFS with local ext3 disks. Intensive writes to HDFS (via local ext3) 14 12 Yarn Memory Yarn and Ext3 Memory Ext3 Memory Total Memory Memory Usage (GB) 10 8 6 4 2 0 0 200 400 600 800 1000 Execution Time (Sec) LUG - S11

Intermediate Data on Local Disk or Lustre* Place intermediate data on local ext3 file system or Lustre, which is mapped to KVM (yarn.nodemanager.local-dirs). Yarn and Lustre Clients are placed on the KVM, OSS/OST on the Physical Machine Terasort (4G) and PageRank (1G) benchmarks have been measured Configuring Yarn s intermediate data directory on Local ext3 or Lustre 500 Job Execu*on Time (sec) 450 400 350 300 250 200 150 100 50 0 Local Ext3 35% 97 149 TeraSort (4G) Lustre 17% 301 364 Page Rank (1G) KVM OSS Yarn OST OSC Physical Machine 8 OSS/OST + 8 OSC LUG - S12

Data Locality for YARN on Lustre* Yarn Client Job Submission Resource Manager Scheduler assignnodelocalcontainers() assignracklocalcontainers() assignoffswitchcontainers() <stripe1, Slave1, Map1> <stripe2, Slave2, Map2> <stripe3, Slave3, Map3> <stripe1, Map1> MRAppMaster Services Node Manager Container MapTask1 <stripe2, Map2> Node Manager Container MapTask2 <stripe3, Map3> Node Manager Launch Map1 Launch Map2 Launch Map3 Container MapTask3 lfs getstripe /lustre/stripe1 lfs getstripe /lustre/stripe2 lfs getstripe /lustre/stripe3 <stripe1, OST1> <stripe2, OST2> <stripe3, OST3> Stripe1 Data Stripe2 Data Stripe3 Data OSC1 OSC2 OSC3 MDS OSS1 OSS2 OSS3 Lustre stripe1 stripe2 stripe3 OST1 OST2 OST3 Master Slave 1 Slave 2 Slave 3 Slide --- 13

Background of TeraSort Test Four cases being compared Intermediate Data on Lustre* or Local disks Scheduling Map tasks with or without data locality lustre_shfl_opt: (on lustre, with locality) lustre_shfl_orig: (on lustre, without locality) local_shfl_opt: (on local disks, with locality) local_shfl_orig: (on local disks, without locality) Test environments -- Lustre 2.5 with dataset from 10GB to 30GB and 128MB stripe size and block size Slide --- 14

Average Number of Local Map Tasks Avg. Local Map Tasks 30 25 20 15 10 5 local_shfl_orig local_shfl_opt lustre_shfl_orig lustre_shfl_opt 0 10G 15G 20G 25G 30G Input Size local_shfl_opt and lustre_shfl_opt achieve high locality The other two have low locality. Slide --- 15

Terasort under Lustre* 2.5 Execution Time (Sec) 800 700 600 500 400 300 200 100 local_shfl_orig local_shfl_opt lustre_shfl_orig lustre_shfl_opt 0 10G 15G 20G 25G 30G Input Size On average, local_shfl_orig has best performance lustre_shfl_opt is in the middle of best case and worst case; Slide --- 16

Data Flow in Original YARN over HDFS This figure shows all of the Disk I/O in original Hadoop Map Task: Input Split, Spilled Data, MapOutput Reduce Task: Shuffled Data, Merged Data, Output Data MapTask ReduceTask map sortandspill mergeparts shuffle Merge ManagerImpl reduce Repetitive Merge Spilled MapOutput Shuffle spilled Merged Input Split HDFS Output Slide --- 17

Data Flow in YARN over Lustre* This figure shows all of the Disk I/O of YARN over Lustre Avoid as much Disk I/O as possible Speed up Reading Input data and Writing Output data MapTask ReduceTask map sortandspill mergeparts shuffle Merge ManagerImpl reduce Repetitive Merge Input Split Spilled Data MapOutput Spilled Data Merged Data Output Lustre Slide --- 18

New Implementation Design Review Improve I/O performance Read/Write from/to local OST Avoid unnecessary shuffle spill and repetitive merge After all MapOutput has been written, launch reduce task to read data Avoid Lustre* write/read lock issues? Reduce Lustre write/read contention? Reduce network contention Most of data is written/read from local OST through virtio bridged network Reserve more network bandwidth for Lustre Shuffle MapTask ReduceTask map sortandspill mergeparts shuffle Merge ManagerImpl reduce Local Local Local Repetitive Avoid Avoid Lustre Merge Local Shuffle Input Split Spilled Data MapOutput Spilled Data Merged Data Output Lustre Slide --- 19

Evaluation Results SATA Disk for OST, 10G Networking, Lustre* 2.5 Running Terasort Benchmark, 1 master node, 8 slave nodes Optimized YARN performs on 21% better than the original YARN on average Execution Time (Sec) 1000 900 800 700 600 500 400 300 200 YARN-Lustre YARN-Lustre-Opt 100 0 8G 16G 24G 32G Slide --- 20

Summary Explore the design of an Analytics Shipping Framework by integrating Lustre* and YARN Provided End-to-End optimizations on data organization, movement and task scheduling for efficient integration of Lustre and YARN Demonstrated its performance benefits to analytics applications LUG - S21

Acknowledgment Contributions from Cong Xu, Yandong Wang, and Huansong Fu. Awards to Auburn University from Intel, Alabama Department of Commerce and the US National Science Foundation. This work was also performed under the auspices of the US Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 (LLNL- CONF-647813). LUG - S22