CloudBATCH: A Batch Job Queuing System on Clouds with Hadoop and HBase. Chen Zhang Hans De Sterck University of Waterloo

Similar documents
Distributed Transactions in Hadoop's HBase and Google's Bigtable

MOHA: Many-Task Computing Framework on Hadoop

PARALLEL PROGRAM EXECUTION SUPPORT IN THE JGRID SYSTEM

Mixing and matching virtual and physical HPC clusters. Paolo Anedda

2/26/2017. For instance, consider running Word Count across 20 splits

Large Scale Computing Infrastructures

Hadoop On Demand: Configuration Guide

CHAPTER 7 CONCLUSION AND FUTURE SCOPE

Work Queue + Python. A Framework For Scalable Scientific Ensemble Applications

GoDocker. A batch scheduling system with Docker containers

L3.4. Data Management Techniques. Frederic Desprez Benjamin Isnard Johan Montagnat

Embedded Technosolutions

Chapter 3. Design of Grid Scheduler. 3.1 Introduction

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

Building a Data-Friendly Platform for a Data- Driven Future

Lecture 10.1 A real SDN implementation: the Google B4 case. Antonio Cianfrani DIET Department Networking Group netlab.uniroma1.it

SmartSuspend. Achieve 100% Cluster Utilization. Technical Overview

HPC learning using Cloud infrastructure

PROOF-Condor integration for ATLAS

Chapter 5. The MapReduce Programming Model and Implementation

An Introduction to Virtualization and Cloud Technologies to Support Grid Computing

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT

Dynamic Virtual Clusters in a Grid Site Manager

MERCED CLUSTER BASICS Multi-Environment Research Computer for Exploration and Discovery A Centerpiece for Computational Science at UC Merced

CHAPTER 3 GRID MONITORING AND RESOURCE SELECTION

Scale-out Storage Solution and Challenges Mahadev Gaonkar igate

Automating Real-time Seismic Analysis

An agent-based peer-to-peer grid computing architecture

AWS Solution Architecture Patterns

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc. The study on magnanimous data-storage system based on cloud computing

GT 4.2.0: Community Scheduler Framework (CSF) System Administrator's Guide

Grid Scheduling Architectures with Globus

The Gearman Cookbook OSCON Eric Day Senior Software Rackspace

Isolation Forest for Anomaly Detection

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Computational Mini-Grid Research at Clemson University

IBM FileNet Content Manager and IBM GPFS

Workload Aware Load Balancing For Cloud Data Center

EnterpriseLink Benefits

Parameter Sweeping Programming Model in Aneka on Data Mining Applications

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

Sentinet for Microsoft Azure SENTINET

PART B UNIT II COMMUNICATION IN DISTRIBUTED SYSTEM PART A

HAWQ: A Massively Parallel Processing SQL Engine in Hadoop

REMEM: REmote MEMory as Checkpointing Storage

In contrast to symmetric virtualisation, in asymmetric virtualisation the data flow is separated from the control flow. This is achieved by moving

Grid Compute Resources and Job Management

Getting Started with Amazon EC2 and Amazon SQS

A New HadoopBased Network Management System with Policy Approach

Lecture 11 Hadoop & Spark

Grid Computing in SAS 9.4

A Compact Computing Environment For A Windows PC Cluster Towards Seamless Molecular Dynamics Simulations

A Study on Load Balancing Techniques for Task Allocation in Big Data Processing* Jin Xiaohong1,a, Li Hui1, b, Liu Yanjun1, c, Fan Yanfang1, d

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Scaling Data Center Application Infrastructure. Gary Orenstein, Gear6

Chapter 1: Introduction 1/29

Sparrow. Distributed Low-Latency Spark Scheduling. Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica

Software Architecture Patterns

Large Scale Sky Computing Applications with Nimbus

1. Which programming language is used in approximately 80 percent of legacy mainframe applications?

Design and Implementation of High Performance and Availability Java RMI Server Group

The LGI Pilot job portal. EGI Technical Forum 20 September 2011 Jan Just Keijser Willem van Engen Mark Somers

Adaptive Cluster Computing using JavaSpaces

PBS PROFESSIONAL VS. MICROSOFT HPC PACK

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

A Network-aware Scheduler in Data-parallel Clusters for High Performance

Parallel and Distributed Systems. Programming Models. Why Parallel or Distributed Computing? What is a parallel computer?

CISC 7610 Lecture 2b The beginnings of NoSQL

Certified Big Data Hadoop and Spark Scala Course Curriculum

Distributed Systems Conclusions & Exam. Brian Nielsen

Consolidated Cluster Systems for Data Centers in the Cloud Age: a Survey and Analysis

Batch Systems. Running calculations on HPC resources

Identity Management and Resource Allocation in the Network Virtualization Environment

Cycle Sharing Systems

A Robust Cloud-based Service Architecture for Multimedia Streaming Using Hadoop

Grid Computing in SAS 9.2. Second Edition

Experiment-Driven Evaluation of Cloud-based Distributed Systems

Cloud Computing. Up until now

Distributed Systems Conclusions & Exam. Brian Nielsen

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

DEPLOYMENT ROADMAP May 2015

The Google File System

I Tier-3 di CMS-Italia: stato e prospettive. Hassen Riahi Claudio Grandi Workshop CCR GRID 2011

VMware vsphere Customized Corporate Agenda

Complete Data Protection & Disaster Recovery Solution

INTRODUCTION TO NEXTFLOW

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

Overview. Distributed Systems. Distributed Software Architecture Using Middleware. Components of a system are not always held on the same host

Tuning Cognos ReportNet for a High Performance Environment

Sky Computing on FutureGrid and Grid 5000 with Nimbus. Pierre Riteau Université de Rennes 1, IRISA INRIA Rennes Bretagne Atlantique Rennes, France

Outline. Definition of a Distributed System Goals of a Distributed System Types of Distributed Systems

SCALABLE DISTRIBUTED DEEP LEARNING

Job sample: SCOPE (VLDBJ, 2012)

Configuring and Deploying Hadoop Cluster Deployment Templates

Unit 5: Distributed, Real-Time, and Multimedia Systems

A Distributed Sensor Network Management System. Anders Hustvedt. A thesis submitted to the. Faculty of University of Colorado in partial

Fault-Tolerant Computer System Design ECE 695/CS 590. Putting it All Together

Using Global Behavior Modeling to improve QoS in Cloud Data Storage Services

IN THE LAST decade a large number of Web sites offering

Transcription:

CloudBATCH: A Batch Job Queuing System on Clouds with Hadoop and HBase Chen Zhang Hans De Sterck University of Waterloo

Outline Introduction Motivation Related Work System Design Future Work

Introduction - Motivation Users have hybrid computing needs Legacy programs dispatched by traditional cluster systems MapReduce programs requiring Hadoop Hadoop is incompatible with traditional cluster management systems Hadoop is yet another system to manage cluster machines Administrative barriers for deploying dedicated Hadoop cluster on existing cluster machines Using Hadoop on public clouds (Amazon) is not cost-effective for large scale data processing tasks CloudBATCH makes it possible to Use Hadoop to manage clusters for both legacy and MapReduce computing needs

Introduction CloudBATCH CloudBATCH is built on top of Hadoop (job execution) and HBase (META-data management) with a scalable architecture where there is no single point of failure What CloudBATCH can do Batch job queuing for legacy applications besides MapReduce jobs Job queue management with individually associated policies, user access control, job priority, pre-scheduled jobs, etc. Transparently handle jobs with simple non-nfs file staging needs Task level fault tolerance against machine failures Work complementarily with Hadoop schedulers, such as setting the minimum number of node allocation guaranteed to each queue The HBase table records can also serve as a good basis for future data provenance supports What CloudBATCH cannot do MPI-type jobs, reserve actual compute nodes, etc.

Introduction - Related Work (1) Hadoop Schedulers Focus on optimizing task distribution and execution Do not provide rich functionality for cluster level management, such as user access control and accounting, and advanced job and job queue management Hadoop On Demand (HOD) Creates a Hadoop cluster on-the-fly by running Hadoop daemons through the cluster resource management system on a set of reserved compute nodes Does not exploit data locality for MapReduce jobs because the nodes where the Hadoop cluster is created on may not host the needed data at all Sun Grid Engine (SGE) Hadoop Integration

Introduction - Related Work (2) Sun Grid Engine (SGE) Hadoop Integration Claims to be the first cluster resource management system with Hadoop integration Creates a Hadoop cluster on-the-fly like HOD with better data locality concerns Potential risks of overloading a compute node as a result of catering for data locality concerns (non-exclusive node usage) Shares a major drawback with HOD: a possible significant waste of resources in the Reduce phase, or in the case of having unbalanced executions of Map tasks Because the size of the on-the-fly cluster is statically determined at the beginning by users, normally according to the number of Map tasks On-the-fly Hadoop cluster is not shared across job submissions

System Design - Overview

System Design Client Clients accept job submission commands from users Check user credentials, determine queue/job policy, add job to CloudBATCH system by inserting job information into an HBase table called Job Table

System Design Broker and Monitor Brokers constantly poll the Job Table for a list of jobs with Status:submitted status For every job on the list Changes the job status to Status:queued Submits to the Hadoop MapReduce framework a Wrapper program for job execution Note: if the Wrapper fails directly after being submitted, jobs could stay in Status:queued forever. This is handled by the Monitor program Monitors set a time threshold T, periodically poll the Job Table for jobs that stay in the Status:queued status for a time period longer than T, and change their status back to Status:submitted

System Design Wrapper Wrappers are Map-only MapReduce programs acting as agents to execute user programs at compute nodes where they are scheduled to run by the underlying Hadoop schedulers When a Wrapper starts at some compute node: Grabs from the HBase table the necessary job information and transfers files that need to be staged to the local machine Updates the job status to Status:running Starts the job execution through commandline invocation, MapReduce and legacy jobs alike During job execution: Checks the execution status such as total running time to follow job policies, and terminates a running job if policies are violated After job execution completes: Either successful or failed, the Wrapper will update the job status ( Status:successful or Status:failed ) in the Job Table, Performs cleanup of the temporary job execution directory Terminates itself normally

System Design - Discussion Limitations No support for MPI-type applications Jobs must be commandline based, no interactive execution Data consistency under concurrent access Guard against conflicting concurrent updates to Job Table Multiple Brokers updating conflicting job status Solution: use transactional HBase with snapshot isolation Performance bottleneck Multiple number of system components (Brokers, etc.) can be run according to the scale of the cluster and job requests The bottleneck is at how fast concurrent clients can insert job information to the Job Table

Future Work Monitors are still not fully explored for our prototype and may be extended in the future for detecting jobs that have been marked as Status:running but actually failed Further test the system under multi-queue, multi-user scenarios with heavy load and refine the prototype implementation for trial production deployment in solving real-world use cases Exploit the usage of CloudBATCH to make dedicated Hadoop clusters useful for the load balancing of legacy batch job submissions to other coexisting traditional clusters

Thank you! And Questions? Please contact me at: c15zhang@cs.uwaterloo.ca Thank you!