Lightweight Streaming-based Runtime for Cloud Computing. Shrideep Pallickara. Community Grids Lab, Indiana University

Similar documents
MOHA: Many-Task Computing Framework on Hadoop

Dept. Of Computer Science, Colorado State University

Chapter 5. The MapReduce Programming Model and Implementation

HPC learning using Cloud infrastructure

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

The Future of Virtualization Desktop to the Datacentre. Raghu Raghuram Vice President Product and Solutions VMware

Triton file systems - an introduction. slide 1 of 28

Job sample: SCOPE (VLDBJ, 2012)

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

The Future of Virtualization. Jeff Jennings Global Vice President Products & Solutions VMware

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Cloud Computing & Visualization

MapReduce for Data Intensive Scientific Analyses

Analyzing Electroencephalograms Using Cloud Computing Techniques

Distributed Filesystem

CLOUD-SCALE FILE SYSTEMS

The ATLAS EventIndex: Full chain deployment and first operation

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Harp-DAAL for High Performance Big Data Computing

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

GFS: The Google File System

Milestone Solution Partner IT Infrastructure Components Certification Report

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

SoftNAS Cloud Performance Evaluation on AWS

The Fusion Distributed File System

MapReduce: A Programming Model for Large-Scale Distributed Computation

CA485 Ray Walshe Google File System

Cisco Tetration Analytics Platform: A Dive into Blazing Fast Deep Storage

<Insert Picture Here> Managing Oracle Exadata Database Machine with Oracle Enterprise Manager 11g

StreamSets Control Hub Installation Guide

Cloud Computing CS

Embedded Technosolutions

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

SQL Azure as a Self- Managing Database Service: Lessons Learned and Challenges Ahead

Tuning Intelligent Data Lake Performance

CS455: Introduction to Distributed Systems [Spring 2018] Dept. Of Computer Science, Colorado State University

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

PUBLIC SAP Vora Sizing Guide

EsgynDB Enterprise 2.0 Platform Reference Architecture

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Hadoop/MapReduce Computing Paradigm

On the Creation & Discovery of Topics in Distributed Publish/Subscribe systems

Introduction to MapReduce

Key aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling

Evolution of the ATLAS PanDA Workload Management System for Exascale Computational Science

Grid Computing with Voyager

webmethods Task Engine 9.9 on Red Hat Operating System

Enterprise print management in VMware Horizon

Sizing Guidelines and Performance Tuning for Intelligent Streaming

Map-Reduce. Marco Mura 2010 March, 31th

GFS: The Google File System. Dr. Yingwu Zhu

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse

Operating Systems: Internals and Design Principles. Chapter 2 Operating System Overview Seventh Edition By William Stallings

20762B: DEVELOPING SQL DATABASES

Nimble Storage Adaptive Flash

Resource and Performance Distribution Prediction for Large Scale Analytics Queries

<Insert Picture Here> Enterprise Data Management using Grid Technology

Table of Contents HOL SLN

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

Integrate MATLAB Analytics into Enterprise Applications

ELASTIC DATA PLATFORM

Welcome to the New Era of Cloud Computing

Dept. Of Computer Science, Colorado State University

Frequently asked questions from the previous class survey

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

The Google File System

Milestone Solution Partner IT Infrastructure Components Certification Report

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Machine Learning at the Limit

Building a Data-Friendly Platform for a Data- Driven Future

Evolving To The Big Data Warehouse

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Preemptive, Low Latency Datacenter Scheduling via Lightweight Virtualization

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

Large Scale Computing Infrastructures

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Certified Reference Design for VMware Cloud Providers

NFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC

The Google File System

Improved MapReduce k-means Clustering Algorithm with Combiner

Data Platforms and Pattern Mining

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

High Performance Computing on MapReduce Programming Framework

Tuning I/O Performance for Data Intensive Computing. Nicholas J. Wright. lbl.gov

DRYADLINQ CTP EVALUATION

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

The Evolution of a Data Project

Data Intensive Scalable Computing

Map Reduce. Yerevan.

A Parallel R Framework

Information Brokerage

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Design Tradeoffs for Data Deduplication Performance in Backup Workloads

Analysis in the Big Data Era

Five Common Myths About Scaling MySQL

Transcription:

Lightweight Streaming-based Runtime for Cloud Computing granules Shrideep Pallickara Community Grids Lab, Indiana University

A unique confluence of factors have driven the need for cloud computing DEMAND PULLS: Process and store large data volumes Y02 22-EB : Y06 161-EB : Y10 988-EB ~ I ZB TECHNOLOGY PUSHES: Falling hardware costs & better networks RESULT: Aggregation to scale 2

Since the cloud is not monolithic it is easier to cope with flux and evolve Replacement and upgrades are two sides of the same coin Desktop: 4 GB RAM, 400 GB Disk, 50 GFLOPS & 4 cores 250 x Desktop = 1 TB RAM, 100 TB Disk, 6.25TFLOP and 1000 cores 3

Cloud and traditional HPC systems have some fundamental differences One job at a time = Underutilization Execution pipelines IO Bound activities An application is the sum of its parts Cloud strategy is to interleave 1000s of tasks on the same resource 4

Projects utilizing NaradaBrokering QuakeSim 5

Ecological Monitoring: PISCES 6

Projects utilizing NaradaBrokering Sensor Grids RFID Tag RFID Reader GPS Sensor Nokia N800 7

Lessons learned from multi-disciplinary settings Framework for processing streaming data Compute demands will outpace availability Manage computational load transparently 8

Big Picture Simulations & Services Networked Devices {Sensors, Instruments} Experimental Models Burgeoning data volumes 9

Fueled the need for a new type of computational task Operate on dynamic and voluminous data. Comparably smaller CPU-bound times Milliseconds to minutes BUT concurrently interleave 1000s of these long running tasks Granules provisions this 10

Granules is dispersed over, and permeates, distributed components A A C A C G G Content Distribution Network G G G R 1 R 2 R N Discovery Services Concurrent Execution Queue Lifecycle Management Deployment Manager Diagnostics Dataset Initializer 11

An application is the sum of its computational tasks that are Agnostic about the resources that they execute on Responsible for processing a subset of the data Fragment of a stream Subset of files Portions of a database 12

Granules does most of the work for the applications except for 1. Processing Functionality 2. Specifying the Datasets 3. Scheduling strategy for constituent tasks 13

Granules processing functionality 1 is domain specific Implement just one method: execute() Processing 1 TB of data over 100 machines is done in 150 lines of Java code. 14

Computational tasks often need to cope with multiple datasets 2 TYPES: Streams, Files, Databases & URIs INITIALIZE: Configuration & allocations ACCESS: Permissions and authorizations DISPOSE: Reclaim allocated resources 15

Computational tasks specify their lifetime and scheduling 3 strategy Data availability Permute on any of these dimensions Change during execution Assert completion Number of times Periodicity 16

Granules discovers resources to deploy computational tasks Deploy computation instance on multiple resources Instantiate computational tasks & execute in Sandbox Initialize task s STATE and DATASETS Interleave multiple computations on a given machine 17

Deploying computational tasks C G Content Distribution Network G R N G G R 2 R 1 18

Granules manages the state transitions for computational tasks Transition Triggers: Data Availability Periodicity Termination conditions External Requests Dormant Initialize Terminate Activated 19

Activation and processing of computational tasks Dispose Execute Dormant D 1 P P D P D 20

MAP-REDUCE enables concurrent processing of large datasets d 1 Map 1 Reduce 1 Dataset d 2 Map 2 o 1 Reduce K d N Map N o K 21

Substantial benefits can be accrued in a streaming version of MAP-REDUCE File-based = Disk IO Expensive Streaming is much faster Allows access to intermediate results Enables time bound responses Granules Map-Reduce based on streams 22

In Granules MAP and REDUCE are two roles of a computational task Linking of MAP-REDUCE roles is easy M1.addReduce(R1) or R1.addMap(M1) Unlinking is easy too: remove Maps generate result streams, which are consumed by reducers Reducers can track outputs from Maps 23

In Granules MAP-REDUCE roles are interchangeable M1 M2 R1 RM M3 M4 R2 RM.addReduce(M1) RM.addReduce(M2) RM.addReduce(M3) RM.addReduce(M4) 24

Scientific applications can harness MAP-REDUCE variants in Granules ITERATIVE: Fixed number of times RECURSIVE: Till termination condition PERIODIC DATA AVAILABILITY DRIVEN 25

Complex computational pipelines can be set up using Granules STAGE 1 STAGE 2 STAGE 3 Iterative, Periodic, Recursive & Data driven Each stage could comprise computations dispersed on multiple machines 26

Granules manages pipeline communications complexity No arduous management of fan-ins Facilities to track outputs Confirm receipt from all preceding stages. 1 2 3 10000 27

Granules allows computational tasks to be cloned M1 M1 M2 M2 M3 M4 R1 R1 R2 Fine tune redundancies Double-check results Discard duplicates from clones 28

Related work HADOOP: File-based, Java, HDFS DRYAD: Dataflow graphs, C#, LinQ, MSD DISCO: File-based, Erlang PHEONIX: Multicore GOOGLE CLOUD: GFS 29

Granules outperforms Hadoop & Dryad in a traditional IR benchmark Overhead for histogramming words (Seconds) 3000 2500 2000 1500 1000 500 Hadoop DRYAD Granules 0 20 30 40 50 60 70 80 90 100 Total size of dataset (GB) 30

Clustering data points using K-means 10000 Overhead (Seconds) 1000 100 10 1 Hadoop Granules MPI 0.1 0.1 1 10 100 Number of 2D data points (millions) 31

Computing the product of two 16Kx16 K matrices using streaming datasets Computing overhead in Seconds 16384 8192 4096 2048 1024 Each Granules instance deals with at least 18K streams 1 2 4 8 Number of Machines 32

Maximizing core utilizations when assembling mrna sequences Computing overhead in Seconds 1024 64 Computing Overhead Speedup 512 32 256 16 128 8 64 32 4 16 2 8 1 2 4 8 16 32 64 1 Number of Cores Program aims to reconstruct full-length mrna sequences for each expressed gene Speed-up 33

Preserving scheduling periodicity for 10 4 concurrent computational tasks Time task scheduled at (Seconds) 70 60 50 40 30 20 10 0 0 2000 4000 6000 8000 10000 Computational Tasks 34

The streaming substrate provides consistent high performance throughput Mean transit delay (Milliseconds) 5 4 3 2 1 Delay Standard Deviation Streaming overheads for different payload sizes (100B - 100KB) 0 100 1000 10000 100000 0 Content Payload Size (Bytes) 5 4 3 2 1 Standard Deviation (Milliseconds) 35

The streaming substrate provides consistent high performance throughput Mean transit delay (Milliseconds) 60 50 40 30 20 10 Delay Standard Deviation Streaming overheads for different payload sizes (100B - 1MB) 14 12 10 2 0 100 1000 10000 100000 1e+006 0 Content Payload Size (Bytes) 8 6 4 Standard Deviation (Milliseconds) 36

Cautionary Tale: Gains when Disk-IO cannot keep pace with processing Processing overhead (Seconds) 14000 12000 10000 8000 6000 4000 2000 Histogramming words in a 500 GB dataset 1600 Files & 100 Map Functions 95% 72% 41% 65% 5 10 15 20 25 30 35 40 45 50 Number of Processors 42% 37

Key innovations within Granules Easy to develop applications Support for real-time streaming datasets Rich lifecycle and scheduling support for computational tasks. Enforces semantics of complex, distributed computational graphs Seamless cloning at finer & coarser levels 38

Future Work Probabilistic guarantees within the cloud Efficient generation of compute streams Throttling and steering of computations Staging datasets to maximize throughput Support policies with global & local scope 39

Conclusions Pressing need to cloud-enable network data intensive systems Complexity should be managed BY the runtime, and NOT by domain specialists Autonomy of Granules instances allows it to cope well with resource pool expansion Provisioning lifecycle metrics for the parts makes it easier to do so for the sum 40