MATE-EC2: A Middleware for Processing Data with Amazon Web Services

Similar documents
Supporting Fault Tolerance in a Data-Intensive Computing Middleware

Research in Middleware Systems For In-Situ Data Analytics and Instrument Data Analysis

Programming model and implementation for processing and. Programs can be automatically parallelized and executed on a large cluster of machines

Chisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

NFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC

Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

MOHA: Many-Task Computing Framework on Hadoop

CPET 581 Cloud Computing: Technologies and Enterprise IT Strategies

Chapter 5. The MapReduce Programming Model and Implementation

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

The MapReduce Abstraction

An Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

Hadoop/MapReduce Computing Paradigm

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

NFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC

Guoping Wang and Chee-Yong Chan Department of Computer Science, School of Computing National University of Singapore VLDB 14.

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

AWS Solution Architecture Patterns

Accelerating MapReduce on a Coupled CPU-GPU Architecture

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Cross-layer Optimization for Virtual Machine Resource Management

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

The Fusion Distributed File System

CIT 668: System Architecture. Amazon Web Services

2013 AWS Worldwide Public Sector Summit Washington, D.C.

A Cloud-based Dynamic Workflow for Mass Spectrometry Data Analysis

Principal Solutions Architect. Architecting in the Cloud

LEEN: Locality/Fairness- Aware Key Partitioning for MapReduce in the Cloud

Enosis: Bridging the Semantic Gap between

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Pocket: Elastic Ephemeral Storage for Serverless Analytics

Ambry: LinkedIn s Scalable Geo- Distributed Object Store

Introduction to Data Management CSE 344

Analysis in the Big Data Era

CS 61C: Great Ideas in Computer Architecture. MapReduce

Best Practices and Performance Tuning on Amazon Elastic MapReduce

Concurrency for data-intensive applications

Large Scale Computing Infrastructures

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

FINAL REPORT: K MEANS CLUSTERING SAPNA GANESH (sg1368) VAIBHAV GANDHI(vrg5913)

Locality-Aware Dynamic VM Reconfiguration on MapReduce Clouds. Jongse Park, Daewoo Lee, Bokyeong Kim, Jaehyuk Huh, Seungryoul Maeng

Typically applied in clusters and grids Loosely-coupled applications with sequential jobs Large amounts of computing for long periods of times

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Database Applications (15-415)

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

Piccolo. Fast, Distributed Programs with Partitioned Tables. Presenter: Wu, Weiyi Yale University. Saturday, October 15,

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

Improving the MapReduce Big Data Processing Framework

escience in the Cloud: A MODIS Satellite Data Reprojection and Reduction Pipeline in the Windows

Ian Foster, An Overview of Distributed Systems

Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store. Wei Xie TTU CS Department Seminar, 3/7/2017

What is Cloud Computing? What are the Private and Public Clouds? What are IaaS, PaaS, and SaaS? What is the Amazon Web Services (AWS)?

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency

Synonymous with supercomputing Tightly-coupled applications Implemented using Message Passing Interface (MPI) Large of amounts of computing for short

Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work

Chapter 3. Database Architecture and the Web

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

MapReduce. Stony Brook University CSE545, Fall 2016

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2008 Quiz II

Database Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question:

A Case for Packing and Indexing in Cloud File Systems

Distributed Systems 16. Distributed File Systems II

BigData and Map Reduce VITMAC03

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S.

Combining Distributed Memory and Shared Memory Parallelization for Data Mining Algorithms

RAMCloud: Scalable High-Performance Storage Entirely in DRAM John Ousterhout Stanford University

Cloud Computing CS

Introduction to Amazon Web Services

Next-Generation Cloud Platform

Dynamic Load-Balanced Multicast for Data-Intensive Applications on Clouds 1

HaLoop Efficient Iterative Data Processing on Large Clusters

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

Enhancing Checkpoint Performance with Staging IO & SSD

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

MapReduce. U of Toronto, 2014

SMCCSE: PaaS Platform for processing large amounts of social media

4 Myths about in-memory databases busted

Accelerating Parallel Analysis of Scientific Simulation Data via Zazen

IBM Education Assistance for z/os V2R2

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Chapter 4:- Introduction to Grid and its Evolution. Prepared By:- NITIN PANDYA Assistant Professor SVBIT.

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

Data Partitioning and MapReduce

An Introduction to Big Data Formats

Oasis: An Active Storage Framework for Object Storage Platform

Correlation based File Prefetching Approach for Hadoop

Distributed Systems CS6421

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS. Presented by: Ahmed Elbagoury

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING

Transcription:

MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering Ohio State University * School of Engineering and Computer Science (Presenter) Washington State University, Vancouver The 4th Workshop on Many-Task Computing on Grids and Supercomputers: MTAGS 11

The Problem Today s applications are increasingly data and compute intensive 2

The Problem Today s applications are increasingly data and compute intensive Many-Task Computing paradigms becoming pervasive: WF, MR 2

The Problem Today s applications are increasingly data and compute intensive Many-Task Computing paradigms becoming pervasive: WF, MR E.g., Map-Reducible applications are solving common problems Data mining Graph processing etc. 2

Clouds Infrastructure-as-a-Service Anyone, anywhere can allocate unlimited virtualized compute/storage resources Amazon Web Services: Most popular IaaS provider Elastic Compute Cloud (EC2) Simple Storage Service (S3) 3

Amazon Web Services: EC2 On-Demand Instances (Virtual Machines) Types: Extra Large, Large, Small, etc. For example, Extra Large Instance: Cost: $0.68 per allocated-hour 15GB memory; 1.7TB disk (ephemeral) 8 Compute Units High I/O performance 4

Amazon Web Services: S3 Accessible anywhere. High reliability and availability Objects are arbitrary data blobs Objects stored as (key,value) in Buckets 5TB of data per object Unlimited objects per bucket 5

Amazon Web Services: S3 (Cont.) Simple FTP-like interface using web service protocol Put, Get (Partial Get), and Delete SOAP and REST High throughput (~40MB/sec) Scales well to multiple clients Low costs 6

Amazon Web Services: S3 (Cont.) 449 billion objects in S3 as of July 2011 Doubling each year 7

Motivation and Focus Virtualization is characteristic of any cloud environment: Clouds are black boxes Storage and elastic compute services exhibit performance variabilities we should leverage 8

Goals? As users are increasingly moving to cloud-based solutions for computing. We have a need for services and tools that can Get the most out of cloud resources for dataintensive processing Provide a simple programming interface 9

Outline Background System Overview Experiments Conclusion 10

MATE-EC2 System Design Cloud middleware that is able to Use a set of possibly heterogeneous EC2 instances to scalably and efficiently process data stored in S3 MATE is a generalized reduction PDC structure like Map-Reduce 11

MATE and Map-Reduce Map-Reduce input data map() shuffle reduce() result (k1,v) (k1,v) (k3,v) (k1,v) (k1,v) (k2,v) (k2,v) (k2,v) (k1,v') (k2,v') (k3,v') 12

MATE and Map-Reduce Map-Reduce MATE input data map() shuffle reduce() result input data local reduction() global reduction() result (k1,v) (k1,v) (k3,v) (k1,v) (k1,v) (k2,v) (k2,v) (k2,v) (k1,v') (k2,v') (k3,v') proc(e) (k1,v') (k3,v') robj 1 (k1,v') (k1,v') robj 1 robj K (k1,v'') (k2,v'') (k3,v'') combined robj 12

MATE and Map-Reduce Map-Reduce MATE input data map() shuffle reduce() result input data local reduction() global reduction() result (k1,v) (k1,v) (k3,v) (k1,v) (k1,v) (k2,v) (k2,v) (k2,v) (k1,v') (k2,v') (k3,v') proc(e) (k1,v') (k3,v') robj 1 (k1,v') (k1,v') robj 1 robj K (k1,v'') (k2,v'') (k3,v'') combined robj 12

MATE and Map-Reduce Map-Reduce MATE input data map() shuffle reduce() result input data local reduction() global reduction() result (k1,v) (k1,v) (k3,v) (k1,v) (k1,v) (k2,v) (k2,v) (k2,v) (k1,v') (k2,v') (k3,v') proc(e) (k1,v') (k3,v') robj 1 (k1,v') (k1,v') robj 1 robj K (k1,v'') (k2,v'') (k3,v'') combined robj 12

MATE-EC2 Design S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance 13

MATE-EC2 Design: Data Organization Objects: Physical representation of the data in S3 S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance 14

MATE-EC2 Design: Data Organization S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance 14

MATE-EC2 Design: Data Organization S3 Data Retrieval Layer Buffer input Computing Layer Th 0 Th 1 Th 2 Chunks: Logical data partitions within objects (exploits memory utilization) Th 3 S3 Data Object Chunk 0 Chunk n Job Scheduler S3 Data Store Metadata EC2 Slave Instance EC2 Master Instance 14

MATE-EC2 Design: Data Organization S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance 14

MATE-EC2 Design: Data Organization S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata Units: Fixed data units within a chunk for retrieval EC2 (exploits Slave concurrency) Instance EC2 Master Instance 14

MATE-EC2 Design: Data Organization S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance 14

MATE-EC2 Design: Data Organization S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance Metadata: chunk offset, chunk size, unit size 14

MATE-EC2 Design: Data Retrieval S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance Threaded chunk retrieval: Chunk retrieved concurrently with a number of threads EC2 Master Instance 15

Dynamic Load Balancing S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance Job Pool 16

Dynamic Load Balancing S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance Schedule for processing Simple greedy heuristic: Select a unit belonging to the least connected chunk Job Pool 17

MATE-EC2 Processing Flow S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance (1) Compute node requests a job from Master

MATE-EC2 Processing Flow S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance

MATE-EC2 Processing Flow S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance (2) Chunk retrieved in units

MATE-EC2 Processing Flow S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance (3) Pass to Compute Layer, and process

MATE-EC2 Processing Flow S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance (3) Pass to Compute Layer, and process

MATE-EC2 Processing Flow S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance (3) Pass to Compute Layer, and process

MATE-EC2 Processing Flow S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance (3) Pass to Compute Layer, and process

MATE-EC2 Processing Flow S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance (3) Pass to Compute Layer, and process

MATE-EC2 Processing Flow S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance (4) Request another job from Master

MATE-EC2 Processing Flow Slave Instance S3 Data Retrieval Layer Buffer Th 0 Th 1 Th 2 S3 Data Object Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance Slave Instance Slave Instance

Experiments Goals Finding the most suitable setting for AWS Performance of MATE-EC2 on heterogeneous and homogeneous compute environments Performance comparison of MATE-EC2 and Map- Reduce 23

Experiments (Cont.) Setup: 4 Large EC2 slave instances 1 Large instance for master instance For each application, the dataset is split into16 data objects on S3 Large Instance: 4 compute units (each comparable to 1.0-1.2GHz) 7.5GB (memory) 850GB (disk, ephemeral) High I/O 24

Experiments (Cont.) App I/O Comp RObj Size Dataset KMeans Clustering Low/Med Med/High Small 8.2GB 10.7 billion points PCA Low High Large 8.2GB PageRank High Low/Med Very Large 1GB 9.6M nodes, 131M edges 25

Effect of Chunk Sizes (KMeans) 1 Data Retrieval Thread Performance increase: 128KB vs. >8M 2.07x to 2.49x speedup 26

Data Retrieval (KMeans) 128M Chunk Size 16 Data Retrieval Threads 350 Execution Time (sec) 300 250 200 150 100 1 2 4 8 16 32 S3 Data Retrieval Threads Execution Time (sec) 220 200 180 160 140 120 100 8M 16M 32M 64M 128M Chunk Size One Thread vs. others: 1.37x - 1.90x 8M vs. others speedup: 1.13x - 1.30x 27

Job Assignment (KMeans) Execution Time (sec) 230 220 210 200 190 180 Sequential Selective Speedup: 1.01x for 8M 1.1x to 1.14x for others 8M 16M 32M 64M 128M Chunk Size 28

MATE-EC2 vs. Elastic MapReduce Chunk Size: 128MB Data retrieval Threads: 16 KMeans PageRank Execution Time (sec) 2000 1500 1000 500 EMR-no-combine EMR-combine MATE-EC2 Execution Time (sec) 1400 1200 1000 800 600 400 200 EMR-no-combine EMR-combine MATE-EC2 0 8 16 32 Elastic Compute Units 0 8 16 32 Elastic Compute Units Speedups vs. EMR-combine 3.54x to 4.58x Speedups vs. EMR-combine 4.08x to 7.54x 29

MATE-EC2 on Heterogeneous Instances Overheads KMeans: 1% PCA: 1.1%, 7.4%, 11.7% 30

In Conclusion AWS environment is explored for data-intensive computing 64M and 128M data chunks w/ 16 data retrieval threads seems to be optimal for our middleware Our data retrieval and processing optimizations significantly improve the performance of our middleware MATE-EC2 outperforms MR both in scalability and performance 31

Thank You Questions & Discussion? Acknowledgments: We want to thank the MTAGS 11 organizers and the reviewers for their constructive comments This work was sponsored by: 32