MATE-EC2: A Middleware for Processing Data with Amazon Web Services

Size: px

Start display at page:

Download "MATE-EC2: A Middleware for Processing Data with Amazon Web Services"

Raymond O’Brien’
5 years ago
Views:

MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering Ohio State

1 MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering Ohio State University * School of Engineering and Computer Science (Presenter) Washington State University, Vancouver The 4th Workshop on Many-Task Computing on Grids and Supercomputers: MTAGS 11

2 The Problem Today s applications are increasingly data and compute intensive 2

3 The Problem Today s applications are increasingly data and compute intensive Many-Task Computing paradigms becoming pervasive: WF, MR 2

4 The Problem Today s applications are increasingly data and compute intensive Many-Task Computing paradigms becoming pervasive: WF, MR E.g., Map-Reducible applications are solving common problems Data mining Graph processing etc. 2

5 Clouds Infrastructure-as-a-Service Anyone, anywhere can allocate unlimited virtualized compute/storage resources Amazon Web Services: Most popular IaaS provider Elastic Compute Cloud (EC2) Simple Storage Service (S3) 3

6 Amazon Web Services: EC2 On-Demand Instances (Virtual Machines) Types: Extra Large, Large, Small, etc. For example, Extra Large Instance: Cost: $0.68 per allocated-hour 15GB memory; 1.7TB disk (ephemeral) 8 Compute Units High I/O performance 4

7 Amazon Web Services: S3 Accessible anywhere. High reliability and availability Objects are arbitrary data blobs Objects stored as (key,value) in Buckets 5TB of data per object Unlimited objects per bucket 5

8 Amazon Web Services: S3 (Cont.) Simple FTP-like interface using web service protocol Put, Get (Partial Get), and Delete SOAP and REST High throughput (~40MB/sec) Scales well to multiple clients Low costs 6

9 Amazon Web Services: S3 (Cont.) 449 billion objects in S3 as of July 2011 Doubling each year 7

10 Motivation and Focus Virtualization is characteristic of any cloud environment: Clouds are black boxes Storage and elastic compute services exhibit performance variabilities we should leverage 8

11 Goals? As users are increasingly moving to cloud-based solutions for computing. We have a need for services and tools that can Get the most out of cloud resources for dataintensive processing Provide a simple programming interface 9

12 Outline Background System Overview Experiments Conclusion 10

13 MATE-EC2 System Design Cloud middleware that is able to Use a set of possibly heterogeneous EC2 instances to scalably and efficiently process data stored in S3 MATE is a generalized reduction PDC structure like Map-Reduce 11

14 MATE and Map-Reduce Map-Reduce input data map() shuffle reduce() result (k1,v) (k1,v) (k3,v) (k1,v) (k1,v) (k2,v) (k2,v) (k2,v) (k1,v') (k2,v') (k3,v') 12

15 MATE and Map-Reduce Map-Reduce MATE input data map() shuffle reduce() result input data local reduction() global reduction() result (k1,v) (k1,v) (k3,v) (k1,v) (k1,v) (k2,v) (k2,v) (k2,v) (k1,v') (k2,v') (k3,v') proc(e) (k1,v') (k3,v') robj 1 (k1,v') (k1,v') robj 1 robj K (k1,v'') (k2,v'') (k3,v'') combined robj 12

16 MATE and Map-Reduce Map-Reduce MATE input data map() shuffle reduce() result input data local reduction() global reduction() result (k1,v) (k1,v) (k3,v) (k1,v) (k1,v) (k2,v) (k2,v) (k2,v) (k1,v') (k2,v') (k3,v') proc(e) (k1,v') (k3,v') robj 1 (k1,v') (k1,v') robj 1 robj K (k1,v'') (k2,v'') (k3,v'') combined robj 12

17 MATE and Map-Reduce Map-Reduce MATE input data map() shuffle reduce() result input data local reduction() global reduction() result (k1,v) (k1,v) (k3,v) (k1,v) (k1,v) (k2,v) (k2,v) (k2,v) (k1,v') (k2,v') (k3,v') proc(e) (k1,v') (k3,v') robj 1 (k1,v') (k1,v') robj 1 robj K (k1,v'') (k2,v'') (k3,v'') combined robj 12

18 MATE-EC2 Design S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance 13

19 MATE-EC2 Design: Data Organization Objects: Physical representation of the data in S3 S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance 14

20 MATE-EC2 Design: Data Organization S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance 14

21 MATE-EC2 Design: Data Organization S3 Data Retrieval Layer Buffer input Computing Layer Th 0 Th 1 Th 2 Chunks: Logical data partitions within objects (exploits memory utilization) Th 3 S3 Data Object Chunk 0 Chunk n Job Scheduler S3 Data Store Metadata EC2 Slave Instance EC2 Master Instance 14

22 MATE-EC2 Design: Data Organization S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance 14

23 MATE-EC2 Design: Data Organization S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata Units: Fixed data units within a chunk for retrieval EC2 (exploits Slave concurrency) Instance EC2 Master Instance 14

24 MATE-EC2 Design: Data Organization S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance 14

25 MATE-EC2 Design: Data Organization S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance Metadata: chunk offset, chunk size, unit size 14

26 MATE-EC2 Design: Data Retrieval S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance Threaded chunk retrieval: Chunk retrieved concurrently with a number of threads EC2 Master Instance 15

27 Dynamic Load Balancing S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance Job Pool 16

28 Dynamic Load Balancing S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance Schedule for processing Simple greedy heuristic: Select a unit belonging to the least connected chunk Job Pool 17

29 MATE-EC2 Processing Flow S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance (1) Compute node requests a job from Master

30 MATE-EC2 Processing Flow S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance

31 MATE-EC2 Processing Flow S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance (2) Chunk retrieved in units

32 MATE-EC2 Processing Flow S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance (3) Pass to Compute Layer, and process

33 MATE-EC2 Processing Flow S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance (3) Pass to Compute Layer, and process

34 MATE-EC2 Processing Flow S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance (3) Pass to Compute Layer, and process

35 MATE-EC2 Processing Flow S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance (3) Pass to Compute Layer, and process

36 MATE-EC2 Processing Flow S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance (3) Pass to Compute Layer, and process

37 MATE-EC2 Processing Flow S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance (4) Request another job from Master

38 MATE-EC2 Processing Flow Slave Instance S3 Data Retrieval Layer Buffer Th 0 Th 1 Th 2 S3 Data Object Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance Slave Instance Slave Instance

39 Experiments Goals Finding the most suitable setting for AWS Performance of MATE-EC2 on heterogeneous and homogeneous compute environments Performance comparison of MATE-EC2 and Map- Reduce 23

40 Experiments (Cont.) Setup: 4 Large EC2 slave instances 1 Large instance for master instance For each application, the dataset is split into16 data objects on S3 Large Instance: 4 compute units (each comparable to GHz) 7.5GB (memory) 850GB (disk, ephemeral) High I/O 24

41 Experiments (Cont.) App I/O Comp RObj Size Dataset KMeans Clustering Low/Med Med/High Small 8.2GB 10.7 billion points PCA Low High Large 8.2GB PageRank High Low/Med Very Large 1GB 9.6M nodes, 131M edges 25

42 Effect of Chunk Sizes (KMeans) 1 Data Retrieval Thread Performance increase: 128KB vs. >8M 2.07x to 2.49x speedup 26

43 Data Retrieval (KMeans) 128M Chunk Size 16 Data Retrieval Threads 350 Execution Time (sec) S3 Data Retrieval Threads Execution Time (sec) M 16M 32M 64M 128M Chunk Size One Thread vs. others: 1.37x x 8M vs. others speedup: 1.13x x 27

44 Job Assignment (KMeans) Execution Time (sec) Sequential Selective Speedup: 1.01x for 8M 1.1x to 1.14x for others 8M 16M 32M 64M 128M Chunk Size 28

45 MATE-EC2 vs. Elastic MapReduce Chunk Size: 128MB Data retrieval Threads: 16 KMeans PageRank Execution Time (sec) EMR-no-combine EMR-combine MATE-EC2 Execution Time (sec) EMR-no-combine EMR-combine MATE-EC Elastic Compute Units Elastic Compute Units Speedups vs. EMR-combine 3.54x to 4.58x Speedups vs. EMR-combine 4.08x to 7.54x 29

46 MATE-EC2 on Heterogeneous Instances Overheads KMeans: 1% PCA: 1.1%, 7.4%, 11.7% 30

47 In Conclusion AWS environment is explored for data-intensive computing 64M and 128M data chunks w/ 16 data retrieval threads seems to be optimal for our middleware Our data retrieval and processing optimizations significantly improve the performance of our middleware MATE-EC2 outperforms MR both in scalability and performance 31

48 Thank You Questions & Discussion? Acknowledgments: We want to thank the MTAGS 11 organizers and the reviewers for their constructive comments This work was sponsored by: 32

Supporting Fault Tolerance in a Data-Intensive Computing Middleware

Supporting Fault Tolerance in a Data-Intensive Computing Middleware Tekin Bicer, Wei Jiang and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University IPDPS 2010, Atlanta,