MATE-EC2: A Middleware for Processing Data with Amazon Web Services
|
|
- Raymond O’Brien’
- 5 years ago
- Views:
Transcription
1 MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering Ohio State University * School of Engineering and Computer Science (Presenter) Washington State University, Vancouver The 4th Workshop on Many-Task Computing on Grids and Supercomputers: MTAGS 11
2 The Problem Today s applications are increasingly data and compute intensive 2
3 The Problem Today s applications are increasingly data and compute intensive Many-Task Computing paradigms becoming pervasive: WF, MR 2
4 The Problem Today s applications are increasingly data and compute intensive Many-Task Computing paradigms becoming pervasive: WF, MR E.g., Map-Reducible applications are solving common problems Data mining Graph processing etc. 2
5 Clouds Infrastructure-as-a-Service Anyone, anywhere can allocate unlimited virtualized compute/storage resources Amazon Web Services: Most popular IaaS provider Elastic Compute Cloud (EC2) Simple Storage Service (S3) 3
6 Amazon Web Services: EC2 On-Demand Instances (Virtual Machines) Types: Extra Large, Large, Small, etc. For example, Extra Large Instance: Cost: $0.68 per allocated-hour 15GB memory; 1.7TB disk (ephemeral) 8 Compute Units High I/O performance 4
7 Amazon Web Services: S3 Accessible anywhere. High reliability and availability Objects are arbitrary data blobs Objects stored as (key,value) in Buckets 5TB of data per object Unlimited objects per bucket 5
8 Amazon Web Services: S3 (Cont.) Simple FTP-like interface using web service protocol Put, Get (Partial Get), and Delete SOAP and REST High throughput (~40MB/sec) Scales well to multiple clients Low costs 6
9 Amazon Web Services: S3 (Cont.) 449 billion objects in S3 as of July 2011 Doubling each year 7
10 Motivation and Focus Virtualization is characteristic of any cloud environment: Clouds are black boxes Storage and elastic compute services exhibit performance variabilities we should leverage 8
11 Goals? As users are increasingly moving to cloud-based solutions for computing. We have a need for services and tools that can Get the most out of cloud resources for dataintensive processing Provide a simple programming interface 9
12 Outline Background System Overview Experiments Conclusion 10
13 MATE-EC2 System Design Cloud middleware that is able to Use a set of possibly heterogeneous EC2 instances to scalably and efficiently process data stored in S3 MATE is a generalized reduction PDC structure like Map-Reduce 11
14 MATE and Map-Reduce Map-Reduce input data map() shuffle reduce() result (k1,v) (k1,v) (k3,v) (k1,v) (k1,v) (k2,v) (k2,v) (k2,v) (k1,v') (k2,v') (k3,v') 12
15 MATE and Map-Reduce Map-Reduce MATE input data map() shuffle reduce() result input data local reduction() global reduction() result (k1,v) (k1,v) (k3,v) (k1,v) (k1,v) (k2,v) (k2,v) (k2,v) (k1,v') (k2,v') (k3,v') proc(e) (k1,v') (k3,v') robj 1 (k1,v') (k1,v') robj 1 robj K (k1,v'') (k2,v'') (k3,v'') combined robj 12
16 MATE and Map-Reduce Map-Reduce MATE input data map() shuffle reduce() result input data local reduction() global reduction() result (k1,v) (k1,v) (k3,v) (k1,v) (k1,v) (k2,v) (k2,v) (k2,v) (k1,v') (k2,v') (k3,v') proc(e) (k1,v') (k3,v') robj 1 (k1,v') (k1,v') robj 1 robj K (k1,v'') (k2,v'') (k3,v'') combined robj 12
17 MATE and Map-Reduce Map-Reduce MATE input data map() shuffle reduce() result input data local reduction() global reduction() result (k1,v) (k1,v) (k3,v) (k1,v) (k1,v) (k2,v) (k2,v) (k2,v) (k1,v') (k2,v') (k3,v') proc(e) (k1,v') (k3,v') robj 1 (k1,v') (k1,v') robj 1 robj K (k1,v'') (k2,v'') (k3,v'') combined robj 12
18 MATE-EC2 Design S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance 13
19 MATE-EC2 Design: Data Organization Objects: Physical representation of the data in S3 S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance 14
20 MATE-EC2 Design: Data Organization S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance 14
21 MATE-EC2 Design: Data Organization S3 Data Retrieval Layer Buffer input Computing Layer Th 0 Th 1 Th 2 Chunks: Logical data partitions within objects (exploits memory utilization) Th 3 S3 Data Object Chunk 0 Chunk n Job Scheduler S3 Data Store Metadata EC2 Slave Instance EC2 Master Instance 14
22 MATE-EC2 Design: Data Organization S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance 14
23 MATE-EC2 Design: Data Organization S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata Units: Fixed data units within a chunk for retrieval EC2 (exploits Slave concurrency) Instance EC2 Master Instance 14
24 MATE-EC2 Design: Data Organization S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance 14
25 MATE-EC2 Design: Data Organization S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance Metadata: chunk offset, chunk size, unit size 14
26 MATE-EC2 Design: Data Retrieval S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance Threaded chunk retrieval: Chunk retrieved concurrently with a number of threads EC2 Master Instance 15
27 Dynamic Load Balancing S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance Job Pool 16
28 Dynamic Load Balancing S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance Schedule for processing Simple greedy heuristic: Select a unit belonging to the least connected chunk Job Pool 17
29 MATE-EC2 Processing Flow S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance (1) Compute node requests a job from Master
30 MATE-EC2 Processing Flow S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance
31 MATE-EC2 Processing Flow S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance (2) Chunk retrieved in units
32 MATE-EC2 Processing Flow S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance (3) Pass to Compute Layer, and process
33 MATE-EC2 Processing Flow S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance (3) Pass to Compute Layer, and process
34 MATE-EC2 Processing Flow S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance (3) Pass to Compute Layer, and process
35 MATE-EC2 Processing Flow S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance (3) Pass to Compute Layer, and process
36 MATE-EC2 Processing Flow S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance (3) Pass to Compute Layer, and process
37 MATE-EC2 Processing Flow S3 Data Retrieval Layer S3 Data Object Buffer Th 0 Th 1 Th 2 Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance (4) Request another job from Master
38 MATE-EC2 Processing Flow Slave Instance S3 Data Retrieval Layer Buffer Th 0 Th 1 Th 2 S3 Data Object Chunk 0 Chunk n S3 Data Store input Th 3 Computing Layer Job Scheduler Metadata EC2 Slave Instance EC2 Master Instance Slave Instance Slave Instance
39 Experiments Goals Finding the most suitable setting for AWS Performance of MATE-EC2 on heterogeneous and homogeneous compute environments Performance comparison of MATE-EC2 and Map- Reduce 23
40 Experiments (Cont.) Setup: 4 Large EC2 slave instances 1 Large instance for master instance For each application, the dataset is split into16 data objects on S3 Large Instance: 4 compute units (each comparable to GHz) 7.5GB (memory) 850GB (disk, ephemeral) High I/O 24
41 Experiments (Cont.) App I/O Comp RObj Size Dataset KMeans Clustering Low/Med Med/High Small 8.2GB 10.7 billion points PCA Low High Large 8.2GB PageRank High Low/Med Very Large 1GB 9.6M nodes, 131M edges 25
42 Effect of Chunk Sizes (KMeans) 1 Data Retrieval Thread Performance increase: 128KB vs. >8M 2.07x to 2.49x speedup 26
43 Data Retrieval (KMeans) 128M Chunk Size 16 Data Retrieval Threads 350 Execution Time (sec) S3 Data Retrieval Threads Execution Time (sec) M 16M 32M 64M 128M Chunk Size One Thread vs. others: 1.37x x 8M vs. others speedup: 1.13x x 27
44 Job Assignment (KMeans) Execution Time (sec) Sequential Selective Speedup: 1.01x for 8M 1.1x to 1.14x for others 8M 16M 32M 64M 128M Chunk Size 28
45 MATE-EC2 vs. Elastic MapReduce Chunk Size: 128MB Data retrieval Threads: 16 KMeans PageRank Execution Time (sec) EMR-no-combine EMR-combine MATE-EC2 Execution Time (sec) EMR-no-combine EMR-combine MATE-EC Elastic Compute Units Elastic Compute Units Speedups vs. EMR-combine 3.54x to 4.58x Speedups vs. EMR-combine 4.08x to 7.54x 29
46 MATE-EC2 on Heterogeneous Instances Overheads KMeans: 1% PCA: 1.1%, 7.4%, 11.7% 30
47 In Conclusion AWS environment is explored for data-intensive computing 64M and 128M data chunks w/ 16 data retrieval threads seems to be optimal for our middleware Our data retrieval and processing optimizations significantly improve the performance of our middleware MATE-EC2 outperforms MR both in scalability and performance 31
48 Thank You Questions & Discussion? Acknowledgments: We want to thank the MTAGS 11 organizers and the reviewers for their constructive comments This work was sponsored by: 32
Supporting Fault Tolerance in a Data-Intensive Computing Middleware
Supporting Fault Tolerance in a Data-Intensive Computing Middleware Tekin Bicer, Wei Jiang and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University IPDPS 2010, Atlanta,
More informationResearch in Middleware Systems For In-Situ Data Analytics and Instrument Data Analysis
Research in Middleware Systems For In-Situ Data Analytics and Instrument Data Analysis Gagan Agrawal The Ohio State University (Joint work with Yi Wang, Yu Su, Tekin Bicer and others) Outline Middleware
More informationProgramming model and implementation for processing and. Programs can be automatically parallelized and executed on a large cluster of machines
A programming model in Cloud: MapReduce Programming model and implementation for processing and generating large data sets Users specify a map function to generate a set of intermediate key/value pairs
More informationChisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique
Chisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique Prateek Dhawalia Sriram Kailasam D. Janakiram Distributed and Object Systems Lab Dept. of Comp.
More informationCloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe
Cloud Programming Programming Environment Oct 29, 2015 Osamu Tatebe Cloud Computing Only required amount of CPU and storage can be used anytime from anywhere via network Availability, throughput, reliability
More informationNFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC
Segregated storage and compute NFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC Co-located storage and compute HDFS, GFS Data
More informationCloud Computing Paradigms for Pleasingly Parallel Biomedical Applications
Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Thilina Gunarathne, Tak-Lon Wu Judy Qiu, Geoffrey Fox School of Informatics, Pervasive Technology Institute Indiana University
More informationThe amount of data increases every day Some numbers ( 2012):
1 The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect
More information2/26/2017. The amount of data increases every day Some numbers ( 2012):
The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect to
More informationMOHA: Many-Task Computing Framework on Hadoop
Apache: Big Data North America 2017 @ Miami MOHA: Many-Task Computing Framework on Hadoop Soonwook Hwang Korea Institute of Science and Technology Information May 18, 2017 Table of Contents Introduction
More informationCPET 581 Cloud Computing: Technologies and Enterprise IT Strategies
CPET 581 Cloud Computing: Technologies and Enterprise IT Strategies Lecture 8 Cloud Programming & Software Environments: High Performance Computing & AWS Services Part 2 of 2 Spring 2015 A Specialty Course
More informationChapter 5. The MapReduce Programming Model and Implementation
Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing
More informationImproving Hadoop MapReduce Performance on Supercomputers with JVM Reuse
Thanh-Chung Dao 1 Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse Thanh-Chung Dao and Shigeru Chiba The University of Tokyo Thanh-Chung Dao 2 Supercomputers Expensive clusters Multi-core
More informationDepartment of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang
Department of Computer Science San Marcos, TX 78666 Report Number TXSTATE-CS-TR-2010-24 Clustering in the Cloud Xuan Wang 2010-05-05 !"#$%&'()*+()+%,&+!"-#. + /+!"#$%&'()*+0"*-'(%,1$+0.23%(-)+%-+42.--3+52367&.#8&+9'21&:-';
More informationThe MapReduce Abstraction
The MapReduce Abstraction Parallel Computing at Google Leverages multiple technologies to simplify large-scale parallel computations Proprietary computing clusters Map/Reduce software library Lots of other
More informationAn Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs
An Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs Xin Huo, Vignesh T. Ravi, Wenjing Ma and Gagan Agrawal Department of Computer Science and Engineering
More informationData Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research
Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research The CRAIL Project: Overview Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) Spark-IO FS Albis Streaming
More informationHadoop/MapReduce Computing Paradigm
Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications
More informationCS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014
CS15-319 / 15-619 Cloud Computing Recitation 3 September 9 th & 11 th, 2014 Overview Last Week s Reflection --Project 1.1, Quiz 1, Unit 1 This Week s Schedule --Unit2 (module 3 & 4), Project 1.2 Questions
More informationCPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University
CPSC 426/526 Cloud Computing Ennan Zhai Computer Science Department Yale University Recall: Lec-7 In the lec-7, I talked about: - P2P vs Enterprise control - Firewall - NATs - Software defined network
More informationPLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
More informationNFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC
Segregated storage and compute NFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC Co-located storage and compute HDFS, GFS Data
More informationGuoping Wang and Chee-Yong Chan Department of Computer Science, School of Computing National University of Singapore VLDB 14.
Guoping Wang and Chee-Yong Chan Department of Computer Science, School of Computing National University of Singapore VLDB 14 Page 1 Introduction & Notations Multi-Job optimization Evaluation Conclusion
More informationData Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros
Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on
More informationAWS Solution Architecture Patterns
AWS Solution Architecture Patterns Objectives Key objectives of this chapter AWS reference architecture catalog Overview of some AWS solution architecture patterns 1.1 AWS Architecture Center The AWS Architecture
More informationAccelerating MapReduce on a Coupled CPU-GPU Architecture
Accelerating MapReduce on a Coupled CPU-GPU Architecture Linchuan Chen Xin Huo Gagan Agrawal Department of Computer Science and Engineering The Ohio State University Columbus, OH 43210 {chenlinc,huox,agrawal}@cse.ohio-state.edu
More informationRAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University
RAMCloud and the Low- Latency Datacenter John Ousterhout Stanford University Most important driver for innovation in computer systems: Rise of the datacenter Phase 1: large scale Phase 2: low latency Introduction
More informationWhere We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344
Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)
More informationCross-layer Optimization for Virtual Machine Resource Management
Cross-layer Optimization for Virtual Machine Resource Management Ming Zhao, Arizona State University Lixi Wang, Amazon Yun Lv, Beihang Universituy Jing Xu, Google http://visa.lab.asu.edu Virtualized Infrastructures,
More informationBigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13
Bigtable A Distributed Storage System for Structured Data Presenter: Yunming Zhang Conglong Li References SOCC 2010 Key Note Slides Jeff Dean Google Introduction to Distributed Computing, Winter 2008 University
More informationThe Fusion Distributed File System
Slide 1 / 44 The Fusion Distributed File System Dongfang Zhao February 2015 Slide 2 / 44 Outline Introduction FusionFS System Architecture Metadata Management Data Movement Implementation Details Unique
More informationCIT 668: System Architecture. Amazon Web Services
CIT 668: System Architecture Amazon Web Services Topics 1. AWS Global Infrastructure 2. Foundation Services 1. Compute 2. Storage 3. Database 4. Network 3. AWS Economics Amazon Services Architecture Regions
More information2013 AWS Worldwide Public Sector Summit Washington, D.C.
2013 AWS Worldwide Public Sector Summit Washington, D.C. EMR for Fun and for Profit Ben Butler Sr. Manager, Big Data butlerb@amazon.com @bensbutler Overview 1. What is big data? 2. What is AWS Elastic
More informationA Cloud-based Dynamic Workflow for Mass Spectrometry Data Analysis
A Cloud-based Dynamic Workflow for Mass Spectrometry Data Analysis Ashish Nagavaram, Gagan Agrawal, Michael A. Freitas, Kelly H. Telu The Ohio State University Gaurang Mehta, Rajiv. G. Mayani, Ewa Deelman
More informationPrincipal Solutions Architect. Architecting in the Cloud
Matt Tavis Principal Solutions Architect Architecting in the Cloud Cloud Best Practices Whitepaper Prescriptive guidance to Cloud Architects Just Search for Cloud Best Practices to find the link ttp://media.amazonwebservices.co
More informationLEEN: Locality/Fairness- Aware Key Partitioning for MapReduce in the Cloud
LEEN: Locality/Fairness- Aware Key Partitioning for MapReduce in the Cloud Shadi Ibrahim, Hai Jin, Lu Lu, Song Wu, Bingsheng He*, Qi Li # Huazhong University of Science and Technology *Nanyang Technological
More informationEnosis: Bridging the Semantic Gap between
Enosis: Bridging the Semantic Gap between File-based and Object-based Data Models Anthony Kougkas - akougkas@hawk.iit.edu, Hariharan Devarajan, Xian-He Sun Outline Introduction Background Approach Evaluation
More informationDistributed Computations MapReduce. adapted from Jeff Dean s slides
Distributed Computations MapReduce adapted from Jeff Dean s slides What we ve learnt so far Basic distributed systems concepts Consistency (sequential, eventual) Fault tolerance (recoverability, availability)
More informationPocket: Elastic Ephemeral Storage for Serverless Analytics
Pocket: Elastic Ephemeral Storage for Serverless Analytics Ana Klimovic*, Yawen Wang*, Patrick Stuedi +, Animesh Trivedi +, Jonas Pfefferle +, Christos Kozyrakis* *Stanford University, + IBM Research 1
More informationAmbry: LinkedIn s Scalable Geo- Distributed Object Store
Ambry: LinkedIn s Scalable Geo- Distributed Object Store Shadi A. Noghabi *, Sriram Subramanian +, Priyesh Narayanan +, Sivabalan Narayanan +, Gopalakrishna Holla +, Mammad Zadeh +, Tianwei Li +, Indranil
More informationIntroduction to Data Management CSE 344
Introduction to Data Management CSE 344 Lecture 26: Parallel Databases and MapReduce CSE 344 - Winter 2013 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Cluster will run in Amazon s cloud (AWS)
More informationAnalysis in the Big Data Era
Analysis in the Big Data Era Massive Data Data Analysis Insight Key to Success = Timely and Cost-Effective Analysis 2 Hadoop MapReduce Ecosystem Popular solution to Big Data Analytics Java / C++ / R /
More informationCS 61C: Great Ideas in Computer Architecture. MapReduce
CS 61C: Great Ideas in Computer Architecture MapReduce Guest Lecturer: Justin Hsia 3/06/2013 Spring 2013 Lecture #18 1 Review of Last Lecture Performance latency and throughput Warehouse Scale Computing
More informationBest Practices and Performance Tuning on Amazon Elastic MapReduce
Best Practices and Performance Tuning on Amazon Elastic MapReduce Michael Hanisch Solutions Architect Amo Abeyaratne Big Data and Analytics Consultant ANZ 12.04.2016 2016, Amazon Web Services, Inc. or
More informationConcurrency for data-intensive applications
Concurrency for data-intensive applications Dennis Kafura CS5204 Operating Systems 1 Jeff Dean Sanjay Ghemawat Dennis Kafura CS5204 Operating Systems 2 Motivation Application characteristics Large/massive
More informationLarge Scale Computing Infrastructures
GC3: Grid Computing Competence Center Large Scale Computing Infrastructures Lecture 2: Cloud technologies Sergio Maffioletti GC3: Grid Computing Competence Center, University
More informationApache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context
1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes
More informationFINAL REPORT: K MEANS CLUSTERING SAPNA GANESH (sg1368) VAIBHAV GANDHI(vrg5913)
FINAL REPORT: K MEANS CLUSTERING SAPNA GANESH (sg1368) VAIBHAV GANDHI(vrg5913) Overview The partitioning of data points according to certain features of the points into small groups is called clustering.
More informationLocality-Aware Dynamic VM Reconfiguration on MapReduce Clouds. Jongse Park, Daewoo Lee, Bokyeong Kim, Jaehyuk Huh, Seungryoul Maeng
Locality-Aware Dynamic VM Reconfiguration on MapReduce Clouds Jongse Park, Daewoo Lee, Bokyeong Kim, Jaehyuk Huh, Seungryoul Maeng Virtual Clusters on Cloud } Private cluster on public cloud } Distributed
More informationTypically applied in clusters and grids Loosely-coupled applications with sequential jobs Large amounts of computing for long periods of times
Typically applied in clusters and grids Loosely-coupled applications with sequential jobs Large amounts of computing for long periods of times Measured in operations per month or years 2 Bridge the gap
More informationTITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP
TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop
More informationDatabase Applications (15-415)
Database Applications (15-415) Hadoop Lecture 24, April 23, 2014 Mohammad Hammoud Today Last Session: NoSQL databases Today s Session: Hadoop = HDFS + MapReduce Announcements: Final Exam is on Sunday April
More informationPROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP
ISSN: 0976-2876 (Print) ISSN: 2250-0138 (Online) PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP T. S. NISHA a1 AND K. SATYANARAYAN REDDY b a Department of CSE, Cambridge
More informationPiccolo. Fast, Distributed Programs with Partitioned Tables. Presenter: Wu, Weiyi Yale University. Saturday, October 15,
Piccolo Fast, Distributed Programs with Partitioned Tables 1 Presenter: Wu, Weiyi Yale University Outline Background Intuition Design Evaluation Future Work 2 Outline Background Intuition Design Evaluation
More informationPSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets
2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department
More informationImproving the MapReduce Big Data Processing Framework
Improving the MapReduce Big Data Processing Framework Gistau, Reza Akbarinia, Patrick Valduriez INRIA & LIRMM, Montpellier, France In collaboration with Divyakant Agrawal, UCSB Esther Pacitti, UM2, LIRMM
More informationescience in the Cloud: A MODIS Satellite Data Reprojection and Reduction Pipeline in the Windows
escience in the Cloud: A MODIS Satellite Data Reprojection and Reduction Pipeline in the Windows Jie Li1, Deb Agarwal2, Azure Marty Platform Humphrey1, Keith Jackson2, Catharine van Ingen3, Youngryel Ryu4
More informationIan Foster, An Overview of Distributed Systems
The advent of computation can be compared, in terms of the breadth and depth of its impact on research and scholarship, to the invention of writing and the development of modern mathematics. Ian Foster,
More informationToward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store. Wei Xie TTU CS Department Seminar, 3/7/2017
Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store Wei Xie TTU CS Department Seminar, 3/7/2017 1 Outline General introduction Study 1: Elastic Consistent Hashing based Store
More informationWhat is Cloud Computing? What are the Private and Public Clouds? What are IaaS, PaaS, and SaaS? What is the Amazon Web Services (AWS)?
What is Cloud Computing? What are the Private and Public Clouds? What are IaaS, PaaS, and SaaS? What is the Amazon Web Services (AWS)? What is Amazon Machine Image (AMI)? Amazon Elastic Compute Cloud (EC2)?
More informationZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency
ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency Thanos Makatos, Yannis Klonatos, Manolis Marazakis, Michail D. Flouris, and Angelos Bilas {mcatos,klonatos,maraz,flouris,bilas}@ics.forth.gr
More informationSynonymous with supercomputing Tightly-coupled applications Implemented using Message Passing Interface (MPI) Large of amounts of computing for short
Synonymous with supercomputing Tightly-coupled applications Implemented using Message Passing Interface (MPI) Large of amounts of computing for short periods of time Usually requires low latency interconnects
More informationIntroduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work
Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work Today (2014):
More informationChapter 3. Database Architecture and the Web
Chapter 3 Database Architecture and the Web 1 Chapter 3 - Objectives Software components of a DBMS. Client server architecture and advantages of this type of architecture for a DBMS. Function and uses
More informationCIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )
Guide: CIS 601 Graduate Seminar Presented By: Dr. Sunnie S. Chung Dhruv Patel (2652790) Kalpesh Sharma (2660576) Introduction Background Parallel Data Warehouse (PDW) Hive MongoDB Client-side Shared SQL
More informationMapReduce. Stony Brook University CSE545, Fall 2016
MapReduce Stony Brook University CSE545, Fall 2016 Classical Data Mining CPU Memory Disk Classical Data Mining CPU Memory (64 GB) Disk Classical Data Mining CPU Memory (64 GB) Disk Classical Data Mining
More informationMASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2008 Quiz II
Department of Electrical Engineering and Computer Science MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.830 Database Systems: Fall 2008 Quiz II There are 14 questions and 11 pages in this quiz booklet. To receive
More informationDatabase Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question:
Database Workload + Low throughput (0.8 IPC on an 8-wide superscalar. 1/4 of SPEC) + Naturally threaded (and widely used) application - Already high cache miss rates on a single-threaded machine (destructive
More informationA Case for Packing and Indexing in Cloud File Systems
A Case for Packing and Indexing in Cloud File Systems Saurabh Kadekodi, Bin Fan*, Adit Madan*, Garth Gibson and Greg Ganger University, *Alluxio Inc. In a nutshell Cloud object stores have a per-operation
More informationDistributed Systems 16. Distributed File Systems II
Distributed Systems 16. Distributed File Systems II Paul Krzyzanowski pxk@cs.rutgers.edu 1 Review NFS RPC-based access AFS Long-term caching CODA Read/write replication & disconnected operation DFS AFS
More informationBigData and Map Reduce VITMAC03
BigData and Map Reduce VITMAC03 1 Motivation Process lots of data Google processed about 24 petabytes of data per day in 2009. A single machine cannot serve all the data You need a distributed system to
More information2/26/2017. Originally developed at the University of California - Berkeley's AMPLab
Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second
More informationAutomatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S.
Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S. Vazhkudai Instance of Large-Scale HPC Systems ORNL s TITAN (World
More informationCombining Distributed Memory and Shared Memory Parallelization for Data Mining Algorithms
Combining Distributed Memory and Shared Memory Parallelization for Data Mining Algorithms Ruoming Jin Department of Computer and Information Sciences Ohio State University, Columbus OH 4321 jinr@cis.ohio-state.edu
More informationRAMCloud: Scalable High-Performance Storage Entirely in DRAM John Ousterhout Stanford University
RAMCloud: Scalable High-Performance Storage Entirely in DRAM John Ousterhout Stanford University (with Nandu Jayakumar, Diego Ongaro, Mendel Rosenblum, Stephen Rumble, and Ryan Stutsman) DRAM in Storage
More informationCloud Computing CS
Cloud Computing CS 15-319 Programming Models- Part III Lecture 6, Feb 1, 2012 Majd F. Sakr and Mohammad Hammoud 1 Today Last session Programming Models- Part II Today s session Programming Models Part
More informationIntroduction to Amazon Web Services
Introduction to Amazon Web Services Introduction Amazon Web Services (AWS) is a collection of remote infrastructure services mainly in the Infrastructure as a Service (IaaS) category, with some services
More informationNext-Generation Cloud Platform
Next-Generation Cloud Platform Jangwoo Kim Jun 24, 2013 E-mail: jangwoo@postech.ac.kr High Performance Computing Lab Department of Computer Science & Engineering Pohang University of Science and Technology
More informationDynamic Load-Balanced Multicast for Data-Intensive Applications on Clouds 1
Dynamic Load-Balanced Multicast for Data-Intensive Applications on Clouds 1 Contents: Introduction Multicast on parallel distributed systems Multicast on P2P systems Multicast on clouds High performance
More informationHaLoop Efficient Iterative Data Processing on Large Clusters
HaLoop Efficient Iterative Data Processing on Large Clusters Yingyi Bu, Bill Howe, Magdalena Balazinska, and Michael D. Ernst University of Washington Department of Computer Science & Engineering Presented
More informationData Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research
Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research The CRAIL Project: Overview Data Processing Framework (e.g., Spark, TensorFlow, λ Compute) Spark-IO Albis Pocket
More informationEnhancing Checkpoint Performance with Staging IO & SSD
Enhancing Checkpoint Performance with Staging IO & SSD Xiangyong Ouyang Sonya Marcarelli Dhabaleswar K. Panda Department of Computer Science & Engineering The Ohio State University Outline Motivation and
More informationMapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec
MapReduce: Recap Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec MapReduce: Recap Sequentially read a lot of data Why? Map: extract something we care about map (k, v)
More informationMapReduce. U of Toronto, 2014
MapReduce U of Toronto, 2014 http://www.google.org/flutrends/ca/ (2012) Average Searches Per Day: 5,134,000,000 2 Motivation Process lots of data Google processed about 24 petabytes of data per day in
More informationSMCCSE: PaaS Platform for processing large amounts of social media
KSII The first International Conference on Internet (ICONI) 2011, December 2011 1 Copyright c 2011 KSII SMCCSE: PaaS Platform for processing large amounts of social media Myoungjin Kim 1, Hanku Lee 2 and
More information4 Myths about in-memory databases busted
4 Myths about in-memory databases busted Yiftach Shoolman Co-Founder & CTO @ Redis Labs @yiftachsh, @redislabsinc Background - Redis Created by Salvatore Sanfilippo (@antirez) OSS, in-memory NoSQL k/v
More informationAccelerating Parallel Analysis of Scientific Simulation Data via Zazen
Accelerating Parallel Analysis of Scientific Simulation Data via Zazen Tiankai Tu, Charles A. Rendleman, Patrick J. Miller, Federico Sacerdoti, Ron O. Dror, and David E. Shaw D. E. Shaw Research Motivation
More informationIBM Education Assistance for z/os V2R2
IBM Education Assistance for z/os V2R2 Item: RSM Scalability Element/Component: Real Storage Manager Material current as of May 2015 IBM Presentation Template Full Version Agenda Trademarks Presentation
More informationCloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018
Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster
More informationChapter 4:- Introduction to Grid and its Evolution. Prepared By:- NITIN PANDYA Assistant Professor SVBIT.
Chapter 4:- Introduction to Grid and its Evolution Prepared By:- Assistant Professor SVBIT. Overview Background: What is the Grid? Related technologies Grid applications Communities Grid Tools Case Studies
More informationEC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures
EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures Haiyang Shi, Xiaoyi Lu, and Dhabaleswar K. (DK) Panda {shi.876, lu.932, panda.2}@osu.edu The Ohio State University
More informationData Partitioning and MapReduce
Data Partitioning and MapReduce Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies,
More informationAn Introduction to Big Data Formats
Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION
More informationOasis: An Active Storage Framework for Object Storage Platform
Oasis: An Active Storage Framework for Object Storage Platform Yulai Xie 1, Dan Feng 1, Darrell D. E. Long 2, Yan Li 2 1 School of Computer, Huazhong University of Science and Technology Wuhan National
More informationCorrelation based File Prefetching Approach for Hadoop
IEEE 2nd International Conference on Cloud Computing Technology and Science Correlation based File Prefetching Approach for Hadoop Bo Dong 1, Xiao Zhong 2, Qinghua Zheng 1, Lirong Jian 2, Jian Liu 1, Jie
More informationDistributed Systems CS6421
Distributed Systems CS6421 Intro to Distributed Systems and the Cloud Prof. Tim Wood v I teach: Software Engineering, Operating Systems, Sr. Design I like: distributed systems, networks, building cool
More informationRESTORE: REUSING RESULTS OF MAPREDUCE JOBS. Presented by: Ahmed Elbagoury
RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Presented by: Ahmed Elbagoury Outline Background & Motivation What is Restore? Types of Result Reuse System Architecture Experiments Conclusion Discussion Background
More informationHDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung
HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per
More informationBIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE
BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE BRETT WENINGER, MANAGING DIRECTOR 10/21/2014 ADURANT APPROACH TO BIG DATA Align to Un/Semi-structured Data Instead of Big Scale out will become Big Greatest
More informationCATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING
CATEGORIZATION OF THE DOCUMENTS BY USING MACHINE LEARNING Amol Jagtap ME Computer Engineering, AISSMS COE Pune, India Email: 1 amol.jagtap55@gmail.com Abstract Machine learning is a scientific discipline
More information