Azure MapReduce. Thilina Gunarathne Salsa group, Indiana University

Similar documents
Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications

Introduction to. Amazon Web Services. Thilina Gunarathne Salsa Group, Indiana University. With contributions from Saliya Ekanayake.

Y790 Report for 2009 Fall and 2010 Spring Semesters

Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications

Portable Parallel Programming on Cloud and HPC: Scientific Applications of Twister4Azure

Scalable Parallel Scientific Computing Using Twister4Azure

SCALABLE PARALLEL COMPUTING ON CLOUDS:

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Chapter 5. The MapReduce Programming Model and Implementation

Towards a Collective Layer in the Big Data Stack

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Abstract 1. Introduction

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Computing: MapReduce Jin, Hai

Applying Twister to Scientific Applications

Introduction to Hadoop and MapReduce

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Abstract 1. Introduction

Hybrid MapReduce Workflow. Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

Clouds and MapReduce for Scientific Applications

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Developing Microsoft Azure Solutions: Course Agenda

More AWS, Serverless Computing and Cloud Research

Course Outline. Lesson 2, Azure Portals, describes the two current portals that are available for managing Azure subscriptions and services.

The MapReduce Framework

ARCHITECTURE AND PERFORMANCE OF RUNTIME ENVIRONMENTS FOR DATA INTENSIVE SCALABLE COMPUTING. Jaliya Ekanayake

MapReduce for Data Intensive Scientific Analyses

Clustering Lecture 8: MapReduce

Improving the MapReduce Big Data Processing Framework

Database System Architectures Parallel DBs, MapReduce, ColumnStores

Lecture 11 Hadoop & Spark

COMP6511A: Large-Scale Distributed Systems. Windows Azure. Lin Gu. Hong Kong University of Science and Technology Spring, 2014

A Survey on Parallel Rough Set Based Knowledge Acquisition Using MapReduce from Big Data

Dynamic Cluster Configuration Algorithm in MapReduce Cloud

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Challenges for Data Driven Systems

Course Outline. Developing Microsoft Azure Solutions Course 20532C: 4 days Instructor Led

Large-Scale GPU programming

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

DryadLINQ for Scientific Analyses

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Map Reduce Group Meeting

The Datacenter Needs an Operating System

L22: SC Report, Map Reduce

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

ABSTRACT I. INTRODUCTION

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

MapReduce, Hadoop and Spark. Bompotas Agorakis

Shark: Hive on Spark

Genetic Algorithms with Mapreduce Runtimes

CSE 444: Database Internals. Lecture 23 Spark

Windows Azure Services - At Different Levels

Batch Inherence of Map Reduce Framework

Sequence Clustering Tools

Collective Communication Patterns for Iterative MapReduce

Cloud, Big Data & Linear Algebra

MPJ Express Meets YARN: Towards Java HPC on Hadoop Systems

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Towards Scalable Data Management for Map-Reduce-based Data-Intensive Applications on Cloud and Hybrid Infrastructures

What Is Datacenter (Warehouse) Computing. Distributed and Parallel Technology. Datacenter Computing Architecture

Welcome to the New Era of Cloud Computing

MapReduce and Hadoop

Outline. CS-562 Introduction to data analysis using Apache Spark

CompSci 516: Database Systems

The MapReduce Abstraction

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Developing Enterprise Cloud Solutions with Azure

Distributed Filesystem

Efficient Dynamic Resource Allocation Using Nephele in a Cloud Environment

Embedded Technosolutions

Techno Expert Solutions

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Developing Microsoft Azure Solutions (MS 20532)

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Towards a Real- time Processing Pipeline: Running Apache Flink on AWS

Integrate MATLAB Analytics into Enterprise Applications

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

Locality Aware Fair Scheduling for Hammr

CPET 581 Cloud Computing: Technologies and Enterprise IT Strategies

Virtual Appliances and Education in FutureGrid. Dr. Renato Figueiredo ACIS Lab - University of Florida

Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma

Survey on MapReduce Scheduling Algorithms

Cloud Computing & Visualization

Pivoting Entity-Attribute-Value data Using MapReduce for Bulk Extraction

Data Intensive Scalable Computing. Thanks to: Randal E. Bryant Carnegie Mellon University

Introduction to Windows Azure Cloud Computing Futures Group, Microsoft Research Roger Barga, Jared Jackson, Nelson Araujo, Dennis Gannon, Wei Lu, and

Advanced Database Technologies NoSQL: Not only SQL

Towards a next generation of scientific computing in the Cloud

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Warehouse-Scale Computing, MapReduce, and Spark

MapReduce & HyperDex. Kathleen Durant PhD Lecture 21 CS 3200 Northeastern University

Batch Processing Basic architecture

Applicability of DryadLINQ to Scientific Applications

Efficient Map Reduce Model with Hadoop Framework for Data Processing

Transcription:

Azure MapReduce Thilina Gunarathne Salsa group, Indiana University

Agenda Recap of Azure Cloud Services Recap of MapReduce Azure MapReduce Architecture Application development using AzureMR Pairwise distance alignment implementation Next steps

Cloud Computing On demand computational services over web Backed by massive commercial infrastructures giving economies of scale Spiky compute needs of the scientists Horizontal scaling with no additional cost Increased throughput Cloud infrastructure services Storage, messaging, tabular storage Cloud oriented services guarantees Virtually unlimited scalability Future seems to be CLOUDY!!!

Azure Platform Windows Azure Compute.net platform as a service Worker roles & web roles Azure Storage Blobs Queues Table Development SDK, fabric and storage

MapReduce Automatic parallelization & distribution Fault-tolerant Provides status and monitoring tools Clean abstraction for programmers map (in_key, in_value) -> (out_key, intermediate_value) list reduce (out_key, intermediate_value list) -> out_value list

Motivation Currently no parallel programming framework on Azure No MPI, No Dryad Well known, easy to use programming model Cloud nodes are not as reliable as conventional cluster nodes

Azure MapReduce Concepts Take advantage of the cloud services Distributed services, Unlimited scalability Backed by industrial strength data centers and technologies Decentralized control Dynamically scale up/down Eventual consistency Large latencies Coarser grained map tasks Global queue based scheduling

1 1.Client driver loads the map & reduce tasks to the queues

2. Map workers retrieve map tasks from the queue 2

3. Map workers download data from the Blob storage and start processing 3

4. Reduce workers pick the tasks from the queue and start monitoring the reduce task tables 4

5. Finished map tasks upload the results to Blob storage. Add entries to the respective reduce task tables. 5

6. Reduce tasks download the intermediate data products 6

7. Start reducing when all the map tasks are finished and when a reduce task is finished downloading the intermediate data products 7

Azure MapReduce Architecture Client API and driver Map tasks Reduce tasks Intermediate data transfer Monitoring Configurations

Fault tolerance Use the visibility timeout of the queues Currently maximum is 3 hours Delete the message from the queue only after everything is successful Execution, upload, update status Tasks will rerun when timeout happens Ensures eventual completion Intermediate data are persisted in blob storage Retry up to 3 times Many retries in service invocations

Programming Model Data Handling Scheduling Apache Hadoop [24] /(Google MR) MapReduce HDFS (Hadoop Distributed File System) Data Locality; Rack aware, Dynamic task scheduling through global queue Failure Handling Re-execution of failed tasks; Duplicate execution of slow tasks Environment Linux Clusters, Amazon Elastic Map Reduce on EC2 Intermediate data transfer File, Http Microsoft Dryad [25] Twister [19] Azure Map Reduce/Twister DAG execution, Extensible to MapReduce and other patterns Shared Directories & local disks Data locality; Network topology based run time graph optimizations; Static task partitions Re-execution of failed tasks; Duplicate execution of slow tasks Windows HPCS cluster File, TCP pipes, sharedmemory FIFOs Iterative MapReduce Local disks and data management tools Data Locality; Static task partitions Re-execution of Iterations Linux Cluster EC2 Publish/Subscribe messaging MapReduce-- will extend to Iterative MapReduce Azure Blob Storage Dynamic task scheduling through global queue Re-execution of failed tasks; Duplicate execution of slow tasks Window Azure Compute, Windows Azure Local Development Fabric Files, TCP

Why Azure Services No need to install software stacks In fact you can t Eg: NaradaBrokering, HDFS, Database Virtually unlimited scalable distributed services Zero maintenance Let the platform take care of you No single point of failures Availability guarantees Ease of development

API ProcessMapRed(jobid, container, params, numreducetasks, storageaccount, mapqname, reduceqname,list maptasks) Map(key, value, programargs, Dictionary outputcollector) Reduce(key, List values, programargs, Dictionary outputcollector)

Develop applications using Azure MapReduce Local debugging using Azure development fabric DistributedCache capability Bundle with Azure Package Compile in release mode before creating the package. Deploy using Azure web interface Errors logged to a Azure Table

SWG Pairwise Distance Alignment SmithWaterman-GOTOH Pairwise sequence alignment Align each sequence with all the other sequences

Application architecture 1 (1-100) Block decomposition 2 (101-200) 3 (201-300) 4 (301-400) 1 (1-100) M1 M2 from M6 M3 Reduce 1 2 (101-200) from M2 M4 M5 from M9 Reduce 2 3 (201-300) M6 from M5 M7 M8 Reduce 3 4 (301-400) from M3 M9 from M8 M10 Reduce 4

AzureMR SWG Performance 10k Sequences Execution Time (s) 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 0 32 64 96 128 160 Number of Azure Small Instances Execution Time(s)

AzureMR SWG Performance 10k Sequences 7 Alignment Time (ms) 6 5 4 3 2 1 0 Time Per Alignment Per Instance 0 32 64 96 128 160 Number of Azure Small Instances

AzureMR SWG Performance on Different Instance Types Execution Time (s) 700 600 500 400 300 200 100 0 Small Medium Large ExtraLarge Instance Type Execution Time

AzureMR SWG Performance on Different Data Sizes Time for an Actual Aligement (ms) 8 7 6 5 4 Time Per Alignment Per Core (ms) 3 2 1 0 4000 5000 6000 7000 8000 9000 10000 Number of Sequences

Next Steps In the works Monitoring web interface Alternative intermediate data communication mechanisms Public release Future plans AzureTwister Iterative MapReduce

Questions? Thanks!!

References J. Dean, and S. Ghemawat, MapReduce: simplified data processing on large clusters, Commun. ACM, vol. 51, no. 1, pp. 107-113., 2008. J.Ekanayake, H.Li, B.Zhang et al., Twister: A Runtime for iterative MapReduce, in Proceedings of the First International Workshop on MapReduce and its Applications of ACM HPDC 2010 conference June 20-25, 2010, Chicago, Illinois, 2010. Cloudmapreduce, http://sites.google.com/site/huanliu/cloudmapreduce.pdf "Apache Hadoop," http://hadoop.apache.org/ M. Isard, M. Budiu, Y. Yu et al., "Dryad: Distributed dataparallel programs from sequential building blocks." pp. 59-72.

Acknowledgments Prof. Geoffrey Fox, Dr. Judy Qiu and the Salsa group Dr. Ying Chen and Alex De Luca from IBM Almaden Research Center Virtual School Organizers