Performing Large Science Experiments on Azure: Pitfalls and Solutions

Similar documents
Introduction to Windows Azure Cloud Computing Futures Group, Microsoft Research Roger Barga, Jared Jackson, Nelson Araujo, Dennis Gannon, Wei Lu, and


WINDOWS AZURE QUEUE. Table of Contents. 1 Introduction

Windows Azure Services - At Different Levels

escience in the Cloud: A MODIS Satellite Data Reprojection and Reduction Pipeline in the Windows

COMP6511A: Large-Scale Distributed Systems. Windows Azure. Lin Gu. Hong Kong University of Science and Technology Spring, 2014

Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications

Developing Microsoft Azure Solutions

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

Yogesh Simmhan. escience Group Microsoft Research

Developing Microsoft Azure Solutions: Course Agenda

Course Outline. Lesson 2, Azure Portals, describes the two current portals that are available for managing Azure subscriptions and services.

Exam Questions

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Course Outline. Introduction to Azure for Developers Course 10978A: 5 days Instructor Led

Distributed Systems. Tutorial 9 Windows Azure Storage

Course Outline. Developing Microsoft Azure Solutions Course 20532C: 4 days Instructor Led

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

Azure-persistence MARTIN MUDRA

XLDB 11 Cloud Computing at Scale. Roger Barga Microsoft Research

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

Microsoft Developing Microsoft Azure Solutions.

Microsoft Windows HPC Server 2008 R2 for the Cluster Developer

AZURE CONTAINER INSTANCES

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc


Microsoft_PrepKing_70-583_v _85q_By-Cath. if u wana pass the exam with good percentage dn follow this dump

ACCURATE STUDY GUIDES, HIGH PASSING RATE! Question & Answer. Dump Step. provides update free of charge in one year!

Developing Microsoft Azure Solutions

MapReduce for Data Intensive Scientific Analyses

ebay s Architectural Principles

Patterns on XRegional Data Consistency

CLUSTERING HIVEMQ. Building highly available, horizontally scalable MQTT Broker Clusters

Vlad Vinogradsky

Techno Expert Solutions

Developing Microsoft Azure Solutions (MS 20532)

Adaptive Cluster Computing using JavaSpaces

Most real programs operate somewhere between task and data parallelism. Our solution also lies in this set.

ebay Marketplace Architecture

Datacenter replication solution with quasardb

20532D: Developing Microsoft Azure Solutions

EMC RecoverPoint. EMC RecoverPoint Support

Distributed ETL. A lightweight, pluggable, and scalable ingestion service for real-time data. Joe Wang

PERFORMANCE OPTIMIZATION FOR LARGE SCALE LOGISTICS ERP SYSTEM

microsoft. Number: Passing Score: 800 Time Limit: 120 min.

Locality-Aware Dynamic VM Reconfiguration on MapReduce Clouds. Jongse Park, Daewoo Lee, Bokyeong Kim, Jaehyuk Huh, Seungryoul Maeng

Cohesity Microsoft Azure Data Box Integration

Large-scale cluster management at Google with Borg

Apache Flink. Alessandro Margara

The MapReduce Abstraction

<Hot>Table 1.1 lists the Infoblox vnios for Azure appliance models that are supported for this release. # of vcpu Cores. TE-V Yes

Yves Goeleven. Solution Architect - Particular Software. Shipping software since Azure MVP since Co-founder & board member AZUG

Rocksteady: Fast Migration for Low-Latency In-memory Storage. Chinmay Kulkarni, Aniraj Kesavan, Tian Zhang, Robert Ricci, Ryan Stutsman

The Stream Processor as a Database. Ufuk

Distributed Systems 27. Process Migration & Allocation

Map-Reduce. Marco Mura 2010 March, 31th

Pimp My Data Grid. Brian Oliver Senior Principal Solutions Architect <Insert Picture Here>

Qualys Cloud Platform

Batches and Commands. Overview CHAPTER

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2

Speeding up the execution of numerical computations and simulations with rcuda José Duato

SAND: A Fault-Tolerant Streaming Architecture for Network Traffic Analytics

Distributed Systems. Day 3: Principles Continued Jan 31, 2019

Introduction to Grid Computing

High Availability & Disaster Recovery. Witt Mathot

Distributed and Fault-Tolerant Execution Framework for Transaction Processing

Synergetics-Standard-SQL Server 2012-DBA-7 day Contents

Vendor: Microsoft. Exam Code: Exam Name: Developing Microsoft Azure Solutions. Version: Demo

Ambry: LinkedIn s Scalable Geo- Distributed Object Store

Pivotal Greenplum Database Azure Marketplace v4.0 Release Notes

Azure Development Course

ITBraindumps. Latest IT Braindumps study guide

Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Windows Azure Overview

The Google File System

Tackling Latency via Replication in Distributed Systems

Introduction to K2View Fabric

Developing Microsoft Azure Solutions (70-532) Syllabus

The Google File System

PLEXXI HCN FOR VMWARE ENVIRONMENTS

Forget about the Clouds, Shoot for the MOON

Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G.

CSCI 204 Introduction to Computer Science II Lab 7 Queue ADT

INTRODUCTION TO NEXTFLOW

Users and utilization of CERIT-SC infrastructure

Linear Regression Optimization

Parallel Computing: MapReduce Jin, Hai

Serverless Computing: Design, Implementation, and Performance. Garrett McGrath and Paul R. Brenner

Programming model and implementation for processing and. Programs can be automatically parallelized and executed on a large cluster of machines

Deccansoft Software Services

Tasks. Task Implementation and management

BERLIN. 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

Step-by-Step Guide to Installing Cluster Service

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

Developing In The Cloud

Chapter 5: CPU Scheduling

Scalable Parallel Scientific Computing Using Twister4Azure

GFS: The Google File System. Dr. Yingwu Zhu

Vendor: Microsoft. Exam Code: Exam Name: Developing Microsoft Azure Solutions. Version: Demo

Transcription:

Performing Large Science Experiments on Azure: Pitfalls and Solutions Wei Lu, Jared Jackson, Jaliya Ekanayake, Roger Barga, Nelson Araujo Microsoft extreme Computing Group

Windows Azure Application Compute Storage Fabric

Suggested Application Model Using queues for reliable messaging To scale, add more of either Web Role Worker Role IIS ASP.NET, WCF, etc. main( { } 4) Do work Decouple the system Absorb the bursts resilient to the instance failure, Easy to scale 2) Put work in queue Queue 3) Get work from queue

Azure Queue Communication channel between instances Messages in the Queue is reliable and durable 7-day life time Fault tolerance mechanism De-queued message becomes visible again after visibilitytimeout if it is not deleted 2-hour maximum limitation Idempotent processing Instance Instance Instance

AzureBLAST BLAST (Basic Local Alignment Search Tool) the most important software in bioinformatics Identify the similarity between bio-sequences BLAST is highly computation-intensive Large number of pairwise alignment operations The size of sequence databases has been growing exponentially Two choices for running large BLAST jobs Building a local cluster Submit jobs to NCBI or EBI Long job queuing time BLAST is easy to be parallelized Query segmentation Splitting task BLAST task BLAST task BLAST task Merging Task BLAST task

AzureBLAST Worker Web Role Job Management Role Web Portal Web Service Job registration Job Scheduler Global dispatch queue Worker Worker NCBI databases Database updating Role Job Registry Azure Table Blast databases, temporary data, etc.) Azure Blob

All-by-All BLAST experiment All by All query Compare the database against itself Discovering Homologs inter-relationships of known protein sequences Large protein database (4.2 GB size) Totally 9,865,668 sequences In theory100 billion sequence comparisons! Performance estimation would require 14 CPU-years One of biggest BLAST jobs as far as we know

Our Solution Allocated 3776 weighted instances 475 extra-large instances From three datacenters US South Central, West Europe and North Europe Dividing 10 million sequences into several segments Each will be submitted to one datacenter as one job Each segment consists of smaller partitions Finally the job took two weeks Total size of all outputs is ~230GB

Understanding Azure by analyzing logs A normal log record should be 3/31/2010 6:14RD00155D3611B0 Executing the task 251523... 3/31/2010 6:25RD00155D3611B0 Execution of task 251523 is done, it takes 10.9mins 3/31/2010 6:25RD00155D3611B0 Executing the task 251553... 3/31/2010 6:44RD00155D3611B0 Execution of task 251553 is done, it takes 19.3mins 3/31/2010 6:44RD00155D3611B0 Executing the task 251600... 3/31/2010 7:02RD00155D3611B0 Execution of task 251600 is done, it takes 17.27 mins Otherwise, something is wrong (e.g., lost task) 3/31/2010 8:22RD00155D3611B0 Executing the task 251774... 3/31/2010 9:50RD00155D3611B0 Executing the task 251895... 3/31/2010 11:12RD00155D3611B0 Execution of task 251895 is done, it takes 82 mins

Challenges & Pitfalls Failures Instance Idle time Limitation of current Azure Queue Performance/Cost Estimation Minimizing the Needs for Programming

Case Study 1 North Europe datacenter, totally 34, 265 tasks processed Node replacement, Avoid using machine name in your program Almost one day delay. Try not to orchestrate instances by the tight synchronization (e.g., barrier)

Case Study 2 North Europe Data Center, totally 34,256 tasks processed All 62 nodes lost tasks and then came back in a group fashion. This is Update domain ~ 6 nodes in one group ~30 mins

Case Study 3 West Europe Datacenter; 30,976 tasks are completed, and job was killed 35 Nodes experienced the blob writing failure at same time A reasonable guess: the Fault Domain is working

Challenges & Pitfalls Failures Failures are expectable and unpredictable Design with failure in mind Most are automatically recovered by cloud Instance Idle time Limitation of current Azure Queue Performance/Cost Estimation Minimizing the Needs for Programming

Challenges & Pitfalls Failures Instance Idle time Gap time between two jobs Diversity of work load Load imbalance Limitation of current Azure Queue Performance/Cost Estimation Minimizing the Needs for Programming

Load imbalance North Europe Data center, 2058 tasks Two-day very low system throughput due to some long-tail tasks Task 56823 needs 8 hours to complete; it was re-executed by 8 nodes due to the 2-hour max value of the visibliblitytimeout of a message

Challenges & Pitfalls Failures Instance Idle time Limitation of current Azure Queue 2-hour max value of visibilitytimeout Each individual task has to be done in 2 hours 7-day max message life time Entire experiment has to be done in less then 7 days Performance/Cost Estimation Minimizing the Needs for Programming

Challenges & Pitfalls Failures Instance Idle time Limitation of current Azure Queue Performance/Cost Estimation The better you understand your application, the more money you can save BLAST has about 20 arguments VM size Minimizing the Needs for Programming

Cirrus: Parameter Sweeping Service on Azure Worker Web Role Job Manager Role Web Portal Web Service Job registration Job Scheduler Scaling Engine Parametric Engine Sampling Filter Dispatch Queue Worker Worker Azure Table Azure Blob

Job Manager Role Job Definition Job Scheduler Scaling Engine Parametric Engine Sampling Filter Declarative Job definition Derived from Nimrod Each job can have Prolog Commands Paramters Azure-related opeartors AzureCopy AzureMount SelectBlobs Job configuration Minimize the programming for running legacy binaries on Azure BLAST Bayesian Network Machine Learning Image rendering <job name="blast"> <prolog> azurecopy http://.../uniref.fasta uniref.fasta </prolog> <cmd> azurecopy %partition% input blastall.exe -p blastp -d uniref.fasta -i input -o output azurecopy output %partition%.out </cmd> <parameter name="partition"> <selectblobs> <prefix>partitions/</prefix> </selectblobs> </parameter> <configure> <mininstances>2</mininstances> <maxinstances>4</maxinstances> <shutdownwhendone> true </shutdownwhendone> <sampling> true </sampling> </configure> </job>

Job Manager Role Dynamic ScalingJob Scheduler Scaling Engine Parametric Engine Sampling Filter Scaling in/out for individual job Fit into the [min, max] window specified in the job config Synchronous Scaling Tasks are dispatched after the scaling is done Asynchronous Scaling Tasks execution and scaling operation are simultaneous Scaling in when load imbalance happens Scaling in when not receiving new jobs after a period of time Or if the job is configured as shutdown-when-done Usually used for the reducing job.

Job Pause-ReConfig-Resume Each job maintains a take status table Checkpoint by snapshotting the task table A task can be incomplete Fix the 7-day/ 2-hour limitation Handle the exception optimistically Ignore the exceptions, retry incomplete tasks with reduced number of instance, minimize the cost of failures Handle the load imbalance

Performance Estimation by Sampling Observation based approach Job Manager Role Job Scheduler Scaling Engine Parametric Engine Randomly sample the parameter space based on the sampling ration a Only dispatch the sample tasks scaling in only with n instances to save cost Assuming the uniform distribution, the estimation is done by Sampling Filter

Evaluation A complete BLAST running takes 2 hours with 16 instances, a 2%-sampling-run which achieves 96% accuracy only takes about 18 minutes with 2 instances the overall cost for the sampling run is only 1.8% of the complete run.

Evaluation Scaling-out Sync. Operation stall all instances for 80 minutes Async. Operation, Existing instances keep working New instances needs 20-80 minutes 16-instance run is 1.4x faster Scaling-in Sync. Operation finished in 3 minutes Async. Operation caused the random message losing May lead to more idle instance time. the best practices scale-out asynchronously Scale-in synchronously New instances join in 20 80 minutes Azure randomly picks the instances to shutdown

Conclusion Running large-scale parameter sweeping experiment on Azure Identified Pitfalls Design with Failure (most of them are recoverable) Watch out the instance idle time understand your application to save cost Minimize the need of programming Our parameter sweeping solutions Declarative job definition Dynamic scaling, Job pause-reconfig-resume pattern Performance estimation