HPC in Cloud. Presenter: Naresh K. Sehgal Contributors: Billy Cox, John M. Acken, Sohum Sohoni

Similar documents
CIT 668: System Architecture. Amazon Web Services

DISTRIBUTED SYSTEMS [COMP9243] Lecture 8a: Cloud Computing WHAT IS CLOUD COMPUTING? 2. Slide 3. Slide 1. Why is it called Cloud?

Introduction. Architecture Overview

High Cost of ESL Design

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands

Paperspace. Architecture Overview. 20 Jay St. Suite 312 Brooklyn, NY Technical Whitepaper

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Next Generation Storage for The Software-Defned World

Introduction to Cloud Computing

OpenNebula on VMware: Cloud Reference Architecture

The vsphere 6.0 Advantages Over Hyper- V

Cloud Connect. Gain highly secure, performance-optimized access to third-party public and private cloud providers

HOW TO PLAN & EXECUTE A SUCCESSFUL CLOUD MIGRATION

Héctor Fernández and G. Pierre Vrije Universiteit Amsterdam

EMC XTREMCACHE ACCELERATES VIRTUALIZED ORACLE

Pocket: Elastic Ephemeral Storage for Serverless Analytics

Mission-Critical Databases in the Cloud. Oracle RAC in Microsoft Azure Enabled by FlashGrid Software.

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

The Google File System

Certified Reference Design for VMware Cloud Providers

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Agenda. This Session: Azure Networking Basics, On-prem connectivity options DEMO Create VNET/Gateway Cost-estimation for VNET/Gateways

Magellan Project. Jeff Broughton NERSC Systems Department Head October 7, 2009

Data Centers and Cloud Computing

Data Centers and Cloud Computing. Slides courtesy of Tim Wood

TPC-E testing of Microsoft SQL Server 2016 on Dell EMC PowerEdge R830 Server and Dell EMC SC9000 Storage

Enabling NVMe I/O Scale

Data Centers and Cloud Computing. Data Centers

Lecture 7: Data Center Networks

Cisco Tetration Analytics

Windows Server 2012 Hands- On Camp. Learn What s Hot and New in Windows Server 2012!

Nested Virtualization and Server Consolidation

VVD for Cloud Providers: Scale and Performance Guidelines. October 2018

WEBSITE & CLOUD PERFORMANCE ANALYSIS. Evaluating Cloud Performance for Web Site Hosting Requirements

Distributed Filesystem

THE DEFINITIVE GUIDE FOR AWS CLOUD EC2 FAMILIES

EMC Business Continuity for Microsoft Applications

Data Services. Reliable, high-speed data connectivity. Group Ltd

Cross-layer Optimization for Virtual Machine Resource Management

Deadline Guaranteed Service for Multi- Tenant Cloud Storage Guoxin Liu and Haiying Shen

Consolidating Complementary VMs with Spatial/Temporalawareness

Application-Specific Configuration Selection in the Cloud: Impact of Provider Policy and Potential of Systematic Testing

MySQL In the Cloud. Migration, Best Practices, High Availability, Scaling. Peter Zaitsev CEO Los Angeles MySQL Meetup June 12 th, 2017.

CSE 124: THE DATACENTER AS A COMPUTER. George Porter November 20 and 22, 2017

IaaS. IaaS. Virtual Server

Forget IOPS: A Proper Way to Characterize & Test Storage Performance Peter Murray SwiftTest

Data Services. Reliable, high-speed data connectivity

Title DC Automation: It s a MARVEL!

the Edge. Vinod Eswaraprasad Wipro Technologies Storage Developer Conference. Insert Your Company Name. All Rights Reserved.

2013 AWS Worldwide Public Sector Summit Washington, D.C.

Shen, Tang, Yang, and Chu

Cloud Computing. What is cloud computing. CS 537 Fall 2017

Data Services. Reliable, high-speed data connectivity

Amazon Elastic File System

Architecting Applications to Scale in the Cloud

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

The Google File System

What s New in VMware vsphere 4.1 Performance. VMware vsphere 4.1

IaaS. IaaS. Virtual Server

FlashGrid Software Enables Converged and Hyper-Converged Appliances for Oracle* RAC

Spark and Flink running scalable in Kubernetes Frank Conrad

The Google File System

Are You Insured Against Your Noisy Neighbor Sunku Ranganath, Intel Corporation Sridhar Rao, Spirent Communications

CSE 124: Networked Services Lecture-16

Efficient On-Demand Operations in Distributed Infrastructures

CS 655 Advanced Topics in Distributed Systems

Beyond 1001 Dedicated Data Service Instances

Towards Deadline Guaranteed Cloud Storage Services Guoxin Liu, Haiying Shen, and Lei Yu

CLOUD-SCALE FILE SYSTEMS

Fast and Easy Persistent Storage for Docker* Containers with Storidge and Intel

Application Provisioning

SoftNAS Cloud Performance Evaluation on AWS

Introduction to Windows Azure Cloud Computing Futures Group, Microsoft Research Roger Barga, Jared Jackson, Nelson Araujo, Dennis Gannon, Wei Lu, and

Doubling Performance in Amazon Web Services Cloud Using InfoScale Enterprise

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

Amazon AWS-Solution-Architect-Associate Exam

Running Databases in Containers.

IaaS. IaaS. Virtual Server

Reliable, fast data connectivity

Building Service Platforms using OpenStack and CEPH: A University Cloud at Humboldt University

RACKSPACE ONMETAL I/O V2 OUTPERFORMS AMAZON EC2 BY UP TO 2X IN BENCHMARK TESTING

A10 HARMONY CONTROLLER

CHAPTER 6 STATISTICAL MODELING OF REAL WORLD CLOUD ENVIRONMENT FOR RELIABILITY AND ITS EFFECT ON ENERGY AND PERFORMANCE

CACHE ME IF YOU CAN! GETTING STARTED WITH AMAZON ELASTICACHE. AWS Charlotte Meetup / Charlotte Cloud Computing Meetup Bilal Soylu October 2013

Lecture 09: VMs and VCS head in the clouds

Next-Generation Cloud Platform

VMware vcloud Air User's Guide

IBM Education Assistance for z/os V2R2

Pulse Secure Application Delivery

IaaS. IaaS. Virtual Server

Document Sub Title. Yotpo. Technical Overview 07/18/ Yotpo

Network-Aware Resource Allocation in Distributed Clouds

Amazon Web Services Cloud Computing in Action. Jeff Barr

NPTEL Course Jan K. Gopinath Indian Institute of Science

Provisioning IT at the Speed of Need with Microsoft Azure. Presented by Mark Gordon and Larry Kuhn Hashtag: #HAND5

Elastic Efficient Execution of Varied Containers. Sharma Podila Nov 7th 2016, QCon San Francisco

SOFT CONTAINER TOWARDS 100% RESOURCE UTILIZATION ACCELA ZHAO, LAYNE PENG

What is Cloud Computing? Cloud computing is the dynamic delivery of IT resources and capabilities as a Service over the Internet.

1V0-602.exam. Number: 1V0-602 Passing Score: 800 Time Limit: 120 min. Vmware 1V VMware Certified Associate 6 Hybrid Cloud Fundamentals

SoftNAS Cloud Performance Evaluation on Microsoft Azure

Transcription:

HPC in Cloud Presenter: Naresh K. Sehgal Contributors: Billy Cox, John M. Acken, Sohum Sohoni

2 Agenda What is HPC? Problem Statement(s) Cloud Workload Characterization Translation from High Level Issues and dealing with them Next Steps and Summary

What is HPC? Means Different Things to Different Customers Lockheed-Martin F-22A Raptor, Mach 2 1 passenger up to 5,000 miles Costs US$150 million High Speed RAF Typhoon, Mach 2 1 passenger up to 3,000 miles cost $130 million High speed at lesser cost Boeing 737, Mach 0.74 200 passengers up to 3,000 nautical miles Cost US$84.4 million More load at less cost 3 Airbus A380, Mach 0.86 top speed 853 passengers up to 9,600 miles Cost US$389.9 million Delivers more load over larger distance

4 HPC in Cloud Problem Statement(s) No agreed upon definitions of Cloud workload categories Difficulties in matching Customers requirements with available resources Problems in meeting SLA (Service-level agreements) Uncertain QoS (Quality of Service) for Cloudy Apps Resource usage planning for CSPs (Cloud Service Providers)

5 Computing Resource for WL categories Workload Category User View or Example Providers Limiting Resources Level of cloud relevance: How cloud heavy is this category? Big Streaming Data Netflix Network Bandwidth Heavy Big Database Creation and Calculation Google, US census Persistent Storage, Computational Capability, Caching Heavy Big Database Search and Access US census, Google, online shopping, online reservations Persistent Storage, Network, Caching Heavy Big Data Storage Rackspace, Softlayer Livedrive, Zip cloud Sugarsync, MyPC Persistent Storage, Caching Heavy In Memory Database Redis, SAP Main Memory Size, Caching Medium Many Tiny Tasks (Ants) Simple games, word or phrase translators, dictionary Network, Many processors Heavy High Performance Computing (HPC) Computer Aided Engineering, molecular modeling, genome analysis, and numerical modeling Processor assignment and computational capability Heavy Highly Interactive Single Person Terminal access, server administration Network (latency) Some Highly Interactive Multi-Person Jobs Collaborative online environment, e.g. Google Docs Network (latency) Medium Single Computer Intensive Jobs EDA tools (logic simulation, circuit simulation, board layout) Computational capability None Private Local Tasks Offline tasks Persistent Storage None Slow Communication E-mail, blog Network Some Real-Time Local Tasks Any Home Security System Network None Location aware computing Travel guidance Local input hardware ports added for security Real-Time Geographically Dispersed Remote machinery or vehicle control Network Light now, but future Access Control PayPal Network Some, light Voice or video over IP Skype, SIP Network Light if any

EDA Cloud Workload Characterization Some used a set of fixed traces [1] Profiles of WLs based on machine utilization and wait times Others have evaluated HPC performance in Cloud [2] Tradeoffs in migrating HPC workloads to public clouds Evaluated the performance of a suite of benchmarks Spectrum of applications for a typical HPC center Cloud EDA WL category may change with phases/steps of a HPC job [1] Q. Zhang, J. L. Hellerstein, and R. Boutaba, "Characterizing task usage shapes in Google s compute clusters," Proc. of Large-Scale Distributed Systems and Middleware (LADIS 2011), 2011. 6 [2] K. R. Jackson, L. Ramakrishnan, K. Muriki, S. Canon, S. Cholia, J. Shalf, H. J. Wasserman, and N. J. Wright, "Performance analysis of high performance computing applications on the amazon web services cloud," in Cloud Computing Technology and Science (CloudCom), 2010 IEEE Second International Conference on, 2010, pp. 159-168.

7 Provider Options for VMs Linux * cloud instances AWS 1 Rackspace 2 Type Standard Small Standard Medium Standard Large Standard Extra Large 2 nd Gen Standard Extra Large 2 nd Gen Standard Double Extra Large Micro High Memory Extra Large High Memory Double Extra Large High Memory Quadruple Extra Large High CPU Medium High CPU Extra Large Cluster Quadruple Extra Large Cluster Eight Extra Large High Memory Cluster Eight Extra Large High I/O Quadruple Extra Large High Storage Eight Extra Large RAM vcpu Disk Public Network Internal Network 512MB 1 20GB 20Mbps 40Mbps 1GB 1 40GB 30Mbps 60Mbps 2GB 2 80GB 60Mbps 120Mbps 4GB 2 160GB 100Mbps 200Mbps 8GB 4 320GB 150Mbps 300Mbps 15GB 6 620GB 200Mbps 400Mbps 30GB 8 1.2TB 300Mbps 600Mbps 1) From: http://aws.amazon.com/ec2/pricing/ 2) From: http://www.rackspace.com/cloud/servers/pricing/

8 Optimizer CDN Anatomy of a Typical Cloud App Cloud app architecture Firewall Firewall Load Balancer Load Balancer Web Server Web Server Web Server Database Database Database Database Cache Cache Cache +Cache = optimize accesses to the (slow) database Scale out at all levels for availability and capacity Web server scales based on demand Database and cache size is fixed at design time into shards Next: Apply architecture to a provider

Cloud Service Provider: Actions to Host VM App Owner For each VM: Reference design to VM Image VM instance metadata Type from catalog Implies SLA CSP location Credentials REST API Cloud Service Provider Scheduler Objective Scheduler actions: Convert implied SLA to SLO Build list of compatible hosts Sort host by fit Connect network Connect any mounted storage Assign VM to host and start VM Compute Network Storage Data VM Images compatible = able to run the image and has the capacity fit = metric on prioritizing hosts (spare capacity, proximity, etc.) 9 Next: Convert the SLA into an SLO

10 SLA to SLO Translations Example SLA Request: Standard Service Level Objective SLA = 1000 MIPS of CPU instructions, 50Mb read/write/sec 1 GBPS of Network speed <3 ms response time, 90% of requests 3TB backend data, growing to 20TB 99.9999% up time (~1 hr/year) Service Provider SLO Table Name Host CPU (G/s) Mem I/O DC I/O Standard 1.000 1G 60M 30M Medium 2.000 2G 100M 50M Large 4.000 8G 200M 100M Example only Translated values are used in VM placement to ensure SLA is met. SLO result: 1.000/1G/60M/30M

11 TXT VM C-1 VM C-4 VM C-7 VM C-23 VM C-12 TXT VM C-1 VM C-4 VM C-7 VM C-23 VM C-12 VM??? Issues: Multi-tenant Multi-tenant infrastructures allow sharing of common resources Storage VMM VMM vswitch Server vswitch Server Controller Network Network traffic is multiplexed by the vswitch Virtualization provides the compute isolation Intentional bad guys may be in the vm next to you Unintentional noisy neighbor may slow you down Shared infrastructure exposes new challenges

12 Issues: Performance Variations Small variation within same processor type Larger variation based on processor for same instance request. From: Exploiting Hardware Heterogeneity within the Same Instance Type of Amazon EC2 1% to 17% variation in performance (CPU) at same location for same instance type From: Cloud Harmony, Mar 14 Pi day, Aggregate CPU performance metric Cost effectiveness of instances can vary significantly.

Response Time Issues: Performance Performance limit Excursion normal range 13 # concurrent users Assume: service has been operating within parameters Question: What caused the excursion? Consider: Shared exclusion locks JVM garbage collection Database contention From the infrastructure (invisible) Overloaded compute Noisy neighbor VM Overloaded network is hosted on

Dynamic Monitoring and Resource Allocation Proposed HPC Cloud Solution: A cluster augmented with low level measurements local measurements to determine resource contention A robust aggregation framework to determine the cluster level efficacy Research Questions: What is the extent of resource contention in various shared cluster setting (e.g. HPC applications running in the cloud)? Is there contention to warrant dynamic real-time monitoring? How often should these measurements be taken? sampling overhead, network bandwidth for data aggregation etc. How these measurements drive smarter resource allocation? how do low level metrics measurements relate to and indicate which resource is the limiting one? 14

15 Next Steps and Summary Use cloud design principles for HPC apps Scale-out, assume failures, wide variations in loading Local Excursions in response times Define SLAs and translate to SLOs Measure, fail, analyze, adjust, repeat Find best platform/cost for your HPC Apps Test for scalability, for failures Expect multitenant environments Isolation, trust Optimize for performance and availability of EDA apps