A Comparative Study of High Performance Computing on the Cloud. Lots of authors, including Xin Yuan Presentation by: Carlos Sanchez

Similar documents
A Comparative Study of High-Performance Computing on the Cloud

The Use of Cloud Computing Resources in an HPC Environment

The Optimal CPU and Interconnect for an HPC Cluster

Optimizing LS-DYNA Productivity in Cluster Environments

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

ECE 598 Advanced Operating Systems Lecture 22

The Need for Consistent IO Speed in the Financial Services Industry. Silverton Consulting, Inc. StorInt Briefing

Exploring Hardware Overprovisioning in Power-Constrained, High Performance Computing

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Scheduling Mar. 19, 2018

Empirical Evaluation of Latency-Sensitive Application Performance in the Cloud

CPU Scheduling (1) CPU Scheduling (Topic 3) CPU Scheduling (2) CPU Scheduling (3) Resources fall into two classes:

Transparent Throughput Elas0city for IaaS Cloud Storage Using Guest- Side Block- Level Caching

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Best Practices for Setting BIOS Parameters for Performance

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

Using Alluxio to Improve the Performance and Consistency of HDFS Clusters

High Performance Computing. Introduction to Parallel Computing

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

IBM Spectrum Scale IO performance

The BioHPC Nucleus Cluster & Future Developments

Magellan Project. Jeff Broughton NERSC Systems Department Head October 7, 2009

ECE 486/586. Computer Architecture. Lecture # 2

CP2K Performance Benchmark and Profiling. April 2011

CS418 Operating Systems

Single-Points of Performance

Delivering HPC Performance at Scale

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

Future Routing Schemes in Petascale clusters

Datacenter Simulation Methodologies Case Studies

Future Trends in Hardware and Software for use in Simulation

CS370 Operating Systems

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Birds of a Feather Presentation

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou ( Zhejiang University

Architectural Differences nc. DRAM devices are accessed with a multiplexed address scheme. Each unit of data is accessed by first selecting its row ad

ò mm_struct represents an address space in kernel ò task represents a thread in the kernel ò A task points to 0 or 1 mm_structs

Ch 4 : CPU scheduling

Example: CPU-bound process that would run for 100 quanta continuously 1, 2, 4, 8, 16, 32, 64 (only 37 required for last run) Needs only 7 swaps

Scheduling. Don Porter CSE 306

Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science

CS370 Operating Systems

Mellanox InfiniBand Solutions Accelerate Oracle s Data Center and Cloud Solutions

Processes, PCB, Context Switch

Multi-Level Feedback Queues

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Starling: A Scheduler Architecture for High Performance Cloud Computing

THE DEFINITIVE GUIDE FOR AWS CLOUD EC2 FAMILIES

2008 International ANSYS Conference

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

How To Rock with MyRocks. Vadim Tkachenko CTO, Percona Webinar, Jan

Linux Clusters for High- Performance Computing: An Introduction

System Software Solutions for Exploiting Power Limited HPC Systems

Episode 5. Scheduling and Traffic Management

GROMACS Performance Benchmark and Profiling. August 2011

FYS Data acquisition & control. Introduction. Spring 2018 Lecture #1. Reading: RWI (Real World Instrumentation) Chapter 1.

High Performance Interconnects: Landscape, Assessments & Rankings

HYCOM Performance Benchmark and Profiling

Outline. CSC 447: Parallel Programming for Multi- Core and Cluster Systems

GROMACS Performance Benchmark and Profiling. September 2012

Building Apps in the Cloud to reduce costs up to 90%

Announcements. Program #1. Program #0. Reading. Is due at 9:00 AM on Thursday. Re-grade requests are due by Monday at 11:59:59 PM.

Technical Note P/N REV A01 March 29, 2007

Underutilizing Resources for HPC on Clouds

Multifunction Networking Adapters

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Network Design Considerations for Grid Computing

Multi-Screen Computer Buyers Guide. // //

Operating Systems (2INC0) 2017/18

Memory-Based Cloud Architectures

GETTING GREAT PERFORMANCE IN THE CLOUD

The Secret World of Cloud IaaS Pricing: How to Compare Apples and Oranges Among Cloud Providers

SoftNAS Cloud Performance Evaluation on AWS

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS

CHAPTER 7 CONCLUSION AND FUTURE SCOPE

Cache introduction. April 16, Howard Huang 1

Scheduling. Scheduling. Scheduling. Scheduling Criteria. Priorities. Scheduling

CPU Scheduling. CSE 2431: Introduction to Operating Systems Reading: Chapter 6, [OSC] (except Sections )

Overview Computer Networking What is QoS? Queuing discipline and scheduling. Traffic Enforcement. Integrated services

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

Review. Preview. Three Level Scheduler. Scheduler. Process behavior. Effective CPU Scheduler is essential. Process Scheduling

Chapter 9 Real Memory Organization and Management

Chapter 9 Real Memory Organization and Management

Cost-efficient Task Farming with ConPaaS Ana Oprescu, Thilo Kielmann Vrije Universiteit, Amsterdam Haralambie Leahu, Technical University Eindhoven

BlueDBM: An Appliance for Big Data Analytics*

Introduction. Application Performance in the QLinux Multimedia Operating System. Solution: QLinux. Introduction. Outline. QLinux Design Principles

BlueGene/L (No. 4 in the Latest Top500 List)

A Case for High Performance Computing with Virtual Machines

Fit for Purpose Platform Positioning and Performance Architecture

R-Storm: A Resource-Aware Scheduler for STORM. Mohammad Hosseini Boyang Peng Zhihao Hong Reza Farivar Roy Campbell

Practical MySQL Performance Optimization. Peter Zaitsev, CEO, Percona July 02, 2015 Percona Technical Webinars

1 Bull, 2011 Bull Extreme Computing

QLIKVIEW SCALABILITY BENCHMARK WHITE PAPER

Tail Latency in ZooKeeper and a Simple Reimplementation

Power Bounds and Large Scale Computing

Integrated IoT and Cloud Environment for Fingerprint Recognition

Chapter 6: CPU Scheduling. Operating System Concepts 9 th Edition

Transcription:

A Comparative Study of High Performance Computing on the Cloud Lots of authors, including Xin Yuan Presentation by: Carlos Sanchez

What is The Cloud? The cloud is just a bunch of computers connected over a network (usually Ethernet with internet protocols) which provides a service. Cloud networks usually have a layer of virtualization. When you connect to a cloud network, you re connecting to a virtual machine, which could have been dynamically allocated on any of the cloud s machines. Virtualization allows cloud networks to be more flexible in terms of location, size, etc. because the virtual hardware is abstracted from the physical hardware. The physical hardware can be changed/upgraded on the fly, but the virtual hardware (which users connect to) stays the same.

High Performance Computing Clusters Computational problems such as simulations which require a great amount of computation time and which are highly parallelizable are usually run on High Performance Computing (HPC) clusters. HPC clusters are dedicated networks of computers with specialized hardware whose main purpose is to compute as much as possible in the shortest amount of time. These clusters are generally used for research and paid for with research funding, so users can access these resources for free

HPC Clusters VS. Cloud Computing HPC clusters are focused on maximum performance for a very specific service (complex computations) Cloud networks are focused on maximum utilization of resources for a wide variety of services (and profit) HPC clusters for research are generally free thanks to funding, but your jobs are placed in a queue to be scheduled at the next available time Cloud networks provide a service for a fee, but that service is available on-demand

Some Issues Addressed in the Paper HPC users believe that dedicated clusters are better than cloud platforms due to the communication overhead. Authors believe the difference is smaller than perceived and that total turnaround time and cost are better determining factors than a program s start and finish times. As research HPC clusters are free, a pricing model has been created in order to compare the prices of the ondemand cloud VS. the HPC cluster. Performance between the cloud and HPC clusters are also compared

Which Cloud? Amazon s Elastic Compute Cloud (EC2) is used, as it is a highly successful and popular platform. However, in terms of high performance computing, successes are mixed. Most successful high performance computing applications that were run on EC2 were embarrassingly parallel programs Embarrassingly parallel programs are programs which require little or no effort to parallelize.

Why The Cloud Doesn t Work For HPC Because cloud platforms use Ethernet for interconnectivity, latency and bandwidth are an issue Dedicated HPC clusters use Infiniband Virtualization causes computation overhead, and the relative location of virtual machines could be an issue If the machines are virtual and can be anywhere, their physical proximity isn t necessarily close or within a small number of hops, which could cause even more communication issues.

Why The Cloud Could Work It isn t fair to compare a paid on-demand service which is available 24/7 to the traditional HPC cluster based only on execution time. Traditional HPC clusters can have a significant queue wait time, especially during peak usage and for jobs which require many nodes. The cloud is available on-demand. When comparing HPC clusters with the cloud, looking at total turnaround time and projected cost (should traditional HPC clusters stop being free) may show that the cloud is the way to go for tasks with certain computational, cost, and time requirements.

What The Paper Gives Us Evaluate Amazon s EC2 and 5 HPC clusters from the Lawrence Livermore National Laboratory (LLNL) along the traditional axis of execution time using over 1000 cores. Develop a pricing model to evaluate (free) HPC clusters in node-hour prices based on system performance, resource availability, and user bias. Evaluate EC2 and HPC clusters along the axis of total turnaround time and total cost.

EC2 Basics There are several EC2 instances with different computation and network capabilities. The paper chose the highest-end instance: cluster compute eight extra large, as it is intended for HPC applications There are also several ways of purchasing time. The paper chose on-demand pricing, where the purchaser receives access to the machine immediately. Default EC2 does not guarantee node proximity, but you can request it through a placement group. This was not used for the experiment. HPC systems that use batch scheduling can t guarantee physical proximity either.

Quick Physical Specs Note: all clusters other than EC2 are machines at LLNL udawn is a BlueGene/P system, and is one of the slowest machines. Cab and Sierra are newer machines, and are usually the highest performers.

Program Setup Core 0 is avoided due to EC2 interrupt system Only half of the cores (or closest power of two) on a particular machine are used to reduce the overhead of uneven communication. No hyperthreading on Sandy-Bridge processors Strong scaling (all benchmark sizes are identical across different task counts) Wide variety of benchmarks; some are communication intensive, others are computation intensive.

Tested Network Latency Cab and Sierra have the best stats thanks to Infiniband QDR EC2 has comparable throughput with udawn, but the latency is far worse. Couldn t characterize the virtualization overhead, since there was no access to physical hardware which was identical to the virtual machines. Values for EC2 are consistent with other reported results

Tested Overall Performance Normalized to udawn (1 MPI task, computation only) Normalized to EC2 (1024 MPI tasks, execution time only)

Overall Performance Cont. Results show that communication intensive benchmarks such as SMG2000 run quite a bit slower on EC2 than other machines (except udawn). However, benchmarks such Sweep3D run better on EC2 than any other system. Not only that, but benchmarks such as LAMMPS have some machines performing better and some performing worse than EC2. On the grounds of only program execution time, the choice of whether to use the cloud or not is dependent on the qualities of the applications.

Queue Wait Times Average queue wait times must be known in order to measure turnaround time. However, wait times vary depending on the cluster, job size, and maximum run time Estimating queue wait times isn t easy The scheduling implementation isn t known The wait time isn t necessarily linear with respect to the number of nodes or time requested. Cluster utilization fluctuates depending on current jobs

Queue Wait Times Cont. Wait times are thus simulated by a simulator which is configured to work like the clusters Simulator used job logs from a different cluster taken in 2009 to simulate a proper workload. This workload has characteristics similar to the test clusters. To test the validity of the simulator, real queue wait times were collected for a few test jobs over a two-month period.

Queue Wait Times Cont. The simulated times follow the same trend as the real wait times. Note that EC2 wait times were orders of magnitude smaller, with the highest being 244 seconds for 64 nodes.

Total Turnaround Time Now that the simulation gives a reliable distribution of queue wait times, it can be used to determine the total turnaround time relative to EC2. Remember, EC2 wait times are considerably lower than the HPC clusters because of the cluster s queuing. Tested using an MPI task set to run on 5 hours on EC2. Results are scaled relative to EC2 s run time. Tested with 256, 512, and 1024 tasks.

Total Turnaround Time Trends EC2 execution time is generally better at lower scales. However, as the number of tasks increases, EC2 is outstripped by the better-scaling high end LLNL clusters. This makes sense, as EC2 has weaker communication. Turnaround time for the LLNL clusters is highly dependent on the queue wait time, as a higher task count causes longer turnaround times. This dependency can be seen by the fact that 1 st quartile (best case) queue wait times make LLNL clusters have the better turnaround time, but 4 th quartile (worst case) queue times give EC2 the advantage.

Graphical Turnaround Time 256 Tasks Notice that EC2 beats many of the dedicated clusters for low task counts

Graphical Turnaround Time 512 Tasks Here, we see that machines like Cab and Sierra are sometimes right on the line with EC2, which means that the queue wait time will decide which is better.

Graphical Turnaround Time 1024 Tasks Here, we see that in many cases, the better turnaround time all depends on the queue wait time, however Cab and Sierra outright beat EC2 for LU and SMG2000, while EC2 still beats them all in Sweep3D

Expected Amount of Time LLNL is Better Using QBETS (a time series estimation method), it can be determined how often LLNL beats EC2 in terms of total turnaround time. Only done for 1024 tasks The 0% instances mean EC2 is always better. Even with the fastest machine, the expectation ranges from 25-40%. This is further proof that the queue wait times play a significant factor in total turnaround time.

Cost Calculation The price of EC2 is known, but the LLNL clusters need a price assigned to them. Calculated prices are based off of EC2 ($2.40/node-hour) Can t view computation as a utility, because operating cost data for clusters isn t publicly available. Assume LLNL clusters are competitors in the same market as EC2 Do not incorporate queue wait times in the cost. Users in this market make optimal choices for profit maximization.

Cost Calculation Cont. User takes into account: P = price in node-hours N = nodes to allocate User wants to minimize the combined implicit and explicit costs: C = (p * n * t(n)) + (a * t(n)) The first term is the explicit cost; it is the actual monetary value to be spent on calculation. The second term is the implicit cost; this can be the cost of waiting longer for something slow to finish, etc.

Cost Calculation Cont. Using this model of user choice, cluster operators can choose a price which can maximize profit. Lots of math later, and we get the following prices: Notice that the slowest machine, udawn, has a very low price, while the best machine, Cab, has a fairly high price. Note that EC2 uses hourly pricing, but in order to gather data more effectively for the short jobs that were run, the prices were prorated by the second for the following data.

What Do These Results Show? Different applications have their total cost minimized on different clusters. In other words, a single cluster (such as the cheap udawn) doesn t always have the lowest cost. For instance, it is cheapest to run computationally intensive but communication-lacking programs on EC2, as EC2 s price is optimized for computation. Alternatively, it is cheapest to run communication-intense programs on Cab since Cab s price is optimized for communication. As we saw before, different applications have their shortest running time on different clusters It all depends on the qualities of the application: is it cache heavy? Communication-intense? Computationally-intense?

More Results If a program must be completed within a certain time, it can yet again change the machine which provides the lowest cost. Similarly, if a price bound exists, the machine which provides the fastest time may change Basically: many factors affect the choice of machine: Queue wait times The application itself Whether turnaround time or cost is more important Time/cost bounds

Well, Which One To Choose? The choice between HPC platforms isn t clear; users cannot exhaustively test their application on a multitude of clusters/cloud providers to determine which one is best for their needs. Authors propose software that can choose a cluster automatically. This would require that systems provide wait queue data, which could be a security concern.

Bottom Line The cloud isn t entirely useless for HPC problems. It outperforms dedicated HPC hardware in some cases. When measured in total turnaround time, the cloud can outperform dedicated HPC hardware in many cases due to significant queue wait time on HPC clusters. Should current HPC cluster providers start charging for computation in a manner similar to cloud providers, a multitude of factors can affect a user s machine choice. The choice is not clear in this case, and the responsibility of navigating these factors shouldn t be placed on the user.