Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies

Similar documents
Pegasus: Coordinated Scheduling for Virtualized Accelerator-based Systems

Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies

GViM: GPU-accelerated Virtual Machines

Re-architecting Virtualization in Heterogeneous Multicore Systems

The rcuda middleware and applications

ECE 8823: GPU Architectures. Objectives

Increasing the efficiency of your GPU-enabled cluster with rcuda. Federico Silla Technical University of Valencia Spain

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Introduction to GPU computing

GaaS Workload Characterization under NUMA Architecture for Virtualized GPU

Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators

Visualization of OpenCL Application Execution on CPU-GPU Systems

Multi-tenancy on GPGPU-based Servers

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

The Future of Interconnect Technology

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

High Performance Computing on GPUs using NVIDIA CUDA

Is remote GPU virtualization useful? Federico Silla Technical University of Valencia Spain

Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs

Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)

Portland State University ECE 588/688. Graphics Processors

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

Sharing High-Performance Devices Across Multiple Virtual Machines

Framework of rcuda: An Overview

Red Fox: An Execution Environment for Relational Query Processing on GPUs

Accelerating HPL on Heterogeneous GPU Clusters

Red Fox: An Execution Environment for Relational Query Processing on GPUs

Project Kickoff CS/EE 217. GPU Architecture and Parallel Programming

Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science

Overview of Performance Prediction Tools for Better Development and Tuning Support

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

rcuda: an approach to provide remote access to GPU computational power

The Era of Heterogeneous Compute: Challenges and Opportunities

Recent Advances in Heterogeneous Computing using Charm++

Deploying remote GPU virtualization with rcuda. Federico Silla Technical University of Valencia Spain

Profiling of Data-Parallel Processors

Data Path acceleration techniques in a NFV world

The rcuda technology: an inexpensive way to improve the performance of GPU-based clusters Federico Silla

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Performance & Scalability Testing in Virtual Environment Hemant Gaidhani, Senior Technical Marketing Manager, VMware

CUDA Architecture & Programming Model

The Effect of Emerging Architectures on Data Science (and other thoughts)

GPU Programming Using NVIDIA CUDA

A Case for High Performance Computing with Virtual Machines

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Machine Learning on VMware vsphere with NVIDIA GPUs

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs

GPUfs: Integrating a file system with GPUs

An Introduction to the SPEC High Performance Group and their Benchmark Suites

Real-Time Support for GPU. GPU Management Heechul Yun

Mapping MPI+X Applications to Multi-GPU Architectures

Accelerating Data Warehousing Applications Using General Purpose GPUs

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Nested Virtualization and Server Consolidation

Center Extreme Scale CS Research

dopencl Evaluation of an API-Forwarding Implementation

Threading Hardware in G80

Deep Learning Accelerators

Achieve Low Latency NFV with Openstack*

Graphics Processor Acceleration and YOU

Predictive Runtime Code Scheduling for Heterogeneous Architectures

Cellule: Virtualizing Cell for Lightweight Execution

Understanding Outstanding Memory Request Handling Resources in GPGPUs

CUDA Programming Model

rcuda: desde máquinas virtuales a clústers mixtos CPU-GPU

QR Decomposition on GPUs

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

GPUfs: Integrating a file system with GPUs

Boosting GPU Virtualization Performance with Hybrid Shadow Page Tables

rcuda: towards energy-efficiency in GPU computing by leveraging low-power processors and InfiniBand interconnects

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

Automatic Data Layout Transformation for Heterogeneous Many-Core Systems

May 1, Foundation for Research and Technology - Hellas (FORTH) Institute of Computer Science (ICS) A Sleep-based Communication Mechanism to

AMBER 11 Performance Benchmark and Profiling. July 2011

GPUs and Emerging Architectures

What s New in VMware vsphere 4.1 Performance. VMware vsphere 4.1

Hybrid Implementation of 3D Kirchhoff Migration

A Detailed GPU Cache Model Based on Reuse Distance Theory

Tesla GPU Computing A Revolution in High Performance Computing

Large scale Imaging on Current Many- Core Platforms

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

PHX: Memory Speed HPC I/O with NVM. Pradeep Fernando Sudarsun Kannan, Ada Gavrilovska, Karsten Schwan

Using GPUaaS in Cloud Foundry

Ocelot: An Open Source Debugging and Compilation Framework for CUDA

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

QuartzV: Bringing Quality of Time to Virtual Machines

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

A Framework for Modeling GPUs Power Consumption

W H I T E P A P E R. What s New in VMware vsphere 4: Performance Enhancements

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Transcription:

Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies Alexander Merritt, Vishakha Gupta, Abhishek Verma, Ada Gavrilovska, Karsten Schwan {merritt.alex,abhishek.verma}@gatech.edu {vishakha,ada,schwan}@cc.gtaech.edu

Hetereogeneous Computing Rapid evolution of hardware used in high-performance computing Many-core, accelerators, complex memory structures GPGPUs predominant Many system and node configurations Cache & memory hierarchies, GPU specs, Exposure to broader application domains Scientific simulations Data-intensive apps, low-latency codes (finance), highthroughput codes (massive web apps.), Georgia Institute of Technology - For VTDC'11 2

Sample Configurations Oakridge Keeneland Per node 12 cores in 2 sockets 3 GPGPUs (Fermi M2070) QDR InfiniBand network 24 GiB memory Amazon EC2 Per instance: 33.5 EC2 Compute Units 2 GPGPUS (Fermi M2050) 10G Ethernet 22 GiB memory Teragrid Forge 64 cores in 8 sockets 8 GPGUs (Fermi M2070) 64 GiB memory Others with older GPGPUs and fewer CPU cores. Georgia Institute of Technology - For VTDC'11 3

Existing Constraints Efficiency = fine-tuning of code CUDA Best Practices Guide, 76pgs Static node configuration Code must match hardware Nodes can contain only so much Georgia Institute of Technology - For VTDC'11 4

Introduction GPGPU Assemblies Idea & Design Implementation Evaluation Conclusion Outline Georgia Institute of Technology - For VTDC'11 5

GPGPU Assembly App Virtual platform of CPUs and GPGPUs GPGPUs abstracted as vgpus to apps Assembly Node Node Node 1. Scalability 2. Utilization 3. Portability Georgia Institute of Technology - For VTDC'11 6

Shadowfax: Implementation CUDA Calls VCPU VCPU VCPU VCPU Guest Domain Management Domain Xen Hypervisor Local Node Hardware GPGPU GPGPU CPU CPU Georgia Institute of Technology - For VTDC'11 7

Machine Boundary vgpu Local vgpu Remote vgpu Call Buffer & Shared Memory Server Thread Polling Thread CUDA Runtime GPGPU CUDA Runtime GPGPU Georgia Institute of Technology - For VTDC'11 8

Assembly: Key Ideas Consistent programming model Dynamic vgpu GPU attachment (any GPGPU in cluster) Match workload to hardware availability Georgia Institute of Technology - For VTDC'11 9

Introduction GPGPU Assemblies Evaluation Experimental System Workloads Results Conclusion Outline Georgia Institute of Technology - For VTDC'11 10

Experimental System Common Hardware: 4 CPU cores 4 GiB memory Guest Configuration: 1 VCPU 256 MiB memory Linux 2.6.18 Software Stack: Xen 3.2.1 Linux 2.6.18 CUDA v1.1 NV Driver 169.09 NVIDIA GeForce 8800 NVIDIA GeForce 9800 Gigabit Ethernet GeForce 9800 contains two GPUs internally. Application VM Additional Node Georgia Institute of Technology - For VTDC'11 11

Workloads Class Source Benchmarks Finance CUDA SDK Binomial Options Black-Scholes Media CUDA SDK & Parboil Matrix Multiply MRI-FHD Science Parboil Coulumb Potential Benchmark Binomial Options Black-Scholes Matrix Multiply MRI-FHD & CP Characteristics Few fat kernels, some synchronization Many small asynchronous kernels Large upfront data movement, 1 kernel Few kernels, data movement & kernels interleaved Georgia Institute of Technology - For VTDC'11 12

Performance of Single Thread Contending for GPGPU Binomial Options Time taken to compute constant workload size. Georgia Institute of Technology - For VTDC'11 13

Performance of Single Thread Contending for GPGPU No Contention Binomial Options Time taken to compute constant workload size. Georgia Institute of Technology - For VTDC'11 14

Increased Aggregate Application Throughput App App Asm Asm BinomialOptions (typo in paper says BlackScholes!) 4-vGPU assembly Georgia Institute of Technology - For VTDC'11 15

CPU Contention: Can Limit Performance in Assembly Multiple VMs, each one instance of Black-Scholes. Georgia Institute of Technology - For VTDC'11 16

Sensitivity to Overheads: Different Across Workloads Georgia Institute of Technology - For VTDC'11 17

Scalability achievable! Insights More GPGPUs greater performance CPUs can become a bottleneck Optimizations still needed, but can be pushed to system software layer Network overhead remains greatest contributor Significant for highly synchronous and data-intensive workloads Georgia Institute of Technology - For VTDC'11 18

Introduction GPGPU Assemblies Evaluation Conclusion Outline Future & Current Efforts Related Work Georgia Institute of Technology - For VTDC'11 19

Future Work Understanding utilization Better characterize workload profiles More flexibility + instrumentation Integrate with GPU Ocelot Deeper view into GPGPU utilization Additionally allow CPUs to execute GPGPU code Dynamic matching algorithm will need to be intelligent and distributed Incorporate monitoring system at each layer Greater opportunities to share hardware Georgia Institute of Technology - For VTDC'11 20

vcuda Related Work XMLRPC for API marshalling No remote GPGPU access GViM also no remote GPGPU access rcuda Concurrent use of non-local GPGPUs Reduce power consumption Georgia Institute of Technology - For VTDC'11 21

Conclusion Heterogeneous clusters becoming more diverse GPGPUs dominant accelerator Nodes not reconfigurable, applications tailored towards specific configuration Assemblies provide flexible framework for mapping applications to hardware Assemblies enable scalability, but require optimizations for some workloads Provide additional flexibility by enabling CUDA to also execute on CPUs. Even greater flexibility for assembly hardware mapping Georgia Institute of Technology - For VTDC'11 22

Acknowledgements Student funding provided in part by the National Science Foundation. Equipment funding provided in part by Intel and NVIDIA. Thank you to the reviewers for their thorough feedback during submission! Georgia Institute of Technology - For VTDC'11 23

[End of Talk]