rcuda: an approach to provide remote access to GPU computational power

Similar documents
computational power computational

An approach to provide remote access to GPU computational power

Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators

Exploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization

rcuda: towards energy-efficiency in GPU computing by leveraging low-power processors and InfiniBand interconnects

The rcuda middleware and applications

Carlos Reaño Universitat Politècnica de València (Spain) HPC Advisory Council Switzerland Conference April 3, Lugano (Switzerland)

Framework of rcuda: An Overview

Opportunities of the rcuda remote GPU virtualization middleware. Federico Silla Universitat Politècnica de València Spain

rcuda: desde máquinas virtuales a clústers mixtos CPU-GPU

Remote GPU virtualization: pros and cons of a recent technology. Federico Silla Technical University of Valencia Spain

Solving Dense Linear Systems on Graphics Processors

Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Is remote GPU virtualization useful? Federico Silla Technical University of Valencia Spain

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

The rcuda technology: an inexpensive way to improve the performance of GPU-based clusters Federico Silla

Tesla Architecture, CUDA and Optimization Strategies

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Document downloaded from:

On the Use of Remote GPUs and Low-Power Processors for the Acceleration of Scientific Applications

Increasing the efficiency of your GPU-enabled cluster with rcuda. Federico Silla Technical University of Valencia Spain

Speeding up the execution of numerical computations and simulations with rcuda José Duato

Deploying remote GPU virtualization with rcuda. Federico Silla Technical University of Valencia Spain

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

NAMD GPU Performance Benchmark. March 2011

Hybrid Implementation of 3D Kirchhoff Migration

Matrix Computations on GPUs, multiple GPUs and clusters of GPUs

! Readings! ! Room-level, on-chip! vs.!

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

Paralization on GPU using CUDA An Introduction

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Building NVLink for Developers

Jose Aliaga (Universitat Jaume I, Castellon, Spain), Ruyman Reyes, Mehdi Goli (Codeplay Software) 2017 Codeplay Software Ltd.

VSC Users Day 2018 Start to GPU Ehsan Moravveji

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

rcuda: hybrid CPU-GPU clusters Federico Silla Technical University of Valencia Spain

Improving overall performance and energy consumption of your cluster with remote GPU virtualization

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

Automatic Development of Linear Algebra Libraries for the Tesla Series

Exploiting CUDA Dynamic Parallelism for low power ARM based prototypes

Tesla GPU Computing A Revolution in High Performance Computing

Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies

A Multi-Tiered Optimization Framework for Heterogeneous Computing

Distributed In-GPU Data Cache for Document-Oriented Data Store via PCIe over 10Gbit Ethernet

NVIDIA GPUDirect Technology. NVIDIA Corporation 2011

Accelerating image registration on GPUs

Trends in HPC (hardware complexity and software challenges)

GPUs as better MPI Citizens

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

OCTOPUS Performance Benchmark and Profiling. June 2015

High Performance Computing with Accelerators

GPU ARCHITECTURE Chris Schultz, June 2017

Introduction to CUDA Programming

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA. NVIDIA Corporation 2012

Hierarchical DAG Scheduling for Hybrid Distributed Systems

Outline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work

Accelerating Spark RDD Operations with Local and Remote GPU Devices

OpenPOWER Performance

World s most advanced data center accelerator for PCIe-based servers

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Tesla GPU Computing A Revolution in High Performance Computing

CPU-GPU Heterogeneous Computing

Motivation Hardware Overview Programming model. GPU computing. Part 1: General introduction. Ch. Hoelbling. Wuppertal University

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

High Performance Computing on GPUs using NVIDIA CUDA

GETTING STARTED WITH CUDA SDK SAMPLES

Multi-Processors and GPU

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Distributed-Shared CUDA: Virtualization of Large-Scale GPU Systems for Programmability and Reliability

GPU for HPC. October 2010

Analysis and Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

"On the Capability and Achievable Performance of FPGAs for HPC Applications"

Manuel F. Dolz, Juan C. Fernández, Rafael Mayo, Enrique S. Quintana-Ortí. High Performance Computing & Architectures (HPCA)

GPU-centric communication for improved efficiency

CUDA Programming Model

A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures

Thinking Outside of the Tera-Scale Box. Piotr Luszczek

John W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands

GPU Clusters for High- Performance Computing Jeremy Enos Innovative Systems Laboratory

GPU ARCHITECTURE Chris Schultz, June 2017

Josef Pelikán, Jan Horáček CGG MFF UK Praha

LAMMPSCUDA GPU Performance. April 2011

Optimization solutions for the segmented sum algorithmic function

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

Large scale Imaging on Current Many- Core Platforms

Mathematical computations with GPUs

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

HPC with Multicore and GPUs

Advanced CUDA Optimization 1. Introduction

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

Transcription:

rcuda: an approach to provide remote access to computational power Rafael Mayo Gual Universitat Jaume I Spain (1 of 60) HPC Advisory Council Workshop

Outline computing Cost of a node rcuda goals rcuda structure rcuda implementations rcuda current status (2 of 60) HPC Advisory Council Workshop

Outline computing Cost of a node rcuda goals rcuda structure rcuda implementations rcuda current status (3 of 60) HPC Advisory Council Workshop

computing Will be the near future in HPC in fact, it is already here!!!!

computing It is massively parallel For the right kind of code the use of the use of computing brings huge benefits. Development tools and libraries facilitate the use of the. (5 of 60) HPC Advisory Council Workshop

computing Two approaches in computing: CUDA: nvidia propietary OpenCL: open standard

computing Basically OpenCL and CUDA have the same work scheme: Compilation Separate CPU code Running: and code ( kernel) Data transfers: CPU and memory spaces Before kernel execution: data from CPU memory space to memory space Computation: Kernel execution After kernel execution: results from memory space to CPU memory space. (7 of 60) HPC Advisory Council Workshop

CPU computing Not all algorithms take profit of power. In some cases only part of a program must be run on a. Depending on the algorithms, the can be idle for long periods. (8 of 60) HPC Advisory Council Workshop

Outline computing Cost of a node rcuda goals rcuda structure rcuda implementations rcuda current status (9 of 60) HPC Advisory Council Workshop

computing cost Tesla s2050 near 900 Watts (TDP manufacturer specification) Usage: 75% of time, so 25% idle time. Then each node misses: 160 Kwh/month 2 Mwh/year It could be several hundreds of Kg CO 2 /year (10 of 60) HPC Advisory Council Workshop

computing You can find two different scenarios: Scenario 1 If all your programs are going to use the for long periods Add a to each node You don't need our tool (11 of 60) HPC Advisory Council Workshop

computing You can find two different scenarios: Scenario 2 You could think in adding a, only to some nodes OUR TOOL CAN HELP YOU!!! (12 of 60) HPC Advisory Council Workshop

Outline computing Cost of a node rcuda goals rcuda structure rcuda implementations rcuda current status (13 of 60) HPC Advisory Council Workshop

rcuda

rcuda

rcuda

Outline computing Cost of a node rcuda goals rcuda structure rcuda implementations rcuda current status (17 of 60) HPC Advisory Council Workshop

rcuda structure CUDA application

rcuda structure Client side Server side

rcuda structure Client side Server side Application

rcuda functionality

rcuda functionallity Supported CUDA 4.0 Runtime Functions Module Functions Supported Device Management 13 13 Error handling 3 3 Event management 7 7 Execution control 7 7 Memory management 47 43 Peer device memory access 5 4 Stream management 2 2 Suface reference management 8 8 Texture refefence managemet 6 6 Thread management 6 6 Version managemet 2 2 (22 of 60) HPC Advisory Council Workshop

rcuda functionallity NOT YET Supported CUDA 4.0 Runtime Functions Module Functions Supported Unified addressing 11 0 Peer Device Memory Access 3 0 OpenGL Interoperability 3 0 Direct3D 9 Interoperability 5 0 Direct3D 10 Interoperability 5 0 Direct3D 11 Interoperability 5 0 VDPAU Interoperability 4 0 Graphics Interoperability 6 0 (23 of 60) HPC Advisory Council Workshop

rcuda functionallity Supported CUBLAS Functions Module Functions Supported Helper function reference 15 15 BLAS-1 54 1 BLAS-2 66 0 BLAS-3 30 6 (24 of 60) HPC Advisory Council Workshop

Outline computing Cost of a node rcuda goals rcuda structure rcuda implementations rcuda current status (25 of 60) HPC Advisory Council Workshop

rcuda: basic TCP/IP version

rcuda: basic TCP/IP version Example of rcuda interaction rcuda initialization

rcuda: basic TCP/IP version Example of rcuda interaction CudaMemcpy(..., cudamemcpyhosttodevice);

rcuda: basic TCP/IP version Main problem: data movement overhead On CUDA this overhead is due to: PCIe transfer On rcuda this overhead is due to: Network transfer PCIe transfer (but this appears in CUDA) (29 of 60) HPC Advisory Council Workshop

rcuda: basic TCP/IP version 35000 Data transfer time for matrix multiplication (2 from client to remote ) (1 from remote to client) Time (sec) 30000 25000 20000 15000 10000 5000 0 rcuda CUDA Matrix dimension (30 of 60) HPC Advisory Council Workshop

rcuda: basic TCP/IP version Execution time for matrix multiplication Tesla c1060 Intel Xeon E5410 2'33 GHz 70 Time (sec) 60 50 40 30 20 CPU rcuda kernel rcuda data transfer 10 0 rcuda misc Matrix dimension (31 of 60) HPC Advisory Council Workshop

rcuda: basic TCP/IP version Estimated execution time for matrix multiplication, including data transfers for some HPC networks Time (sec) 120 100 80 60 40 20 0 CPU 10Gbit Ethernet 10Gbit InfiniBand 40Gbit InfiniBand Matrix dimension (32 of 60) HPC Advisory Council Workshop

rcuda: basic TCP/IP version We have shown the functionality As we decrease the network overhead, our solution will have a performance close to the CUDA solution (33 of 60) HPC Advisory Council Workshop

rcuda: InfiniBand version

rcuda: InfiniBand version

rcuda: InfiniBand version

rcuda: InfiniBand Verbs implementation Same user level functionallity. Use of Direct Use of pipelined transfers. Client to/from remote bandwidth near the peak of InfiniBand network performance. (37 of 60) HPC Advisory Council Workshop

rcuda: Direct Without direct 1.- to Main Memory 2.- Main Memory to Main Memory 3.- Main Memory to Network CPU Main memory InfiniBand chipset memory (38 of 60) HPC Advisory Council Workshop

rcuda: Direct Without direct 1.- to Main Memory 2.- Main Memory to Main Memory 3.- Main Memory to Network CPU Main memory InfiniBand chipset memory (39 of 60) HPC Advisory Council Workshop

rcuda: Direct Without direct 1.- to Main Memory 2.- Main Memory to Main Memory 3.- Main Memory to Network CPU Main memory InfiniBand chipset memory (40 of 60) HPC Advisory Council Workshop

rcuda: Direct Without direct 1.- to Main Memory 2.- Main Memory to Main Memory 3.- Main Memory to Network CPU Main memory InfiniBand chipset memory (41 of 60) HPC Advisory Council Workshop

rcuda: Direct WITH direct 1.- to Main Memory 2.- Main Memory to Network CPU Main memory InfiniBand chipset memory (42 of 60) HPC Advisory Council Workshop

rcuda: Direct WITH direct 1.- to Main Memory 2.- Main Memory to Network CPU Main memory InfiniBand chipset memory (43 of 60) HPC Advisory Council Workshop

rcuda: Direct WITH direct 1.- to Main Memory 2.- Main Memory to Network CPU Main memory InfiniBand chipset memory A memory copy is avoided (44 of 60) HPC Advisory Council Workshop

rcuda: Pipelined transfers CLIENT NODE SERVER NODE Main memory CPU CPU Main memory chipset InfiniBand InfiniBand chipset memory Client Network Server (45 of 60) HPC Advisory Council Workshop

rcuda: Pipelined transfers Without pipelined transfers CLIENT NODE SERVER NODE Main memory CPU CPU Main memory chipset InfiniBand InfiniBand chipset memory Client Copy to network buffers Network Server (46 of 60) HPC Advisory Council Workshop

rcuda: Pipelined transfers Without pipelined transfers CLIENT NODE SERVER NODE Main memory CPU CPU Main memory chipset InfiniBand InfiniBand chipset memory Client Copy to network buffers Network Data transfer Server (47 of 60) HPC Advisory Council Workshop

rcuda: Pipelined transfers Without pipelined transfers CLIENT NODE SERVER NODE Main memory CPU CPU Main memory chipset InfiniBand InfiniBand chipset memory Client Copy to network buffers Network Data transfer Server Copy to (48 of 60) HPC Advisory Council Workshop

rcuda: Pipelined transfers CLIENT NODE WITH pipelined transfers SERVER NODE Main memory CPU CPU Main memory chipset InfiniBand InfiniBand chipset memory Client Network Copy to network buffers Server (49 of 60) HPC Advisory Council Workshop

rcuda: Pipelined transfers WITH pipelined transfers CLIENT NODE SERVER NODE Main memory CPU CPU Main memory chipset InfiniBand InfiniBand chipset memory Client Network Server Copy to network buffers Copy to network buffers Data transfer (50 of 60) HPC Advisory Council Workshop

rcuda: Pipelined transfers WITH pipelined transfers CLIENT NODE SERVER NODE Main memory CPU CPU Main memory chipset InfiniBand InfiniBand chipset memory Client Copy to network buffers Copy to network buffers Copy to network buffers Network Data transfer Data transfer Server Copy to (51 of 60) HPC Advisory Council Workshop

rcuda: Pipelined transfers WITH pipelined transfers CLIENT NODE SERVER NODE Main memory CPU CPU Main memory chipset InfiniBand InfiniBand chipset memory Client Copy to network buffers Copy to network buffers Copy to network buffers Network Data transfer Data transfer Data transfer Server Copy to Copy to (52 of 60) HPC Advisory Council Workshop

rcuda: Pipelined transfers WITH pipelined transfers CLIENT NODE SERVER NODE Main memory CPU CPU Main memory chipset InfiniBand InfiniBand chipset memory Client Copy to network buffers Copy to network buffers Copy to network buffers Network Data transfer Data transfer Data transfer Server Copy to Copy to Copy to (53 of 60) HPC Advisory Council Workshop

rcuda: optimized InfiniBand version Bandwidth for GEMM 13824x13824 Bandwidth (MB/sec) 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0 rcuda GigaE rcuda IPoIB rcuda IBVerbs CUDA (54 of 60) HPC Advisory Council Workshop

rcuda: optimized InfiniBand version Time for GEMM 4096x4096 2,50 2,00 2,28 Time (sec) 1,50 1,00 1,30 0,70 0,65 0,62 0,50 0,00 rcuda GigaE rcuda IpoIB rcuda IBVerbs CUDA CPU (MKL) Intel Xeon E5645 (55 of 60) HPC Advisory Council Workshop

Outline computing Cost of a node rcuda goals rcuda structure rcuda implementations rcuda current status (56 of 60) HPC Advisory Council Workshop

rcuda: work in progress rcuda port to Microsoft rcuda thread safe rcuda support to CUDA 4.0 Support to CUDA C extensions ropencl (57 of 60) HPC Advisory Council Workshop

rcuda: near future Dynamic remote scheduling. Workload balance. Remote data cache. Remote kernels cache (58 of 60) HPC Advisory Council Workshop

rcuda: more information http://www.gap.upv.es/rcuda http://www.hpca.uji.es/rcuda virtualization in high performance clusters J. Duato, F. Igual, R. Mayo, A. Peña, E. S. Quintana, F. Silla. 4th Workshop on Virtualization and High-Performance Cloud Computing, VHPC'09. rcuda: reducing the number of -based accelerators in high performance clusters. J. Duato, A. Peña, F. Silla, R. Mayo, E. S. Quintana. Workshop on Optimization Issues in Energy Efficient Distributed Systems, OPTIM 2010. Performance of CUDA virtualized remote s in high performance clusters. J. Duato, R. Mayo, A. Peña, E. S. Quintana, F. Silla. International Conference on Parallel Processing, ICPP 2011 (accepted). (59 of 60) HPC Advisory Council Workshop

rcuda: credits Parallel Architectures Group Technical University of València http://www.gap.uji.es High Performance Computing and Architectures Group University Jaume I of Castelló http://hpca.uji.es (60 of 60) HPC Advisory Council Workshop