GpuWrapper: A Portable API for Heterogeneous Programming at CGG

Similar documents
OpenACC Course. Office Hour #2 Q&A

CUDA. Matthew Joyner, Jeremy Williams

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

TUNING CUDA APPLICATIONS FOR MAXWELL

Tesla GPU Computing A Revolution in High Performance Computing

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

Unified Memory. Notes on GPU Data Transfers. Andreas Herten, Forschungszentrum Jülich, 24 April Member of the Helmholtz Association

TUNING CUDA APPLICATIONS FOR MAXWELL

Parallel Programming Libraries and implementations

Parallel Programming. Libraries and Implementations

GPU Fundamentals Jeff Larkin November 14, 2016

Accelerator programming with OpenACC

Intel Xeon Phi Coprocessors

GPU Programming with Ateji PX June 8 th Ateji All rights reserved.

n N c CIni.o ewsrg.au

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Introduction to GPU hardware and to CUDA

CUDA Programming Model

Tesla GPU Computing A Revolution in High Performance Computing

THE FUTURE OF GPU DATA MANAGEMENT. Michael Wolfe, May 9, 2017

CSC573: TSHA Introduction to Accelerators

ECE 574 Cluster Computing Lecture 15

Martin Dubois, ing. Contents

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

OpenACC 2.6 Proposed Features

Trends and Challenges in Multicore Programming

Fundamental CUDA Optimization. NVIDIA Corporation

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Advanced CUDA Optimization 1. Introduction

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford

Fundamental CUDA Optimization. NVIDIA Corporation

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Introduction to Multicore Programming

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Introduction to Multicore Programming

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

CUDA OPTIMIZATIONS ISC 2011 Tutorial

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

trisycl Open Source C++17 & OpenMP-based OpenCL SYCL prototype Ronan Keryell 05/12/2015 IWOCL 2015 SYCL Tutorial Khronos OpenCL SYCL committee

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

OPENCL GPU BEST PRACTICES BENJAMIN COQUELLE MAY 2015

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Using SYCL as an Implementation Framework for HPX.Compute

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

NumbaPro CUDA Python. Square matrix multiplication

General Purpose GPU Computing in Partial Wave Analysis

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Mathematical computations with GPUs

Antonio R. Miele Marco D. Santambrogio

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Addressing Heterogeneity in Manycore Applications

Profiling of Data-Parallel Processors

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

GPU ARCHITECTURE Chris Schultz, June 2017

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

GPGPU, 4th Meeting Mordechai Butrashvily, CEO GASS Company for Advanced Supercomputing Solutions

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Introduction to Parallel Computing with CUDA. Oswald Haan

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Advanced CUDA Optimizations. Umar Arshad ArrayFire

GPUfs: Integrating a file system with GPUs

Technology for a better society. hetcomp.com

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

OpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

GPUfs: Integrating a file system with GPUs

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

vs. GPU Performance Without the Answer University of Virginia Computer Engineering g Labs

Tuning CUDA Applications for Fermi. Version 1.2

Bigger GPUs and Bigger Nodes. Carl Pearson PhD Candidate, advised by Professor Wen-Mei Hwu

Towards Automatic Heterogeneous Computing Performance Analysis. Carl Pearson Adviser: Wen-Mei Hwu

Towards a Performance- Portable FFT Library for Heterogeneous Computing


CUDA Experiences: Over-Optimization and Future HPC

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

ECE 574 Cluster Computing Lecture 17

When MPPDB Meets GPU:

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

Tesla Architecture, CUDA and Optimization Strategies

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

cuda-on-cl A compiler and runtime for running NVIDIA CUDA C++11 applications on OpenCL 1.2 devices Hugh Perkins (ASAPP)

High Performance Computing on GPUs using NVIDIA CUDA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Chapter 3 Parallel Software

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Transcription:

GpuWrapper: A Portable API for Heterogeneous Programming at CGG Victor Arslan, Jean-Yves Blanc, Gina Sitaraman, Marc Tchiboukdjian, Guillaume Thomas-Collignon March 2 nd, 2016

GpuWrapper: Objectives & Design GpuWrapper: CGG compatibility layer to support both CUDA & OpenCL Increase applications portability Future proof accelerated applications development Objectives Reach maximum performance on Nvidia & AMD GPUs Consider the complete platform not only the hardware: Compiler, Runtime, Debugger, Profiler, Numerical libraries Accelerated application = unified host code + device-specific kernels One version of the host code for all devices (majority of application code is the host part) One version of a kernel for each device (enable device-specific tuning) GPU Architecture Nvidia Kepler Nvidia Maxwell AMD Hawaii AMD Fiji Memory BW 336 GB/s 336 GB/s 320 GB/s 512GB/s Compute (SP) 5 Tflops 6 Tflops 5 Tflops 8 Tflops

GpuWrapper Overview Application Kernels in CUDA Application Host Code Application Kernels in OpenCL GpuWrapper API libgpuwrapper.so CUDA implem. libgpuwrapper.so OpenCL implem. CUDA Driver API OpenCL 1.2 API Nvidia GPU OpenCL Device

GpuWrapper API Initialization & cleanup of devices Allocation and memory transfers Kernel launch Synchronizations with streams and events Device-to-device communications FFT API GpuWrapper is targeted at CGG applications, hence it is simpler 50 API calls 3k LOC each for OpenCL & CUDA implementations

OpenCL vs CUDA OpenCL and CUDA are quite similar but there are some differences CUDA OpenCL GpuWrapper GPU memory buffers Pinned to a GPU Automatically moved to device before use Similar to CUDA Kernel compilation Offline JIT Similar to CUDA Kernel launch <<< >>> syntax with compiler support clsetkernelarg clenqueuendrange Similar to OpenCL Pointer arithmetic on host Supported Use subbuffers Similar to OpenCL Device initialization Implicit Explicit Similar to CUDA FFT library cufft clfft Common API to use both

Nvidia GPUs vs AMD GPUs Nvidia and AMD GPU architectures are similar but Differences sometimes require different kernel implementations To fit within specifications To get good performance (especially differences impacting occupancy) Short list of characteristics Nvidia (CC 3.0) Nvidia (CC 3.5) AMD Hawaii Max threads per block 1024 1024 256 Max shared memory size per block 48KB 48KB 32KB Max registers per thread 63 255 255 Warp size 32 32 64 Cache line size 128B 128B 64B

Bandwidth (MB/S) Low latency GPU to GPU communications GpuWrapper API semantic Common memory buffer accessible by kernels For a pair of GPUs GpuWrapper implementations 1. Memory buffer allocated in pinned host memory 2. Memory buffer allocated in one of the GPUs (GPUDirect/DirectGMA) Policy to select implementation Use #2 when GPUs are on the same socket Otherwise use #1 Specifically for AMD: Use #1 for small buffers (better performance) or when both remote read & remote write are required (remote read not supported) 10000 7500 5000 2500 0 0 500 1000 1500 2000 Buffer size (KB) This functionality is no longer duplicated for each application AMD DirectGMA AMD Host memory Nvidia GPUDirect Nvidia Host memory

Porting applications from CUDA to the GpuWrapper API 2 steps process to port applications 1. Port CUDA API host code to use the GpuWrapper API (keep kernels in CUDA) Check results on NVIDIA GPU 2. Implement OpenCL version of the CUDA kernels (host code stays the same) Check results on AMD GPU Factors impacting results Different compilers Compilation flags Different FFT libraries OpenCL CUDA Flag Default Flag Default -cl-mad-enable False -fmad=true True -cl-fp32-correctly-roundeddivide-sqrt False -prec-div=true True -prec-sqrt=true -cl-denorms-are-zero False -ftz=true False

Evaluation of the GpuWrapper Overhead Does the GpuWrapper add overhead over the CUDA implementation on the same hardware? Negligible Overhead Experimental Setup Gather representative set of production jobs for each application Run each job on each generation of Nvidia GPU accelerated nodes Compare performance of the pure CUDA implementation and the GpuWrapper implementation

Performance Comparison on Wave-Equation Modeling Increasing wave-equation complexity 500GB/s 2.7x 330GB/s 1.8x 190 GB/s Performance scales with memory BW

Related Work: Portable Heterogeneous Programming Directive-based OpenMP4, OpenACC Higher level Source to source translation CU2CL, Ocelot, Kim et al 2015 Great for initial port but Some features are not supported Not viable over the lifetime of the application Compatibility layer Swan, OCCA, Souza et al 2015, AMD HIP OCCA and HIP also provide a common kernel language Very similar to our approach but missing specific features

Conclusion & Acknowledgments GpuWrapper API Multiple applications successfully deployed in production Compatible with both CUDA and OpenCL Fully transparent for geophysicists No overhead on Nvidia GPUs Over 1 SP Petaflops of AMD GPUs successfully deployed in production With two system vendors (Supermicro, Dell) Great performance on wave-equation modeling Wrapper for future manycores? Acknowledgements Great support from AMD (many thanks to Benjamin Coquelle) Great system support from Dell