Mapping C++ AMP to OpenCL / HSA Wen-Heng Jack Chung

Similar documents
Renderscript Accelerated Advanced Image and Video Processing on ARM Mali T-600 GPUs. Lihua Zhang, Ph.D. MulticoreWare Inc.

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function

HSAIL: PORTABLE COMPILER IR FOR HSA

HSA FOR APPLICATION PROGRAMMING

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!

The Role of Standards in Heterogeneous Programming

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

Introduction to CUDA (1 of n*)

SIMULATOR AMD RESEARCH JUNE 14, 2015

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

Compiling CUDA and Other Languages for GPUs. Vinod Grover and Yuan Lin

Colin Riddell GPU Compiler Developer Codeplay Visit us at

HKG OpenCL Support by NNVM & TVM. Jammy Zhou - Linaro

Handout 3. HSAIL and A SIMT GPU Simulator

Heterogeneous Computing

Taipei Embedded Outreach OpenCL DSP Profile Proposals

Daniel Moth, Parallel Computing Platform, Microsoft

Accelerated Test Execution Using GPUs

AMD s Unified CPU & GPU Processor Concept

Modern Processor Architectures. L25: Modern Compiler Design

WRITING DATA PARALLEL ALGORITHMS ON GPUs

SIGGRAPH Briefing August 2014

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14

Sparse Matrix-Matrix Multiplication on the GPU. Julien Demouth, NVIDIA

OpenCL: History & Future. November 20, 2017

GPGPU, 4th Meeting Mordechai Butrashvily, CEO GASS Company for Advanced Supercomputing Solutions

CLICK TO EDIT MASTER TITLE STYLE. Click to edit Master text styles. Second level Third level Fourth level Fifth level

HPVM: Heterogeneous Parallel Virtual Machine

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Copyright Khronos Group Page 1. OpenCL BOF SIGGRAPH 2013

cuda-on-cl A compiler and runtime for running NVIDIA CUDA C++11 applications on OpenCL 1.2 devices Hugh Perkins (ASAPP)

Porting Fabric Engine to NVIDIA Unified Memory: A Case Study. Peter Zion Chief Architect Fabric Engine Inc.

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

Automatic Intra-Application Load Balancing for Heterogeneous Systems

Native Offload of Haskell Repa Programs to Integrated GPUs

Data Parallel Algorithmic Skeletons with Accelerator Support

Copyright Khronos Group Page 1. Introduction to SYCL. SYCL Tutorial IWOCL

Enabling GPU support for the COMPSs-Mobile framework

HMPP port. G. Colin de Verdière (CEA)

Torch Internals. Ronan Collobert torch.ch

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi

Porting OpenVMS to x Update

OpenCL heterogeneous portability theory and practice

GPU Linear algebra extensions for GNU/Octave

High Performance Computing with Accelerators

OpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR

From Application to Technology OpenCL Application Processors Chung-Ho Chen

Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth

Domain Specific Languages for Financial Payoffs. Matthew Leslie Bank of America Merrill Lynch

GPU Programming for Mathematical and Scientific Computing

Optimization of Tele-Immersion Codes

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

Lecture 2: Introduction to CUDA C

The ARES High-level Intermediate Representation

THE HETEROGENEOUS SYSTEM ARCHITECTURE IT S BEYOND THE GPU

CUDA (Compute Unified Device Architecture)

Programming in CUDA. Malik M Khan

Apple LLVM GPU Compiler: Embedded Dragons. Charu Chandrasekaran, Apple Marcello Maggioni, Apple

GPU-Accelerated Deep Learning

A Framework for Visualization of OpenCL Applications Execution

Research Faculty Summit Systems Fueling future disruptions

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

Heterogeneous SoCs. May 28, 2014 COMPUTER SYSTEM COLLOQUIUM 1

Going to cover; - Why we have SPIR-V - Brief history of SPIR-V - Some of the core required features we wanted - How OpenCL will use SPIR-V - How

HCFFT Documentation. Release. Aparna Suresh

OPENMP GPU OFFLOAD IN FLANG AND LLVM. Guray Ozen, Simone Atzeni, Michael Wolfe Annemarie Southwell, Gary Klimowicz

A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening. Alberto Magni, Christophe Dubach, Michael O'Boyle

Support Tools for Porting Legacy Applications to Multicore. Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura

Module 2: Introduction to CUDA C

A Productive Framework for Generating High Performance, Portable, Scalable Applications for Heterogeneous computing

Porting Performance across GPUs and FPGAs

x Welcome to the jungle. The free lunch is so over

CS516 Programming Languages and Compilers II

PTX WRITER'S GUIDE TO INTEROPERABILITY

Michel Steuwer.

CME 213 S PRING Eric Darve

Module 2: Introduction to CUDA C. Objective

Tracing and profiling dataflow applications

CSE 160 Lecture 24. Graphical Processing Units

OpenCL Overview. Shanghai March Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Compile-Time Function Call Interception to Mock Functions in C/C++

End to End Optimization Stack for Deep Learning

Lecture 3. Programming with GPUs

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Compiler Tools for HighLevel Parallel Languages

OmpCloud: Bridging the Gap between OpenMP and Cloud Computing

A Case for Better Integration of Host and Target Compilation When Using OpenCL for FPGAs

EXTENDING THE REACH OF PARALLEL COMPUTING WITH CUDA

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Abstract. Introduction. Kevin Todisco

Relay: a high level differentiable IR. Jared Roesch TVMConf December 12th, 2018

OpenCL parallel Processing using General Purpose Graphical Processing units TiViPE software development

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

CU2CL Automated Source-to-Source Translation from CUDA to OpenCL for General Purpose Accelerated Computing

Copyright Khronos Group, Page 1 SYCL. SG14, February 2016

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Introduction to GPU hardware and to CUDA

Transcription:

Mapping C++ AMP to OpenCL / HSA Wen-Heng Jack Chung <jack@multicorewareinc.com> 1

MulticoreWare Founded in 2009 Largest Independent OpenCL Team Locations Changchun Champaign Beijing St. Louis Taiwan Sunnyvale Chennai Dr. Wen Mei-Hwu, MCW CTO and PI for the UIUC Blue Waters Supercomputer accepts the Second Annual Achievement Award at GTC 2013 2

Agenda C++ AMP Kalmar : C++ AMP Implementation on OpenCL Case Studies Kalmar on HSA Toward Heterogeneous C++ 3

C++ AMP Programming Model void MultiplyWithAMP(int* amatrix, int* bmatrix, int *productmatrix) { array_view<int, 2> a(3, 2, amatrix); array_view<int, 2> b(2, 3, bmatrix); array_view<int, 2> product(3, 3, productmatrix); parallel_for_each( product.extent, [=](index<2> idx) restrict(amp) { int row = idx[0]; int col = idx[1]; for (int inner = 0; inner < a.get_extent()[1]; ++inner) product[idx] += a(row, inner) * b(inner, col); ); product.synchronize(); 4

C++ AMP Programming Model void MultiplyWithAMP(int* amatrix, int* bmatrix, int *productmatrix) { array_view<int, 2> a(3, 2, amatrix); array_view<int, 2> b(2, 3, bmatrix); array_view<int, 2> product(3, 3, productmatrix); parallel_for_each( product.extent, [=](index<2> idx) restrict(amp) { int row = idx[0]; int col = idx[1]; for (int inner = 0; inner < a.get_extent()[1]; ++inner) product[idx] += a(row, inner) * b(inner, col); ); product.synchronize(); GPU data modeled as data container objects 5

C++ AMP Programming Model void MultiplyWithAMP(int* amatrix, int* bmatrix, int *productmatrix) { array_view<int, 2> a(3, 2, amatrix); array_view<int, 2> b(2, 3, bmatrix); array_view<int, 2> product(3, 3, productmatrix); parallel_for_each( product.extent, [=](index<2> idx) restrict(amp) { int row = idx[0]; int col = idx[1]; for (int inner = 0; inner < a.get_extent()[1]; ++inner) product[idx] += a(row, inner) * b(inner, col); ); product.synchronize(); Execution interface; marking an implicitly parallel region for GPU execution 6

C++ AMP Programming Model void MultiplyWithAMP(int* amatrix, int* bmatrix, int *productmatrix) { array_view<int, 2> a(3, 2, amatrix); array_view<int, 2> b(2, 3, bmatrix); array_view<int, 2> product(3, 3, productmatrix); parallel_for_each( product.extent, [=](index<2> idx) restrict(amp) { int row = idx[0]; int col = idx[1]; for (int inner = 0; inner < a.get_extent()[1]; ++inner) { product[idx] += a(row, inner) * b(inner, col); ); product.synchronize(); Kernels modeled as lambdas; arguments are implicitly modeled as captured variables 7

C++ AMP Programming Model void MultiplyWithAMP(int* amatrix, int* bmatrix, int *productmatrix) { array_view<int, 2> a(3, 2, amatrix); array_view<int, 2> b(2, 3, bmatrix); array_view<int, 2> product(3, 3, productmatrix); parallel_for_each( product.extent, [=](index<2> idx) restrict(amp) { int row = idx[0]; int col = idx[1]; for (int inner = 0; inner < a.get_extent()[1]; ++inner) { product[idx] += a(row, inner) * b(inner, col); ); product.synchronize(); Synchronize GPU data back to host 8

Kalmar An Open Source implementation contributed by MulticoreWare Based on Clang & LLVM 3.3 and 3.5 Lower C++ AMP codes to: OpenCL SPIR OpenCL C HSA HSAIL / BRIG x86-64 (CPU fallback) Compatible with all major OpenCL stacks, and HSA https://bitbucket.org/multicoreware/cppamp-driver-ng/ 9

C++ AMP Source Code Build Process Device Code Compilation Host Code Compilation Compiler Driver LLVM IR Host object GPU Target Lowering OpenCL C SPIR C++ AMP Runtime Binary Executable HSA BRIG 10

Tasks for Device Code Compilation Enforce additional rules checks for device codes Restriction specifier C++ data types unsupported on GPU C++ syntax unsupported on GPU Turn captured variables into OpenCL arguments Members of captured objects will be flattened into a list of primitive arguments Construct kernel function prototype Kernel names would be mangled per Itanium ABI Deserialization: emit device codes to reconstruct OpenCL arguments into captured objects on device Populate work-item index into kernel Device Code Compilation LLVM IR 11

Tasks for GPU Target Lowering Deduce memory address spaces for kernel arguments Link with platform-specific built-in functions Atomic functions Floating point functions Work-item indexing Inlining and optimization Produce SPIR binary Produce OpenCL C Modify C Backend from LLVM to emit OpenCL C from SPIR Produce HSAIL BRIG Depend on HSAIL Compiler from HSA Foundation GPU Target Lowering OpenCL C SPIR 12 HSA BRIG

Conceptual Compilation of Device Code C++ AMP [=](index<2> idx) restrict(amp) { int row = idx[0]; int col = idx[1]; for (int inner = 0; inner < a.get_extent()[1]; ++inner) product[idx] += a(row, inner) * b(inner, col); OpenCL C kernel void _Z6kernelPiiS_iiS_ii( global int* vp, int rowp, int colp, global int* va, int rowa, int cola, global int* vb, int rowb, int colb){ int row = get_global_id(0); int col = get_global_id(1); for (int inner = 0; inner < cola; ++inner) vp[row * colp + col] += va[row * cola + inner] * vb[inner * colb + col]; 13

Tasks for C++ AMP Runtime Lazily determine which platformdependent runtime is available Binary Executable OpenCL C SPIR Dynamically load and delegate runtime calls to platform-dependent C++ AMP runtimes C++ AMP Runtime Host object C++ AMP OpenCL Runtime HSA BRIG C++ AMP HSA Runtime 14 C++ AMP CPU Runtime

Case Study: SGEMM in BLAS Configuration: CPU: Intel i5-4440@3.10ghz GPU: AMD FirePro W8100 15

Case Study: Neural Network Configuration A: Configuration B: CPU: Intel i5-4440@3.10ghz GPU: Nvidia GeForce GTX 780Ti CPU: Intel i5-4440@3.10ghz GPU: AMD FirePro W8100 Neural Network package: Torch7 GPU backend: cutorch + cunn Neural Network package: Torch7 GPU backend: C++ AMP implementation Training data set: AlexNet Training data set: AlexNet Training Epochs: 100 Training Epochs: 100 Time: 700 secs Time: 1,013 secs 16

Matrix Multiplication on HSA void MultiplyWithSVM(int* amatrix, int* bmatrix, int *productmatrix) { array_view instances are no longer needed parallel_for_each( extent<2>(3, 3), [&](index<2> idx) restrict(amp) { int row = idx[0]; int col = idx[1]; for (int inner = 0; inner < 2; inner++) productmatrix[row * 3 + col] += amatrix[row * 2 + inner] * bmatrix[inner * 3 + col]; ); Synchronization back to host is no longer necessary 17

Matrix Multiplication on HSA void MultiplyWithSVM(int* amatrix, int* bmatrix, int *productmatrix) { parallel_for_each( Pointers on host can be extent<2>(3, 3), accessed in GPU kernels [&](index<2> idx) restrict(amp) { int row = idx[0]; int col = idx[1]; for (int inner = 0; inner < 2; inner++) productmatrix[row * 3 + col] += amatrix[row * 2 + inner] * bmatrix[inner * 3 + col]; ); 18

Platform Atomics on HSA atomic objects in <atomic> can be shared between CPU and GPU via SVM Lock-free synchronization between CPU and GPU GPU // send command request to CPU cmd.store(val, std::memory_order_release); CPU while (true) { // fetch command request from GPU val = cmd.load(std::memory_order_acquire); // do something according to val // omitted ret =.; // store results to GPU result.store(ret, std::memory_order_release); // while until CPU returns while (result.load(std::memory_order_acquire)); 19

Towards Heterogeneous C++ Kalmar infrastructure allow one compiler frontend to accept various GPU programming paradigms: C++ AMP C++17 proposals for parallelism (ex: N4167, N4409, N4494) Being implemented. Available in Q3 '15. Allow more C++ syntax within GPU kernels OpenCL 2.0 & SPIR-V support Upstream our work 20

21