Generic System Calls for GPUs

Similar documents
INTERFERENCE FROM GPU SYSTEM SERVICE REQUESTS

SIMULATOR AMD RESEARCH JUNE 14, 2015

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

Panel Discussion: The Future of I/O From a CPU Architecture Perspective

AMD CORPORATE TEMPLATE AMD Radeon Open Compute Platform Felix Kuehling


OPENCL TM APPLICATION ANALYSIS AND OPTIMIZATION MADE EASY WITH AMD APP PROFILER AND KERNELANALYZER

AMD Graphics Team Last Updated February 11, 2013 APPROVED FOR PUBLIC DISTRIBUTION. 1 3DMark Overview February 2013 Approved for public distribution

Sequential Consistency for Heterogeneous-Race-Free

AMD ACCELERATING TECHNOLOGIES FOR EXASCALE COMPUTING FELLOW 3 OCTOBER 2016

AMD APU and Processor Comparisons. AMD Client Desktop Feb 2013 AMD

THE PROGRAMMER S GUIDE TO THE APU GALAXY. Phil Rogers, Corporate Fellow AMD

Understanding GPGPU Vector Register File Usage

KVM CPU MODEL IN SYSCALL EMULATION MODE ALEXANDRU DUTU, JOHN SLICE JUNE 14, 2015

Use cases. Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games

AMD IOMMU VERSION 2 How KVM will use it. Jörg Rödel August 16th, 2011

AMD Graphics Team Last Updated April 29, 2013 APPROVED FOR PUBLIC DISTRIBUTION. 1 3DMark Overview April 2013 Approved for public distribution

EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT

The Rise of Open Programming Frameworks. JC BARATAULT IWOCL May 2015

HyperTransport Technology

MEASURING AND MODELING ON-CHIP INTERCONNECT POWER ON REAL HARDWARE

CAUTIONARY STATEMENT 1 AMD NEXT HORIZON NOVEMBER 6, 2018

AMD RYZEN PROCESSOR WITH RADEON VEGA GRAPHICS CORPORATE BRAND GUIDELINES

ADVANCED RENDERING EFFECTS USING OPENCL TM AND APU Session Olivier Zegdoun AMD Sr. Software Engineer

CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to

RegMutex: Inter-Warp GPU Register Time-Sharing

ROCm: An open platform for GPU computing exploration

Run Anywhere. The Hardware Platform Perspective. Ben Pollan, AMD Java Labs October 28, 2008

EXPLOITING ACCELERATOR-BASED HPC FOR ARMY APPLICATIONS

Heterogeneous Computing

Automatic Intra-Application Load Balancing for Heterogeneous Systems

GPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011

CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to

SOLUTION TO SHADER RECOMPILES IN RADEONSI SEPTEMBER 2015

clarmor: A DYNAMIC BUFFER OVERFLOW DETECTOR FOR OPENCL KERNELS CHRIS ERB, JOE GREATHOUSE, MAY 16, 2018

viewdle! - machine vision experts

Multi-core processors are here, but how do you resolve data bottlenecks in native code?

FLASH MEMORY SUMMIT Adoption of Caching & Hybrid Solutions

The Road to the AMD. Fiji GPU. Featuring Die Stacking and HBM Technology 1 THE ROAD TO THE AMD FIJI GPU ECTC 2016 MAY 2015

Fusion Enabled Image Processing

ACCELERATING MATRIX PROCESSING WITH GPUs. Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research

HPG 2011 HIGH PERFORMANCE GRAPHICS HOT 3D

Designing Natural Interfaces

Maximizing Six-Core AMD Opteron Processor Performance with RHEL

FUSION PROCESSORS AND HPC

AMD SEV Update Linux Security Summit David Kaplan, Security Architect

PROTECTING VM REGISTER STATE WITH AMD SEV-ES DAVID KAPLAN LSS 2017

BIOMEDICAL DATA ANALYSIS ON HETEROGENEOUS PLATFORM. Dong Ping Zhang Heterogeneous System Architecture AMD

HIGHLY PARALLEL COMPUTING IN PHYSICS-BASED RENDERING OpenCL Raytracing Based. Thibaut PRADOS OPTIS Real-Time & Virtual Reality Manager

D3D12 & Vulkan: Lessons learned. Dr. Matthäus G. Chajdas Developer Technology Engineer, AMD

AMD Radeon ProRender plug-in for Unreal Engine. Installation Guide

The mobile computing evolution. The Griffin architecture. Memory enhancements. Power management. Thermal management

SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL

AMD HD3D Technology. Setup Guide. 1 AMD HD3D TECHNOLOGY: Setup Guide

LIQUIDVR TODAY AND TOMORROW GUENNADI RIGUER, SOFTWARE ARCHITECT

STREAMING VIDEO DATA INTO 3D APPLICATIONS Session Christopher Mayer AMD Sr. Software Engineer

3D Numerical Analysis of Two-Phase Immersion Cooling for Electronic Components

Gestural and Cinematic Interfaces - DX11. David Brebner Unlimited Realities CTO

Accelerating Applications. the art of maximum performance computing James Spooner Maxeler VP of Acceleration

Vulkan (including Vulkan Fast Paths)

MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011

DR. LISA SU

OpenCL Implementation Of A Heterogeneous Computing System For Real-time Rendering And Dynamic Updating Of Dense 3-d Volumetric Data

3 Our Approach. 4 System Call Design Space Exploration. 4.1 GPU-Side Design Considerations

NEXT-GENERATION MATRIX 3D IMMERSIVE USER INTERFACE [ M3D-IUI ] H Raghavendra Swamy AMD Senior Software Engineer

Desktop Telepresence Arrived! Sudha Valluru ViVu CEO

Generic System Calls for GPUs

AMD EPYC CORPORATE BRAND GUIDELINES

Graphics Hardware 2008

1 Presentation Title Month ##, 2012

HPCA 18. Reliability-aware Data Placement for Heterogeneous memory Architecture

1 HiPEAC January, 2012 Public TASKS, FUTURES AND ASYNCHRONOUS PROGRAMMING

AMD AIB Partner Guidelines. Version February, 2015

AMD RYZEN CORPORATE BRAND GUIDELINES

PCCC WORKSHOP:AMD の最新製品戦略とプラットフォームソリューション FEBRUARY 19 TH 2016 HIDETOSHI IWASA, FAE MANAGER AMD JAPAN

Resource Saving: Latest Innovation in Optimized Cloud Infrastructure

Driver Options in AMD Radeon Pro Settings. User Guide

Introducing NVDIMM-X: Designed to be the World s Fastest NAND-Based SSD Architecture and a Platform for the Next Generation of New Media SSDs

EPYC VIDEO CUG 2018 MAY 2018

Pattern-based analytics to estimate and track yield risk of designs down to 7nm

Eleos: Exit-Less OS Services for SGX Enclaves

INTRODUCING RYZEN MARCH

AMD 780G. Niles Burbank AMD. an x86 chipset with advanced integrated GPU. Hot Chips 2008

Changing your Driver Options with Radeon Pro Settings. Quick Start User Guide v3.0

AMD Security and Server innovation

MULTIMEDIA PROCESSING Real-time H.264 video enhancement by using AMD APP SDK

Fan Control in AMD Radeon Pro Settings. User Guide. This document is a quick user guide on how to configure GPU fan speed in AMD Radeon Pro Settings.

AMD S X86 OPEN64 COMPILER. Michael Lai AMD

High Performance Graphics 2010

CLICK TO EDIT MASTER TITLE STYLE. Click to edit Master text styles. Second level Third level Fourth level Fifth level

Changing your Driver Options with Radeon Pro Settings. Quick Start User Guide v2.1

Cilk Plus: Multicore extensions for C and C++

Anatomy of AMD s TeraScale Graphics Engine

GPUfs: Integrating a file system with GPUs

NUMA Topology for AMD EPYC Naples Family Processors

GPUfs: Integrating a file system with GPUs

User Manual. Nvidia Jetson Series Carrier board Aetina ACE-N622

Microsoft Windows 2016 Mellanox 100GbE NIC Tuning Guide

AMD Radeon ProRender plug-in for Universal Scene Description. Installation Guide

Transcription:

Generic System Calls for GPUs Ján Veselý*, Arkaprava Basu, Abhishek Bhattacharjee*, Gabriel H. Loh, Mark Oskin, Steven K. Reinhardt *Rutgers University, Indian Institute of Science, Advanced Micro Devices Inc., University of Washington, Microsoft Inc.

Towards heterogeneous computing CPU Acc GPU Application ISCA 2018 2

Programs require system services CPU: function process(port, file) { data, response_data, response = malloc(); data = recvmsg(port); idx = get_idx(data); response_data = pread(file, idx); response = process(response_data, data); log( request processed\n ); sendmsg(port, response); } free(data, response_data, response); ISCA 2018 3

Programs require system services CPU: function process(port, file) { data, response_data, response = malloc(); data = recvmsg(port); idx = get_idx(data); response_data = pread(file, idx); response = process(response_data, data); log( request processed\n ); sendmsg(port, response); Memory allocation } free(data, response_data, response); Memory allocation ISCA 2018 4

Programs require system services CPU: function process(port, file) { data, response_data, response = malloc(); Memory allocation data = recvmsg(port); Network idx = get_idx(data); response_data = pread(file, idx); response = process(response_data, data); log( request processed\n ); } sendmsg(port, response); free(data, response_data, response); Network Memory allocation ISCA 2018 5

Programs require system services CPU: function process(port, file) { data, response_data, response = malloc(); Memory allocation data = recvmsg(port); Network idx = get_idx(data); response_data = pread(file, idx); Storage response = process(response_data, data); log( request processed\n ); } sendmsg(port, response); free(data, response_data, response); Network Memory allocation ISCA 2018 6

Programs require system services CPU: function process(port, file) { data, response_data, response = malloc(); Memory allocation data = recvmsg(port); Network idx = get_idx(data); response_data = pread(file, idx); Storage response = process(response_data, data); } log( request processed\n ); sendmsg(port, response); free(data, response_data, response); Terminal/Storage Network Memory allocation ISCA 2018 7

Programs require system services CPU: function process(port, file) { data, response_data, response = malloc(); GPU: Memory allocation data = recvmsg(port); Network idx = get_idx(data); response_data = pread(file, idx); Storage response = process(response_data, data); } log( request processed\n ); sendmsg(port, response); free(data, response_data, response); Terminal/Storage Network Memory allocation ISCA 2018 8

Computation can be offloaded CPU: function process(port, file) { data, response_data, response = malloc(); GPU: GPU MANAGEMENT GPU INVOCATION GPU KERNEL data[] = recvmsgs(port); copy_to_device(data[]); gpu_get_idx(&idx[], data[]); copy_from_device(idx[]); for (d in data) response_data[] = pread(file, idx[i]); copy_to_device(response_data[]); gpu_process(&response[], response_data[], data[]); copy_from_device(response[]); Kernel 1 idx[i] = get_idx(data[i]) Kernel 2 response[i] = process(response_data[i], data[i]) log( requests processed\n ); sendmsgs(port, response[]); ISCA 2018 9

GPUs are tightly integrated Unified virtual memory (UVM) HSA, CUDA UVM, OpenCL SVM CPU GPU cache coherence HSA, CCIX, Gen-Z ISCA 2018 10

UVM and cache coherence ease programmability CPU: GPU: GPU MANAGEMENT function process(port, file) { data, response_data, response = malloc(); data[] = recvmsgs(port); GPU INVOCATION GPU KERNEL copy_to_device(data[]); gpu_get_idx(&idx[], data[]); copy_from_device(idx[]); for (d in data) response_data[] = pread(file, idx[i]); copy_to_device(response_data[]); gpu_process(&response[], response_data[], data[]); copy_from_device(response[]); Kernel 1 idx[i] = get_idx(data[i]) Kernel 2 response[i] = process(response_data[i], data[i]) log( requests processed\n ); sendmsgs(port, response[]); ISCA 2018 11

UVM and cache coherence ease programmability CPU: GPU: GPU MANAGEMENT function process(port, file) { data, response_data, response = malloc(); data[] = recvmsgs(port); GPU INVOCATION GPU KERNEL copy_to_device(data[]); gpu_get_idx(&idx[], data[]); copy_from_device(idx[]); for (d in data) response_data[] = pread(file, idx[i]); copy_to_device(response_data[]); gpu_process(&response[], response_data[], data[]); copy_from_device(response[]); Kernel 1 idx[i] = get_idx(data[i]) Kernel 2 response[i] = process(response_data[i], data[i]) log( requests processed\n ); sendmsgs(port, response[]); ISCA 2018 12

UVM and cache coherence ease programmability CPU: function process(port, file) { data, response_data, response = malloc(); data[] = recvmsgs(port); GPU: GPU MANAGEMENT GPU INVOCATION GPU KERNEL copy_to_device(data[]); gpu_get_idx(&idx[], data[]); copy_from_device(idx[]); for (d in data) response_data[] = pread(file, idx[i]); copy_to_device(response_data[]); gpu_process(&response[], response_data[], data[]); copy_from_device(response[]); Kernel 1 idx[i] = get_idx(data[i])? Kernel 2 response[i] = process(response_data[i], data[i]) log( requests processed\n ); sendmsgs(port, response[]); ISCA 2018 13

Next step is system services ISCA 2018 14

Next step is system services Memory allocation HSA, CUDA ISCA 2018 15

Next step is system services Memory allocation HSA, CUDA Printf HSA, OpenCL, CUDA ISCA 2018 16

Next step is system services Memory allocation HSA, CUDA Printf HSA, OpenCL, CUDA Academic research GPUfs [Silberstein, ASPLOS 13], GPUnet [Kim, OSDI 14], SPIN [Bergman, ATC 17], ISCA 2018 17

Some services can be invoked from GPU CPU: function process(port, file) { data, response_data, response = malloc(); gpu_process(port, file, response[], response_data[], data[]); free(data, response_data, response); } GPU: void gpu_group_process(port, file) { data = Grecv(port); idx = gpu_get_idx(&idx, data); response_data = Gread(file, idx); GPUnet GPUfs CUDA response = process(response_data, data); Gprintf( request processed\n ); } Gsend(port, response); ISCA 2018 18

Previous solutions took the first steps ISCA 2018 19

Previous solutions took the first steps Subsystem specific ISCA 2018 20

Previous solutions took the first steps Subsystem specific Specialized, restricted functionality ISCA 2018 21

Previous solutions took the first steps Subsystem specific Specialized, restricted functionality Custom API/semantics ISCA 2018 22

Our work takes the next step ISCA 2018 23

Our work takes the next step GENEric SYStem call interface Efficient direct-to-os communication ISCA 2018 24

Our work takes the next step GENEric SYStem call interface Efficient direct-to-os communication Allows all system calls implementable for GPUs 79% of all system calls ISCA 2018 25

Our work takes the next step GENEric SYStem call interface Efficient direct-to-os communication Allows all system calls implementable for GPUs 79% of all system calls Original OS (Linux) semantics POSIX -like ISCA 2018 26

Our work takes the next step GENEric SYStem call interface Efficient direct-to-os communication Allows all system calls implementable for GPUs 79% of all system calls Original OS (Linux) semantics POSIX -like Available on github https://github.com/radeonopencompute/{rock,roct,hcc}_syscall ISCA 2018 27

Genesys subsumes previous work (and more) CPU: GPU: GENESYS function process(port, file) { gpu_process(port, file, response[], response_data[], data[]); } void gpu_process(port, file) { data, response_data, response = malloc(); data = recvmsg(port); idx = get_idx(data); response_data = pread(file, idx); response = process(response_data, data); log( requests processed\n ); sendmsg(port, response); } free(data, response_data, response); ISCA 2018 28

Ideal system services properties Familiarity Known semantics ISCA 2018 29

Ideal system services properties Familiarity Known semantics Flexibility Do not restrict programmers Adaptability Adapt to workload needs ISCA 2018 30

Flexibility in application interface Invocation granularity ISCA 2018 31

Flexibility in application interface Invocation granularity Observed ordering ISCA 2018 32

Flexibility in application interface Invocation granularity Observed ordering Blocking vs. Non-blocking ISCA 2018 33

Flexibility: Any thread can invoke system call GPU execution hierarchy ISCA 2018 34

Flexibility: Any thread can invoke system call GPU execution hierarchy Workitem (thread) ISCA 2018 35

Flexibility: Any thread can invoke system call GPU execution hierarchy Workitem (thread) Workgroup (thread group) ISCA 2018 36

Flexibility: Any thread can invoke system call GPU execution hierarchy Workitem (thread) Workgroup (thread group) Kernel kernel workgroup workgroup workgroup workgroup ISCA 2018 37

Flexibility: Any thread can invoke system call GPU execution hierarchy Workitem (thread) Workgroup (thread group) Kernel kernel workgroup workgroup workgroup workgroup Wavefront (warp) HW specific! Do not expose! ISCA 2018 38

Flexibility: Any thread can invoke system call GPU execution hierarchy Workitem (thread) Workgroup (thread group) Kernel Invokes system call kernel workgroup workgroup workgroup workgroup Wavefront (warp) HW specific! Do not expose! ISCA 2018 39

Flexibility: Ordering can be relaxed (group) Strict ordering Both barriers Relaxed ordering Remove one barrier Before read After write workgroup ISCA 2018 40

Flexibility: Ordering can be relaxed (group) Strict ordering Both barriers Relaxed ordering Remove one barrier Before read After write write workgroup ISCA 2018 41

Flexibility: Ordering can be relaxed (group) Strict ordering Both barriers Relaxed ordering Remove one barrier Before read After write write workgroup ISCA 2018 42

Flexibility: Ordering can be relaxed (group) Strict ordering Both barriers Relaxed ordering Remove one barrier Before read After write write workgroup ISCA 2018 43

Flexibility: Ordering can be relaxed (group) Strict ordering Both barriers Relaxed ordering Remove one barrier Before read After write write workgroup ISCA 2018 44

Flexibility: Ordering can be relaxed (group) Strict ordering Both barriers Relaxed ordering Remove one barrier Before read After write write workgroup ISCA 2018 45

Flexibility: Ordering can be relaxed (group) Strict ordering Both barriers Relaxed ordering Remove one barrier Before read After write write workgroup ISCA 2018 46

Flexibility: Ordering can be relaxed (group) Strict ordering Both barriers Relaxed ordering Remove one barrier Before read After write write workgroup ISCA 2018 47

Flexibility: Allow non-blocking invocation Blocking invocation Wait for result Non-blocking invocation Return value collected later or not at all workgroup ISCA 2018 48

Flexibility: Allow non-blocking invocation Blocking invocation Wait for result Non-blocking invocation Return value collected later or not at all write workgroup ISCA 2018 49

Flexibility: Allow non-blocking invocation Blocking invocation Wait for result Non-blocking invocation Return value collected later or not at all write workgroup ISCA 2018 50

Flexibility: Allow non-blocking invocation Blocking invocation Wait for result Non-blocking invocation Return value collected later or not at all write workgroup ISCA 2018 51

Flexibility: Allow non-blocking invocation Blocking invocation Wait for result Non-blocking invocation Return value collected later or not at all write workgroup ISCA 2018 52

Flexibility: Allow non-blocking invocation Blocking invocation Wait for result Non-blocking invocation Return value collected later or not at all workgroup ISCA 2018 53

Flexibility: Allow non-blocking invocation Blocking invocation Wait for result Non-blocking invocation Return value collected later or not at all workgroup ISCA 2018 54

Ideal system services properties Familiarity Known semantics Flexibility Do not restrict programmers Adaptability Adapt to workload needs ISCA 2018 55

Adaptability in implementation ISCA 2018 56

Adaptability in implementation Don t waste resources Syscall light applications Important for heterogeneous systems Share power and energy budget ISCA 2018 57

Adaptability in implementation Don t waste resources Syscall light applications Important for heterogeneous systems Share power and energy budget Use as many resources as possible Syscall heavy applications ISCA 2018 58

Implementation GPU CPU Syscall area Main Memory ISCA 2018 59

Implementation 1. Fill Parameters 2. Send Interrupt (suspend) 3. Process Interrupt 4. Execute System call 5. Fill return value 6. Wake up wavefront (if suspended) GPU CPU 1 Syscall area Main Memory ISCA 2018 60

Implementation 1. Fill Parameters 2. Send Interrupt (suspend) 3. Process Interrupt 4. Execute System call 5. Fill return value 6. Wake up wavefront (if suspended) GPU 2 CPU 1 Syscall area Main Memory ISCA 2018 61

Implementation 1. Fill Parameters 2. Send Interrupt (suspend) 3. Process Interrupt 4. Execute System call 5. Fill return value 6. Wake up wavefront (if suspended) GPU 2 CPU 3 1 Syscall area Main Memory ISCA 2018 62

Implementation 1. Fill Parameters 2. Send Interrupt (suspend) 3. Process Interrupt 4. Execute System call 5. Fill return value 6. Wake up wavefront (if suspended) GPU 2 CPU 3 1 4 Syscall area Main Memory ISCA 2018 63

Implementation 1. Fill Parameters 2. Send Interrupt (suspend) 3. Process Interrupt 4. Execute System call 5. Fill return value 6. Wake up wavefront (if suspended) GPU 2 CPU 3 1 4 5 Syscall area Main Memory ISCA 2018 64

Implementation 1. Fill Parameters 2. Send Interrupt (suspend) 3. Process Interrupt 4. Execute System call 5. Fill return value 6. Wake up wavefront (if suspended) GPU 2 6 CPU 3 1 4 5 Syscall area Main Memory ISCA 2018 65

Genesys works on off-the-shelf hardware ISCA 2018 66

Genesys works on off-the-shelf hardware AMD FX-9800P 4 CPU cores, 8 CUs (gpu cores) Share 15W of TDP 16GB DDR4 RAM ISCA 2018 67

Genesys works on off-the-shelf hardware AMD FX-9800P 4 CPU cores, 8 CUs (gpu cores) Share 15W of TDP 16GB DDR4 RAM GPU L2 cache is CPU coherent GPU L1 coherence is handled in software Provides CPU GPU atomic operations ISCA 2018 68

Ideal system services properties Familiarity Known semantics Flexibility Do not restrict programmers Adaptability Adapt to workload needs ISCA 2018 69

Genesys supports wide range of use cases Storage ISCA 2018 70

Genesys supports wide range of use cases Storage Networking ISCA 2018 71

Genesys supports wide range of use cases Storage Networking Memory Management ISCA 2018 72

Genesys supports wide range of use cases Storage Networking Memory Management Device Control ISCA 2018 73

Storage workload grep ISCA 2018 74

Storage workload grep Parallelize across number of files ISCA 2018 75

Storage workload grep Parallelize across number of files Exploit high throughput storage devices ISCA 2018 76

Storage workload grep Parallelize across number of files Exploit high throughput storage devices Each workitem (thread): open, read, write(stdout), close ISCA 2018 77

Time (s) Storage workload grep Parallelize across number of files Exploit high throughput storage devices CPU original Genesys workgroup 30 25 CPU openmp (4T) Genesys workitem Lower is better Each workitem (thread): open, read, write(stdout), close 20 15 10 5 0 grep ISCA 2018 78

Networking workload memcached ISCA 2018 79

Networking workload memcached Heterogeneous application ISCA 2018 80

Networking workload memcached Heterogeneous application CPU and GPU work on the same data SET CPU GET GPU, CPU ISCA 2018 81

Networking workload memcached Heterogeneous application CPU and GPU work on the same data SET CPU GET GPU, CPU Each workgroup (thread group) recvmsg, write(stderr), sendmsg Parallelize; hash, lookup, data copy ISCA 2018 82

Operations per second Networking workload memcached Heterogeneous application CPU and GPU work on the same data SET CPU GET GPU, CPU Each workgroup (thread group) recvmsg, write(stderr), sendmsg Parallelize; hash, lookup, data copy Throughput memcached CPU GPU Genesys GPU without syscalls 60000 Higher is better 50000 40000 30000 20000 10000 0 hits misses ISCA 2018 83

Time (ms) Networking workload memcached Heterogeneous application CPU and GPU work on the same data SET CPU GET GPU, CPU Each workgroup (thread group) recvmsg, write(stderr), sendmsg Parallelize; hash, lookup, data copy Latency memcached CPU GPU Genesys GPU without syscalls 2.5 Lower is better 2 1.5 1 0.5 0 hits misses ISCA 2018 84

Memory management miniamr Algorithm includes memory allocator Adaptive mesh refining Enable judicious use of system resources Accelerator multiprogramming Coarsening workitems (threads) madvise(madv_dontneed) ISCA 2018 85

Device control ioctl Audio devices USB devices Network devices GPU! ISCA 2018 86

Device control ioctl Audio devices USB devices Network devices GPU! Display frame buffer ISCA 2018 87

Device control ioctl Audio devices USB devices Network devices GPU! Display frame buffer ISCA 2018 88

Conclusion Generic POSIX -like system calls for GPUs are viable Improvement in programming environment leads to new applications and improved performance of traditional ones All code is available on github, hosted by AMD ROCm project https://github.com/radeonopencompute/{rock,roct,hcc}_syscall ISCA 2018 89

Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION 2018 Advanced Micro Devices, Inc. All rights reserved. AMD, Radeon, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners. ISCA 2018 90