Generic System Calls for GPUs Ján Veselý*, Arkaprava Basu, Abhishek Bhattacharjee*, Gabriel H. Loh, Mark Oskin, Steven K. Reinhardt *Rutgers University, Indian Institute of Science, Advanced Micro Devices Inc., University of Washington, Microsoft Inc.
Towards heterogeneous computing CPU Acc GPU Application ISCA 2018 2
Programs require system services CPU: function process(port, file) { data, response_data, response = malloc(); data = recvmsg(port); idx = get_idx(data); response_data = pread(file, idx); response = process(response_data, data); log( request processed\n ); sendmsg(port, response); } free(data, response_data, response); ISCA 2018 3
Programs require system services CPU: function process(port, file) { data, response_data, response = malloc(); data = recvmsg(port); idx = get_idx(data); response_data = pread(file, idx); response = process(response_data, data); log( request processed\n ); sendmsg(port, response); Memory allocation } free(data, response_data, response); Memory allocation ISCA 2018 4
Programs require system services CPU: function process(port, file) { data, response_data, response = malloc(); Memory allocation data = recvmsg(port); Network idx = get_idx(data); response_data = pread(file, idx); response = process(response_data, data); log( request processed\n ); } sendmsg(port, response); free(data, response_data, response); Network Memory allocation ISCA 2018 5
Programs require system services CPU: function process(port, file) { data, response_data, response = malloc(); Memory allocation data = recvmsg(port); Network idx = get_idx(data); response_data = pread(file, idx); Storage response = process(response_data, data); log( request processed\n ); } sendmsg(port, response); free(data, response_data, response); Network Memory allocation ISCA 2018 6
Programs require system services CPU: function process(port, file) { data, response_data, response = malloc(); Memory allocation data = recvmsg(port); Network idx = get_idx(data); response_data = pread(file, idx); Storage response = process(response_data, data); } log( request processed\n ); sendmsg(port, response); free(data, response_data, response); Terminal/Storage Network Memory allocation ISCA 2018 7
Programs require system services CPU: function process(port, file) { data, response_data, response = malloc(); GPU: Memory allocation data = recvmsg(port); Network idx = get_idx(data); response_data = pread(file, idx); Storage response = process(response_data, data); } log( request processed\n ); sendmsg(port, response); free(data, response_data, response); Terminal/Storage Network Memory allocation ISCA 2018 8
Computation can be offloaded CPU: function process(port, file) { data, response_data, response = malloc(); GPU: GPU MANAGEMENT GPU INVOCATION GPU KERNEL data[] = recvmsgs(port); copy_to_device(data[]); gpu_get_idx(&idx[], data[]); copy_from_device(idx[]); for (d in data) response_data[] = pread(file, idx[i]); copy_to_device(response_data[]); gpu_process(&response[], response_data[], data[]); copy_from_device(response[]); Kernel 1 idx[i] = get_idx(data[i]) Kernel 2 response[i] = process(response_data[i], data[i]) log( requests processed\n ); sendmsgs(port, response[]); ISCA 2018 9
GPUs are tightly integrated Unified virtual memory (UVM) HSA, CUDA UVM, OpenCL SVM CPU GPU cache coherence HSA, CCIX, Gen-Z ISCA 2018 10
UVM and cache coherence ease programmability CPU: GPU: GPU MANAGEMENT function process(port, file) { data, response_data, response = malloc(); data[] = recvmsgs(port); GPU INVOCATION GPU KERNEL copy_to_device(data[]); gpu_get_idx(&idx[], data[]); copy_from_device(idx[]); for (d in data) response_data[] = pread(file, idx[i]); copy_to_device(response_data[]); gpu_process(&response[], response_data[], data[]); copy_from_device(response[]); Kernel 1 idx[i] = get_idx(data[i]) Kernel 2 response[i] = process(response_data[i], data[i]) log( requests processed\n ); sendmsgs(port, response[]); ISCA 2018 11
UVM and cache coherence ease programmability CPU: GPU: GPU MANAGEMENT function process(port, file) { data, response_data, response = malloc(); data[] = recvmsgs(port); GPU INVOCATION GPU KERNEL copy_to_device(data[]); gpu_get_idx(&idx[], data[]); copy_from_device(idx[]); for (d in data) response_data[] = pread(file, idx[i]); copy_to_device(response_data[]); gpu_process(&response[], response_data[], data[]); copy_from_device(response[]); Kernel 1 idx[i] = get_idx(data[i]) Kernel 2 response[i] = process(response_data[i], data[i]) log( requests processed\n ); sendmsgs(port, response[]); ISCA 2018 12
UVM and cache coherence ease programmability CPU: function process(port, file) { data, response_data, response = malloc(); data[] = recvmsgs(port); GPU: GPU MANAGEMENT GPU INVOCATION GPU KERNEL copy_to_device(data[]); gpu_get_idx(&idx[], data[]); copy_from_device(idx[]); for (d in data) response_data[] = pread(file, idx[i]); copy_to_device(response_data[]); gpu_process(&response[], response_data[], data[]); copy_from_device(response[]); Kernel 1 idx[i] = get_idx(data[i])? Kernel 2 response[i] = process(response_data[i], data[i]) log( requests processed\n ); sendmsgs(port, response[]); ISCA 2018 13
Next step is system services ISCA 2018 14
Next step is system services Memory allocation HSA, CUDA ISCA 2018 15
Next step is system services Memory allocation HSA, CUDA Printf HSA, OpenCL, CUDA ISCA 2018 16
Next step is system services Memory allocation HSA, CUDA Printf HSA, OpenCL, CUDA Academic research GPUfs [Silberstein, ASPLOS 13], GPUnet [Kim, OSDI 14], SPIN [Bergman, ATC 17], ISCA 2018 17
Some services can be invoked from GPU CPU: function process(port, file) { data, response_data, response = malloc(); gpu_process(port, file, response[], response_data[], data[]); free(data, response_data, response); } GPU: void gpu_group_process(port, file) { data = Grecv(port); idx = gpu_get_idx(&idx, data); response_data = Gread(file, idx); GPUnet GPUfs CUDA response = process(response_data, data); Gprintf( request processed\n ); } Gsend(port, response); ISCA 2018 18
Previous solutions took the first steps ISCA 2018 19
Previous solutions took the first steps Subsystem specific ISCA 2018 20
Previous solutions took the first steps Subsystem specific Specialized, restricted functionality ISCA 2018 21
Previous solutions took the first steps Subsystem specific Specialized, restricted functionality Custom API/semantics ISCA 2018 22
Our work takes the next step ISCA 2018 23
Our work takes the next step GENEric SYStem call interface Efficient direct-to-os communication ISCA 2018 24
Our work takes the next step GENEric SYStem call interface Efficient direct-to-os communication Allows all system calls implementable for GPUs 79% of all system calls ISCA 2018 25
Our work takes the next step GENEric SYStem call interface Efficient direct-to-os communication Allows all system calls implementable for GPUs 79% of all system calls Original OS (Linux) semantics POSIX -like ISCA 2018 26
Our work takes the next step GENEric SYStem call interface Efficient direct-to-os communication Allows all system calls implementable for GPUs 79% of all system calls Original OS (Linux) semantics POSIX -like Available on github https://github.com/radeonopencompute/{rock,roct,hcc}_syscall ISCA 2018 27
Genesys subsumes previous work (and more) CPU: GPU: GENESYS function process(port, file) { gpu_process(port, file, response[], response_data[], data[]); } void gpu_process(port, file) { data, response_data, response = malloc(); data = recvmsg(port); idx = get_idx(data); response_data = pread(file, idx); response = process(response_data, data); log( requests processed\n ); sendmsg(port, response); } free(data, response_data, response); ISCA 2018 28
Ideal system services properties Familiarity Known semantics ISCA 2018 29
Ideal system services properties Familiarity Known semantics Flexibility Do not restrict programmers Adaptability Adapt to workload needs ISCA 2018 30
Flexibility in application interface Invocation granularity ISCA 2018 31
Flexibility in application interface Invocation granularity Observed ordering ISCA 2018 32
Flexibility in application interface Invocation granularity Observed ordering Blocking vs. Non-blocking ISCA 2018 33
Flexibility: Any thread can invoke system call GPU execution hierarchy ISCA 2018 34
Flexibility: Any thread can invoke system call GPU execution hierarchy Workitem (thread) ISCA 2018 35
Flexibility: Any thread can invoke system call GPU execution hierarchy Workitem (thread) Workgroup (thread group) ISCA 2018 36
Flexibility: Any thread can invoke system call GPU execution hierarchy Workitem (thread) Workgroup (thread group) Kernel kernel workgroup workgroup workgroup workgroup ISCA 2018 37
Flexibility: Any thread can invoke system call GPU execution hierarchy Workitem (thread) Workgroup (thread group) Kernel kernel workgroup workgroup workgroup workgroup Wavefront (warp) HW specific! Do not expose! ISCA 2018 38
Flexibility: Any thread can invoke system call GPU execution hierarchy Workitem (thread) Workgroup (thread group) Kernel Invokes system call kernel workgroup workgroup workgroup workgroup Wavefront (warp) HW specific! Do not expose! ISCA 2018 39
Flexibility: Ordering can be relaxed (group) Strict ordering Both barriers Relaxed ordering Remove one barrier Before read After write workgroup ISCA 2018 40
Flexibility: Ordering can be relaxed (group) Strict ordering Both barriers Relaxed ordering Remove one barrier Before read After write write workgroup ISCA 2018 41
Flexibility: Ordering can be relaxed (group) Strict ordering Both barriers Relaxed ordering Remove one barrier Before read After write write workgroup ISCA 2018 42
Flexibility: Ordering can be relaxed (group) Strict ordering Both barriers Relaxed ordering Remove one barrier Before read After write write workgroup ISCA 2018 43
Flexibility: Ordering can be relaxed (group) Strict ordering Both barriers Relaxed ordering Remove one barrier Before read After write write workgroup ISCA 2018 44
Flexibility: Ordering can be relaxed (group) Strict ordering Both barriers Relaxed ordering Remove one barrier Before read After write write workgroup ISCA 2018 45
Flexibility: Ordering can be relaxed (group) Strict ordering Both barriers Relaxed ordering Remove one barrier Before read After write write workgroup ISCA 2018 46
Flexibility: Ordering can be relaxed (group) Strict ordering Both barriers Relaxed ordering Remove one barrier Before read After write write workgroup ISCA 2018 47
Flexibility: Allow non-blocking invocation Blocking invocation Wait for result Non-blocking invocation Return value collected later or not at all workgroup ISCA 2018 48
Flexibility: Allow non-blocking invocation Blocking invocation Wait for result Non-blocking invocation Return value collected later or not at all write workgroup ISCA 2018 49
Flexibility: Allow non-blocking invocation Blocking invocation Wait for result Non-blocking invocation Return value collected later or not at all write workgroup ISCA 2018 50
Flexibility: Allow non-blocking invocation Blocking invocation Wait for result Non-blocking invocation Return value collected later or not at all write workgroup ISCA 2018 51
Flexibility: Allow non-blocking invocation Blocking invocation Wait for result Non-blocking invocation Return value collected later or not at all write workgroup ISCA 2018 52
Flexibility: Allow non-blocking invocation Blocking invocation Wait for result Non-blocking invocation Return value collected later or not at all workgroup ISCA 2018 53
Flexibility: Allow non-blocking invocation Blocking invocation Wait for result Non-blocking invocation Return value collected later or not at all workgroup ISCA 2018 54
Ideal system services properties Familiarity Known semantics Flexibility Do not restrict programmers Adaptability Adapt to workload needs ISCA 2018 55
Adaptability in implementation ISCA 2018 56
Adaptability in implementation Don t waste resources Syscall light applications Important for heterogeneous systems Share power and energy budget ISCA 2018 57
Adaptability in implementation Don t waste resources Syscall light applications Important for heterogeneous systems Share power and energy budget Use as many resources as possible Syscall heavy applications ISCA 2018 58
Implementation GPU CPU Syscall area Main Memory ISCA 2018 59
Implementation 1. Fill Parameters 2. Send Interrupt (suspend) 3. Process Interrupt 4. Execute System call 5. Fill return value 6. Wake up wavefront (if suspended) GPU CPU 1 Syscall area Main Memory ISCA 2018 60
Implementation 1. Fill Parameters 2. Send Interrupt (suspend) 3. Process Interrupt 4. Execute System call 5. Fill return value 6. Wake up wavefront (if suspended) GPU 2 CPU 1 Syscall area Main Memory ISCA 2018 61
Implementation 1. Fill Parameters 2. Send Interrupt (suspend) 3. Process Interrupt 4. Execute System call 5. Fill return value 6. Wake up wavefront (if suspended) GPU 2 CPU 3 1 Syscall area Main Memory ISCA 2018 62
Implementation 1. Fill Parameters 2. Send Interrupt (suspend) 3. Process Interrupt 4. Execute System call 5. Fill return value 6. Wake up wavefront (if suspended) GPU 2 CPU 3 1 4 Syscall area Main Memory ISCA 2018 63
Implementation 1. Fill Parameters 2. Send Interrupt (suspend) 3. Process Interrupt 4. Execute System call 5. Fill return value 6. Wake up wavefront (if suspended) GPU 2 CPU 3 1 4 5 Syscall area Main Memory ISCA 2018 64
Implementation 1. Fill Parameters 2. Send Interrupt (suspend) 3. Process Interrupt 4. Execute System call 5. Fill return value 6. Wake up wavefront (if suspended) GPU 2 6 CPU 3 1 4 5 Syscall area Main Memory ISCA 2018 65
Genesys works on off-the-shelf hardware ISCA 2018 66
Genesys works on off-the-shelf hardware AMD FX-9800P 4 CPU cores, 8 CUs (gpu cores) Share 15W of TDP 16GB DDR4 RAM ISCA 2018 67
Genesys works on off-the-shelf hardware AMD FX-9800P 4 CPU cores, 8 CUs (gpu cores) Share 15W of TDP 16GB DDR4 RAM GPU L2 cache is CPU coherent GPU L1 coherence is handled in software Provides CPU GPU atomic operations ISCA 2018 68
Ideal system services properties Familiarity Known semantics Flexibility Do not restrict programmers Adaptability Adapt to workload needs ISCA 2018 69
Genesys supports wide range of use cases Storage ISCA 2018 70
Genesys supports wide range of use cases Storage Networking ISCA 2018 71
Genesys supports wide range of use cases Storage Networking Memory Management ISCA 2018 72
Genesys supports wide range of use cases Storage Networking Memory Management Device Control ISCA 2018 73
Storage workload grep ISCA 2018 74
Storage workload grep Parallelize across number of files ISCA 2018 75
Storage workload grep Parallelize across number of files Exploit high throughput storage devices ISCA 2018 76
Storage workload grep Parallelize across number of files Exploit high throughput storage devices Each workitem (thread): open, read, write(stdout), close ISCA 2018 77
Time (s) Storage workload grep Parallelize across number of files Exploit high throughput storage devices CPU original Genesys workgroup 30 25 CPU openmp (4T) Genesys workitem Lower is better Each workitem (thread): open, read, write(stdout), close 20 15 10 5 0 grep ISCA 2018 78
Networking workload memcached ISCA 2018 79
Networking workload memcached Heterogeneous application ISCA 2018 80
Networking workload memcached Heterogeneous application CPU and GPU work on the same data SET CPU GET GPU, CPU ISCA 2018 81
Networking workload memcached Heterogeneous application CPU and GPU work on the same data SET CPU GET GPU, CPU Each workgroup (thread group) recvmsg, write(stderr), sendmsg Parallelize; hash, lookup, data copy ISCA 2018 82
Operations per second Networking workload memcached Heterogeneous application CPU and GPU work on the same data SET CPU GET GPU, CPU Each workgroup (thread group) recvmsg, write(stderr), sendmsg Parallelize; hash, lookup, data copy Throughput memcached CPU GPU Genesys GPU without syscalls 60000 Higher is better 50000 40000 30000 20000 10000 0 hits misses ISCA 2018 83
Time (ms) Networking workload memcached Heterogeneous application CPU and GPU work on the same data SET CPU GET GPU, CPU Each workgroup (thread group) recvmsg, write(stderr), sendmsg Parallelize; hash, lookup, data copy Latency memcached CPU GPU Genesys GPU without syscalls 2.5 Lower is better 2 1.5 1 0.5 0 hits misses ISCA 2018 84
Memory management miniamr Algorithm includes memory allocator Adaptive mesh refining Enable judicious use of system resources Accelerator multiprogramming Coarsening workitems (threads) madvise(madv_dontneed) ISCA 2018 85
Device control ioctl Audio devices USB devices Network devices GPU! ISCA 2018 86
Device control ioctl Audio devices USB devices Network devices GPU! Display frame buffer ISCA 2018 87
Device control ioctl Audio devices USB devices Network devices GPU! Display frame buffer ISCA 2018 88
Conclusion Generic POSIX -like system calls for GPUs are viable Improvement in programming environment leads to new applications and improved performance of traditional ones All code is available on github, hosted by AMD ROCm project https://github.com/radeonopencompute/{rock,roct,hcc}_syscall ISCA 2018 89
Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION 2018 Advanced Micro Devices, Inc. All rights reserved. AMD, Radeon, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners. ISCA 2018 90