Generic System Calls for GPUs

Size: px

Start display at page:

Download "Generic System Calls for GPUs"

Franklin Stone
5 years ago
Views:

1 Generic System Calls for GPUs Ján Veselý*, Arkaprava Basu, Abhishek Bhattacharjee*, Gabriel H. Loh, Mark Oskin, Steven K. Reinhardt *Rutgers University, Indian Institute of Science, Advanced Micro Devices Inc., University of Washington, Microsoft Inc.

2 Towards heterogeneous computing CPU Acc GPU Application ISCA

3 Programs require system services CPU: function process(port, file) { data, response_data, response = malloc(); data = recvmsg(port); idx = get_idx(data); response_data = pread(file, idx); response = process(response_data, data); log( request processed\n ); sendmsg(port, response); } free(data, response_data, response); ISCA

4 Programs require system services CPU: function process(port, file) { data, response_data, response = malloc(); data = recvmsg(port); idx = get_idx(data); response_data = pread(file, idx); response = process(response_data, data); log( request processed\n ); sendmsg(port, response); Memory allocation } free(data, response_data, response); Memory allocation ISCA

5 Programs require system services CPU: function process(port, file) { data, response_data, response = malloc(); Memory allocation data = recvmsg(port); Network idx = get_idx(data); response_data = pread(file, idx); response = process(response_data, data); log( request processed\n ); } sendmsg(port, response); free(data, response_data, response); Network Memory allocation ISCA

6 Programs require system services CPU: function process(port, file) { data, response_data, response = malloc(); Memory allocation data = recvmsg(port); Network idx = get_idx(data); response_data = pread(file, idx); Storage response = process(response_data, data); log( request processed\n ); } sendmsg(port, response); free(data, response_data, response); Network Memory allocation ISCA

7 Programs require system services CPU: function process(port, file) { data, response_data, response = malloc(); Memory allocation data = recvmsg(port); Network idx = get_idx(data); response_data = pread(file, idx); Storage response = process(response_data, data); } log( request processed\n ); sendmsg(port, response); free(data, response_data, response); Terminal/Storage Network Memory allocation ISCA

$Programs require system services CPU: function process(port, file) { data, response_data, response =$ $pread(file, idx); Storage response = process(response_data, data); } log( request processed\n );$

8 Programs require system services CPU: function process(port, file) { data, response_data, response = malloc(); GPU: Memory allocation data = recvmsg(port); Network idx = get_idx(data); response_data = pread(file, idx); Storage response = process(response_data, data); } log( request processed\n ); sendmsg(port, response); free(data, response_data, response); Terminal/Storage Network Memory allocation ISCA

9 Computation can be offloaded CPU: function process(port, file) { data, response_data, response = malloc(); GPU: GPU MANAGEMENT GPU INVOCATION GPU KERNEL data[] = recvmsgs(port); copy_to_device(data[]); gpu_get_idx(&idx[], data[]); copy_from_device(idx[]); for (d in data) response_data[] = pread(file, idx[i]); copy_to_device(response_data[]); gpu_process(&response[], response_data[], data[]); copy_from_device(response[]); Kernel 1 idx[i] = get_idx(data[i]) Kernel 2 response[i] = process(response_data[i], data[i]) log( requests processed\n ); sendmsgs(port, response[]); ISCA

10 GPUs are tightly integrated Unified virtual memory (UVM) HSA, CUDA UVM, OpenCL SVM CPU GPU cache coherence HSA, CCIX, Gen-Z ISCA

11 UVM and cache coherence ease programmability CPU: GPU: GPU MANAGEMENT function process(port, file) { data, response_data, response = malloc(); data[] = recvmsgs(port); GPU INVOCATION GPU KERNEL copy_to_device(data[]); gpu_get_idx(&idx[], data[]); copy_from_device(idx[]); for (d in data) response_data[] = pread(file, idx[i]); copy_to_device(response_data[]); gpu_process(&response[], response_data[], data[]); copy_from_device(response[]); Kernel 1 idx[i] = get_idx(data[i]) Kernel 2 response[i] = process(response_data[i], data[i]) log( requests processed\n ); sendmsgs(port, response[]); ISCA

12 UVM and cache coherence ease programmability CPU: GPU: GPU MANAGEMENT function process(port, file) { data, response_data, response = malloc(); data[] = recvmsgs(port); GPU INVOCATION GPU KERNEL copy_to_device(data[]); gpu_get_idx(&idx[], data[]); copy_from_device(idx[]); for (d in data) response_data[] = pread(file, idx[i]); copy_to_device(response_data[]); gpu_process(&response[], response_data[], data[]); copy_from_device(response[]); Kernel 1 idx[i] = get_idx(data[i]) Kernel 2 response[i] = process(response_data[i], data[i]) log( requests processed\n ); sendmsgs(port, response[]); ISCA

13 UVM and cache coherence ease programmability CPU: function process(port, file) { data, response_data, response = malloc(); data[] = recvmsgs(port); GPU: GPU MANAGEMENT GPU INVOCATION GPU KERNEL copy_to_device(data[]); gpu_get_idx(&idx[], data[]); copy_from_device(idx[]); for (d in data) response_data[] = pread(file, idx[i]); copy_to_device(response_data[]); gpu_process(&response[], response_data[], data[]); copy_from_device(response[]); Kernel 1 idx[i] = get_idx(data[i])? Kernel 2 response[i] = process(response_data[i], data[i]) log( requests processed\n ); sendmsgs(port, response[]); ISCA

14 Next step is system services ISCA

15 Next step is system services Memory allocation HSA, CUDA ISCA

16 Next step is system services Memory allocation HSA, CUDA Printf HSA, OpenCL, CUDA ISCA

17 Next step is system services Memory allocation HSA, CUDA Printf HSA, OpenCL, CUDA Academic research GPUfs [Silberstein, ASPLOS 13], GPUnet [Kim, OSDI 14], SPIN [Bergman, ATC 17], ISCA

18 Some services can be invoked from GPU CPU: function process(port, file) { data, response_data, response = malloc(); gpu_process(port, file, response[], response_data[], data[]); free(data, response_data, response); } GPU: void gpu_group_process(port, file) { data = Grecv(port); idx = gpu_get_idx(&idx, data); response_data = Gread(file, idx); GPUnet GPUfs CUDA response = process(response_data, data); Gprintf( request processed\n ); } Gsend(port, response); ISCA

19 Previous solutions took the first steps ISCA

20 Previous solutions took the first steps Subsystem specific ISCA

21 Previous solutions took the first steps Subsystem specific Specialized, restricted functionality ISCA

22 Previous solutions took the first steps Subsystem specific Specialized, restricted functionality Custom API/semantics ISCA

23 Our work takes the next step ISCA

24 Our work takes the next step GENEric SYStem call interface Efficient direct-to-os communication ISCA

25 Our work takes the next step GENEric SYStem call interface Efficient direct-to-os communication Allows all system calls implementable for GPUs 79% of all system calls ISCA

26 Our work takes the next step GENEric SYStem call interface Efficient direct-to-os communication Allows all system calls implementable for GPUs 79% of all system calls Original OS (Linux) semantics POSIX -like ISCA

27 Our work takes the next step GENEric SYStem call interface Efficient direct-to-os communication Allows all system calls implementable for GPUs 79% of all system calls Original OS (Linux) semantics POSIX -like Available on github ISCA

$Genesys subsumes previous work (and more) CPU: GPU: GENESYS function process(port, file) { gpu_process(port, file, response[], response_data[], data[]); } void gpu_process(port, file) { data,$

28 Genesys subsumes previous work (and more) CPU: GPU: GENESYS function process(port, file) { gpu_process(port, file, response[], response_data[], data[]); } void gpu_process(port, file) { data, response_data, response = malloc(); data = recvmsg(port); idx = get_idx(data); response_data = pread(file, idx); response = process(response_data, data); log( requests processed\n ); sendmsg(port, response); } free(data, response_data, response); ISCA

29 Ideal system services properties Familiarity Known semantics ISCA

30 Ideal system services properties Familiarity Known semantics Flexibility Do not restrict programmers Adaptability Adapt to workload needs ISCA

31 Flexibility in application interface Invocation granularity ISCA

32 Flexibility in application interface Invocation granularity Observed ordering ISCA

33 Flexibility in application interface Invocation granularity Observed ordering Blocking vs. Non-blocking ISCA

34 Flexibility: Any thread can invoke system call GPU execution hierarchy ISCA

35 Flexibility: Any thread can invoke system call GPU execution hierarchy Workitem (thread) ISCA

36 Flexibility: Any thread can invoke system call GPU execution hierarchy Workitem (thread) Workgroup (thread group) ISCA

37 Flexibility: Any thread can invoke system call GPU execution hierarchy Workitem (thread) Workgroup (thread group) Kernel kernel workgroup workgroup workgroup workgroup ISCA

38 Flexibility: Any thread can invoke system call GPU execution hierarchy Workitem (thread) Workgroup (thread group) Kernel kernel workgroup workgroup workgroup workgroup Wavefront (warp) HW specific! Do not expose! ISCA

39 Flexibility: Any thread can invoke system call GPU execution hierarchy Workitem (thread) Workgroup (thread group) Kernel Invokes system call kernel workgroup workgroup workgroup workgroup Wavefront (warp) HW specific! Do not expose! ISCA

40 Flexibility: Ordering can be relaxed (group) Strict ordering Both barriers Relaxed ordering Remove one barrier Before read After write workgroup ISCA

41 Flexibility: Ordering can be relaxed (group) Strict ordering Both barriers Relaxed ordering Remove one barrier Before read After write write workgroup ISCA

42 Flexibility: Ordering can be relaxed (group) Strict ordering Both barriers Relaxed ordering Remove one barrier Before read After write write workgroup ISCA

43 Flexibility: Ordering can be relaxed (group) Strict ordering Both barriers Relaxed ordering Remove one barrier Before read After write write workgroup ISCA

44 Flexibility: Ordering can be relaxed (group) Strict ordering Both barriers Relaxed ordering Remove one barrier Before read After write write workgroup ISCA

45 Flexibility: Ordering can be relaxed (group) Strict ordering Both barriers Relaxed ordering Remove one barrier Before read After write write workgroup ISCA

46 Flexibility: Ordering can be relaxed (group) Strict ordering Both barriers Relaxed ordering Remove one barrier Before read After write write workgroup ISCA

47 Flexibility: Ordering can be relaxed (group) Strict ordering Both barriers Relaxed ordering Remove one barrier Before read After write write workgroup ISCA

48 Flexibility: Allow non-blocking invocation Blocking invocation Wait for result Non-blocking invocation Return value collected later or not at all workgroup ISCA

49 Flexibility: Allow non-blocking invocation Blocking invocation Wait for result Non-blocking invocation Return value collected later or not at all write workgroup ISCA

50 Flexibility: Allow non-blocking invocation Blocking invocation Wait for result Non-blocking invocation Return value collected later or not at all write workgroup ISCA

51 Flexibility: Allow non-blocking invocation Blocking invocation Wait for result Non-blocking invocation Return value collected later or not at all write workgroup ISCA

52 Flexibility: Allow non-blocking invocation Blocking invocation Wait for result Non-blocking invocation Return value collected later or not at all write workgroup ISCA

53 Flexibility: Allow non-blocking invocation Blocking invocation Wait for result Non-blocking invocation Return value collected later or not at all workgroup ISCA

54 Flexibility: Allow non-blocking invocation Blocking invocation Wait for result Non-blocking invocation Return value collected later or not at all workgroup ISCA

55 Ideal system services properties Familiarity Known semantics Flexibility Do not restrict programmers Adaptability Adapt to workload needs ISCA

56 Adaptability in implementation ISCA

57 Adaptability in implementation Don t waste resources Syscall light applications Important for heterogeneous systems Share power and energy budget ISCA

58 Adaptability in implementation Don t waste resources Syscall light applications Important for heterogeneous systems Share power and energy budget Use as many resources as possible Syscall heavy applications ISCA

59 Implementation GPU CPU Syscall area Main Memory ISCA

60 Implementation 1. Fill Parameters 2. Send Interrupt (suspend) 3. Process Interrupt 4. Execute System call 5. Fill return value 6. Wake up wavefront (if suspended) GPU CPU 1 Syscall area Main Memory ISCA

61 Implementation 1. Fill Parameters 2. Send Interrupt (suspend) 3. Process Interrupt 4. Execute System call 5. Fill return value 6. Wake up wavefront (if suspended) GPU 2 CPU 1 Syscall area Main Memory ISCA

62 Implementation 1. Fill Parameters 2. Send Interrupt (suspend) 3. Process Interrupt 4. Execute System call 5. Fill return value 6. Wake up wavefront (if suspended) GPU 2 CPU 3 1 Syscall area Main Memory ISCA

63 Implementation 1. Fill Parameters 2. Send Interrupt (suspend) 3. Process Interrupt 4. Execute System call 5. Fill return value 6. Wake up wavefront (if suspended) GPU 2 CPU Syscall area Main Memory ISCA

64 Implementation 1. Fill Parameters 2. Send Interrupt (suspend) 3. Process Interrupt 4. Execute System call 5. Fill return value 6. Wake up wavefront (if suspended) GPU 2 CPU Syscall area Main Memory ISCA

65 Implementation 1. Fill Parameters 2. Send Interrupt (suspend) 3. Process Interrupt 4. Execute System call 5. Fill return value 6. Wake up wavefront (if suspended) GPU 2 6 CPU Syscall area Main Memory ISCA

66 Genesys works on off-the-shelf hardware ISCA

67 Genesys works on off-the-shelf hardware AMD FX-9800P 4 CPU cores, 8 CUs (gpu cores) Share 15W of TDP 16GB DDR4 RAM ISCA

68 Genesys works on off-the-shelf hardware AMD FX-9800P 4 CPU cores, 8 CUs (gpu cores) Share 15W of TDP 16GB DDR4 RAM GPU L2 cache is CPU coherent GPU L1 coherence is handled in software Provides CPU GPU atomic operations ISCA

69 Ideal system services properties Familiarity Known semantics Flexibility Do not restrict programmers Adaptability Adapt to workload needs ISCA

70 Genesys supports wide range of use cases Storage ISCA

71 Genesys supports wide range of use cases Storage Networking ISCA

72 Genesys supports wide range of use cases Storage Networking Memory Management ISCA

73 Genesys supports wide range of use cases Storage Networking Memory Management Device Control ISCA

74 Storage workload grep ISCA

75 Storage workload grep Parallelize across number of files ISCA

76 Storage workload grep Parallelize across number of files Exploit high throughput storage devices ISCA

77 Storage workload grep Parallelize across number of files Exploit high throughput storage devices Each workitem (thread): open, read, write(stdout), close ISCA

78 Time (s) Storage workload grep Parallelize across number of files Exploit high throughput storage devices CPU original Genesys workgroup CPU openmp (4T) Genesys workitem Lower is better Each workitem (thread): open, read, write(stdout), close grep ISCA

79 Networking workload memcached ISCA

80 Networking workload memcached Heterogeneous application ISCA

81 Networking workload memcached Heterogeneous application CPU and GPU work on the same data SET CPU GET GPU, CPU ISCA

82 Networking workload memcached Heterogeneous application CPU and GPU work on the same data SET CPU GET GPU, CPU Each workgroup (thread group) recvmsg, write(stderr), sendmsg Parallelize; hash, lookup, data copy ISCA

83 Operations per second Networking workload memcached Heterogeneous application CPU and GPU work on the same data SET CPU GET GPU, CPU Each workgroup (thread group) recvmsg, write(stderr), sendmsg Parallelize; hash, lookup, data copy Throughput memcached CPU GPU Genesys GPU without syscalls Higher is better hits misses ISCA

84 Time (ms) Networking workload memcached Heterogeneous application CPU and GPU work on the same data SET CPU GET GPU, CPU Each workgroup (thread group) recvmsg, write(stderr), sendmsg Parallelize; hash, lookup, data copy Latency memcached CPU GPU Genesys GPU without syscalls 2.5 Lower is better hits misses ISCA

85 Memory management miniamr Algorithm includes memory allocator Adaptive mesh refining Enable judicious use of system resources Accelerator multiprogramming Coarsening workitems (threads) madvise(madv_dontneed) ISCA

86 Device control ioctl Audio devices USB devices Network devices GPU! ISCA

87 Device control ioctl Audio devices USB devices Network devices GPU! Display frame buffer ISCA

88 Device control ioctl Audio devices USB devices Network devices GPU! Display frame buffer ISCA

89 Conclusion Generic POSIX -like system calls for GPUs are viable Improvement in programming environment leads to new applications and improved performance of traditional ones All code is available on github, hosted by AMD ROCm project ISCA

90 Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION 2018 Advanced Micro Devices, Inc. All rights reserved. AMD, Radeon, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners. ISCA

INTERFERENCE FROM GPU SYSTEM SERVICE REQUESTS

INTERFERENCE FROM GPU SYSTEM SERVICE REQUESTS ARKAPRAVA BASU, JOSEPH L. GREATHOUSE, GURU VENKATARAMANI, JÁN VESELÝ AMD RESEARCH, ADVANCED MICRO DEVICES, INC. MODERN SYSTEMS ARE POWERED BY HETEROGENEITY