Chris Rossbach, Jon Currey, Microsoft Research Mark Silberstein, Technion Baishakhi Ray, Emmett Witchel, UT Austin SOSP October 25, PDF Free Download

Chris Rossbach, Jon Currey, Microsoft Research Mark Silberstein, Technion Baishakhi Ray, Emmett Witchel, UT Austin SOSP October 25, 2011

There are lots of GPUs 3 of top 5 supercomputers use GPUs In all new PCs, smart phones, tablets Great for gaming and HPC/batch Unusable in other application domains GPU programming challenges GPU+main memory disjoint Treated as I/O device by OS PTask SOSP 2011 2

There are lots of GPUs 3 of top 5 supercomputers use GPUs In all new PCs, smart phones These two tablets things are related: Great for gaming and HPC/batch We need OS abstractions Unusable in other application domains GPU programing challenges GPU+main memory disjoint Treated as I/O device by OS PTask SOSP 2011 3

The case for OS support PTask: Dataflow for GPUs Evaluation Related Work Conclusion PTask SOSP 2011 4

programmervisible interface OS-level abstractions Hardware interface 1:1 correspondence between OS-level and user-level abstractions PTask SOSP 2011 5

programmervisible interface GPGPU APIs Shaders/ Kernels Language Integration DirectX/CUDA/OpenCL Runtime 1 OS-level abstraction! 1. No kernel-facing API 2. No OS resource-management 3. Poor composability PTask SOSP 2011 6

Higher is better 1200 1000 800 600 400 200 0 GPU benchmark throughput no CPU load CPU scheduler and GPU scheduler not integrated! high CPU load Image-convolution in CUDA Windows 7 x64 8GB RAM Intel Core 2 Quad 2.66GHz nvidia GeForce GT230 PTask SOSP 2011 7

OS cannot prioritize cursor updates WDDM + DWM + CUDA == dysfunction Flatter lines Are better Windows 7 x64 8GB RAM Intel Core 2 Quad 2.66GHz nvidia GeForce GT230 PTask SOSP 2011 8

Raw images Hand events capture detect capture camera images xform noisy point cloud detect gestures filter geometric transformation High data rates Data-parallel algorithms good fit for GPU noise filtering NOT Kinect: this is a harder problem! PTask SOSP 2011 9

#> capture xform filter detect & CPU GPU Modular design flexibility, reuse GPU Utilize heterogeneous hardware Data-parallel components GPU Sequential components CPU Using OS provided tools processes, pipes CPU PTask SOSP 2011 10

GPUs cannot run OS: different ISA Disjoint memory space, no coherence Host CPU must manage GPU execution Program inputs explicitly transferred/bound at runtime Device buffers pre-allocated User-mode apps must implement Main memory CPU Copy inputs Copy outputs Send commands GPU memory GPU PTask SOSP 2011 11

#> capture xform filter detect & capture xform filter detect read() write() read() write() read() write() read() copy to GPU OS executive copy from GPU copy to GPU copy from GPU IRP camdrv GPU driver HIDdrv PCI-xfer PCI-xfer PCI-xfer GPU Run! PCI-xfer PTask SOSP 2011 12

GPU Analogues for: Process API IPC API Scheduler hints Abstractions that enable: Fairness/isolation OS use of GPU Composition/data movement optimization PTask SOSP 2011 13

The case for OS support PTask: Dataflow for GPUs Evaluation Related Work Conclusion PTask SOSP 2011 14

ptask (parallel task) Has priority for fairness Analogous to a process for GPU execution List of input/output resources (e.g. stdin, stdout ) ports Can be mapped to ptask input/outputs A data source or sink channels Similar to pipes, connect arbitrary ports Specialize to eliminate double-buffering OS objects OS RM possible data: specify where, not how graph DAG: connected ptasks, ports, channels datablocks Memory-space transparent buffers PTask SOSP 2011 15

rawimg cloud f-in f-out #> capture xform filter detect & ptask graph capture xform filter detect mapped mem GPU mem GPU mem process (CPU) ptask (GPU) port channel ptask graph datablock Optimized data movement Data arrival triggers computation PTask SOSP 2011 16

Graphs scheduled dynamically ptasks queue for dispatch when inputs ready Queue: dynamic priority order ptask priority user-settable ptask prio normalized to OS prio Transparently support multiple GPUs Schedule ptasks for input locality PTask SOSP 2011 17

Datablock space V M RW data main 1 1 1 1 gpu0 0 1 1 0 gpu1 1 1 1 0 Main Memory GPU 0 Memory GPU 1 Memory Logical buffer backed by multiple physical buffers buffers created/updated lazily mem-mapping used to share across process boundaries Track buffer validity per memory space writes invalidate other views Flags for access control/data placement PTask SOSP 2011 18

rawimg cloud f-in #> capture xform filter capture xform filter Datablock space V M RW data main 1 0 10 10 10 gpu 10 10 10 10 Main Memory GPU Memory process ptask port channel datablock PTask SOSP 2011 19

port datablock port 1-1 correspondence between programmer and OS abstractions GPU APIs can be built on top of new OS abstractions PTask SOSP 2011 20

The case for OS support PTask: Dataflow for GPUs Evaluation Related Work Conclusion PTask SOSP 2011 21

Windows 7 Full PTask API implementation Stacked UMDF/KMDF driver Kernel component: mem-mapping, signaling User component: wraps DirectX, CUDA, OpenCL syscalls DeviceIoControl() calls Linux 2.6.33.2 Changed OS scheduling to manage GPU GPU accounting added to task_struct PTask SOSP 2011 22

Windows 7, Core2-Quad, GTX580 (EVGA) Implementations pipes: capture xform filter detect modular: capture+xform+filter+detect, 1process handcode: data movement optimized, 1process ptask: ptask graph Configurations real-time: driven by cameras unconstrained: driven by in-memory playback PTask SOSP 2011 23

relative to handcode 3.5 3 lower is better 2.5 2 1.5 1 0.5 0 runtime user sys handcode modular pipes ptask compared to hand-code compared to pipes 11.6% higher throughput ~2.7x less CPU lower usage CPU util: no driver 16x higher throughput program ~45% less memory usage Windows 7 x64 8GB RAM Intel Core 2 Quad 2.66GHz GTX580 (EVGA) PTask SOSP 2011 24

PTask invocations/second 1600 1400 1200 1000 800 600 400 fifo priority ptask Higher is better 200 0 FIFO queue invocations in arrival order ptask aged priority queue w OS priority graphs: 6x6 matrix multiply priority same for every PTask node PTask provides throughput 2 4 6 proportional 8 to priority PTask priority Windows 7 x64 8GB RAM Intel Core 2 Quad 2.66GHz GTX580 (EVGA) PTask SOSP 2011 25

Speedup over 1 GPU 2 1.5 Synthetic graphs: Varying depths Higher is better 1 0.5 0 priority data-aware Data-aware provides best throughput, preserves priority Data-aware == priority + locality Graph depth > 1 req. for any benefit Windows 7 x64 8GB RAM Intel Core 2 Quad 2.66GHz 2 x GTX580 (EVGA) PTask SOSP 2011 26

user-prgs R/W bnc cuda-1 cuda-2 user-libs EncFS FUSE libc OS PTask Linux 2.6.33 HW SSD1 SSD2 GPU Simple GPU usage accounting Restores performance GPU/ CPU cuda-1 Linux cuda-2 Linux cuda-1 PTask cuda-2 Ptask Read 1.17x -10.3x -30.8x 1.16x 1.16x Write 1.28x -4.6x -10.3x 1.21x 1.20x PTask SOSP 2011 27 EncFS: nice -20 cuda-*: nice +19 AES: XTS chaining SATA SSD, RAID seq. R/W 200 MB

The case for OS support PTask: Dataflow for GPUs Evaluation Related Work Conclusion PTask SOSP 2011 28

OS support for heterogeneous platforms: Helios [Nightingale 09], BarrelFish [Baumann 09],Offcodes [Weinsberg 08] GPU Scheduling TimeGraph [Kato 11], Pegasus [Gupta 11] Graph-based programming models Synthesis [Masselin 89] Monsoon/Id [Arvind] Dryad [Isard 07] StreamIt [Thies 02] DirectShow TCP Offload [Currid 04] Tasking Tessellation, Apple GCD, PTask SOSP 2011 29

OS abstractions for GPUs are critical Enable fairness & priority OS can use the GPU Dataflow: a good fit abstraction system manages data movement performance benefits significant Thank you. Questions? PTask SOSP 2011 30