GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1

Traditional System Architecture Applications OS CPU 2

Modern System Architecture Accelerated applications OS CPU Manycore processors FPGA Hybrid CPU-GPU GPUs 3

Software-hardware gap is widening Accelerated applications OS CPU Manycore processors FPGA Hybrid CPU-GPU GPUs 4

Software-hardware gap is widening Accelerated applications OS CPU Ad-hoc abstractions and management mechanisms Manycore processors FPGA Hybrid CPU-GPU GPUs 5

On-accelerator OS support closes the programmability gap Accelerated applications Native accelerator applications OS On-accelerator OS support CPU Coordination Manycore processors FPGA Hybrid CPU-GPU GPUs 6

GPUfs: File I/O support for GPUs Motivation Goals Understanding the hardware Design Implementation Evaluation 7

Building systems with GPUs is hard. Why? 8

Goal of GPU programming frameworks GPU CPU Data transfers GPU invocation Memory management Parallel Algorithm 9

Headache for GPU programmers GPU CPU Data transfers Invocation Memory management Parallel Algorithm Half of the CUDA SDK 4.1 samples: at least 9 CPU LOC per 1 GPU LOC 10

GPU kernels are isolated GPU CPU Data transfers Invocation Memory management Parallel Algorithm 11

Example: accelerating photo collage http://www.codeproject.com/articles/36347/face-collage While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() } 12

CPU Implementation CPU Application While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() } 13

Offloading computations to GPU CPU Application Move to GPU While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() } 14

Offloading computations to GPU Co-processor programming model CPU Data transfer GPU Kernel start Kernel termination 15

Kernel start/stop overheads ke invo y to cop U GP Cache flush cop y CPU to CPU Invocation latency GPU Synchronization 16

Hiding the overheads Asynchronous invocation Manual data reuse management Double buffering y to cop U GP ke invo y to cop U GP cop y CPU to CPU GPU 17

Implementation complexity Management overhead Asynchronous invocation Manual data reuse management Double buffering y to cop U GP ke invo y to cop U GP cop y CPU to CPU GPU 18

Implementation complexity Management overhead Asynchronous invocation Manual data reuse management Double buffering y to cop U GP ke invo y to cop U GP cop y CPU to CPU GPU Why do we need to deal with low-level system details? 19

The reason is... GPUs are peer-processors They need I/O OS services 20

GPUfs: application view GPU2 GPU1 d hare n( s ope GPU3 () p a m m ) ) _file f ile d_ re ha ( s en op writ e() CPUs GPUfs Host File System 21

GPUfs: application view d hare n( s ope GPU2 GPU3 () p a m m ) ) _file f ile d_ re ha ( s en op System-wide shared namespace GPU1 writ e() CPUs POSIX GPUfs (CPU)-like API Host File System Persistent storage 22

Accelerating collage app with GPUfs No CPU management code CPU GPUfs GPU open/read from GPU 23

Accelerating collage app with GPUfs CPU Read-ahead GPUfsGPUfs GPUfs buffer cache GPU Overlapping Overlapping computations and transfers 24

Accelerating collage app with GPUfs CPU GPUfs GPU Data reuse Random data access 25

Challenge GPU CPU 26

Massive parallelism Parallelism is essential for performance in deeply multi-threaded wide-vector hardware AMD HD5870* NVIDIA Fermi* 23,000 active threads 31,000 active threads From M. Houston/A. Lefohn/K. Fatahalian A trip through the architecture of modern GPUs* 27

Heterogeneous memory GPUs inherently impose high bandwidth demands on memory GPU CPU 10-32GB/s 288-360GB/s Memory Memory ~x20 6-16 GB/s 28

How to build an FS layer on this hardware? 29

GPUfs: principled redesign of the whole file system stack Relaxed FS API semantics for parallelism Relaxed FS consistency for heterogeneous memory GPU-specific implementation of synchronization primitives, lock-free data structures, memory allocation,. 30

GPUfs high-level design CPU GPU Unchanged applications using OS File API GPU application using GPUfs File API GPUfs hooks OS File System Interface OS Massive parallelism GPUfs GPU File I/O library GPUfs Distributed Buffer Cache (Page cache) CPU Memory Heterogeneous GPU Memory memory Host File System Disk 31

GPUfs high-level design CPU GPU Unchanged applications using OS File API GPU application using GPUfs File API GPUfs hooks OS File System Interface OS GPUfs GPU File I/O library GPUfs Distributed Buffer Cache (Page cache) CPU Memory GPU Memory Host File System Disk 32

Buffer cache semantics Local or Distributed file system data consistency? 33

GPUfs buffer cache Weak data consistency model close(sync)-to-open semantics (AFS) open() read(1) GPU1 Not visible to CPU GPU2 write(1) fsync() write(2) Remote-to-Local memory performance ratio is similar to a distributed system >> 34

In the paper On-GPU File I/O API open/close gopen/gclose read/write gread/gwrite mmap/munmap gmmap/gmunmap fsync/msync gfsync/gmsync ftrunc gftrunc Changes in the semantics are crucial 35

Implementation bits In the paper Paging support Dynamic data structures and memory allocators Lock-free radix tree Inter-processor communications (IPC) Hybrid H/W-S/W barriers Consistency module in the OS kernel ~1,5K GPU LOC, ~600 CPU LOC 36

Evaluation All benchmarks are written as a GPU kernel: no CPU-side development 37

Matrix-vector product (Inputs/Outputs in files) Vector 1x128K elements, Page size = 2MB, GPU=TESLA C2075 3500 CUDA piplined CUDA optimized GPU file I/O Throughput (MB/s) 3000 2500 2000 1500 1000 500 0 280 560 2800 5600 11200 Input matrix size (MB) 38

Word frequency count in text Count frequency of modern English words in the works of Shakespeare, and in the Linux kernel source tree English dictionary: 58,000 words Challenges Dynamic working set Small files Lots of file I/O (33,000 files,1-5kb each) Unpredictable output size 39

Results 8CPUs GPU-vanilla GPU-GPUfs Linux source 33,000 files, 524MB 6h 50m (7.2X) 53m (6.8X) Shakespeare 1 file, 6MB 292s 40s (7.3X) 40s (7.3X) 40

Results 8CPUs GPU-vanilla GPU-GPUfs Linux source 33,000 files, 524MB 6h 50m (7.2X) 53m (6.8X) Shakespeare 1 file, 6MB 292s 8% overhead 40s (7.3X) 40s (7.3X) Unbounded input/output size support 41

GPUfs is the first system to provide native access to host OS services from GPU programs GPUfs CPU CPU GPU GPU Code is available for download at: https://sites.google.com/site/silbersteinmark/home/gpufs http://goo.gl/ofj6j 42

Our life would have been easier with PCI atomics Preemptive background daemons GPU-CPU signaling support In-GPU exceptions GPU virtual memory API (host-based or device) Compiler optimizations for register-heavy libraries Seems like accomplished in 5.0 43

Sequential access to file: 3 versions CUDA whole file transfer GPU file I/O GPU CPU gmmap() Read file Transfer to GPU CUDA pipelined transfer CPU Read chunk Transfer to GPU Read chunk Read chunk Transfer to GPU Transfer to GPU Read chunk Transfer to GPU 44

Sequential read Throughput vs. Page size 4000 GPU File I/O CUDA whole file CUDA pipeline 3500 Throughput (MB/s) 3000 2500 2000 1500 1000 500 0 16K 64K 256K 512K 1M 2M Page size 45

Sequential read Throughput vs. Page size 4000 GPU File I/O CUDA whole file CUDA pipeline 3500 Throughput (MB/s) 3000 2500 2000 1500 1000 500 0 Benefit: Decouple performance constraints from application logic 16K 64K 256K 512K 1M 2M Page size 46

Yesterday On-accelerator OS support Accelerators as co-processors Tomorrow Accelerators as peers 47

What about software? Yesterday CPU? Tomorrow CPU GPU GPU Accelerators as peers Accelerators as coprocessors 48

Set GPUs free! 49

Parallel square root on GPU gpu_thread(thread_id i){ float buffer; int fd=gopen(filename,o_grdwr); Same code will run in all offset=sizeof(float)*i; thousands of gread(fd,sizeof(float),&buffer,offset); the GPU buffer=sqrt(buffer); threads gwrite(fd,sizeof(float),&buffer,offset); gclose(fd); } 50

GPUfs impact on GPU programs Memory overhead Register pressure Very little CPU coding Makes exitless GPU kernels possible Pay-as-you-go design 51

Preserve CPU semantics? What does it mean to open/read/write/close/mmap a file in thousands of threads? GPU threads are different from CPU threads Thread Thread Thread Thread Thread Thread Thread SIMD vector Thread SIMD vector 52

Preserve CPU semantics? What does it mean to open/read/write/close/mmap a file in thousands of threads? GPU kernel is a single data-parallel application GPU threads are different from CPU threads Thread Thread Thread Thread Thread Thread Thread SIMD vector Thread SIMD vector 53

GPUfs semantics (see more discussion in the paper) int fd=gopen( filename,o_grdwr); Thread Thread Thread Thread Thread Thread Thread SIMD vector Thread SIMD vector One call per SIMD vector: bulk-synchronous cooperative execution One file descriptor per file: open()/close() cached on a GPU 54

GPU hardware characteristics Parallelism Heterogeneous memory 55

API semantics int fd=gopen( filename,o_grdwr); 56

API semantics int fd=gopen( filename,o_grdwr); C P U 57

This code runs in 100,000 GPU threads int fd=gopen( filename,o_grdwr); C P U G P U 58