GPUfs: Integrating a file system with GPUs

Size: px

Start display at page:

Download "GPUfs: Integrating a file system with GPUs"

Griffin Knight
6 years ago
Views:

1 ASPLOS 2013 GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1

2 Traditional System Architecture Applications OS 2

3 Modern System Architecture Accelerated applications OS Manycore processors FPGA Hybrid -GPU GPUs 3

4 Software-hardware gap is widening Accelerated applications OS Manycore processors FPGA Hybrid -GPU GPUs 4

5 Software-hardware gap is widening Accelerated applications OS Ad-hoc abstractions and management mechanisms Manycore processors FPGA Hybrid -GPU GPUs 5

6 On-accelerator OS support closes the programmability gap Accelerated applications Native accelerator applications OS On-accelerator OS support Coordination Manycore processors FPGA Hybrid -GPU GPUs 6

7 GPUfs: File I/O support for GPUs Motivation Goals Understanding the hardware Design Implementation Evaluation 7

8 Building systems with GPUs is hard. Why? 8

9 Goal of GPU programming frameworks GPU Data transfers GPU invocation Memory management Parallel Algorithm 9

10 Headache for GPU programmers GPU Data transfers Invocation Memory management Parallel Algorithm Half of the CUDA SDK 4.1 samples: at least 9 LOC per 1 GPU LOC 10

11 GPU kernels are isolated GPU Data transfers Invocation Memory management Parallel Algorithm 11

$com/articles/36347/face-collage While(Unhappy()){ Read_next_image_file()$

12 Example: accelerating photo collage While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() } 12

13 Implementation Application While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() } 13

14 Offloading computations to GPU Application Move to GPU While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() } 14

15 Offloading computations to GPU Co-processor programming model Data transfer GPU Kernel start Kernel termination 15

16 Kernel start/stop overheads ke invo y to cop U GP Cache flush cop y to Invocation latency GPU Synchronization 16

17 Hiding the overheads Asynchronous invocation Manual data reuse management Double buffering y to cop U GP ke invo y to cop U GP cop y to GPU 17

18 Implementation complexity Management overhead Asynchronous invocation Manual data reuse management Double buffering y to cop U GP ke invo y to cop U GP cop y to GPU 18

19 Implementation complexity Management overhead Asynchronous invocation Manual data reuse management Double buffering y to cop U GP ke invo y to cop U GP cop y to GPU Why do we need to deal with low-level system details? 19

20 The reason is... GPUs are peer-processors They need I/O OS services 20

21 GPUfs: application view GPU2 GPU1 ) le ) d_fi hare n( s ope file d_ re ha ( s en op writ e() s GPU3 () p a m m GPUfs Host File System 21

22 GPUfs: application view ) le ) d_fi hare n( s ope file d_ re ha ( s en op System-wide shared namespace GPU1 GPU2 writ e() s GPU3 () p a m m POSIX GPUfs ()-like API Host File System Persistent storage 22

23 Accelerating collage app with GPUfs No management code GPUfs GPU open/read from GPU 23

24 Accelerating collage app with GPUfs Read-ahead GPUfsGPUfs GPUfs buffer cache GPU Overlapping Overlapping computations and transfers 24

25 Accelerating collage app with GPUfs GPUfs GPU Data reuse Random data access 25

26 Challenge GPU 26

27 Massive parallelism Parallelism is essential for performance in deeply multi-threaded wide-vector hardware AMD HD5870* NVIDIA Fermi* 23,000 active threads 31,000 active threads From M. Houston/A. Lefohn/K. Fatahalian A trip through the architecture of modern GPUs* 27

28 Heterogeneous memory GPUs inherently impose high bandwidth demands on memory GPU 10-32GB/s GB/s Memory Memory ~x GB/s 28

29 How to build an FS layer on this hardware? 29

30 GPUfs: principled redesign of the whole file system stack Relaxed FS API semantics for parallelism Relaxed FS consistency for heterogeneous memory GPU-specific implementation of synchronization primitives, lock-free data structures, memory allocation,. 30

31 GPUfs high-level design GPU Unchanged applications using OS File API GPU application using GPUfs File API GPUfs hooks OS File System Interface OS Massive parallelism GPUfs GPU File I/O library GPUfs Distributed Buffer Cache (Page cache) Memory Heterogeneous GPU Memory memory Host File System Disk 31

32 GPUfs high-level design GPU Unchanged applications using OS File API GPU application using GPUfs File API GPUfs hooks OS File System Interface OS GPUfs GPU File I/O library GPUfs Distributed Buffer Cache (Page cache) Memory GPU Memory Host File System Disk 32

33 Buffer cache semantics Local or Distributed file system data consistency? 33

34 GPUfs buffer cache Weak data consistency model close(sync)-to-open semantics (AFS) open() read(1) GPU1 Not visible to GPU2 write(1) fsync() write(2) Remote-to-Local memory performance ratio is similar to a distributed system >> 34

35 In the paper On-GPU File I/O API open/close gopen/gclose read/write gread/gwrite mmap/munmap gmmap/gmunmap fsync/msync gfsync/gmsync ftrunc gftrunc Changes in the semantics are crucial 35

36 Implementation bits In the paper Paging support Dynamic data structures and memory allocators Lock-free radix tree Inter-processor communications (IPC) Hybrid H/W-S/W barriers Consistency module in the OS kernel ~1,5K GPU LOC, ~600 LOC 36

37 Evaluation All benchmarks are written as a GPU kernel: no -side development 37

38 Matrix-vector product (Inputs/Outputs in files) Vector 1x128K elements, Page size = 2MB, GPU=TESLA C CUDA piplined CUDA optimized GPU file I/O Throughput (MB/s) Input matrix size (MB) 38

39 Word frequency count in text Count frequency of modern English words in the works of Shakespeare, and in the Linux kernel source tree English dictionary: 58,000 words Challenges Dynamic working set Small files Lots of file I/O (33,000 files,1-5kb each) Unpredictable output size 39

40 Results 8s GPU-vanilla GPU-GPUfs Linux source 33,000 files, 524MB 6h 50m (7.2X) 53m (6.8X) Shakespeare 1 file, 6MB 292s 40s (7.3X) 40s (7.3X) 40

41 Results 8s GPU-vanilla GPU-GPUfs Linux source 33,000 files, 524MB 6h 50m (7.2X) 53m (6.8X) Shakespeare 1 file, 6MB 292s 8% overhead 40s (7.3X) 40s (7.3X) Unbounded input/output size support 41

42 GPUfs is the first system to provide native access to host OS services from GPU programs GPUfs GPU GPU Code is available for download at:

GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications OS CPU