GPUfs: Integrating a file system with GPUs

Size: px

Start display at page:

Download "GPUfs: Integrating a file system with GPUs"

Camilla Adams
6 years ago
Views:

1 GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1

2 Traditional System Architecture Applications OS CPU 2

3 Modern System Architecture Accelerated applications OS CPU Manycore processors FPGA Hybrid CPU-GPU GPUs 3

4 Software-hardware gap is widening Accelerated applications OS CPU Manycore processors FPGA Hybrid CPU-GPU GPUs 4

5 Software-hardware gap is widening Accelerated applications OS CPU Ad-hoc abstractions and management mechanisms Manycore processors FPGA Hybrid CPU-GPU GPUs 5

6 On-accelerator OS support closes the programmability gap Accelerated applications Native accelerator applications OS On-accelerator OS support CPU Coordination Manycore processors FPGA Hybrid CPU-GPU GPUs 6

7 GPUfs: File I/O support for GPUs Motivation Goals Understanding the hardware Design Implementation Evaluation 7

8 Building systems with GPUs is hard. Why? 8

9 Goal of GPU programming frameworks GPU CPU Data transfers GPU invocation Memory management Parallel Algorithm 9

10 Headache for GPU programmers GPU CPU Data transfers Invocation Memory management Parallel Algorithm Half of the CUDA SDK 4.1 samples: at least 9 CPU LOC per 1 GPU LOC 10

11 GPU kernels are isolated GPU CPU Data transfers Invocation Memory management Parallel Algorithm 11

$com/articles/36347/face-collage While(Unhappy()){ Read_next_image_file()$

12 Example: accelerating photo collage While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() } 12

13 CPU Implementation CPU Application While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() } 13

14 Offloading computations to GPU CPU Application Move to GPU While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() } 14

15 Offloading computations to GPU Co-processor programming model CPU Data transfer GPU Kernel start Kernel termination 15

16 Kernel start/stop overheads ke invo y to cop U GP Cache flush cop y CPU to CPU Invocation latency GPU Synchronization 16

17 Hiding the overheads Asynchronous invocation Manual data reuse management Double buffering y to cop U GP ke invo y to cop U GP cop y CPU to CPU GPU 17

18 Implementation complexity Management overhead Asynchronous invocation Manual data reuse management Double buffering y to cop U GP ke invo y to cop U GP cop y CPU to CPU GPU 18

19 Implementation complexity Management overhead Asynchronous invocation Manual data reuse management Double buffering y to cop U GP ke invo y to cop U GP cop y CPU to CPU GPU Why do we need to deal with low-level system details? 19

20 The reason is... GPUs are peer-processors They need I/O OS services 20

21 GPUfs: application view GPU2 GPU1 d hare n( s ope GPU3 () p a m m ) ) _file f ile d_ re ha ( s en op writ e() CPUs GPUfs Host File System 21

22 GPUfs: application view d hare n( s ope GPU2 GPU3 () p a m m ) ) _file f ile d_ re ha ( s en op System-wide shared namespace GPU1 writ e() CPUs POSIX GPUfs (CPU)-like API Host File System Persistent storage 22

23 Accelerating collage app with GPUfs No CPU management code CPU GPUfs GPU open/read from GPU 23

24 Accelerating collage app with GPUfs CPU Read-ahead GPUfsGPUfs GPUfs buffer cache GPU Overlapping Overlapping computations and transfers 24

25 Accelerating collage app with GPUfs CPU GPUfs GPU Data reuse Random data access 25

26 Challenge GPU CPU 26

27 Massive parallelism Parallelism is essential for performance in deeply multi-threaded wide-vector hardware AMD HD5870* NVIDIA Fermi* 23,000 active threads 31,000 active threads From M. Houston/A. Lefohn/K. Fatahalian A trip through the architecture of modern GPUs* 27

28 Heterogeneous memory GPUs inherently impose high bandwidth demands on memory GPU CPU 10-32GB/s GB/s Memory Memory ~x GB/s 28

29 How to build an FS layer on this hardware? 29

30 GPUfs: principled redesign of the whole file system stack Relaxed FS API semantics for parallelism Relaxed FS consistency for heterogeneous memory GPU-specific implementation of synchronization primitives, lock-free data structures, memory allocation,. 30

31 GPUfs high-level design CPU GPU Unchanged applications using OS File API GPU application using GPUfs File API GPUfs hooks OS File System Interface OS Massive parallelism GPUfs GPU File I/O library GPUfs Distributed Buffer Cache (Page cache) CPU Memory Heterogeneous GPU Memory memory Host File System Disk 31

32 GPUfs high-level design CPU GPU Unchanged applications using OS File API GPU application using GPUfs File API GPUfs hooks OS File System Interface OS GPUfs GPU File I/O library GPUfs Distributed Buffer Cache (Page cache) CPU Memory GPU Memory Host File System Disk 32

33 Buffer cache semantics Local or Distributed file system data consistency? 33

34 GPUfs buffer cache Weak data consistency model close(sync)-to-open semantics (AFS) open() read(1) GPU1 Not visible to CPU GPU2 write(1) fsync() write(2) Remote-to-Local memory performance ratio is similar to a distributed system >> 34

35 In the paper On-GPU File I/O API open/close gopen/gclose read/write gread/gwrite mmap/munmap gmmap/gmunmap fsync/msync gfsync/gmsync ftrunc gftrunc Changes in the semantics are crucial 35

36 Implementation bits In the paper Paging support Dynamic data structures and memory allocators Lock-free radix tree Inter-processor communications (IPC) Hybrid H/W-S/W barriers Consistency module in the OS kernel ~1,5K GPU LOC, ~600 CPU LOC 36

37 Evaluation All benchmarks are written as a GPU kernel: no CPU-side development 37

38 Matrix-vector product (Inputs/Outputs in files) Vector 1x128K elements, Page size = 2MB, GPU=TESLA C CUDA piplined CUDA optimized GPU file I/O Throughput (MB/s) Input matrix size (MB) 38

39 Word frequency count in text Count frequency of modern English words in the works of Shakespeare, and in the Linux kernel source tree English dictionary: 58,000 words Challenges Dynamic working set Small files Lots of file I/O (33,000 files,1-5kb each) Unpredictable output size 39

40 Results 8CPUs GPU-vanilla GPU-GPUfs Linux source 33,000 files, 524MB 6h 50m (7.2X) 53m (6.8X) Shakespeare 1 file, 6MB 292s 40s (7.3X) 40s (7.3X) 40

41 Results 8CPUs GPU-vanilla GPU-GPUfs Linux source 33,000 files, 524MB 6h 50m (7.2X) 53m (6.8X) Shakespeare 1 file, 6MB 292s 8% overhead 40s (7.3X) 40s (7.3X) Unbounded input/output size support 41

42 GPUfs is the first system to provide native access to host OS services from GPU programs GPUfs CPU CPU GPU GPU Code is available for download at:

43 Our life would have been easier with PCI atomics Preemptive background daemons GPU-CPU signaling support In-GPU exceptions GPU virtual memory API (host-based or device) Compiler optimizations for register-heavy libraries Seems like accomplished in

44 Sequential access to file: 3 versions CUDA whole file transfer GPU file I/O GPU CPU gmmap() Read file Transfer to GPU CUDA pipelined transfer CPU Read chunk Transfer to GPU Read chunk Read chunk Transfer to GPU Transfer to GPU Read chunk Transfer to GPU 44

45 Sequential read Throughput vs. Page size 4000 GPU File I/O CUDA whole file CUDA pipeline 3500 Throughput (MB/s) K 64K 256K 512K 1M 2M Page size 45

46 Sequential read Throughput vs. Page size 4000 GPU File I/O CUDA whole file CUDA pipeline 3500 Throughput (MB/s) Benefit: Decouple performance constraints from application logic 16K 64K 256K 512K 1M 2M Page size 46

47 Yesterday On-accelerator OS support Accelerators as co-processors Tomorrow Accelerators as peers 47

48 What about software? Yesterday CPU? Tomorrow CPU GPU GPU Accelerators as peers Accelerators as coprocessors 48

49 Set GPUs free! 49

50 Parallel square root on GPU gpu_thread(thread_id i){ float buffer; int fd=gopen(filename,o_grdwr); Same code will run in all offset=sizeof(float)*i; thousands of gread(fd,sizeof(float),&buffer,offset); the GPU buffer=sqrt(buffer); threads gwrite(fd,sizeof(float),&buffer,offset); gclose(fd); } 50

51 GPUfs impact on GPU programs Memory overhead Register pressure Very little CPU coding Makes exitless GPU kernels possible Pay-as-you-go design 51

52 Preserve CPU semantics? What does it mean to open/read/write/close/mmap a file in thousands of threads? GPU threads are different from CPU threads Thread Thread Thread Thread Thread Thread Thread SIMD vector Thread SIMD vector 52

53 Preserve CPU semantics? What does it mean to open/read/write/close/mmap a file in thousands of threads? GPU kernel is a single data-parallel application GPU threads are different from CPU threads Thread Thread Thread Thread Thread Thread Thread SIMD vector Thread SIMD vector 53

54 GPUfs semantics (see more discussion in the paper) int fd=gopen( filename,o_grdwr); Thread Thread Thread Thread Thread Thread Thread SIMD vector Thread SIMD vector One call per SIMD vector: bulk-synchronous cooperative execution One file descriptor per file: open()/close() cached on a GPU 54

55 GPU hardware characteristics Parallelism Heterogeneous memory 55

56 API semantics int fd=gopen( filename,o_grdwr); 56

57 API semantics int fd=gopen( filename,o_grdwr); C P U 57

58 This code runs in 100,000 GPU threads int fd=gopen( filename,o_grdwr); C P U G P U 58

GPUfs: Integrating a file system with GPUs

ASPLOS 2013 GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications