GPUfs: Integrating a file system with GPUs

Size: px

Start display at page:

Download "GPUfs: Integrating a file system with GPUs"

Norman Barton
5 years ago
Views:

1 GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1

2 Building systems with GPUs is hard. Why? 2

3 Goal of GPU programming frameworks GPU CPU Data transfers GPU invocation Memory management Parallel Algorithm 3

4 Headache for GPU programmers GPU CPU Data transfers Invocation Memory management Parallel Algorithm Half of the CUDA SDK 4.1 samples: at least 9 CPU LOC per 1 GPU LOC 4

5 GPU kernels are isolated GPU CPU Data transfers Invocation Memory management Parallel Algorithm 5

$com/articles/36347/face-collage While(Unhappy()){ Read_next_image_file()$

6 Example: accelerating photo collage While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() } 6

7 CPU Implementation CPU Application While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() } 7

8 Offloading computations to GPU This offloading pattern is common in commodity applications CPU Application Move to GPU While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() } 8

9 Offloading computations to GPU Co-processor programming model requires kernel termination CPU Application Data transfer GPU Kernel start Kernel termination 9

10 Kernel start/stop overheads ke invo y to cop U GP Cache flush cop y CPU to CPU Invocation latency GPU Synchronization 10

11 Implementation complexity Asynchronous invocation Manual data reuse management Double buffering y to cop U GP ke invo y to cop U GP cop y CPU to CPU GPU 11

12 Implementation complexity Management overhead Asynchronous invocation Manual data reuse management Double buffering y to cop U GP ke invo y to cop U GP cop y CPU to CPU GPU 12

13 Implementation complexity Management overhead Asynchronous invocation Manual data reuse management Double buffering y to cop U GP ke invo y to cop U GP cop y CPU to CPU GPU Why do we need to deal with low-level system details? 13

14 The reason is... GPUs are peer-processors 14

15 The reason is... GPUs are peer-processors They need I/O OS services 15

16 GPUfs: application view GPU2 GPU1 ) le ) d_fi hare n( s ope file d_ re ha ( s en op writ e() CPUs GPU3 () p a m m GPUfs Host File System 16

17 GPUfs: application view ) le ) d_fi hare n( s ope file d_ re ha ( s en op System-wide shared namespace GPU1 GPU2 writ e() CPUs GPU3 () p a m m POSIX GPUfs (CPU)-like API Host File System Persistent storage 17

18 Accelerating collage app with GPUfs No CPU management code CPU GPUfs GPU open/read from GPU 18

19 Accelerating collage app with GPUfs CPU Read-ahead GPUfsGPUfs GPUfs buffer cache GPU Overlapping Overlapping computations and transfers 19

20 Accelerating collage app with GPUfs CPU GPUfs GPU Data reuse Random data access 20

21 Challenge GPU CPU 21

22 GPUfs: File I/O support for GPUs Understanding the hardware Design Implementation Evaluation 22

23 Massive parallelism Parallelism is essential for performance in deeply multi-threaded wide-vector hardware AMD HD5870* NVIDIA Fermi* 23,000 active threads 31,000 active threads From M. Houston/A. Lefohn/K. Fatahalian A trip through the architecture of modern GPUs* 23

24 Heterogeneous memory GPUs inherently impose high bandwidth demands on memory GPU CPU 10-32GB/s GB/s Memory Memory ~x GB/s 24

25 How to build an FS layer on this hardware? 25

26 GPUfs: principled redesign of the whole file system stack Relaxed FS API semantics for parallelism Relaxed FS consistency for heterogeneous memory GPU-specific implementation of synchronization primitives, lock-free data structures, memory allocation,. 26

27 High-level design CPU GPU Unchanged applications using OS File API GPU application using GPUfs File API GPUfs hooks OS File System Interface OS Massive parallelism GPUfs GPU File I/O library GPUfs Distributed Buffer Cache (Page cache) CPU Memory GPU Memory Heterogeneous memory Host File System Disk 27

28 On-GPU File I/O API open/close gopen/gclose read/write gread/gwrite mmap/munmap gmmap/gmunmap fsync/msync gfsync/gmsync ftrunc gftrunc 28

29 On-GPU File I/O API open/close gopen/gclose read/write gread/gwrite mmap/munmap gmmap/gmunmap fsync/msync gfsync/gmsync ftrunc gftrunc Should we preserve CPU semantics? 29

30 FS API semantics int fd=open( filename,o_grdwr); 30

31 FS API semantics int fd=open( filename,o_grdwr); C P U 31

32 GPU threads are not CPU threads int fd=open( filename,o_grdwr); C P U G P U 32

33 File descriptors are not designed for data-parallel programs int fd=open( filename,o_grdwr); C P U G P U 33

34 Parallel square root on CPU cpu_thread(thread_id i){ float buffer; int fd=open(filename,o_grdwr); Thread i offset=sizeof(float)*i; find i-th element in file pread(fd,sizeof(float),&buffer,offset); read a i ai ai buffer=sqrt(buffer); pwrite(fd,sizeof(float),&buffer,offset); write new a i close(fd); } 34

35 GPUfs implementation sqrt_gpu(char* filename ){ shared float buffer[block_size]; int fd=gopen(filename,o_grdwr); offset=block_size*sizeof(float)*blockidx.x; gread(fd,offset,&buffer,block_size*sizeof(float)); buffer[threadidx.x]=sqrt(buffer[threadidx.x]); gwrite(fd,offset,&buffer,block_size*sizeof(float)); if (is_last_block()) gfsync(fd); gclose(fd); } 35

36 GPUfs implementation sqrt_gpu(char* filename ){ shared float buffer[block_size]; Same FD for all kernel threads int fd=gopen(filename,o_grdwr); Calls from multiple TBs offset=block_size*sizeof(float)*blockidx.x; return the same FD gread(fd,offset,&buffer,block_size*sizeof(float)); Each TB can call with different file name buffer[threadidx.x]=sqrt(buffer[threadidx.x]); gwrite(fd,offset,&buffer,block_size*sizeof(float)); if (is_last_block()) gfsync(fd); gclose(fd); } 36

37 GPUfs implementation sqrt_gpu(char* filename ){ shared float buffer[block_size]; int fd=gopen(filename,o_grdwr); offset=block_size*sizeof(float)*blockidx.x; gread(fd,offset,&buffer,block_size*sizeof(float)); buffer[threadidx.x]=sqrt(buffer[threadidx.x]); Bulk synchronous read/write gwrite(fd,offset,&buffer,block_size*sizeof(float)); if (is_last_block()) gfsync(fd); gclose(fd); } 37

38 GPUfs implementation sqrt_gpu(char* filename ){ shared float buffer[block_size]; int fd=gopen(filename,o_grdwr); offset=block_size*sizeof(float)*blockidx.x; gread(fd,offset,&buffer,block_size*sizeof(float)); buffer[threadidx.x]=sqrt(buffer[threadidx.x]); gwrite(fd,offset,&buffer,block_size*sizeof(float)); if (is_last_block()) gfsync(fd); Explicit sync gclose(fd); } 38

39 GPUfs implementation sqrt_gpu(char* filename ){ shared float buffer[block_size]; int fd=gopen(filename,o_grdwr); offset=block_size*sizeof(float)*blockidx.x; gread(fd,offset,&buffer,block_size*sizeof(float)); buffer[threadidx.x]=sqrt(buffer[threadidx.x]); gwrite(fd,offset,&buffer,block_size*sizeof(float)); if (is_last_block()) gfsync(fd); gclose(fd); } Matching close per open 39

40 GPUfs high-level design CPU GPU Unchanged applications using OS File API GPU application using GPUfs File API GPUfs hooks OS File System Interface OS GPUfs GPU File I/O library GPUfs Distributed Buffer Cache (Page cache) CPU Memory GPU Memory Host File System Disk 40

41 Buffer cache semantics Local or Distributed file system data consistency? 41

42 GPUfs buffer cache Weak data consistency model close(sync)-to-open semantics (AFS) open() read(1) GPU1 Not visible to CPU GPU2 write(1) fsync() write(2) Remote-to-Local memory performance ratio is similar to a distributed system >> 42

43 Implementation bits In the paper Full paging support Dynamic data structures and memory allocators Lock-free radix tree Inter-processor communications (IPC) Hybrid H/W-S/W barriers Consistency module in the OS kernel 43

44 GPUfs impact on GPU programs Memory overhead Register pressure Very little CPU coding Makes exitless GPU kernels possible Pay-as-you-go design 44

45 Evaluation All benchmarks are written as a GPU kernel: no CPU-side development 45

46 Evaluation All benchmarks are written as a GPU kernel: no CPU-side development CUDA 5.0, 4x TESLA C2075, Supermicro 2x4 Xeon L5630, RAM 12GB 46

47 Matrix-vector product (Inputs/Outputs in files) Vector 1x128K elements 3500 CUDA piplined CUDA optimized GPU file I/O Throughput (MB/s) Input matrix size (MB) 47

48 Prioritized image matching Problem ordered set of databases Di query set of images Q For every image from Q find the database with the matching image, having the smallest i Challenges Dynamic working set Unpredictable input size 48

49 Image matching Input: 2,016 random images (31MB raw data) Databases: 3 files 400MB each, 25,000 images in total Matching function: Euclidean distance 49

50 Image matching Input: 2,016 random images (31MB raw data) Databases: 3 files 400MB each, 25,000 images in total Matching function: Euclidean distance CPU and GPUfs versions: 130+/-10 LOC 8CPUs 1GPU 2GPU 3GPU 4GPU 119s 53s 27s 18s 13(x4) Exact match 100s 40s 21s 14s 11s(x3.6) No match 50

51 Word frequency count in text Count frequency of modern English words in the works of Shakespeare, and in the Linux kernel source tree English dictionary: 58,000 words Challenges Dynamic working set Small files Lots of file I/O (33,000 files,1-5kb each) Unpredictable output size 51

52 Results 8CPUs GPU-vanilla GPU-GPUfs Linux source 33,000 files, 524MB 6h 50m (7.2X) 53m (6.8X) Shakespeare 1 file, 6MB 292s 40s (7.3X) 40s (7.3X) 52

53 Results 8CPUs GPU-vanilla GPU-GPUfs Linux source 33,000 files, 524MB 6h 50m (7.2X) 53m (6.8X) Shakespeare 1 file, 6MB 292s 8% overhead 40s (7.3X) 40s (7.3X) Unbounded input/output size support 53

54 Matrix Product Input/output in files Implementation based on matrixmul sample Input: A=8192x1024, B=1024x8192 Challenges High contention on file system state High sensitivity to overheads 54

55 Results Kepler K20c Matrix Sizes: A(8320 x 1024), B(1024 x 4096), C(8320 x 4096) A(33M), B(16M), C(130M) CUDA matmul GPUfs matmul 167GFLOP/s Data in GPU buffer cache/memory 154GFLOP/s Input in files Output in GPU buffer cache/memory 164GFLOP/s 40 GFLOP/s 132GFLOP/s Data in files 161GFLOP/s 55

56 GPUfs is the first system to provide native access to host OS services from GPU programs GPUfs CPU CPU GPU GPU Code and paper are available for download at:

GPUfs: Integrating a file system with GPUs

ASPLOS 2013 GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications