SPIN: Seamless Operating System Integration of Peer-to-Peer DMA Between SSDs and GPUs. Shai Bergman Tanya Brokhman Tzachi Cohen Mark Silberstein

Size: px

Start display at page:

Download "SPIN: Seamless Operating System Integration of Peer-to-Peer DMA Between SSDs and GPUs. Shai Bergman Tanya Brokhman Tzachi Cohen Mark Silberstein"

Georgina Thornton
5 years ago
Views:

1 : Seamless Operating System Integration of Peer-to-Peer DMA Between SSDs and s Shai Bergman Tanya Brokhman Tzachi Cohen Mark Silberstein

2 What do we do? Enable efficient file I/O for s Why? Support diverse I/O workloads involving s How? Make P2P a first class citizen within the file I/O stack Results Better throughput Standard file API cross- portability 2

3 Fast data transfers CPU mediated data Data resides in SSD transfers introduce extra latency with lower Bounded by extra copy throughput PCIe CPUIO - CPU mediated transfer 3

vendors support P2P & PCIe Eliminates redundant copies Not involved in data

Zhang, D. Donofrio, J. Shalf, M. T. Kandemir, and M.

Architectures, [2] H.-W. Tseng, Y. Liu, M. Gahagan, J. Li, Y. Jin, and S.

Heterogeneous Computing and Storage Resources, [3] M. Shihab, K. Taht, and M.

Tseng, Q. Zhao, Y. Zhou, M. Gahagan, and S.

4 vendors support P2P & PCIe Eliminates redundant copies Not involved in data path Better power efficiency Higher throughput & lower latency [1] J. Zhang, D. Donofrio, J. Shalf, M. T. Kandemir, and M. Jung, NVMMU: A Non-volatile Memory Management Unit for Heterogeneous -SSD Architectures, [2] H.-W. Tseng, Y. Liu, M. Gahagan, J. Li, Y. Jin, and S. Swanson, Gullfoss: Accelerating and Simplifying Data Movement Among Heterogeneous Computing and Storage Resources, [3] M. Shihab, K. Taht, and M. Jung, Drive: Reconsidering Storage Accesses for Acceleration, [4] H.-W. Tseng, Q. Zhao, Y. Zhou, M. Gahagan, and S. Swanson, Morpheus: creating application objects efficiently for heterogeneous computing, 4 [5] Project Donard

5 Speedup P2P is great!* * But wait only for certain data access patterns Sequential reads (e.g. grep) x 7.6x 4.2x 512B 53 4K k 472 CPUIO - CPU mediated transfer 32k 64K 512K 1M Block size P2P is not a silver bullet! CPUIO P2P P2P is not a silver bullet Short Large sequential sequential reads: reads: P2P ~33x P2P ~1.4x Slower Speedup than CPUIO? **data is not preloaded to the page cache 5

reuse Hard to utilize Non-standard API No misaligned accesses

6 What went wrong? P2P bypasses the kernel! No Page Cache Integration No read ahead Cannot utilize P$ for data reuse Hard to utilize Non-standard API No misaligned accesses LVM/MDADM incompatible No file consistency Can read stale data Requires explicit flushes to SSD 6

7 What do we want? Regular file I/O to memory int fd;... //open file... pread64(fd,gpu_buffer,20*1024,0);... 7

: Contributions Standard File API Underlying block

beneficial Data Consistency + POSIX file semantics

when CPU + work on the same file Combine Page Cache

possible Read Ahead Activate read ahead mechanism

8 : Contributions Standard File API Underlying block device support (RAID, LVM) Activate P2P when beneficial Data Consistency + POSIX file semantics Keep POSIX file semantics + data consistency, even when CPU + work on the same file Combine Page Cache and P2P Interleave system memory and SSD when possible Read Ahead Activate read ahead mechanism when determined beneficial. Nested page cache within CPU memory for Use 8

9 P2PDMA transfer From Compatible SSD? : Destined to? High level Part of a sequential read? Data resides in page cache? pread( fd, buffer ) RQ buffer overview P-Read Ahead policy 1 P-router 2.a 2.b P-cache checker PCIe P2PDMA VFS Page cache 5.b 3.a CPU PC RA 3.b Block Layer NVMe Driver addr. extraction Current FS API 4.a 4.b file 5.a NVMe SSD 9

10 P2PDMA transfer : High level pread( fd, buffer ) buffer overview P-Read Ahead policy 1 RQ P-router 2.a 2.b P-cache checker PCIe P2PDMA VFS Page cache 5.b 3.a CPU PC RA 3.b Block Layer NVMe Driver addr. extraction Current FS API 4.a 4.b file 5.a NVMe SSD 10

11 P2PDMA transfer : High level pread( fd, buffer ) buffer overview P-Read Ahead policy 1 P-router 2.a 2.b P-cache checker PCIe VFS P2PDMA RQ Page cache 5.b 3.a CPU PC RA 3.b Block Layer NVMe Driver addr. extraction Current FS API 4.a 4.b file 5.a NVMe SSD 11

12 P2PDMA transfer : High level pread( fd, buffer ) buffer overview P-Read Ahead policy 1 RQ P-router 2.a 2.b P-cache checker PCIe P2PDMA VFS Page cache 5.b 3.a CPU PC RA 3.b Block Layer NVMe Driver addr. extraction Current FS API 4.a 4.b file 5.a NVMe SSD 12

13 P2PDMA transfer : High level pread( fd, buffer ) buffer overview P-Read Ahead policy RQ 1 P-router 2.a 2.b P-cache checker PCIe P2PDMA VFS Page cache 5.b 3.a CPU PC RA 3.b Block Layer NVMe Driver addr. extraction Current FS API 4.a 4.b file 5.a NVMe SSD 13

14 P2PDMA transfer : High level pread( fd, buffer ) buffer overview P-Read Ahead policy 1 RQ P-router 2.a 2.b P-cache checker PCIe P2PDMA VFS Page cache 5.b 3.a CPU PC RA 3.b Block Layer NVMe Driver addr. extraction Current FS API 4.a 4.b file 5.a NVMe SSD 14

15 P2PDMA transfer : High level pread( fd, buffer ) buffer overview P-Read Ahead policy 1 P-router 2.a 2.b P-cache checker PCIe P2PDMA VFS Page cache 5.b 3.a CPU PC RA 3.b Block Layer NVMe Driver addr. extraction Current FS API 4.a 4.b file 5.a NVMe SSD 15

16 Transfer Time [usec] : Combine page cache and P2P Sometimes the requested data resides in the P$ e.g due to previous usage of the data by CPU Transfer time from P$ and SSD P2P vs request size P2P P$ 500 Lower is better Request Size [KiB] Transferring data from P$ is faster! 16

17 : Combine page cache and P2P pread64(fd,gpu_dest,5*4096,0); //5 pages of 4KiB Page #0 Page #1 Page #2 Page #3 Page #4 Page cache is empty: Page Cache PCIe Memory Bus EMPTY Transfer P2P! P-cache checker 17

18 : Combine page cache and P2P pread64(fd,gpu_destk,5*4096,0); //5 pages of 4KiB Page #0 Page #1 Page #2 Page #3 Page #4 All pread64 contents in P$: Page Cache PCIe Memory Bus EMPTY Transfer from memory (CPUIO) P-cache checker 18

19 : Combine page cache and P2P pread64(fd,gpu_dest,5*4096,0); //5 pages of 4KiB Page #0 Page #1 Page #2 Page #3 Page #4 Page resides in SSD only Page resides in SSD and P$ Some of the data is in P$? PCIe Memory Bus Page Cache EMPTY #p1? #p3 #p0 #p2 #p4? Fine grained interleaving is a bad idea! 19

20 : Combine page cache and P2P Page resides in SSD only Page resides in SSD and P$ pread64(fd,gpu_dest,5*4096,0); //5 pages of 4KiB Page #0 Page #1 Page #2 Page #3 Page #4 40.1us 40.1us 40.1us 3 transfers of 4KiB via P2P: 120.3us Single transfer of 20KiB via P2P: 74.3us 20

21 : Fine grained interleaving = poor performance! Combine page cache and P2P SSDs: - Short IO requests are less efficient (low parallelism) - Invocation overhead per request Optimization Problem: Find the transfer schedule to minimize transfer time 21

22 Transfer Time [sec] : Combine page cache and P2P To solve the problem & get an optimal schedule we need: T s - P2P transfer time for a given request size p2p T s - P$ transfer time for a given request size P$ We model the SSD and RAM performance characteristics: - Assume P2P transfer time as piece-wise linear - Assume RAM transfer time as linear P2P 1500 P$ Request Size [KiB] 22

23 : Combine page Solution is polynomial in number of blocks Costly to calculate for every transfer cache and P2P We apply a greedy heuristic: - Examine every 3 consecutive data chunks Chunk #n Chunk #n+1 Chunk #n+2 Page resides in SSD only Page resides in SSD and P$ Calculate: T n + n n + 2 p2p vs. T p2p n + T P$ n T p2p n

24 : Combine page Solution is polynomial in number of blocks Costly to calculate for every transfer cache and P2P We apply a greedy heuristic: - Examine every 3 consecutive data chunks Greedy Heuristic is only Chunk #n Chunk #n+1 Chunk #n+2 1.6% slower than optimal Page resides in SSD only Page resides in SSD and P$ Calculate: scheduling T n + n n + 2 p2p vs. T p2p n + T P$ n T p2p n

25 P2PDMA transfer : pread( fd, buffer ) buffer Implementation: P2P & P$ P-Read Ahead policy 1 P-router 2.a 2.b P-cache checker PCIe P2P: Address tunneling Transfers P2PDMA mechanism VFS Page cache 5.b P$: Memcpy from P$ to mapped memory 3.a CPU PC RA 3.b Current FS API 5.a Block Layer NVMe Driver addr. extraction 4.a 4.b file NVMe SSD 25

26 : Implementation & Evaluation is implemented as a kernel module, patched NVME module & an LD_PRELOAD library No kernel modifications are required System Specs: - Intel P3700 NVME SSD - AMD Radeon R9 Fury & NVIDIA Tesla K40c - Ubuntu + Linux kernel Intel Core i7-5930k (6 Phys Cores) & X99 Chipset - 24GB DDR4 RAM 26

27 : Implementation & Evaluation We have evaluated the following: Threaded IO (TIOtest) Benchmark (1-4 threads): - Sequential reads (including software RAID) - Random reads/writes - Effects of P$ residency on read throughput - Effects CPU & I/O stress on read throughput Application Benchmarks - Aerial imagery rendering - accelerated log server - Image collage utilizing FS 27

28 Relative throughput % : Implementation & Evaluation Effect of P$ on Read Throughput Potential performance gains for producer-consumer workloads No data in P$, less than 5% overhead P2PDMA CPUIO All data in P$, less than 5% overhead Thpt [MB/sec]: % of file in page cache *512B reads 28

29 : Implementation & Evaluation - Store a log into SSD Accelerated Log Server - Analyze log using acceleration for string matching - Similar to fail2ban Real time configuration: - Log arrives to server - Server stores logs in SSD - analyzes logs by reading file Offline configuration: - Log is already in SSD - analyzes logs by reading file 29

30 : Implementation & Evaluation - Store a log into SSD Accelerated Log Server - Analyze log using acceleration for string matching - Similar to fail2ban Real time configuration: - Log arrives to server - Server stores logs in SSD We want our application to - analyzes logs by reading file work efficiently in any configuration Offline configuration: - Log is already in SSD - analyzes logs by reading file 30

31 BW [MBPS] I/O Contention BW [MBPS] Additional hop : Accelerated Log Server Implementation & Evaluation Real Time configuration Offline configuration P2P CPUIO Transfer Mechanism Data resides in p$ and SSD reads data from P$ 0 P2P CPUIO Transfer Mechanism Data resides in SSD only utilizes P2P 31

transparently - With, the same code performs well under all setups

32 : - seamlessly integrates P2P as a first class citizen into the file I/O stack - utilizes several mechanisms to speed up data transfers transparently - With, the same code performs well under all setups Thank you! shaiberg1@tx.technion.ac.il github.com/acsl-technion/spin 32

GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Building systems with GPUs is hard. Why? 2 Goal of