Efficient Memory Mapped File I/O for In-Memory File Systems Jungsik Choi, Jiwon Kim, Hwansoo Han
Operations Per Second Storage Latency Close to DRAM SATA/SAS Flash SSD (~00μs) PCIe Flash SSD (~60 μs) D-XPoint & Optane (~0 μs) NVDIMM-N (~0ns) SATA HDD Millisec Microsec Nanosec
Large Overhead in Software Existing OSes were designed for fast CPUs and slow block devices With low-latency storage, SW overhead becomes the largest burden Software overhead includes Complicated I/O stacks Redundant memory copies Frequent user/kernel mode switches SW overhead must be addressed to fully exploit low-latency NVM storage Latency Decomposition (Normalized) 00% 80% 60% 0% 0% 0% 7% Software Storage NIC 50% 77% 8% HDD NVMe SSD PCM PCM HDD 0Gb NVMe PCM 0Gb PCM 0Gb NIC 0Gb NIC 0Gb NIC 00Gb NIC NIC 0Gb NIC NIC 00Gb NIC (Source: Ahn et al. MICRO-8)
Eliminate SW Overhead with mmap In recent studies, memory mapped I/O has been commonly proposed Memory mapped I/O can expose the NVM storage s raw performance Mapping files onto user memory space Provide memory-like access Avoid complicated I/O stacks Minimize user/kernel mode switches No data copies user kernel App r/w syscalls File System Device Driver App load/store instructions A mmap syscall will be a critical interface Disk/Flash Traditional I/O Persistent Memory Memory mapped I/O
Memory Mapped I/O is Not as Fast as Expected Microbenchmark: sequential read, a GB file (Ext-DAX on NVDIMM-N) Elapsed Time (sec).5 0.5 0.06. 0.6 0.6 0.6.0 0.6 0.9 read : read system calls 계열 default-mmap : mmap page without fault overhead any special flags populate-mmap page table entry : construction mmap file access with a MAP_POPULATE flag 5
Memory Mapped I/O is Not as Fast as Expected Microbenchmark: sequential read, a GB file (Ext-DAX on NVDIMM-N).5. Elapsed Time (sec) 0.5 0.06 0.6 0.6 0.6.0 0.6 0.9 계열 page fault overhead page table entry construction file access 5
Overhead in Memory Mapped File I/O Memory mapped I/O can avoid the SW overhead of traditional file I/O However, Memory mapped I/O causes another SW overhead Page fault overhead TLB miss overhead PTE construction overhead It decreases the advantages of memory mapped file I/O 6
Techniques to Alleviate Storage Latency Readahead Preload pages that are expected to be accessed Page cache Cache frequently accessed pages fadvise/madvise interfaces Utilize user hints to manage pages User Hints (madvise/fadvise) Readahead However, these can t be used in memory based FSs App Main Memory Page Cache Secondary Storage 7
Techniques to Alleviate Storage Latency Readahead Map-ahead Preload pages that are expected to be accessed Page cache Mapping cache Cache frequently accessed pages fadvise/madvise interfaces Utilize user hints to manage pages User Hints (madvise/fadvise) Extended madvise Readahead However, these can t be used in memory based FSs New optimization is needed in memory based FSs App Main Memory Page Cache Secondary Storage 7
Map-ahead Constructs PTEs in Advance When a page fault occurs, the page fault handler handles A page that caused the page fault (existing demand paging) Pages that are expected to be accessed (map-ahead) Kernel analyzes page fault patterns to predict pages to be accessed Sequential fault : map-ahead window size Random fault : map-ahead window size Map-ahead can reduce # of page faults seq seq PURE_SEQ winsize NON_SEQ winsize rnd seq rnd rnd NEARLY_SEQ winsize - 8
Comparison of Demand Paging & Map-ahead Existing demand paging application a b c d e f page fault handler 5 time 9
Comparison of Demand Paging & Map-ahead Existing demand paging overhead in user/kernel mode switch, page fault, TLB miss application a b c d e f page fault handler 5 time 9
Comparison of Demand Paging & Map-ahead Existing demand paging overhead in user/kernel mode switch, page fault, TLB miss application a b c d e f page fault handler 5 time Map-ahead application a b c d e f page fault handler 5 time Performance gain Map-ahead window size == 5 9
Async Map-ahead Hides Mapping Overhead Sync map-ahead application a b c d e f page fault handler 5 time Map-ahead window size == 5 0
Async Map-ahead Hides Mapping Overhead Sync map-ahead application a b c d e f page fault handler 5 time Async map-ahead application a b c d e f page fault handler Performance gain time kernel threads 5 0
Extended madvise Makes User Hints Available Taking advantage of user hints can maximize the app s performance However, existing interfaces are to optimize only paging Extended madvise makes user hints available in memory based FSs MADV_SEQUENTIAL MADV_WILLNEED Existing madvise Run readahead aggressively Extended madvise Run map-ahead aggressively MADV_RANDOM Stop readahead Stop map-ahead MADV_DONTNEED Release pages in page $ Release mapping in mapping $
Mapping Cache Reuses Mapping Data When munmap is called, existing kernel releases the VMA and PTEs In memory based FSs, mapping overhead is very large With mapping cache, mapping overhead can be reduced VMAs and PTEs are cached in mapping cache When mmap is called, they are reused if possible (cache hit) VMA VMA VMA VMA VMA VMA mmaped region VA PA VA PA VA PA mm_struct red-black tree Process address space Page table Physical memory
Experimental Environment Experimental machine Intel Xeon E5-60 GB DRAM, GB NVDIMM-N Linux kernel. Ext-DAX filesystem Benchmarks fio : sequential & random read YCSB on MongoDB : load workload Apache HTTP server with httperf Netlist DDR NVDIMM-N 6GB
fio : Sequential Read 7 default-mmap populate-mmap map-ahead extended madvise 6 Bandwidth (GB/s) 5 0 8 6 6 Record Size (KB)
fio : Sequential Read 7 default-mmap populate-mmap map-ahead extended madvise Bandwidth (GB/s) 6 5 6% % 8% thousand page faults thousand page faults 6 million page faults 0 8 6 6 Record Size (KB)
fio : Sequential Read 7 default-mmap populate-mmap map-ahead extended madvise 6 Bandwidth (GB/s) 5 0 8 6 6 Record Size (KB)
fio : Random Read default-mmap populate-mmap map-ahead extended madvise Bandwidth (GB/s) 0 8 6 6 Record Size (KB) 5
fio : Random Read default-mmap populate-mmap map-ahead extended madvise Bandwidth (GB/s) 0 8 6 6 Record Size (KB) 5
fio : Random Read default-mmap populate-mmap map-ahead extended madvise Bandwidth (GB/s) 0 8 6 6 Record Size (KB) 5
fio : Random Read Bandwidth (GB/s) default-mmap populate-mmap map-ahead extended madvise madvise(madv_random) == default-mmap 0 8 6 6 Record Size (KB) 5
Web Server Experimental Setting Apache HTTP server Memory mapped file I/O 0 thousand MB-size HTML files (from Wikipedia) Total size is about 0GB httperf clients 8 additional machines Zipf-like distribution Apache HTTP server Gbps switch ports (lperf: 8,800Mbps bandwidth) httperf clients 6
Apache HTTP Server CPU Cycles (trillions, 0 ) 6 5 0 No mapping cache kernel libphp libc others Mapping cache (GB) Mapping cache (No-Limit) 600 600 700 700 800 800 900 900 000 000 Request Rate (reqs/s) 7
Apache HTTP Server CPU Cycles (trillions, 0 ) 6 5 0 No mapping cache kernel libphp libc others Mapping cache (GB) Mapping cache (No-Limit) 600 600 700 700 800 800 900 900 000 000 Request Rate (reqs/s) 7
Conclusion SW latency is becoming bigger than the storage latency Memory mapped file I/O can avoid the SW overhead Memory mapped file I/O still incurs expensive additional overhead Page fault, TLB miss, and PTEs construction overhead To exploit the benefits of memory mapped I/O, we propose Map-ahead, extended madvise, mapping cache Our techniques demonstrate good performance by mitigating the mapping overhead Map-ahead : 8% Map-ahead + extended madvise : % Mapping cache : 8% 8
QnA chjs@skku.edu 9