Efficient Memory Mapped File I/O for In-Memory File Systems. Jungsik Choi, Jiwon Kim, Hwansoo Han

Similar documents
Unblinding the OS to Optimize User-Perceived Flash SSD Latency

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Dongjun Shin Samsung Electronics

PM Support in Linux and Windows. Dr. Stephen Bates, CTO, Eideticom Neal Christiansen, Principal Development Lead, Microsoft

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2015 Lecture 23

THE IN-PLACE WORKING STORAGE TIER OPPORTUNITIES FOR SOFTWARE INNOVATORS KEN GIBSON, INTEL, DIRECTOR MEMORY SW ARCHITECTURE

VMware vsphere Virtualization of PMEM (PM) Richard A. Brunner, VMware

SAY-Go: Towards Transparent and Seamless Storage-As-You-Go with Persistent Memory

The Long-Term Future of Solid State Storage Jim Handy Objective Analysis

Ceph in a Flash. Micron s Adventures in All-Flash Ceph Storage. Ryan Meredith & Brad Spiers, Micron Principal Solutions Engineer and Architect

Persistent Memory. High Speed and Low Latency. White Paper M-WP006

Virtual Memory. Patterson & Hennessey Chapter 5 ELEC 5200/6200 1

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2018 Lecture 23

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

Load-Sto-Meter: Generating Workloads for Persistent Memory Damini Chopra, Doug Voigt Hewlett Packard (Enterprise)

Distributed caching for cloud computing

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

Flavors of Memory supported by Linux, their use and benefit. Christoph Lameter, Ph.D,

Understanding Write Behaviors of Storage Backends in Ceph Object Store

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

Managing Array of SSDs When the Storage Device is No Longer the Performance Bottleneck

Presented by: Nafiseh Mahmoudi Spring 2017

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

SHRD: Improving Spatial Locality in Flash Storage Accesses by Sequentializing in Host and Randomizing in Device

PCIe Storage Beyond SSDs

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories

A Disseminated Distributed OS for Hardware Resource Disaggregation Yizhou Shan

Toward a Memory-centric Architecture

MANAGING MULTI-TIERED NON-VOLATILE MEMORY SYSTEMS FOR COST AND PERFORMANCE 8/9/16

NVMe SSDs with Persistent Memory Regions

Virtual Memory: From Address Translation to Demand Paging

A Breakthrough in Non-Volatile Memory Technology FUJITSU LIMITED

Azor: Using Two-level Block Selection to Improve SSD-based I/O caches

ECE 598-MS: Advanced Memory and Storage Systems Lecture 7: Unified Address Translation with FlashMap

Workload Performance Comparison. Eden Kim, Calypso Systems, Inc.

PASTE: A Network Programming Interface for Non-Volatile Main Memory

Windows Support for PM. Tom Talpey, Microsoft

Impact on Application Development: SNIA NVM Programming Model in the Real World. Andy Rudoff pmem SW Architect, Intel

Windows Support for PM. Tom Talpey, Microsoft

Linux Memory Management

Revisiting Virtual Memory for High Performance Computing on Manycore Architectures: A Hybrid Segmentation Kernel Approach

Exploring System Challenges of Ultra-Low Latency Solid State Drives

RDMA Requirements for High Availability in the NVM Programming Model

The Why and How of Developing All-Flash Storage Server

Non-Volatile Memory Through Customized Key-Value Stores

SNIA NVM Programming Model Workgroup Update. #OFADevWorkshop

COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Linux Storage System Bottleneck Exploration

Fusion Engine Next generation storage engine for Flash- SSD and 3D XPoint storage system

WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES BIG AND SMALL SERVER PLATFORMS

Virtual Memory. Virtual Memory

REMOTE PERSISTENT MEMORY ACCESS WORKLOAD SCENARIOS AND RDMA SEMANTICS

Data-Centric Innovation Summit ALPER ILKBAHAR VICE PRESIDENT & GENERAL MANAGER MEMORY & STORAGE SOLUTIONS, DATA CENTER GROUP

Persistent Memory over Fabrics

ECE331: Hardware Organization and Design

CPS 104 Computer Organization and Programming Lecture 20: Virtual Memory

MQSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices

An Efficient Memory-Mapped Key-Value Store for Flash Storage

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Low-Overhead Flash Disaggregation via NVMe-over-Fabrics Vijay Balakrishnan Memory Solutions Lab. Samsung Semiconductor, Inc.

Evaluation of Intel Memory Drive Technology Performance for Scientific Applications

Flash Memory Summit Persistent Memory - NVDIMMs

I/O Profiling Towards the Exascale

High-Performance Transaction Processing in Journaling File Systems Y. Son, S. Kim, H. Y. Yeom, and H. Han

Using NVDIMM under KVM. Applications of persistent memory in virtualization

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

SPDK China Summit Ziye Yang. Senior Software Engineer. Network Platforms Group, Intel Corporation

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

Storage Systems : Disks and SSDs. Manu Awasthi July 6 th 2018 Computer Architecture Summer School 2018

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees

Virtual Memory. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Chapter 8 Virtual Memory

Solros: A Data-Centric Operating System Architecture for Heterogeneous Computing

All-Flash Storage System

Beyond Block I/O: Rethinking

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

CPS104 Computer Organization and Programming Lecture 16: Virtual Memory. Robert Wagner

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

Building a High IOPS Flash Array: A Software-Defined Approach

Open-Channel SSDs Offer the Flexibility Required by Hyperscale Infrastructure Matias Bjørling CNEX Labs

Exploiting the benefits of native programming access to NVM devices

Application Access to Persistent Memory The State of the Nation(s)!

All-NVMe Performance Deep Dive Into Ceph + Sneak Preview of QLC + NVMe Ceph

Webinar Series: Triangulate your Storage Architecture with SvSAN Caching. Luke Pruen Technical Services Director

Storage Systems : Disks and SSDs. Manu Awasthi CASS 2018

Onyx: A Prototype Phase-Change Memory Storage Array

IN-PERSISTENT-MEMORY COMPUTING WITH JAVA ERIC KACZMAREK INTEL CORPORATION

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Ziggurat: A Tiered File System for Non-Volatile Main Memories and Disks

3D Xpoint Status and Forecast 2017

Deep Learning Performance and Cost Evaluation

March NVM Solutions Group

arxiv: v1 [cs.dc] 13 Mar 2019

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Virtual Memory. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. November 15, MIT Fall 2018 L20-1

Linux Storage System Analysis for e.mmc With Command Queuing

Transcription:

Efficient Memory Mapped File I/O for In-Memory File Systems Jungsik Choi, Jiwon Kim, Hwansoo Han

Operations Per Second Storage Latency Close to DRAM SATA/SAS Flash SSD (~00μs) PCIe Flash SSD (~60 μs) D-XPoint & Optane (~0 μs) NVDIMM-N (~0ns) SATA HDD Millisec Microsec Nanosec

Large Overhead in Software Existing OSes were designed for fast CPUs and slow block devices With low-latency storage, SW overhead becomes the largest burden Software overhead includes Complicated I/O stacks Redundant memory copies Frequent user/kernel mode switches SW overhead must be addressed to fully exploit low-latency NVM storage Latency Decomposition (Normalized) 00% 80% 60% 0% 0% 0% 7% Software Storage NIC 50% 77% 8% HDD NVMe SSD PCM PCM HDD 0Gb NVMe PCM 0Gb PCM 0Gb NIC 0Gb NIC 0Gb NIC 00Gb NIC NIC 0Gb NIC NIC 00Gb NIC (Source: Ahn et al. MICRO-8)

Eliminate SW Overhead with mmap In recent studies, memory mapped I/O has been commonly proposed Memory mapped I/O can expose the NVM storage s raw performance Mapping files onto user memory space Provide memory-like access Avoid complicated I/O stacks Minimize user/kernel mode switches No data copies user kernel App r/w syscalls File System Device Driver App load/store instructions A mmap syscall will be a critical interface Disk/Flash Traditional I/O Persistent Memory Memory mapped I/O

Memory Mapped I/O is Not as Fast as Expected Microbenchmark: sequential read, a GB file (Ext-DAX on NVDIMM-N) Elapsed Time (sec).5 0.5 0.06. 0.6 0.6 0.6.0 0.6 0.9 read : read system calls 계열 default-mmap : mmap page without fault overhead any special flags populate-mmap page table entry : construction mmap file access with a MAP_POPULATE flag 5

Memory Mapped I/O is Not as Fast as Expected Microbenchmark: sequential read, a GB file (Ext-DAX on NVDIMM-N).5. Elapsed Time (sec) 0.5 0.06 0.6 0.6 0.6.0 0.6 0.9 계열 page fault overhead page table entry construction file access 5

Overhead in Memory Mapped File I/O Memory mapped I/O can avoid the SW overhead of traditional file I/O However, Memory mapped I/O causes another SW overhead Page fault overhead TLB miss overhead PTE construction overhead It decreases the advantages of memory mapped file I/O 6

Techniques to Alleviate Storage Latency Readahead Preload pages that are expected to be accessed Page cache Cache frequently accessed pages fadvise/madvise interfaces Utilize user hints to manage pages User Hints (madvise/fadvise) Readahead However, these can t be used in memory based FSs App Main Memory Page Cache Secondary Storage 7

Techniques to Alleviate Storage Latency Readahead Map-ahead Preload pages that are expected to be accessed Page cache Mapping cache Cache frequently accessed pages fadvise/madvise interfaces Utilize user hints to manage pages User Hints (madvise/fadvise) Extended madvise Readahead However, these can t be used in memory based FSs New optimization is needed in memory based FSs App Main Memory Page Cache Secondary Storage 7

Map-ahead Constructs PTEs in Advance When a page fault occurs, the page fault handler handles A page that caused the page fault (existing demand paging) Pages that are expected to be accessed (map-ahead) Kernel analyzes page fault patterns to predict pages to be accessed Sequential fault : map-ahead window size Random fault : map-ahead window size Map-ahead can reduce # of page faults seq seq PURE_SEQ winsize NON_SEQ winsize rnd seq rnd rnd NEARLY_SEQ winsize - 8

Comparison of Demand Paging & Map-ahead Existing demand paging application a b c d e f page fault handler 5 time 9

Comparison of Demand Paging & Map-ahead Existing demand paging overhead in user/kernel mode switch, page fault, TLB miss application a b c d e f page fault handler 5 time 9

Comparison of Demand Paging & Map-ahead Existing demand paging overhead in user/kernel mode switch, page fault, TLB miss application a b c d e f page fault handler 5 time Map-ahead application a b c d e f page fault handler 5 time Performance gain Map-ahead window size == 5 9

Async Map-ahead Hides Mapping Overhead Sync map-ahead application a b c d e f page fault handler 5 time Map-ahead window size == 5 0

Async Map-ahead Hides Mapping Overhead Sync map-ahead application a b c d e f page fault handler 5 time Async map-ahead application a b c d e f page fault handler Performance gain time kernel threads 5 0

Extended madvise Makes User Hints Available Taking advantage of user hints can maximize the app s performance However, existing interfaces are to optimize only paging Extended madvise makes user hints available in memory based FSs MADV_SEQUENTIAL MADV_WILLNEED Existing madvise Run readahead aggressively Extended madvise Run map-ahead aggressively MADV_RANDOM Stop readahead Stop map-ahead MADV_DONTNEED Release pages in page $ Release mapping in mapping $

Mapping Cache Reuses Mapping Data When munmap is called, existing kernel releases the VMA and PTEs In memory based FSs, mapping overhead is very large With mapping cache, mapping overhead can be reduced VMAs and PTEs are cached in mapping cache When mmap is called, they are reused if possible (cache hit) VMA VMA VMA VMA VMA VMA mmaped region VA PA VA PA VA PA mm_struct red-black tree Process address space Page table Physical memory

Experimental Environment Experimental machine Intel Xeon E5-60 GB DRAM, GB NVDIMM-N Linux kernel. Ext-DAX filesystem Benchmarks fio : sequential & random read YCSB on MongoDB : load workload Apache HTTP server with httperf Netlist DDR NVDIMM-N 6GB

fio : Sequential Read 7 default-mmap populate-mmap map-ahead extended madvise 6 Bandwidth (GB/s) 5 0 8 6 6 Record Size (KB)

fio : Sequential Read 7 default-mmap populate-mmap map-ahead extended madvise Bandwidth (GB/s) 6 5 6% % 8% thousand page faults thousand page faults 6 million page faults 0 8 6 6 Record Size (KB)

fio : Sequential Read 7 default-mmap populate-mmap map-ahead extended madvise 6 Bandwidth (GB/s) 5 0 8 6 6 Record Size (KB)

fio : Random Read default-mmap populate-mmap map-ahead extended madvise Bandwidth (GB/s) 0 8 6 6 Record Size (KB) 5

fio : Random Read default-mmap populate-mmap map-ahead extended madvise Bandwidth (GB/s) 0 8 6 6 Record Size (KB) 5

fio : Random Read default-mmap populate-mmap map-ahead extended madvise Bandwidth (GB/s) 0 8 6 6 Record Size (KB) 5

fio : Random Read Bandwidth (GB/s) default-mmap populate-mmap map-ahead extended madvise madvise(madv_random) == default-mmap 0 8 6 6 Record Size (KB) 5

Web Server Experimental Setting Apache HTTP server Memory mapped file I/O 0 thousand MB-size HTML files (from Wikipedia) Total size is about 0GB httperf clients 8 additional machines Zipf-like distribution Apache HTTP server Gbps switch ports (lperf: 8,800Mbps bandwidth) httperf clients 6

Apache HTTP Server CPU Cycles (trillions, 0 ) 6 5 0 No mapping cache kernel libphp libc others Mapping cache (GB) Mapping cache (No-Limit) 600 600 700 700 800 800 900 900 000 000 Request Rate (reqs/s) 7

Apache HTTP Server CPU Cycles (trillions, 0 ) 6 5 0 No mapping cache kernel libphp libc others Mapping cache (GB) Mapping cache (No-Limit) 600 600 700 700 800 800 900 900 000 000 Request Rate (reqs/s) 7

Conclusion SW latency is becoming bigger than the storage latency Memory mapped file I/O can avoid the SW overhead Memory mapped file I/O still incurs expensive additional overhead Page fault, TLB miss, and PTEs construction overhead To exploit the benefits of memory mapped I/O, we propose Map-ahead, extended madvise, mapping cache Our techniques demonstrate good performance by mitigating the mapping overhead Map-ahead : 8% Map-ahead + extended madvise : % Mapping cache : 8% 8

QnA chjs@skku.edu 9