Using persistent memory to Talk build Title a high-performant, Here fully Author Name, Company user space filesystem Krzysztof Czuryło, Intel
What it is? pmemfile Low-overhead userspace implementation of file APIs using persistent memory Open source: https://github.com/pmem/pmemfile BSD license
What is is? Fully user-space not FUSE Builds on libpmemobj http://pmem.io/nvml/ Specifically designed for Persistent Memory
pmem family https://pmem.io libpmemblk libpmemlog syscall intercept libpmemfile posix libpmemobj libpmem libvmem libpmemfile vltrace librpmem libvmmalloc antool
Motivation Performance when HW is fast, SW becomes a bottleneck move kernel out of the stack Speed up PMEM adoption test your program with pmemfile (w/o any changes) if it performs better, consider rewriting it to use PMEM directly (i.e. libpmemobj)
Features Strong consistency/atomicity guarantees metadata and user data could be limited to metadata only Focused on performance takes advantage of NVDIMM bandwidth/latency Fine-grain granularity
Design
Design libpmemfile-posix syscall-like API can be directly used by applications libpmemfile transparent access to libpmemfile-posix pools thanks to syscall_intercept
Design User Applications GNU C Library (glibc) User Space GNU/Linux System Call Interface Kernel Kernel Space Architecture-Dependent Kernel Code Hardware Platform
Design User Applications pmemfile_write write pmemfile components libpmemfile-posix libpmemobj libpmemfile syscall_intercept glibc User Space load/store System Call Interface Kernel SYS_write Kernel Space Architecture-Dependent Kernel Code NVDIMM Hardware Platform
Design Syscall-like APIs vs. file I/O? hundreds of functions to implement/intercept open => open, open2, open64,... write => write, pwrite, fwrite, fprintf,... what about statically linked libc? what about syscalls issued by the program itself? Drawbacks intercepting system calls is not trivial
Intercepting system calls syscall_intercept Provides a low-level interface for hooking Linux system calls in user space Very simple API Open source: https://github.com/pmem/syscall_intercept
Design Build on libpmemobj DAX-enabled filesystem / DAX Device memory-mapped files direct access to persistent memory (load/store) persistent memory allocator replication Fail-safety transactions atomic operations
libpmemfile-posix
libpmemfile-posix File system for persistent memory Runs in user-space No kernel overhead Interfaces modeled after the corresponding POSIX interfaces for file management about 60 functions: pmemfile_* open, openat, creat, close, link, unlink,..., read, write,... Easier transition for application developers
libpmemfile-posix #include <libpmemfile-posix.h> PMEMfile *pmemfile_open(pmemfilepool *pfp, const char *pathname, int flags,...); ssize_t pmemfile_read(pmemfilepool *pfp, PMEMfile *file, void *buf, size_t count); PMEMfilepool - filesystem (pmemfile pool) handle passed to each pmemfile_* function as the first argument PMEMfile - (pmem)file descriptor
libpmemfile-posix Multiple root directories multiple, distinct directory trees in one pool one pmemfile pool can handle multiple mounting points unsigned pmemfile_root_count(pmemfilepool *pfp); PMEMfile *pmemfile_open_root(pmemfilepool *pfp, unsigned index, int flags);
libpmemfile
libpmemfile User space persistent memory file system which is automatically enabled when libpmemfile is pre-loaded Nearly transparent access to persistent memory resident files Intercepts standard Linux glibc interfaces
Create filesystem mkfs-pmemfile path size creates pmemfile pool ("filesystem image") path should point to a pmem-aware filesystem or Device DAX $ mkfs-pmemfile /mnt/pmem/myfs 1G or $ mkfs-pmemfile /dev/dax1.0 0
Mount pmemfile-mount path mount-point convenient way to "mount" pmemfile pool / filesystem at given location libpmemfile reads mounts at load time $ sudo pmemfile-mount /mnt/pmem/myfs /tmp/mountpoint
Mount If pmemfile-mount can't be used i.e. no root privileges PMEMFILE_POOLS=/tmp/mountpoint:/dev/dax0.0' Files seen at /tmp/mountpoint/* are actually stored on filesystem backed by /dev/dax0.0 Syscalls related to those files are transparently redirected to libpmemfile-posix
Example $ alias pf='ld_preload=libpmemfile.so' $ alias pf='ld_preload=libpmemfile.so \ PMEMFILE_POOLS=/tmp/mountpoint:/dev/dax0.0' $ pf mkdir /tmp/mountpoint/dir_in_pmemfile $ pf cp README.md /tmp/mountpoint/dir_in_pmemfile $ pf ls -l /tmp/mountpoint/ total 0 drwxrwxrwx 2 user group 4008 Feb 16 17:46 dir_in_pmemfile $ pf ls -l /tmp/mountpoint/dir_in_pmemfile total 16 -rw-r--r-- 1 user group 1014 Feb 16 17:46 README.md $ pf cat /tmp/mountpoint/dir_in_pmemfile/readme.md wc -c $ ls -l /tmp/mountpoint/ total 0 $ ls -l /tmp/mountpoint/dir_in_pmemfile ls: cannot access '/tmp/mountpoint/dir_in_pmemfile': No such file or directory
Limitations
There are many...
Limitations No support for I/O event notification epoll_*, inotify*, poll, select,... No extended attributes All writes are synchronous (it's a feature actually!) no asynchronous I/O flushes are not needed / no-op sync, fsync, fdatasync,... No file locks (flock)
Limitations Memory mapping is not supported (yet) mmap, munmap, msync,... Can't execute program binaries stored in pmemfile pool because of mmap... No special files (mknod)... and some other minor issues see libpmemfile man page for details
Limitations No multi-process access (or very limited) libpmemobj limitation memory-mapped files (MAP_SHARED) - no COW workaround available (veeeery slow) Works only on Linux x86_64 other *NIX-like systems could be supported syscall_intercept/libpmemobj - work only on x86_64
Limitations Limited support for clone() fork() child process has no access to pmem files vfork() not supported No remote replication (not fail-safe)
vltrace vltrace Tool for tracing applications and evaluating whether libpmemfile.so supports them (https://github.com/pmem/vltrace)
Performance results
Results Not much difference for read-only workload Performs well for heavy-write workload small writes appends for large data transfers memcpy is the limit Outperforms ext4+dax up to 2x, depending on the workload
Results
Q&A
Backup
Limitations Full list of non-supported syscalls chroot getsockname lsetxattr msync select epoll_ctl getsockopt madvise munlock setxattr epoll_pwait inotify_add_watch mknod munlockall swapoff epoll_wait inotify_rm_watch mknodat munmap tee fgetxattr ioctl mmap poll umount2 flistxattr lgetxattr mount ppoll vfork fremovexattr listxattr mprotect pselect fsetxattr lremovexattr mremap removexattr
Build and install git clone https://github.com/pmem/pmemfile cd pmemfile mkdir build cd build cmake.. -DCMAKE_INSTALL_PREFIX=/usr make sudo make install cmake.. -DCMAKE_BUILD_TYPE=Debug -DDEVELOPER_MODE=1 \ -DTEST_DIR=/mnt/pmem/pmemfile-tests... ctest --output-on-failure