Closing the Performance Gap Between Volatile and Persistent K-V Stores

Similar documents
Big and Fast. Anti-Caching in OLTP Systems. Justin DeBrabant

Module Outline. CPU Memory interaction Organization of memory modules Cache memory Mapping and replacement policies.

Software and Tools for HPE s The Machine Project

Dalí: A Periodically Persistent Hash Map

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

Non-Volatile Memory Through Customized Key-Value Stores

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees

An Analysis of Persistent Memory Use with WHISPER

Request-Oriented Durable Write Caching for Application Performance appeared in USENIX ATC '15. Jinkyu Jeong Sungkyunkwan University

High-Performance Key-Value Store on OpenSHMEM

Redesigning LSMs for Nonvolatile Memory with NoveLSM

NV-Tree Reducing Consistency Cost for NVM-based Single Level Systems

An Analysis of Persistent Memory Use with WHISPER

BzTree: A High-Performance Latch-free Range Index for Non-Volatile Memory

STORAGE LATENCY x. RAMAC 350 (600 ms) NAND SSD (60 us)

Portable SHMEMCache: A High-Performance Key-Value Store on OpenSHMEM and MPI

Hardware Support for NVM Programming

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

Scalable and Secure Internet Services and Architecture PRESENTATION REPORT Semester: Winter 2015 Course: ECE 7650

Accelerating Microsoft SQL Server Performance With NVDIMM-N on Dell EMC PowerEdge R740

Request-Oriented Durable Write Caching for Application Performance

Soft Updates Made Simple and Fast on Non-volatile Memory

A New Key-Value Data Store For Heterogeneous Storage Architecture

FOEDUS: OLTP Engine for a Thousand Cores and NVRAM

Tailwind: Fast and Atomic RDMA-based Replication. Yacine Taleb, Ryan Stutsman, Gabriel Antoniu, Toni Cortes

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

Fine-grained Metadata Journaling on NVM

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

PASTE: A Network Programming Interface for Non-Volatile Main Memory

Blurred Persistence in Transactional Persistent Memory

Memory: Page Table Structure. CSSE 332 Operating Systems Rose-Hulman Institute of Technology

Module 1: Basics and Background Lecture 4: Memory and Disk Accesses. The Lecture Contains: Memory organisation. Memory hierarchy. Disks.

High Performance Transactions in Deuteronomy

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

NVthreads: Practical Persistence for Multi-threaded Applications

Log-Free Concurrent Data Structures

System Software for Persistent Memory

Write-Optimized and High-Performance Hashing Index Scheme for Persistent Memory

WORT: Write Optimal Radix Tree for Persistent Memory Storage Systems

MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing

Speeding Up Cloud/Server Applications Using Flash Memory

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

High-Performance Transaction Processing in Journaling File Systems Y. Son, S. Kim, H. Y. Yeom, and H. Han

SLM-DB: Single-Level Key-Value Store with Persistent Memory

Outline. Exploiting Program Parallelism. The Hydra Approach. Data Speculation Support for a Chip Multiprocessor (Hydra CMP) HYDRA

Aerie: Flexible File-System Interfaces to Storage-Class Memory [Eurosys 2014] Operating System Design Yongju Song

Pay Migration Tax to Homeland: Anchor-based Scalable Reference Counting for Multicores

March NVM Solutions Group

CGAR: Strong Consistency without Synchronous Replication. Seo Jin Park Advised by: John Ousterhout

S. Narravula, P. Balaji, K. Vaidyanathan, H.-W. Jin and D. K. Panda. The Ohio State University

Introduction to cache memories

Caches. Hiding Memory Access Times

High Performance Packet Processing with FlexNIC

* Contributed while interning at SAP. September 1 st, 2017 PUBLIC

Analyzing I/O Performance on a NEXTGenIO Class System

CHAPTER 8: CPU and Memory Design, Enhancement, and Implementation

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

3D Xpoint Status and Forecast 2017

Virtual to physical address translation

Enlightening the I/O Path: A Holistic Approach for Application Performance

The End of a Myth: Distributed Transactions Can Scale

CS6453. Data-Intensive Systems: Rachit Agarwal. Technology trends, Emerging challenges & opportuni=es

Rethink the Sync 황인중, 강윤지, 곽현호. Embedded Software Lab. Embedded Software Lab.

Flash-Conscious Cache Population for Enterprise Database Workloads

Amnesic Cache Management for Non-Volatile Memory

PacketShader: A GPU-Accelerated Software Router

W H I T E P A P E R. Comparison of Storage Protocol Performance in VMware vsphere 4

REMOTE PERSISTENT MEMORY ACCESS WORKLOAD SCENARIOS AND RDMA SEMANTICS

Cycle Time for Non-pipelined & Pipelined processors

Database Management and Tuning

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

Computer Science 146. Computer Architecture

Lecture: Memory Technology Innovations

FAWN. A Fast Array of Wimpy Nodes. David Andersen, Jason Franklin, Michael Kaminsky*, Amar Phanishayee, Lawrence Tan, Vijay Vasudevan

Chip-Multithreading Systems Need A New Operating Systems Scheduler

Benchmarking Persistent Memory in Computers

Multiprocessor Support

Interrupt Coalescing in Xen

Runtime Data Management on Non-volatile Memory-based Heterogeneous Memory for Task-Parallel Programs

BIBIM: A Prototype Multi-Partition Aware Heterogeneous New Memory

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

Part II: Data Center Software Architecture: Topic 2: Key-value Data Management Systems. SkimpyStash: Key Value Store on Flash-based Storage

ThyNVM. Enabling So1ware- Transparent Crash Consistency In Persistent Memory Systems

Hybrid Memory Platform

Kaladhar Voruganti Senior Technical Director NetApp, CTO Office. 2014, NetApp, All Rights Reserved

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency

Storage Systems : Disks and SSDs. Manu Awasthi CASS 2018

StreamBox: Modern Stream Processing on a Multicore Machine

arxiv: v1 [cs.db] 19 Jan 2019

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

SoftWrAP: A Lightweight Framework for Transactional Support of Storage Class Memory

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

CS 550 Operating Systems Spring File System

SAY-Go: Towards Transparent and Seamless Storage-As-You-Go with Persistent Memory

Mnemosyne Lightweight Persistent Memory

Parallel pg_dump. Parallel pg_dump. Maxing out hardware resources for database copy, backup and restore. Joachim Wieland. CHAR(10), 2nd July 2010

A New Key-value Data Store For Heterogeneous Storage Architecture Intel APAC R&D Ltd.

NEXTGenIO Performance Tools for In-Memory I/O

CS 61C: Great Ideas in Computer Architecture. Direct Mapped Caches

PCIe Storage Beyond SSDs

Transcription:

Closing the Performance Gap Between Volatile and Persistent K-V Stores Yihe Huang, Harvard University Matej Pavlovic, EPFL Virendra Marathe, Oracle Labs Margo Seltzer, Oracle Labs Tim Harris, Oracle Labs Steve Byan, Oracle Labs ATC 2018, 07/13/2018, Boston MA, USA 1

Key-Value Stores: Persistent or Fast? Persistent Slow or Volatile Fast 2

Key-Value Stores: Persistent or Fast? Persistent Slow or Volatile Fast What about both? 2

Non-Volatile Memory Faster than HDD/SSD Byte-addressable 3

Non-Volatile Memory Faster than HDD/SSD Slower than DRAM Byte-addressable Tricky to manage consistently 3

Non-Volatile Memory Faster than HDD/SSD Slower than DRAM Byte-addressable Tricky to manage consistently DRAM K-V stores don t easily translate to NVM 3

Outline Cross-Referencing Logs: Getting the best of DRAM and NVM Bullet: Fast and persistent key-value store 4

Outline Cross-Referencing Logs: Getting the best of DRAM and NVM Bullet: Fast and persistent key-value store 5

Caching Volatile store (frontend) caches Persistent store (backend) 6

Caching Volatile store (frontend) caches Persistent store (backend) NVM 7

Caching Volatile store (frontend) How? caches Persistent store (backend) NVM 7

Reading: Standard Volatile store (frontend) Lookup on cache miss Persistent store (backend) NVM 8

Writing: Logs Volatile store (frontend) 1. Persistently log op. 2. Update frontend 3. Return to client 4. Update backend in background Persistent store (backend) NVM 9

Writing: How? Logs Volatile store (frontend) 1. Persistently log op. 2. Update frontend 3. Return to client 4. Update backend in background Persistent store (backend) NVM 9

Writing: How? Logs Volatile store (frontend) 1. Persistently log op. 2. Update frontend 3. Return to client 4. Update backend in background Persistent store (backend) NVM Challenges: persistence, low latency, scalability, consistency 9

Cross-Referencing Logs Persistence: Place logs in NVM Low latency Few NVM accesses per log append Scalability Multiple logs (one per thread) Consistency Cross-log references 10

Cross-Referencing Logs Persistence: Place logs in NVM Low latency Few NVM accesses per log append Scalability Multiple logs (one per thread) Consistency Cross-log references 10

Cross-Referencing Logs Log entry: Key OP Applied Prev append consume 11

Cross-Referencing Logs Log entry: Key OP Applied Prev Key 1, OP: put A append consume 11

Cross-Referencing Logs Log entry: Key OP Applied Prev Key 1, OP: put A append Key 2, OP: put B consume 11

Cross-Referencing Logs Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B consume 11

Cross-Referencing Logs Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B consume 11

Cross-Referencing Logs Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B consume Key 3, OP: delete 11

Cross-Referencing Logs Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B consume Key 2, OP: put C Key 3, OP: delete 11

Cross-Referencing Logs Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B consume Key 2, OP: put C Key 3, OP: delete 11

Cross-Referencing Logs Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B consume Key 2, OP: put C Key 3, OP: delete 12

Cross-Referencing Logs Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B consume Key 2, OP: put C Key 3, OP: delete Applied, Pr: 13

Cross-Referencing Logs Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B consume Key 2, OP: put C Key 3, OP: delete Applied, Pr: 14

Cross-Referencing Logs Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B consume Key 2, OP: put C Key 3, OP: delete Applied, Pr: 15

Cross-Referencing Logs Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B consume Key 2, OP: put C Key 3, OP: delete Applied, Pr: 16

Cross-Referencing Logs Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B Applied, Pr: consume Key 2, OP: put C Key 3, OP: delete Applied, Pr: 17

Cross-Referencing Logs Log entry: Key OP Applied Prev Key 2, OP: delete Applied, Pr: Key 1, OP: put A append Key 2, OP: put B Applied, Pr: consume Key 2, OP: put C Key 3, OP: delete Applied, Pr: 18

Cross-Referencing Logs Log entry: Key OP Applied Prev Key 2, OP: delete Applied, Pr: Key 1, OP: put A append Key 2, OP: put B Applied, Pr: consume Key 2, OP: put C Applied, Pr: Key 3, OP: delete Applied, Pr: 19

Outline Cross-Referencing Logs: Getting the best of DRAM and NVM Bullet: Fast and persistent key-value store 20

Outline Cross-Referencing Logs: Getting the best of DRAM and NVM Bullet: Fast and persistent key-value store Basic design Optimizations 20

Outline Cross-Referencing Logs: Getting the best of DRAM and NVM Bullet: Fast and persistent key-value store Basic design Optimizations 21

Bullet K-V Store Volatile store (frontend) caches Persistent store (backend) NVM 22

Bullet K-V Store Volatile store (frontend) caches Persistent store (backend) NVM 23

Bullet K-V Store Volatile store (frontend) Persistent store (backend) caches 24

Experimental Setup 16 CPU cores, 512 GB DRAM Intel s NVM emulation platform Zipfian distribution of keys (YCSB) Measuring 99%ile latency Comparing against HiKV 25

Raw Hash Table Performance 26

Bullet K-V Store Volatile store (frontend) Persistent store (backend) caches 27

Bullet K-V Store Volatile store (frontend) Persistent store (backend) X-Ref Logs 28

Bullet K-V Store Volatile store (frontend) Writers Persistent store (backend) X-Ref Logs 29

Bullet K-V Store Volatile store (frontend) Writers Gleaners Persistent store (backend) X-Ref Logs 29

Latency in ns Bullet Performance 800 700 600 500 400 300 200 100 0 Read-only volatile phash 0 5000000 10000000 15000000 20000000 25000000 Throughput in ops/sec 30

Latency in ns Bullet Performance 800 700 600 500 400 300 200 100 0 Read-only volatile phash hikv-ht 0 5000000 10000000 15000000 20000000 25000000 Throughput in ops/sec 31

Latency in ns Bullet Performance 800 700 600 500 400 300 200 100 0 Read-only volatile phash hikv-ht bullet-st 0 5000000 10000000 15000000 20000000 25000000 Throughput in ops/sec 32

Latency in ns Bullet Performance 1800 1600 1400 1200 1000 800 600 400 200 0 15% writes volatile phash hikv-ht bullet-st 0 2000000 4000000 6000000 8000000 10000000 12000000 14000000 Throughput in ops/sec 33

Outline Cross-Referencing Logs: Getting the best of DRAM and NVM Bullet: Fast and persistent key-value store Basic design Optimizations 34

Optimization: Dynamic Worker Threads Volatile store (frontend) Writers Gleaners Persistent store (backend) X-Ref Logs 35

Optimization: Dynamic Worker Threads Volatile store (frontend) Writers Gleaners Persistent store (backend) X-Ref Logs 36

Read-Heavy Workload Volatile store (frontend) Writers Gleaners Persistent store (backend) X-Ref Logs 37

Write-Heavy Workload Volatile store (frontend) Writers Gleaners Persistent store (backend) X-Ref Logs 38

Optimization: Overwriting Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B consume Key 2, OP: put C Key 3, OP: delete Applied, Pr: 39

Optimization: Overwriting Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B consume Key 2, OP: put C Key 3, OP: delete Applied, Pr: 40

Optimization: Overwriting Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B consume Key 2, OP: put C Key 3, OP: delete Applied, Pr: 41

Optimization: Overwriting Log entry: Key OP Applied Prev Key 2, OP: delete Key 1, OP: put A append Key 2, OP: put B consume Key 2, OP: put C Key 3, OP: delete Applied, Pr: 42

Optimization: Overwriting Log entry: Key OP Applied Prev append IGNORE Key 2, OP: delete Key 1, OP: put A IGNORE Key 2, OP: put B consume Key 2, OP: put C Key 3, OP: delete Applied, Pr: 43

Optimization: Overwriting Volatile store (frontend) Writers Gleaners Persistent store (backend) X-Ref Logs 44

Latency in ns Bullet Performance 800 700 600 500 400 300 200 100 0 Read-only volatile phash hikv-ht bullet-st 0 5000000 10000000 15000000 20000000 25000000 Throughput in ops/sec 45

Latency in ns Bullet Performance 800 700 600 500 400 300 200 100 0 Read-only volatile phash hikv-ht bullet-st bullet-dyn-ovr 0 5000000 10000000 15000000 20000000 25000000 Throughput in ops/sec 46

Latency in ns Bullet Performance 1800 1600 1400 1200 1000 800 600 400 200 0 15% writes volatile phash hikv-ht bullet-st bullet-dyn-ovr 0 2000000 4000000 6000000 8000000 10000000 12000000 14000000 Throughput in ops/sec 47

Latency in ns Bullet Performance 1000 2% writes volatile phash hikv-ht bullet-st bullet-dyn-ovr 800 600 400 200 0 0 5000000 10000000 15000000 20000000 25000000 Throughput in ops/sec 48

Summary DRAM cache for NVM DRAM frontend, NVM backend Novel caching scheme: Cross-referencing logs 49

Summary DRAM cache for NVM DRAM frontend, NVM backend Novel caching scheme: Cross-referencing logs Bullet K-V store Almost identical hash tables in DRAM and NVM DRAM performance for reads Substantial latency improvements for writes 49

Summary DRAM cache for NVM DRAM frontend, NVM backend Novel caching scheme: Cross-referencing logs Bullet K-V store Almost identical hash tables in DRAM and NVM DRAM performance for reads Substantial latency improvements for writes More details in the paper! 49

Backup Slides 50

Optimization Summary Dynamic worker threads Overwrites Non-blocking reads (see paper) Backend access off critical path (see paper) 51

Bullet Performance Read-only 52

Bullet Performance 2% writes 53

Bullet Performance 15% writes 54

Bullet Performance 50% writes 55

Bullet Performance (end-to-end) Read-only 56

Bullet Performance (end-to-end) 2% writes 57

Bullet Performance (end-to-end) 15% writes 58

Bullet Performance (end-to-end) 50% writes 59

Latency Distribution Reads 60

Latency Distribution Writes 61

Log Size Senzitivity 62

Experimental Setup 16 CPU cores 512 GB DRAM NVM simulated using DRAM 384 GB (out of 512GB) 300ns load (2x DRAM) 100ns persist barrier Same throughput DRAM cache fits all data Zipfian distribution of keys (YCSB) 50M K/V pairs, 16 B keys, 100 B values Measuring 99%ile latency 63

Dynamic Worker Thread Switching 64

Cross-Referencing Logs Persistence: Circular buffers in NVM Low latency Fast appends (3 persist barriers) Scalability Multiple logs, stall only when all full Coarse-grained synchronization Epoch-based space reclamation Consistency Cross-log references 65

Cross-Referencing Logs Summary Persistent circular buffers (in NVM) Cross-log pointers for consistency Coarse-grained synchronization Fast appends Epoch-based space reclamation 66

Latency in ns 800 700 600 500 400 300 200 100 0 Read-only volatile phash hikv-ht bullet-st bullet-dyn bullet-dyn-ovr 0 5000000 10000000 15000000 20000000 25000000 Throughput in ops/sec 67