* Contributed while interning at SAP. September 1 st, 2017 PUBLIC

Similar documents
Instant Recovery for Main-Memory Databases

Adaptive Recovery for SCM-Enabled Databases

Memory Management Techniques for Large-Scale Persistent-Main-Memory Systems [VLDB 2017]

STORAGE LATENCY x. RAMAC 350 (600 ms) NAND SSD (60 us)

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

Accelerating Microsoft SQL Server Performance With NVDIMM-N on Dell EMC PowerEdge R740

An Approach for Hybrid-Memory Scaling Columnar In-Memory Databases

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

Leveraging Modern Hardware in SAP HANA How SAP and Intel collaborate to Lead the Edge

Rethink the Sync 황인중, 강윤지, 곽현호. Embedded Software Lab. Embedded Software Lab.

I/O Profiling Towards the Exascale

NEXTGenIO Performance Tools for In-Memory I/O

Module Outline. CPU Memory interaction Organization of memory modules Cache memory Mapping and replacement policies.

Aerie: Flexible File-System Interfaces to Storage-Class Memory [Eurosys 2014] Operating System Design Yongju Song

Using Transparent Compression to Improve SSD-based I/O Caches

Big and Fast. Anti-Caching in OLTP Systems. Justin DeBrabant

January 28-29, 2014 San Jose

No compromises: distributed transactions with consistency, availability, and performance

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

Analyzing I/O Performance on a NEXTGenIO Class System

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency

Accelerate Applications Using EqualLogic Arrays with directcache

Column Stores vs. Row Stores How Different Are They Really?

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees

BzTree: A High-Performance Latch-free Range Index for Non-Volatile Memory

Architecture of a Real-Time Operational DBMS

Azor: Using Two-level Block Selection to Improve SSD-based I/O caches

Hewlett Packard Enterprise HPE GEN10 PERSISTENT MEMORY PERFORMANCE THROUGH PERSISTENCE

HP ProLiant DL380 Gen8 and HP PCle LE Workload Accelerator 28TB/45TB Data Warehouse Fast Track Reference Architecture

Team SimpleMind. Ismail Oukid (TU Dresden), Ingo Müller (KIT), Iraklis Psaroudakis (EPFL) ACM SIGMOD 2015 Programming SIGMOD 2015 (June 2)

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories

Recovering Disk Storage Metrics from low level Trace events

NV-Tree Reducing Consistency Cost for NVM-based Single Level Systems

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

WORT: Write Optimal Radix Tree for Persistent Memory Storage Systems

DATABASES AND THE CLOUD. Gustavo Alonso Systems Group / ECC Dept. of Computer Science ETH Zürich, Switzerland

Soft Updates Made Simple and Fast on Non-volatile Memory

NOVA: The Fastest File System for NVDIMMs. Steven Swanson, UC San Diego

WHITEPAPER. Improve PostgreSQL Performance with Memblaze PBlaze SSD

SoftWrAP: A Lightweight Framework for Transactional Support of Storage Class Memory

Efficient Memory Mapped File I/O for In-Memory File Systems. Jungsik Choi, Jiwon Kim, Hwansoo Han

The Btrfs Filesystem. Chris Mason

Implica(ons of Non Vola(le Memory on So5ware Architectures. Nisha Talagala Lead Architect, Fusion- io

Field Testing Buffer Pool Extension and In-Memory OLTP Features in SQL Server 2014

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

JOURNALING techniques have been widely used in modern

CS6453. Data-Intensive Systems: Rachit Agarwal. Technology trends, Emerging challenges & opportuni=es

Kaladhar Voruganti Senior Technical Director NetApp, CTO Office. 2014, NetApp, All Rights Reserved

Loose-Ordering Consistency for Persistent Memory

Evaluation Report: HP StoreFabric SN1000E 16Gb Fibre Channel HBA

Firebird Tour 2017: Performance. Vlad Khorsun, Firebird Project

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Design and Implementation of a Random Access File System for NVRAM

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

White Paper. File System Throughput Performance on RedHawk Linux

Join Processing for Flash SSDs: Remembering Past Lessons

System Software for Persistent Memory

Exploiting the benefits of native programming access to NVM devices

CA485 Ray Walshe Google File System

Single-pass restore after a media failure. Caetano Sauer, Goetz Graefe, Theo Härder

stec Host Cache Solution

Column-Stores vs. Row-Stores. How Different are they Really? Arul Bharathi

March NVM Solutions Group

HYRISE In-Memory Storage Engine

Innodb Performance Optimization

How TokuDB Fractal TreeTM. Indexes Work. Bradley C. Kuszmaul. Guest Lecture in MIT Performance Engineering, 18 November 2010.

Accelerating OLTP performance with NVMe SSDs Veronica Lagrange Changho Choi Vijay Balakrishnan

Design of Flash-Based DBMS: An In-Page Logging Approach

Flavors of Memory supported by Linux, their use and benefit. Christoph Lameter, Ph.D,

88X + PERFORMANCE GAINS USING IBM DB2 WITH BLU ACCELERATION ON INTEL TECHNOLOGY

A Breakthrough in Non-Volatile Memory Technology FUJITSU LIMITED

Dynamic Interval Polling and Pipelined Post I/O Processing for Low-Latency Storage Class Memory

Non-Volatile Memory Through Customized Key-Value Stores

In-Memory Data Management

Ingo Brenckmann Jochen Kirsten Storage Technology Strategists SAS EMEA Copyright 2003, SAS Institute Inc. All rights reserved.

Closing the Performance Gap Between Volatile and Persistent K-V Stores

Operating System Supports for SCM as Main Memory Systems (Focusing on ibuddy)

WALL: A Writeback-Aware LLC Management for PCM-based Main Memory Systems

A Case Study: Performance Evaluation of a DRAM-Based Solid State Disk

Flash Memory Summit Persistent Memory - NVDIMMs

Redesigning LSMs for Nonvolatile Memory with NoveLSM

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Crescando: Predictable Performance for Unpredictable Workloads

Deukyeon Hwang UNIST. Wook-Hee Kim UNIST. Beomseok Nam UNIST. Hanyang Univ.

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

Beyond Block I/O: Rethinking

Oracle Exadata: Strategy and Roadmap

IaaS Vendor Comparison

Incrementally Parallelizing. Twofold Speedup on a Quad-Core. Thread-Level Speculation. A Case Study with BerkeleyDB. What Am I Working on Now?

SMORE: A Cold Data Object Store for SMR Drives

NVthreads: Practical Persistence for Multi-threaded Applications

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

IBM V7000 Unified R1.4.2 Asynchronous Replication Performance Reference Guide

How In-memory Database Technology and IBM soliddb Complement IBM DB2 for Extreme Speed

Why Does Solid State Disk Lower CPI?

SRM-Buffer: An OS Buffer Management SRM-Buffer: An OS Buffer Management Technique toprevent Last Level Cache from Thrashing in Multicores

MySQL Database Scalability

Firebird Tour 2017: Performance. Vlad Khorsun, Firebird Project

Efficient Storage Management for Aged File Systems on Persistent Memory

Transcription:

Adaptive Recovery for SCM-Enabled Databases Ismail Oukid (TU Dresden & SAP), Daniel Bossle* (SAP), Anisoara Nica (SAP), Peter Bumbulis (SAP), Wolfgang Lehner (TU Dresden), Thomas Willhalm (Intel) * Contributed while interning at SAP ADMS@VLDB, September 1 st, 2017 PUBLIC

Non-Volatile Memory Non- Volatile We assume hardware-based wear-leveling Limited write endurance Low, asymmetric latency Writes noticeably slower than reads NVM Byteaddressable Denser than DRAM More capacity and cheaper than DRAM 3 TB per socket for first-gen 3D XPoint Energy efficient NVM is a merging point between main memory and storage 2

Architecting NVM An NVM-optimized filesystem provides zero-copy mmap Direct access to NVM via load/store instructions Several filesystem proposals: NOVA, PMFS, SCMFS, etc. Linux ext4 and xfs already provide Direct Access support NVM may become a universal memory 3

SOFORT: A Hybrid NVM-DRAM Storage Engine Primary data persisted in and accessed from NVM Secondary data can be persistent, transient, or hybrid NVM enables a single-level database storage architecture 4

Recovery time [sec] Normalized throughput Recovery Time 100 10 1 TPC-C (128 Warehouse) 166,4 Disk SSD NVM 234,9 237,8 21,7 21,7 19,7 8,7 1,87 0,3 1 0,8 0,6 0,4 0,2 TPC-C (128 Warehouse) 17% 30% 0,1 Sofort-STXTree Sofort-FPTree Sofort-wBTree 0 Sofort-STXTree Sofort-FPTree Sofort-wBTree Rebuilding secondary data is the new recovery bottleneck Goal: improve recovery without compromising query performance 5

Synchronous Recovery Recovery protocol 1. Recover primary data 2. Undo in-flight TXs 3. Rebuild secondary data 4. Accept queries + Secondary data rebuilt as fast as possible - System is not responsive during recovery 6

Asynchronous Recovery (aka Instant Recovery) Intuition: Primary data is sufficient to answer queries Accept queries right after recovering primary data During recovery Dictionary index lookup replaced with dictionary array scan Column index lookups replaced by column scans CPU resources split between query processing and recovery + Near-instant responsiveness of the database - It takes longer to reach pre-failure throughput 7

Adaptive Recovery Intuition 1: Secondary data structures are not equally important Recover indexes in the order of their importance Intuition 2: Some secondary data structures are not useful for the currently running workload Release recovery resources after rebuilding important indexes How to decide the importance of secondary data structures? 8

Index Benefit Functions Benefit_indep Computes the benefit of an index for a query plan independently of other indexes Benefit_dep Computes the benefit of an index while captures its dependencies to currently available indexes S: Set of all indexes s: Considered index Q: Considered query Benefit(s,Q,S) = Cost(Q,SNVM)-Cost(Q,SNVM U{s}) SNVM: Set of persistent indexes Benefit(s,Q,S) = Cost(Q, S(tQ))-Cost(Q,S(tQ) U{s}) S(tQ): Set of available indexes at time tq 9

Ranking of Secondary Data Structures Workload before failure (WoPast) Workload during recovery (WoRestart) Time WoPast and WoRestart can be similar or different A ranking function must take both into consideration Weighted benefits on WoPast and WoRestart rank(s,t) = α(n)*benefit(s,wopast,s) +(1-α(n))*Benefit(s,WoRestart(t),S) -rebuild(s) Cost of index rebuild e.g., α(n) = α n and 0 < α < 1 WoPast s weight decays with the number of statements in WoRestart 10

Evaluation Setup Intel NVM Emulator Intel Xeon E5 @2.60Ghz, 20MB L3 cache, 8 physical cores Benchmarks run on a single socket 11

Experimental Setup (Cont d) TATP Benchmark 80% Get Subscriber Data 20% Update Location TPC-C Benchmark 50% Order Status 50% Stock Level Sofort Configuration 8 users (threads) All indexes in DRAM 12

Recovery Strategies Recovery workload same as pre-failure workload Asynchronous recovery worse than synchronous recovery! Pre-failure throughput regained before the end of recovery 13

Workload Change During Recovery Pre-failure workload: Full TATP and TPC-C mixes Recovery workload: only TATP and TPC-C queries presented earlier Adaptive recovery adapts well to workload changes 14

Worst-Case Analysis Synthetic benchmark 10 tables, 10 columns and 1 Mio row each All columns uniformly queried Resources are released as soon as the job queue is empty Synchronous recovery is a lower bound for adaptive recovery 15

Conclusion NVM enables a single-level database storage architecture Rebuilding secondary data is the new recovery bottleneck Regaining pre-failure throughput near-instantly possible, but at a significant query performance cost When query performance is paramount, adaptive recovery allows to swiftly regains pre-failure throughput 16

Thank you. Contact information: Ismail Oukid ismail.oukid@sap.com

Resource Allocation Recovery workload same as pre-failure workload Only a subset of indexes is relevant to the workload Adapting resources significantly improves recovery performance 18