Beyond Block I/O: Rethinking

Similar documents
Exploiting the benefits of native programming access to NVM devices

Beyond Block I/O: Rethinking Traditional Storage Primitives

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

Implica(ons of Non Vola(le Memory on So5ware Architectures. Nisha Talagala Lead Architect, Fusion- io

January 28-29, 2014 San Jose

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories

SFS: Random Write Considered Harmful in Solid State Drives

Advanced file systems: LFS and Soft Updates. Ken Birman (based on slides by Ben Atkin)

Using Transparent Compression to Improve SSD-based I/O Caches

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency

Presented by: Nafiseh Mahmoudi Spring 2017

STORAGE LATENCY x. RAMAC 350 (600 ms) NAND SSD (60 us)

CS122 Lecture 15 Winter Term,

Chapter 6. Storage and Other I/O Topics. ICE3003: Computer Architecture Spring 2014 Euiseong Seo

Performance Modeling and Analysis of Flash based Storage Devices

Design of Flash-Based DBMS: An In-Page Logging Approach

Chapter 6. Storage and Other I/O Topics

Optimizing MySQL performance with ZFS. Neelakanth Nadgir Allan Packer Sun Microsystems

Ben Walker Data Center Group Intel Corporation

Middleware and Flash Translation Layer Co-Design for the Performance Boost of Solid-State Drives

Computer Systems Laboratory Sungkyunkwan University

File Systems: Consistency Issues

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

Accelerating Microsoft SQL Server Performance With NVDIMM-N on Dell EMC PowerEdge R740

Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Last Class. Today s Class. Faloutsos/Pavlo CMU /615

Computer Architecture Computer Science & Engineering. Chapter 6. Storage and Other I/O Topics BK TP.HCM

Chapter 6. Storage and Other I/O Topics. ICE3003: Computer Architecture Fall 2012 Euiseong Seo

Secondary storage. CS 537 Lecture 11 Secondary Storage. Disk trends. Another trip down memory lane

Performance Benefits of Running RocksDB on Samsung NVMe SSDs

Performance comparisons and trade-offs for various MySQL replication schemes

Chapter 6. Storage and Other I/O Topics

OSSD: A Case for Object-based Solid State Drives

High-Performance Transaction Processing in Journaling File Systems Y. Son, S. Kim, H. Y. Yeom, and H. Han

u Covered: l Management of CPU & concurrency l Management of main memory & virtual memory u Currently --- Management of I/O devices

Database Management Systems Reliability Management

The Dangers and Complexities of SQLite Benchmarking. Dhathri Purohith, Jayashree Mohan and Vijay Chidambaram

COS 318: Operating Systems. Storage Devices. Kai Li Computer Science Department Princeton University

MQSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices

UCS Invicta: A New Generation of Storage Performance. Mazen Abou Najm DC Consulting Systems Engineer

Lightweight Application-Level Crash Consistency on Transactional Flash Storage

How To Rock with MyRocks. Vadim Tkachenko CTO, Percona Webinar, Jan

SSD/Flash for Modern Databases. Peter Zaitsev, CEO, Percona November 1, 2014 Highload Moscow,Russia

ijournaling: Fine-Grained Journaling for Improving the Latency of Fsync System Call

Example Networks on chip Freescale: MPC Telematics chip

COS 318: Operating Systems. Storage Devices. Jaswinder Pal Singh Computer Science Department Princeton University

I-CASH: Intelligently Coupled Array of SSD and HDD

GPUfs: Integrating a file system with GPUs

Rethink the Sync. Abstract. 1 Introduction

Solid State Drives (SSDs) Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Conoscere e ottimizzare l'i/o su Linux. Andrea Righi -

FOR decades, transactions have been widely used in database

A Caching-Oriented FTL Design for Multi-Chipped Solid-State Disks. Yuan-Hao Chang, Wei-Lun Lu, Po-Chun Huang, Lue-Jane Lee, and Tei-Wei Kuo

COS 318: Operating Systems. Storage Devices. Vivek Pai Computer Science Department Princeton University

Accelerate Applications Using EqualLogic Arrays with directcache

Unblinding the OS to Optimize User-Perceived Flash SSD Latency

SSDs vs HDDs for DBMS by Glen Berseth York University, Toronto

Recoverability. Kathleen Durant PhD CS3200

A File-System-Aware FTL Design for Flash Memory Storage Systems

Design Considerations for Using Flash Memory for Caching

From server-side to host-side:

Storage Systems : Disks and SSDs. Manu Awasthi CASS 2018

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 420, York College. November 21, 2006

Caching and reliability

The Benefits of Solid State in Enterprise Storage Systems. David Dale, NetApp

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona

SMORE: A Cold Data Object Store for SMR Drives

Rethink the Sync 황인중, 강윤지, 곽현호. Embedded Software Lab. Embedded Software Lab.

Non-Blocking Writes to Files

Operating Systems. File Systems. Thomas Ropars.

High Performance Solid State Storage Under Linux

High Performance SSD & Benefit for Server Application

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

Storage. Hwansoo Han

Azor: Using Two-level Block Selection to Improve SSD-based I/O caches

COURSE 4. Database Recovery 2

Recovering Disk Storage Metrics from low level Trace events

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

Validating the NetApp Virtual Storage Tier in the Oracle Database Environment to Achieve Next-Generation Converged Infrastructures

Request-Oriented Durable Write Caching for Application Performance appeared in USENIX ATC '15. Jinkyu Jeong Sungkyunkwan University

EECS 482 Introduction to Operating Systems

CSE506: Operating Systems CSE 506: Operating Systems

GPUfs: Integrating a file system with GPUs

NAND Flash-based Storage. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Comparing Performance of Solid State Devices and Mechanical Disks

Journaling. CS 161: Lecture 14 4/4/17

Chapter 6. Storage and Other I/O Topics. Jiang Jiang

Near-Data Processing for Differentiable Machine Learning Models

Mass-Storage Structure

Storage Systems : Disks and SSDs. Manu Awasthi July 6 th 2018 Computer Architecture Summer School 2018

SLM-DB: Single-Level Key-Value Store with Persistent Memory

Flash Trends: Challenges and Future

IMPROVING THE PERFORMANCE, INTEGRITY, AND MANAGEABILITY OF PHYSICAL STORAGE IN DB2 DATABASES

Review: The ACID properties. Crash Recovery. Assumptions. Motivation. Preferred Policy: Steal/No-Force. Buffer Mgmt Plays a Key Role

High-Speed NAND Flash

Efficient Memory Mapped File I/O for In-Memory File Systems. Jungsik Choi, Jiwon Kim, Hwansoo Han

Flash Memory Based Storage System

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona Percona Technical Webinars 9 May 2018

CA485 Ray Walshe Google File System

Transcription:

Beyond Block I/O: Rethinking Traditional Storage Primitives Xiangyong Ouyang *, David Nellans, Robert Wipfel, David idflynn, D. K. Panda * * The Ohio State University Fusion io

Agenda Introduction and Motivation Solid State Storage (SSS) Characteristics Duplicated efforts at SSS and upper layers Atomic Write Primitive within FTL Leverage Atomic Write in DBMS Example with MySQL Experimental Results Conclusion and Future Work 2

Evolution of Storage Devices Interface to persistent storage remains unchanged for decades seek, read, write Fits well with mechanical hard disks Solid State Storage (SSS) Merits Fast random access, high throughput Low power consumption Shockresistance resistance, smallform factor Expose the same disk based block I/O interface Challenges 3

NAND flash Based Solid State Storage (SSS) Pitfalls Asymmetric read/write latency Cannot overwrite before erasure Erasure at large unit (64 256 pages), very slow (+ ms) Flash Wear out: limited write durability SLC: 30K erase/program cycles, MLC: 3K erase/program cycles Flash Translation Layer (FTL) Input: Logical Block Address (LBA) Output: Physical Block Address (PBA) File System OS Applications i Flash Translation Layer Flash Media 4

Log Structured FTL Mapping LBA >PBA 2 3 4 5 0 2 3 Log head Log tail 2 3 4 5 PBA: 0 2 3 4 5 6 5

Log Structured FTL Write Request 6 2 3 Mapping LBA >PBA 2 3 4 5 6 0 5 6 2 3 4 Log head Log tail Log taillog tail 2 3 4 5 6 2 3 PBA: 0 2 3 4 5 6 Log FTL Advantages Avoid in place update (Block Remapping) Even wear leveling 6

Duplicated Efforts at Upper Layers and FTL Multi Version at Upper Layer DBMS ( Transactional Log ) File systems (Metadata journaling, Copy on Write) To achieve Write Atomicity ACID: Atomicity, Consistency,Isolation,durability Block Remapping at FTL Avoid idin place update dt in critical path Common Thread: Multi versions of same data Why duplicate this effort? Proposed approach: Offload Write Atomicity guarantee to FTL Provide Atomic Write Wit primitive iti to upper layers 7

Agenda Introduction and Motivation Atomic Write Primitive at FTL Leverage Atomic Write in DBMS Experimental Results Conclusion and Future Work 8

Atomic Write: a New Block I/O Primitive Offloadthe Write Atomicity guaranteeinto FTL Combines multi block writes into a logical lgroup (contiguous, non contiguous) Commit the group as an atomic unit, if the compound operation succeeds Rollback the whole group is any individual fails 9

Atomic Write (): Flag Bit in Block Header One FlagBit per block header Identify blocks belonging to the same atomic group Non AW: flag == Atomic Write Flags== 0 0 Log tail Flag Bit LBA 0 0 3 4 5 6 8 9 PBA 0 2 3 4 5 6 7 Don t allow Non AW to interleave with Atomic Write 0

Atomic Write (2): Deferred Mapping Table Update Defer mapping table update Not expose partial state to readers Mapping LBA >PBA 4 6 8 Incoming Atomic Write Group 6 3 7 5 8 4 6 8 Log tail Log tail Log tail 0 0 3 4 5 6 7 8 4 6 8 PBA: 0 2 3 4 5 6 7 8

Atomic Write (3): Failure Recovery Atomic Write Group 4 6 8 Write LBA 4, 6, 8 Update Mapping 0 0 0 0 (3) Failure when updating FTL 4 5 6 8 4 6 5 46 7 68 4 6 8 Incomplete Atomic Write group () Failure during writing: Scan backwards, discard blocks with 0 flag bits Rollback the partial blocks to previous version complete Atomic Write group (2) Failure after writing Scan the log from beginning, g, rebuild the FTL mapping Log g tail contains flag bit Same as (2) 2

Agenda Introduction and Motivation Atomic Write Primitive at FTL Leverage Atomic Write in DBMS Example with MySQL Experimental Results Conclusion and Future Work 3

Proposed Storage Stack DBMS Applications File System Write Atomicity Generalized Solid State Storage Layer Wear Leveling More Solid State Storage Example: Leverage Atomic Write in DBMS ( MySQL ) 4

DoubleWrite with MySQL InnoDB Storage Engine DoubleWrite Buffer Flush dirty buffer pages to TableFile memory pressure commit() Timeout Buffer Pool Phase I Phase II Memory Stable Storage Table File: DoubleWrite Area TableSpace Area Every data page is written twice! Impact the performance Double amount of writes to Flash media halve device s lifespan 5

MySQL InnoDB: Atomic Write Buffer Pool int atomic_write (int fd, void* buf[], long *length[], long * offsets[], int num); Memory Stable Storage Table File: Reduce the data written by half Double the effective wear out life Simplify the upper layer design Better performance Guarantee the same level of data integrity as DoubleWrite 6

Agenda Introduction and Motivation Atomic Write Primitive at FTL Leverage Atomic Write in DBMS Experimental Results Conclusion and Future Work 7

Experiment Setup Fusion io 320GB MLC NAND flash based device Atomic Write implemented in a research branch of v2. Fusion io driver MySQL5 5..49 49InnoDB (extendedwithatomic Write) 2 machines connected with GigE Both Trans. log and table file stored on solid state Processor Xeon X320 @ 2.3GHz DRAM 8GB DDR2 667MHz, 4X2GB Boot Device 250GB SATA II 3.0Gb/s DB Storage Device Fusion io iodrive 320GB PCIe 04x.0 Lanes OS Ubuntu 9.0, Linux Kernel 2.6.33 8

Micro Benchmark Different Write Mechanisms: Synchronous: write() + fsync() Asynchronous: libaio Atomic Write Wit Different write patterns: Sequential Strided Random Buffer strategies Buffered_IO: OS page cache Direct_IO: bypasses OS page cache 9

I/O Microbenchmark: Latency Write Latency (Lower is Better) ( 64blocks blocks, 52Beach) Latency (us) Write Buffering Write Strategy Pattern Sync Async A Write Random Buffered 4042 2 NA DirectIO 3542 85 67 Strided Buffered 4006 46 NA DirectIO 3447 857 669 Sequential Buffered 3955 330 NA DirectIO 3402 898 685 Atomic Write : all blocks in onecompoundwrite Synchronous Write: write ( ) + fsync( ) Asynchronous Write: Linux libaio 20

I/O Microbenchmark: Bandwidth Write Bandwidth (Higher is Better) ( 64blocks blocks, 6KBeach) Bandwidth (MB/s) Write Buffering Write Strategies Pattern Sync Async A Write Random Buffered 302 30 NA DirectIO 22 505 53 Strided Buffered 306 300 NA DirectIO 27 503 53 Sequential Buffered 308 304 NA DirectIO 23 507 54 Atomic Write : all blocks in onecompoundwrite Synchronous Write: write ( ) + fsync( ) Asynchronous Write: Linux libaio 2

8% improvement (not ACID compliant).4 Transaction Throughput 23% improvement (ACID compliant) MySQL DoubleWrite Disabled Atomic Write hroughpu ut Tran nsaction T.2 08 0.8 0.6 04 0.4 0.2 0 TPC C TPC H SysBench Buffer Pool : Database = : 0 DB workload: TPC C (DBT2), TPC H (DBT3), SysBench 22

Data Written to SSS.2 MySQL DoubleWrite Disabled Atomic Write Data a Written 0.8 0.6 04 0.4 0.2 0 46% reduction TPC C TPC H sysbench (not ACID compliant) 43% reduction (ACID compliant) (High throughput generate more trans. log) Buffer Pool : Database = : 0 DB workload: TPC C (DBT2), TPC H (DBT3), SysBench 23

.2 Transaction Latency MySQL DoubleWrite Disabled Atomic Write ion Laten ncy Transact 0.8 0.6 0.4 0.2 0 TPC C TPC H sysbench 9% improvement 20% improvement (not ACID compliant) (ACID compliant) Buffer Pool : Database = : 0 DB workload: TPC C (DBT2), TPC H (DBT3), SysBench 24

DB buffer pool size : DB on disk size.4.2 0.8 7% improvement Results in previous slides 33% improvement DoubleWrite as the Baseline 06 0.6 0.4 02 0.2 0 : :2 :4 :0 :25 :00 :500 :000 Trans/Minute (Higher is Better) Data Written (Lower is Better) DB workload: TPC C (DBT2) DB workload: TPC C (DBT2) Vary Buffer Pool : Database size Atomic Write vs. DoubleWrite 25

DB Records Update Ratio 33% improvement DoubleWrite as the Baseline.4.2 0.8 06 0.6 0.4 02 0.2 0 28 40% Reduction 0% 0% 33% 50% 67% 90% 00% Trans/Second (Higher is Better) Data Written (Lower is Better) DB workload: SysBench DB workload: SysBench Vary Update ratio in total workload Atomic Write vs. DoubleWrite 26

Conclusions Solid State Storage opens opportunities for higher order primitives in storage interfaces Atomic Write: allows multi block write operations to be completed as an atomic unit Benefit upper layers with ACID requirements OS, Filesystem, DBMS, applications Reduced complexity Improved performance Improved device durability 27

Future Work Work with Linux kernel maintainers to integrate atomic write in a non proprietary way To support multiple outstanding atomic write groups Full transactional support Explore other higher order I/O primitives 28

Thank You! 29