Design of Flash-Based DBMS: An In-Page Logging Approach

Similar documents
Design of Flash-Based DBMS: An In-Page Logging Approach

SSDs vs HDDs for DBMS by Glen Berseth York University, Toronto

LAST: Locality-Aware Sector Translation for NAND Flash Memory-Based Storage Systems

NAND Flash-based Storage. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

16:30 18:00, June 20 (Monday), 2011 # (even student IDs) # (odd student IDs) Scope

Sep. 22 nd, 2008 Sang-Won Lee. Toward Flash-based Enterprise Databases

SFS: Random Write Considered Harmful in Solid State Drives

NAND Flash-based Storage. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Storage Architecture and Software Support for SLC/MLC Combined Flash Memory

NAND Flash-based Storage. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Join Processing for Flash SSDs: Remembering Past Lessons

NAND Flash-based Storage. Computer Systems Laboratory Sungkyunkwan University

Transactional In-Page Logging for Multiversion Read Consistency and Recovery

STORING DATA: DISK AND FILES

BPCLC: An Efficient Write Buffer Management Scheme for Flash-Based Solid State Disks

A Hybrid Solid-State Storage Architecture for the Performance, Energy Consumption, and Lifetime Improvement

Solid State Drives (SSDs) Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

A Buffer Replacement Algorithm Exploiting Multi-Chip Parallelism in Solid State Disks

A Memory Management Scheme for Hybrid Memory Architecture in Mission Critical Computers

Understanding the Relation between the Performance and Reliability of NAND Flash/SCM Hybrid Solid- State Drive

Using Transparent Compression to Improve SSD-based I/O Caches

Chapter 14 HARD: Host-Level Address Remapping Driver for Solid-State Disk

A Caching-Oriented FTL Design for Multi-Chipped Solid-State Disks. Yuan-Hao Chang, Wei-Lun Lu, Po-Chun Huang, Lue-Jane Lee, and Tei-Wei Kuo

Page Mapping Scheme to Support Secure File Deletion for NANDbased Block Devices

A New Cache Management Approach for Transaction Processing on Flash-based Database

STORAGE SYSTEMS. Operating Systems 2015 Spring by Euiseong Seo

CSCI-GA Database Systems Lecture 8: Physical Schema: Storage

Toward Seamless Integration of RAID and Flash SSD

Beyond Block I/O: Rethinking

CSE 451: Operating Systems Spring Module 12 Secondary Storage

CSE 451: Operating Systems Spring Module 12 Secondary Storage. Steve Gribble

2009. October. Semiconductor Business SAMSUNG Electronics

VSSIM: Virtual Machine based SSD Simulator

L9: Storage Manager Physical Data Organization

Disks and Files. Storage Structures Introduction Chapter 8 (3 rd edition) Why Not Store Everything in Main Memory?

COS 318: Operating Systems. Storage Devices. Vivek Pai Computer Science Department Princeton University

Disks, Memories & Buffer Management

A Reliable B-Tree Implementation over Flash Memory

Advanced Database Systems

Principles of Data Management. Lecture #2 (Storing Data: Disks and Files)

Solving the I/O bottleneck with Flash

User Perspective. Module III: System Perspective. Module III: Topics Covered. Module III Overview of Storage Structures, QP, and TM

Flash Memory Based Storage System

smxnand RTOS Innovators Flash Driver General Features

Cooperating Write Buffer Cache and Virtual Memory Management for Flash Memory Based Systems

A Flash-Aware Buffering Scheme with the On-the-Fly Redo for Efficient Data Management in Flash Storage

Database Systems. November 2, 2011 Lecture #7. topobo (mit)

Baoping Wang School of software, Nanyang Normal University, Nanyang , Henan, China

SICV Snapshot Isolation with Co-Located Versions

Storing Data: Disks and Files

u Covered: l Management of CPU & concurrency l Management of main memory & virtual memory u Currently --- Management of I/O devices

Memory Hierarchy Y. K. Malaiya

Page-Differential Logging: An Efficient and DBMS- Independent Approach for Storing Data into Flash Memory

[537] Flash. Tyler Harter

Database Recovery. Haengrae Cho Yeungnam University. Database recovery. Introduction to Database Systems

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

Presented by: Nafiseh Mahmoudi Spring 2017

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568

CS143: Disks and Files

Delayed Partial Parity Scheme for Reliable and High-Performance Flash Memory SSD

Storage Systems : Disks and SSDs. Manu Awasthi CASS 2018

Overview of Storage & Indexing (i)

LBM: A Low-power Buffer Management Policy for Heterogeneous Storage in Mobile Consumer Devices

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency

Partitioned Real-Time NAND Flash Storage. Katherine Missimer and Rich West

Power Analysis for Flash Memory SSD. Dongkun Shin Sungkyunkwan University

I/O & Storage. Jin-Soo Kim ( Computer Systems Laboratory Sungkyunkwan University

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

NAND Flash Memory. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

PC-based data acquisition II

Why Is This Important? Overview of Storage and Indexing. Components of a Disk. Data on External Storage. Accessing a Disk Page. Records on a Disk Page

CFDC A Flash-aware Replacement Policy for Database Buffer Management

GREEN COMPUTING- A CASE FOR DATA CACHING AND FLASH DISKS?

OSSD: A Case for Object-based Solid State Drives

Storage Systems : Disks and SSDs. Manu Awasthi July 6 th 2018 Computer Architecture Summer School 2018

Hard Disk Drives. Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau)

Reducing Solid-State Storage Device Write Stress Through Opportunistic In-Place Delta Compression

Review 1-- Storing Data: Disks and Files

Computer Systems Laboratory Sungkyunkwan University

Clustered Page-Level Mapping for Flash Memory-Based Storage Devices

STORAGE LATENCY x. RAMAC 350 (600 ms) NAND SSD (60 us)

DSM: A Low-Overhead, High-Performance, Dynamic Stream Mapping Approach for MongoDB

CBM: A Cooperative Buffer Management for SSD

Hard Disk Drives (HDDs) Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

AN ALTERNATIVE TO ALL- FLASH ARRAYS: PREDICTIVE STORAGE CACHING

CFLRU:A A Replacement Algorithm for Flash Memory

CSE 120. Translation Lookaside Buffer (TLB) Implemented in Hardware. July 18, Day 5 Memory. Instructor: Neil Rhodes. Software TLB Management

File System Management

Goals for Today. CS 133: Databases. Relational Model. Multi-Relation Queries. Reason about the conceptual evaluation of an SQL query

CSE 232A Graduate Database Systems

Storing Data: Disks and Files

Big and Fast. Anti-Caching in OLTP Systems. Justin DeBrabant

Hard Disk Drives (HDDs)

1. Creates the illusion of an address space much larger than the physical memory

I/O CANNOT BE IGNORED

Storing Data: Disks and Files

Architecture Exploration of High-Performance PCs with a Solid-State Disk

Main Points. File systems. Storage hardware characteristics. File system usage patterns. Useful abstractions on top of physical devices

Storing Data: Disks and Files. Storing and Retrieving Data. Why Not Store Everything in Main Memory? Database Management Systems need to:

Chapter 6. Storage and Other I/O Topics

Transcription:

SIGMOD 07 Design of Flash-Based DBMS: An In-Page Logging Approach Sang-Won Lee School of Info & Comm Eng Sungkyunkwan University Suwon,, Korea 440-746 wonlee@ece.skku.ac.kr Bongki Moon Department of Computer Science University of Arizona Tucson, AZ 85721, U.S.A. bkmoon@cs.arizona.edu ACM SIGMOD 2007, Beijing, China -1-

Introduction In recent years, NAND flash memory wins over hard disk in mobile storage market PDA, MP3, mobile phone, digital camera,... Advantages: size, weight, shock resistance, power consumption, noise n Now, compete with hard disk in personal computer market 32GB Flash SSD: M-TronM Tron,, Samsung, SanDisk Vendors launched new lines of personal computers only with NAND flash memory instead of hard disk In near future, full database servers can run on computing platforms with TB-scale Flash SSD second storage C.G. Hwang predicted twofold increase of NAND flash density each year until 2012 [ProcIEEE[ 2003] Database workload different from multimedia applications ACM SIGMOD 2007, Beijing, China -2-

Characteristics of NAND Flash No in-place update No data item or sector can be updated in place before erasing it first. An erase unit (16KB or 128 KB) is much larger than a sector. No mechanical latency Flash memory is an electronic device without moving parts Provides uniform random access speed without seek/rotational latency Asymmetric read & write speed Read speed is typically at least twice faster than write speed Write (and erase) optimization is critical ACM SIGMOD 2007, Beijing, China -3-

Magnetic Disk vs NAND Flash Read time Write time Erase time Magnetic Disk 12.7 msec 13.7 msec N/A NAND Flash 80 µsec 200 µsec 1.5 msec Magnetic Disk : Seagate Barracuda 7200.7 ST380011A NAND Flash : Samsung K9WAG08U1A 16 Gbits SLC NAND Unit of read/write: 2KB, Unit of erase: 128KB ACM SIGMOD 2007, Beijing, China -4-

Disk-Based DBMS on Flash Memory What happens if disk-based DBMS runs on NAND Flash? Due to No In-place Update, an update causes a write into another clean page Consume free sectors quickly causing frequent garbage collection and erase SQL: Update / Insert / Delete Update Buffer Mgr. Page : 4KB Erase Unit: 128KB Dirty Block Write Flash Memory Data Block Area ACM SIGMOD 2007, Beijing, China -5-

Disk-Based DBMS Performance Run SQL queries on a commercial DBMS Sequential scan or update of a table Non-sequential read or update of a table (via B-tree B index) Experimental settings Storage: Magnetic disk vs M- Tron SSD (Samsung flash) Data page of 8KB 10 tuples per page, 640,000 tuples in a table (64,000 pages, 512MB) ACM SIGMOD 2007, Beijing, China -6-

Disk-Based DBMS Performance Read performance : The result is not surprising at all Sequential Non-sequential Disk Flash 14.0 sec 11.0 sec 61.1 ~ 172.0 sec 12.1 ~ 13.1 sec Hard disk Read performance is poor for non-sequential accesses, mainly because of seek and rotational latency Flash memory Read performance is insensitive to access patterns ACM SIGMOD 2007, Beijing, China -7-

Disk-Based DBMS Performance Write performance Sequential Non-sequential Disk Flash 34.0 sec 26.0 sec 151.9 ~ 340.7 sec 61.8 ~ 369.9 sec Hard disk Write performance is poor for non-sequential accesses, mainly because of seek and rotational latency Flash memory Write performance is poor (worse than disk) for non-sequential accesses due to out-of-place update and erase operations Demonstrate the need of write optimization for DBMS running on Flash ACM SIGMOD 2007, Beijing, China -8-

In-Page Logging (IPL) Approach Design Principles Take advantage of the characteristics of flash memory Uniform random access speed Fast read speed Overcome the erase-before-write limitation of flash memory Minimize the changes made to the overall DBMS architecture Key Ideas Limited to buffer manager and storage manager Changes written to log instead of updating them in place Avoid frequent write and erase operations Log records are co-located with data pages No need to write them sequentially to a separate log region Read current data more efficiently than sequential logging ACM SIGMOD 2007, Beijing, China -9-

Design of the IPL Logging on Per-Page Page basis in both Memory and Flash Database Buffer Flash Memory in-memory data page (8KB) Erase unit: 128KB update-in-place in-memory log sector (512B) 15 data pages (8KB each) An In-memory log sector can be associated with a buffer frame in memory Allocated on demand when a page becomes dirty An In-flash log segment is allocated in each erase unit.. log area (8KB): 16 sectors The log area is shared by all the data pages in an erase unit ACM SIGMOD 2007, Beijing, China -10-

IPL Write Data pages in memory Updated in place, and Physiological log records written to its in-memory log sector In-memory log sector is written to the in-flash log segment, when Data page is evicted from the buffer pool, or The log sector becomes full When a dirty page is evicted, the content is not written to flash memory The previous version remains intact Data pages and their log records are physically co-located in the same erase unit Buffer Mgr. Update / Insert / Delete update-in-place physiological log Sector : 512B Page : 8KB Block : 128KB Flash Memory Data Block Area ACM SIGMOD 2007, Beijing, China -11-

IPL Read When a page is read from flash, the current version is computed on the fly Buffer Mgr. P i Re-construct the current in-memory copy Apply the physiological action to the copy read from Flash (CPU overhead) Read from Flash Original copy of P i All log records belonging to P i (IO overhead) Flash Memory data area (120KB): 15 pages log area (8KB): 16 sectors ACM SIGMOD 2007, Beijing, China -12-

IPL Merge When all free log sectors in an erase unit are consumed Log records are applied to the corresponding data pages The current data pages are copied into a new erase unit A Physical Flash Block Merge 15 up-to-date data pages log area (8KB): 16 sectors B old B new clean log area ACM SIGMOD 2007, Beijing, China -13-

IPL Simulation with TPC-C TPC-C C Log Data Generation Run a commercial DBMS to generate reference streams of TPC-C benchmark HammerOra utility used for TPC-C C workload generation Each trace contains log records of physiological updates as well as physical page writes Average length of a log record: 20 ~ 50B TPC-C C Traces 100M.20M.10u: 100MB DB, 20 MB buffer, 10 simulated users 1G.20M.100u: 1GB DB, 20 MB buffer, 100 simulated users 1G.40M.100u: 1GB DB, 40 MB buffer, 100 simulated users ACM SIGMOD 2007, Beijing, China -14-

IPL Simulation IPL Event-driven Simulator Event-driven simulation of IPL using the TPC-C C traces Events: insert/delete/update log, physical writes of data pages For each physiological log, Add the log record to the in-memory log sector; Generate a sector write event if the log sector is full For each physical page write Generate a sector write event; clear the in-memory log sector For each sector write event Increment the write counter If in-flash log segment is full, increment the merge counter Parameter setting for the simulator to estimate write performance Write (2KB): 200 us Merge (128KB): 20 ms ACM SIGMOD 2007, Beijing, China -15-

Log Segment Size vs Merges TPC-C Write frequencies are highly skewed (and low temporal locality) Erase units containing hot pages consume log sectors quickly Could cause a large number of erase operations More storage but less frequent merges with more log sectors ACM SIGMOD 2007, Beijing, China -16-

Estimated Write Performance Performance trend with varying buffer sizes The size of log segment was fixed at 8KB Estimated write time With IPL = (# of sector writes) 200us + (# of merges) 20ms Without IPL = α (# of page writes) 20ms α is the probability that a page write causes the container erase unit to be copied and erased ACM SIGMOD 2007, Beijing, China -17-

Support for Recovery IPL helps realize a lean recovery mechanism Additional logging: transaction log and list of dirty pages Transaction Commit Similarly to flushing log tail An in-memory log sector is forced out to flash if it contains at least one log record of a committing transaction No explicit REDO action required at system restart Transaction Abort De-apply the log records of an aborting transaction Use selective merge instead of regular merge, because it s s irreversible If committed, merge the log record If aborted, discard the log record If active, carry over the log record to a new erase unit To avoid a thrashing behavior, allow an erase unit to have overflow log sectors No explicit UNDO action required ACM SIGMOD 2007, Beijing, China -18-

Conclusion Clear and present evidence that Flash can co-exist or even replace Disk IPL approach demonstrates its potential for TPC-C C type database applications by Overcoming the erase-before-write limitation Exploiting the fast and uniform random access IPL also helps realize a lean recovery mechanism ACM SIGMOD 2007, Beijing, China -19-