Flash-Conscious Cache Population for Enterprise Database Workloads

Similar documents
Flash-Conscious Cache Population for Enterprise Database Workloads

SFS: Random Write Considered Harmful in Solid State Drives

From server-side to host-side:

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

Presented by: Nafiseh Mahmoudi Spring 2017

EMC VFCache. Performance. Intelligence. Protection. #VFCache. Copyright 2012 EMC Corporation. All rights reserved.

PowerVault MD3 SSD Cache Overview

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Evaluating Phase Change Memory for Enterprise Storage Systems

Using Transparent Compression to Improve SSD-based I/O Caches

D E N A L I S T O R A G E I N T E R F A C E. Laura Caulfield Senior Software Engineer. Arie van der Hoeven Principal Program Manager

LEVERAGING FLASH MEMORY in ENTERPRISE STORAGE

OSSD: A Case for Object-based Solid State Drives

A Prototype Storage Subsystem based on PCM

Implica(ons of Non Vola(le Memory on So5ware Architectures. Nisha Talagala Lead Architect, Fusion- io

New HPE 3PAR StoreServ 8000 and series Optimized for Flash

CSCI-GA Database Systems Lecture 8: Physical Schema: Storage

Mass-Storage Structure

Design Considerations for Using Flash Memory for Caching

Validating the NetApp Virtual Storage Tier in the Oracle Database Environment to Achieve Next-Generation Converged Infrastructures

Readings. Storage Hierarchy III: I/O System. I/O (Disk) Performance. I/O Device Characteristics. often boring, but still quite important

Adaptation of Distributed File System to VDI Storage by Client-Side Cache

Optimizing Server Designs for Speed

UCS Invicta: A New Generation of Storage Performance. Mazen Abou Najm DC Consulting Systems Engineer

Red Hat Enterprise 7 Beta File Systems

Azor: Using Two-level Block Selection to Improve SSD-based I/O caches

Evaluating Phase Change Memory for Enterprise Storage Systems: A Study of Caching and Tiering Approaches

Computer Architecture 计算机体系结构. Lecture 6. Data Storage and I/O 第六讲 数据存储和输入输出. Chao Li, PhD. 李超博士

IBM Spectrum NAS. Easy-to-manage software-defined file storage for the enterprise. Overview. Highlights

Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades

Deploy a High-Performance Database Solution: Cisco UCS B420 M4 Blade Server with Fusion iomemory PX600 Using Oracle Database 12c

SSD Applications in the Enterprise Area

Storage Technologies - 3

Virtual Storage Tier and Beyond

Middleware and Flash Translation Layer Co-Design for the Performance Boost of Solid-State Drives

SHRD: Improving Spatial Locality in Flash Storage Accesses by Sequentializing in Host and Randomizing in Device

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

Solid State Performance Comparisons: SSD Cache Performance

Storage. Hwansoo Han

ZD-XL SQL Accelerator 1.6

Introducing the Cray XMT. Petr Konecny May 4 th 2007

CSE 451: Operating Systems. Section 10 Project 3 wrap-up, final exam review

CS370 Operating Systems

Maximizing Data Center and Enterprise Storage Efficiency

Chapter 6. Storage and Other I/O Topics

Flash In the Data Center

Two hours - online. The exam will be taken on line. This paper version is made available as a backup

Webinar Series: Triangulate your Storage Architecture with SvSAN Caching. Luke Pruen Technical Services Director

Storage Systems : Disks and SSDs. Manu Awasthi July 6 th 2018 Computer Architecture Summer School 2018

Identifying Performance Bottlenecks with Real- World Applications and Flash-Based Storage

Disks and RAID. CS 4410 Operating Systems. [R. Agarwal, L. Alvisi, A. Bracy, E. Sirer, R. Van Renesse]

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

Introducing Tegile. Company Overview. Product Overview. Solutions & Use Cases. Partnering with Tegile

Webscale, All Flash, Distributed File Systems. Avraham Meir Elastifile

High-Performance Transaction Processing in Journaling File Systems Y. Son, S. Kim, H. Y. Yeom, and H. Han

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

Design Tradeoffs for Data Deduplication Performance in Backup Workloads

Open-Channel SSDs Offer the Flexibility Required by Hyperscale Infrastructure Matias Bjørling CNEX Labs

Exploiting the benefits of native programming access to NVM devices

AutoStream: Automatic Stream Management for Multi-stream SSDs

Computer Architecture Computer Science & Engineering. Chapter 6. Storage and Other I/O Topics BK TP.HCM

SvSAN Data Sheet - StorMagic

BENEFITS AND BEST PRACTICES FOR DEPLOYING SSDS IN AN OLTP ENVIRONMENT USING DELL EQUALLOGIC PS SERIES

AN ALTERNATIVE TO ALL- FLASH ARRAYS: PREDICTIVE STORAGE CACHING

Optimizing Flash-based Key-value Cache Systems

Ben Walker Data Center Group Intel Corporation

Introduction I/O 1. I/O devices can be characterized by Behavior: input, output, storage Partner: human or machine Data rate: bytes/sec, transfers/sec

A Semi Preemptive Garbage Collector for Solid State Drives. Junghee Lee, Youngjae Kim, Galen M. Shipman, Sarp Oral, Feiyi Wang, and Jongman Kim

W H I T E P A P E R. Comparison of Storage Protocol Performance in VMware vsphere 4

Hewlett Packard Enterprise HPE GEN10 PERSISTENT MEMORY PERFORMANCE THROUGH PERSISTENCE

Extremely Fast Distributed Storage for Cloud Service Providers

StorMagic SvSAN 6.1. Product Announcement Webinar and Live Demonstration. Mark Christie Senior Systems Engineer

HP visoko-performantna OLTP rješenja

VMware vstorage APIs FOR ARRAY INTEGRATION WITH EMC VNX SERIES FOR SAN

Accelerate Applications Using EqualLogic Arrays with directcache

First-In-First-Out (FIFO) Algorithm

Evaluating Cloud Storage Strategies. James Bottomley; CTO, Server Virtualization

LightNVM: The Linux Open-Channel SSD Subsystem Matias Bjørling (ITU, CNEX Labs), Javier González (CNEX Labs), Philippe Bonnet (ITU)

HGST: Market Creator to Market Leader

Flash Trends: Challenges and Future

Sun N1: Storage Virtualization and Oracle

Caches. Samira Khan March 23, 2017

Purity: building fast, highly-available enterprise flash storage from commodity components

Virtualization of the MS Exchange Server Environment

Scalable High Performance Main Memory System Using PCM Technology

Hitachi Virtual Storage Platform Family

Achieving Memory Level Performance: Secrets Beyond Shared Flash

CS3600 SYSTEMS AND NETWORKS

A Case Study: Performance Evaluation of a DRAM-Based Solid State Disk

Creating the Fastest Possible Backups Using VMware Consolidated Backup. A Design Blueprint

Copyright 2012 EMC Corporation. All rights reserved.

ARCHITECTURE WHITEPAPER

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle

Database Solutions Engineering. Best Practices for Deploying SSDs in an Oracle OLTP Environment using Dell TM EqualLogic TM PS Series

Dell Reference Configuration for Large Oracle Database Deployments on Dell EqualLogic Storage

E-Store: Fine-Grained Elastic Partitioning for Distributed Transaction Processing Systems

Innovations in Non-Volatile Memory 3D NAND and its Implications May 2016 Rob Peglar, VP Advanced Storage,

IBM System Storage DS8800 Overview

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency

Interface Trends for the Enterprise I/O Highway

Transcription:

IBM Research ADMS 214 1 st September 214 Flash-Conscious Cache Population for Enterprise Database Workloads Hyojun Kim, Ioannis Koltsidas, Nikolas Ioannou, Sangeetha Seshadri, Paul Muench, Clem Dickey, Larry Chiu IBM Research Almaden, Zurich ADMS 214 1 st September 214 Hangzhou, China 213 IBM Corporation

Outline Flash-based caching and why it is different Weaknesses of existing approaches : The Scalable Cache Engine Experimental evaluation using TPC-E Conclusions 2 213 IBM Corporation

IBM Research Host-based Flash Caching! Flash fills very well the gap between DRAM & HDDs Good fit as a caching layer between DRAM & HDDs! Host Server Application Server! Goal: Bring the date as close as possible to the application! Benefits " Low latency " Higher-throughput (typically) " Transparent to file systems, applications " Lower administration overhead compared to SAN Flash " SAN Congestion elimination, increased storage consolidation Direct-attached SSD Database Server Filesystem Block-level Cache Driver SAN! Typically employing a Write-Through mode SAN Storage 3 213 IBM Corporation

Flash-based vs. DRAM-based Caching Inherent differences necessitate a different way of thinking A Different place in the hierarchy Flash caches typically see different workloads than DRAM caches B Different memory technology Flash caches use NAND Flash, while DRAM caches use DRAM # 4 213 IBM Corporation

Difference A: Their place in the storage hierarchy! They encounter different workloads The DRAM cache receives the hottest portion The Flash cache sees a wider and sparser range of data with lower access frequency $ Flash caches see weaker locality! Flash caches need to have relatively large capacities $ Flash caches are prone to long warm-up times! Flash caches utilize a premium memory space for metadata High-performance Flash caches store metadata in main memory (DRAM) $ Flash cache metadata footprints may limit their scalability DRAM Flash Storage 5 213 IBM Corporation

Difference B: Their underlying memory technology DRAM Symmetric Read/Write Uniform, extremely low latency Does not wear out Flash Asymmetric Read/Write Higher latency Heavily dependent on access patterns Limited Endurance With Flash:! Cache population is expensive With DRAM we can cache all cache-missed data by just a memory copy! Unnecessary Flash writes reduce performance and shorten the device lifetime Less Flash bandwidth available for reads More so for small and random writes! Unconditional population can result in thrashing! Populating just the cache missed data results in unacceptably long warm-up times More than 1 hours for a 3GB Flash cache* 6 * Byan et al., Mercury: Host-side flash caching for the data center, MSST 212 213 IBM Corporation

Experimental setup for TPC-E! TPC-E workload! Experiments with existing open-source solutions:!!!! All three perform synchronous population of all cache misses in the user data path Write-through mode! Comparisons to: the baseline (nocache), an configuration ()!! Hypervisor: KVM!! OS: Fedora 19, Kernel 3.11.7-2.fc19.x86_64! Hardware: IBM System x365 M3 / 24 Cores 64GiB RAM! TPC-E (5, customers, 3 days, 4 threads)! PCI-e SSD! DB2 Express-C v1.5!!guest virtual machine!!os: RHEL6.4, 2.6.32-358.18.1.el6.x86_64!!Virtual hardware: 8 CPUs, 15GiB RAM! 16GiB volume with flash cache! Flash cache management layer! 2GB, enterprise-class device 16GiB partition on iscsi volume! Gigabit Ethernet!!! RAID:!!! 3x 73GB HDDs (15KRPM, 6Gb SAS)! iscsi Target! scsi-target-utils-1..24-3.el6_4.x86_64! OS: RHEL6.4, Kernel 2.6.32-358.23.2.el6.x86_64! Hardware: IBM System x365 M2 / 16 Cores 24GiB RAM! Transactions/sec! Avg. trans. latency! Cache statistics!! /proc/diskstats! raw storage access! statistics! 7 213 IBM Corporation

Transactions per second (TpsE) - TPC-E Evaluation results 1 1 9 9 8 8 7 7 6 6 5 5 4 4 3 5x 3 2 1.5x 2 1 1 Table 1: TPC-E average TpsE and memory usage with 1 2 3 4 5 6 7 8 three open-source 1 2 flash cache 3 solutions: 4 shows the highest average TpsE and biggest 5 6 7 8 Time memory (hours) Time (hours) usage. (Last 1 hour average) Memory Observations TpsE Read hit rate Usage! showed ~3.3x better performance 11.2 N/A N/A than, 87.9 N/A! N/A only achieved 67% of the 56.9 98.66% 1,212 performance MiB 17.3 96.75% 15! MiB The warm-up time was too long 17.5 96.52% 415 MiB! The hit rate is not the problem here 8 213 IBM Corporation

Transactions per second (TpsE) - TPC-E Evaluation results 1 1 9 9 8 8 7 7 6 6 5 5 4 4 The problem is that cache population occurs: 3 5x 3 - Unconditionally 2 1.5x 2 1 1 - In the foreground Table 1: TPC-E average TpsE and memory usage with 1 2 3 4 5 6 7 8 three - open-source 1 At too 2 fine flash cache a granularity 3 solutions: 4 shows the highest average TpsE and biggest 5 6 7 8 Time memory (hours) Time (hours) usage. (Last 1 hour average) Memory Observations TpsE Read hit rate Usage! showed ~3.3x better performance 11.2 N/A N/A than, 87.9 N/A! N/A only achieved 67% of the 56.9 98.66% 1,212 performance MiB 17.3 96.75% 15! MiB The warm-up time was too long 17.5 96.52% 415 MiB! The hit rate is not the problem here 9 213 IBM Corporation

: the Scalable Cache Engine A cache engine the makes population as Flash-friendly as possible Selective Caching Coarse-grained Cache Management Asynchronous Cache Population Asynchronous Write-Through 1 213 IBM Corporation

Selective Caching What?! Do not populate all read cache misses! Only populate the data that are deemed worthy of caching Why?! Benefits Cache pollution is avoided $ Increased hit rate Lower rate of writes to the SSD $ Higher SSD read bandwidth $ Lower SSD read latency Fewer total writes to the SSD $ Longer device lifetime How?! The cache continuously monitors the user access patterns! A recency-and-frequency filter is applied to population candidates! Only the data that are classified as hot are promoted into the cache! Various algorithms implemented For the presented experiments we have used a variation of MRU Cache miss Selective Caching Filter Hot? yes Populate no keep monitoring it 11 213 IBM Corporation

Coarse-grained Cache Management! A Fragment is 1MB of contiguous logical space! Cache management occurs at a fragment granularity A fragment is the unit of population, eviction, workload monitoring! User operations occur at 4kB page granularity Reads, invalidates, write-through Benefits! Small metadata footprint $ Scalability Only 76 bytes per 1MB-Fragment!! Exploits spatial locality in workloads Works effectively as a prefetching mechanism! Fast cache warm-up Metadata Footprint for a 2TB cache 12GB 4GB 152 MB 12 213 IBM Corporation

Asynchronous Background Population Linux Device Mapper Framework!! Population is a background task The population unit a fragment (1MB)! Outside the user data path Does not add latency to user reads! The cache has full control How much to populate When to populate! The cache limits the population rate by limiting the number of threads as the hit rate grows as the SSD latency increases A pop.! task! Write a fragment! P" Scalable Cache Engine ()! Completion! of tasks! Asynchronous population worker worker! thread! Caching device! (SSD)! Invalidation! AWT worker thread! Read! a fragment! Cache hit / miss! Write! Request! Cache-hit! Read! Request! Backing device! (HDD)! Cache-miss! Controlled population rate, flash-friendly writes, minimum impact to the foreground operations 13 213 IBM Corporation

Asynchronous Write-Through 1. 4KiB Write! 4KiB! AWT Task Queue! Background AWT Worker Thread! 6. De-queue AWT task! 7. Update flash! 8. Validate Bitmap! 4KiB! 3. En-queue AWT Task! (Memory copy)! 4KiB! 4KiB! 2. Invalidate Bitmap! 5. Complete write I/O! Page validity bitmap! 4. Write-through to the backing device! Caching device! (SSD)!! Follows the same principles as Population! Occurs outside the user data path Backing device! (HDD)!! The cache can control how much and when to write through The user write latency becomes independent from the SSD write latency! 14 213 IBM Corporation

TPC-E with 2GB Flash Cache Transactions per second (TpsE) 1 1 8 8 Read Read hit rate hit becomes rate becomes 1%: 1%: only one thread remains active for cache population only one thread remains active for cache population flashcac 6 6 Read Read hit rate hit becomes rate becomes 95%: 95%: only only two threads two threads do population do population 7.5x 4 1% 4 1 98.7% 1% 1 98.7% 98 5x 98 96.7%96.5% 96.7%96.5% 2 96 96 1.5x 2 94 94 d rate hit rate becomes becomes 1%: 1%: 92 92 one thread thread remains remains active active for for cache population no-cac 1 2 3 4 5 9 9 6 7 1 2 3 5 6 7 8 1 2 3 Time (hours) 4 5 6 7 on Time (h 1% -cache 1 1% 98.7% (a) (b) (b) 1 98.7% 64.2 64.2 98 98 96.7%96.5% 52. 5.7 96.7%96.5% 52. 5.7 96 96 94 94 92 92 4.6 12. 5.4 9 4.6 12. 5.4 9 3 4 5 6 7 8 3 4 Time (hour) 15 (a) 5 6 7 8 213 IBM Corporation Time (hour) (c) Avg. read hit rate (%) Avg. read hit rate (%) ur r Avg. read hit rate (%) Avg. read hit rate (%) Avg. Res.(ms) / last hour Avg. Res.(ms) / last hour

TPC-E with 2GB Flash Cache Transactions per second (TpsE) 1 1 8 8 Read Read hit rate hit becomes rate becomes 1%: 1%: only one thread remains active for cache population only one thread remains active for cache population flashcac 6 6 Read Read hit rate hit becomes rate becomes 95%: 95%: only only two threads two threads do population do population 7.5x 4 1% 4 1 98.7% 1% 1 98.7% 98 5x 98 96.7%96.5% 96.7%96.5% 2 96 96 1.5x 2 94 94 d rate hit rate becomes becomes 1%: 1%: 92 92 one thread thread remains remains active active for for cache population no-cac 1 2 3 4 5 9 9 6 7 1 2 3 5 6 7 8 1 2 3 Time (hours) 4 5 6 7 on Time (h 1% -cache 1 1% 98.7% (a) (b) (b) 1 98.7% 64.2 64.2 98 98 96.7%96.5% 52. 5.7 96.7%96.5% 52. 5.7 96 96 94 94 92 92 4.6 12. 5.4 9 4.6 12. 5.4 9 3 4 5 6 7 8 3 4 Time (hour) 16 (a) 5 6 7 8 213 IBM Corporation Time (hour) (c) Avg. read hit rate (%) Avg. read hit rate (%) ur r Avg. read hit rate (%) Avg. read hit rate (%) Avg. Res.(ms) / last hour Avg. Res.(ms) / last hour

TPC-E with 3GB Flash Cache The data (16GB) do not fit in the cache Continuous population and eviction are exercised 1 1 TpsE Transactions per second (TpsE) 17 8 6 4 2 8 6 4 2 Read hit rate becomes 95%: only two threads do population Read hit rate becomes 1%: only one thread remains active for cache population flashcach no-cach 1 2 3 4 1 2 3 Time (a) (hours) 4 5 6 7 Time (hour) Time (ho (a) Thrashing amplifies the problem 3.6x - 4.9x 213 IBM Corporation

TPC-E with 3GB Flash Cache I/O Statistics I/O traffic (MiB/s) I/O traffic (MiB/s) 1 8 6 4 2 35 3 25 2 15 1 5 SSD Read SSD Read SSD Write iscsi Read iscsi Read iscsi Write SSD Write 1 2 3 4 Time (hour) (a) iscsi Read SSD Write SSD Read 1 2 3 4 Time (hour) (d) 18 213 IBM Corporation

Volume of Flash writes SSD Write Amount / Transaction (KiB) 4 35 3 25 2 15 1 5 43% - 59% fewer writes 142.5 145.2 16.4 6.2 21% - 57% fewer writes 379.9 33. 29.4 164.8 8 hour TPC-E with 2GiB Flash cache 4 hour TPC-E with 3GiB Flash cache Increasingly important as we are moving towards c-mlc and TLC Flash! 19 213 IBM Corporation

Conclusions! Traditional cache population schemes are bad for Flash-based caches Main conclusions: 1. 1 Population should be selective i.e., do not promote all cache-missed data 2. 2 A coarse-grained cache management is beneficial Small metadata footprint Short warm-times Benefits from prefetching 3. Population should be a background operation i.e., outside the user data path Similarly for write-through operations 4. 4 Population should occur in chunks in the order of 1MB 2 213 IBM Corporation

IBM Easy Tier Server for DS8 Distributed host-based Flash caching for DS8 client hosts! Implements the algorithms and techniques described! For IBM Power hosts running the IBM AIX OS! The storage server manages cluster cache coherence! Integration between the host-side Flash caches and the automated tiering on the storage server (Easy Tier )! More information: Software-defined just-in-time caching in an enterprise storage system IBM Journal of Research and Development vol.58, no.2/3, pp.7:1,7:13, March-May 214 IBM System Storage DS8 Easy Tier Server http://www.redbooks.ibm.com/redpapers/pdfs/redp513.pdf 21 213 IBM Corporation

22 213 IBM Corporation

Backup 23 213 IBM Corporation

Transactions per second (TpsE) with varying fragment size 1 8 7 6 5 8 6 Baseline (1M sized frag., with AWT) without AWT 4 Read hit rate becomes 95%: only two threads do population 4 1 91.1% 3 1 91.1% 82.9% 91.1% 89.8% 8M sized frag. 8 8 73.6% ne (1M 2sized frag., with AWT) 6 6 5.4% 2 4 4 1 24.7% 2 2 no-cach 1 2 3 4 5 6 7 1 2 3 4 1 2 3 Time (hours) 4 5 6 7 Time (ho (a) ache 1% (b) (b) 1 1 91.1% 98.7% 91.1% 89.8% 2 898 73.6% 8M sized frag. 7 64.2 17. 59.4 6.4 96.7%96.5% 156 696 5 1.5 1 4 4 36.1 94 6.3 6.3 2 53 92 2 lashcache 9 1 4.6 6.3 3 4 24 2 3 4 213 IBM Corporation (a) (b) Time (hour) (c) thout AWT Avg. read hit-rate (%) Avg. read hit rate (%) our r 128K sized frag. frag.8m frag.128k without AWT baseline Read hit rate becomes 1%: only one thread remains active for cache population without AWT Avg. read hit-rate (%) Avg. read hit-rate (%) Avg. Res.(ms) / last hour Avg. Res.(ms) / last hour frag.8m frag.128k without AWT baseline frag.8m frag.128k without AWT baseline EnhanceI flashcach

Memory Footprint Memory usage (MiB) 12 1 8 6 4 2 1.1G 15M 415M 34M Memory usage (MiB) 7 6 5 4 3 2 1 6.G 34M 2.1G 93M Flash cache size: 2 GiB Flash cache size: 1 TiB 25 213 IBM Corporation