FlashCache. Mohan Srinivasan Mark Callaghan July 2010

Similar documents
The Btrfs Filesystem. Chris Mason

Part II: Data Center Software Architecture: Topic 2: Key-value Data Management Systems. SkimpyStash: Key Value Store on Flash-based Storage

1. Creates the illusion of an address space much larger than the physical memory

CS370 Operating Systems

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

GPUfs: Integrating a File System with GPUs. Yishuai Li & Shreyas Skandan

STORING DATA: DISK AND FILES

CS3600 SYSTEMS AND NETWORKS

The What, Why and How of the Pure Storage Enterprise Flash Array. Ethan L. Miller (and a cast of dozens at Pure Storage)

System Characteristics

How to Fulfill the Potential of InnoDB's Performance and Scalability

Linux SMR Support Status

Persistent Storage - Datastructures and Algorithms

The Google File System

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017

Caching and reliability

Open-Channel SSDs Offer the Flexibility Required by Hyperscale Infrastructure Matias Bjørling CNEX Labs

Chapter 8 Virtual Memory

MySQL and SSD: Usage Patterns

Mastering the art of indexing

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

Storage and File Structure

Conoscere e ottimizzare l'i/o su Linux. Andrea Righi -

SSD/Flash for Modern Databases. Peter Zaitsev, CEO, Percona October 08, 2014 Percona Technical Webinars

Improving Performance using the LINUX IO Scheduler Shaun de Witt STFC ISGC2016

<Insert Picture Here> Filesystem Features and Performance

Name: Instructions. Problem 1 : Short answer. [48 points] CMU / Storage Systems 23 Feb 2011 Spring 2012 Exam 1

Azor: Using Two-level Block Selection to Improve SSD-based I/O caches

Lecture 14: Cache & Virtual Memory

SQL Server 2014: In-Memory OLTP for Database Administrators

SHRD: Improving Spatial Locality in Flash Storage Accesses by Sequentializing in Host and Randomizing in Device

Optimizing MySQL performance with ZFS. Neelakanth Nadgir Allan Packer Sun Microsystems

Virtual or Logical. Logical Addr. MMU (Memory Mgt. Unit) Physical. Addr. 1. (50 ns access)

Flash-Conscious Cache Population for Enterprise Database Workloads

Linux Internals For MySQL DBAs. Ryan Lowe Marcos Albe Chris Giard Daniel Nichter Syam Purnam Emily Slocombe Le Peter Boros

SSD/Flash for Modern Databases. Peter Zaitsev, CEO, Percona November 1, 2014 Highload Moscow,Russia

ECE 571 Advanced Microprocessor-Based Design Lecture 13

Ch 11: Storage and File Structure

Chapter 11: Storage and File Structure. Silberschatz, Korth and Sudarshan Updated by Bird and Tanin

CSE506: Operating Systems CSE 506: Operating Systems

Classifying Physical Storage Media. Chapter 11: Storage and File Structure. Storage Hierarchy (Cont.) Storage Hierarchy. Magnetic Hard Disk Mechanism

Classifying Physical Storage Media. Chapter 11: Storage and File Structure. Storage Hierarchy. Storage Hierarchy (Cont.) Speed

ECE 5730 Memory Systems

Caching and Demand-Paged Virtual Memory

Storage and File Structure. Classification of Physical Storage Media. Physical Storage Media. Physical Storage Media

Foster B-Trees. Lucas Lersch. M. Sc. Caetano Sauer Advisor

Swapping. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

SmartSaver: Turning Flash Drive into a Disk Energy Saver for Mobile Computers

Disks, Memories & Buffer Management

SRM-Buffer: An OS Buffer Management SRM-Buffer: An OS Buffer Management Technique toprevent Last Level Cache from Thrashing in Multicores

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017

Linux Filesystems and Storage Chris Mason Fusion-io

Managing Storage: Above the Hardware

Pipelined processors and Hazards

Section 7: Wait/Exit, Address Translation

University of Waterloo Midterm Examination Sample Solution

Swapping. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

CFLRU:A A Replacement Algorithm for Flash Memory

Ben Walker Data Center Group Intel Corporation

Chapter 8 & Chapter 9 Main Memory & Virtual Memory

CSE 120. Translation Lookaside Buffer (TLB) Implemented in Hardware. July 18, Day 5 Memory. Instructor: Neil Rhodes. Software TLB Management

Adapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK]

CPS 104 Computer Organization and Programming Lecture 20: Virtual Memory

Virtual memory. Virtual memory - Swapping. Large virtual memory. Processes

The MySQL Query Cache

Operating Systems CSE 410, Spring Virtual Memory. Stephen Wagner Michigan State University

Virtual Memory. Robert Grimm New York University

Modification and Evaluation of Linux I/O Schedulers

HashKV: Enabling Efficient Updates in KV Storage via Hashing

File System Implementation. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Basic Memory Management. Basic Memory Management. Address Binding. Running a user program. Operating Systems 10/14/2018 CSC 256/456 1

MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing

CS 450 Fall xxxx Final exam solutions. 2) a. Multiprogramming is allowing the computer to run several programs simultaneously.

Getting Real: Lessons in Transitioning Research Simulations into Hardware Systems

FILE SYSTEMS, PART 2. CS124 Operating Systems Fall , Lecture 24

EECS 482 Introduction to Operating Systems

Project 3: Virtual Memory

STORAGE SYSTEMS. Operating Systems 2015 Spring by Euiseong Seo

G Virtual Memory. Robert Grimm New York University

Address Translation. Tore Larsen Material developed by: Kai Li, Princeton University

OPERATING SYSTEM. Chapter 9: Virtual Memory

Flash Memory Based Storage System

BTREE FILE SYSTEM (BTRFS)

Choosing Storage for MySQL. Peter Zaitsev CEO, Percona Inc Percona Live, Washington,DC 11 January 2012

IBM Spectrum NAS. Easy-to-manage software-defined file storage for the enterprise. Overview. Highlights

From server-side to host-side:

Chapter 3 - Memory Management

The Journey of an I/O request through the Block Layer

Chapter 8 Main Memory

10/7/13! Anthony D. Joseph and John Canny CS162 UCB Fall 2013! " (0xE0)" " " " (0x70)" " (0x50)"

Chapter 11: Implementing File Systems. Operating System Concepts 8 th Edition,

LSM-trie: An LSM-tree-based Ultra-Large Key-Value Store for Small Data

File System Internals. Jo, Heeseung

CPS104 Computer Organization and Programming Lecture 16: Virtual Memory. Robert Wagner

200 points total. Start early! Update March 27: Problem 2 updated, Problem 8 is now a study problem.

XtraDB 5.7: Key Performance Algorithms. Laurynas Biveinis Alexey Stroganov Percona

OS and Hardware Tuning

Learning Outcomes. An understanding of page-based virtual memory in depth. Including the R3000 s support for virtual memory.

Virtual Memory. CS 351: Systems Programming Michael Saelee

Memory Management Virtual Memory

Transcription:

FlashCache Mohan Srinivasan Mark Callaghan July 2010

FlashCache at Facebook What We want to use some Flash storage on existing servers We want something that is simple to deploy and use Our IO access patterns benefit from a cache Who Mohan Srinivasan design and implementation Paul Saab platform and MySQL integration Mark Callaghan - benchmarketing

Introduction Block cache for Linux - write back and write through modes Layered below the filesystem at the top of the storage stack Cache Disk Blocks on fast persistent storage (Flash, SSD) Loadable Linux Kernel module, built using the Device Mapper (DM) Primary use case InnoDB, but general purpose Based on dm-cache by Prof Ming

Caching Modes Write Back Lazy writing to disk Persistent across reboot Write Through Non-persistent Are you a pessimist? Persistent across device removal

Cache Structure Set associative hash Hash with fixed sized buckets (sets) with linear probing within a set 512-way set associative by default dbn: Disk Block Number, address of block on disk Set = (dbn / block size / set size) mod (number of sets) Sequential range of dbns map onto a single sets

Cache Structure Set 0 Cache set Block 0 Block 0 Block 511 SET i Block 0 N set s worth Of blocks Set N-1 Cache set Block 511 Block 511

Write Back Replacement policy is FIFO (default) or LRU within a set Switch on the fly between FIFO/LRU (sysctl) Metadata per cache block: 24 bytes in memory, 16 bytes on ssd On ssd metadata per-slot <dbn, block state> In memory metadata per-slot: <dbn, block state, LRU chain pointers, misc>

Write Through Replacement policy is FIFO Metadata per slot 17 bytes (memory), no metadata stored on ssd In memory metadata per-slot <dbn, block state, checksum>

Reads Compute cache set for dbn Cache Hit Verify checksums if configured Serve read out of cache Cache Miss Find free block or reclaim block based on replacement policy Read block from disk and populate cache Update block checksum if configured Return data to user

Write Through - writes Compute cache set for dbn Cache hit Get cached block Cache miss Find free block or reclaim block Write data block to disk Write data block to cache Update block checksum

Write Back - writes Compute cache set for dbn Cache Hit Write data block into cache If data block not DIRTY, synchronously update on-ssd cache metadata to mark block DIRTY Cache miss Find free block or reclaim block based on replacement policy Write data block to cache Synchronously update on-ssd cache metadata to mark block DIRTY

Small or uncacheable requests First invalidate blocks that overlap the requests There are at most 2 such blocks For Write Back, if the overlapping blocks are DIRTY they are cleaned first then invalidated Uncacheable full block reads are served from cache in case of a cache hit Perform disk IO Repeat invalidation to close races which might have caused the block to be cached while the disk IO was in progress

Write Back policy Default expiration of 30 seconds (work in progress) When dirty blocks in a set exceeds configurable threshold, clean some blocks Blocks selected for writeback based on replacement policy Default dirty threshold 20% Set higher for write heavy workloads Sort selected blocks and pickup any other blocks in set that can be contiguously merged with these Writes merged by the IO scheduler

Write Back cache metadata overhead In-Memory cache metadata memory footprint 300GB/4KB cache -> ~18GB 160GB/4KB cache -> ~960MB Cache metadata writes/file system write Worst case is 2 cache metadata updates per write (VALID->DIRTY, DIRTY->VALID) Average case is much lower because of cache write hits and batching of cache metadata updates

Write Through cache metadata overhead In-Memory Cache metadata footprint 300GB/4KB cache -> ~13GB 160GB/4KB cache -> ~700MB Cache metadata writes per file system write 1 cache data write per file system write

Write Back metadata updates Cache (on-ssd) metadata only updated on writes and block cleanings (VALID->DIRTY or DIRTY->VALID) Cache (on-ssd) metadata not updated on cache population for reads Reload after an unclean shutdown only loads DIRTY blocks Fast and Slow cache shutdowns Only metadata is written on fast shutdown Reload loads both dirty and clean blocks Slow shutdown writes all dirty blocks to disk first, then writes out metadata to the ssd Reload only loads clean blocks Metadata updates to multiple blocks in same sector are batched

Torn Page Problem Handle partial block write caused by power failure or other causes Problem exists for Flashcache in Write Back mode Detected via block checksums Checksums are disabled by default Pages with bad checksums are not used Checksums increase cache metadata writes and memory footprint Update cache metadata checksums on DIRTY->DIRTY block transitions for Write Back Each per-cache slot grows by 8 bytes to hold the checksum (a 33% increase from 24 bytes to 32 bytes for the Write Back case)

Cache controls for Write Back Work best with O_DIRECT file access Global modes Cache All or Cache Nothing Cache All has a blacklist of pids and tgids Cache Nothing has a whitelist of pids and tgids tgids can be used to tag all pthreads in the group as cacheable Exceptions for threads within a group are supported List changes done via FlashCache ioctls Cache can be read but is not written for non-cacheable tgids and pids We modified MySQL and scp to use this support

Cache Nothing policy If the thread id is whitelisted, cache all IOs for this thread If the tgid is whitelisted, cache all IOs for this thread If the thread id is blacklisted do not cache IOs

Cache control example We use Cache Nothing mode for MySQL servers The mysqld tgid is added to the whitelist All IO done by it is cacheable Writes done by other processes do not update the cache Full table scans done by mysqldump use a hint that directs mysqld to add the query s thread id to the blacklist to avoid wiping FlashCache select /* SQL_NO_FCACHE */ pk, col1, col2 from foobar

Utilities flashcache_create flashcache_create -b 4k -s 10g mysql /dev/flash /dev/disk flashcache_destroy flashcache_destory /dev/flash flashcache_load

sysctl a grep flash devflashcachecache_all = 0 devflashcachefast_remove = 0 devflashcachezero_stats = 1 devflashcachewrite_merge = 1 devflashcachereclaim_policy = 0 devflashcachepid_expiry_secs = 60 devflashcachemax_pids = 100 devflashcachedo_pid_expiry = 0 devflashcachemax_clean_ios_set = 2 devflashcachemax_clean_ios_total = 4 devflashcachedebug = 0 devflashcachedirty_thresh_pct = 20 devflashcachestop_sync = 0 devflashcachedo_sync = 0

Removing FlashCache umount /data dmesetup remove mysql flashcache_destroy /dev/flash

cat /proc/flashcache_stats reads=4 writes=0 read_hits=0 read_hit_percent=0 write_hits=0 write_hit_percent=0 dirty_write_hits=0 dirty_write_hit_percent=0 replacement=0 write_replacement=0 write_invalidates=0 read_invalidates=0 pending_enqueues=0 pending_inval=0 metadata_dirties=0 metadata_cleans=0 cleanings=0 no_room=0 front_merge=0 back_merge=0 nc_pid_adds=0 nc_pid_dels=0 nc_pid_drops=0 nc_expiry=0 disk_reads=0 disk_writes=0 ssd_reads=0 ssd_writes=0 uncached_reads=169 uncached_writes=128

Future Work Cache mirroring SW RAID 0 block device as a cache Online cache resize No shutdown and recreate Support for ATA trim Discard blocks no longer in use Fix the torn page problem Use shadow pages

Resources GitHub : facebook/flashcache Mailing list : flashcache-dev@googlegroupscom http://facebookcom/mysqlatfacebook Email : mohan@facebookcom (Mohan Srinivasan) ps@facebookcom (Paul Saab) mcallaghan@facebookcom (Mark Callaghan)

(c) 2009 Facebook, Inc or its licensors "Facebook" is a registered trademark of Facebook, Inc All rights reserved 10