The What, Why and How of the Pure Storage Enterprise Flash Array. Ethan L. Miller (and a cast of dozens at Pure Storage)

Similar documents
Purity: building fast, highly-available enterprise flash storage from commodity components

DATA DOMAIN INVULNERABILITY ARCHITECTURE: ENHANCING DATA INTEGRITY AND RECOVERABILITY

The Google File System

NetApp SolidFire and Pure Storage Architectural Comparison A SOLIDFIRE COMPETITIVE COMPARISON

UCS Invicta: A New Generation of Storage Performance. Mazen Abou Najm DC Consulting Systems Engineer

[537] Flash. Tyler Harter

Chapter 10: Mass-Storage Systems

Chapter 10: Mass-Storage Systems. Operating System Concepts 9 th Edition

SolidFire and Pure Storage Architectural Comparison

HP AutoRAID (Lecture 5, cs262a)

ChunkStash: Speeding Up Storage Deduplication using Flash Memory

2. PICTURE: Cut and paste from paper

New HPE 3PAR StoreServ 8000 and series Optimized for Flash

Algorithms and Data Structures for Efficient Free Space Reclamation in WAFL

SolidFire and Ceph Architectural Comparison

Mass-Storage Structure

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

Shared snapshots. 1 Abstract. 2 Introduction. Mikulas Patocka Red Hat Czech, s.r.o. Purkynova , Brno Czech Republic

V. Mass Storage Systems

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

Reliable and Efficient Metadata Storage and Indexing Using NVRAM

HP AutoRAID (Lecture 5, cs262a)

Solid State Drives (SSDs) Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

What is a file system

Flashed-Optimized VPSA. Always Aligned with your Changing World

FILE SYSTEMS, PART 2. CS124 Operating Systems Fall , Lecture 24

Storage Systems : Disks and SSDs. Manu Awasthi CASS 2018

LEVERAGING FLASH MEMORY in ENTERPRISE STORAGE

Configuring Short RPO with Actifio StreamSnap and Dedup-Async Replication

Part II: Data Center Software Architecture: Topic 2: Key-value Data Management Systems. SkimpyStash: Key Value Store on Flash-based Storage

Chapter 3 - Memory Management

SMORE: A Cold Data Object Store for SMR Drives

EECS 482 Introduction to Operating Systems

NetApp Data Compression and Deduplication Deployment and Implementation Guide

Operating Systems Design Exam 2 Review: Spring 2011

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

CS 416: Opera-ng Systems Design March 23, 2012

PowerVault MD3 SSD Cache Overview

DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE

Presented by: Nafiseh Mahmoudi Spring 2017

COS 318: Operating Systems. Journaling, NFS and WAFL

Ch 11: Storage and File Structure

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

Storage Update and Storage Best Practices for Microsoft Server Applications. Dennis Martin President, Demartek January 2009 Copyright 2009 Demartek

Today s Papers. Array Reliability. RAID Basics (Two optional papers) EECS 262a Advanced Topics in Computer Systems Lecture 3

CS3600 SYSTEMS AND NETWORKS

- SLED: single large expensive disk - RAID: redundant array of (independent, inexpensive) disks

Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 420, York College. November 21, 2006

Data Organization and Processing

I/O Devices & SSD. Dongkun Shin, SKKU

How to recover a failed Storage Spaces

SFS: Random Write Considered Harmful in Solid State Drives

Indexing. Jan Chomicki University at Buffalo. Jan Chomicki () Indexing 1 / 25

QLC Challenges. QLC SSD s Require Deep FTL Tuning Karl Schuh Micron. Flash Memory Summit 2018 Santa Clara, CA 1

Design Considerations for Using Flash Memory for Caching

Introducing Tegile. Company Overview. Product Overview. Solutions & Use Cases. Partnering with Tegile

pblk the OCSSD FTL Linux FAST Summit 18 Javier González Copyright 2018 CNEX Labs

Lecture 18: Reliable Storage

Using Transparent Compression to Improve SSD-based I/O Caches

Freewrite: Creating (Almost) Zero-Cost Writes to SSD

Speeding Up Cloud/Server Applications Using Flash Memory

Final Examination CS 111, Fall 2016 UCLA. Name:

Chapter 6: File Systems

File Systems. Chapter 11, 13 OSPP

CSE 124: Networked Services Lecture-17

All-Flash Storage Solution for SAP HANA:

Differential RAID: Rethinking RAID for SSD Reliability

Asynchronous Logging and Fast Recovery for a Large-Scale Distributed In-Memory Storage

How To Get The Most Out Of Flash Deployments

Google File System. Arun Sundaram Operating Systems

Classifying Physical Storage Media. Chapter 11: Storage and File Structure. Storage Hierarchy (Cont.) Storage Hierarchy. Magnetic Hard Disk Mechanism

Classifying Physical Storage Media. Chapter 11: Storage and File Structure. Storage Hierarchy. Storage Hierarchy (Cont.) Speed

CLOUD-SCALE FILE SYSTEMS

NAND/MTD support under Linux

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

NetApp Data Compression, Deduplication, and Data Compaction

Deduplication File System & Course Review

An Oracle White Paper June Exadata Hybrid Columnar Compression (EHCC)

COMP 530: Operating Systems File Systems: Fundamentals

Understanding Primary Storage Optimization Options Jered Floyd Permabit Technology Corp.

Turning Object. Storage into Virtual Machine Storage. White Papers

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Mass-Storage Systems. Mass-Storage Systems. Disk Attachment. Disk Attachment

The Effectiveness of Deduplication on Virtual Machine Disk Images

FILE SYSTEMS, PART 2. CS124 Operating Systems Winter , Lecture 24

CA485 Ray Walshe Google File System

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

A Caching-Oriented FTL Design for Multi-Chipped Solid-State Disks. Yuan-Hao Chang, Wei-Lun Lu, Po-Chun Huang, Lue-Jane Lee, and Tei-Wei Kuo

The Google File System

The Google File System

SQL Server 2014: In-Memory OLTP for Database Administrators

Why Datrium DVX is Best for VDI

ZFS: What's New Jeff Bonwick Oracle

FLASHARRAY//M Business and IT Transformation in 3U

TODAY AND TOMORROW. Storage CHAPTER

Storage and File Structure. Classification of Physical Storage Media. Physical Storage Media. Physical Storage Media

Main Points. File System Reliability (part 2) Last Time: File System Reliability. Reliability Approach #1: Careful Ordering 11/26/12

File Systems: Fundamentals

From server-side to host-side:

Chapter 10: Mass-Storage Systems

Transcription:

The What, Why and How of the Pure Storage Enterprise Flash Array Ethan L. Miller (and a cast of dozens at Pure Storage)

Enterprise storage: $30B market built on disk Key players: EMC, NetApp, HP, etc. Dominated by spinning disk Wait, isn t flash the new hotness? Drives today s smartphones, cameras, USB drives, even laptops Common to speed up desktops by installing SSDs Why can t we swap flash into today s disk array? Current software systems are optimized for disk Flash and disk are very different 2

Flash vs. disk Disk Moving parts: mechanical limitations Locality matters: faster to read data nearby on the disk Read and write have symmetric performance Data can be overwritten in place Flash No moving parts: all electronic Much faster Much more reliable Reads are locality-independent and 100x faster than disk Consumer Disk Enterprise Disk $/GB $0.05 $0.50 $1 SSD Capacity ~3 4 TB ~0.2 2 TB ~500 GB Sequential read/ write 150 MB/s 200 MB/s ~400 MB/s IOPS (4K) 80 120 200 11.5K/40K Read latency 12 ms 6 ms 0.15ms Write latency 13 ms 6 ms 0.05ms Worst case latency 15ms 10ms 100ms 3

Why not simply replace disk with flash in an array? Traditional arrays are optimized for disk Attempt to reduce seek distances: irrelevant for flash Overwrite in place: make flash work harder Mix reads and writes: flash often performs worse on such workloads Resulting flash array would be too expensive Leverage deduplication: keep only one copy of data even if it s written to multiple locations Dedup and compression can reduce cost and increase performance Different failure characteristics Flash can burn out quickly Entire disks typically fail more often than flash drives Flash drives lose more individual sectors 4

Flash isn t all rosy Data is written in pages, erased in blocks Page: 4KB Erase block: 64 256 pages Need a flash translation layer to map logical addresses to physical locations Typically, manage flash with log-structured storage system Limited number of erase cycles Multi-Level Cells (MLC): ~3 5,000 erase cycles Single-Level Cells (SLC): ~100,000 cycles (but more expensive) Performance quirks Unpredictable read and write response times 5

Pure Storage goals 6

Pure Storage goals 10x faster... Provide consistent I/O latency for read and write: 200 400K IOPS Leverage flash characteristics: random reads & sequential writes Optimal performance independent of volume config and data layout 6

Pure Storage goals 10x faster... Provide consistent I/O latency for read and write: 200 400K IOPS Leverage flash characteristics: random reads & sequential writes Optimal performance independent of volume config and data layout at the same cost Use commodity MLC SSDs: scalable to 10s 100s of terabytes Leverage aggressive compression and deduplication 6

Pure Storage goals 10x faster... Provide consistent I/O latency for read and write: 200 400K IOPS Leverage flash characteristics: random reads & sequential writes Optimal performance independent of volume config and data layout at the same cost Use commodity MLC SSDs: scalable to 10s 100s of terabytes Leverage aggressive compression and deduplication always available Run 24x7x365: non-disruptive upgrades protect against device failure and corruption 6

Pure Storage goals 10x faster... Provide consistent I/O latency for read and write: 200 400K IOPS Leverage flash characteristics: random reads & sequential writes Optimal performance independent of volume config and data layout at the same cost Use commodity MLC SSDs: scalable to 10s 100s of terabytes Leverage aggressive compression and deduplication always available Run 24x7x365: non-disruptive upgrades protect against device failure and corruption and with standard storage features like snapshots 6

Pure Storage approaches Guiding principles Never change information, only make it outdated (append-only structures): this makes operations idempotent Keep approximate answers: resolve when needed (more reads, fewer writes) Primary structures Data and metadata logs: stored in segments Append-only data structure: new entries occlude older entries Separate invalidation structure to store invalid sequence numbers Garbage collection to reclaim invalid data Deduplicate and compress on the fly Use RAID for both reliability and performance Reconstruct data for reads during writes to flash 7

Purity system overview Purity is (currently) a block store Highly efficient for both dense and sparse address ranges Similar to a key-value store Key is <medium, address> Value is content of a block Purity runs in user space Heavily multithreaded I/O goes thru Purespecific driver in kernel Cache daemon Segments Data reduction Garbage collection Flash Personality Layer: per-device customization Platform: device management Linux kernel

The basics: read and write path Read Write 20 12 RW 11 10 RW Generate list of <medium, LBA> pairs 12 5 10 RO 1 RO 17 1 10 RW 18 5 RW 10 5 RO 6 - RO 1 RW Translate <volume, LBA> <medium, offset> 20 12 18 5 RW 10 5 RO 5 12 RW 10 RO 1 RO 17 10 RW 6 11 10 RW 1 RW Identify pattern blocks and deduplicate 1 - RO Find highest <medium, LBA> <segment, offset> Metadata entries (t 3 ) Metadata entries (t 2 ) Metadata entries (t 1 ) Metadata entries (t 0 ) Persist data to NVRAM NVRAM Block cache Return success to user Get block at <segment, offset> Block cache Compressed user data Metadata entries Write data & metadata into log Compressed user data Metadata entries Metadata entries (t 3 ) Return block to user Write metadata into pyramid Metadata entries (t 2 ) Metadata entries (t 1 ) Metadata entries (t 0 ) 9

Basic data storage: segments All data and metadata are stored in segments Segment numbers are monotonically increasing sequence numbers: never reused! Segments are distributed across a set of devices (MLC SSDs) Each segment can use a different set of SSDs Write unit is atomic unit in which data goes to SSD: segment need not be written all at once Segment protected by RAID Across devices Within a single device Segment Write unit Write unit Write unit P Q 10

Inside a write unit Write unit is committed to SSD atomically Contents include Compressed user data Checksums Internal parity Page read errors can be detected and corrected locally Data Data Data Data Data Parity Checksum Data Data 11

Storing data Data and metadata written to a (single) log Stored first in high performance NVRAM: SLC SSD Other fast block-addressable NVRAM Data is considered committed immediately when it hits NVRAM Later, written to bulk storage (MLC SSD) Log persistence System stores Full log (containing data and metadata) Metadata-only indexes Metadata indexes can be rebuilt from log Compressed user data Metadata entries 12

Storing data: details Data stored in tuples Key can consist of multiple (fixedlength) bit-strings Example: <medium, LBA> Newest instance of a tuple occludes older instances Logs are append-only Data written in sorted order in each segment Consistency isn t an issue Need to garbage collect 61 225 199 61 291 252 61 307 802 87 103 532 MEDIUM LBA SEG+PG Individual tuples located quickly using a mapping structure: pyramid Index allows tuples to be found quickly 13

Storing data: details Data stored in tuples Key can consist of multiple (fixedlength) bit-strings Example: <medium, LBA> Newest instance of a tuple occludes older instances Logs are append-only Data written in sorted order in each segment Consistency isn t an issue Need to garbage collect 61 180 199 61 225 199 61 290 307 61 291 252 61 291 901 61 307 802 61 344 822 87 103 532 MEDIUM LBA SEG+PG Individual tuples located quickly using a mapping structure: pyramid Index allows tuples to be found quickly 13

Storing data: details Data stored in tuples Key can consist of multiple (fixedlength) bit-strings Example: <medium, LBA> Newest instance of a tuple occludes older instances Logs are append-only Data written in sorted order in each segment Consistency isn t an issue Need to garbage collect 61 180 199 61 225 199 61 290 307 61 291 252 61 291 901 61 307 802 61 344 822 87 103 532 MEDIUM LBA SEG+PG Individual tuples located quickly using a mapping structure: pyramid Index allows tuples to be found quickly 13

Storing data: details Data stored in tuples Key can consist of multiple (fixedlength) bit-strings Example: <medium, LBA> Newest instance of a tuple occludes older instances Logs are append-only Data written in sorted order in each segment Consistency isn t an issue Need to garbage collect Individual tuples located quickly using a mapping structure: pyramid Index allows tuples to be found quickly 61 180 199 61 225 199 62 850 355+4 62 850 65 220 380+2 61 290 307 61 291 252 61 291 901 61 307 802 65 110 355+8 61 344 822 87 103 532 65 198 354+2 65 220 355+8 MEDIUM LBA SEG+PG 13

The power of the pyramid Metadata entries kept in a pyramid New entries written to top May occlude entries lower down Add new levels to the top All levels are read-only (once written) Search proceeds top-down Multiple types of metadata Mapping table: entries point to data location Link table: lists non-local references Dedup table: location of potential duplicates Segments used for underlying storage Metadata entries (t 3 ) Metadata entries (t 2 ) Metadata entries (t 1 ) Metadata entries (t 0 ) 14

The power of the pyramid Metadata entries kept in a pyramid New entries written to top May occlude entries lower down Add new levels to the top All levels are read-only (once written) Search proceeds top-down Multiple types of metadata Mapping table: entries point to data location Link table: lists non-local references Dedup table: location of potential duplicates Segments used for underlying storage Metadata entries (t 3 ) Metadata entries (t 2 ) Metadata entries (t 1 ) Metadata entries (t 0 ) 14

The power of the pyramid Metadata entries kept in a pyramid New entries written to top May occlude entries lower down Add new levels to the top All levels are read-only (once written) Search proceeds top-down Multiple types of metadata Mapping table: entries point to data location Link table: lists non-local references Dedup table: location of potential duplicates Segments used for underlying storage Metadata entries (t 3 ) Metadata entries (t 2 ) Metadata entries (t 1 ) Metadata entries (t 0 ) Metadata entries (t 3 ) Metadata entries (t 2 ) Metadata entries (t 1 ) Metadata entries (t 0 ) 14

Merging (flattening) pyramid layers One or more adjacent layers can be merged Remove tuples that are Invalid: tuple refers to a no-longer-extant source Occluded: newer tuple for the same key Result Faster searches (Some) reclaimed space Operation is idempotent Search on merged layer has identical results to input Failure partway through isn t an issue! Switch over lazily Improves lookup speed! Doesn t reclaim data storage... Metadata entries (t 3 ) Metadata entries (t 2 ) Metadata entries (t 1 ) Metadata entries (t 0 ) 15

Merging (flattening) pyramid layers One or more adjacent layers can be merged Remove tuples that are Invalid: tuple refers to a no-longer-extant source Occluded: newer tuple for the same key Result Faster searches (Some) reclaimed space Operation is idempotent Search on merged layer has identical results to input Failure partway through isn t an issue! Switch over lazily Improves lookup speed! Doesn t reclaim data storage... Metadata entries (t 3 ) Metadata entries (t 2 ) Metadata entries (t 1 ) Metadata entries (t 0 ) Metadata entries (t 3 ) Metadata entries (t 1 t 2 ) Metadata entries (t 0 ) 15

Identifying data: mediums Each block is stored in a medium Medium is just an identifier Medium is part of the key used to look up a block Mediums can refer to other mediums for part (or all!) of their data Information tracked in a medium table Typically relatively few entries 16

Identifying data: mediums Each block is stored in a medium Medium is just an identifier Medium is part of the key used to look up a block Mediums can refer to other mediums for part (or all!) of their data Information tracked in a medium table Typically relatively few entries Medium ID Start Length Underlying Offset State 101 0 2000 0 0 Q 16

Identifying data: mediums Each block is stored in a medium Medium is just an identifier Medium is part of the key used to look up a block Mediums can refer to other mediums for part (or all!) of their data Information tracked in a medium table Typically relatively few entries Medium ID Start Length Underlying Offset State 101 0 2000 0 0 Q 102 0 2000 101 0 QU 16

Identifying data: mediums Each block is stored in a medium Medium is just an identifier Medium is part of the key used to look up a block Mediums can refer to other mediums for part (or all!) of their data Information tracked in a medium table Typically relatively few entries Medium ID Start Length Underlying Offset State 101 0 2000 0 0 Q 102 0 2000 101 0 QU 103 0 600 102 0 R 103 600 400 102 0 RU 16

Identifying data: mediums Each block is stored in a medium Medium is just an identifier Medium is part of the key used to look up a block Mediums can refer to other mediums for part (or all!) of their data Information tracked in a medium table Typically relatively few entries Medium ID Start Length Underlying Offset State 101 0 2000 0 0 Q 102 0 2000 101 0 QU 103 0 600 102 0 R 103 600 400 102 0 RU 103 1000 1000 101-1000 Q 16

Mediums & volumes & snapshots (oh, my!) Entries in medium table make a directed graph Read goes to a specific medium May need to look up multiple <medium, LBA> keys Volume points at one medium May create new medium, point volume at it instead Medium IDs never reused Once closed, medium is stable: mappings never change Fast snapshots Fast copy offload Occasionally need to flatten the medium graph to reduce lookups 20 12 5 15 7 3 medium ID 12 RW 10 RO 1 RO 14 RW 3 RO - RO 17 underlying 1 10 RW 6 - RO status 10 RW 18 5 RW 10 5 RO 14 7 RO 11 1 RW Vol Medium 101 18 102 15 105 17 107 20 109 6 120 11 17

Invalidating large ranges of sequence numbers All data structures in the system are append-only! Often useful to invalidate lots of sequence numbers at once Medium being removed Segment deallocated (link & mapping entries invalidated) Keep a separate table listing invalid sequence ranges Assumes invalid sequence numbers never again become valid! Merge adjacent ranges together Invalid sequence number table limited in size to number of valid sequence numbers In turn, limited by actual storage space In practice, this table is much smaller than maximum size possible 18

Garbage collection Only operates on data log segments GC frees unused pyramid segments too... Scan metadata from one or more segments Check data for liveness Read & rewrite data if still live Segregate deduplicated data into their own segments for efficiency Rewrites done just like real writes Segment may be invalidated when it contains no live data Operation is idempotent! Original segments 19

Garbage collection Only operates on data log segments GC frees unused pyramid segments too... Scan metadata from one or more segments Check data for liveness Read & rewrite data if still live Segregate deduplicated data into their own segments for efficiency Rewrites done just like real writes Segment may be invalidated when it contains no live data Operation is idempotent! Original segments 19

Garbage collection Only operates on data log segments GC frees unused pyramid segments too... Scan metadata from one or more segments Check data for liveness Read & rewrite data if still live Segregate deduplicated data into their own segments for efficiency Rewrites done just like real writes Segment may be invalidated when it contains no live data Operation is idempotent! Original segments 19

Garbage collection Only operates on data log segments GC frees unused pyramid segments too... Scan metadata from one or more segments Check data for liveness Read & rewrite data if still live Segregate deduplicated data into their own segments for efficiency Rewrites done just like real writes Segment may be invalidated when it contains no live data Operation is idempotent! Original segments 19

Picking segments to garbage collect Use our own space reporting data to estimate occupancy of any given segment Based on estimate, do cost-benefit analysis Benefit: space we get back Cost Higher for more live data: more reads & writes Higher if the segment has more log pages: more data to scan during GC Higher for live dedup data and more incoming references: GC on deduplicated data is more expensive 20

Tracking reference counts Need to know when data can be freed Dedup allows incoming references from elsewhere Keep a link table: list of external links to a given segment No exact reference count! Easy to know where to look for incoming pointers during garbage collection Link table maintained as part of the pyramid All links to a block look the same! Need not be exact: correctness verified at GC No harm in outdated information Information never becomes incorrect: segment IDs never reused! 21

Data reduction: compression & deduplication Compression Primary concern is fast decompression speed! Lots more reads than writes Patterned blocks: pattern stored in index Improves performance: writes never done Saves a lot of space Allows for compact representations of large zeroed ranges Deduplication (at the 512B sector level) Deduplicate data as soon as possible Reduce writes to SSD IIncrease apparent capacity Dedup process somewhat inexact OK to miss some duplicates: find them later OK to have false positives: checked against actual data 22

Finding duplicates Calculate hashes as part of finding pattern blocks No need for cryptographically secure hashes! Use in-memory table for hints, but verify byte-bybyte before finalizing dedup Tables can use shorter hashes Smaller tables Faster hashes (non-cryptographic) Different tables (in-memory vs. SSD-resident) use different-size hashes to adjust false positive rate Only impact is number of extraneous reads... 23

Deduplication and garbage collection Easy to find local references to blocks in a segment How are remote references found? Use the link table List of logical addresses that reference a given block in the log Lookup of logical address no longer points here reference must be invalid When new reference to duplicate block is written, record new link table entry Old entry is superceded... 24

Segregated deduplicated data During dedup, segregate deduplicated blocks into their own segments Deduplicated data is less likely to become invalid More expensive to GC deduplicated data: keep it separate from other data to save processing References to dedup blocks remain in regular segments References to dedup blocks can die relatively quickly Data in dedup blocks lives a long time: until all references die 25

Flash Personality Layer Goal: encapsulate SSD properties in a software layer Enable per-ssd model optimization for RAID / data protection Data layout & striping I/O timing Allows Purity to use multiple drive types efficiently with minimal software change 26

Data protection: RAID-3D Calculate parity in two directions Across multiple SSDs (traditional RAID) Within a single SSD: protect against single-page read errors Turn off drive reads during writes to that drive (if needed) Often, writes interfere with reads, making both slower Latency of reads during write operations can be unpredictable During write to a drive, reads to that drive are done with RAID rebuild D D D D D P Q Write unit Segment 27

Data protection: RAID-3D Calculate parity in two directions Across multiple SSDs (traditional RAID) Within a single SSD: protect against single-page read errors Turn off drive reads during writes to that drive (if needed) Often, writes interfere with reads, making both slower Latency of reads during write operations can be unpredictable During write to a drive, reads to that drive are done with RAID rebuild D D D D D P Q Write unit Segment 27

Data protection: RAID-3D Calculate parity in two directions Across multiple SSDs (traditional RAID) Within a single SSD: protect against single-page read errors Turn off drive reads during writes to that drive (if needed) Often, writes interfere with reads, making both slower Latency of reads during write operations can be unpredictable During write to a drive, reads to that drive are done with RAID rebuild D D D D D P Q Write unit Segment Write unit 27

Data protection: RAID-3D Calculate parity in two directions Across multiple SSDs (traditional RAID) Within a single SSD: protect against single-page read errors Turn off drive reads during writes to that drive (if needed) Often, writes interfere with reads, making both slower Latency of reads during write operations can be unpredictable During write to a drive, reads to that drive are done with RAID rebuild D D D D D P Q Write unit Segment Write unit 27

Data protection: RAID-3D Calculate parity in two directions Across multiple SSDs (traditional RAID) Within a single SSD: protect against single-page read errors Turn off drive reads during writes to that drive (if needed) Often, writes interfere with reads, making both slower Latency of reads during write operations can be unpredictable During write to a drive, reads to that drive are done with RAID rebuild D D D D D P Q Write unit Segment 27

Data protection: RAID-3D Calculate parity in two directions Across multiple SSDs (traditional RAID) Within a single SSD: protect against single-page read errors Turn off drive reads during writes to that drive (if needed) Often, writes interfere with reads, making both slower Latency of reads during write operations can be unpredictable During write to a drive, reads to that drive are done with RAID rebuild D D D D D P Q Write unit Segment 27

Data protection: RAID-3D Calculate parity in two directions Across multiple SSDs (traditional RAID) Within a single SSD: protect against single-page read errors Turn off drive reads during writes to that drive (if needed) Often, writes interfere with reads, making both slower Latency of reads during write operations can be unpredictable During write to a drive, reads to that drive are done with RAID rebuild D D D D D P Q Write unit Segment 27

Data protection: RAID-3D Calculate parity in two directions Across multiple SSDs (traditional RAID) Within a single SSD: protect against single-page read errors Turn off drive reads during writes to that drive (if needed) Often, writes interfere with reads, making both slower Latency of reads during write operations can be unpredictable During write to a drive, reads to that drive are done with RAID rebuild D D D D D P Q Write unit Segment 27

Data protection: RAID-3D Calculate parity in two directions Across multiple SSDs (traditional RAID) Within a single SSD: protect against single-page read errors Turn off drive reads during writes to that drive (if needed) Often, writes interfere with reads, making both slower Latency of reads during write operations can be unpredictable During write to a drive, reads to that drive are done with RAID rebuild D D D D D P Q Write unit Segment 27

Data protection: RAID-3D Calculate parity in two directions Across multiple SSDs (traditional RAID) Within a single SSD: protect against single-page read errors Turn off drive reads during writes to that drive (if needed) Often, writes interfere with reads, making both slower Latency of reads during write operations can be unpredictable During write to a drive, reads to that drive are done with RAID rebuild D D D D D P Q Write unit Segment 27

Customize layout of data on drives Lay out segment shards at offsets that optimize drive performance May vary on a per-drive model basis Lay out segment data across drives to trade off parallelism and number of drives per (large) request Fractal layout D0 D1 D2 D3 D4 D5 0 3 4 5 58 59 60 63 1 2 7 6 57 56 61 62 14 13 8 9 54 55 50 49 15 12 11 10 53 52 51 48 16 17 30 31 32 33 46 47 19 18 29 28 35 34 45 44 20 23 24 27 36 39 40 43 21 22 25 26 37 38 41 42 28

Architecture summary Manage SSDs with append-only data structures Information is never changed, only superseded Keep indexes for fast lookup Use an append-only invalidation table to avoid unchecked table growth Run inline dedup and compression at high speed Optimized algorithms Best-effort algorithms at initial write, improve later Optimize data layout and handling for SSDs RAID reconstruct during normal writes Choose optimal layout for each SSD 29

Questions? 30