Purity: building fast, highly-available enterprise flash storage from commodity components J. Colgrove, J. Davis, J. Hayes, E. Miller, C. Sandvig, R. Sears, A. Tamches, N. Vachharajani, and F. Wang 0
Gala has already introduced HDD What is exactly SSD? Heroes of this talk 1
Outline Introduction to SSD Unique properties in SSD The pure storage system System processes Real world deployments 2
SSD - 101 SSD is composed of flash memory arrays Block: 2-16 MB Page: 0.5-4KB 3
Write/Read/Erase Operations in SSD Full bucket means 1 Empty (or almost empty) bucket means 0 We can read/write rows However must erase the entire block 1 level 0 level 4
Multi-level flash memories 5
Write/Read/Erase Operations in MLC SSD Data is represented by the amount of electrical charge Cells can be written individually However, can only be erased by erasing an entire block q-1 level 0 level 6
7
Google s Data Centers 8
It even determines the location 9
So why not always use SSD? Traditionally, data center s software and design were optimally developed for HDD 10
Purity All SSD enterprise storage system Claims to be cheaper Higher performance 11
Outline Introduction to SSD Unique properties in SSD The pure storage system System processes Real world deployments 12
SSD - unique properties 13
SSD unique properties #1 Wear For each P/E cycle the media becomes worse Y. cai, et al. Threshold Voltage Distribution in MLC NAND Flash Memory: Characterization, Analysis, and Modeling DATE13 14
Pages SSD unique properties #2 Garbage collection We can write a page but can only erase a block Typical numbers: 64-128 pages per block ~2048 blocks per flash drive Blocks 0 1 2 15
Pages SSD unique properties #2 Garbage collection We can write a page but can only erase a block Typical numbers: 64-128 pages per block ~2048 blocks per flash drive Blocks 0 4 8 12 we want to update block 0 1 2 5 6 9 10 13 14 3 7 11 15 First option: erase the block Second option: use overprovisioning 16
Pages SSD unique properties #2 Garbage collection Blocks 0 46 8 12 we want to also update pages 4, 5, 8 1 57 9 13 2 6 10 14 3 7 11 15 We are full! And now what? Tradeof:Space vs. lifetime Garbage collection 18
Main conclusions for SSD Random fast access (unlike disks) Performance heavily depends on workload The response of the device is not uniform We wish to write 3 : Scenario A Scenario B 0 0 4 8 12 0 1 1 5 9 13 1 2 2 6 10 14 2 3 7 11 15 19
SSD unique properties #2 Wear Leveling Let us assume we keep updating pages 0, 1, 2, 3 0 4 8 12 0 1 5 9 13 1 2 6 10 14 2 3 7 11 15 3 What will happen? 20
System benefits from fast access Some processes are no longer required to run on cache HDD requires much more (expensive) DRAM for cache Can perform system processes (many IOPS) that are hard to do in HDD: Deduplication Read speedup Log structure file system etc. 21
Outline Introduction to SSD Unique properties in SSD The pure storage system System processes Real world deployments 22
Pure Storage storage system 23
Comparison between Purity and disk based system We will try to explain this magic 24
Implementation 12x(10-16) Gb/s 11-24 MLC drives This actually SLC flash 25
Basic Architecture Each segment is striped across multiple SSDs Reed-Solomon is used in order to overcome two SSD failures The parity pages enable correcting a single corrupted page without reading the rest of the SSDs (???) 7+2 Drives 1MB write block 8MB taken from a single SSD 26
Outline Introduction to SSD Unique properties in SSD The pure storage system System processes Real world deployments 27
Processes in the storage system 28
Compression 101: Column-oriented database management system Insights: Reducing seek time in HDD (for example if we wish to count how many people get >48000) Many repeating patterns 29
Compression 101: Run-Length Encoding 30
Compression 101: Entropy encoding Data centers use lossless compression Let us consider a text file: aabacfffeedcbaaa The storage needed in order to store the file: (45+13+12+16+9+5)*1000*3 = 300,000 bits 31
Compression 101 cont. But what if The storage needed is given by: (45*1+13*3+12*3+16*3+9*4+5*4)*1000=224,000bits 25% reduction in needed storage! We pay in using encoder and decoder for each read/write operation This is the Huffman code 32
Deduplication Elimination of duplicate copies of repeating data Replacing duplicated files/data blocks by pointers 0 3 3 9 1 4 0 9 2 5 7 10 Deduplication ratio of 16/11=1.45 0 6 8 2 0 4 8 1 5 9 2 6 10 3 7 33
Deduplication cont. Examples from real life for cases in which a single file is saved many times on the same server: Downloading homework1 in algebra to 1000 Dropbox accounts Mail attachments Streaming music files from the cloud Backups! What is the practical storage reduction by deduplication and compression? 3 10 (can even reach 50 for backups) 34
Block Deduplication Deduplication can be performed by blocks of information (and not necessarily by files) For example: 35
Deduplication in Purity Tracks deduplication blocks at 512B granularity Keeps the hash value of every 8 th block In case of a match, the block is verified byte by byte Can detect 8*512B=4KB duplicates Purity performs: Inline deduplication: detecting duplicates before writing it on SSD Looks only in recently written data, and in frequently duplicated data In charge on most of the deduplication ratio Deduplication during garbage collection 36
Log Structure File System In a nut shell: in order to overcome the high seek times of disks, data is buffered in a log structure to create long sequential writes: File#1 File#2 Update File#1 File#2 Update File#2 File#3 This structure supports the way we write to SSD! No need to buffer the data in expensive RAM Garbage collection is inherent process in SSD Wear leveling algorithms in flash memories avoids fragmentation 37
Data Read speedup Vast majority of slow SSD reads happen while the SSD is in the middle of a write process Purtiy avoids writing to more than two SSDs per ECC group in the same time Time Threshold Reconstructing using the erasure codes Request Latency 38
Real world deployments 39
Reliability The company collects telemetry data form its customers Telemetry data includes: I/O request rates Request sizes Deduplication ratio etc. The company analyzed the data and can foresee failures and replace components before they fail They reach 99.999% availability (5min/year) 40
Reliability cont. SSDs are very reliable in all of Purity s data centers (the number is not given) only two SSDs have failed Most customers never approach the P/E ratings of consumer MLC drives The company offers free SSD replacement due to wear In order to fix errors, Purity uses: ECC Rewriting un-accessed data 41
Database deployments Customers usually deploy dozens/hundreds of database instances on top of a single Purity array The 5min rule: Data accessed more often than once every 5min belonged to RAM Colder data belonged on disk Several assumptions for analysis: The deduplication ratio is usually 3-8 for documents database, it is ~10 I/Os are 55KB on average 42
Relative Cost based on data form customers Disks are no good for high performance needs Without reduction, store everything that you can afford to lose in RAM With data reduction, never cache cold data (accessed less frequently than 30m) 43
Summary SSDs are fast random storage devices SSDs have unique properties like: Wear Garbage collection etc. By using all SSD storage system, it is possible to enhance some system processes which are gamechanging Therefore, although SSDs are more expensive, the Purity system is cost-effective 44
Questions? 45