Gecko: Contention-Oblivious Disk Arrays for Cloud Storage Ji-Yong Shin Cornell University In collaboration with Mahesh Balakrishnan (MSR SVC), Tudor Marian (Google), and Hakim Weatherspoon (Cornell) FAST 2013
Throughput (MB/s) Cloud and Virtualization What happens to storage? Guest VM Guest VM Guest VM Guest VM 350 300 250 200 150 100 50 0 4-Disk RAID0 Seq - VM+EXT4 Rand - VM+EXT4 1 2 3 4 5 6 7 # of Seq Writers Shared Disk SEQUENTIAL Shared Disk VMM RANDOM Shared Disk Shared Disk I/O CONTENTION causes throughput collapse 2
Existing Solutions for I/O Contention? I/O scheduling: reordering I/Os Entails increased latency for certain workloads May still require seeking Workload placement: positioning workloads to minimize contention Requires prior knowledge or dynamic prediction Predictions may be inaccurate Limits freedom of placing VMs in the cloud 3
Log-structured File System to the Rescue? [Rosenblum et al. 91] Write Log Tail Write Write Log all writes to tail Addr 0 1 2 N Shared Disk 4
Log Head Log Tail GC Read Challenges of Log-Structured File System Garbage collection is the Achilles Heel of LFS [Seltzer et al. 93, 95; Matthews et al. 97] Garbage Collection (GC) from Log Head N Shared Disk Shared Disk Shared Disk GC and read still cause disk seek 5
1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 121 Throughput (MB/s) Challenges of Log-Structured File System Garbage collection is the Achilles Heel of LFS 2-disk RAID-0 setting of LFS GC under write-only synthetic workload 250 200 Max Sequential Throughput RAID 0 + LFS 150 100 50 0 Throughput falls by 10X during GC GC Aggregate App Time (s) 6
Problem: Increased virtualization leads to increased disk seeks and kills performance RAID and LFS do not solve the problem 7
Rest of the Talk Motivation Gecko: contention-oblivious disk storage Sources of I/O contention New technique: Chained logging Implementation Evaluation Conclusion 8
Write Write Read Read Write Read What Causes Disk Seeks? Write-write Read-read Write-read VM 1 VM2 9
Write GC Read What Causes Disk Seeks? Write-write Read-read Write-read VM 1 VM2 Logging Write-GC read Read-GC read 10
Principle: A single sequentially accessed disk is better than multiple randomly seeking disks 11
GC Gecko s Chained Logging Design Separating the log tail from the body Eliminates write-gc read contention Physical Addr Space Log Tail Garbage collection from Log Head to Tail Eliminates write-write and reduces write-read contention Disk 0 Disk 1 Disk 2 GC reads do not interrupt the sequential write 1 uncontended drive >>faster>> N contended drives 12
GC Gecko s Chained Logging Design Smarter Compact-In-Body Garbage Collection Physical Addr Space Garbage collection from Log Head to Tail Log Tail Disk 0 Disk 1 Disk 2 Garbage collection from Head to Body 13
Read Read Write Read Gecko Caching What happens to reads going to tail drives? Reduces read-read contention Body Cache (Flash ) Tail Cache Reduces write-read contention LRU Cache RAM Hot data Disk 0 Disk 1 Disk 2 MLC SSD Warm data 14
Read Read Write Gecko Properties Summary Reduced read-write and read-read contention No write-write contention, No GC-write contention, and Reduced read-write contention Body Cache (Flash ) Tail Cache Fault tolerance Disk 0 Disk 1 Disk + 2 Read performance Mirroring/ Striping Disk 0 Disk 1 Disk 2 Power saving w/o consistency concerns 15
head tail head tail Gecko Implementation Primary map: less than 8 GB RAM for a 8 TB storage Logical-to-physical map (in memory) 4-byte entries W W R W W 4KB pages Data (in disk) R W W Disk 0 Disk 1 Disk 2 empty filled Physical-to-logical map (in flash) Inverse map: 8 GB flash for a 8 TB storage (written every 1024 writes) 16
Evaluation 1. How well does Gecko handle GC? 2. Performance of Gecko under real workloads? 3. Effect of varying Gecko chain length? 4. Effectiveness of the tail cache? 5. Durability of the flash based tail cache? 17
Evaluation Setup In-kernel version Implemented as block device for portability Similar to software RAID User-level emulator For fast prototyping Runs block traces Tail cache support Hardware WD 600GB HDD Used 512GB of 600GB 2.5 10K RPM SATA-600 Intel MLC (multi level cell) SSD 240GB SATA-600 18
1 11 21 31 41 51 61 71 81 91 101 111 1 12 23 34 45 56 67 78 89 100 111 122 Throughput (MB/s) Throughput (MB/s) How Well Does Gecko Handle GC? 2-disk setting; write-only synthetic workload Gecko Log + RAID0 250 250 GC 200 200 Max Sequential Aggregate Aggregate throughput 150 Throughput of 1 disk Max Sequential = Max throughput App 150 Throughput of 2 disks 100 100 Throughput collapses 50 50 GC Aggregate App 0 0 Time (s) Time (s) Gecko s aggregate throughput always remains high 3X higher aggregate & 4X higher application throughput 19
1 11 21 31 41 51 61 71 81 91 101 111 Throughput (MB/s) Throughput (MB/s) How Well Does Gecko Handle GC? 250 200 150 100 50 Gecko GC Aggregate App 250 200 150 100 50 0 Time (s) App throughput can be preserved using smarter GC 0 No GC Moveto-Tail GC Compactin-Body GC 20
MS Enterprise and MSR Cambridge Traces Trace Name Estimated Addr Space Total Data Accessed (GB) Total Data Read (GB) Total Data Written (GB) TotalIOReq NumReadReq NumWriteReq prxy 136 2,076 1,297 779 181,157,932 110,797,984 70,359,948 src1 820 3,107 2,224 884 85,069,608 65,172,645 19,896,963 proj 4,102 2,279 1,937 342 65,841,031 55,809,869 10,031,162 Exchange 4,822 760 300 460 61,179,165 26,324,163 34,855,002 usr 2,461 2,625 2,530 96 58,091,915 50,906,183 7,185,732 LiveMapsBE 6,737 2,344 1,786 558 44,766,484 35,310,420 9,456,064 MSNFS 1,424 303 201 102 29,345,085 19,729,611 9,615,474 DevDivRelease 4,620 428 252 176 18,195,701 12,326,432 5,869,269 prn 770 271 194 77 16,819,297 9,066,281 7,753,016 [Narayanan et al. 08, 09] 21
1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 Throughput (MB/s) Throughput (MB/s) What is the Performance of Gecko under Real Workloads? Mix of 8 workloads: prn, MSNFS, DevDivRelease, proj, Exchange, LiveMapsBE, prxy, and src1 6 Disk configuration with 200GB of data prefilled Gecko Log + RAID10 200 Read 150 200 Write 150 Read Write 100 50 0 100 50 0 Gecko Time (s) Mirrored chain of length 3 Tail cache (2GB RAM + 32GB SSD) Body Cache (32GB SSD) Log + RAID10 Time (s) Mirrored, Log + 3 disk RAID-0 LRU cache (2GB RAM + 64GB SSD) Gecko showed less read-write contention and higher cache hit rate Gecko s throughput is 2X-3X higher 22
Throughput (MB/s) What is the Effect of Varying Gecko Chain Length? Same 8 workloads with 200GB data prefilled 180 160 140 120 100 80 60 40 20 0 Read Write errorbar = stdev RAID 0 Single uncontended disk Separating reads and writes better performance 23
Hit Rate (%) Hit Rate (%) How Effective Is the Tail Cache? Read hit rate of tail cache (2GB RAM+32GB SSD) on 512GB disk 21 combinations of 4 to 8 MSR Cambridge and MS Enterprise traces SSD 100 90 80 70 60 50 40 30 20 10 0 RAM 100 90 80 70 60 50 1 3 5 7 9 11 13 15 17 19 21 Application Mix 100 200 300 400 500 Amount of Data in Disk (GB) At least 86% of read hit rate RAM handles most of hot data Amount of data changes hit rate Still average 80+ % hit rate Tail cache can effectively resolve read-write contention 24
SSD Lifetime (Days) How Durable is Flash Based Tail Cache? Static analysis of lifetime based on cache hit rate Use of 2GB RAM extends SSD lifetime 700 600 500 400 300 200 100 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Application Mix 2X-8X Lifetime Extension Lifetime at 40MB/s 25
Conclusion Gecko enables fast storage in the cloud Scales with increasing virtualization and number of cores Oblivious to I/O contention Gecko s technical contribution Separates log tail from its body Separates reads and writes Tail cache absorbs reads going to tail A single sequentially accessed disk is better than multiple randomly seeking disks 26
Question? 27