Towards Weakly Consistent Local Storage Systems

Towards Weakly Consistent Local Storage Systems Ji-Yong Shin 1,2, Mahesh Balakrishnan 2, Tudor Marian 3, Jakub Szefer 2, Hakim Weatherspoon 1 1 Cornell University, 2 Yale University, 3 Google

2 Consistency/Performance Trade-off in Distributed Systems Slower reads to latest version of data Client Client Client Client Clients Faster reads to stale data using weak consistency Primary ReplicaOon Back-up

3 Server Comparison Year 2006 2016 Model (4U) Dell PowerEdge 6850 Dell PowerEdge R930 CPU [# of cores] 4 2 core Xeon [ 8 ] Memory 64GB 6TB Network Storage 2 1GigE 8 SCSI/SAS HDD 4 24 core Xeon [ 96 ] 2 1GigE 2 10GigE 24 SAS HDD/SSD 10 x PCIe SSD 12 X 96 X 11 X 4.2 X (175X) Modern Server Distributed System

4 Can we apply distributed system principles to local storage systems to improve performance? Consistency and performance trade-off

5 Why Consistency/Performance Trade-off? Distributed Systems Different versions of data exist in different servers due to network delays for replicaoon Older versions are faster to access when the client is closer to the server Modern Servers Different versions of data exist in different storage media due to logging, caching, copy-onwrite, deduplicaoon, etc. Older versions are faster to access when they are on faster storage media Reasons for different access speeds ü RAM, SSD, HDD, hybrid-drives, etc. ü Disk arm contenoon ü SSD under garbage collecoon ü Degraded mode in RAID

6 Fine-grained Log and Coarse-grained Cache MulOple logged objects fit in one cache block Read A Fast read (stale) A ( 3 ) B ( 4 ) Memory Cache D ( 5 ) Read B Slow read (latest) Read D SSD A ( 3 ) B ( 4 ) C ( 4 ) A ( 5 ) D ( 5 ) Log of KV pairs Key (Ver) Key (Ver)

Goal Speedup local storage systems using stale data (consistency/performance trade-off) How should storage systems access older versions? Which version should be to returned? What should be the interface? What are the target applicaoons? ACM Symposium on Cloud Computing Oct 6, 2016 7

8 Rest of the Talk StaleStore Yogurt: An Instance of StaleStore EvaluaOon Conclusion

9 StaleStore A local storage system that can trade-off consistency and performance Can be in any form KV-store, filesystem, block store, DB, etc. Maintains mulople versions of data Should have interface to access older versions Can esomate cost for accessing each version Aware of data locaoons and storage device condioons Aware of consistency semanocs Ordered writes and nooon of Omestamps and snapshots Distributed weak (client-centric) consistency semanocs

StaleStore: Consistency Model Distributed (client-centric) consistency semanocs Per-client, per-object guarantees for reads Bounded staleness Read-my-writes Monotonic-reads: A client reads an object that is the same or later version than the version that was last read by the same client ACM Symposium on Cloud Computing Oct 6, 2016 10

StaleStore: Target ApplicaOons Distributed applicaoons Aware of distributed consistency Can deal with data staleness Server applicaoons Can provide per client guarantees ACM Symposium on Cloud Computing Oct 6, 2016 11

12 Rest of the Talk StaleStore Yogurt: An Instance of StaleStore EvaluaOon Conclusion

13 Yogurt: A Block-Level StaleStore An log-structured disk array with cache [Shin et al., FAST 13] (Linux kernel module) Read Prefer to read from non-logging disks Prefer to read from the most idle disk Fast read (stale) Fast read (stale) Read Cache Read Slow read (latest) Log Disk 0 Disk 1 Disk 2

Yogurt: Basic APIs Write (Address, Data, Version #) Versioned (Ome-stamped) Write Version # consotutes snapshots Read (Address, Version #) Versioned (Ome-stamped) Read GetCost(Address, Version #) Cost esomaoon for each version ACM Symposium on Cloud Computing Oct 6, 2016 14

15 Yogurt Cost EsOmaOon GetCost(Address, Version) returns an integer Disk vs Memory Cache Cache always has lower cost (e.g. cache = -1, disk = posiove int) Disk vs disk Number of queued I/Os with weights Queued writes have higher weight than reads

16 Reading blocks from Yogurt Monotonic-reads example Client session Lowest Ver = 3 Read version [Blk 1: Ver 5] Read block 1 1. Checks current Omestamp: highest Ver = 2. Issues GetCost() for block 1 between versions 3 and 8 (N queries with uniform distance) 3. Reads the cheapest: e.g. 1 (5): Read(1, 5) 4. Records version for block 1 8 Cache Global Timestamp 3 1 5 38 45 67 3 (3) 1 (4) 2 (4) 1 (5) 3 (5) 1 (6) 2 (6) 3 (7) 2 (8)

Data construct on Yogurt High level data constructs span mulople blocks Blocks should be read from a consistent snapshot Later reads depend on prior reads: GetVersionRange() 3 3 3 Cornell University in NYC 3 Cornell University in NYC Cornell University in Ithaca Timestamp 2 1 Cornell Cornell University University 1 2 1 1 in NYC in Ithaca Ithaca College in Ithaca 0 0 Ithaca 0 0 College in Ithaca 0 1 2 Block Address ACM Symposium on Cloud Computing Oct 6, 2016 17

18 Rest of the Talk StaleStore Yogurt: An Instance of StaleStore EvaluaQon Conclusion

19 EvaluaOon Yogurt: 3 disk seqng with memory cache Focus on read latency while using monotonic-reads Clients simultaneously access servers Primary-backup seqng Baseline 1: reads latest data in the primary server 100ms delay Baseline 2: reads latest data in a local server Client Client Client Client Clients Primary Stream of Versioned Writes Back-up (Yogurt) UOlize stale data in a local server

EvaluaOon: Block Access Uniform random workload 8 clients access one block at a Ome X-axis: # of available older versions built up during warm up Average Read Latency (ms) 200 180 160 140 120 100 80 60 40 20 0 Primary Local latest Yogurt MR 1 2 3 4 5 6 7 8 Number of Stale Versions Available at Start Time ACM Symposium on Cloud Computing Oct 6, 2016 20

EvaluaOon: Key-Value Store YCSB Workload-A (Zipf with 50% read, 50% write) 16 clients access mulople blocks of key-value pairs KV Store greedily searches the cheapest using Yogurt APIs KV pairs can be paroally updated Average Read Latency (ms) 180 160 140 120 100 80 60 40 20 0 Primary Local latest Yogurt MR 4KB 8KB 12KB 16KB 20KB Key-Value Pair Size ACM Symposium on Cloud Computing Oct 6, 2016 21

22 Conclusion Modern servers are similar to distributed systems Local storage systems can trade-off consistency and performance We call them StaleStores Many systems have potenoals to use this feature Yogurt, a block level StaleStore EffecOvely trades-off consistency and performance Supports high level constructs that span mulople blocks

23 Thank you QuesOons?

24 Extra slides

25 Fine-grained log and coarse-grained cache MulOple logged objects fit in one cache block Read A Fast read (stale) Memory Cache A ( 3 ) B ( 1 ) C ( 1 ) D ( 2 ) Read B Slow read (latest) Read C SSD A ( 3 ) B ( 1 ) C ( 0 ) A ( 4 ) C ( 1 ) D ( 2 ) Log of KV pairs Key (Ver) Key (Ver)

Fine-grained log and coarse-grained cache 8 threads reading and wriong at 9:1 raoo KV-pairs per cache block from 2 to 16 Allowed staleness from 0 to 15 updates (bounded staleness) 1200 Average Read Latency (us) 1000 800 600 400 200 0 2 items 4 items 8 items 16 items 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Allowed Staleness (# of updates) ACM Symposium on Cloud Computing Oct 6, 2016 Max 60% 26

27 Deduplicated system with read cache Systems that cache deduplicated chunks Logical block to physical chunk map B 0 B 1 B 2 Read B 1 Memory Cache Slow Read B2 read (latest) Disk C 2 C 3 Fast read (stale) C 2

Deduplicated system with read cache 8 threads reading and wriong at 9:1 raoo DeduplicaOon raoo controlled from 30 to 90% Allowed staleness from 0 to 15 updates (bounded staleness) Average Read Latency (us) 25000 20000 15000 10000 5000 30% 50% 70% 90% Max 30% 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Allowed Staleness (# of updates) ACM Symposium on Cloud Computing Oct 6, 2016 28

29 Write cache that is slow for reads Griffin: disk cache over SSD for SSD lifeome Read Addr 1 Slow read (latest) Disk Cache 3 (5) 1 (2) 1 (3) Logged blocks Fast read (stale) Flushes latest SSD 0 (3) 1 (1) 2 (0) 3 (4) Linear block address space Addr (Ver)

Write cache that is slow for reads 8 threads reading and wriong at 9:1 raoo Data flushed from disk to SSD every 128MB to 1GB writes Allowed staleness from 0 to 7 updates (bounded staleness) Average Read Latency (us) 8000 7000 6000 5000 4000 3000 2000 1000 0 Max 95% 128MB 256MB 512MB 1024MB 0 1 2 3 4 5 6 7 Allowed Staleness (# of updates) ACM Symposium on Cloud Computing Oct 6, 2016 30