MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing

Size: px

Start display at page:

Download "MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing"

Mae Holland
6 years ago
Views:

1 MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing Bin Fan (CMU), Dave Andersen (CMU), Michael Kaminsky (Intel Labs) NSDI

2 Goal: Improve Memcached 1. Reduce space overhead (bytes/key) 2. Improve performance (queries/sec) 2

Our Approach: Algorithm Engineering Apply concurrent / space-efficient data

3 Overview Previous Work: Sharding Avoid inter-thread synchronization e.g., dedicated cores [Berezecki11] Hotspot? Memory Efficiency? Our Approach: Algorithm Engineering Apply concurrent / space-efficient data structures MemC3 vs. Memcached 3x throughput 30% more small key-value items 3

4 Memcached Overview A DRAM-based key-value store GET(key) SET(key, value) LRU eviction for high hit rate Webserver Webserver Webserver Typical use: Speed up webservers Alleviate db load get(x) set(y,"123") get(z) Memcached server on miss Memcached server on miss Database 4

5 Watch the next talk! Typical Workloads Often used for small objects (Facebook [Atikoglu12] ) 90% keys < 31 bytes Some apps only use 2-byte values Tens of millions of queries per second for large memcached clusters (Facebook [Nishtala13] ) Small Objects, High Rate 5

6 Memcached: Core Data Structures Key-Value Index: Chaining hash table Hash table w/ chaining K V K V K V K V K V 6 Bin Fan April 13!!

7 Memcached: Core Data Structures Key-Value Index: Chaining hash table Hash table w/ chaining K V LRU Eviction: Doubly-linked lists Doubly-linked-list (for each slab) LRU header K V K V K V K V 7

8 Problems We Solve Single-node scalability Accessing hash table and updating LRU are serialized Space overhead 56-byte header per object Including 3 pointers and 1 refcount For a 100B object, overhead > 50% 8

Solutions Optimistic cuckoo hashing Better memory efficiency: 95% table occupancy Higher concurrency: single-writer/multi-reader focus of this talk

9 Solutions Optimistic cuckoo hashing Better memory efficiency: 95% table occupancy Higher concurrency: single-writer/multi-reader focus of this talk CLOCK-based LRU eviction Better space efficiency and concurrency Additional algo & tuning improvements described in paper 9

10 Solutions Optimistic cuckoo hashing Better memory efficiency: 95% table occupancy Higher concurrency: single-writer/multi-reader focus of this talk CLOCK-based LRU eviction Better space efficiency and concurrency Additional algo & tuning improvements 10

11 Memcached Default Hash Table Chaining items hashed in same bucket: lookup K V K V K V Good: simple (Data Structure 101) Bad: low cache locality: (dependent pointer dereference) Bad: pointer costs space 11

12 Linear Probing Probing consecutive buckets for vacancy Good: simple lookup Good: cache friendly Bad: poor memory efficiency: ( if occupancy > 50%, lookup needs to search a long chain) 12

13 Cuckoo Hashing [Pagh04] Each key has two candidate buckets Assigned by hash 1 (key), hash 2 (key) Stored in one of its candidate buckets Lookup: read 2 buckets in parallel hash 1 (x) lookup x hash 2 (x) 0! 1! 2! 3! 4! 5! 6! 7! 13

14 Cuckoo Hashing [Pagh04] Each key has two candidate buckets Assigned by hash 1 (key), hash 2 (key) Stored in one of its candidate buckets Lookup: read 2 buckets in parallel Insert: Perform key displacement recursively Still O(1) on average [Pagh04] hash 1 (x) hash 1 (b) 0! 1! 2! 3! 4! insert x 5! hash 2 (c) hash 6! 2 (x) 7! a c b 14

15 Increase Set-Associativity a b Each bucket still fits in 1 cacheline b e f g h x 6 7 a x c d 2 cacheline-sized reads per lookup 50% space utilized 2 cacheline-sized reads per lookup 95% space utilized! 15

16 Solutions Optimistic cuckoo hashing Better memory efficiency: 95% table occupancy Higher concurrency: single-writer/multi-reader focus of this talk CLOCK-based LRU eviction Better space efficiency and concurrency Additional algo & tuning improvements 16

17 During insertion: False Miss Problem always a floating item during insertion a reader may miss this floating item 0! 1! 2! 3! Floating Insert Item x 4! 5! b 6! a 7! 17

18 Our Solution: 2-Step Insert Step1: Find a cuckoo path to an empty slot without editing buckets 0! 1! 2! 3! Insert x 4! 5! 6! 7! a b 18

19 Our Solution: 2-Step Insert Step1: Find a cuckoo path to an empty slot without editing buckets 0! 1! Step2: Move hole backwards: Insert x 2! 3! 4! 5! 6! 7! a b 19

20 Our Solution: 2-Step Insert Step1: Find a cuckoo path to an empty slot without editing buckets 0! 1! Step2: Move hole backwards: Insert x 2! 3! 4! 5! 6! 7! b a 20

21 Our Solution: 2-Step Insert Step1: Find a cuckoo path to an empty slot without editing buckets 0! 1! Step2: Move hole backwards: 2! 3! 4! b a Insert x 5! 6! 7! 21

22 Our Solution: 2-Step Insert Step1: Find a cuckoo path to an empty slot without editing buckets 0! 1! Step2: Move hole backwards: 2! 3! 4! b a 5! Only need to ensure each move is atomic w.r.t. reader 6! 7! x 22

23 How to Ensure Atomic Move e.g., move key b from bucket 4 to bucket 2 A simple implementation: Lock bucket 2 and 4 Move key Unlock bucket 2 and 4 Our approach: Optimistic locking 0! 1! 2! 3! 4! Optimized for read-heavy workloads Each key mapped to a version counter 5! Reader detects version change 6! (described in paper) 7! b 23

24 How to Coordinate Writers Simple (current) solution: Serialize inserts Works fine with read-heavy workload Ongoing work: allow multiple writers 24

25 Solutions Optimistic cuckoo hashing Better memory efficiency: 95% table occupancy Higher concurrency: single-writer/multi-reader focus of this talk CLOCK-based LRU eviction Better space efficiency and concurrency 2ptr/key => 1bit/key, concurrent update Additional algo & tuning improvements 25

26 Solutions Optimistic cuckoo hashing Better memory efficiency: 95% table occupancy Higher concurrency: single-writer/multi-reader focus of this talk CLOCK-based LRU eviction Better space efficiency and concurrency 2ptr/key => 1bit/key, concurrent update Additional algo & tuning improvements Avoid unnecessary full-key comparisons on hash collision 26

27 Problems We Solve Single-node scalability Accessing hash table and updating LRU are serialized GET requires no mutex Single-writer/multiple-reader Space overhead 56-byte header per object 3 pointers + 1 refcount => 1 pointer + 1 refbit 27

28 Hash Table Microbenchmark Million Lookups/sec Server: Low Power Xeon CPU w/ 12 cores, 12 MB L3 cache 6 local threads reading hash tables of ~1GB 60% 50% 40% 30% 20% 10% 0% Lookups all hit Base 21.54% 12.79% 68% 14.28% 1.74% 1.9% Chaining w/ Bkt Lock Optimistic Cuckoo Base Lookups all miss Chaining w/ Bkt Lock 47.89% Optimistic Cuckoo 235% 28

29 Hash Table Microbenchmark Million Lookups/sec Server: Low Power Xeon CPU w/ 12 cores, 12 MB L3 cache 6 local threads reading hash tables of ~1GB 60% 50% 40% 30% 20% 10% 0% Lookups all hit Base 21.54% 12.79% 68% 14.28% 1.74% 1.9% Chaining w/ Bkt Lock Optimistic Cuckoo Base Lookups all miss Chaining w/ Bkt Lock 47.89% Optimistic Cuckoo 235% 29

End-to-end Performance 16-Byte key, 32-Byte Value,

5 MOPS Memcached MemC3 Sharding 0" 1" 2" 4" 6" 8"

30 End-to-end Performance 16-Byte key, 32-Byte Value, 95% GET, 5% SET, zipf distributed 50 remote clients generate workloads 5" max tput 4.3 MOPS MQPS% 4" 3" 2" 1" max tput 1.5 MOPS Memcached MemC3 Sharding 0" 1" 2" 4" 6" 8" 10" 12" 14" 16" Number%of%Server%Threads%% max tput 0.6 MOPS 30

31 Conclusion Optimistic cuckoo hashing High space efficiency Optimized for read-heavy workloads Source Code available: github.com/efficient/libcuckoo MemC3 improves Memcached 3x throughput, 30% more (small) objects Optimistic Cuckoo Hashing, CLOCK, other system tuning 31

32 References [Atikoglu12] Workload analysis of a large-scale key- value store. [Berezecki11] Many-core key-value store [Nishtala13] Scaling Memcache at Facebook [Pagh04] Cuckoo hashing 32

33 Q & A Thanks! 33

MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing

To appear in Proceedings of the 1th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13), Lombard, IL, April 213 MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter