Large-scale Caching. CS6450: Distributed Systems Lecture 18. Ryan Stutsman

Size: px

Start display at page:

Download "Large-scale Caching. CS6450: Distributed Systems Lecture 18. Ryan Stutsman"

Tiffany Harrell
5 years ago
Views:

created by Michael Freedman and Kyle Jamieson at Princeton

Licensed for use under a Creative Commons

1 Large-scale Caching CS6450: Distributed Systems Lecture 18 Ryan Stutsman Material taken/derived from Princeton COS-418 materials created by Michael Freedman and Kyle Jamieson at Princeton University. Licensed for use under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Some material taken/derived from MIT by Robert Morris, Franz Kaashoek, and Nickolai Zeldovich. 1

2 Move Fast & Break Things Feb 2004: open thefacebook.com to Harvard PHP + MySQL Jun 2004: expand to Columbia, Yale, Stanford Dec 2005: 6 million users 2008: 100 million 2009: 200 million 2010: 400 million 2011: 800 million Today 1.4 billion? PHP + MySQL 2

3 Big Picture Billions of users Inherent wide fan outs, poor locality Near real-time communication, content sharing Spans globe Billions of data requests per second Trillions of items Combine, simple off-the-shelf stuff to do it 3

4 Simple www DB PHP Frontend MySQL Backend 4

5 PHP is slow, so scale LB www www www www www www PHP Frontend MySQL Backend Too many DNS entries, add load balancer? (HTTP proxy) DB 5

6 DB is slow, so shard LB www www www www www www DB DB DB DB DB DB Problem: DBs short stroked, only 100 IOPS each Enter memcached 6

7 Memcached Get(k) v Set(k, v) CAS(k, v, v ) All data in DRAM LRU + Slab allocator Couple thousand lines of C 7

8 +Memcache LB www www www www www www mc mc mc mc mc DB DB DB DB DB DB Look aside, ~ 1M IOPS per memcached box All writes through DB, read miss goes to DB Even this won t be enough... Reads >> writes, decouple r/w capacity scaling 8

9 Cache Workload Long tail; some KBs or 10s of KBs Median few tens of B? Paper says 135 B percentile of requests Caching looks promising Lots of small values hard for disks Reads >> Writes Bytes Figure 10: Cumulative distribution of value sizes fetched wide variance between the p95 latencies arises from handling large responses and waiting for the runnable thread to be scheduled as discussed in Section Pool Statistics

10 Look Aside web server web server 1. get k 2. SELECT UPDATE set (k,v) 2. delete k memcache database memcache delete k database Figure 1: as a demand-filled look-aside Why forward deletes and not sets?

11 Interesting Points Shard with consistent hashing to distribute load Different hash function than DB sharding, why? Scale DB capacity, throughput independently DB provisioning determined by write throughput peak miss rate 50% miss, get rid of ½ DBs 1% miss, get rid of 99% of DBs 11

12 Communications Issues All-to-all communication pattern TCP connection state isn t free Communications scheduling, connection scaling # www servers >> mc, and > 100 threads per www O(nm) connections needed Also, flow control is per connection Possible to DOS self by issues parallel requests on 100 threads ( incast ) Incast can happen further up too, even in a nonblocking network 12

13 Connection Aggregation mcrouter One TCP connection per machine to each memcached 100x reduction in memcached-side connection state For Gets, each client thread uses UDP and skips mcrouter Max sustained items / second fb www 0 1M 2M TCP UDP Get mcrouter 10 key multiget 13

14 Latency microseconds UDP direct by mcrouter (TCP) Average of Medians Average of 95th Percentiles

15 App-level Flow Control Paces requests Across all targets unlike TCP One TCP connection per machine to each memcached 100x reduction in memcached-side connection state milliseconds th Percentile Median Window Size 15

16 Why Separate Cache? High fanout and multiple rounds of data fetching Batching Can we use multiget well? Interstitial slide Data dependency DAG for a small request Can amortize dispatch/remote call cost, but Collect groups of keys to request? Coroutines 16

17 Fan out percentile of requests All requests A popular data intensive page distinct memcached servers

18 Two Problems Stale Sets What if DB value changes before cache value can be installed? Thundering Herds Cache misses on hot keys cause runs on DB Leases: lock key on get miss while fetching 18

19 Stale Set C1 Get(k) Miss Get(k) v1 Set(k, v1) Ok MC DB C2

20 Stale Set C1 Get(k) Miss Get(k) v1 Set(k, v1) Ok MC DB C2 Set(k, v2) Ok Del(k)

21 Stale Set Fix: Leases C1 Get(k) Miss Get(k) v1 Set(k, v1) Reject! Idea: Ensure all sets induced by a get miss that came before a DB update are invalidated. MC DB C2 Set(k, v2) Ok Del(k) LL/SC Lease Granted Lease Cleared Set Rejected

22 Thundering Herd MC DB C2

23 Thundering Herd MC DB C2 Ok Del(k)

24 Thundering Herd Fix: Leases Get(k) Miss: Retry Soon C3 Get(k) Miss Get(k) v1 Set(k, v1) C1 MC DB C2 Set(k, v2) Ok Del(k) Lease Granted Lease Cleared Idea: Only let one miss handle DB fetch

25 Pools Terabytes Low churn High churn Daily Weekly Minimum, mean, and maximum Mixing high-churn and low-churn apps causes negative interference in eviction policy Daily Solution: separate apps physically Weekly Figure 5: Daily and weekly working set of a high-chu 25

26 Partitioned Memory Over Time Memshare Detour Static Partition No Partition App B App C 26

27 Estimate Hit Rate Curve Gradient to Optimize Hit Rate Memshare Detour Workload 1 Hit Rate Workload 2 Hit Rate Cache Allocation 27

28 Estimate Hit Rate Curve Gradient to Optimize Hit Rate Memshare Detour Workload 1 Hit Rate " # < " % Keep items from " % Workload 2 Hit Rate Cache Allocation 28

29 Estimating Hit Rate Gradient Memshare Detour Track access frequency to recently evicted objects to determine gradient at working point Can be further improved with full hit rate curve estimation SHARDS [Waldspurger 2015, 2017] AET [Hu 2016] Hit Rate Cache Allocation 29

30 Too Hot to Scale: Replicate Some keys have high locality and are hot Not amenable to sharding mcd mcd Capacity 500k get/s Incoming 1M get/s in 100 key multigets 30

31 Too Hot to Scale: Replicate Some keys have high locality and are hot Not amenable to sharding Interesting angle to paper Sharding in some cases, replication in others Depends on workload and resources e.g. network topo mcd mcd Incoming 2M get/s in 50 key multigets if sharded (1M/s to each server) Capacity 500k get/s each 31

32 Handling Failures: Gutter Reroute Gets to unresponsive nodes to a small mcd cluster No deletes, just short lifetimes Recovers hot part of crashed node s LRU chain quickly But, big cut in hit rate on that shard Hit rate > 35% in 4 m mcd mcd Frontend mcd cluster mcd mcd mcd Gutter mcd mcd 32

33 Replication versus Partitioning Partition: frontend i, key k -> cache server hash(k) Memory efficient Max per key throughput equal to single server tput Multiplier on number of servers each frontend talks to Replication: frontend i, key k -> cache server hash(i) Redundant data Works well if few keys extremely popular

34 Regions mcrouter helps connection scaling; only a constant factor Want some failure independent clusters Inter-cluster links likely to be less well-provisioned Want low-latency to local DCs 34

35 Bigger Picture

36 Shootdowns MC DB C1 Set(k, v2) Ok Del(k)

37 Regional Invalidations Memcache Mcrouter Storage Server MySQL Commit Log McSqueal Update Operations Storage

38 How Bobby was able to sleep at night Problem: if power goes off? Lose 100 TB of 100 B objects in DRAM Disks 100 IOPS Need 1 trillion disk accesses to refill cache To recover in 1 s just need 10 billion disks 38

39 Cold Cluster Warmup Use Region 1 s DRAM cache to warm Region 2 39

40 Cold Cluster Warmup 2. Get Hit 1. Get Miss 3. Set Cluster 1 Cluster 2 Storage (DB) Tier

41 Cold Cluster Warmup 2. Get Miss 1. Get Miss 3. Get from DB Cluster 1 Cluster 2 Storage (DB) Tier

42 Cold Cluster Set(k, v2) C1 Del MC2 DB Del Del MC1 C2 Get(k) Miss Get(k) v1 Set(k, v1) Ok

43 Cold Cluster: Fix Hold-Off Set(k, v2) C1 MC2 Del Del Hold Off Window DB Del MC1 C2 Get(k) Miss Get(k) v1 Set(k, v1) Reject

44 Non-local Writes Set in non-master cluster Invalidate the local cache Send write to master region Fetch value from non-master region... Could still get the old value... So not even read-your-own writes 44

45 Non-local Writes Set(k, v2) C1 MC Slave DB Del Ok Get(k) Miss Get(k) v1 Del Set(k, v2) Master DB

46 Non-local Writes: Remote Marker Set(rk) Set(k, v2) C1 Del k Ok Get(rk) Miss Get(k) v2 MC Slave DB Del k Del rk Set(k, v2) Master DB

47 Busted, but my boss told me to break things What if two clients set markers? Marker will get cleared by the first Set handled by mcsqueal Filled cache value may miss the second update Cache state diverges for an unbounded period of time In practice, we find both the eviction of remote markers and situations of concurrent modification to be rare. 47

48 So, consistency? tion of value sizes latencies arises from iting for the runnable d in Section 3.1. four memcache pools. ult pool), app (a pool fraction of deletes that failed 1e 06 1e 05 1e 04 1e 03 1s 10s 1m 10m 1h 1d 1s 10s 1m 10m 1h 1d master region seconds of delay replica region Figure 11: Latency of the Delete Pipeline 1e-3: 1 in 1000, 1e-4 1 in 10,000 1 in 10,000 Gets of cross-regional writes will return the incorrect value for > 1 day... Follow on work finds, consistency is pretty pretty pretty good of a million deletes and record the time the delete was issued. We subsequently query the contents of memcache across all frontend clusters at regular intervals for the sampled keys and log an error if an item remains cached despite a delete that should have invalidated it. In Figure 11, we use this monitoring mechanism to report our invalidation latencies across a 30 day span. We break this data into two different components: (1) the delete originated from a web server in the master region and was destined to a memcached server in the master re-

49 Takeaways Cache is crucial for survival at FB; caches go down, site goes down Partitioning and replication have different nuance for increasing performance How much does consistency matter?

50 50

51 Discussion Why not have DB send new values to memcached, so clients only read memcached? Then, no racing client updates. All writes ordered.

52 Discussion Why not have DB send new values to memcached, so clients only read memcached? Then, no racing client updates. All writes ordered. 1. DB doesn t know how to compute values for memcached (cache isn t literal DB record) 2. Would increase read-your-writes delay (probably need Spanner-like mechanism?) 3. DB doesn t know what is cached; have to send values for uncached items

53 Replication versus Partitioning Partition: frontend i, key k -> cache server hash(k) Memory efficient Max per key throughput equal to single server tput Multiplier on number of servers each frontend talks to Replication: frontend i, key k -> cache server hash(i) Redundant data Works well if few keys extremely popular

54 milliseconds th Percentile Median Window Size

55 Terabytes Low churn High churn Daily Weekly Daily Weekly Minimum, mean, and maximum Figure 5: Daily and weekly working set of a high-chu

Goals. Facebook s Scaling Problem. Scaling Strategy. Facebook Three Layer Architecture. Workload. Memcache as a Service.

Goals. Facebook s Scaling Problem. Scaling Strategy. Facebook Three Layer Architecture. Workload. Memcache as a Service. Goals Memcache as a Service Tom Anderson Rapid application development - Speed of adding new features is paramount Scale Billions of users Every user on FB all the time Performance Low latency for every