Goals. Facebook s Scaling Problem. Scaling Strategy. Facebook Three Layer Architecture. Workload. Memcache as a Service.

Similar documents
Large-scale Caching. CS6450: Distributed Systems Lecture 18. Ryan Stutsman

Replication in Distributed Systems

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Large-Scale Web Applications

Last Class: Consistency Models. Today: Implementation Issues

NFS: Naming indirection, abstraction. Abstraction, abstraction, abstraction! Network File Systems: Naming, cache control, consistency

MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing

CPS 512 midterm exam #1, 10/7/2016

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

Important Lessons. A Distributed Algorithm (2) Today's Lecture - Replication

Today: World Wide Web! Traditional Web-Based Systems!

High-Performance Key-Value Store on OpenSHMEM

Distributed Systems. Lec 9: Distributed File Systems NFS, AFS. Slide acks: Dave Andersen

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Distributed Systems (5DV147)

The Google File System

Facebook Tao Distributed Data Store for the Social Graph

Consistency & Replication

Lecture 16/17: Distributed Shared Memory. CSC 469H1F Fall 2006 Angela Demke Brown

The Google File System

GFS: The Google File System. Dr. Yingwu Zhu

Important Lessons. Today's Lecture. Two Views of Distributed Systems

CSE 5306 Distributed Systems. Consistency and Replication

Eventual Consistency. Eventual Consistency

CLOUD-SCALE FILE SYSTEMS

Chapter 4: Distributed Systems: Replication and Consistency. Fall 2013 Jussi Kangasharju

Finding a needle in Haystack: Facebook's photo storage

The Design and Implementation of a Next Generation Name Service for the Internet (CoDoNS) Presented By: Kamalakar Kambhatla

Architekturen für die Cloud

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Consistency and Replication

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

The Google File System (GFS)

big picture parallel db (one data center) mix of OLTP and batch analysis lots of data, high r/w rates, 1000s of cheap boxes thus many failures

Computing Parable. The Archery Teacher. Courtesy: S. Keshav, U. Waterloo. Computer Science. Lecture 16, page 1

Replication and Consistency. Fall 2010 Jussi Kangasharju

Network File Systems

CS5412 CLOUD COMPUTING: PRELIM EXAM Open book, open notes. 90 minutes plus 45 minutes grace period, hence 2h 15m maximum working time.

Advanced Topic: Efficient Synchronization

Today: World Wide Web

Scaling for Humongous amounts of data with MongoDB

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li

Module 7 - Replication

Today: Distributed File Systems

Today: World Wide Web. Traditional Web-Based Systems

CS399 New Beginnings. Jonathan Walpole

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Distributed Systems 16. Distributed File Systems II

Consistency and Replication. Some slides are from Prof. Jalal Y. Kawash at Univ. of Calgary

GFS: The Google File System

Consistency: Relaxed. SWE 622, Spring 2017 Distributed Software Engineering

EVCache: Lowering Costs for a Low Latency Cache with RocksDB. Scott Mansfield Vu Nguyen EVCache

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

CMPSC 311- Introduction to Systems Programming Module: Caching

<Insert Picture Here> MySQL Web Reference Architectures Building Massively Scalable Web Infrastructure

Distributed File Systems

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

Consistency and Replication 1/62

Consistency and Replication 1/65

Scaling Without Sharding. Baron Schwartz Percona Inc Surge 2010

Consistency. CS 475, Spring 2018 Concurrent & Distributed Systems

Building Consistent Transactions with Inconsistent Replication

ebay s Architectural Principles

Distributed File Systems II

Implementation Issues. Remote-Write Protocols

10. Replication. Motivation

416 Distributed Systems. Distributed File Systems 2 Jan 20, 2016

Today: Fault Tolerance. Replica Management

Consistency and Replication

MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing

ZooKeeper & Curator. CS 475, Spring 2018 Concurrent & Distributed Systems

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout

Last Class: Web Caching. Today: More on Consistency

Database Architectures

Distributed Systems Principles and Paradigms

Chapter 8 Virtual Memory

Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li

The Google File System

Eventual Consistency 1

Part 1: Indexes for Big Data

Lecture 12: Hardware/Software Trade-Offs. Topics: COMA, Software Virtual Memory

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

CSE 344 JULY 9 TH NOSQL

Search Engines. Information Retrieval in Practice

Distributed Systems COMP 212. Revision 2 Othon Michail

This project must be done in groups of 2 3 people. Your group must partner with one other group (or two if we have add odd number of groups).

Dynamo: Key-Value Cloud Storage

Today CSCI Coda. Naming: Volumes. Coda GFS PAST. Instructor: Abhishek Chandra. Main Goals: Volume is a subtree in the naming space

Making RAMCloud Writes Even Faster

Replication and Consistency

ebay Marketplace Architecture

The Google File System

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

IT Best Practices Audit TCS offers a wide range of IT Best Practices Audit content covering 15 subjects and over 2200 topics, including:

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

1. Memory technology & Hierarchy

Give Your Site a Boost With memcached. Ben Ramsey

DISTRIBUTED COMPUTER SYSTEMS

Transcription:

Goals Memcache as a Service Tom Anderson Rapid application development - Speed of adding new features is paramount Scale Billions of users Every user on FB all the time Performance Low latency for every user everywhere Fault tolerance Scale implies failures Consistency model: Best effort eventual consistency Facebook s Scaling Problem Rapidly increasing user base Small initial user base 2x every 9 months 2013: 1B users globally Users read/update many times per day Increasingly intensive app logic per user 2x I/O every 4-6 months Infrastructure has to keep pace Scaling Strategy Adapt off the shelf components where possible Fix as you go no overarching plan Rule of thumb: Every order of magnitude requires a rethink Facebook Three Layer Architecture Application front end Stateless, rapidly changing program logic If app server fails, redirect client to new app server Memcache Lookaside key-value cache Keys defined by app logic (can be computed results) Fault tolerant storage backend Stateful Careful engineering to provide safety and performance Both and No Workload Each user s page is unique draws on events posted by other users Users not in cliques For the most part User popularity is zipf Some user posts affect very large # s of other pages Most affect a much smaller number

Many small lookups Many dependencies Workload App logic: many diffuse, chained reads latency of each read is crucial Much smaller update rate still large in absolute terms Scaling A few servers Many servers An entire data center Many data centers Each step 10-100x previous one Facebook Scale by hashing to partitioned servers Scale by caching Scale by replicating popular keys Scale by replicating clusters Scale by replicating data centers Scale By Consistent Hashing Hash users to front end web servers Hash keys to memcache servers Hash files to servers Result of consistent hashing is all to all communication pattern Each web server pulls data from all memcache servers and all storage servers Scale By Caching: Memcache Sharded in-memory key-value cache Key, values assigned by application code Values can be data, result of computation Independent of backend storage architecture (, no) or format Design for high volume, low latency Lookaside architecture Lookaside Read get k (1) data

Lookaside Read Lookaside Read get k (1) put k (3) get k (2) data nope! get k (2) data ok! Lookaside Operation (Read) Webserver needs key value Webserver requests from memcache Memcache: If in cache, return it If not in cache: Return error Webserver gets data from storage server Possibly an query or complex computation Webserver stores result back into memcache Question What if swarm of users read same key at the same time? update (1) ok! Lookaside Write delete k (2) ok! Lookaside Operation (Write) Webserver changes a value that would invalidate a memcache entry Could be an update to a key Could be an update to a table Could be an update to a value used to derive some key value Client puts new data on storage server Client invalidates entry in memcache

Why Not Delete then Update? Why Not Delete then Update? update (2) ok! delete k (1) ok! update (2) ok! delete k (1) ok! Read miss might reload data before it is updated. Memcache Consistency Webserver: Reader Webserver: Writer Is memcache linearizable? Read cache If missing, Fetch from database Store back to cache Change database Delete cache entry Interleave any # of readers/writers Memcache Consistency Webserver: Reader Webserver: Writer Read cache Change database What if we delete cache entry, then change database? Delete cache entry

Memcache Consistency Read cache Read database Is the lookaside protocol eventually consistent? change database Delete entry Store back to cache Lookaside With Leases Goals: Reduce (eliminate?) per-key inconsistencies Reduce cache miss swarms On a read miss: leave a marker in the cache (fetch in progress) return timestamp check timestamp when filling the cache if changed means value has (likely) changed: don't overwrite If another thread read misses: find marker and wait for update (retry later) Question What if web server crashes while holding lease? Question Webserver: Reader Webserver: Writer Is lookaside with leases linearizable? Change database Read cache Delete cache entry

Question Webserver: Reader Webserver: Writer Is lookaside with leases eventually consistent? Change database Read cache CRASH! (before Delete cache entry) Question Would this be linearizable? read misses obtain lease writes obtain lease (prevent reads during update) Except that FB replicates popular keys (need lease on every copy?) memcache server might fail, or appear to fail by being slow (e.g., to some nodes, but not others) Latency Optimizations Concurrent lookups Issue many lookups concurrently Prioritize those that have chained dependencies Batching Batch multiple requests (e.g., for different end users) to the same memcache server Incast control: Limit concurrency to avoid collisions among RPC responses More Optimizations Return stale data to web server if lease is held No guarantee that concurrent requests returning stale data will be consistent with each other Partitioned memory pools Infrequently accessed, expensive to recompute Frequently accessed, cheap to recompute If mixed, frequent accesses will evict all others Replicate keys if access rate is too high Implication for consistency? Gutter When a memcache server fails, flood of requests to fetch data from storage layer Slows users needing any key on failed server Slows other users due to storage server contention Solution: backup (gutter) cache Time-to-live invalidation (ok if clients disagree as to whether memcache server is still alive) TTL is eventually consistent

Scaling Within a Cluster What happens as we increase the number of memcache servers to handle more load? Recall: All to all communication pattern Less data between any pair of nodes: less batching Need even more replication of popular keys More failures: need bigger gutter cache Multi-Cluster Scaling Multiple independent clusters within data center Each with front-ends, memcache servers Data replicated in the caches in each partition Shared storage backend Data is replicated in each cluster (inefficient?) need to invalidate every cluster on every update Instead: invalidate local cluster on update (read my writes) background invalidate driven off of database update log temporary inconsistency! Multi-Cluster Scaling Web server driven invalidation? need to invalidate every cluster on every update Instead: mcsqueal invalidate local cluster on update (read my writes) background invalidate driven off of database update log temporary inconsistency! mcsqueal Web servers talk to local memcache. On update: Acquire local lease Tell storage layer which keys to invalidate Invalidate local memcache Storage layer sends invalidations to other clusters Scan database log for updates/invalidations Batch invalidations to each cluster (mcrouter) Forward/batch invalidations to remote memcache servers Per-Cluster vs. Multi-Cluster Per-cluster memcache servers Frequently accessed data Inexpensive to compute data Lower latency, less efficient use of memory Shared multi-cluster memcache servers infrequently accessed hard to compute data higher latency, more memory efficient Cold Start Consistency During new cluster startup: Many cache misses! Lots of extra load on servers Instead of going to server on cache miss: Webserver gets data from warm memcache cluster Puts data into local cluster Subsequent requests hit in local cluster

: Local Update B: change database B: queue remote invalidation B: Delete local mc entry A: Local cache miss A: Read remote cluster A: Put data in local cache Apply remote invalidation Solution: prevent local cache fills within 2 seconds of delete Multi-Region Scaling Storage layer consistency Storage at one data center designated as primary All updates applied at primary Updates propagated in background to other data centers Invalidations to memcache layer delayed until after update reaches that site However Webservers may read stale data Even data that they just wrote Multi-Region Consistency To perform an update to key: put marker into local region Send write to primary region Delete local copy On a cache miss: Check if local marker If so, fetch data from primary region Fill local copy Data Centers without Data Tradeoff in increasing number of data centers Lower latency when data near clients More consistency overhead More opportunity for inconsistency Mini-data centers Front end web servers Memcache servers No backend storage: remote access for cache misses