EVCache: Lowering Costs for a Low Latency Cache with RocksDB. Scott Mansfield Vu Nguyen EVCache

Similar documents
Machine Learning meets Databases. Ioannis Papapanagiotou Cloud Database Engineering

Making Non-Distributed Databases, Distributed. Ioannis Papapanagiotou, PhD Shailesh Birari

SCYLLA: NoSQL at Ludicrous Speed. 主讲人 :ScyllaDB 软件工程师贺俊

Mike Kania Truss

MyRocks deployment at Facebook and Roadmaps. Yoshinori Matsunobu Production Engineer / MySQL Tech Lead, Facebook Feb/2018, #FOSDEM #mysqldevroom

LSM-trie: An LSM-tree-based Ultra-Large Key-Value Store for Small Data

Migrating to Cassandra in the Cloud, the Netflix Way

Distributed Filesystem

NPTEL Course Jan K. Gopinath Indian Institute of Science

No compromises: distributed transactions with consistency, availability, and performance

RocksDB Key-Value Store Optimized For Flash

The Google File System (GFS)

Introduction to Database Services

VOLTDB + HP VERTICA. page

How Netflix Leverages Multiple Regions to Increase Availability: Isthmus and Active-Active Case Study

Scaling Without Sharding. Baron Schwartz Percona Inc Surge 2010

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

A New Key-Value Data Store For Heterogeneous Storage Architecture

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

The Google File System

Microservice Layout in Netflix

Performance and Scalability with Griddable.io

Building Durable Real-time Data Pipeline

Goals. Facebook s Scaling Problem. Scaling Strategy. Facebook Three Layer Architecture. Workload. Memcache as a Service.

How To Rock with MyRocks. Vadim Tkachenko CTO, Percona Webinar, Jan

Servers fail, who cares? (Answer: I do, sort of) Gregg Ulrich, #netflixcloud #cassandra12

The Google File System

GFS: The Google File System. Dr. Yingwu Zhu

Configuring Short RPO with Actifio StreamSnap and Dedup-Async Replication

Real-Time & Big Data GIS: Best Practices. Suzanne Foss Josh Joyner

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li

Measuring HEC Performance For Fun and Profit

The Google File System

STORAGE LATENCY x. RAMAC 350 (600 ms) NAND SSD (60 us)

The Google File System

RAMCloud. Scalable High-Performance Storage Entirely in DRAM. by John Ousterhout et al. Stanford University. presented by Slavik Derevyanko

The Google File System

Amazon Aurora Deep Dive

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Highly Scalable, Non-RDMA NVMe Fabric. Bob Hansen,, VP System Architecture

ChunkStash: Speeding Up Storage Deduplication using Flash Memory

How we scaled push messaging for millions of Netflix devices. Susheel Aroskar Cloud Gateway

Tools for Social Networking Infrastructures

CLOUD-SCALE FILE SYSTEMS

Performance Benefits of Running RocksDB on Samsung NVMe SSDs

Flashed-Optimized VPSA. Always Aligned with your Changing World

4 Myths about in-memory databases busted

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

Sizing Guidelines and Performance Tuning for Intelligent Streaming

Caching At Twitter and moving towards a persistent, in-memory key-value store

CS3600 SYSTEMS AND NETWORKS

A New Key-value Data Store For Heterogeneous Storage Architecture Intel APAC R&D Ltd.

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

Agenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache

Hedvig as backup target for Veeam

The Google File System

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona Percona Technical Webinars 9 May 2018

High-Performance Key-Value Store on OpenSHMEM

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Distributed File Systems II

davidklee.net gplus.to/kleegeek linked.com/a/davidaklee

SharePoint 2010 Technical Case Study: Microsoft SharePoint Server 2010 Enterprise Intranet Collaboration Environment

RAMCube: Exploiting Network Proximity for RAM-Based Key-Value Store

Ambry: LinkedIn s Scalable Geo- Distributed Object Store

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

High Performance Parallel File Access via Standard NFS v3

Which technology to choose in AWS?

Data Infrastructure at LinkedIn. Shirshanka Das XLDB 2011

Google File System. Arun Sundaram Operating Systems

Elastic Efficient Execution of Varied Containers. Sharma Podila Nov 7th 2016, QCon San Francisco

Data Movement & Tiering with DMF 7

GFS: The Google File System

CA485 Ray Walshe Google File System

MySQL In the Cloud. Migration, Best Practices, High Availability, Scaling. Peter Zaitsev CEO Los Angeles MySQL Meetup June 12 th, 2017.

NPTEL Course Jan K. Gopinath Indian Institute of Science

YCSB++ Benchmarking Tool Performance Debugging Advanced Features of Scalable Table Stores

Multimedia Streaming. Mike Zink

Rethinking Deduplication Scalability

Pulsar. Realtime Analytics At Scale. Wang Xinglang

Performance Testing of SQL Server on Kaminario K2 Storage

Amazon Aurora Deep Dive

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

SQL Server Performance on AWS. October 2018

SharePoint 2010 Technical Case Study: Microsoft SharePoint Server 2010 Social Environment

Understanding Primary Storage Optimization Options Jered Floyd Permabit Technology Corp.

Pocket: Elastic Ephemeral Storage for Serverless Analytics

Cloud Analytics and Business Intelligence on AWS

Scaling App Engine Applications. Justin Haugh, Guido van Rossum May 10, 2011

Storage Performance Validation for Panzura

CSE 124: Networked Services Lecture-16

Introducing Tegile. Company Overview. Product Overview. Solutions & Use Cases. Partnering with Tegile

Apache BookKeeper. A High Performance and Low Latency Storage Service

Deduplication Storage System

Nimble Storage Adaptive Flash

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Removing the I/O Bottleneck in Enterprise Storage

Transcription:

EVCache: Lowering Costs for a Low Latency Cache with RocksDB Scott Mansfield Vu Nguyen EVCache

90 seconds

What do caches touch? Signing up* Logging in Choosing a profile Picking liked videos Personalization* Loading home page* Scrolling home page* A/B tests Video image selection * multiple caches involved Searching* Viewing title details Playing a title* Subtitle / language prefs Rating a title My List Video history* UI strings Video production*

Home Page Request

Ephemeral Volatile Cache Key-Value store optimized for AWS and tuned for Netflix use cases

What is EVCache? Distributed, sharded, replicated key-value store Tunable in-region and global replication Based on Memcached Resilient to failure Topology aware Linearly scalable Seamless deployments

Why Optimize for AWS Instances disappear Zones fail Regions become unstable Network is lossy Customer requests bounce between regions Failures happen and we test all the time

EVCache Use @ Netflix Hundreds of terabytes of data Trillions of ops / day Tens of billions of items stored Tens of millions of ops / sec Millions of replications / sec Thousands of servers Hundreds of instances per cluster Hundreds of microservice clients Tens of distinct clusters 3 regions 4 engineers

Architecture Application Server Client Library Client Memcached EVCar Eureka (Service Discovery)

Architecture us-west-2a Client us-west-2b Client us-west-2c Client

Reading (get) us-west-2a us-west-2b Client Primary Secondary us-west-2c

Writing (set, delete, add, etc.) us-west-2a Client us-west-2b Client us-west-2c Client

Use Case: Lookaside Cache Application Client Library Client Ribbon Client S Data Flow C S S C C S C

Use Case: Transient Data Store Time Application Application Client Library Application Client Library Client Client Library Client Client

Use Case: Primary Store Online Application Client Library Client Online Services Offline Services Offline / Nearline Precomputes for Recommendations Data Flow

Use Case: Versioned Primary Store Online Application Archaius (Dynamic Properties) Client Library Client Online Services Offline Services Offline Compute Data Flow Control System (Valhalla)

Use Case: High Volume && High Availability Application Client Library Optional In-memory Remote Compute & Publish on schedule Data Flow Ribbon Client S C S C S S C C

Pipeline of Personalization Online 1 Online 2 Online Services Offline Services Compute A Compute B Data Flow Compute D Compute C Compute E

Additional Features Kafka Global data replication Consistency metrics Key Iteration Cache warming Lost instance recovery Backup (and restore)

Additional Features (Kafka) Global data replication Consistency metrics

Cross-Region Replication Region A APP Region B 7 read 1 mutate APP a t da t ge t 4 r se fo 6 mutate 2 send metadata Repl Proxy Repl Proxy sg s tp 5 Kafka 3 poll msg Repl Relay ht nd se m Repl Relay Kafka

Additional Features (Key Iteration) Cache warming Lost instance recovery Backup (and restore)

Cache Warming Application Client Library Client Control Data Flow Control Flow Metadata Flow S3 Cache Warmer (Spark)

Moneta Next-generation EVCache server

Moneta Moneta: The Goddess of Memory Juno Moneta: The Protectress of Funds for Juno Evolution of the EVCache server Cost optimization EVCache on SSD Ongoing lower EVCache cost per stream Takes advantage of global request patterns

Old Server Stock Memcached and EVCar (sidecar) All data stored in RAM in Memcached Expensive with global expansion / N+1 architecture Memcached EVCar external

Optimization Global data means many copies Access patterns are heavily region-oriented In one region: Hot data is used often Cold data is almost never touched Keep hot data in RAM, cold data on SSD Size RAM for working set, SSD for overall dataset

New Server Adds Rend and Mnemonic Still looks like Memcached Unlocks cost-efficient storage & server-side intelligence

Rend go get github.com/netflix/rend

Rend High-performance Memcached proxy & server Written in Go Powerful concurrency primitives Productive and fast Manages the L1/L2 relationship Tens of thousands of connections

Rend Modular set of libraries and an example main() Manages connections, request orchestration, and communication Low-overhead metrics library Connection Management Multiple orchestrators Server Loop Protocol Parallel locking for data integrity Request Orchestration Efficient connection pool Backend Handlers M E T R I C S

Mnemonic

Mnemonic Manages data storage on SSD Uses Rend server libraries Handles Memcached protocol Maps Memcached ops to RocksDB ops Rend Server Core Lib (Go) Mnemonic Op Handler (Go) Mnemonic Core (C++) RocksDB (C++)

Why RocksDB? Fast at medium to high write load Disk--write load higher than read load (because of Memcached) Predictable RAM Usage Record A Record B memtable memtable SST SST memtable SST SST: Static Sorted Table...

How we use RocksDB No Level Compaction Generated too much traffic to SSD High and unpredictable read latencies No Block Cache Rely on Local Memcached No Compression

How we use RocksDB FIFO Compaction SST s ordered by time Oldest SST deleted when full Reads access every SST until record found

How we use RocksDB Full File Bloom Filters Full Filter reduces unnecessary SSD reads Bloom Filters and Indices pinned in memory Minimize SSD access per request

How we use RocksDB Records sharded across multiple RocksDB per node Reduces number of files checked to decrease latency Mnemonic Core Key: ABC Key: XYZ... R R R R

Region-Locality Optimizations Replication and Batch updates only RocksDB* Keeps Region-Local and hot data in memory Separate Network Port for off-line requests Memcached data replaced

FIFO Limitations FIFO compaction not suitable for all use cases Very frequently updated records may push out valid records Expired Records still exist Requires Larger Bloom Filters SST SST SST Record A1 Record A1 Record C Record A2 Record A2 Record D Record B1 Record B1 Record E Record A3 Record A3 Record F Record B2 Record B2 Record G Record B3 Record B3 Record H time

AWS Instance Type i2.xlarge 4 vcpu 30 GB RAM 800 GB SSD 32K IOPS (4KB Pages) ~130MB/sec

Moneta Perf Benchmark (High-Vol Online Requests)

Moneta Perf Benchmark (cont)

Moneta Perf Benchmark (cont)

Moneta Perf Benchmark (cont)

Moneta Performance in Production (Batch Systems) Request get 95%, 99% = 729 μs, 938 μs L1 get 95%, 99% = 153 μs, 191 μs L2 get 95%, 99% = 1005 μs, 1713 μs ~20 KB Records ~99% Overall Hit Rate ~90% L1 Hit Rate

Moneta Performance in Prod (High Vol-Online Req) Request get 95%, 99% = 174 μs, 588 μs L1 get 95%, 99% = 145 μs, 190 μs L2 get 95%, 99% = 770 μs, 1330 μs ~1 KB Records ~98% Overall Hit Rate ~97% L1 Hit Rate

Moneta Performance in Prod (High Vol-Online Req) Get Percentiles: 50th: 102 μs (101 μs) 75th: 120 μs (115 μs) 90th: 146 μs (137 μs) 95th: 174 μs (166 μs) 99th: 588 μs (427 μs) 99.5th: 733 μs (568 μs) 99.9th: 1.39 ms (979 μs) Latencies: peak (trough) Set Percentiles: 50th: 97.2 μs (87.2 μs) 75th: 107 μs (101 μs) 90th: 125 μs (115 μs) 95th: 138 μs (126 μs) 99th: 177 μs (152 μs) 99.5th: 208 μs (169 μs) 99.9th: 1.19 ms (318 μs)

70% Reduction in cost*

Challenges/Concerns Less Visibility Unclear of Overall Data Size because of duplicates and expired records Restrict Unique Data Set to ½ of Max for Precompute Batch Data Lower Max Throughput than Memcached-based Server Higher CPU usage Planning must be better so we can handle unusually high request spikes

Current/Future Work Investigate Blob Storage feature Less Data read/write from SSD during Level Compaction Lower Latency, Higher Throughput Better View of Total Data Size Purging Expired SST s earlier Useful in short TTL use cases May purge 60%+ SST earlier than FIFO Compaction Reduce Worst Case Latency Better Visibility of Overall Data Size Inexpensive Deduping for Batch Data

Open Source https://github.com/netflix/evcache https://github.com/netflix/rend

Thank You smansfield@netflix.com (@sgmansfield) nguyenvu@netflix.com techblog.netflix.com

Failure Resilience in Client Operation Fast Failure Tunable Retries Operation Queues Tunable Latch for Mutations Async Replication through Kafka

Consistency Metrics Region A APP 1 mutate Client Dashboards 2 send metadata 4 pull data Kafka 3 poll msg Consistency Checker Atlas (Metrics Backend) 5 report

Lost Instance Recovery Application Client Library Client Zone A Zone B Control Data Flow Partial Data Flow Control Flow Metadata Flow S3 Cache Warmer (Spark)

Backup (and Restore) Application Client Library Client Control Data Flow Control Flow S3 Cache Warmer (Spark)

Moneta in Production Serving all of our personalization data Rend runs with two ports: One for standard users (read heavy or active data management) Another for async and batch users: Replication and Precompute Maintains working set in RAM Optimized for precomputes Smartly replaces data in L1 Std Memcached (RAM) Batch Mnemonic EVCar external internal (SSD)

Rend batching backend