NetCache: Balancing Key-Value Stores with Fast In-Network Caching

Similar documents
NetCache: Balancing Key-Value Stores with Fast In-Network Caching

NetChain: Scale-Free Sub-RTT Coordination

NetCache: Balancing Key-Value Stores with Fast In-Network Caching

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li

Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li

DistCache: Provable Load Balancing for Large- Scale Storage Systems with Distributed Caching

arxiv: v1 [cs.dc] 24 Jan 2019

NetChain: Scale-Free Sub-RTT Coordination

Programmable Data Plane at Terabit Speeds

A Comparison of Performance and Accuracy of Measurement Algorithms in Software

PVPP: A Programmable Vector Packet Processor. Sean Choi, Xiang Long, Muhammad Shahbaz, Skip Booth, Andy Keep, John Marshall, Changhoon Kim

Backend for Software Data Planes

High-Throughput Publish/Subscribe in the Forwarding Plane

Towards a Software Defined Data Plane for Datacenters

High-resolution Measurement of Data Center Microbursts

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

CPU Project in Western Digital: From Embedded Cores for Flash Controllers to Vision of Datacenter Processors with Open Interfaces

Arrakis: The Operating System is the Control Plane

High-Performance Key-Value Store on OpenSHMEM

Packet-Level Network Analytics without Compromises NANOG 73, June 26th 2018, Denver, CO. Oliver Michel

Cisco Ultra Packet Core High Performance AND Features. Aeneas Dodd-Noble, Principal Engineer Daniel Walton, Director of Engineering October 18, 2018

Correlation based File Prefetching Approach for Hadoop

Infiniswap. Efficient Memory Disaggregation. Mosharaf Chowdhury. with Juncheng Gu, Youngmoon Lee, Yiwen Zhang, and Kang G. Shin

Scaling Hardware Accelerated Network Monitoring to Concurrent and Dynamic Queries with *Flow

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees

G-NET: Effective GPU Sharing In NFV Systems

Abhishek Pandey Aman Chadha Aditya Prakash

Intel Solid State Drive Data Center Family for PCIe* in Baidu s Data Center Environment

IncBricks: Toward In-Network Computation with an In-Network Cache

WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES BIG AND SMALL SERVER PLATFORMS

MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing

E-Store: Fine-Grained Elastic Partitioning for Distributed Transaction Processing Systems

High Performance Packet Processing with FlexNIC

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

On Smart Query Routing: For Distributed Graph Querying with Decoupled Storage

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Zhang Tianfei. Rosen Xu

Big data, little time. Scale-out data serving. Scale-out data serving. Highly skewed key popularity

Cubro Packetmaster EX48600

Microsoft SQL Server 2012 Fast Track Reference Configuration Using PowerEdge R720 and EqualLogic PS6110XV Arrays

Jinho Hwang (IBM Research) Wei Zhang, Timothy Wood, H. Howie Huang (George Washington Univ.) K.K. Ramakrishnan (Rutgers University)

P51: High Performance Networking

S. Narravula, P. Balaji, K. Vaidyanathan, H.-W. Jin and D. K. Panda. The Ohio State University

MOHA: Many-Task Computing Framework on Hadoop

Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation

Distributed caching for cloud computing

Deploy a High-Performance Database Solution: Cisco UCS B420 M4 Blade Server with Fusion iomemory PX600 Using Oracle Database 12c

DIDO: Dynamic Pipelines for In-Memory Key-Value Stores on Coupled CPU-GPU Architectures

KV-Direct: High-Performance In-Memory Key-Value Store with Programmable NIC

Efficient Memory Disaggregation with Infiniswap. Juncheng Gu, Youngmoon Lee, Yiwen Zhang, MosharafChowdhury, Kang G. Shin

Albis: High-Performance File Format for Big Data Systems

BespoKV: Application Tailored Scale-Out Key-Value Stores

Efficient Memory Mapped File I/O for In-Memory File Systems. Jungsik Choi, Jiwon Kim, Hwansoo Han

Best Practices for Setting BIOS Parameters for Performance

Datacenter application interference

Dynacache: Dynamic Cloud Caching

Survey of ETSI NFV standardization documents BY ABHISHEK GUPTA FRIDAY GROUP MEETING FEBRUARY 26, 2016

Scalable Enterprise Networks with Inexpensive Switches

Tools for Social Networking Infrastructures

The Power of Batching in the Click Modular Router

Arista 7170 series: Q&A

Treadmill: Attributing the Source of Tail Latency through Precise Load Testing and Statistical Inference

FastReact. In-Network Control and Caching for Industrial Control Networks using Programmable Data Planes

Improving Cloud Application Performance with Simulation-Guided CPU State Management

Deduplication Storage System

Programmable Data Plane at Terabit Speeds Vladimir Gurevich May 16, 2017

Implementing SQL Server 2016 with Microsoft Storage Spaces Direct on Dell EMC PowerEdge R730xd

Building Efficient and Reliable Software-Defined Networks. Naga Katta

Total Cost of Ownership Analysis for a Wireless Access Gateway

SwitchX Virtual Protocol Interconnect (VPI) Switch Architecture

SmartNICs: Giving Rise To Smarter Offload at The Edge and In The Data Center

Inexpensive Coordination in Hardware

Language-Directed Hardware Design for Network Performance Monitoring

Cisco Nexus 9508 Switch Power and Performance

SDA: Software-Defined Accelerator for general-purpose big data analysis system

FlashShare: Punching Through Server Storage Stack from Kernel to Firmware for Ultra-Low Latency SSDs

BlueDBM: An Appliance for Big Data Analytics*

SSS: An Implementation of Key-value Store based MapReduce Framework. Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh

Consolidating Microsoft SQL Server databases on PowerEdge R930 server

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Speeding up Linux TCP/IP with a Fast Packet I/O Framework

Parallel Patterns for Window-based Stateful Operators on Data Streams: an Algorithmic Skeleton Approach

An FPGA-Based Optical IOH Architecture for Embedded System

Optimizing Flash-based Key-value Cache Systems

FAWN. A Fast Array of Wimpy Nodes. David Andersen, Jason Franklin, Michael Kaminsky*, Amar Phanishayee, Lawrence Tan, Vijay Vasudevan

Facebook Tao Distributed Data Store for the Social Graph

Workload-driven Analysis of File Systems in Shared Multi-tier Data-Centers over InfiniBand

Hardware Evolution in Data Centers

Jinho Hwang and Timothy Wood George Washington University

PacketShader: A GPU-Accelerated Software Router

Rack Disaggregation Using PCIe Networking

Non-Volatile Memory Through Customized Key-Value Stores

Gateware Defined Networking (GDN) for Ultra Low Latency Trading and Compliance

A 400Gbps Multi-Core Network Processor

Memory-Based Cloud Architectures

Reconfigurable hardware for big data. Gustavo Alonso Systems Group Department of Computer Science ETH Zurich, Switzerland

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

CSE 124: Networked Services Lecture-17

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Transcription:

NetCache: Balancing Key-Value Stores with Fast In-Network Caching Xin Jin, Xiaozhou Li, Haoyu Zhang, Robert Soulé Jeongkeun Lee, Nate Foster, Changhoon Kim, Ion Stoica

NetCache is a rack-scale key-value store that leverages in-network data plane caching to achieve billions QPS throughput & ~10 μs latency highly-skewed even under & workloads. rapidly-changing New generation of systems enabled by programmable switches J

Goal: fast and cost-efficient rack-scale key-value storage q Store, retrieve, manage key-value objects Critical building block for large-scale cloud services Need to meet aggressive latency and throughput objectives efficiently q Target workloads Small objects Read intensive Highly skewed and dynamic key popularity

Key challenge: highly-skewed and rapidly-changing workloads low throughput & high tail latency Load Server Q: How to provide effective dynamic load balancing?

Opportunity: fast, small cache can ensure load balancing Cache absorbs hottest queries Balanced load

Opportunity: fast, small cache can ensure load balancing [B. Fan et al. SoCC 11, X. Li et al. NSDI 16] Cache O(N log N) hottest items E.g., 10,000 hot objects N: # of servers E.g., 100 backends with 100 billions items Requirement: cache throughput backend aggregate throughput

NetCache: towards billions QPS key-value storage rack Cache needs to provide the aggregate throughput of the storage layer flash/disk each: O(100) KQPS total: O(10) MQPS storage layer in-memory each: O(10) MQPS total: O(1) BQPS cache cache in-memory O(10) MQPS cache layer O(1) BQPS

NetCache: towards billions QPS key-value storage rack Cache needs to provide the aggregate throughput of the storage layer flash/disk each: O(100) KQPS total: O(10) MQPS storage layer in-memory each: O(10) MQPS total: O(1) BQPS cache cache in-memory O(10) MQPS cache layer in-network O(1) BQPS Small on-chip memory? Only cache O(N log N) small items

Key-value caching in network ASIC at line rate?! q How to identify application-level packet fields? q How to store and serve variable-length data? q How to efficiently keep the cache up-to-date?

PISA: Protocol Independent Switch Architecture q Programmable Parser Converts packet data into metadata q Programmable Mach-Action Pipeline Operate on metadata and update memory states Match + Action Memory ALU Programmable Parser Programmable Match-Action Pipeline

PISA: Protocol Independent Switch Architecture q Programmable Parser Parse custom key-value fields in the packet q Programmable Mach-Action Pipeline Read and update key-value data Provide query statistics for cache updates Match + Action Memory ALU Programmable Parser Programmable Match-Action Pipeline

PISA: Protocol Independent Switch Architecture Network Management PCIe Control plane (CPU) Run-time API Network Functions Data plane (ASIC) Match + Action Memory Programmable Match-Action Pipeline Programmable Parser ALU

NetCache rack-scale architecture PCIe Network Management Run-time API Network Functions Clients Cache Management Key-Value Cache Query Statistics Top of Rack Switch q Storage Servers Switch data plane Key-value store to serve queries for cached keys Query statistics to enable efficient cache updates q Switch control plane Insert hot items into the cache and evict less popular items Manage memory allocation for on-chip key-value store

Data plane query handling Read Query (cache hit) Client 1 Hit Cache Update Stats 2 Server Read Query (cache miss) Client 1 4 Miss Cache Update Stats 2 3 Server Client 1 Write Query Invalidate Cache Stats 4 3 2 Server

Key-value caching in network ASIC at line rate q How to identify application-level packet fields? q How to store and serve variable-length data? q How to efficiently keep the cache up-to-date?

NetCache Packet Format Existing Protocols NetCache Protocol ETH IP TCP/UDP OP SEQ KEY VALUE L2/L3 Routing reserved port # read, write, delete, etc. q Application-layer protocol: compatible with existing L2-L4 layers q Only the top of rack switch needs to parse NetCache fields

Key-value caching in network ASIC at line rate q How to identify application-level packet fields? q How to store and serve variable-length data? q How to efficiently keep the cache up-to-date?

Key-value store using register array in network ASIC Match pkt.key == A pkt.key == B Action process_array(0) process_array(1) pkt.value: A B action process_array(idx): if pkt.op == read: pkt.value array[idx] elif pkt.op == cache_update: array[idx] pkt.value 0 1 2 3 A B Register Array

Variable-length key-value store in network ASIC? Match pkt.key == A pkt.key == B Action process_array(0) process_array(1) pkt.value: A B 0 1 2 3 A B Key Challenges: Register Array q No loop or string due to strict timing requirements q Need to minimize hardware resources consumption Number of table entries Size of action data from each entry Size of intermediate metadata across tables

Combine outputs from multiple arrays Lookup Table Match pkt.key == A Action bitmap = 111 index = 0 pkt.value: A0 A1 A2 Bitmap indicates arrays that store the key s value Index indicates slots in the arrays to get the value Minimal hardware resource overhead Value Table 0 Match bitmap[0] == 1 Action process_array_0 (index ) 0 1 2 3 A0 Register Array 0 Value Table 1 Match bitmap[1] == 1 Action process_array_1 (index ) A1 Register Array 1 Value Table 2 Match bitmap[2] == 1 Action process_array_2 (index ) A2 Register Array 2

Combine outputs from multiple arrays Lookup Table Match pkt.key == A pkt.key == B pkt.key == C pkt.key == D Action bitmap = 111 index = 0 bitmap = 110 index = 1 bitmap = 010 index = 2 bitmap = 101 index = 2 pkt.value: A0 A1 A2 B0 B1 C0 D0 D1 Value Table 0 Match bitmap[0] == 1 Action process_array_0 (index ) 0 1 2 3 A0 B0 D0 Register Array 0 Value Table 1 Match bitmap[1] == 1 Action process_array_1 (index ) A1 B1 C0 Register Array 1 Value Table 2 Match bitmap[2] == 1 Action process_array_2 (index ) A2 D1 Register Array 2

Key-value caching in network ASIC at line rate q How to identify application-level packet fields? q How to store and serve variable-length data? q How to efficiently keep the cache up-to-date?

Cache insertion and eviction q Challenge: cache the hottest O(N log N) items with limited insertion rate q Goal: react quickly and effectively to workload changes with minimal updates 1 Data plane reports hot keys Cache Management PCIe 4 2 1 Key-Value Cache Tor Switch 3 2 Control plane compares loads of new hot and sampled cached keys 3 Query Statistics Control plane fetches values for keys to be inserted to the cache 4 Control plane inserts and evicts keys Storage Servers

Query statistics in the data plane report not cached hot pkt.key Cache Lookup cached Count-Min sketch Bloom filter Per-key counters for each cached item q Cached key: per-key counter array q Uncached key Count-Min sketch: report new hot keys Bloom filter: remove duplicated hot key reports

Evaluation q Can NetCache run on programmable switches at line rate? q Can NetCache provide significant overall performance improvements? q Can NetCache efficiently handle workload dynamics?

Prototype implementation and experimental setup q Switch P4 program (~2K LOC) Routing: basic L2/L3 routing Key-value cache: 64K items with 16-byte key and up to 128-byte value Evaluation platform: one 6.5Tbps Barefoot Tofino switch q Server 16-core Intel Xeon E5-2630, 128 GB memory, 40Gbps Intel XL710 NIC TommyDS for in-memory key-value store Throughput: 10 MQPS; Latency: 7 us

Single switch benchmark 2.5 TKrougKSut (B436) ThroughSut (B436) ue n d ge he m. ne We rpf The boring life of a NetCache switch 2.0 1.5 1.0 0.5 0.0 0 32 64 96 128 9alue 6ize (Byte) 2.5 2.0 1.5 1.0 0.5 0.0 0 16. 32. 48. CacKe 6ize 64. Throughput value size.(b)(b) Throughput cache size. (a)(a) Throughput vs.vs. value size. Throughput vs.vs. cache size. Figure Switch microbenchmark (read and update). Figure 9: 9: Switch microbenchmark (read and update).

And its not so boring benefits 1 switch + 128 storage servers ThroughSuW (BQPS) 1oCDche 1eWCDche(servers) 1eWCDche(cDche) 2.0 1.5 1.0 0.5 0.0 uqiforp zisf-0.9 zisf-0.95 zisf-0.99 WorNloDd DisWribuWioQ 3-10x throughput improvements

Impact of workload dynamics hot-in workload (radical change) random workload (moderate change) ThroughSut (0436) 50 40 30 20 10 0 average throughsut Ser sec. average throughsut Ser 10 sec. 0 20 40 60 80 100 TiPe (s) ThroughSut (0436) 50 40 30 20 10 0 average throughsut Ser sec. average throughsut Ser 10 sec. 0 20 40 60 80 100 TiPe (s) Quickly and effectively reacts to a wide range of workload dynamics. (2 physical servers to emulate 128 storage servers, performance scaled down by 64x)

NetCache is a rack-scale key-value store that leverages in-network data plane caching to achieve billions QPS throughput & ~10 μs latency highly-skewed even under & workloads. rapidly-changing

Conclusion: programmable switches beyond networking q Cloud datacenters are moving towards Rack-scale disaggregated architecture In-memory storage systems Task scheduling at microseconds granularity q Programmable switches can do more than packet forwarding Cross-layer co-design of compute, storage and network stacks Switches help on caching, coordination, scheduling, etc. q New generations of systems enabled by programmable switches J