Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li

Similar documents
Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li

NetCache: Balancing Key-Value Stores with Fast In-Network Caching

NetCache: Balancing Key-Value Stores with Fast In-Network Caching

NetChain: Scale-Free Sub-RTT Coordination

NetCache: Balancing Key-Value Stores with Fast In-Network Caching

Big data, little time. Scale-out data serving. Scale-out data serving. Highly skewed key popularity

Scalable Enterprise Networks with Inexpensive Switches

Stateless Network Functions:

Learning with Purpose

MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing

P51: High Performance Networking

High Performance Packet Processing with FlexNIC

PVPP: A Programmable Vector Packet Processor. Sean Choi, Xiang Long, Muhammad Shahbaz, Skip Booth, Andy Keep, John Marshall, Changhoon Kim

Netronome 25GbE SmartNICs with Open vswitch Hardware Offload Drive Unmatched Cloud and Data Center Infrastructure Performance

OpenFlow Software Switch & Intel DPDK. performance analysis

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

SmartNIC Programming Models

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

Ambry: LinkedIn s Scalable Geo- Distributed Object Store

Next Generation Architecture for NVM Express SSD

SmartNIC Programming Models

Building Efficient and Reliable Software-Defined Networks. Naga Katta

Expeditus: Congestion-Aware Load Balancing in Clos Data Center Networks

Programmable Data Plane at Terabit Speeds

Bringing the Power of ebpf to Open vswitch. Linux Plumber 2018 William Tu, Joe Stringer, Yifeng Sun, Yi-Hung Wei VMware Inc. and Cilium.

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross

Reliably Scalable Name Prefix Lookup! Haowei Yuan and Patrick Crowley! Washington University in St. Louis!! ANCS 2015! 5/8/2015!

Agilio OVS Software Architecture

Programmable Software Switches. Lecture 11, Computer Networks (198:552)

E-Store: Fine-Grained Elastic Partitioning for Distributed Transaction Processing Systems

PacketShader: A GPU-Accelerated Software Router

Correlation based File Prefetching Approach for Hadoop

T4P4S: When P4 meets DPDK. Sandor Laki DPDK Summit Userspace - Dublin- 2017

Decision Forest: A Scalable Architecture for Flexible Flow Matching on FPGA

Disruptive Innovation in ethernet switching

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

PISCES: A Programmable, Protocol-Independent Software Switch

WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES BIG AND SMALL SERVER PLATFORMS

Scaling Without Sharding. Baron Schwartz Percona Inc Surge 2010

Programmable Data Plane at Terabit Speeds Vladimir Gurevich May 16, 2017

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

High-Performance Key-Value Store on OpenSHMEM

A Comparison of Performance and Accuracy of Measurement Algorithms in Software

SDN-based Network Obfuscation. Roland Meier PhD Student ETH Zürich

High-Throughput Publish/Subscribe in the Forwarding Plane

Main-Memory Databases 1 / 25

FAWN. A Fast Array of Wimpy Nodes. David Andersen, Jason Franklin, Michael Kaminsky*, Amar Phanishayee, Lawrence Tan, Vijay Vasudevan

High-Speed Forwarding: A P4 Compiler with a Hardware Abstraction Library for Intel DPDK

DevoFlow: Scaling Flow Management for High Performance Networks

6.9. Communicating to the Outside World: Cluster Networking

Goals. Facebook s Scaling Problem. Scaling Strategy. Facebook Three Layer Architecture. Workload. Memcache as a Service.

SOFTWARE DEFINED NETWORKS. Jonathan Chu Muhammad Salman Malik

ASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed

Implementing Ultra Low Latency Data Center Services with Programmable Logic

PASTE: A Network Programming Interface for Non-Volatile Main Memory

Architecture of a Real-Time Operational DBMS

A New Key-value Data Store For Heterogeneous Storage Architecture Intel APAC R&D Ltd.

Multimedia Streaming. Mike Zink

Implementing SQL Server 2016 with Microsoft Storage Spaces Direct on Dell EMC PowerEdge R730xd

Throughput & Latency Control in Ethernet Backplane Interconnects. Manoj Wadekar Gary McAlpine. Intel

Memory-Based Cloud Architectures

DPDK Summit China 2017

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees

Performance Analysis for Crawling

Centec V350 Product Introduction. Centec Networks (Suzhou) Co. Ltd R

Gateware Defined Networking (GDN) for Ultra Low Latency Trading and Compliance

Programmable Forwarding Planes at Terabit/s Speeds

Deploy a High-Performance Database Solution: Cisco UCS B420 M4 Blade Server with Fusion iomemory PX600 Using Oracle Database 12c

An Efficient Memory-Mapped Key-Value Store for Flash Storage

Cubro Packetmaster EX48600

Performance Benefits of Running RocksDB on Samsung NVMe SSDs

G-NET: Effective GPU Sharing In NFV Systems

Design and Implementation of Virtual TAP for Software-Defined Networks

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

CS 261 Fall Caching. Mike Lam, Professor. (get it??)

Programmable NICs. Lecture 14, Computer Networks (198:552)

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Towards a Software Defined Data Plane for Datacenters

Data Modeling and Databases Ch 10: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Storage Speed and Human Behavior. PRESENTATION TITLE GOES HERE Eric Herzog CMO and Senior VP of Business Development Violin Memory

Jinho Hwang (IBM Research) Wei Zhang, Timothy Wood, H. Howie Huang (George Washington Univ.) K.K. Ramakrishnan (Rutgers University)

Agilio CX 2x40GbE with OVS-TC

On Smart Query Routing: For Distributed Graph Querying with Decoupled Storage

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

From architecture to algorithms: Lessons from the FAWN project

Solros: A Data-Centric Operating System Architecture for Heterogeneous Computing

Facebook Tao Distributed Data Store for the Social Graph

Stateless ICN Forwarding with P4 towards Netronome NFP-based Implementation

PUSHING THE LIMITS, A PERSPECTIVE ON ROUTER ARCHITECTURE CHALLENGES

End-to-End Adaptive Packet Aggregation for High-Throughput I/O Bus Network Using Ethernet

EVCache: Lowering Costs for a Low Latency Cache with RocksDB. Scott Mansfield Vu Nguyen EVCache

15-744: Computer Networking. Data Center Networking II

CSE 124: Networked Services Lecture-17

Designing Next Generation FS for NVMe and NVMe-oF

Typhoon: An SDN Enhanced Real-Time Big Data Streaming Framework

Dell PowerEdge R730xd Servers with Samsung SM1715 NVMe Drives Powers the Aerospike Fraud Prevention Benchmark

Transcription:

Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li

Goal: fast and cost-efficient key-value store Store, retrieve, manage key-value objects Get(key)/Put(key,value)/Delete(key) Target: cluster-level storage for modern cloud services Need both high-performance and low-cost operations Workload characteristics large datasets small objects random access massively parallel deeply skewed rapidly changing 2

Goal: fast and cost-efficient key-value store Use DRAM and flash efficiently meet performance and capacity goals with min cost SwitchKV: hash partitioned flash-based storage cluster Ensure dynamic load balancing always fully utilize all the storage servers Achieve maximum parallelism in multi-core systems fully utilize the processing power of modern CPUs SwitchKV: exploit fast programmable networking 3

Key challenge: dynamic load balancing time t time t+x Most servers are either over- or underutilized. low throughput and high tail latency Today s solutions: data migration / replication high system overhead, consistency challenge 4

Fast, small cache can ensure load balancing Need only cache O(nlogn) items to provide good load balance, where n is the number of backend nodes.[fan, SOCC 11] less-hot queries, better-balanced loads hottest queries flash-based backend nodes frontend cache node E.g., 100 backends with hundreds of billions of items + cache with 10,000 entries How to efficiently serve queries with cache and backend nodes? How to efficiently update the cache under dynamic workloads? 5

High overheads with traditional caching architectures clients clients backends Look-aside cache miss backends cache (load balancer) Look-through Cache must process all queries and handle misses In our case, cache is small and hit ratio could be low Throughput is bounded by the cache I/O High latency for queries for uncached keys 6

SwitchKV: content-aware routing clients controller OpenFlow Switches cache backends Switches route requests directly to the appropriate nodes Latency can be minimized for all queries Throughput can scale out with # of backends Availability would not be affected by cache node failures 7

Exploit SDN and switch hardware (NSDI 16) Clients encode key information in packet headers Encode key hash in MAC for read queries Use destination backend IP for all queries Switches maintain forwarding rules and route query packets Packet In L2 table exact match rule per cached key miss hit Packet Out to the cache TCAM table match rule per physical machine Packet Out 8

Wait, modifying MAC addresses seems scary Frequently asked questions at NSDI 16: L2 is so fundamental, can you really change it? Is it realistic for production systems? So I cannot use regular UDP socket any more? Any side effects to other network apps? Good news: P4 switches Fully user-programmable data plane No dangerous L2/L3 header changes by the application 9

P4 switch pipeline Ingress Egress Parser MAU Metadata MAU Deparser Traffic Manager Parser MAU Metadata MAU Deparser Parser Converts packet data into metadata (Parsed Representation) Mach-Action Units (MAU) Operate on metadata Metadata Bus Carries information within the pipeline Deparser Converts metadata back into a serialized packet 10

New SwitchKV packet for P4 switch ETH IP UDP TYPE KEY HASH PAYLOAD IP: all queries use backends as destination UDP: reserve port # for the key-value storage service Type: read, write, delete, miss Key Hash: 64b or 128b hash value of the key 11

P4 program for SwitchKV header_type kv_t { fields { type: KV_TYPE_BITWIDTH; key_hash: KV_HASH_BITWIDTH; } } header kv_t kv control ingress {... if(valid(kv)and } }... kv.type==read){ apply(cache_lookup);... parser parse_udp { extract(udp); return select(latest.dstport){ UDP_PORT_KV: parse_kv; default: ingress; } } parser parse_kv { extract(kv); return ingress; } table cache_lookup { reads { kv.key_hash: exact; } actions { cache_hit; // change dst Eth & IP _no_op; } size: CACHE_SIZE; } 12

Keep cache and switch rules updated New challenges for cache updates Only cache the hottest O(nlogn) items Limited switch rule update rate Goal: react quickly to workload changes with minimal updates switch rule update top-k <key, load> list (periodic) controller cache fetch <key, value> backend burstyhot <key, value> (instant) 13

Goal: fast and cost-efficient key-value store Use DRAM and flash efficiently meet performance and capacity goals with min cost SwitchKV: hash partitioned flash-based storage cluster Ensure dynamic load balancing always fully utilize all the storage servers Achieve maximum parallelism in multi-core systems fully utilize the processing power of modern CPUs SwitchKV: exploit fast programmable networking 14

Multicore parallelism Concurrent access more generic abstractions require careful algorithm engineering Exclusive access no inter-core communication load imbalance may hurt performance need to deliver the requests to the right cores 15

SwitchKV servers: exclusive access Use Intel DPDK to deliver each query to the right core match on (self-defined) packet header fields Host stack Core 0 Core 1 Core n Programmable NIC Rx 0 Rx 1 Rx n Match on packet headers 16

SwitchKV servers: exclusive access Use Intel DPDK to deliver each query to the right core match on (self-defined) packet header fields Each core runs its own load tracker and counter easy to ensure high performance and correctness Avoid performance drop due to load imbalance backends do not face high skew in key popularity cache nodes exploit CPU caches and packet burst I/O 17

Evaluation (NSDI 16) How well does a fast small cache improve the system load balance and throughput? Does SwitchKV improve system performance compared to traditional architectures? Can SwitchKV react quickly to workload changes? 18

Evaluation Platform Reference backend 1 Gb link Intel Atom C2750 processor Intel DC P3600 PCIe-based SSD RocksDB with 120 million 1KB objects 99.4K queries per second 19

Evaluation Platform Xeon Server 1 Client 40 GbE 40 GbE 40 GbE Backends Pica8 P-3922 (OVS 2.3) Xeon Server 2 Cache 40 GbE 40 GbE 40 GbE 40 GbE 40 GbE Backends Xeon Server 3 Xeon Server 4 Ryu # of backends 128 backend tput 100 KQPS keyspace size 10 billion key size 16 bytes value size 128 bytes query skewness Zipf 0.99 cache size 10,000 entries Default settings in this talk Use Intel DPDK to efficiently transfer packets and modify headers Client adjusts its sending rate, keep loss rate between 0.5% and 1% 20

Throughput with and without caching Backends aggregate (without cache) Backends aggregate (with cache) Cache (10,000 entries) 21

Throughput vs. Number of backends backend rate limit: 50KQPS, cache rate limit: 5MQPS 22

End-to-end latency vs. Throughput 23

Throughput with workload changes Traditional cache update method Periodic top-k updates only Periodic top-k updates + instant bursty hot key updates Make 200 cold keys become the hottest keys every 10 seconds 24

SwitchKV: fast and cost-efficient KV store Fast, small cache guarantees backend load balancing Efficient content-aware programmable network switching Low (tail) latency Scalable throughput High availability Keep high performance under highly dynamic workloads 25

Thanks! 26