BCStore: Bandwidth-Efficient In-memory KV-Store with Batch Coding. Shenglong Li, Quanlu Zhang, Zhi Yang and Yafei Dai Peking University

BCStore: Bandwidth-Efficient In-memory KV-Store with Batch Coding Shenglong Li, Quanlu Zhang, Zhi Yang and Yafei Dai Peking University

Outline Introduction and Motivation Our Design System and Implementation Evaluation

In-memory KV-Store A crucial building block for many systems Data cache (e.g. Memcached and Redis in Facebook, Twitter) In-memory database Availability is important for in-memory KV-Stores Facebook reports that it takes 2.5-3 hours to recover 120GB data of an in-memory database from disk to memory Data redundancy in distributed memory is essential for fast failover

Two redundancy schemes Replication is a classical way to provide data availability E.g., Repcached, Redis Client Write request High bandwidth cost Update Data node High memory cost Backup node Backup node

Two redundancy schemes Erasure coding is a space-efficient redundancy scheme The increase of CPU speed enables fast data recovery Encoding/Decoding rates can reach 40Gb/s on single core [1] Client Write request High bandwidth cost Data Node Update Parity Node Data Node Parity Node Data Node Low memory cost [1] Efficient and Available In-memory KV-Store with Hybrid Erasure Coding and Replication, FAST 16

In-place Update A traditional mechanism for encoding small objects Update(obj4->obj4 ) Delta(obj4, obj4 ) Data node 1 Data node 2 Data node 3 Parity node 1 Update (obj3->obj3 ) obj1 obj4 obj7 obj2 obj5 obj8 obj3 obj6 obj9 Pp Pp Pp Update(obj8->obj8 ) Bandwidth cost is the same as 3-replication Parity node 2 pp pp Pp Our goal: both memory efficiency and bandwidth efficiency

Outline Introduction and Motivation Our Design System and Implementation Evaluation

Our Design Aggregate write requests and encode objects in a new coding stripe invalid Batch coding Append Data node 1 obj1 obj4 obj7 obj4 Data node 2 obj2 obj5 obj8 obj8 Data node 3 obj3 obj6 obj9 obj3 Batch node Parity node 1 P P P P Parity node 2 P P P P

Latency Analysis Batch coding induces extra request waiting time Formalize the waiting time W W = f(t, k) Request throughput number of data nodes Latency bound Ɛ K = 3

Garbage Collection Recycle updated or deleted blocks and release extra parity blocks Move-based garbage collection Original stripes Batched stripes Data nodes Move Parity nodes Much bandwidth cost for updating parity blocks GC GC

Garbage Collection How to reduce the GC bandwidth cost? Intuition: GC the stripes with the most invalid blocks Greedy block moving Original stripes Batched stripes Data nodes Parity nodes GC GC Two block moves to release two coding stripes

Garbage Collection How to further reduce block move? Intuition: make the updates focus on few stripes Popularity-based data arrangement Data nodes Hot Original stripes Cold Hot Batched stripes Cold Parity nodes GC GC Only one block move to release two coding stripes

Bandwidth Analysis Theorem GC bandwidth + Coding bandwidth <= In-place update bandwidth Detailed proof can be found in our paper

Outline Introduction and Motivation Our Design System and Implementation Evaluation

System Architecture Client Client Preprocessing Batch process Batch coding Data process Data process Data process Client Garbage collection Metadata management Parity process Parity process Storage group

Handle Write Requests Client Set(k1, v1) Client set(k2, v2) Client set(k3, v3) Hash table Batch process Batch coding v2 v1 v3 P1 Stripe index Update v2 Data process 1 v1 Data process 2 v3 Data process 3 P1 Parity process 1 P2 Parity process 2 P2 b1

Handle Read Requests v2 Data process 1 Client get(k1) Hash table Batch process Stripe index get(b1) v1 Data process 2 v3 Data process 3 Key Stripe id P1 Parity process 1 k1 k2 k3 b1 b1 b1 P2 Parity process 2

Recovery Recover the request data first Client get(k1) 1. Get values according to stripe id from any k storage processes Batch process Decoder v2 Data process 1 v1 Data process 2 v3 Data process 3 P1 Parity process 1 2. Recover the lost blocks v2 P1 P2 v1 P2 Parity process 2

Outline Introduction and Motivation Our Design System and Implementation Evaluation

Evaluation Cluster configuration 10 machines running SUSE Linux 11 containing 12 * AMD Opteron Processor 4180 CPUs 1Gb/s Ethernet Targets of comparison In-place update EC (Cocytus[1]) Replication (Rep) Workload YCSB with different key distributions 50%:50% read/write ratio [1] Efficient and Available In-memory KV-Store with Hybrid Erasure Coding and Replication, FAST 16

Bandwidth Cost Save up to 51% bandwidth cost Bandwidth cost for different coding schemes.

Throughput Up to 2.4x improvement Throughput performance for different coding schemes.

Memory Save up to 41% memory cost Memory consumption for different redundancy schemes

Latency Read latency Write latency

Conclusion Efficiency and availability are two crucial features for inmemory KV-Stores We build BCStore, an in-memory KV-Store which applies erasure coding for data availability We design batch coding mechanism to achieve high bandwidth efficiency for write workload We propose a heuristic garbage collection algorithm to improve memory efficiency

Thanks! Q&A

Severity of Bandwidth Cost Prevalence of write requests in large-scale web services Peak load can easily run out of network bandwidth and degrade service performance Monetary cost of bandwidth becomes several times higher Especially under the commonly used peak-load pricing model Bandwidth amplification would be more serious with the increase of m (number of parity servers) Budget of bandwidth resource is usually limited in workload-sharing cluster Our goal: High memory efficiency and bandwidth efficiency

Our Design Batch write requests and append a new coding stripe Batch coding Append Data node 1 obj1 obj4 obj7 obj4 Data node 2 obj2 obj5 obj8 obj8 Data node 3 obj3 obj6 obj9 obj3 Parity node 1 Parity node 2 P P P P

Challenges Recycle the memory space of data blocks which are deleted or updated Data blocks and parity blocks are appended to the storage Updated blocks can not be delete directly Encode variable-sized data efficiently Variable-sized data can not be appended to previous storage space directly

Garbage Collection Popularity-based data arrangement Hot Data node 1 Sort Data node 2 Data node 3 Batched objects cold Parity node1 Parity node2

Encoding Variable-size Data Virtual coding stripes (vcs) Each virtual coding stripe has a large fixed-length space and is aligned in virtual address Data node 1 Physical space Data node 1 Data node 2 Data node 3 Parity node 1 Parity node 2 vcs1 vcs2 vcs3 Virtual space

Bandwidth Cost Bandwidth cost for moderate-skewed Zipfian workload (RS(3,2))

Throughput Throughput performance for moderate-skewed Zipfian workload

Throughput Throughput for recovery

In-place Update A traditional mechanism for coding small objects Data node 1 obj1 obj4 obj7 Data node 2 obj2 obj5 obj8 Data node 3 obj3 obj6 obj9 Parity node 1 P P P Parity node 2 P P P

Bandwidth Analysis Theorem GC bandwidth + Coding bandwidth <= In-place update bandwidth Original stripes Batched stripes Data nodes Parity nodes GC GC Worst case of GC bandwidth

Bandwidth Cost Bandwidth cost for different throughput. (RS(5,4))

Recovery 1. Get latest batch id Data process 1 M Batch process Data process 2 Client Replication Data process 3 3. Serve requests M Batch process 2. Update the latest stable batch id and reconstruct metadata Parity process 1 Parity process 2