BCStore: Bandwidth-Efficient In-memory KV-Store with Batch Coding. Shenglong Li, Quanlu Zhang, Zhi Yang and Yafei Dai Peking University

Similar documents
BCStore: Bandwidth-Efficient In-memory KV-Store with Batch Coding

Atlas: Baidu s Key-value Storage System for Cloud Data

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

SMORE: A Cold Data Object Store for SMR Drives

arxiv: v1 [cs.db] 25 Nov 2018

SDPaxos: Building Efficient Semi-Decentralized Geo-replicated State Machines

Ambry: LinkedIn s Scalable Geo- Distributed Object Store

CGAR: Strong Consistency without Synchronous Replication. Seo Jin Park Advised by: John Ousterhout

ParaFS: A Log-Structured File System to Exploit the Internal Parallelism of Flash Devices

Efficient and Available In-memory KV-Store with Hybrid Erasure Coding and Replication

Amazon ElastiCache 8/1/17. Why Amazon ElastiCache is important? Introduction:

Efficient and Available In-Memory KV-Store with Hybrid Erasure Coding and Replication

Facilitating Magnetic Recording Technology Scaling for Data Center Hard Disk Drives through Filesystem-level Transparent Local Erasure Coding

UCS Invicta: A New Generation of Storage Performance. Mazen Abou Najm DC Consulting Systems Engineer

CSE 124: Networked Services Lecture-17

HashKV: Enabling Efficient Updates in KV Storage via Hashing

NV-Tree Reducing Consistency Cost for NVM-based Single Level Systems

Federated Array of Bricks Y Saito et al HP Labs. CS 6464 Presented by Avinash Kulkarni

Cold Storage: The Road to Enterprise Ilya Kuznetsov YADRO

SFS: Random Write Considered Harmful in Solid State Drives

High-Performance Key-Value Store on OpenSHMEM

Parallel File Systems for HPC

DiskReduce: Making Room for More Data on DISCs. Wittawat Tantisiriroj

Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li

GearDB: A GC-free Key-Value Store on HM-SMR Drives with Gear Compaction

A Gentle Introduction to Ceph

A Light-weight Compaction Tree to Reduce I/O Amplification toward Efficient Key-Value Stores

Optimizing Flash-based Key-value Cache Systems

The term "physical drive" refers to a single hard disk module. Figure 1. Physical Drive

CA485 Ray Walshe Google File System

pblk the OCSSD FTL Linux FAST Summit 18 Javier González Copyright 2018 CNEX Labs

NPTEL Course Jan K. Gopinath Indian Institute of Science

White Paper. EonStor GS Family Best Practices Guide. Version: 1.1 Updated: Apr., 2018

Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store. Wei Xie TTU CS Department Seminar, 3/7/2017

Erasure Coding in Object Stores: Challenges and Opportunities

vsan Remote Office Deployment January 09, 2018

Reducing Solid-State Storage Device Write Stress Through Opportunistic In-Place Delta Compression

Big and Fast. Anti-Caching in OLTP Systems. Justin DeBrabant

IBM V7000 Unified R1.4.2 Asynchronous Replication Performance Reference Guide

Paxos Replicated State Machines as the Basis of a High- Performance Data Store

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

Native vsphere Storage for Remote and Branch Offices

SRCMap: Energy Proportional Storage using Dynamic Consolidation

Big data, little time. Scale-out data serving. Scale-out data serving. Highly skewed key popularity

Advanced Architectures for Oracle Database on Amazon EC2

Presented by Nanditha Thinderu

A New Key-value Data Store For Heterogeneous Storage Architecture Intel APAC R&D Ltd.

Repair Pipelining for Erasure-Coded Storage

SHRD: Improving Spatial Locality in Flash Storage Accesses by Sequentializing in Host and Randomizing in Device

A New Key-Value Data Store For Heterogeneous Storage Architecture

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li

VMware vsphere Clusters in Security Zones

vsan Management Cluster First Published On: Last Updated On:

Distributed Video Systems Chapter 5 Issues in Video Storage and Retrieval Part 2 - Disk Array and RAID

- SLED: single large expensive disk - RAID: redundant array of (independent, inexpensive) disks

vsan 6.6 Performance Improvements First Published On: Last Updated On:

Correlation based File Prefetching Approach for Hadoop

I/O CANNOT BE IGNORED

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

vsan Security Zone Deployment First Published On: Last Updated On:

GUIDE. Optimal Network Designs with Cohesity

What's new in Jewel for RADOS? SAMUEL JUST 2015 VAULT

I/O CANNOT BE IGNORED

COSC 6385 Computer Architecture Storage Systems

Isilon Performance. Name

Jinho Hwang (IBM Research) Wei Zhang, Timothy Wood, H. Howie Huang (George Washington Univ.) K.K. Ramakrishnan (Rutgers University)

Tailwind: Fast and Atomic RDMA-based Replication. Yacine Taleb, Ryan Stutsman, Gabriel Antoniu, Toni Cortes

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees

Workload-Aware Elastic Striping With Hot Data Identification for SSD RAID Arrays

Purity: building fast, highly-available enterprise flash storage from commodity components

SYSTEM UPGRADE, INC Making Good Computers Better. System Upgrade Teaches RAID

The Google File System

Physical Representation of Files

GFS: The Google File System

Introduction to I/O and Disk Management

Cloudian Sizing and Architecture Guidelines

Introduction to I/O and Disk Management

Storage Devices for Database Systems

On the Speedup of Single-Disk Failure Recovery in XOR-Coded Storage Systems: Theory and Practice

IBM Spectrum NAS. Easy-to-manage software-defined file storage for the enterprise. Overview. Highlights

Concepts Introduced in Chapter 6. Warehouse-Scale Computers. Programming Models for WSCs. Important Design Factors for WSCs

Near Memory Key/Value Lookup Acceleration MemSys 2017

PYTHIA: Improving Datacenter Utilization via Precise Contention Prediction for Multiple Co-located Workloads

CS5460: Operating Systems Lecture 20: File System Reliability

Staggeringly Large Filesystems

C 1. Recap. CSE 486/586 Distributed Systems Distributed File Systems. Traditional Distributed File Systems. Local File Systems.

High Performance Computing Course Notes High Performance Storage

Current Topics in OS Research. So, what s hot?

Principled Schedulability Analysis for Distributed Storage Systems Using Thread Architecture Models

Engineering Fault-Tolerant TCP/IP servers using FT-TCP. Dmitrii Zagorodnov University of California San Diego

Anti-Caching: A New Approach to Database Management System Architecture. Guide: Helly Patel ( ) Dr. Sunnie Chung Kush Patel ( )

A2E: Adaptively Aggressive Energy Efficient DVFS Scheduling for Data Intensive Applications

1. Introduction. Traditionally, a high bandwidth file system comprises a supercomputer with disks connected

HP Dynamic Deduplication achieving a 50:1 ratio

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

RAMCube: Exploiting Network Proximity for RAM-Based Key-Value Store

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

DiskReduce: Making Room for More Data on DISCs. Wittawat Tantisiriroj

E-Store: Fine-Grained Elastic Partitioning for Distributed Transaction Processing Systems

Advanced Database Systems

Transcription:

BCStore: Bandwidth-Efficient In-memory KV-Store with Batch Coding Shenglong Li, Quanlu Zhang, Zhi Yang and Yafei Dai Peking University

Outline Introduction and Motivation Our Design System and Implementation Evaluation

Outline Introduction and Motivation Our Design System and Implementation Evaluation

In-memory KV-Store A crucial building block for many systems Data cache (e.g. Memcached and Redis in Facebook, Twitter) In-memory database Availability is important for in-memory KV-Stores Facebook reports that it takes 2.5-3 hours to recover 120GB data of an in-memory database from disk to memory Data redundancy in distributed memory is essential for fast failover

Two redundancy schemes Replication is a classical way to provide data availability E.g., Repcached, Redis Client Write request High bandwidth cost Update Data node High memory cost Backup node Backup node

Two redundancy schemes Erasure coding is a space-efficient redundancy scheme The increase of CPU speed enables fast data recovery Encoding/Decoding rates can reach 40Gb/s on single core [1] Client Write request High bandwidth cost Data Node Update Parity Node Data Node Parity Node Data Node Low memory cost [1] Efficient and Available In-memory KV-Store with Hybrid Erasure Coding and Replication, FAST 16

In-place Update A traditional mechanism for encoding small objects Update(obj4->obj4 ) Delta(obj4, obj4 ) Data node 1 Data node 2 Data node 3 Parity node 1 Update (obj3->obj3 ) obj1 obj4 obj7 obj2 obj5 obj8 obj3 obj6 obj9 Pp Pp Pp Update(obj8->obj8 ) Bandwidth cost is the same as 3-replication Parity node 2 pp pp Pp Our goal: both memory efficiency and bandwidth efficiency

Outline Introduction and Motivation Our Design System and Implementation Evaluation

Our Design Aggregate write requests and encode objects in a new coding stripe invalid Batch coding Append Data node 1 obj1 obj4 obj7 obj4 Data node 2 obj2 obj5 obj8 obj8 Data node 3 obj3 obj6 obj9 obj3 Batch node Parity node 1 P P P P Parity node 2 P P P P

Latency Analysis Batch coding induces extra request waiting time Formalize the waiting time W W = f(t, k) Request throughput number of data nodes Latency bound Ɛ K = 3

Garbage Collection Recycle updated or deleted blocks and release extra parity blocks Move-based garbage collection Original stripes Batched stripes Data nodes Move Parity nodes Much bandwidth cost for updating parity blocks GC GC

Garbage Collection How to reduce the GC bandwidth cost? Intuition: GC the stripes with the most invalid blocks Greedy block moving Original stripes Batched stripes Data nodes Parity nodes GC GC Two block moves to release two coding stripes

Garbage Collection How to further reduce block move? Intuition: make the updates focus on few stripes Popularity-based data arrangement Data nodes Hot Original stripes Cold Hot Batched stripes Cold Parity nodes GC GC Only one block move to release two coding stripes

Bandwidth Analysis Theorem GC bandwidth + Coding bandwidth <= In-place update bandwidth Detailed proof can be found in our paper

Outline Introduction and Motivation Our Design System and Implementation Evaluation

System Architecture Client Client Preprocessing Batch process Batch coding Data process Data process Data process Client Garbage collection Metadata management Parity process Parity process Storage group

Handle Write Requests Client Set(k1, v1) Client set(k2, v2) Client set(k3, v3) Hash table Batch process Batch coding v2 v1 v3 P1 Stripe index Update v2 Data process 1 v1 Data process 2 v3 Data process 3 P1 Parity process 1 P2 Parity process 2 P2 b1

Handle Read Requests v2 Data process 1 Client get(k1) Hash table Batch process Stripe index get(b1) v1 Data process 2 v3 Data process 3 Key Stripe id P1 Parity process 1 k1 k2 k3 b1 b1 b1 P2 Parity process 2

Recovery Recover the request data first Client get(k1) 1. Get values according to stripe id from any k storage processes Batch process Decoder v2 Data process 1 v1 Data process 2 v3 Data process 3 P1 Parity process 1 2. Recover the lost blocks v2 P1 P2 v1 P2 Parity process 2

Outline Introduction and Motivation Our Design System and Implementation Evaluation

Evaluation Cluster configuration 10 machines running SUSE Linux 11 containing 12 * AMD Opteron Processor 4180 CPUs 1Gb/s Ethernet Targets of comparison In-place update EC (Cocytus[1]) Replication (Rep) Workload YCSB with different key distributions 50%:50% read/write ratio [1] Efficient and Available In-memory KV-Store with Hybrid Erasure Coding and Replication, FAST 16

Bandwidth Cost Save up to 51% bandwidth cost Bandwidth cost for different coding schemes.

Throughput Up to 2.4x improvement Throughput performance for different coding schemes.

Memory Save up to 41% memory cost Memory consumption for different redundancy schemes

Latency Read latency Write latency

Conclusion Efficiency and availability are two crucial features for inmemory KV-Stores We build BCStore, an in-memory KV-Store which applies erasure coding for data availability We design batch coding mechanism to achieve high bandwidth efficiency for write workload We propose a heuristic garbage collection algorithm to improve memory efficiency

Thanks! Q&A

Severity of Bandwidth Cost Prevalence of write requests in large-scale web services Peak load can easily run out of network bandwidth and degrade service performance Monetary cost of bandwidth becomes several times higher Especially under the commonly used peak-load pricing model Bandwidth amplification would be more serious with the increase of m (number of parity servers) Budget of bandwidth resource is usually limited in workload-sharing cluster Our goal: High memory efficiency and bandwidth efficiency

Our Design Batch write requests and append a new coding stripe Batch coding Append Data node 1 obj1 obj4 obj7 obj4 Data node 2 obj2 obj5 obj8 obj8 Data node 3 obj3 obj6 obj9 obj3 Parity node 1 Parity node 2 P P P P

Challenges Recycle the memory space of data blocks which are deleted or updated Data blocks and parity blocks are appended to the storage Updated blocks can not be delete directly Encode variable-sized data efficiently Variable-sized data can not be appended to previous storage space directly

Garbage Collection Popularity-based data arrangement Hot Data node 1 Sort Data node 2 Data node 3 Batched objects cold Parity node1 Parity node2

Encoding Variable-size Data Virtual coding stripes (vcs) Each virtual coding stripe has a large fixed-length space and is aligned in virtual address Data node 1 Physical space Data node 1 Data node 2 Data node 3 Parity node 1 Parity node 2 vcs1 vcs2 vcs3 Virtual space

Bandwidth Cost Bandwidth cost for moderate-skewed Zipfian workload (RS(3,2))

Throughput Throughput performance for moderate-skewed Zipfian workload

Throughput Throughput for recovery

In-place Update A traditional mechanism for coding small objects Data node 1 obj1 obj4 obj7 Data node 2 obj2 obj5 obj8 Data node 3 obj3 obj6 obj9 Parity node 1 P P P Parity node 2 P P P

Garbage Collection How to further reduce block move? Intuition: make the updates focus on few stripes Popularity-based data arrangement Data nodes Hot Original stripes Cold Hot Batched stripes Cold Parity nodes GC GC

Bandwidth Analysis Theorem GC bandwidth + Coding bandwidth <= In-place update bandwidth Original stripes Batched stripes Data nodes Parity nodes GC GC Worst case of GC bandwidth

Bandwidth Cost Bandwidth cost for different throughput. (RS(5,4))

Recovery 1. Get latest batch id Data process 1 M Batch process Data process 2 Client Replication Data process 3 3. Serve requests M Batch process 2. Update the latest stable batch id and reconstruct metadata Parity process 1 Parity process 2