RAMCube: Exploiting Network Proximity for RAM-Based Key-Value Store

Similar documents
BCube: A High Performance, Servercentric. Architecture for Modular Data Centers

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University

RAMCloud: Scalable High-Performance Storage Entirely in DRAM John Ousterhout Stanford University

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

Network Design Considerations for Grid Computing

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout

Tools for Social Networking Infrastructures

CGAR: Strong Consistency without Synchronous Replication. Seo Jin Park Advised by: John Ousterhout

COSC 6385 Computer Architecture Storage Systems

Effect of memory latency

CA485 Ray Walshe Google File System

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

RDMA over Commodity Ethernet at Scale

The Google File System

NPTEL Course Jan K. Gopinath Indian Institute of Science

COSC 6385 Computer Architecture. Storage Systems

SSDs vs HDDs for DBMS by Glen Berseth York University, Toronto

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

EVCache: Lowering Costs for a Low Latency Cache with RocksDB. Scott Mansfield Vu Nguyen EVCache

Data Centers. Tom Anderson

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

DATA CENTER FABRIC COOKBOOK

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

The Google File System

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

The Google File System

Distributed File Systems II

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Hadoop/MapReduce Computing Paradigm

~3333 write ops/s ms response

CIT 668: System Architecture. Amazon Web Services

BCStore: Bandwidth-Efficient In-memory KV-Store with Batch Coding. Shenglong Li, Quanlu Zhang, Zhi Yang and Yafei Dai Peking University

No Compromises. Distributed Transactions with Consistency, Availability, Performance

Discretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters

RAMCloud. Scalable High-Performance Storage Entirely in DRAM. by John Ousterhout et al. Stanford University. presented by Slavik Derevyanko

I/O in scientific applications

An Evaluation of Checkpoint Recovery For Massively Multiplayer Online Games

Advanced Database Systems

Module Outline. CPU Memory interaction Organization of memory modules Cache memory Mapping and replacement policies.

GFS: The Google File System. Dr. Yingwu Zhu

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

BlueDBM: An Appliance for Big Data Analytics*

User Perspective. Module III: System Perspective. Module III: Topics Covered. Module III Overview of Storage Structures, QP, and TM

Google File System. Arun Sundaram Operating Systems

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

IBM V7000 Unified R1.4.2 Asynchronous Replication Performance Reference Guide

SMB Direct Update. Tom Talpey and Greg Kramer Microsoft Storage Developer Conference. Microsoft Corporation. All Rights Reserved.

White paper Version 3.10

FaRM: Fast Remote Memory

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

ECE/CS 757: Advanced Computer Architecture II Interconnects

Infiniswap. Efficient Memory Disaggregation. Mosharaf Chowdhury. with Juncheng Gu, Youngmoon Lee, Yiwen Zhang, and Kang G. Shin

Caching At Twitter and moving towards a persistent, in-memory key-value store

I/O CANNOT BE IGNORED

16GFC Sets The Pace For Storage Networks

Ambry: LinkedIn s Scalable Geo- Distributed Object Store

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Reliable Distributed Messaging with HornetQ

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store

Isilon Performance. Name

Two hours - online. The exam will be taken on line. This paper version is made available as a backup

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

SEMI-DEDICATED SERVERS WITH ISLAHOST

CSE 124: Networked Services Lecture-17

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Warehouse-Scale Computing

Asynchronous Logging and Fast Recovery for a Large-Scale Distributed In-Memory Storage

Distributed Filesystem

Technology in Action. Chapter Topics. Participation Question. Participation Question. Participation Question 8/8/11

An Empirical Study of High Availability in Stream Processing Systems

Highly Scalable, Non-RDMA NVMe Fabric. Bob Hansen,, VP System Architecture

IBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage

Rethinking Deduplication Scalability

The Google File System

TECHED USER CONFERENCE MAY 3-4, 2016

CS 655 Advanced Topics in Distributed Systems

The Google File System (GFS)

Google Disk Farm. Early days

Deep Learning Performance and Cost Evaluation

Cache and Forward Architecture

Hardware Oriented Security

SEMI-DEDICATED SERVERS WITH WEB HOSTING PRICED RIGHT

On Smart Query Routing: For Distributed Graph Querying with Decoupled Storage

Scalability of web applications

GFS: The Google File System

Introduction to Database Services

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

LSM-trie: An LSM-tree-based Ultra-Large Key-Value Store for Small Data

No compromises: distributed transactions with consistency, availability, and performance

The Lion of storage systems

Extremely Fast Distributed Storage for Cloud Service Providers

Principles behind data link layer services:

Principles behind data link layer services:

Experiments on TCP Re-Ordering March 27 th 2017

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

A Design for Networked Flash

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

SvSAN Data Sheet - StorMagic

Transcription:

RAMCube: Exploiting Network Proximity for RAM-Based Key-Value Store Yiming Zhang, Rui Chu @ NUDT Chuanxiong Guo, Guohan Lu, Yongqiang Xiong, Haitao Wu @ MSRA June, 2012 1

Background Disk-based storage is problematic I/O latency and bandwidth Current typical storage system Using RAM as a cache App servers + storage servers + cache servers Facebook keeps more than 75% data in memcached 2

Why Cache is NOT Preferred Consistency App is responsible for consistency (e.g. memcached) Error-prone E.g., the failure of Facebook Automated Verifying System Efficiency RAM is 1000X faster than disk 3

Using RAM as a Persistent Store RAMCloud: RAM-based key-value store Data is kept entirely in the RAM of servers Primary-backup for durability Multiple copies for each key-value pair Primary node + backup node + recovery node Write/read/recover procedure X recover Data center network

RAMCloud (cont.) Fast failure recovery Availability: MTTF / (MTTF + MTTR) Not specifically designed for DCN Temporary network problems cause false failure detection Parallel unarranged recovery flows ToR switch failures 5

RAMCube: Exploiting DCN Proximity Goal of RAM-based storage High performance Low latency 10+ μs read/write latency Large scale Hundreds of TB data High throughput: Millions of read ops per server per sec Hundreds of thousands of write ops / server / sec Fast failure recovery for high availability 1~2 sec 6

RAMCube Design Choices Network hardware - InfiniBand is high-bandwidth, low-latency But expensive and not common in DCN We use commodity Ethernet-based BCube Primary-backup vs. symmetric replication - Symmetric replication easy to achieve high availability But uses more RAM We use primary-recovery-backup like RAMCloud 7

Basic Idea of RAMCube 8

Using BCube 9

Mapping Key Space onto BCube Multi-layer logical rings Primary ring All servers in BCube are primary servers. The whole key space is mapped onto primary ring Each primary server for a subset of the key space 30 23 32 22 33 00 20 01 02 12 03 10 10

Multiple Rings in RAMCube (cont.) Each primary server P has a recovery ring Composed of 1-hop neighbors to P Map the p s key space to recovery servers Each recovery server R has a backup ring Composed of servers 1-hop to R and 2-hop to P Map the r s key space to backup servers 32 33 30 23 20 22 32 30 23 22 33 01 10 12 00 20 01 12 22 32 02 03 23 33 02 03 10 12

RAMCube Property MultiRing of RAMCube for BCube(n, k) # of primary servers P = n k+1 # of recovery servers for each primary p R = (n-1)(k+1) 32 33 30 23 20 22 32 33 01 10 12 00 01 12 22 02 32 03 33 23 02 # of backup servers for each primary server 03 B = (n-1) 2 k(k+1)/2 30 10 Each primary node has plenty of recovery 23 and backup servers 22 12 BCube(16,2), 20 P = 4096, R = 45, B = 675 BCube(4,1), P = 16, R = 6, B = 9 12

Failure detection Heartbeats Primary node periodically sends heartbeats to each of its recovery nodes Confirmation of failure Recovery node uses source routing to issue additional pings to suspicious primary node 32 33 12 22 30 01 02 32 23 20 10 03 22 12 33 23

Failure Recovery Primary node failure Recovery nodes fetch dominant copies to their RAM from directly-connected backup nodes Given 10Gbps network bandwidth and 100MB/s disk bandwidth BCube (16,2) can easily recover 64GB data in 1~2 sec Recovery/backup node failure Not as critical as primary node failure 32 33 12 22 30 01 02 32 23 20 10 03 22 12 33 23 14

Prototype Implementation Platform & language Windows 2008R2, C++, libevent Code size: 3000+ lines RAMCube components Connection manager maintains the status of directly connected neighbors and handles network events memory manager uses a slab-based mechanism and handles set/get/delete requests 15

Evaluation Evaluation configuration BCube(4,1) with 16 servers Eight 2.4GHz cores & 16G RAM per node Single-threaded server and client Both with a busy loop 15B key + 100B value Experiments Throughput Server failure recovery Switch failures 16

Experimental Results -1 17

Experimental Results -2 Ideal: 720MB/s 456MB/s Total Disk I/O (backup) 9*100MB/s = 900MB/s Total net I/O (recovery) 6*120MB/s = 720MB/s 18

Experimental Results -3 19

Future Work SSD & Low-latency Ethernet Rich data model Table/tablet support, column family Utilizing multi-core Improve read/write ops throughput 20

Summary RAMCube: RAM-based persistent k-v store Exploiting network proximity of BCube Failure detection 1-hop Recovery nodes directly watch their primary nodes Recovery traffic 1-hop No unexpected traffic overlap Switch failures multi-path Graceful performance degradation 32 33 12 22 30 01 02 32 23 20 10 03 22 12 33 23

Thank You! 22