Atlas: Baidu s Key-value Storage System for Cloud Data

Similar documents
BCStore: Bandwidth-Efficient In-memory KV-Store with Batch Coding. Shenglong Li, Quanlu Zhang, Zhi Yang and Yafei Dai Peking University

LSM-trie: An LSM-tree-based Ultra-Large Key-Value Store for Small Data

Google File System. Arun Sundaram Operating Systems

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

The Google File System

HashKV: Enabling Efficient Updates in KV Storage via Hashing

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Ambry: LinkedIn s Scalable Geo- Distributed Object Store

CSE 124: Networked Services Lecture-16

CSE 124: Networked Services Lecture-17

GearDB: A GC-free Key-Value Store on HM-SMR Drives with Gear Compaction

CSE 124: Networked Services Fall 2009 Lecture-19

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

SDF: Software-Defined Flash for Web-Scale Internet Storage Systems

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees

UCS Invicta: A New Generation of Storage Performance. Mazen Abou Najm DC Consulting Systems Engineer

SMORE: A Cold Data Object Store for SMR Drives

The Google File System

The Google File System

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

Decentralized Distributed Storage System for Big Data

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

NPTEL Course Jan K. Gopinath Indian Institute of Science

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Design a Remote-Office or Branch-Office Data Center with Cisco UCS Mini

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

S-FTL: An Efficient Address Translation for Flash Memory by Exploiting Spatial Locality

Google is Really Different.

Distributed File Systems II

Denali Open-Channel SSDs

Design a Remote-Office or Branch-Office Data Center with Cisco UCS Mini

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

A Light-weight Compaction Tree to Reduce I/O Amplification toward Efficient Key-Value Stores

EsgynDB Enterprise 2.0 Platform Reference Architecture

CA485 Ray Walshe Google File System

RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University

Today s Papers. Array Reliability. RAID Basics (Two optional papers) EECS 262a Advanced Topics in Computer Systems Lecture 3

The Google File System

IBM ProtecTIER and Netbackup OpenStorage (OST)

Cloud Computing CS

Assessing performance in HP LeftHand SANs

Outline. Spanner Mo/va/on. Tom Anderson

The Google File System

A FLEXIBLE ARM-BASED CEPH SOLUTION

Integrated hardware-software solution developed on ARM architecture. CS3 Conference Krakow, January 30th 2018

Cold Storage: The Road to Enterprise Ilya Kuznetsov YADRO

Performance Benefits of Running RocksDB on Samsung NVMe SSDs

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

CLOUD-SCALE FILE SYSTEMS

Freewrite: Creating (Almost) Zero-Cost Writes to SSD

IBM System Storage DCS3700

CS3600 SYSTEMS AND NETWORKS

arxiv: v1 [cs.db] 25 Nov 2018

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

HPC Storage Use Cases & Future Trends

pblk the OCSSD FTL Linux FAST Summit 18 Javier González Copyright 2018 CNEX Labs

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

IBM System x servers. Innovation comes standard

朱义普. Resolving High Performance Computing and Big Data Application Bottlenecks with Application-Defined Flash Acceleration. Director, North Asia, HPC

Hadoop/MapReduce Computing Paradigm

The Google File System

Cisco HyperFlex HX220c M4 and HX220c M4 All Flash Nodes

Pelican: A building block for exascale cold data storage

DiskReduce: Making Room for More Data on DISCs. Wittawat Tantisiriroj

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

The Lion of storage systems

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

Federated Array of Bricks Y Saito et al HP Labs. CS 6464 Presented by Avinash Kulkarni

A New Key-Value Data Store For Heterogeneous Storage Architecture

Lecture S3: File system data layout, naming

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Next-Generation Cloud Platform

Network Design Considerations for Grid Computing

LEVERAGING FLASH MEMORY in ENTERPRISE STORAGE

High-density Grid storage system optimization at ASGC. Shu-Ting Liao ASGC Operation team ISGC 2011

Record Placement Based on Data Skew Using Solid State Drives

IBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage

Google Disk Farm. Early days

Finding a needle in Haystack: Facebook's photo storage

Leveraging Flash in Scalable Environments: A Systems Perspective on How FLASH Storage is Displacing Disk Storage

Windows Servers In Microsoft Azure

Highly Scalable, Non-RDMA NVMe Fabric. Bob Hansen,, VP System Architecture

Kinetic Action: Micro and Macro Benchmark-based Performance Analysis of Kinetic Drives Against LevelDB-based Servers

Veritas NetBackup on Cisco UCS S3260 Storage Server

Deep Storage for Exponential Data. Nathan Thompson CEO, Spectra Logic

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Ubiquitous and Mobile Computing CS 525M: Virtually Unifying Personal Storage for Fast and Pervasive Data Accesses

Crossing the Chasm: Sneaking a parallel file system into Hadoop

LevelDB-Raw: Eliminating File System Overhead for Optimizing Performance of LevelDB Engine

IBM V7000 Unified R1.4.2 Asynchronous Replication Performance Reference Guide

Distributed Filesystem

Tools for Social Networking Infrastructures

GOOGLE FILE SYSTEM: MASTER Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung

MapReduce. Stony Brook University CSE545, Fall 2016

Deploy a High-Performance Database Solution: Cisco UCS B420 M4 Blade Server with Fusion iomemory PX600 Using Oracle Database 12c

Transcription:

Atlas: Baidu s Key-value Storage System for Cloud Data Song Jiang Chunbo Lai Shiding Lin Liqiong Yang Guangyu Sun Jason Cong Wayne State University Zhenyu Hou Can Cui Peking University University of California Los Angeles Baidu Inc.

Cloud Storage Service Cloud storage services become increasingly popular. Baidu Cloud has over 200 million users and 200PB user data. To be attractive and competitive, they often offer large free space and price the service modestly. Baidu offers 2TB free space for each user. The challenge is how to economically provision resources and also achieve service quality. A large number of servers, each with local large storage space. The data must be reliably stored with a high availability. Requests for any data in the system should be served reasonably fast. 2

The workload Challenges on Baidu s System Request size is capped at 256KB for system efficiency. Majority of the requests are for data between 128KB and 256KB. Distribution of requests on a typical day in 2014. The Challenges Can the X86 processors be efficiently used? Can we use a file system to store data at each server? Can we use an LSM-tree-based key-value store to store the data? 3

Challenge on Processor Efficiency The X86 processors (two 4-core 2.4GHz E5620) were consistently under-utilized Less than 20% utilization rate with nine hard disks installed on a server. Adding more disks is not an ultimate solution. The ARM processor (one 4-core 1.6GHz Cortex A9) can provide similar I/O performance. The ARM processor is more than 10X cheaper and more energy-efficient. Baidu s customized ARM-based server. Each 2U chassis has six 4-core Cortex A9 processors. Each processor comes with four 3TB SATA disks. However, each processor can support only 4GB memory. On each chassis only 24GB memory available for accessing data as large as 72TB data. 4

Challenge on Using a File System Memory cannot hold all metadata. Most files would be of 128-256KB. Access on the storage has little locality. More than one disk accesses are often required to access a file. The approach used in Facebook s Haystack is not sufficient. There are 3.3GB metadata for 16TB 128KB-data. System software and buffer cache also compete for 4GB memory. 5

Challenge on Using LSM-tree Based Key-value Store LSM-tree-based KV store is designed for storing many small key-value items, represented by Google s LevelDB. The store is memory efficient. The metadata is only about 320MB for 16TB 128KB-data. However, the store needs constant compaction operations to sort its data distributed across levels of the store. For a store of 7 levels, the write amplification can be over 70. Very limited I/O bandwidth is left for servicing frond-end user requests. Memory 6

Challenge on Using LSM-tree Based Key-value Store LSM-tree-based KV store is designed for storing many small key-value items, represented by Google s LevelDB. The store is memory efficient. The metadata is only about 320MB for 16TB 128KB-data. However, the store needs constant compaction operations to sort its data distributed across levels for such a small metadata. For a store of 7 levels, the write amplification can be over 70. Very limited I/O bandwidth is left for servicing frond-end user requests. Memory 7

Challenge on Using LSM-tree Based Key-value Store LSM-tree-based KV store is designed for storing many small key-value items, represented by Google s LevelDB. The store is memory efficient. The metadata is only about 320MB for 16TB 128KB-data. However, the store needs constant compaction operations to sort its data distributed across levels for such a small metadata. For a store of 7 levels, the write amplification can be over 70. Very limited I/O bandwidth is left for servicing frond-end user requests. Memory 8

Reducing Compaction Cost In a KV item, value is usually much larger than the key. Values are not necessary to be involved in compactions. Move and place the values in a fixed-size container (block), and replace the values with pointers in KV items. Memory Key? Value???? 9

Reducing Compaction Cost In a KV item, value is usually much larger than the key. Values are not necessary to be involved in compactions. Move and place the values in a fixed-size container (block), and replace the values with pointers in KV items. Memory Patch(64MB) Key Pointer Value 10

Reducing Compaction Cost In a KV item, value is usually much larger than the key. Values are not necessary to be involved in compactions. Move and place the values in a fixed-size container (block), and replace the values with pointers in KV items. Memory Patch(64MB) 11

Reducing Compaction Cost In a KV item, value is usually much larger than the key. Values are not necessary to be involved in compactions. Move and place the values in a fixed-size container (block), and replace the values with pointers in KV items. Memory Patch(64MB) 12

Reducing Compaction Cost In a KV item, value is usually much larger than the key. Values are not necessary to be involved in compactions. Move and place the values in a fixed-size container (block), and replace the values with pointers in KV items. Memory Patch(64MB) 13

Features of Baidu s Cloud Storage System (Atlas) A hardware and software co-design with customized low-power servers for high resource utilization Separate metadata (keys and offsets) and data (value blocks) management systems. Data are efficiently protected by erasure coding. Storage of Metadata Storage of Data Memory Block (64MB) Keys Values 14

Big Picture of the Atlas System Keys PIS (Patch and Index System) Patch (64MB) Values Block (64MB) RBS (RAID-like Block System) 15

Distribution of User Requests Atalas Clients (Applications) Key hashing Key hashing Keys Values Keys Values Patch (64MB) PIS slice Patch (64MB) PIS slice 16

RBS (RAID-like Block System) A PIS Slice Redundancy for Protecting KV items Keys Values Three PIS slice units in a PIS slice Patch (64MB) Block (64MB) Eight 8MB-parts Four RS-coded parts 17

The Architecture of Atlas Use LSM-tree KV Store: Key (logical) parts/block Application (logical) parts/block Physical Partservers PIS Slice PIS Slice RBS Master Shadow RBS Master RBS Partserver RBS Partserver 18

Serving a Write Request Application (1) Send request to a PIS slice. 4) Obtain 12 + 3 partserver IPs RBS Master 6) Record (blockid, list of partserver IPs) PIS Slice RBS Partserver Patch Index (2) Write the KV item in the patch, and acknowledge client; (3) If the patch is full, convert it into a block, and partition and compute it into 8+4 parts. (4) Record (key, blockid, offset) into the index. RBS Partserver (5) Write the parts to the partsevers. 19

Serving a Read Request Application (1) Send request to a PIS slice. (4) Get partserver IP for the block ID RBS Master Patch Index (5) Retrieve the value from the partserver (2) If the KV item is in the patch, return the value; (3) Otherwise, Get() block ID and offset from the index. RBS Partserver RBS Partserver (6) Part recovery is initiated if it is a failure. 20

Serving Delete/Overwrite Requests KV pairs stored in Atlas are immutable. Blocks in Atlas are also immutable. A new KV item is written into the system to service a delete/overwritten request. Space occupied by obsolete items are reclaimed in a garbage collection (GC) process. Periodically two questions are asked about a block in the RBS subsystem, and positive answers to both lead to a GC. 1) Is the block created earlier than a threshold (such as one week ago)? 2) Is the ratio of valid data in the block smaller than a threshold (such as 80%)?

Atlas s Advantages on Hardware Cost and Power Atlas saves about 70% of hardware cost per GB storage Using ARM servers to replace x86 servers Using erasure coding to replace 3-copy replication. Power consumption is reduced by about 53% per GB storage. The ARM processors are more power efficient. The ARM server racks are more space efficient, reducing energy cost for power supply and thermal dissipation. 22

Comparison with the Prior System Reference system (pre-atlas) Similar PIS subsystem. All data are managed solely by the LSM-tree-based KV store. Run on a 12-server X86 cluster. Atlas s throughput at one node All writes Read : Write = 3:1 23

Atlas on a Customized ARM cluster A cluster of 12 ARM servers. Each hosts multiple PIS slices and RBS partservers. Each server has a 4-core Marvell processor, 4GB memory, four 3TB disks. 1Gbps full-duplex Ethernet adapter. Request size is 256KB. 24

Throughput at One Node with Diff. Request Types More I/O and Network bandwidth Consumed All writes All Reads Read : Write = 3:1 25

Latencies with Diff. Request Types All writes All Reads 26

Throughput at one Node of a Production System write Reads 27

Disk Bandwidth at one Node of a Production System write Reads 28

Summary Atlas is an object store using a two-tier design separating the managements of keys and values. Atlas uses a hardware-software co-design for high costeffectiveness and energy efficiency. Atlas adopts the erasure coding technique for space-efficient data protection.