Ambry: LinkedIn s Scalable Geo- Distributed Object Store

Similar documents
Building Interactive Distributed Processing Applications at a Global Scale

Distributed File Systems II

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

The Google File System

Tools for Social Networking Infrastructures

CSE 124: Networked Services Lecture-17

CA485 Ray Walshe Google File System

CLOUD-SCALE FILE SYSTEMS

A BigData Tour HDFS, Ceph and MapReduce

CSE 124: Networked Services Fall 2009 Lecture-19

Finding a needle in Haystack: Facebook's photo storage

The Lion of storage systems

CS 655 Advanced Topics in Distributed Systems

CSE 124: Networked Services Lecture-16

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Ceph: A Scalable, High-Performance Distributed File System

The Google File System

The Google File System

Finding a Needle in a Haystack. Facebook s Photo Storage Jack Hartner

Google is Really Different.

Volley: Automated Data Placement for Geo-Distributed Cloud Services

The Google File System

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

Distributed Systems 16. Distributed File Systems II

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Introduction to Distributed Data Systems

Data Center Services and Optimization. Sobir Bazarbayev Chris Cai CS538 October

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

NPTEL Course Jan K. Gopinath Indian Institute of Science

Google File System. Arun Sundaram Operating Systems

Advanced Data Management Technologies

The Google File System. Alexandru Costan

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

the Edge. Vinod Eswaraprasad Wipro Technologies Storage Developer Conference. Insert Your Company Name. All Rights Reserved.

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

Correlation based File Prefetching Approach for Hadoop

Block Storage Service: Status and Performance

RAMCloud: Scalable High-Performance Storage Entirely in DRAM John Ousterhout Stanford University

Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store. Wei Xie TTU CS Department Seminar, 3/7/2017

Intra-cluster Replication for Apache Kafka. Jun Rao

Distributed Filesystem

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

The Google File System

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

GFS: The Google File System

Google File System (GFS) and Hadoop Distributed File System (HDFS)

BCStore: Bandwidth-Efficient In-memory KV-Store with Batch Coding. Shenglong Li, Quanlu Zhang, Zhi Yang and Yafei Dai Peking University

Introduction to Windows Azure Cloud Computing Futures Group, Microsoft Research Roger Barga, Jared Jackson, Nelson Araujo, Dennis Gannon, Wei Lu, and

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Flat Datacenter Storage. Edmund B. Nightingale, Jeremy Elson, et al. 6.S897

Today CSCI Coda. Naming: Volumes. Coda GFS PAST. Instructor: Abhishek Chandra. Main Goals: Volume is a subtree in the naming space

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Next-Generation Cloud Platform

Topics in P2P Networked Systems

Atlas: Baidu s Key-value Storage System for Cloud Data

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

The Fusion Distributed File System

Google File System, Replication. Amin Vahdat CSE 123b May 23, 2006

VoltDB vs. Redis Benchmark

DYNAMO: AMAZON S HIGHLY AVAILABLE KEY-VALUE STORE. Presented by Byungjin Jun

Distributed File Systems

Ubiquitous and Mobile Computing CS 525M: Virtually Unifying Personal Storage for Fast and Pervasive Data Accesses

Strata: A Cross Media File System. Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon Peter, Emmett Witchel, Thomas Anderson

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Dept. Of Computer Science, Colorado State University

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

Efficient On-Demand Operations in Distributed Infrastructures

CSE 153 Design of Operating Systems

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

The Google File System

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

Cloud Computing at Yahoo! Thomas Kwan Director, Research Operations Yahoo! Labs

Scaling Without Sharding. Baron Schwartz Percona Inc Surge 2010

UCS Invicta: A New Generation of Storage Performance. Mazen Abou Najm DC Consulting Systems Engineer

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li

GFS: The Google File System. Dr. Yingwu Zhu

CS3600 SYSTEMS AND NETWORKS

Cloudian Sizing and Architecture Guidelines

Distributed System. Gang Wu. Spring,2018

RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

Cluster Setup and Distributed File System

Deduplication File System & Course Review

Design of Global Data Deduplication for A Scale-out Distributed Storage System

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

CS252 S05. CMSC 411 Computer Systems Architecture Lecture 18 Storage Systems 2. I/O performance measures. I/O performance measures

The Google File System (GFS)

Warehouse- Scale Computing and the BDAS Stack

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

IBM V7000 Unified R1.4.2 Asynchronous Replication Performance Reference Guide

Cost-Effective Virtual Petabytes Storage Pools using MARS. FrOSCon 2017 Presentation by Thomas Schöbel-Theuer

Review of Morphus Abstract 1. Introduction 2. System design

Abstract. 1. Introduction. 2. Design and Implementation Master Chunkserver

EECS 498 Introduction to Distributed Systems

Transcription:

Ambry: LinkedIn s Scalable Geo- Distributed Object Store Shadi A. Noghabi *, Sriram Subramanian +, Priyesh Narayanan +, Sivabalan Narayanan +, Gopalakrishna Holla +, Mammad Zadeh +, Tianwei Li +, Indranil Gupta*, Roy H. Campbell* + LinkedIn Corporation Data Infrastructure team * University of Illinois at Urbana- Champaign SRG: http://srg.cs.illinois.edu and DPRG: http://dprg.cs.uiuc.edu

Massive Media Objects are Everywhere 100s of Millions of users 100s of TBs to PBs From all around the world 2

Media has a unique access pattern Uploaded once Rarely deleted Never modified *Immutable* Frequently accessed from all around the world *Read- heavy* Skewed toward recent data 3

How to handle all these objects? We need a geographically distributed system that stores and retrieves these read- heavy immutable objects in an efficient and scalable manner Ambry In production for >2 years and >400 M users! Open source: https://github.com/linkedin/ambry

Challenges Wide diversity 10s of KBs to few GBs Evergrowing data and requests Data rarely gets deleted > 800 M req/day (~120 TB) Rate doubled in 12 months Fast, durable, and highly available processing Geo- replication with low latency Workload variability Req/s Time Load imbalance Cluster expansions 5

Related Work Distributed file system: NFS, HDFS, Ceph, etc. High metadata overhead Unnecessary additional capabilities Key value stores: Cassandra, Dynamo, Bigtable, etc. Not designed for very large objects Blob stores: Haystack, f4, Twitter s blob store Not resolving load imbalance Not designed specifically for geo- distributed design 6

Design Goals Wide diversity Low latency and high throughput for all objects 10s of KBs to few GBs Evergrowing data and requests Data rarely gets deleted > 800 M req/day Scalable (~120 TB) Rate doubled in 12 months Fast, durable, and highly available processing Req/s Geo- distributed Geo- replication with low latency Workload variability Time Load imbalance Load balance Cluster expansions 7

Ambry s Design Low latency Huge and diversity high throughput Logical grouping of objects (partition) Segmented Indexing Exploiting OS caches Zero cost failure detection Chunking 10s of and KBs parallel to few read/write GBs EvergrowingScalable data and requests Decentralized Data rarely gets design deletes Independent > 800 M req/day components (~120 TB) with little interaction Rate doubled in 12 months Separation of logical and physical placement Geo- distributed Fast, durable, and highly available processing Asynchronous writes Proxy request Active- active design Geo- replication 2- phase background replication with Journaling low latency Load imbalance Chunking and random placement Rebalancing mechanism with minimal data movement Req/s Time 8

Outline Low latency and high throughput for all objects Geo- distributed Scalable Load balance 9

High Throughput and Low Latency Small objects: low metadata overhead per object Large objects: sequential writes/reads of objects 0 200 700 850 900 980 1040 100 GB obj id 30 obj id 20 Partition p obj id 70 obj id 40 obj id 60 obj id 90 Group object together into virtual units called partition Large pre- allocated file where objects are appended to the end Partitions replicated on multiple nodes, in a separate procedure 10

High Throughput and Low Latency Other Optimizations: Segmented Indexing: finding location of object with low latency Latest segment in memory Other segment on disk Bloom filters per segment 0 obj id 30 200 obj id 20 Partition p 700 obj id 70 start offset of current index segment 850 obj id 40 900 obj id 90 end offset 960 Exploiting OS caches Zero cost failure detection Zero copy reads Chunking Index segment 1 Index segment 2 Latest Index segment Start offset: 700 End offset: 960 900 blob obj id offset 40 850 70 700 90 900 11

Outline Low latency and high throughput for all objects Geo- distributed Scalable Load balance 12

Scalable Design Data Center 1 Req Frontend Layer Maintain state of the cluster frontend Frontend node Node Router Library Data Layer Datanode Receive, verify and route requests Disk 1... Disk k

Scalable Design Data Center 1 Req Frontend Layer Maintain state of the cluster frontend Frontend node Node Router Library Data Layer Datanode Receive, verify and route requests Disk 1... Disk k

Scalable Design Data Center 1 Req Frontend Layer Maintain state of the cluster frontend Frontend node Node Router Library Data Layer Datanode Disk 1... Disk k Receive, verify and route requests Store/retrieve actual data

Scalable Design Data Center 1 Req Frontend Layer Maintain state of the cluster frontend Frontend node Node Router Library Data Layer Datanode Disk 1... Disk k Receive, verify and route requests Store/retrieve actual data

Scalable Design Data Center 1 Req Frontend Layer Maintain state of the cluster frontend Frontend node Node Router Library Data Layer Datanode Disk 1... Disk k Receive, verify and route requests Store/retrieve actual data

Scalable Design Decentralized design Data Center 1 Req Independent components with little interaction Maintain state of the cluster Frontend Layer frontend Frontend node Node Router Library Data Layer Datanode Disk 1... Disk k Receive, verify and route requests Store/retrieve actual data

Outline Low latency and high throughput for all objects Geo- distributed Scalable Load balance 19

Geo- Distribution Asynchronous writes Write synchronously only to the datacenter serving the request Asynchronously replicate data to other datacenters Reduce latency Minimize cross- DC traffic Proxy request Active- active design 2- phase background replication Journaling Data Center 1 Frontend Layer Data Layer req... Data Center n Replication Frontend Layer Data Layer req

Outline Low latency and high throughput for all objects Geo- distributed Scalable Load balanced 21

Load Balance Static Cluster Dynamic Cluster We use three techniques: Random placement of objects on large partitions CDN layer above Ambry Chunking Using these techniques load imbalance < 5 % among nodes Cluster expansions + skewed workload (recent data) à imbalance We built a rebalancing mechanism moving popular data around with: Minimum data movement 6-10x improvement in request balance 9-10x improvement in disk usage balance 22

EVALUATION 23

Evaluation - setup Small cluster A beefy single Datanode 24 core CPU, 64 GB of RAM, 14 1TB HDD disks, and a full- duplex 1 Gb/s Ethernet network Workload: Read only, 50-50 read- write, and write only Fix size objects in each test Randomly reading objects (worse- case scenario) Production cluster 3 datacenters located across the US Real world traffic Simulations Long periods & very large scale 24

Throughput Max Throughput (MB/s) 1000 100 10 10 100 1000 10000 Blob Size (KB) Write Read Read-Write Network Large objects: All cases saturate network Read- write saturates both inbound and outbound link Small objects: Writes saturates network Read throughput reduced due to frequent disk seeks Random workload worse case 25

Latency 100 Average Latency (ms) 10 Write Read Read-Write 1 10 100 1000 10000 Blob Size (KB) Low overhead for large objects Ambry is mainly intended for large objects! Small objects: Reads dominated by disk seek In 50 KB objects, > 94% latency for disk seeks. 26

Conclusion Low latency and high throughput Partitions + Indexing Scalable Decentralized design Independent components Geo- distributed Asynchronous writes Load Balance Random placement Load Balancing mechanism Open source: https://github.com/linkedin/ambry 27

Back up slides 28

Zero- cost Failure Recovery No extra messages leveraging request messages Effective, simple, and consumes very little bandwidth. wait period wait period available temp down temp available temp down temp available available 29

Cluster manager Maintains the state of the cluster Partition id Logical Layout Replica placement id 10 DC 1, node 5, disk 3 DC 3, node 10, disk 1 id 15 DC 1, node 5, disk 3 DC 1, node 2, disk 3 Physical Layout Datacenter Datanode Disk DC 1 node 1 disk 1 disk 2 DC 1 node 2 disk 1 DC n node j 30

Replication Bandwidth - Production CDF 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 DC1-intra DC2-intra DC3-intra DC1-inter DC2-inter DC3-inter 1 10 100 1000 10000 100000 1x10 6 Average Replication Bandwidth per Datanode (B/s) Inter Datacenter: Short tail 95 th to 5 th percentile ratio 3x à load balanced design Intra datacenter: Longer tail Many zero values (omitted) Very small value (< 1KB/s) 31

Latency - Variance Write-50KB Read-50KB Read-Write-50KB Write-5MB Read-5MB Read-Write-5MB 1 0.8 CDF 0.6 0.4 0.2 0 0.1 1 10 100 1000 Latency (ms) 32