Giza: Erasure Coding Objects across Global Data Centers

Similar documents
Erasure Coding in Object Stores: Challenges and Opportunities

Janus. Consolidating Concurrency Control and Consensus for Commits under Conflicts. Shuai Mu, Lamont Nelson, Wyatt Lloyd, Jinyang Li

Pelican: A building block for exascale cold data storage

Pricing Intra-Datacenter Networks with

Tape in the Microsoft Datacenter: The Good and Bad of Tape as a Target for Cloud-based Archival Storage

Building Consistent Transactions with Inconsistent Replication

Yves Goeleven. Solution Architect - Particular Software. Shipping software since Azure MVP since Co-founder & board member AZUG

Enhancing Throughput of

ABSTRACT I. INTRODUCTION

Everything You Need to Know About MySQL Group Replication

ECS High Availability Design

Data Centers and Cloud Computing

Data Centers and Cloud Computing. Slides courtesy of Tim Wood

Data Centers and Cloud Computing. Data Centers

Patterns on XRegional Data Consistency

REFERENCE ARCHITECTURE Quantum StorNext and Cloudian HyperStore

CS /15/16. Paul Krzyzanowski 1. Question 1. Distributed Systems 2016 Exam 2 Review. Question 3. Question 2. Question 5.

NetChain: Scale-Free Sub-RTT Coordination

SDPaxos: Building Efficient Semi-Decentralized Geo-replicated State Machines

Paxos Replicated State Machines as the Basis of a High- Performance Data Store

CLOUD-SCALE FILE SYSTEMS

The Google File System (GFS)

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

The Google File System

Google Disk Farm. Early days

Designing Distributed Systems using Approximate Synchrony in Data Center Networks

The Google File System

Changing Requirements for Distributed File Systems in Cloud Storage

Optimizing Software-Defined Storage for Flash Memory

Eight Tips for Better Archives. Eight Ways Cloudian Object Storage Benefits Archiving with Veritas Enterprise Vault

Exploiting Commutativity For Practical Fast Replication. Seo Jin Park and John Ousterhout

Low-Latency Multi-Datacenter Databases using Replicated Commit

New research on Key Technologies of unstructured data cloud storage

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Recap: First Requirement. Recap: Second Requirement. Recap: Strengthening P2

SimpleChubby: a simple distributed lock service

ATOMIC COMMITMENT Or: How to Implement Distributed Transactions in Sharded Databases

Recap. CSE 486/586 Distributed Systems Google Chubby Lock Service. Paxos Phase 2. Paxos Phase 1. Google Chubby. Paxos Phase 3 C 1

The Google File System

context: massive systems

Windows Servers In Microsoft Azure

I/O CANNOT BE IGNORED

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

SCALABLE CONSISTENCY AND TRANSACTION MODELS

Clouds are complex so they fail. Cloud File System Security and Dependability with SafeCloud-FS

Oasis: An Active Storage Framework for Object Storage Platform

Decentralized Distributed Storage System for Big Data

Lecture 21: Reliable, High Performance Storage. CSC 469H1F Fall 2006 Angela Demke Brown

The Google File System

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

EECS 498 Introduction to Distributed Systems

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Kernel Level Speculative DSM

Repair Pipelining for Erasure-Coded Storage

25 Best Practice Tips for architecting Amazon VPC

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

Locally repairable codes

Google File System. By Dinesh Amatya

A Reliable Broadcast System

The New Economics of Cloud Storage

Managing Array of SSDs When the Storage Device is No Longer the Performance Bottleneck

Data Modeling and Databases Ch 14: Data Replication. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich

Introduction to MapReduce

Datacenter replication solution with quasardb

MDCC MULTI DATA CENTER CONSISTENCY. amplab. Tim Kraska, Gene Pang, Michael Franklin, Samuel Madden, Alan Fekete

The Google File System

Storage Integration with Host-based Write-back Caching

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

CSE 124: Networked Services Lecture-17

NPTEL Course Jan K. Gopinath Indian Institute of Science

A Practical Scalable Distributed B-Tree

TAPIR. By Irene Zhang, Naveen Sharma, Adriana Szekeres, Arvind Krishnamurthy, and Dan Ports Presented by Todd Charlton

GFS: The Google File System

Performance Models of Access Latency in Cloud Storage Systems

Your Data is in the Cloud: Who Exactly is Looking After It?

Hedvig as backup target for Veeam

A Comprehensive Analytical Performance Model of DRAM Caches

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Distributed DNS Name server Backed by Raft

MillWheel:Fault Tolerant Stream Processing at Internet Scale. By FAN Junbo

Ceph: A Scalable, High-Performance Distributed File System

High Performance Transactions in Deuteronomy

Data Centers. Tom Anderson

The Google File System

Kubernetes Integration with Virtuozzo Storage

Building Consistent Transactions with Inconsistent Replication

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

Elastic Cloud Storage (ECS)

Routing in Delay Tolerant Networks (2)

Virtual Storage Tier and Beyond

TripS: Automated Multi-tiered Data Placement in a Geo-distributed Cloud Environment

Deukyeon Hwang UNIST. Wook-Hee Kim UNIST. Beomseok Nam UNIST. Hanyang Univ.

Consolidating Concurrency Control and Consensus for Commits under Conflicts

Distributed Systems. Tutorial 9 Windows Azure Storage

Outline. Spanner Mo/va/on. Tom Anderson

goals monitoring, fault tolerance, auto-recovery (thousands of low-cost machines) handle appends efficiently (no random writes & sequential reads)

Live Virtual Machine Migration with Efficient Working Set Prediction

MapReduce. U of Toronto, 2014

Pervasive PSQL Vx Server Licensing

CORFU: A Shared Log Design for Flash Clusters

Transcription:

Giza: Erasure Coding Objects across Global Data Centers Yu Lin Chen*, Shuai Mu, Jinyang Li, Cheng Huang *, Jin li *, Aaron Ogus *, and Douglas Phillips* New York University, *Microsoft Corporation USENIX ATC, Santa Clara, CA, July 2017 1

Azure data centers span the globe 2

Data center fault tolerance: Geo Replication blob 3

Data center fault tolerance: Erasure coding blob a b p Project Giza 4

The cost of erasure coding a b blob Project Giza 5

When is cross DC coding a win? 1. Inexpensive cross DC communication. 2. Coding friendly workload a) Stores a lot of data in large objects (less coding overhead) b) Infrequent reconstruction due to read or failure. 6

Cross DC communication is becoming less expensive Largest bandwidth capacity today with 160Tpbs 7

In search of a coding friendly workload 3 months trace: January March 2016 8

OneDrive is occupied mostly by large objects 9

OneDrive is occupied mostly by large objects Object > 1GB occupies 53% of total storage capacity. 10

OneDrive is occupied mostly by large objects Objects < 4MB occupies only 0.9% of total storage capacity 11

OneDrive reads are infrequent after a month 12

OneDrive reads are infrequent after a month 47% of the total bytes read occur within the first day of the object creation 13

OneDrive reads are infrequent after a month less than 2% of the total bytes read occur after a month 14

Outline 1. The case for erasure coding across data centers. 2. Giza design and challenges 3. Experimental Results 15

Giza: a strongly consistent versioned object store Erasure code each object individually. Optimize latency for Put and Get. Leverage existing single-dc cloud storage API. 16

Giza s architecture DC 2 Table Blob DC 1 DC 3 17

Giza s workflow Client Put(Blob1, Data) DC 2 Table Blob DC 1 DC 3 18

Giza s workflow Client Put(Blob1, Data) Put(GUID1, Fragment A) DC 2 Table Blob DC 1 DC 3 19

Giza s workflow Client Put(Blob1, Data) {DC1: GUID1, DC2: GUID2, DC3: GUID3} DC 2 Table Blob DC 1 DC 3 20

Giza metadata versioning Let md this = {DC 1 :GUID 4, DC 2 :GUID 5, DC 3 :GUID 6 } Key Metadata (version #: actual metadata) Blob1 v1: md 1 Key Blob1 Metadata (version #: actual metadata) v1: md 1, v2: md this 21

Inconsistency during concurrent put operations Client Put(Blob1, Data1) Client Put(Blob1, Data2) DC 2 Table Blob DC 1 DC 3 22

Inconsistency during concurrent put operations Order of arrival md2 md1 md2 md1 md2 md1 Key Metadata Key Metadata Key Metadata Blob1 v1: md 1 DC 1 Blob1 v1: md 2 DC 2 Blob1 v1: md 1 DC 3 WARNING: INCONSISTENT STATE 23

Inconsistency during concurrent put operations Order of arrival md2 md1 md2 md1 md2 md1 Key Metadata Blob1 v1: md 1 DC 1 Key Metadata Blob1 v1: md 2 DC 2 Key Metadata FAILURE! Blob1 v1: md 1 DC 3 WARNING: INCONSISTENT STATE 24

Choosing version value via consensus Key Metadata Key Metadata Key Metadata Blob1 v1: md 1 v2: md 2 Blob1 v1: md 1 v2: md 2 Blob1 v1: md 1 v2: md 2 DC 1 DC 2 OR DC 3 Key Metadata Key Metadata Key Metadata Blob1 v1: md 2 v2: md 1 Blob1 v1: md 2 v2: md 1 Blob1 v1: md 2 v2: md 1 DC 1 DC 2 DC 3 25

Paxos and Fast Paxos properties 26

Paxos and Fast Paxos properties 27

Implementing Fast Paxos with Azure table API Let md this = ({DC 1 :GUID 1, DC 2 :GUID 2, DC 3 :GUID 3 }, Paxos states) DC 2 Table Blob DC 1 DC 3 28

Implementing Fast Paxos with Azure table API Let md this = ({DC 1 :GUID 1, DC 2 :GUID 2, DC 3 :GUID 3 }, Paxos states) DC 2 Table Blob DC 1 DC 3 29

Implementing Fast Paxos with Azure table API Let md this = ({DC 1 :GUID 1, DC 2 :GUID 2, DC 3 :GUID 3 }, Paxos states) Read metadata row and decide to accept or reject proposal (v2 = md this ) Table Blob Etag = 1 Remote DC 30

Implementing Fast Paxos with Azure table API Let md this = ({DC 1 :GUID 1, DC 2 :GUID 2, DC 3 :GUID 3 }, Paxos states) Etag = 1 Fail to update table since etag has changed Write metadata row back to the table if proposal accepted Table Blob Etag = 2 Remote DC 31

Implementing Fast Paxos with Azure table API Let md this = ({DC 1 :GUID 1, DC 2 :GUID 2, DC 3 :GUID 3 }, Paxos states) Etag = 1 Successfully update table since etag hasn t changed Write metadata row back to the table if proposal accepted Table Blob Etag = 1 Remote DC 32

Parallelizing metadata and data path DC 2 Table Blob DC 1 DC 3 33

Data path fails, metadata path succeeds Key Metadata (version #: actual metadata) Blob1 v1: md 1, v2: md 2 Ongoing, data path failed During read, Giza will revert back to the latest version where the data is available. 34

Giza Put Execute metadata and data path in parallel. Upon success, return acknowledgement to the client. In the background, update highest committed version number in the metadata row. Key Committed Metadata (version #: actual metadata) Blob1 1 v1: md 1, v2: md 2 Ongoing, not committed 35

Giza Get Optimistically read the latest committed version locally. Execute both data path and metadata path in parallel: Data path retrieves the data fragments indicated by the local latest committed version. Metadata path will retrieve the metadata row from all data centers and validate the latest committed version. 36

Outline 1. The case for erasure coding across data centers. 2. Giza design and challenges 3. Experimental Results 37

Giza deployment configurations US-2-1 38

Giza deployment configurations US-6-1 39

Giza deployment configurations World-2-1 40

Giza deployment configurations World-6-1 41

Metadata Path: Fast vs Classic Paxos 23ms 32ms 64ms 42

Metadata Path: Fast vs Classic Paxos 102ms 240ms 139ms 43

Giza Put Latency (World-2-1) 44

Giza Get Latency (World-2-1) 45

Summary Recent trend in data centers supports the economic feasibility of erasure coding across data centers. Giza is a strongly consistent versioned object store, built on top of Azure storage. Giza optimizes for latency and is fast in the common case. 46