Giza: Erasure Coding Objects across Global Data Centers Yu Lin Chen*, Shuai Mu, Jinyang Li, Cheng Huang *, Jin li *, Aaron Ogus *, and Douglas Phillips* New York University, *Microsoft Corporation USENIX ATC, Santa Clara, CA, July 2017 1
Azure data centers span the globe 2
Data center fault tolerance: Geo Replication blob 3
Data center fault tolerance: Erasure coding blob a b p Project Giza 4
The cost of erasure coding a b blob Project Giza 5
When is cross DC coding a win? 1. Inexpensive cross DC communication. 2. Coding friendly workload a) Stores a lot of data in large objects (less coding overhead) b) Infrequent reconstruction due to read or failure. 6
Cross DC communication is becoming less expensive Largest bandwidth capacity today with 160Tpbs 7
In search of a coding friendly workload 3 months trace: January March 2016 8
OneDrive is occupied mostly by large objects 9
OneDrive is occupied mostly by large objects Object > 1GB occupies 53% of total storage capacity. 10
OneDrive is occupied mostly by large objects Objects < 4MB occupies only 0.9% of total storage capacity 11
OneDrive reads are infrequent after a month 12
OneDrive reads are infrequent after a month 47% of the total bytes read occur within the first day of the object creation 13
OneDrive reads are infrequent after a month less than 2% of the total bytes read occur after a month 14
Outline 1. The case for erasure coding across data centers. 2. Giza design and challenges 3. Experimental Results 15
Giza: a strongly consistent versioned object store Erasure code each object individually. Optimize latency for Put and Get. Leverage existing single-dc cloud storage API. 16
Giza s architecture DC 2 Table Blob DC 1 DC 3 17
Giza s workflow Client Put(Blob1, Data) DC 2 Table Blob DC 1 DC 3 18
Giza s workflow Client Put(Blob1, Data) Put(GUID1, Fragment A) DC 2 Table Blob DC 1 DC 3 19
Giza s workflow Client Put(Blob1, Data) {DC1: GUID1, DC2: GUID2, DC3: GUID3} DC 2 Table Blob DC 1 DC 3 20
Giza metadata versioning Let md this = {DC 1 :GUID 4, DC 2 :GUID 5, DC 3 :GUID 6 } Key Metadata (version #: actual metadata) Blob1 v1: md 1 Key Blob1 Metadata (version #: actual metadata) v1: md 1, v2: md this 21
Inconsistency during concurrent put operations Client Put(Blob1, Data1) Client Put(Blob1, Data2) DC 2 Table Blob DC 1 DC 3 22
Inconsistency during concurrent put operations Order of arrival md2 md1 md2 md1 md2 md1 Key Metadata Key Metadata Key Metadata Blob1 v1: md 1 DC 1 Blob1 v1: md 2 DC 2 Blob1 v1: md 1 DC 3 WARNING: INCONSISTENT STATE 23
Inconsistency during concurrent put operations Order of arrival md2 md1 md2 md1 md2 md1 Key Metadata Blob1 v1: md 1 DC 1 Key Metadata Blob1 v1: md 2 DC 2 Key Metadata FAILURE! Blob1 v1: md 1 DC 3 WARNING: INCONSISTENT STATE 24
Choosing version value via consensus Key Metadata Key Metadata Key Metadata Blob1 v1: md 1 v2: md 2 Blob1 v1: md 1 v2: md 2 Blob1 v1: md 1 v2: md 2 DC 1 DC 2 OR DC 3 Key Metadata Key Metadata Key Metadata Blob1 v1: md 2 v2: md 1 Blob1 v1: md 2 v2: md 1 Blob1 v1: md 2 v2: md 1 DC 1 DC 2 DC 3 25
Paxos and Fast Paxos properties 26
Paxos and Fast Paxos properties 27
Implementing Fast Paxos with Azure table API Let md this = ({DC 1 :GUID 1, DC 2 :GUID 2, DC 3 :GUID 3 }, Paxos states) DC 2 Table Blob DC 1 DC 3 28
Implementing Fast Paxos with Azure table API Let md this = ({DC 1 :GUID 1, DC 2 :GUID 2, DC 3 :GUID 3 }, Paxos states) DC 2 Table Blob DC 1 DC 3 29
Implementing Fast Paxos with Azure table API Let md this = ({DC 1 :GUID 1, DC 2 :GUID 2, DC 3 :GUID 3 }, Paxos states) Read metadata row and decide to accept or reject proposal (v2 = md this ) Table Blob Etag = 1 Remote DC 30
Implementing Fast Paxos with Azure table API Let md this = ({DC 1 :GUID 1, DC 2 :GUID 2, DC 3 :GUID 3 }, Paxos states) Etag = 1 Fail to update table since etag has changed Write metadata row back to the table if proposal accepted Table Blob Etag = 2 Remote DC 31
Implementing Fast Paxos with Azure table API Let md this = ({DC 1 :GUID 1, DC 2 :GUID 2, DC 3 :GUID 3 }, Paxos states) Etag = 1 Successfully update table since etag hasn t changed Write metadata row back to the table if proposal accepted Table Blob Etag = 1 Remote DC 32
Parallelizing metadata and data path DC 2 Table Blob DC 1 DC 3 33
Data path fails, metadata path succeeds Key Metadata (version #: actual metadata) Blob1 v1: md 1, v2: md 2 Ongoing, data path failed During read, Giza will revert back to the latest version where the data is available. 34
Giza Put Execute metadata and data path in parallel. Upon success, return acknowledgement to the client. In the background, update highest committed version number in the metadata row. Key Committed Metadata (version #: actual metadata) Blob1 1 v1: md 1, v2: md 2 Ongoing, not committed 35
Giza Get Optimistically read the latest committed version locally. Execute both data path and metadata path in parallel: Data path retrieves the data fragments indicated by the local latest committed version. Metadata path will retrieve the metadata row from all data centers and validate the latest committed version. 36
Outline 1. The case for erasure coding across data centers. 2. Giza design and challenges 3. Experimental Results 37
Giza deployment configurations US-2-1 38
Giza deployment configurations US-6-1 39
Giza deployment configurations World-2-1 40
Giza deployment configurations World-6-1 41
Metadata Path: Fast vs Classic Paxos 23ms 32ms 64ms 42
Metadata Path: Fast vs Classic Paxos 102ms 240ms 139ms 43
Giza Put Latency (World-2-1) 44
Giza Get Latency (World-2-1) 45
Summary Recent trend in data centers supports the economic feasibility of erasure coding across data centers. Giza is a strongly consistent versioned object store, built on top of Azure storage. Giza optimizes for latency and is fast in the common case. 46