Ambry: LinkedIn s Scalable Geo- Distributed Object Store

Ambry: LinkedIn s Scalable Geo- Distributed Object Store Shadi A. Noghabi *, Sriram Subramanian +, Priyesh Narayanan +, Sivabalan Narayanan +, Gopalakrishna Holla +, Mammad Zadeh +, Tianwei Li +, Indranil Gupta*, Roy H. Campbell* + LinkedIn Corporation Data Infrastructure team * University of Illinois at Urbana- Champaign SRG: http://srg.cs.illinois.edu and DPRG: http://dprg.cs.uiuc.edu

Massive Media Objects are Everywhere 100s of Millions of users 100s of TBs to PBs From all around the world 2

Media has a unique access pattern Uploaded once Rarely deleted Never modified *Immutable* Frequently accessed from all around the world *Read- heavy* Skewed toward recent data 3

How to handle all these objects? We need a geographically distributed system that stores and retrieves these read- heavy immutable objects in an efficient and scalable manner Ambry In production for >2 years and >400 M users! Open source: https://github.com/linkedin/ambry

Challenges Wide diversity 10s of KBs to few GBs Evergrowing data and requests Data rarely gets deleted > 800 M req/day (~120 TB) Rate doubled in 12 months Fast, durable, and highly available processing Geo- replication with low latency Workload variability Req/s Time Load imbalance Cluster expansions 5

Related Work Distributed file system: NFS, HDFS, Ceph, etc. High metadata overhead Unnecessary additional capabilities Key value stores: Cassandra, Dynamo, Bigtable, etc. Not designed for very large objects Blob stores: Haystack, f4, Twitter s blob store Not resolving load imbalance Not designed specifically for geo- distributed design 6

Design Goals Wide diversity Low latency and high throughput for all objects 10s of KBs to few GBs Evergrowing data and requests Data rarely gets deleted > 800 M req/day Scalable (~120 TB) Rate doubled in 12 months Fast, durable, and highly available processing Req/s Geo- distributed Geo- replication with low latency Workload variability Time Load imbalance Load balance Cluster expansions 7

Ambry s Design Low latency Huge and diversity high throughput Logical grouping of objects (partition) Segmented Indexing Exploiting OS caches Zero cost failure detection Chunking 10s of and KBs parallel to few read/write GBs EvergrowingScalable data and requests Decentralized Data rarely gets design deletes Independent > 800 M req/day components (~120 TB) with little interaction Rate doubled in 12 months Separation of logical and physical placement Geo- distributed Fast, durable, and highly available processing Asynchronous writes Proxy request Active- active design Geo- replication 2- phase background replication with Journaling low latency Load imbalance Chunking and random placement Rebalancing mechanism with minimal data movement Req/s Time 8

Outline Low latency and high throughput for all objects Geo- distributed Scalable Load balance 9

High Throughput and Low Latency Small objects: low metadata overhead per object Large objects: sequential writes/reads of objects 0 200 700 850 900 980 1040 100 GB obj id 30 obj id 20 Partition p obj id 70 obj id 40 obj id 60 obj id 90 Group object together into virtual units called partition Large pre- allocated file where objects are appended to the end Partitions replicated on multiple nodes, in a separate procedure 10

High Throughput and Low Latency Other Optimizations: Segmented Indexing: finding location of object with low latency Latest segment in memory Other segment on disk Bloom filters per segment 0 obj id 30 200 obj id 20 Partition p 700 obj id 70 start offset of current index segment 850 obj id 40 900 obj id 90 end offset 960 Exploiting OS caches Zero cost failure detection Zero copy reads Chunking Index segment 1 Index segment 2 Latest Index segment Start offset: 700 End offset: 960 900 blob obj id offset 40 850 70 700 90 900 11

Outline Low latency and high throughput for all objects Geo- distributed Scalable Load balance 12

Scalable Design Data Center 1 Req Frontend Layer Maintain state of the cluster frontend Frontend node Node Router Library Data Layer Datanode Receive, verify and route requests Disk 1... Disk k

Scalable Design Data Center 1 Req Frontend Layer Maintain state of the cluster frontend Frontend node Node Router Library Data Layer Datanode Disk 1... Disk k Receive, verify and route requests Store/retrieve actual data

Scalable Design Decentralized design Data Center 1 Req Independent components with little interaction Maintain state of the cluster Frontend Layer frontend Frontend node Node Router Library Data Layer Datanode Disk 1... Disk k Receive, verify and route requests Store/retrieve actual data

Outline Low latency and high throughput for all objects Geo- distributed Scalable Load balance 19

Geo- Distribution Asynchronous writes Write synchronously only to the datacenter serving the request Asynchronously replicate data to other datacenters Reduce latency Minimize cross- DC traffic Proxy request Active- active design 2- phase background replication Journaling Data Center 1 Frontend Layer Data Layer req... Data Center n Replication Frontend Layer Data Layer req

Outline Low latency and high throughput for all objects Geo- distributed Scalable Load balanced 21

Load Balance Static Cluster Dynamic Cluster We use three techniques: Random placement of objects on large partitions CDN layer above Ambry Chunking Using these techniques load imbalance < 5 % among nodes Cluster expansions + skewed workload (recent data) à imbalance We built a rebalancing mechanism moving popular data around with: Minimum data movement 6-10x improvement in request balance 9-10x improvement in disk usage balance 22

EVALUATION 23

Evaluation - setup Small cluster A beefy single Datanode 24 core CPU, 64 GB of RAM, 14 1TB HDD disks, and a full- duplex 1 Gb/s Ethernet network Workload: Read only, 50-50 read- write, and write only Fix size objects in each test Randomly reading objects (worse- case scenario) Production cluster 3 datacenters located across the US Real world traffic Simulations Long periods & very large scale 24

Throughput Max Throughput (MB/s) 1000 100 10 10 100 1000 10000 Blob Size (KB) Write Read Read-Write Network Large objects: All cases saturate network Read- write saturates both inbound and outbound link Small objects: Writes saturates network Read throughput reduced due to frequent disk seeks Random workload worse case 25

Latency 100 Average Latency (ms) 10 Write Read Read-Write 1 10 100 1000 10000 Blob Size (KB) Low overhead for large objects Ambry is mainly intended for large objects! Small objects: Reads dominated by disk seek In 50 KB objects, > 94% latency for disk seeks. 26

Conclusion Low latency and high throughput Partitions + Indexing Scalable Decentralized design Independent components Geo- distributed Asynchronous writes Load Balance Random placement Load Balancing mechanism Open source: https://github.com/linkedin/ambry 27

Back up slides 28

Zero- cost Failure Recovery No extra messages leveraging request messages Effective, simple, and consumes very little bandwidth. wait period wait period available temp down temp available temp down temp available available 29

Cluster manager Maintains the state of the cluster Partition id Logical Layout Replica placement id 10 DC 1, node 5, disk 3 DC 3, node 10, disk 1 id 15 DC 1, node 5, disk 3 DC 1, node 2, disk 3 Physical Layout Datacenter Datanode Disk DC 1 node 1 disk 1 disk 2 DC 1 node 2 disk 1 DC n node j 30

Replication Bandwidth - Production CDF 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 DC1-intra DC2-intra DC3-intra DC1-inter DC2-inter DC3-inter 1 10 100 1000 10000 100000 1x10 6 Average Replication Bandwidth per Datanode (B/s) Inter Datacenter: Short tail 95 th to 5 th percentile ratio 3x à load balanced design Intra datacenter: Longer tail Many zero values (omitted) Very small value (< 1KB/s) 31

Latency - Variance Write-50KB Read-50KB Read-Write-50KB Write-5MB Read-5MB Read-Write-5MB 1 0.8 CDF 0.6 0.4 0.2 0 0.1 1 10 100 1000 Latency (ms) 32