Box: Using HBase as a message queue. David MacKenzie Staff So2ware Engineer

Similar documents
HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

Distributed Filesystem

Intra-cluster Replication for Apache Kafka. Jun Rao

Distributed Data Management Replication

Datacenter replication solution with quasardb

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

Big data streaming: Choices for high availability and disaster recovery on Microsoft Azure. By Arnab Ganguly DataCAT

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

GFS: The Google File System. Dr. Yingwu Zhu

Towards Weakly Consistent Local Storage Systems

Distributed Systems. Tutorial 9 Windows Azure Storage

Building Durable Real-time Data Pipeline

GFS: The Google File System

PNUTS: Yahoo! s Hosted Data Serving Platform. Reading Review by: Alex Degtiar (adegtiar) /30/2013

Tools for Social Networking Infrastructures

ebay s Architectural Principles

INTRODUCTION TO XTREMIO METADATA-AWARE REPLICATION

ebay Marketplace Architecture

The Google File System

Data Informatics. Seon Ho Kim, Ph.D.

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

Design Patterns for Large- Scale Data Management. Robert Hodges OSCON 2013

MySQL High Availability Solutions. Alex Poritskiy Percona

Microsoft SQL Server Fix Pack 15. Reference IBM

CLOUD-SCALE FILE SYSTEMS

The Google File System

TRANSACTIONS AND ABSTRACTIONS

IBM Db2 Event Store Simplifying and Accelerating Storage and Analysis of Fast Data. IBM Db2 Event Store

Distributed File Systems II

Extreme Computing. NoSQL.

Solace JMS Broker Delivers Highest Throughput for Persistent and Non-Persistent Delivery

Scaling with mongodb

Final Exam Logistics. CS 133: Databases. Goals for Today. Some References Used. Final exam take-home. Same resources as midterm

Installing and configuring Apache Kafka

Distributed Systems 16. Distributed File Systems II

MySQL Group Replication. Bogdan Kecman MySQL Principal Technical Engineer

Course Content MongoDB

The Google File System (GFS)

CS140 Final Review. Winter 2014

ProxySQL - GTID Consistent Reads. Adaptive query routing based on GTID tracking

Tutorial 8 Build resilient, responsive and scalable web applications with SocketPro

Federated Array of Bricks Y Saito et al HP Labs. CS 6464 Presented by Avinash Kulkarni

10. Replication. CSEP 545 Transaction Processing Philip A. Bernstein. Copyright 2003 Philip A. Bernstein. Outline

Paxos provides a highly available, redundant log of events

PNUTS and Weighted Voting. Vijay Chidambaram CS 380 D (Feb 8)

The Google File System

Oracle Streams. An Oracle White Paper October 2002

Yves Goeleven. Solution Architect - Particular Software. Shipping software since Azure MVP since Co-founder & board member AZUG

Paxos Replicated State Machines as the Basis of a High- Performance Data Store

Making Non-Distributed Databases, Distributed. Ioannis Papapanagiotou, PhD Shailesh Birari

Staggeringly Large Filesystems

Broker Clusters. Cluster Models

Google File System. Arun Sundaram Operating Systems

The Google File System

Cassandra - A Decentralized Structured Storage System. Avinash Lakshman and Prashant Malik Facebook

Pragmatic Clustering. Mike Cannon-Brookes CEO, Atlassian Software Systems

Care and Feeding of Oracle Rdb Hot Standby

TRANSACTIONS OVER HBASE

MySQL Architecture Design Patterns for Performance, Scalability, and Availability

Plug-in Configuration

Introduc)on to Apache Ka1a. Jun Rao Co- founder of Confluent

MI-PDB, MIE-PDB: Advanced Database Systems

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers

Microservice Splitting the Monolith. Software Engineering II Sharif University of Technology MohammadAmin Fazli

The Google File System

MINIMIZING TRANSACTION LATENCY IN GEO-REPLICATED DATA STORES

Performance and Scalability with Griddable.io

MySQL Replication Update

Migrating to the P8 5.2 Component Manager Framework

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 21: Network Protocols (and 2 Phase Commit)

Oracle Database 12c: JMS Sharded Queues

More reliability and support for PostgreSQL 10: Introducing Pgpool-II 3.7

Mix n Match Async and Group Replication for Advanced Replication Setups. Pedro Gomes Software Engineer

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL

Building LinkedIn s Real-time Data Pipeline. Jay Kreps

Scale out Read Only Workload by sharing data files of InnoDB. Zhai weixiang Alibaba Cloud

Optimizing RDM Server Performance

LazyBase: Trading freshness and performance in a scalable database

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

MySQL High Availability. Michael Messina Senior Managing Consultant, Rolta-AdvizeX /

real-time delivery architecture

Market Data Publisher In a High Frequency Trading Set up

High Noon at AWS. ~ Amazon MySQL RDS versus Tungsten Clustering running MySQL on AWS EC2

Identifying Workloads for the Cloud

MySQL HA Solutions. Keeping it simple, kinda! By: Chris Schneider MySQL Architect Ning.com

The HAMMER Filesystem DragonFlyBSD Project Matthew Dillon 11 October 2008

High Availability through Warm-Standby Support in Sybase Replication Server A Whitepaper from Sybase, Inc.

Apache BookKeeper. A High Performance and Low Latency Storage Service

Distributed PostgreSQL with YugaByte DB

Jailbreaking MySQL Replication Featuring Tungsten Replicator. Robert Hodges, CEO, Continuent

CS November 2017

THE ZADARA CLOUD. An overview of the Zadara Storage Cloud and VPSA Storage Array technology WHITE PAPER

Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database.

MySQL Replication Options. Peter Zaitsev, CEO, Percona Moscow MySQL User Meetup Moscow,Russia

The course modules of MongoDB developer and administrator online certification training:

A Guide to Architecting the Active/Active Data Center

MySQL Replication : advanced features in all flavours. Giuseppe Maxia Quality Assurance Architect at

Data Infrastructure at LinkedIn. Shirshanka Das XLDB 2011

Conceptual Modeling on Tencent s Distributed Database Systems. Pan Anqun, Wang Xiaoyu, Li Haixiang Tencent Inc.

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Transcription:

/events @ Box: Using HBase as a message queue David MacKenzie Staff So2ware Engineer

Share, manage and access your content from any device, anywhere 2

What is the /events API? RealOme stream of all acovity happening within a user s account GET /events?stream_position=234&stream_type=all 2 3 4 5 Client Persistent and re- playable 3

Why did we build it? Main use- case desktop sync à switch from batch to incremental diffs Several requirements arose from the sync use case: Guaranteed delivery Clients can be offline for days at a Ome Arbitrary number of clients consuming each user s stream Persistence Re- playability 4

Clients MySQL Events logged transacoonally with their associated DB modificaoons. ~500 events/sec at peak. Processing Pool Dispatcher ~25,000 events/sec, 800 Mb/sec HBase 5

Storing message queues in HBase HBase data model: Data organized into rows, each idenofied by a unique row- key Rows organized into tables, ordered lexicographically by row key Tables split into regions, distributed across the cluster HBase Key Space HBase RegionServers 6

Storing message queues in HBase Each user assigned a separate secoon of the HBase key- space Messages are stored in order from oldest to newest within a user s secoon of the key- space Reads map directly to scans from the provided posioon to the user s end key Row key structure: <pseudo- random prefix>_<user_id>_<posioon> 2- bytes of user_id sha Millisecond Omestamp 7

Using a Omestamp as a queue posioon Pro: Allows for allocaong roughly monotonically increasing posioons with no co- ordinaoon between write requests Con: Isn t sufficient to guarantee append- only semanocs in the presence of parallel writes Write Write Write 2 R e a d 2 R e a d 8

Time- bounding and Back- scanning Need to ensure that clients don t advance their stream posioons past writes that will eventually succeed But clients do need to advance posioon eventually How do we know when it s safe? SoluOon: Ome- bound writes and back- scan reads Time- bounding: every write to HBase must complete within a fixed Ome- bound to be considered successful No guaranteed delivery for unsuccessful writes. Clients should retry failed writes at higher stream posioons. Back- scanning: clients cannot advance their stream posioons further than (current Ome back- scan interval) Back- scan interval >= write Ome- bound Provides guaranteed delivery but at the cost of duplicate events 9

Write Write 3 R e a d Write 2 3 R e a d Write 2 3 Write 4 R e a d 0

ReplicaOon Need to remain available if a cluster or data center is taken offline Can t drop messages when clients issue requests from their previous stream posioons against a new cluster Some system of replicaoon required to ensure that messages not yet picked up from the old cluster are available to be picked up in the new cluster

ReplicaOon Master/slave architecture Master cluster handles all reads and writes, slave clusters are passive replicas Asynchronous replicaoon of messages and their stream posioons between clusters Each cluster copies the messages it receives from the other clusters to the exact posioons inioally allocated On promooon, clients transparently fail over to the new master cluster, re- using their exisong stream posioons Absent replicaoon lag, all messages will be in the same posioons in the new cluster as in the original cluster. Reads against the new cluster behave exactly as reads against the old cluster would. 2

Why Master/Slave? Delivery guarantees rely on the strong consistency guarantees of the underlying HBase cluster Specifically, that writes are immediately visible a2er successful compleoon Allows the cluster to know it has delivered all of the messages successfully wripen to posioons below the next_stream_posioon returned to the client WriOng and reading from mulople clusters breaks this guarantee Write R e a 2 d 3 Write 3

Handling ReplicaOon Lag From the client s perspecove, failing over to a lagging cluster can look exactly the same as allowing writes and reads to occur against different clusters 2 ReplicaOon Failover 2 ReplicaOon Write 3 R e a d 4

Handling ReplicaOon Lag ReplicaOon system needs to be aware of master/slave failovers Stop exactly replicaong messages. Start appending messages to the current ends of the queues. 2 Failover 2 3 4 R e a d Trades off duplicate delivery for some clients for guaranteed delivery to all clients Modified replicaoon algorithm Slave clusters exactly replicate messages to their original master allocated posioons Master cluster appends replicated messages to the current ends of its queues 5

Handling ReplicaOon Lag Not sufficient if we allow mastership to fail back before replicaoon has caught up Even if a cluster has become a slave again, needs to re- append messages that it didn t have while it was master. 2 Failover 2 3 R e a d Failback 2 2 3 4 6

Handling ReplicaOon Lag Core problem with replicaoon lag: Whenever a cluster hands out a new stream posioon to a reading client, it s making a promise that the client has read all of the messages below that stream posioon Cluster can t guarantee the validity of this promise for all clients if there are messages wripen to lower posioons that hadn t yet replicated to the cluster at the Ome of the read To guarantee delivery, any such messages need to be re- appended to the queue to ensure that clients have another chance to pick them up How does the cluster idenofy every such message? Without needlessly re- appending messages for which delivery was already guaranteed 7

Handling ReplicaOon Lag Cluster could just keep track of the highest stream posioon it s handed out to reading clients Any replicated messages with lower posioons would need to be re- appended Turns all reads into (potenoally contenoous) write operaoons Has pathologic behavior if we end up in a prolonged split- brain, master/master scenario Failover 2 4 3 8

Handling ReplicaOon Lag SoluOon: Introduce a replicaoon epoch/generaoon ID Incremented every Ome a new cluster becomes master Incorporated into the stream posioons used by the current master cluster Stream posioon is a 64- bit millisecond Omestamp - - > first two- bytes co- opted to store the current replicaoon epoch Ensures global ordering of messages between master cluster flips Master cluster posioons < Master cluster 2 posioons < Master cluster 3 posioons Reads against an old master cluster can never require us to re- append messages successfully wripen to the current master cluster Each slave cluster keeps track of the last replicaoon epoch during which it was master Any replicated message from a prior epoch needs to be appended Any replicated message from a subsequent epoch can be safely replicated to its original posioon 9

0 0 Failover 0 0 04 2 3 4 20

Handling ReplicaOon Lag 0 02 0 Failover 0 02 0 3 R e a d Failback 0 02 24 4 0 02 3 2

ReplicaOon Algorithm Each cluster asynchronously ships the messages wripen to it and their corresponding stream posioons to the other clusters Slave clusters process each replicated message by: Comparing the replicaoon epoch of the message against the cluster s last- master epoch and: ReplicaOng the message locally to its original posioon if the replicaoon epoch is higher Re- appending to the master cluster if the replicaoon epoch is lower Master cluster processes each replicated message by: Comparing the replicaoon epoch of the message against the cluster s current epoch and: Re- appending the message if it s replicaoon epoch is higher Failing and re- trying if the replicaoon epoch is higher (split- brain) How do we generate the asynchronous replicaoon stream? 22

Master Datacenter MySQL Slave Datacenter Processing Pool Allocates posioon for each event. Records posioon used in MySQL DB. Queries for events with posioon allocated by master. Reuses master posioon when wriong events. Processing Pool Dispatcher Dispatcher HBase HBase 23

What are the problems with this approach? Only one posioon can be allocated for an event, regardless of how many users it s sent to Some events need to be sent to 00K+ users Impossible to send events to an arbitrarily large number of users within the system s fixed Ome- bounds Added a second MySQL table post- fanout to chunk results, but it heavily increased our MySQL write amplificaoon factor ReplicaOon implemented at the client- level Either duplicate replicaoon logic across all clients or else restrict write access to a single client 24

MySQL Clients Processing Pool Dispatcher Master Queue Cluster HBase ReplicaOon Slave Queue Cluster 25

Can we leverage HBase replicaoon? HBase replicaoon employs a master- push model à master cluster ships changes to configured slave servers If our queue service can talk the naove HBase replicaoon API, we can configure it to be the replicaoon target for the master HBase cluster Provides us an opportunity to enforce master/slave cluster state when processing the replicaoon stream Currently rolling this HBase- backed replicaoon system out in producoon 26

What s next? Our inioal firehose of all user acovity is soll locked inside MySQL Expensive to add new subscribers onto the stream Every client requires its own column in the table to track its processing status Every addioonal client adds addioonal write load onto MySQL to track its processing status If a client goes offline, either sacrifice delivery guarantees or churn through storage on main applicaoon DB Oer Expensive to add new events to the stream Especially for non- DB transacoonal events (such as downloads, logins, etc.), which would otherwise be read- only à turns them into DB write operaoons Keep MySQL for inioal transacoonal recording of events but move to alternate system for client interacoon and recording non- DB transacoonal events 27

Can we leverage our exisong HBase queuing system? Problem: Much higher throughput than our exisong user queues Would have to add support for paroooning topics to spread the load across mulople HBase regionservers Conceptually simple à incorporate parooon ID into row key: <pseudo- random prefix>_<topic_id>_<parooon_id>_<posioon> Make sure pseudo- random prefix is disonct between parooons for the same topic May have to change our queue layout in HBase to remove Omestamps as the queue posioon Backscan algorithm causes rate of duplicate events to scale linearly with throughput 500 events/sec * 5 second backscan = 7500 duplicate events per fetch across all parooons Likely need to substanoally decrease Ome- bounds and backscan windows to be viable 28

Open source alternaoves? Closest off- the- rack queuing system is Kava Developed at LinkedIn. Open sourced in 20. Originally built to power LinkedIn s analyocs pipeline Very similar model built around ordered commit logs Allow for easy addioon of new subscribers Allow for varying subscriber consumpoon paperns à slow subscribers don t back up the pipeline As a dedicated queuing system, much more fully featured than what we ve built and tuned for much higher throughput 29

Why not Kava? Would be a second system to maintain as it can t replace our exisong HBase user queues Can t scale to millions of topics For our HBase user queues, we currently have 3 queues for each of our 30+ million users Kava currently tops out in the tens of thousands of topics/parooons per cluster Design requires very granular topic/parooon tracking. Barrier to scale. We may need to build much of the higher throughput support into our HBase queuing system anyhow in order to support enterprise queues Would require 50K+ topics Throughput for our larger enterprises might be higher than we d be comfortable running against a single regionserver 30

Why not Kava? Inter- cluster replicaoon support Not enough control over Kava queue posioons to implement transparent client failovers between replica clusters, especially in the presence of replicaoon lag R e a d 2 3 ReplicaOon Failover 2 3 ReplicaOon 2 Write 3 4 R e a d 3

In conclusion We were able to leverage HBase to store millions of guaranteed delivery message queues, each of which was: replicated between data centers independently consumable by an arbitrary number of clients We re currently working on building a cleaner abstracoon around these queues with naove replicaoon support We soll need to decide whether enhancing Kava or cononuing to build on top of HBase is the right strategy for our higher- throughput queues 32

Ques*ons? Email dmackenzie@box.com Engineering Blog tech.blog.box.com Pla{orm developers.box.com Open Source opensource.box.com 33