Box: Using HBase as a message queue. David MacKenzie Staff So2ware Engineer

/events @ Box: Using HBase as a message queue David MacKenzie Staff So2ware Engineer

Share, manage and access your content from any device, anywhere 2

What is the /events API? RealOme stream of all acovity happening within a user s account GET /events?stream_position=234&stream_type=all 2 3 4 5 Client Persistent and re- playable 3

Why did we build it? Main use- case desktop sync à switch from batch to incremental diffs Several requirements arose from the sync use case: Guaranteed delivery Clients can be offline for days at a Ome Arbitrary number of clients consuming each user s stream Persistence Re- playability 4

Clients MySQL Events logged transacoonally with their associated DB modificaoons. ~500 events/sec at peak. Processing Pool Dispatcher ~25,000 events/sec, 800 Mb/sec HBase 5

Storing message queues in HBase HBase data model: Data organized into rows, each idenofied by a unique row- key Rows organized into tables, ordered lexicographically by row key Tables split into regions, distributed across the cluster HBase Key Space HBase RegionServers 6

Storing message queues in HBase Each user assigned a separate secoon of the HBase key- space Messages are stored in order from oldest to newest within a user s secoon of the key- space Reads map directly to scans from the provided posioon to the user s end key Row key structure: <pseudo- random prefix>_<user_id>_<posioon> 2- bytes of user_id sha Millisecond Omestamp 7

Using a Omestamp as a queue posioon Pro: Allows for allocaong roughly monotonically increasing posioons with no co- ordinaoon between write requests Con: Isn t sufficient to guarantee append- only semanocs in the presence of parallel writes Write Write Write 2 R e a d 2 R e a d 8

Time- bounding and Back- scanning Need to ensure that clients don t advance their stream posioons past writes that will eventually succeed But clients do need to advance posioon eventually How do we know when it s safe? SoluOon: Ome- bound writes and back- scan reads Time- bounding: every write to HBase must complete within a fixed Ome- bound to be considered successful No guaranteed delivery for unsuccessful writes. Clients should retry failed writes at higher stream posioons. Back- scanning: clients cannot advance their stream posioons further than (current Ome back- scan interval) Back- scan interval >= write Ome- bound Provides guaranteed delivery but at the cost of duplicate events 9

Write Write 3 R e a d Write 2 3 R e a d Write 2 3 Write 4 R e a d 0

ReplicaOon Need to remain available if a cluster or data center is taken offline Can t drop messages when clients issue requests from their previous stream posioons against a new cluster Some system of replicaoon required to ensure that messages not yet picked up from the old cluster are available to be picked up in the new cluster

ReplicaOon Master/slave architecture Master cluster handles all reads and writes, slave clusters are passive replicas Asynchronous replicaoon of messages and their stream posioons between clusters Each cluster copies the messages it receives from the other clusters to the exact posioons inioally allocated On promooon, clients transparently fail over to the new master cluster, re- using their exisong stream posioons Absent replicaoon lag, all messages will be in the same posioons in the new cluster as in the original cluster. Reads against the new cluster behave exactly as reads against the old cluster would. 2

Why Master/Slave? Delivery guarantees rely on the strong consistency guarantees of the underlying HBase cluster Specifically, that writes are immediately visible a2er successful compleoon Allows the cluster to know it has delivered all of the messages successfully wripen to posioons below the next_stream_posioon returned to the client WriOng and reading from mulople clusters breaks this guarantee Write R e a 2 d 3 Write 3

Handling ReplicaOon Lag From the client s perspecove, failing over to a lagging cluster can look exactly the same as allowing writes and reads to occur against different clusters 2 ReplicaOon Failover 2 ReplicaOon Write 3 R e a d 4

Handling ReplicaOon Lag ReplicaOon system needs to be aware of master/slave failovers Stop exactly replicaong messages. Start appending messages to the current ends of the queues. 2 Failover 2 3 4 R e a d Trades off duplicate delivery for some clients for guaranteed delivery to all clients Modified replicaoon algorithm Slave clusters exactly replicate messages to their original master allocated posioons Master cluster appends replicated messages to the current ends of its queues 5

Handling ReplicaOon Lag Not sufficient if we allow mastership to fail back before replicaoon has caught up Even if a cluster has become a slave again, needs to re- append messages that it didn t have while it was master. 2 Failover 2 3 R e a d Failback 2 2 3 4 6

Handling ReplicaOon Lag Core problem with replicaoon lag: Whenever a cluster hands out a new stream posioon to a reading client, it s making a promise that the client has read all of the messages below that stream posioon Cluster can t guarantee the validity of this promise for all clients if there are messages wripen to lower posioons that hadn t yet replicated to the cluster at the Ome of the read To guarantee delivery, any such messages need to be re- appended to the queue to ensure that clients have another chance to pick them up How does the cluster idenofy every such message? Without needlessly re- appending messages for which delivery was already guaranteed 7

Handling ReplicaOon Lag Cluster could just keep track of the highest stream posioon it s handed out to reading clients Any replicated messages with lower posioons would need to be re- appended Turns all reads into (potenoally contenoous) write operaoons Has pathologic behavior if we end up in a prolonged split- brain, master/master scenario Failover 2 4 3 8

Handling ReplicaOon Lag SoluOon: Introduce a replicaoon epoch/generaoon ID Incremented every Ome a new cluster becomes master Incorporated into the stream posioons used by the current master cluster Stream posioon is a 64- bit millisecond Omestamp - - > first two- bytes co- opted to store the current replicaoon epoch Ensures global ordering of messages between master cluster flips Master cluster posioons < Master cluster 2 posioons < Master cluster 3 posioons Reads against an old master cluster can never require us to re- append messages successfully wripen to the current master cluster Each slave cluster keeps track of the last replicaoon epoch during which it was master Any replicated message from a prior epoch needs to be appended Any replicated message from a subsequent epoch can be safely replicated to its original posioon 9

0 0 Failover 0 0 04 2 3 4 20

Handling ReplicaOon Lag 0 02 0 Failover 0 02 0 3 R e a d Failback 0 02 24 4 0 02 3 2

ReplicaOon Algorithm Each cluster asynchronously ships the messages wripen to it and their corresponding stream posioons to the other clusters Slave clusters process each replicated message by: Comparing the replicaoon epoch of the message against the cluster s last- master epoch and: ReplicaOng the message locally to its original posioon if the replicaoon epoch is higher Re- appending to the master cluster if the replicaoon epoch is lower Master cluster processes each replicated message by: Comparing the replicaoon epoch of the message against the cluster s current epoch and: Re- appending the message if it s replicaoon epoch is higher Failing and re- trying if the replicaoon epoch is higher (split- brain) How do we generate the asynchronous replicaoon stream? 22

Master Datacenter MySQL Slave Datacenter Processing Pool Allocates posioon for each event. Records posioon used in MySQL DB. Queries for events with posioon allocated by master. Reuses master posioon when wriong events. Processing Pool Dispatcher Dispatcher HBase HBase 23

What are the problems with this approach? Only one posioon can be allocated for an event, regardless of how many users it s sent to Some events need to be sent to 00K+ users Impossible to send events to an arbitrarily large number of users within the system s fixed Ome- bounds Added a second MySQL table post- fanout to chunk results, but it heavily increased our MySQL write amplificaoon factor ReplicaOon implemented at the client- level Either duplicate replicaoon logic across all clients or else restrict write access to a single client 24

MySQL Clients Processing Pool Dispatcher Master Queue Cluster HBase ReplicaOon Slave Queue Cluster 25

Can we leverage HBase replicaoon? HBase replicaoon employs a master- push model à master cluster ships changes to configured slave servers If our queue service can talk the naove HBase replicaoon API, we can configure it to be the replicaoon target for the master HBase cluster Provides us an opportunity to enforce master/slave cluster state when processing the replicaoon stream Currently rolling this HBase- backed replicaoon system out in producoon 26

What s next? Our inioal firehose of all user acovity is soll locked inside MySQL Expensive to add new subscribers onto the stream Every client requires its own column in the table to track its processing status Every addioonal client adds addioonal write load onto MySQL to track its processing status If a client goes offline, either sacrifice delivery guarantees or churn through storage on main applicaoon DB Oer Expensive to add new events to the stream Especially for non- DB transacoonal events (such as downloads, logins, etc.), which would otherwise be read- only à turns them into DB write operaoons Keep MySQL for inioal transacoonal recording of events but move to alternate system for client interacoon and recording non- DB transacoonal events 27

Can we leverage our exisong HBase queuing system? Problem: Much higher throughput than our exisong user queues Would have to add support for paroooning topics to spread the load across mulople HBase regionservers Conceptually simple à incorporate parooon ID into row key: <pseudo- random prefix>_<topic_id>_<parooon_id>_<posioon> Make sure pseudo- random prefix is disonct between parooons for the same topic May have to change our queue layout in HBase to remove Omestamps as the queue posioon Backscan algorithm causes rate of duplicate events to scale linearly with throughput 500 events/sec * 5 second backscan = 7500 duplicate events per fetch across all parooons Likely need to substanoally decrease Ome- bounds and backscan windows to be viable 28

Open source alternaoves? Closest off- the- rack queuing system is Kava Developed at LinkedIn. Open sourced in 20. Originally built to power LinkedIn s analyocs pipeline Very similar model built around ordered commit logs Allow for easy addioon of new subscribers Allow for varying subscriber consumpoon paperns à slow subscribers don t back up the pipeline As a dedicated queuing system, much more fully featured than what we ve built and tuned for much higher throughput 29

Why not Kava? Would be a second system to maintain as it can t replace our exisong HBase user queues Can t scale to millions of topics For our HBase user queues, we currently have 3 queues for each of our 30+ million users Kava currently tops out in the tens of thousands of topics/parooons per cluster Design requires very granular topic/parooon tracking. Barrier to scale. We may need to build much of the higher throughput support into our HBase queuing system anyhow in order to support enterprise queues Would require 50K+ topics Throughput for our larger enterprises might be higher than we d be comfortable running against a single regionserver 30

Why not Kava? Inter- cluster replicaoon support Not enough control over Kava queue posioons to implement transparent client failovers between replica clusters, especially in the presence of replicaoon lag R e a d 2 3 ReplicaOon Failover 2 3 ReplicaOon 2 Write 3 4 R e a d 3

In conclusion We were able to leverage HBase to store millions of guaranteed delivery message queues, each of which was: replicated between data centers independently consumable by an arbitrary number of clients We re currently working on building a cleaner abstracoon around these queues with naove replicaoon support We soll need to decide whether enhancing Kava or cononuing to build on top of HBase is the right strategy for our higher- throughput queues 32

Ques*ons? Email dmackenzie@box.com Engineering Blog tech.blog.box.com Pla{orm developers.box.com Open Source opensource.box.com 33