Box: Using HBase as a message queue. David MacKenzie Staff So2ware Engineer

Size: px
Start display at page:

Download "Box: Using HBase as a message queue. David MacKenzie Staff So2ware Engineer"

Transcription

1 Box: Using HBase as a message queue David MacKenzie Staff So2ware Engineer

2 Share, manage and access your content from any device, anywhere 2

3 What is the /events API? RealOme stream of all acovity happening within a user s account GET /events?stream_position=234&stream_type=all Client Persistent and re- playable 3

4 Why did we build it? Main use- case desktop sync à switch from batch to incremental diffs Several requirements arose from the sync use case: Guaranteed delivery Clients can be offline for days at a Ome Arbitrary number of clients consuming each user s stream Persistence Re- playability 4

5 Clients MySQL Events logged transacoonally with their associated DB modificaoons. ~500 events/sec at peak. Processing Pool Dispatcher ~25,000 events/sec, 800 Mb/sec HBase 5

6 Storing message queues in HBase HBase data model: Data organized into rows, each idenofied by a unique row- key Rows organized into tables, ordered lexicographically by row key Tables split into regions, distributed across the cluster HBase Key Space HBase RegionServers 6

7 Storing message queues in HBase Each user assigned a separate secoon of the HBase key- space Messages are stored in order from oldest to newest within a user s secoon of the key- space Reads map directly to scans from the provided posioon to the user s end key Row key structure: <pseudo- random prefix>_<user_id>_<posioon> 2- bytes of user_id sha Millisecond Omestamp 7

8 Using a Omestamp as a queue posioon Pro: Allows for allocaong roughly monotonically increasing posioons with no co- ordinaoon between write requests Con: Isn t sufficient to guarantee append- only semanocs in the presence of parallel writes Write Write Write 2 R e a d 2 R e a d 8

9 Time- bounding and Back- scanning Need to ensure that clients don t advance their stream posioons past writes that will eventually succeed But clients do need to advance posioon eventually How do we know when it s safe? SoluOon: Ome- bound writes and back- scan reads Time- bounding: every write to HBase must complete within a fixed Ome- bound to be considered successful No guaranteed delivery for unsuccessful writes. Clients should retry failed writes at higher stream posioons. Back- scanning: clients cannot advance their stream posioons further than (current Ome back- scan interval) Back- scan interval >= write Ome- bound Provides guaranteed delivery but at the cost of duplicate events 9

10 Write Write 3 R e a d Write 2 3 R e a d Write 2 3 Write 4 R e a d 0

11 ReplicaOon Need to remain available if a cluster or data center is taken offline Can t drop messages when clients issue requests from their previous stream posioons against a new cluster Some system of replicaoon required to ensure that messages not yet picked up from the old cluster are available to be picked up in the new cluster

12 ReplicaOon Master/slave architecture Master cluster handles all reads and writes, slave clusters are passive replicas Asynchronous replicaoon of messages and their stream posioons between clusters Each cluster copies the messages it receives from the other clusters to the exact posioons inioally allocated On promooon, clients transparently fail over to the new master cluster, re- using their exisong stream posioons Absent replicaoon lag, all messages will be in the same posioons in the new cluster as in the original cluster. Reads against the new cluster behave exactly as reads against the old cluster would. 2

13 Why Master/Slave? Delivery guarantees rely on the strong consistency guarantees of the underlying HBase cluster Specifically, that writes are immediately visible a2er successful compleoon Allows the cluster to know it has delivered all of the messages successfully wripen to posioons below the next_stream_posioon returned to the client WriOng and reading from mulople clusters breaks this guarantee Write R e a 2 d 3 Write 3

14 Handling ReplicaOon Lag From the client s perspecove, failing over to a lagging cluster can look exactly the same as allowing writes and reads to occur against different clusters 2 ReplicaOon Failover 2 ReplicaOon Write 3 R e a d 4

15 Handling ReplicaOon Lag ReplicaOon system needs to be aware of master/slave failovers Stop exactly replicaong messages. Start appending messages to the current ends of the queues. 2 Failover R e a d Trades off duplicate delivery for some clients for guaranteed delivery to all clients Modified replicaoon algorithm Slave clusters exactly replicate messages to their original master allocated posioons Master cluster appends replicated messages to the current ends of its queues 5

16 Handling ReplicaOon Lag Not sufficient if we allow mastership to fail back before replicaoon has caught up Even if a cluster has become a slave again, needs to re- append messages that it didn t have while it was master. 2 Failover 2 3 R e a d Failback

17 Handling ReplicaOon Lag Core problem with replicaoon lag: Whenever a cluster hands out a new stream posioon to a reading client, it s making a promise that the client has read all of the messages below that stream posioon Cluster can t guarantee the validity of this promise for all clients if there are messages wripen to lower posioons that hadn t yet replicated to the cluster at the Ome of the read To guarantee delivery, any such messages need to be re- appended to the queue to ensure that clients have another chance to pick them up How does the cluster idenofy every such message? Without needlessly re- appending messages for which delivery was already guaranteed 7

18 Handling ReplicaOon Lag Cluster could just keep track of the highest stream posioon it s handed out to reading clients Any replicated messages with lower posioons would need to be re- appended Turns all reads into (potenoally contenoous) write operaoons Has pathologic behavior if we end up in a prolonged split- brain, master/master scenario Failover

19 Handling ReplicaOon Lag SoluOon: Introduce a replicaoon epoch/generaoon ID Incremented every Ome a new cluster becomes master Incorporated into the stream posioons used by the current master cluster Stream posioon is a 64- bit millisecond Omestamp - - > first two- bytes co- opted to store the current replicaoon epoch Ensures global ordering of messages between master cluster flips Master cluster posioons < Master cluster 2 posioons < Master cluster 3 posioons Reads against an old master cluster can never require us to re- append messages successfully wripen to the current master cluster Each slave cluster keeps track of the last replicaoon epoch during which it was master Any replicated message from a prior epoch needs to be appended Any replicated message from a subsequent epoch can be safely replicated to its original posioon 9

20 0 0 Failover

21 Handling ReplicaOon Lag Failover R e a d Failback

22 ReplicaOon Algorithm Each cluster asynchronously ships the messages wripen to it and their corresponding stream posioons to the other clusters Slave clusters process each replicated message by: Comparing the replicaoon epoch of the message against the cluster s last- master epoch and: ReplicaOng the message locally to its original posioon if the replicaoon epoch is higher Re- appending to the master cluster if the replicaoon epoch is lower Master cluster processes each replicated message by: Comparing the replicaoon epoch of the message against the cluster s current epoch and: Re- appending the message if it s replicaoon epoch is higher Failing and re- trying if the replicaoon epoch is higher (split- brain) How do we generate the asynchronous replicaoon stream? 22

23 Master Datacenter MySQL Slave Datacenter Processing Pool Allocates posioon for each event. Records posioon used in MySQL DB. Queries for events with posioon allocated by master. Reuses master posioon when wriong events. Processing Pool Dispatcher Dispatcher HBase HBase 23

24 What are the problems with this approach? Only one posioon can be allocated for an event, regardless of how many users it s sent to Some events need to be sent to 00K+ users Impossible to send events to an arbitrarily large number of users within the system s fixed Ome- bounds Added a second MySQL table post- fanout to chunk results, but it heavily increased our MySQL write amplificaoon factor ReplicaOon implemented at the client- level Either duplicate replicaoon logic across all clients or else restrict write access to a single client 24

25 MySQL Clients Processing Pool Dispatcher Master Queue Cluster HBase ReplicaOon Slave Queue Cluster 25

26 Can we leverage HBase replicaoon? HBase replicaoon employs a master- push model à master cluster ships changes to configured slave servers If our queue service can talk the naove HBase replicaoon API, we can configure it to be the replicaoon target for the master HBase cluster Provides us an opportunity to enforce master/slave cluster state when processing the replicaoon stream Currently rolling this HBase- backed replicaoon system out in producoon 26

27 What s next? Our inioal firehose of all user acovity is soll locked inside MySQL Expensive to add new subscribers onto the stream Every client requires its own column in the table to track its processing status Every addioonal client adds addioonal write load onto MySQL to track its processing status If a client goes offline, either sacrifice delivery guarantees or churn through storage on main applicaoon DB Oer Expensive to add new events to the stream Especially for non- DB transacoonal events (such as downloads, logins, etc.), which would otherwise be read- only à turns them into DB write operaoons Keep MySQL for inioal transacoonal recording of events but move to alternate system for client interacoon and recording non- DB transacoonal events 27

28 Can we leverage our exisong HBase queuing system? Problem: Much higher throughput than our exisong user queues Would have to add support for paroooning topics to spread the load across mulople HBase regionservers Conceptually simple à incorporate parooon ID into row key: <pseudo- random prefix>_<topic_id>_<parooon_id>_<posioon> Make sure pseudo- random prefix is disonct between parooons for the same topic May have to change our queue layout in HBase to remove Omestamps as the queue posioon Backscan algorithm causes rate of duplicate events to scale linearly with throughput 500 events/sec * 5 second backscan = 7500 duplicate events per fetch across all parooons Likely need to substanoally decrease Ome- bounds and backscan windows to be viable 28

29 Open source alternaoves? Closest off- the- rack queuing system is Kava Developed at LinkedIn. Open sourced in 20. Originally built to power LinkedIn s analyocs pipeline Very similar model built around ordered commit logs Allow for easy addioon of new subscribers Allow for varying subscriber consumpoon paperns à slow subscribers don t back up the pipeline As a dedicated queuing system, much more fully featured than what we ve built and tuned for much higher throughput 29

30 Why not Kava? Would be a second system to maintain as it can t replace our exisong HBase user queues Can t scale to millions of topics For our HBase user queues, we currently have 3 queues for each of our 30+ million users Kava currently tops out in the tens of thousands of topics/parooons per cluster Design requires very granular topic/parooon tracking. Barrier to scale. We may need to build much of the higher throughput support into our HBase queuing system anyhow in order to support enterprise queues Would require 50K+ topics Throughput for our larger enterprises might be higher than we d be comfortable running against a single regionserver 30

31 Why not Kava? Inter- cluster replicaoon support Not enough control over Kava queue posioons to implement transparent client failovers between replica clusters, especially in the presence of replicaoon lag R e a d 2 3 ReplicaOon Failover 2 3 ReplicaOon 2 Write 3 4 R e a d 3

32 In conclusion We were able to leverage HBase to store millions of guaranteed delivery message queues, each of which was: replicated between data centers independently consumable by an arbitrary number of clients We re currently working on building a cleaner abstracoon around these queues with naove replicaoon support We soll need to decide whether enhancing Kava or cononuing to build on top of HBase is the right strategy for our higher- throughput queues 32

33 Ques*ons? Engineering Blog tech.blog.box.com Pla{orm developers.box.com Open Source opensource.box.com 33

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017 HDFS Architecture Gregory Kesden, CSE-291 (Storage Systems) Fall 2017 Based Upon: http://hadoop.apache.org/docs/r3.0.0-alpha1/hadoopproject-dist/hadoop-hdfs/hdfsdesign.html Assumptions At scale, hardware

More information

Distributed Filesystem

Distributed Filesystem Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the

More information

Intra-cluster Replication for Apache Kafka. Jun Rao

Intra-cluster Replication for Apache Kafka. Jun Rao Intra-cluster Replication for Apache Kafka Jun Rao About myself Engineer at LinkedIn since 2010 Worked on Apache Kafka and Cassandra Database researcher at IBM Outline Overview of Kafka Kafka architecture

More information

Distributed Data Management Replication

Distributed Data Management Replication Felix Naumann F-2.03/F-2.04, Campus II Hasso Plattner Institut Distributing Data Motivation Scalability (Elasticity) If data volume, processing, or access exhausts one machine, you might want to spread

More information

Datacenter replication solution with quasardb

Datacenter replication solution with quasardb Datacenter replication solution with quasardb Technical positioning paper April 2017 Release v1.3 www.quasardb.net Contact: sales@quasardb.net Quasardb A datacenter survival guide quasardb INTRODUCTION

More information

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com Presenter: Alex Hu } Introduction } Architecture } File

More information

Big data streaming: Choices for high availability and disaster recovery on Microsoft Azure. By Arnab Ganguly DataCAT

Big data streaming: Choices for high availability and disaster recovery on Microsoft Azure. By Arnab Ganguly DataCAT : Choices for high availability and disaster recovery on Microsoft Azure By Arnab Ganguly DataCAT March 2019 Contents Overview... 3 The challenge of a single-region architecture... 3 Configuration considerations...

More information

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS W13.A.0.0 CS435 Introduction to Big Data W13.A.1 FAQs Programming Assignment 3 has been posted PART 2. LARGE SCALE DATA STORAGE SYSTEMS DISTRIBUTED FILE SYSTEMS Recitations Apache Spark tutorial 1 and

More information

GFS: The Google File System. Dr. Yingwu Zhu

GFS: The Google File System. Dr. Yingwu Zhu GFS: The Google File System Dr. Yingwu Zhu Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one big CPU More storage, CPU required than one PC can

More information

Towards Weakly Consistent Local Storage Systems

Towards Weakly Consistent Local Storage Systems Towards Weakly Consistent Local Storage Systems Ji-Yong Shin 1,2, Mahesh Balakrishnan 2, Tudor Marian 3, Jakub Szefer 2, Hakim Weatherspoon 1 1 Cornell University, 2 Yale University, 3 Google 2 Consistency/Performance

More information

Distributed Systems. Tutorial 9 Windows Azure Storage

Distributed Systems. Tutorial 9 Windows Azure Storage Distributed Systems Tutorial 9 Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester, 2011-2012 Windows Azure Storage (WAS) A scalable cloud storage system In production

More information

Building Durable Real-time Data Pipeline

Building Durable Real-time Data Pipeline Building Durable Real-time Data Pipeline Apache BookKeeper at Twitter @sijieg Twitter Background Layered Architecture Agenda Design Details Performance Scale @Twitter Q & A Publish-Subscribe Online services

More information

GFS: The Google File System

GFS: The Google File System GFS: The Google File System Brad Karp UCL Computer Science CS GZ03 / M030 24 th October 2014 Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one

More information

PNUTS: Yahoo! s Hosted Data Serving Platform. Reading Review by: Alex Degtiar (adegtiar) /30/2013

PNUTS: Yahoo! s Hosted Data Serving Platform. Reading Review by: Alex Degtiar (adegtiar) /30/2013 PNUTS: Yahoo! s Hosted Data Serving Platform Reading Review by: Alex Degtiar (adegtiar) 15-799 9/30/2013 What is PNUTS? Yahoo s NoSQL database Motivated by web applications Massively parallel Geographically

More information

Tools for Social Networking Infrastructures

Tools for Social Networking Infrastructures Tools for Social Networking Infrastructures 1 Cassandra - a decentralised structured storage system Problem : Facebook Inbox Search hundreds of millions of users distributed infrastructure inbox changes

More information

ebay s Architectural Principles

ebay s Architectural Principles ebay s Architectural Principles Architectural Strategies, Patterns, and Forces for Scaling a Large ecommerce Site Randy Shoup ebay Distinguished Architect QCon London 2008 March 14, 2008 What we re up

More information

INTRODUCTION TO XTREMIO METADATA-AWARE REPLICATION

INTRODUCTION TO XTREMIO METADATA-AWARE REPLICATION Installing and Configuring the DM-MPIO WHITE PAPER INTRODUCTION TO XTREMIO METADATA-AWARE REPLICATION Abstract This white paper introduces XtremIO replication on X2 platforms. XtremIO replication leverages

More information

ebay Marketplace Architecture

ebay Marketplace Architecture ebay Marketplace Architecture Architectural Strategies, Patterns, and Forces Randy Shoup, ebay Distinguished Architect QCon SF 2007 November 9, 2007 What we re up against ebay manages Over 248,000,000

More information

The Google File System

The Google File System October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single

More information

Data Informatics. Seon Ho Kim, Ph.D.

Data Informatics. Seon Ho Kim, Ph.D. Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu HBase HBase is.. A distributed data store that can scale horizontally to 1,000s of commodity servers and petabytes of indexed storage. Designed to operate

More information

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

! Design constraints.  Component failures are the norm.  Files are huge by traditional standards. ! POSIX-like Cloud background Google File System! Warehouse scale systems " 10K-100K nodes " 50MW (1 MW = 1,000 houses) " Power efficient! Located near cheap power! Passive cooling! Power Usage Effectiveness = Total

More information

Design Patterns for Large- Scale Data Management. Robert Hodges OSCON 2013

Design Patterns for Large- Scale Data Management. Robert Hodges OSCON 2013 Design Patterns for Large- Scale Data Management Robert Hodges OSCON 2013 The Start-Up Dilemma 1. You are releasing Online Storefront V 1.0 2. It could be a complete bust 3. But it could be *really* big

More information

MySQL High Availability Solutions. Alex Poritskiy Percona

MySQL High Availability Solutions. Alex Poritskiy Percona MySQL High Availability Solutions Alex Poritskiy Percona The Five 9s of Availability Clustering & Geographical Redundancy Clustering Technologies Replication Technologies Well-Managed disasters power failures

More information

Microsoft SQL Server Fix Pack 15. Reference IBM

Microsoft SQL Server Fix Pack 15. Reference IBM Microsoft SQL Server 6.3.1 Fix Pack 15 Reference IBM Microsoft SQL Server 6.3.1 Fix Pack 15 Reference IBM Note Before using this information and the product it supports, read the information in Notices

More information

CLOUD-SCALE FILE SYSTEMS

CLOUD-SCALE FILE SYSTEMS Data Management in the Cloud CLOUD-SCALE FILE SYSTEMS 92 Google File System (GFS) Designing a file system for the Cloud design assumptions design choices Architecture GFS Master GFS Chunkservers GFS Clients

More information

The Google File System

The Google File System The Google File System By Ghemawat, Gobioff and Leung Outline Overview Assumption Design of GFS System Interactions Master Operations Fault Tolerance Measurements Overview GFS: Scalable distributed file

More information

TRANSACTIONS AND ABSTRACTIONS

TRANSACTIONS AND ABSTRACTIONS TRANSACTIONS AND ABSTRACTIONS OVER HBASE Andreas Neumann @anew68! Continuuity AGENDA Transactions over HBase: Why? What? Implementation: How? The approach Transaction Manager Abstractions Future WHO WE

More information

IBM Db2 Event Store Simplifying and Accelerating Storage and Analysis of Fast Data. IBM Db2 Event Store

IBM Db2 Event Store Simplifying and Accelerating Storage and Analysis of Fast Data. IBM Db2 Event Store IBM Db2 Event Store Simplifying and Accelerating Storage and Analysis of Fast Data IBM Db2 Event Store Disclaimer The information contained in this presentation is provided for informational purposes only.

More information

Distributed File Systems II

Distributed File Systems II Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation

More information

Extreme Computing. NoSQL.

Extreme Computing. NoSQL. Extreme Computing NoSQL PREVIOUSLY: BATCH Query most/all data Results Eventually NOW: ON DEMAND Single Data Points Latency Matters One problem, three ideas We want to keep track of mutable state in a scalable

More information

Solace JMS Broker Delivers Highest Throughput for Persistent and Non-Persistent Delivery

Solace JMS Broker Delivers Highest Throughput for Persistent and Non-Persistent Delivery Solace JMS Broker Delivers Highest Throughput for Persistent and Non-Persistent Delivery Java Message Service (JMS) is a standardized messaging interface that has become a pervasive part of the IT landscape

More information

Scaling with mongodb

Scaling with mongodb Scaling with mongodb Ross Lawley Python Engineer @ 10gen Web developer since 1999 Passionate about open source Agile methodology email: ross@10gen.com twitter: RossC0 Today's Talk Scaling Understanding

More information

Final Exam Logistics. CS 133: Databases. Goals for Today. Some References Used. Final exam take-home. Same resources as midterm

Final Exam Logistics. CS 133: Databases. Goals for Today. Some References Used. Final exam take-home. Same resources as midterm Final Exam Logistics CS 133: Databases Fall 2018 Lec 25 12/06 NoSQL Final exam take-home Available: Friday December 14 th, 4:00pm in Olin Due: Monday December 17 th, 5:15pm Same resources as midterm Except

More information

Installing and configuring Apache Kafka

Installing and configuring Apache Kafka 3 Installing and configuring Date of Publish: 2018-08-13 http://docs.hortonworks.com Contents Installing Kafka...3 Prerequisites... 3 Installing Kafka Using Ambari... 3... 9 Preparing the Environment...9

More information

Distributed Systems 16. Distributed File Systems II

Distributed Systems 16. Distributed File Systems II Distributed Systems 16. Distributed File Systems II Paul Krzyzanowski pxk@cs.rutgers.edu 1 Review NFS RPC-based access AFS Long-term caching CODA Read/write replication & disconnected operation DFS AFS

More information

MySQL Group Replication. Bogdan Kecman MySQL Principal Technical Engineer

MySQL Group Replication. Bogdan Kecman MySQL Principal Technical Engineer MySQL Group Replication Bogdan Kecman MySQL Principal Technical Engineer Bogdan.Kecman@oracle.com 1 Safe Harbor Statement The following is intended to outline our general product direction. It is intended

More information

Course Content MongoDB

Course Content MongoDB Course Content MongoDB 1. Course introduction and mongodb Essentials (basics) 2. Introduction to NoSQL databases What is NoSQL? Why NoSQL? Difference Between RDBMS and NoSQL Databases Benefits of NoSQL

More information

The Google File System (GFS)

The Google File System (GFS) 1 The Google File System (GFS) CS60002: Distributed Systems Antonio Bruto da Costa Ph.D. Student, Formal Methods Lab, Dept. of Computer Sc. & Engg., Indian Institute of Technology Kharagpur 2 Design constraints

More information

CS140 Final Review. Winter 2014

CS140 Final Review. Winter 2014 CS140 Final Review Winter 2014 Administrivia Friday, March 21, 12:15-3:15pm Open book, covers all 17 lectures (including topics already on the midterm) 50% of grade based on exams using this quanoty: max

More information

ProxySQL - GTID Consistent Reads. Adaptive query routing based on GTID tracking

ProxySQL - GTID Consistent Reads. Adaptive query routing based on GTID tracking ProxySQL - GTID Consistent Reads Adaptive query routing based on GTID tracking Introduction Rene Cannao Founder of ProxySQL MySQL DBA Introduction Nick Vyzas ProxySQL Committer MySQL DBA What is ProxySQL?

More information

Tutorial 8 Build resilient, responsive and scalable web applications with SocketPro

Tutorial 8 Build resilient, responsive and scalable web applications with SocketPro Tutorial 8 Build resilient, responsive and scalable web applications with SocketPro Contents: Introduction SocketPro ways for resilient, responsive and scalable web applications Vertical scalability o

More information

Federated Array of Bricks Y Saito et al HP Labs. CS 6464 Presented by Avinash Kulkarni

Federated Array of Bricks Y Saito et al HP Labs. CS 6464 Presented by Avinash Kulkarni Federated Array of Bricks Y Saito et al HP Labs CS 6464 Presented by Avinash Kulkarni Agenda Motivation Current Approaches FAB Design Protocols, Implementation, Optimizations Evaluation SSDs in enterprise

More information

10. Replication. CSEP 545 Transaction Processing Philip A. Bernstein. Copyright 2003 Philip A. Bernstein. Outline

10. Replication. CSEP 545 Transaction Processing Philip A. Bernstein. Copyright 2003 Philip A. Bernstein. Outline 10. Replication CSEP 545 Transaction Processing Philip A. Bernstein Copyright 2003 Philip A. Bernstein 1 Outline 1. Introduction 2. Primary-Copy Replication 3. Multi-Master Replication 4. Other Approaches

More information

PNUTS and Weighted Voting. Vijay Chidambaram CS 380 D (Feb 8)

PNUTS and Weighted Voting. Vijay Chidambaram CS 380 D (Feb 8) PNUTS and Weighted Voting Vijay Chidambaram CS 380 D (Feb 8) PNUTS Distributed database built by Yahoo Paper describes a production system Goals: Scalability Low latency, predictable latency Must handle

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system

More information

Oracle Streams. An Oracle White Paper October 2002

Oracle Streams. An Oracle White Paper October 2002 Oracle Streams An Oracle White Paper October 2002 Oracle Streams Executive Overview... 3 Introduction... 3 Oracle Streams Overview... 4... 5 Staging... 5 Propagation... 6 Transformations... 6 Consumption...

More information

Yves Goeleven. Solution Architect - Particular Software. Shipping software since Azure MVP since Co-founder & board member AZUG

Yves Goeleven. Solution Architect - Particular Software. Shipping software since Azure MVP since Co-founder & board member AZUG Storage Services Yves Goeleven Solution Architect - Particular Software Shipping software since 2001 Azure MVP since 2010 Co-founder & board member AZUG NServiceBus & MessageHandler Used azure storage?

More information

Paxos Replicated State Machines as the Basis of a High- Performance Data Store

Paxos Replicated State Machines as the Basis of a High- Performance Data Store Paxos Replicated State Machines as the Basis of a High- Performance Data Store William J. Bolosky, Dexter Bradshaw, Randolph B. Haagens, Norbert P. Kusters and Peng Li March 30, 2011 Q: How to build a

More information

Making Non-Distributed Databases, Distributed. Ioannis Papapanagiotou, PhD Shailesh Birari

Making Non-Distributed Databases, Distributed. Ioannis Papapanagiotou, PhD Shailesh Birari Making Non-Distributed Databases, Distributed Ioannis Papapanagiotou, PhD Shailesh Birari Dynomite Ecosystem Dynomite - Proxy layer Dyno - Client Dynomite-manager - Ecosystem orchestrator Dynomite-explorer

More information

Staggeringly Large Filesystems

Staggeringly Large Filesystems Staggeringly Large Filesystems Evan Danaher CS 6410 - October 27, 2009 Outline 1 Large Filesystems 2 GFS 3 Pond Outline 1 Large Filesystems 2 GFS 3 Pond Internet Scale Web 2.0 GFS Thousands of machines

More information

Broker Clusters. Cluster Models

Broker Clusters. Cluster Models 4 CHAPTER 4 Broker Clusters Cluster Models Message Queue supports the use of broker clusters: groups of brokers working together to provide message delivery services to clients. Clusters enable a Message

More information

Google File System. Arun Sundaram Operating Systems

Google File System. Arun Sundaram Operating Systems Arun Sundaram Operating Systems 1 Assumptions GFS built with commodity hardware GFS stores a modest number of large files A few million files, each typically 100MB or larger (Multi-GB files are common)

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google SOSP 03, October 19 22, 2003, New York, USA Hyeon-Gyu Lee, and Yeong-Jae Woo Memory & Storage Architecture Lab. School

More information

Cassandra - A Decentralized Structured Storage System. Avinash Lakshman and Prashant Malik Facebook

Cassandra - A Decentralized Structured Storage System. Avinash Lakshman and Prashant Malik Facebook Cassandra - A Decentralized Structured Storage System Avinash Lakshman and Prashant Malik Facebook Agenda Outline Data Model System Architecture Implementation Experiments Outline Extension of Bigtable

More information

Pragmatic Clustering. Mike Cannon-Brookes CEO, Atlassian Software Systems

Pragmatic Clustering. Mike Cannon-Brookes CEO, Atlassian Software Systems Pragmatic Clustering Mike Cannon-Brookes CEO, Atlassian Software Systems 1 Confluence Largest enterprise wiki in the world 2000 customers in 60 countries J2EE application, ~500k LOC Hibernate, Lucene,

More information

Care and Feeding of Oracle Rdb Hot Standby

Care and Feeding of Oracle Rdb Hot Standby Care and Feeding of Oracle Rdb Hot Standby Paul Mead / Norman Lastovica Oracle New England Development Center Copyright 2001, 2003 Oracle Corporation 2 Overview What Hot Standby provides Basic architecture

More information

TRANSACTIONS OVER HBASE

TRANSACTIONS OVER HBASE TRANSACTIONS OVER HBASE Alex Baranau @abaranau Gary Helmling @gario Continuuity WHO WE ARE We ve built Continuuity Reactor: the world s first scale-out application server for Hadoop Fast, easy development,

More information

MySQL Architecture Design Patterns for Performance, Scalability, and Availability

MySQL Architecture Design Patterns for Performance, Scalability, and Availability MySQL Architecture Design Patterns for Performance, Scalability, and Availability Brian Miezejewski Principal Manager Consulting Alexander Rubin Principal Consultant Agenda HA and

More information

Plug-in Configuration

Plug-in Configuration Overview, page 1 Threading Configuration, page 2 Portal Configuration, page 3 Async Threading Configuration, page 3 Custom Reference Data Configuration, page 4 Balance Configuration, page 6 Diameter Configuration,

More information

Introduc)on to Apache Ka1a. Jun Rao Co- founder of Confluent

Introduc)on to Apache Ka1a. Jun Rao Co- founder of Confluent Introduc)on to Apache Ka1a Jun Rao Co- founder of Confluent Agenda Why people use Ka1a Technical overview of Ka1a What s coming What s Apache Ka1a Distributed, high throughput pub/sub system Ka1a Usage

More information

MI-PDB, MIE-PDB: Advanced Database Systems

MI-PDB, MIE-PDB: Advanced Database Systems MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:

More information

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers A Distributed System Case Study: Apache Kafka High throughput messaging for diverse consumers As always, this is not a tutorial Some of the concepts may no longer be part of the current system or implemented

More information

Microservice Splitting the Monolith. Software Engineering II Sharif University of Technology MohammadAmin Fazli

Microservice Splitting the Monolith. Software Engineering II Sharif University of Technology MohammadAmin Fazli Microservice Software Engineering II Sharif University of Technology MohammadAmin Fazli Topics Seams Why to split the monolith Tangled Dependencies Splitting and Refactoring Databases Transactional Boundaries

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung SOSP 2003 presented by Kun Suo Outline GFS Background, Concepts and Key words Example of GFS Operations Some optimizations in

More information

MINIMIZING TRANSACTION LATENCY IN GEO-REPLICATED DATA STORES

MINIMIZING TRANSACTION LATENCY IN GEO-REPLICATED DATA STORES MINIMIZING TRANSACTION LATENCY IN GEO-REPLICATED DATA STORES Divy Agrawal Department of Computer Science University of California at Santa Barbara Joint work with: Amr El Abbadi, Hatem Mahmoud, Faisal

More information

Performance and Scalability with Griddable.io

Performance and Scalability with Griddable.io Performance and Scalability with Griddable.io Executive summary Griddable.io is an industry-leading timeline-consistent synchronized data integration grid across a range of source and target data systems.

More information

MySQL Replication Update

MySQL Replication Update MySQL Replication Update Lars Thalmann Development Director MySQL Replication, Backup & Connectors OSCON, July 2011 MySQL Releases MySQL 5.1 Generally Available, November 2008 MySQL

More information

Migrating to the P8 5.2 Component Manager Framework

Migrating to the P8 5.2 Component Manager Framework Migrating to the P8 5.2 Component Manager Framework Contents Migrating to the P8 5.2 Component Manager Framework... 1 Introduction... 1 Revision History:... 2 Comparing the Two Component Manager Frameworks...

More information

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 21: Network Protocols (and 2 Phase Commit)

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 21: Network Protocols (and 2 Phase Commit) CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring 2003 Lecture 21: Network Protocols (and 2 Phase Commit) 21.0 Main Point Protocol: agreement between two parties as to

More information

Oracle Database 12c: JMS Sharded Queues

Oracle Database 12c: JMS Sharded Queues Oracle Database 12c: JMS Sharded Queues For high performance, scalable Advanced Queuing ORACLE WHITE PAPER MARCH 2015 Table of Contents Introduction 2 Architecture 3 PERFORMANCE OF AQ-JMS QUEUES 4 PERFORMANCE

More information

More reliability and support for PostgreSQL 10: Introducing Pgpool-II 3.7

More reliability and support for PostgreSQL 10: Introducing Pgpool-II 3.7 More reliability and support for PostgreSQL 10: Introducing Pgpool-II 3.7 PGConf.ASIA 2017 SRA OSS, Inc Japan Tatsuo Ishii Who am I? Working on OSS activities and businesses OSS activities PostgreSQL committer

More information

Mix n Match Async and Group Replication for Advanced Replication Setups. Pedro Gomes Software Engineer

Mix n Match Async and Group Replication for Advanced Replication Setups. Pedro Gomes Software Engineer Mix n Match Async and Group Replication for Advanced Replication Setups Pedro Gomes (pedro.gomes@oracle.com) Software Engineer 4th of February Copyright 2017, Oracle and/or its affiliates. All rights reserved.

More information

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL CISC 7610 Lecture 5 Distributed multimedia databases Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL Motivation YouTube receives 400 hours of video per minute That is 200M hours

More information

Building LinkedIn s Real-time Data Pipeline. Jay Kreps

Building LinkedIn s Real-time Data Pipeline. Jay Kreps Building LinkedIn s Real-time Data Pipeline Jay Kreps What is a data pipeline? What data is there? Database data Activity data Page Views, Ad Impressions, etc Messaging JMS, AMQP, etc Application and

More information

Scale out Read Only Workload by sharing data files of InnoDB. Zhai weixiang Alibaba Cloud

Scale out Read Only Workload by sharing data files of InnoDB. Zhai weixiang Alibaba Cloud Scale out Read Only Workload by sharing data files of InnoDB Zhai weixiang Alibaba Cloud Who Am I - My Name is Zhai Weixiang - I joined in Alibaba in 2011 and has been working on MySQL since then - Mainly

More information

Optimizing RDM Server Performance

Optimizing RDM Server Performance TECHNICAL WHITE PAPER Optimizing RDM Server Performance A Raima Inc. Technical Whitepaper Published: August, 2008 Author: Paul Johnson Director of Marketing Copyright: Raima Inc., All rights reserved Abstract

More information

LazyBase: Trading freshness and performance in a scalable database

LazyBase: Trading freshness and performance in a scalable database LazyBase: Trading freshness and performance in a scalable database (EuroSys 2012) Jim Cipar, Greg Ganger, *Kimberly Keeton, *Craig A. N. Soules, *Brad Morrey, *Alistair Veitch PARALLEL DATA LABORATORY

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 1: Distributed File Systems GFS (The Google File System) 1 Filesystems

More information

MySQL High Availability. Michael Messina Senior Managing Consultant, Rolta-AdvizeX /

MySQL High Availability. Michael Messina Senior Managing Consultant, Rolta-AdvizeX / MySQL High Availability Michael Messina Senior Managing Consultant, Rolta-AdvizeX mmessina@advizex.com / mike.messina@rolta.com Introduction Michael Messina Senior Managing Consultant Rolta-AdvizeX, Working

More information

real-time delivery architecture

real-time delivery architecture real-time delivery architecture @raffi uc berkeley - 27 august 2012 designing twitter what are the goals? evolve from being solely a web stack ROUTING PRESENTATION LOGIC STORAGE & RETRIEVAL T-Bird T-Flock

More information

Market Data Publisher In a High Frequency Trading Set up

Market Data Publisher In a High Frequency Trading Set up Market Data Publisher In a High Frequency Trading Set up INTRODUCTION The main theme behind the design of Market Data Publisher is to make the latest trade & book data available to several integrating

More information

High Noon at AWS. ~ Amazon MySQL RDS versus Tungsten Clustering running MySQL on AWS EC2

High Noon at AWS. ~ Amazon MySQL RDS versus Tungsten Clustering running MySQL on AWS EC2 High Noon at AWS ~ Amazon MySQL RDS versus Tungsten Clustering running MySQL on AWS EC2 Introduction Amazon Web Services (AWS) are gaining popularity, and for good reasons. The Amazon Relational Database

More information

Identifying Workloads for the Cloud

Identifying Workloads for the Cloud Identifying Workloads for the Cloud 1 This brief is based on a webinar in RightScale s I m in the Cloud Now What? series. Browse our entire library for webinars on cloud computing management. Meet our

More information

MySQL HA Solutions. Keeping it simple, kinda! By: Chris Schneider MySQL Architect Ning.com

MySQL HA Solutions. Keeping it simple, kinda! By: Chris Schneider MySQL Architect Ning.com MySQL HA Solutions Keeping it simple, kinda! By: Chris Schneider MySQL Architect Ning.com What we ll cover today High Availability Terms and Concepts Levels of High Availability What technologies are there

More information

The HAMMER Filesystem DragonFlyBSD Project Matthew Dillon 11 October 2008

The HAMMER Filesystem DragonFlyBSD Project Matthew Dillon 11 October 2008 The HAMMER Filesystem DragonFlyBSD Project Matthew Dillon 11 October 2008 HAMMER Quick Feature List 1 Exabyte capacity (2^60 = 1 million terrabytes). Fine-grained, live-view history retention for snapshots

More information

High Availability through Warm-Standby Support in Sybase Replication Server A Whitepaper from Sybase, Inc.

High Availability through Warm-Standby Support in Sybase Replication Server A Whitepaper from Sybase, Inc. High Availability through Warm-Standby Support in Sybase Replication Server A Whitepaper from Sybase, Inc. Table of Contents Section I: The Need for Warm Standby...2 The Business Problem...2 Section II:

More information

Apache BookKeeper. A High Performance and Low Latency Storage Service

Apache BookKeeper. A High Performance and Low Latency Storage Service Apache BookKeeper A High Performance and Low Latency Storage Service Hello! I am Sijie Guo - PMC Chair of Apache BookKeeper Co-creator of Apache DistributedLog Twitter Messaging/Pub-Sub Team Yahoo! R&D

More information

Distributed PostgreSQL with YugaByte DB

Distributed PostgreSQL with YugaByte DB Distributed PostgreSQL with YugaByte DB Karthik Ranganathan PostgresConf Silicon Valley Oct 16, 2018 1 CHECKOUT THIS REPO: github.com/yugabyte/yb-sql-workshop 2 About Us Founders Kannan Muthukkaruppan,

More information

Jailbreaking MySQL Replication Featuring Tungsten Replicator. Robert Hodges, CEO, Continuent

Jailbreaking MySQL Replication Featuring Tungsten Replicator. Robert Hodges, CEO, Continuent Jailbreaking MySQL Replication Featuring Tungsten Robert Hodges, CEO, Continuent About Continuent / Continuent is the leading provider of data replication and clustering for open source relational databases

More information

CS November 2017

CS November 2017 Bigtable Highly available distributed storage Distributed Systems 18. Bigtable Built with semi-structured data in mind URLs: content, metadata, links, anchors, page rank User data: preferences, account

More information

THE ZADARA CLOUD. An overview of the Zadara Storage Cloud and VPSA Storage Array technology WHITE PAPER

THE ZADARA CLOUD. An overview of the Zadara Storage Cloud and VPSA Storage Array technology WHITE PAPER WHITE PAPER THE ZADARA CLOUD An overview of the Zadara Storage Cloud and VPSA Storage Array technology Zadara 6 Venture, Suite 140, Irvine, CA 92618, USA www.zadarastorage.com EXECUTIVE SUMMARY The IT

More information

Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database.

Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database. Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database. Presented by Kewei Li The Problem db nosql complex legacy tuning expensive

More information

MySQL Replication Options. Peter Zaitsev, CEO, Percona Moscow MySQL User Meetup Moscow,Russia

MySQL Replication Options. Peter Zaitsev, CEO, Percona Moscow MySQL User Meetup Moscow,Russia MySQL Replication Options Peter Zaitsev, CEO, Percona Moscow MySQL User Meetup Moscow,Russia Few Words About Percona 2 Your Partner in MySQL and MongoDB Success 100% Open Source Software We work with MySQL,

More information

The course modules of MongoDB developer and administrator online certification training:

The course modules of MongoDB developer and administrator online certification training: The course modules of MongoDB developer and administrator online certification training: 1 An Overview of the Course Introduction to the course Table of Contents Course Objectives Course Overview Value

More information

A Guide to Architecting the Active/Active Data Center

A Guide to Architecting the Active/Active Data Center White Paper A Guide to Architecting the Active/Active Data Center 2015 ScaleArc. All Rights Reserved. White Paper The New Imperative: Architecting the Active/Active Data Center Introduction With the average

More information

MySQL Replication : advanced features in all flavours. Giuseppe Maxia Quality Assurance Architect at

MySQL Replication : advanced features in all flavours. Giuseppe Maxia Quality Assurance Architect at MySQL Replication : advanced features in all flavours Giuseppe Maxia Quality Assurance Architect at VMware @datacharmer 1 About me Who s this guy? Giuseppe Maxia, a.k.a. "The Data Charmer" QA Architect

More information

Data Infrastructure at LinkedIn. Shirshanka Das XLDB 2011

Data Infrastructure at LinkedIn. Shirshanka Das XLDB 2011 Data Infrastructure at LinkedIn Shirshanka Das XLDB 2011 1 Me UCLA Ph.D. 2005 (Distributed protocols in content delivery networks) PayPal (Web frameworks and Session Stores) Yahoo! (Serving Infrastructure,

More information

Conceptual Modeling on Tencent s Distributed Database Systems. Pan Anqun, Wang Xiaoyu, Li Haixiang Tencent Inc.

Conceptual Modeling on Tencent s Distributed Database Systems. Pan Anqun, Wang Xiaoyu, Li Haixiang Tencent Inc. Conceptual Modeling on Tencent s Distributed Database Systems Pan Anqun, Wang Xiaoyu, Li Haixiang Tencent Inc. Outline Introduction System overview of TDSQL Conceptual Modeling on TDSQL Applications Conclusion

More information

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University CPSC 426/526 Cloud Computing Ennan Zhai Computer Science Department Yale University Recall: Lec-7 In the lec-7, I talked about: - P2P vs Enterprise control - Firewall - NATs - Software defined network

More information