Building Durable Real-time Data Pipeline

Similar documents
Apache BookKeeper. A High Performance and Low Latency Storage Service

Intra-cluster Replication for Apache Kafka. Jun Rao

Real Time Processing. Karthik Ramasamy

Data Acquisition. The reference Big Data stack

ZooKeeper. Table of contents

Improving efficiency of Twitter Infrastructure using Chargeback

Tools for Social Networking Infrastructures

Distributed Filesystem

Auto Management for Apache Kafka and Distributed Stateful System in General

Streaming Log Analytics with Kafka

Installing and configuring Apache Kafka

Data Acquisition. The reference Big Data stack

The Google File System (GFS)

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

The Google File System

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

BookKeeper overview. Table of contents

Fluentd + MongoDB + Spark = Awesome Sauce

Distributed Systems 16. Distributed File Systems II

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

Introduc)on to Apache Ka1a. Jun Rao Co- founder of Confluent

@joerg_schad Nightmares of a Container Orchestration System

Distributed File Systems II

The Google File System

Scaling the Yelp s logging pipeline with Apache Kafka. Enrico

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

The Google File System

Before proceeding with this tutorial, you must have a good understanding of Core Java and any of the Linux flavors.

The Google File System

VOLTDB + HP VERTICA. page

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

WHITE PAPER. Reference Guide for Deploying and Configuring Apache Kafka

Google File System (GFS) and Hadoop Distributed File System (HDFS)

The Google File System

Data Infrastructure at LinkedIn. Shirshanka Das XLDB 2011

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Apache Hadoop Goes Realtime at Facebook. Himanshu Sharma

Applications of Paxos Algorithm

PRESENTATION TITLE GOES HERE. Understanding Architectural Trade-offs in Object Storage Technologies

The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Dublin Apache Kafka Meetup, 30 August 2017.

EsgynDB Enterprise 2.0 Platform Reference Architecture

Facebook. The Technology Behind Messages (and more ) Kannan Muthukkaruppan Software Engineer, Facebook. March 11, 2011

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

Data pipelines with PostgreSQL & Kafka

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

vsan 6.6 Performance Improvements First Published On: Last Updated On:

Transformation-free Data Pipelines by combining the Power of Apache Kafka and the Flexibility of the ESB's

Distributed Coordination with ZooKeeper - Theory and Practice. Simon Tao EMC Labs of China Oct. 24th, 2015

Ceph: A Scalable, High-Performance Distributed File System

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

rkafka rkafka is a package created to expose functionalities provided by Apache Kafka in the R layer. Version 1.1

Cluster-Level Google How we use Colossus to improve storage efficiency

Using the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver

CLOUD-SCALE FILE SYSTEMS

HBase Solutions at Facebook

CA485 Ray Walshe Google File System

Flash Storage Complementing a Data Lake for Real-Time Insight

Architectural challenges for building a low latency, scalable multi-tenant data warehouse

real-time delivery architecture

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

Sizing Guidelines and Performance Tuning for Intelligent Streaming

Building a Data-Friendly Platform for a Data- Driven Future

Distributed ETL. A lightweight, pluggable, and scalable ingestion service for real-time data. Joe Wang

GFS: The Google File System

EVCache: Lowering Costs for a Low Latency Cache with RocksDB. Scott Mansfield Vu Nguyen EVCache

Data Storage Revolution

Apache Kafka a system optimized for writing. Bernhard Hopfenmüller. 23. Oktober 2018

The Google File System. Alexandru Costan

CSE 153 Design of Operating Systems

Building/Running Distributed Systems with Apache Mesos

Paxos Replicated State Machines as the Basis of a High- Performance Data Store

CONTINUOUS DELIVERY WITH MESOS, DC/OS AND JENKINS

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Event Streams using Apache Kafka

April Copyright 2013 Cloudera Inc. All rights reserved.

Federated Array of Bricks Y Saito et al HP Labs. CS 6464 Presented by Avinash Kulkarni

Data Analytics with HPC. Data Streaming

Performance and Scalability with Griddable.io

Benefits of Multi-Node Scale-out Clusters running NetApp Clustered Data ONTAP. Silverton Consulting, Inc. StorInt Briefing

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

Petal and Frangipani

Map-Reduce. Marco Mura 2010 March, 31th

Distributed Systems. GFS / HDFS / Spanner

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

Oracle Database 12c: JMS Sharded Queues

11/5/2018 Week 12-A Sangmi Lee Pallickara. CS435 Introduction to Big Data FALL 2018 Colorado State University

Ambry: LinkedIn s Scalable Geo- Distributed Object Store

Data Integrity in Stateful Services. Velocity, China, 2016

Exam C IBM Cloud Platform Application Development v2 Sample Test

Solace JMS Broker Delivers Highest Throughput for Persistent and Non-Persistent Delivery

the great logfile in the

Cloud-Native Applications. Copyright 2017 Pivotal Software, Inc. All rights Reserved. Version 1.0

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

HDFS Architecture Guide

APACHE COTTON. MySQL on Mesos. Yan Xu xujyan

IBM Data Replication for Big Data

PARALLEL CONSENSUS PROTOCOL

Transcription:

Building Durable Real-time Data Pipeline Apache BookKeeper at Twitter @sijieg Twitter

Background Layered Architecture Agenda Design Details Performance Scale @Twitter Q & A

Publish-Subscribe Online services - 10s of milliseconds Transaction log, Queues, RPC Near real-time processing - 100s of milliseconds Change propagation, Stream Computing Data delivery for analytics - seconds~minutes Log collection, aggregation

Twitter Messaging at 2012 Core Business Logic (tweets, fanouts ) Scribe Deferred RPC Gizzard Database Search Kestrel Kestrel Kestrel Book Keeper MySQL Kafka HDFS

Online Services - Kestrel Kestrel Simple Perform well (as long as queue fits in memory) Fan-out Queues: One per subscriber Reliable reads - per item transaction Cross DC replication

Online Services - Kestrel Limitations Kestrel Limitations Durability is hard to achieve - Each queue is a separate file Adding subscribers is expensive Separate physical copy of the queue for each fanout Read-behind degrades performance - Too many random I/Os Scales poorly as #queues increase

Stream Computing - Kafka Kafka Throughput/Latency through sequential I/O with small number of topics Avoid data copying - Direct Network I/O (sendfile) Batch Compression Cross DC replication (Mirroring)

Stream Computing - Kafka Limitation Kafka Limitation Relies on filesystem page cache Limit #topics: Ideally one or handful topics per disk Performance degrades when subscriber falls behind - Too much random I/O No durability and replication (0.7)

Problems Each of the systems came with their maintenance overhead Software Components - backend, clients and interop with the rest of Twitter stack Manageability and Supportability - deployment, upgrades, hardware maintenance and optimization Technical know-how

Rethink the messaging architecture Unified Stack - tradeoffs for various workloads Durable writes, intra-cluster and geo-replication Multi tenancy Scale resources independently - Cost efficiency Ease of manageability

Layered Architecture Data Model Software Stack Data Flow

Log Stream 0100100111010 sequence of bytes Entry: a batch of records 1010101110100 0101110110010 1010110110101 1010110110101 DLSN: (LSSN, Eid, Sid) Sequence ID Transaction ID Log Segment X Log Segment X + 1 Log Segment Y

Layered Architecture APPLICATION STATELESS SERVING Write Clients Ownership Tracker Read Clients Routing Service Write Proxy Read Proxy CORE Writer Reader PERSISTENT STORAGE Metadata Store (ZooKeeper) BookKeeper Cold Storage (HDFS) Bookie Bookie Bookie Bookie

Messaging Flow Write Client 1. write records 3. Flush - Bookie Write a batched entry to bookies 5. Commit - Write Control Record 6. Long poll read Write Proxy Bookie 7. Speculative Read Read Proxy 9. Long poll read Read Client Read Client 4. acknowledge 2. transmit buffer Bookie 8. Cache Records Read Client

Design Details Consistency Global Replicated Log

Consistency LastAddConfirmed => Consistent views among readers Fencing => Consistent views among writers

Consistency - LastAddPushed Writer Add entries 0 1 2 3 4 7 8 9 10 11 12 LastAdd Pushed

Consistency - LastAddConfirmed Writer Ownership Changed Writer Ack Adds Add entries 0 1 2 3 4 7 8 9 10 11 12 Fencing LastAdd Confirmed LastAdd Pushed Reader Reader

Consistency - Fencing New Writer Completed Log Segment X Completed Inprogress Log Log Segment X+1 X+1 Inprogress Log Segment X+2 3 new inprogress Bookie Bookie 0. Ownership Changed 1. Get Log Segments 2.3 Complete Inprogress LogSegment 2.1 Fence Inprogress LogSegment Bookie Old Writer Completed Log Segment X Inprogress Log Segment X+1 2.2 write rejected Bookie Bookie

Consistency - Ownership Tracking Ownership Tracking (Leader Election) ZooKeeper Ephemeral Znodes (leases) Aggressive Failure Detection (within a second) TickTime = 500 (ms) Session Timeout = 1000 (ms)

Global Replicated Log Region Aware Data Placement Cross Region Speculative Reads

Global Replicated Log Writer Ownership Tracker Region 1 ZK Region 2 ZK Region 3 ZK Write Proxy Write Proxy Write Proxy Region Aware Placement Policy Bookie Bookie Bookie Reader Speculative Read

Region Aware Data Placement Policy Hierarchical Data Placement Data is spread uniformly across available regions Each region uses rack aware placement policy Acknowledge only when the data is persisted in majority of regions

Cross Region Speculative Reads Reader consults data placement policy for read order First : the bookie node that is closest to the client Second: the closest node that is in a different failure domain - different rack Third: the bookie node in a different closest region...

Performance Latency vs Throughput Scalability Efficiency

Support various workloads with latency/throughput tradeoffs Latency and throughput under different flush policies

Scale with multiple streams (single node vs multiple nodes) Under 100k rps, latency increased with number of streams increased on a single hybrid proxy Each stream writes 100 rps. Throughput increased linearly with number of streams.

Scale with large number of fanout readers Analytic application writes 2.45GB per second, while the data has been fanout to 40x to the readers.

DistributedLog @ Twitter Use Cases Deployment Scale

Applications at Twitter Manhattan Key-Value Store Durable Deferred RPC Real-time search indexing Self-Served Pub-Sub System / Stream Computing Reliable cross datacenter replication...

Scale at Twitter One global cluster, and a few local clusters each dc O(10^3) bookie nodes O(10^3) global log streams and O(10^4) local log streams O(10^6) live log segments Data is kept from hours to days, even up to a year Pub-Sub: deliver O(1) trillion records per day, roughly accounting for O(10) PB per day

Lessons that we learned Make foundation durable and consistent Don t trust filesystem Think of workloads and I/O isolation Keep persistent state as simple as possible...

DistributedLog is the new messaging foundation Layered Architecture Separated Stateless Serving from Stateful Storage Scale CPU/Memory/Network (shared mesos) independent of Storage (hybrid mesos) Messaging Design Writes / Reads Isolation Scale Writes (Fan-in) independent of Reads (Fan-out) Global Replicated Log

Future It is on Github!! https://github.com/twitter/distributedlog Apache Incubating Twitter Messaging Team @l4stewar @sijieg https://about.twitter.com/careers