Facebook. The Technology Behind Messages (and more ) Kannan Muthukkaruppan Software Engineer, Facebook. March 11, 2011

Similar documents
HBase Solutions at Facebook

CSE-E5430 Scalable Cloud Computing Lecture 9

Distributed File Systems II

Apache Hadoop Goes Realtime at Facebook. Himanshu Sharma

BigTable: A Distributed Storage System for Structured Data

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

HBase. Леонид Налчаджи

HBASE INTERVIEW QUESTIONS

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

The Google File System. Alexandru Costan

CLOUD-SCALE FILE SYSTEMS

NoSQL Databases. Amir H. Payberah. Swedish Institute of Computer Science. April 10, 2014

Fattane Zarrinkalam کارگاه ساالنه آزمایشگاه فناوری وب

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

Ghislain Fourny. Big Data 5. Column stores

The Google File System

Comparing SQL and NOSQL databases

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

Data Informatics. Seon Ho Kim, Ph.D.

ADVANCED HBASE. Architecture and Schema Design GeeCON, May Lars George Director EMEA Services

Google File System. Arun Sundaram Operating Systems

CS November 2017

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

MyRocks deployment at Facebook and Roadmaps. Yoshinori Matsunobu Production Engineer / MySQL Tech Lead, Facebook Feb/2018, #FOSDEM #mysqldevroom

Distributed Systems 16. Distributed File Systems II

The Google File System (GFS)

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

The Google File System

10 Million Smart Meter Data with Apache HBase

NPTEL Course Jan K. Gopinath Indian Institute of Science

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

CS November 2018

Tools for Social Networking Infrastructures

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Distributed Filesystem

CS 655 Advanced Topics in Distributed Systems

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

COSC 6339 Big Data Analytics. NoSQL (II) HBase. Edgar Gabriel Fall HBase. Column-Oriented data store Distributed designed to serve large tables

Cmprssd Intrduction To

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Typical size of data you deal with on a daily basis

The Google File System

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

The Google File System

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

CSE 124: Networked Services Lecture-16

Ghislain Fourny. Big Data 5. Wide column stores

BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis

Distributed Computation Models

Namenode HA. Sanjay Radia - Hortonworks

Building Durable Real-time Data Pipeline

Outline. Spanner Mo/va/on. Tom Anderson

TRANSACTIONS AND ABSTRACTIONS

The Google File System

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

Bigtable. Presenter: Yijun Hou, Yixiao Peng

Hadoop and HDFS Overview. Madhu Ankam

BigTable. CSE-291 (Cloud Computing) Fall 2016

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

CA485 Ray Walshe Google File System

GFS: The Google File System. Dr. Yingwu Zhu

Google File System 2

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

CSE 124: Networked Services Fall 2009 Lecture-19

CISC 7610 Lecture 2b The beginnings of NoSQL

Google File System. By Dinesh Amatya

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Introduction to BigData, Hadoop:-

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

GridGain and Apache Ignite In-Memory Performance with Durability of Disk

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

A BigData Tour HDFS, Ceph and MapReduce

GFS: The Google File System

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Map-Reduce. Marco Mura 2010 March, 31th

A Glimpse of the Hadoop Echosystem

Introduction to Database Services

HBase... And Lewis Carroll! Twi:er,

The Google File System

W b b 2.0. = = Data Ex E pl p o l s o io i n

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

TRANSACTIONS OVER HBASE

MapReduce. U of Toronto, 2014

FLAT DATACENTER STORAGE CHANDNI MODI (FN8692)

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Bigtable: A Distributed Storage System for Structured Data by Google SUNNIE CHUNG CIS 612

Distributed System. Gang Wu. Spring,2018

The Evolving Apache Hadoop Ecosystem What it means for Storage Industry

BigData and Map Reduce VITMAC03

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Cloudera Exam CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Version: 7.5 [ Total Questions: 97 ]

Distributed PostgreSQL with YugaByte DB

Transcription:

HBase @ Facebook The Technology Behind Messages (and more ) Kannan Muthukkaruppan Software Engineer, Facebook March 11, 2011

Talk Outline the new Facebook Messages, and how we got started with HBase quick overview of HBase why we picked HBase our work with and contributions to HBase a few other/emerging use cases within Facebook future plans Q&A

The New Facebook Messages Messages Chats Emails SMS

Storage

Monthly data volume prior to launch 15B x 1,024 bytes = 14TB 120B x 100 bytes = 11TB

Messaging Data Small/medium sized data and indices in HBase Message metadata & indices Search index Small message bodies Attachments and large messages in Haystack (our photo store)

Our architecture Clients (Front End, MTA, etc.) User Directory Service Cell 1 What s the cell for this user? Cell 2 Cell 3 Application Server HBase/HDFS/ ZK Cell 1 Application Server Application Server Attachments HBase/HDFS/ Message, Metadata, ZK HBase/HDFS/ Search Index ZK Haystack

About HBase

HBase in a nutshell distributed, large-scale data store efficient at random reads/writes open source project modeled after Google s BigTable

When to use HBase? storing large amounts of data (100s of TBs) need high write throughput need efficient random access (key lookups) within large data sets need to scale gracefully with data for structured and semi-structured data don t need full RDMS capabilities (cross row/cross table transactions, joins, etc.)

HBase Data Model An HBase table is: a sparse, three-dimensional array of cells, indexed by: RowKey, ColumnKey, Timestamp/Version sharded into regions along an ordered RowKey space Within each region: Data is grouped into column families Sort order within each column family: Row Key (asc), Column Key (asc), Timestamp (desc)

Example: Inbox Search Schema Key: RowKey: userid, Column: word, Version: MessageID Value: Auxillary info (like offset of word in message) Data is stored sorted by <userid, word, messageid>: User1:hi:17->offset1 User1:hi:16->offset2 User1:hello:16->offset3 User1:hello:2->offset4... User2:... User2:...... Can efficiently handle queries like: - Get top N messageids for a specific user & word - Typeahead query: for a given user, get words that match a prefix

HBase System Overview HBASE Database Layer Master Backup Master Region Server Region Server Region Server... HDFS Storage Layer Namenode Secondary Namenode Datanode Datanode Datanode... Zookeeper Quorum ZK Peer Coordination Service ZK Peer...

HBase Overview HBASE Region Server.... Region #2 Region #1.... ColumnFamily #2 ColumnFamily #1 Memstore (in memory data structure) HFiles (in HDFS) flush Write Ahead Log ( in HDFS)

HBase Overview Very good at random reads/writes Write path Sequential write/sync to commit log update memstore Read path Lookup memstore & persistent HFiles HFile data is sorted and has a block index for efficient retrieval Background chores Flushes (memstore -> HFile) Compactions (group of HFiles merged into one)

Why HBase? Performance is great, but what else

Horizontal scalability HBase & HDFS are elastic by design Multiple table shards (regions) per physical server On node additions Load balancer automatically reassigns shards from overloaded nodes to new nodes Because filesystem underneath is itself distributed, data for reassigned regions is instantly servable from the new nodes. Regions can be dynamically split into smaller regions. Pre-sharding is not necessary Splits are near instantaneous!

Automatic Failover Node failures automatically detected by HBase Master Regions on failed node are distributed evenly among surviving nodes. Multiple regions/server model avoids need for substantial overprovisioning HBase Master failover 1 active, rest standby When active master fails, a standby automatically takes over

HBase uses HDFS We get the benefits of HDFS as a storage system for free Fault tolerance (block level replication for redundancy) Scalability End-to-end checksums to detect and recover from corruptions Map Reduce for large scale data processing HDFS already battle tested inside Facebook running petabyte scale clusters lot of in-house development and operational experience

Simpler Consistency Model HBase s strong consistency model simpler for a wide variety of applications to deal with client gets same answer no matter which replica data is read from Eventual consistency: tricky for applications fronted by a cache replicas may heal eventually during failures but stale data could remain stuck in cache

Other Goodies Block Level Compression save disk space network bandwidth Block cache Read-modify-write operation support, like counter increment Bulk import capabilities

HBase Enhancements

Goal of Zero Data Loss/Correctness sync support added to hadoop-20 branch for keeping transaction log (WAL) in HDFS to guarantee durability of transactions atomicity of transactions involving multiple column families Fixed several critical bugs, e.g.: Race conditions causing regions to be assigned to multiple servers region name collisions on disk (due to crc32 encoded names) Errors during log-recovery that could cause: transactions to be incorrectly skipped during log replay deleted items to be resurrected

Zero data loss (contd.) Enhanced HDFS s Block Placement Policy: Default Policy: rack aware, but minimally constrained non-local block replicas can be on any other rack, and any nodes within the rack New: Placement of replicas constrained to configurable node groups Result: Data loss probability reduced by orders of magnitude

Availability/Stability improvements HBase master rewrite- region assignments using ZK Rolling Restarts doing software upgrades without a downtime Interruptible compactions Being able to restart cluster, making schema changes, load-balance regions quickly without waiting on compactions Timeouts on client-server RPCs Staggered major compaction to avoid compaction storms

Performance Improvements Compactions critical for read performance Improved compaction algo delete/ttl/overwrite processing in minor compactions Read optimizations: Seek optimizations for rows with large number of cells Bloom filters to minimize HFile lookups Timerange hints on HFiles (great for temporal data) Improved handling of compressed HFiles

Performance Improvements (contd.) Improvements for large objects threshold size after which a file is no longer compacted rely on bloom filters instead for efficiently looking up object safety mechanism to never compact more than a certain number of files in a single pass To fix potential Out-of-Memory errors minimize number of data copies on RPC response

Working within the Apache community Growing with the community Started with a stable, healthy project In house expertise in both HDFS and HBase Increasing community involvement Undertook massive feature improvements with community help HDFS 0.20-append branch HBase Master rewrite Continually interacting with the community to identify and fix issues e.g., large responses (2GB RPC)

Operational Experiences Darklaunch: shadow traffic on test clusters for continuous, at scale testing experiment/tweak knobs simulate failures, test rolling upgrades Constant (pre-sharding) region count & controlled rolling splits Administrative tools and monitoring Alerts (HBCK, memory alerts, perf alerts, health alerts) auto detecting/decommissioning misbehaving machines Dashboards Application level backup/recovery pipeline

Typical Cluster Layout Multiple clusters/cells for messaging 20 servers/rack; 5 or more racks per cluster Controllers (master/zookeeper) spread across racks ZooKeeper Peer HDFS Namenode ZooKeeper Peer Backup Namenode ZooKeeper Peer Job Tracker ZooKeeper Peer Hbase Master ZooKeeper Peer Backup Master Region Server Data Node Task Tracker Region Server Data Node Task Tracker Region Server Data Node Task Tracker Region Server Data Node Task Tracker Region Server Data Node Task Tracker 19x... 19x... 19x... 19x... 19x... Region Server Data Node Task Tracker Region Server Data Node Task Tracker Region Server Data Node Task Tracker Region Server Data Node Task Tracker Region Server Data Node Task Tracker Rack #1 Rack #2 Rack #3 Rack #4 Rack #5

Data migration Another place we used HBase heavily

Move messaging data from MySQL to HBase In MySQL, inbox data was kept normalized user s messages are stored across many different machines Migrating a user is basically one big join across tables spread over many different machines Multiple terabytes of data (for over 500M users) Cannot pound 1000s of production UDBs to migrate users

How we migrated Periodically, get a full export of all the users inbox data in MySQL And, use bulk loader to import the above into a migration HBase cluster To migrate users: Since users may continue to receive messages during migration: double-write (to old and new system) during the migration period Get a list of all recent messages (since last MySQL export) for the user Load new messages into the migration HBase cluster Perform the join operations to generate the new data Export it and upload into the final cluster

Facebook Insights Real-time Analytics using HBase

Facebook Insights Goes Real-Time Recently launched real-time analytics for social plugins on top of HBase Publishers get real-time distribution/engagement metrics: # of impressions, likes analytics by Domain, URL, demographics Over various time periods (the last hour, day, all-time) Makes use of HBase capabilities like: Efficient counters (read-modify-write increment operations) TTL for purging old data

Future Work It is still early days! Namenode HA (AvatarNode) Fast hot-backups (Export/Import) Online schema & config changes Running HBase as a service (multi-tenancy) Features (like secondary indices, batching hybrid mutations) Cross-DC replication Lot more performance/availability improvements

Thanks! Questions? facebook.com/engineering