Distributed Transactions in Hadoop's HBase and Google's Bigtable

Similar documents
Distributed computing: index building and use

Percolator. Large-Scale Incremental Processing using Distributed Transactions and Notifications. D. Peng & F. Dabek

CloudBATCH: A Batch Job Queuing System on Clouds with Hadoop and HBase. Chen Zhang Hans De Sterck University of Waterloo

Big Data Technology Incremental Processing using Distributed Transactions

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

CS November 2018

CS November 2017

Distributed computing: index building and use

TRANSACTIONS AND ABSTRACTIONS

CSE-E5430 Scalable Cloud Computing Lecture 9

CIB Session 12th NoSQL Databases Structures

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

Data Informatics. Seon Ho Kim, Ph.D.

Scaling Up HBase. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

On the Study of Different Approaches to Database Replication in the Cloud

DIVING IN: INSIDE THE DATA CENTER

SCALABLE CONSISTENCY AND TRANSACTION MODELS

Bigtable: A Distributed Storage System for Structured Data by Google SUNNIE CHUNG CIS 612

Typical size of data you deal with on a daily basis

CA485 Ray Walshe NoSQL

BigTable: A Distributed Storage System for Structured Data

CS5412: DIVING IN: INSIDE THE DATA CENTER

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

CISC 7610 Lecture 2b The beginnings of NoSQL

Apache Hadoop Goes Realtime at Facebook. Himanshu Sharma

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

MapReduce. U of Toronto, 2014

Data Management in the Cloud. Tim Kraska

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

Data Storage Infrastructure at Facebook

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

BigData and Map Reduce VITMAC03

CS / Cloud Computing. Recitation 11 November 5 th and Nov 8 th, 2013

Scalable Search Engine Solution

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

Embedded Technosolutions

CS 655 Advanced Topics in Distributed Systems

An Introduction to Big Data Formats

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Next-Generation Cloud Platform

Indexing Large-Scale Data

Structured Big Data 1: Google Bigtable & HBase Shiow-yang Wu ( 吳秀陽 ) CSIE, NDHU, Taiwan, ROC

MapReduce & BigTable

NewSQL Databases. The reference Big Data stack

7680: Distributed Systems

Distributed File Systems II

TRANSACTIONS OVER HBASE

CS5412: DIVING IN: INSIDE THE DATA CENTER

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Big Data Processing Technologies. Chentao Wu Associate Professor Dept. of Computer Science and Engineering

BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis

Webinar Series TMIP VISION

The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012):

Introduction to MySQL InnoDB Cluster

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Distributed Database Case Study on Google s Big Tables

10 Million Smart Meter Data with Apache HBase

Big Data Analytics. Rasoul Karimi

Performance Enhancement of Data Processing using Multiple Intelligent Cache in Hadoop

CSE6331: Cloud Computing

Data-Intensive Distributed Computing

HAcid: A lightweight transaction system for HBase

CompSci 516 Database Systems

Google big data techniques (2)

The ATLAS EventIndex: Full chain deployment and first operation

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

DrRobert N. M. Watson

Presented by Sunnie S Chung CIS 612

CS5412: OTHER DATA CENTER SERVICES

TRANSACTION MANAGEMENT

Distributed KIDS Labs 1

Map Reduce. Yerevan.

NOSQL DATABASE SYSTEMS: DECISION GUIDANCE AND TRENDS. Big Data Technologies: NoSQL DBMS (Decision Guidance) - SoSe

Safe Harbor Statement

Synchronization. Clock Synchronization

Big Data with Hadoop Ecosystem

Dremel: Interactive Analysis of Web- Scale Datasets

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Chapter 11 - Data Replication Middleware

Extreme Computing. NoSQL.

Big Data and Cloud Computing

Distributed Transactions

Database Solution in Cloud Computing

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

Tutorial Outline. Map/Reduce vs. DBMS. MR vs. DBMS [DeWitt and Stonebraker 2008] Acknowledgements. MR is a step backwards in database access

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

HADOOP FRAMEWORK FOR BIG DATA

Rethinking Serializable Multi-version Concurrency Control. Jose Faleiro and Daniel Abadi Yale University

Distributed Systems [Fall 2012]

Massive Online Analysis - Storm,Spark

Dynamo: Amazon s Highly Available Key-value Store. ID2210-VT13 Slides by Tallat M. Shafaat

High Performance and Cloud Computing (HPCC) for Bioinformatics

The Google File System

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

LazyBase: Trading freshness and performance in a scalable database

COSC 416 NoSQL Databases. NoSQL Databases Overview. Dr. Ramon Lawrence University of British Columbia Okanagan

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Transcription:

Distributed Transactions in Hadoop's HBase and Google's Bigtable Hans De Sterck Department of Applied Mathematics University of Waterloo Chen Zhang David R. Cheriton School of Computer Science University of Waterloo

this talk: how to deal with very large amounts of data in a distributed environment ( cloud ) in particular, data consistency for very large sets of data items that get modified concurrently

example: Google search and ranking continuously crawl all webpages, store them all on 10,000s of commodity machines (CPU+disk), petabytes of data every 3 days (*the old way...), build search index and ranking, involves inverting outlinks to inlinks, Pagerank calculation, etc., via MapReduce distributed processing system, on the same set of machines very large database-like storage system is used (Bigtable) build a $190billion internet empire on fast and useful search results

outline 1. cloud computing 2. Google/Hadoop cloud frameworks 3. motivation: largescale scientific data processing 4. Bigtable and HBase 5. transactions 6. our transactional system for HBase 7. protocol 8. advanced features 9. performance 10. comparison with Google s percolator 11. conclusions

1. cloud computing cloud = set of resources for distributed computing and storage (networked) characterized by 1. homogeneous environment (often through virtualization) 2. dedicated (root) access (no batch queues, not shared, no reservations) 3. scalable on demand (or large enough) cloud = grid, simplified such that it becomes more easily doable (e.g., it is hard to combine two (different) clouds!)

cloud is 1. homogeneous 2. dedicated access cloud computing 3. scalable on demand we are interested in cloud for large-scale, serious computing (not: email and wordprocessing,...) (but: cloud also useful for business computing, e-commerce, storage,...) there are private (Google internal) and public (Amazon) clouds (it is hard to do hybrid clouds!)

2. Google/Hadoop cloud frameworks cloud (homogeneous, dedicated access, scalable on demand) used for large-scale computing/data processing needs cloud framework Google: first, and most successful (private) cloud (framework) for large-scale computing to date Google File System Bigtable MapReduce

Google cloud framework first, and most successful (private) cloud (framework) for large-scale computing to date Google File System: o fault-tolerant, scalable distributed file system Bigtable o fault-tolerant, scalable sparse semi-structured data store MapReduce o fault-tolerant, scalable parallel processing system used for search/ranking, maps, analytics,...

Hadoop clones Google s system Hadoop is an open-source clone of Google s cloud computing framework for large-scale data processing Hadoop File System (HFS) (Google File System) HBase (Bigtable) MapReduce Hadoop is used by Yahoo, Facebook, Amazon,... (and developed/controlled by them)

Hadoop usage (from wiki.apache.org/hadoop/poweredby)

Hadoop usage

Hadoop usage

3. motivation: large-scale scientific data processing my area of research is scientific computing (scientific simulation methods and applications) large-scale computing / data processing we have used grid for distributed task farming of bioinformatics problems (BLSC 2007)

motivation: large-scale scientific data processing use Hadoop/cloud for large-scale processing of biomedical images (HPCS 2009) process 260GB of image data per day

motivation: large-scale scientific data processing workflow system using HBase (Cloudcom 2009) problem: HBase does not have multirow transactions batch system using HBase (Cloudcom 2010)

motivation: large-scale scientific data processing cloud computing (processing) will take off for biomedical data (images, experiments) bioinformatics particle physics astronomy etc.

4. Bigtable and HBase relational DBMS (SQL) are not (currently) suitable for very large distributed data sets: not parallel not scalable relational sophistication not necessary for many applications, and these applications require efficiency for other aspects (scalable, fault-tolerant, throughput versus response time,...)

(OSDI 2006)... Google invents Bigtable

Bigtable sparse tables col1 col2 col3 col4 col5 col6 row1 x x x x x row2 row3 multiple data versions timestamps scalable, fault-tolerant (tens of petabytes of data, tens of thousands of machines) (no SQL) x x x

HBase HBase is clone of Google s Bigtable col1 col2 col3 col4 col5 col6 row1 x x x x x row2 row3 sparse tables, rows sorted by row keys multiple data versions timestamps Random access performance on par with open source relational databases such as MySQL single global database-like table view (cloud scale) x x x

5. transactions transaction = grouping of read and write operations T i has start time label s i and commit time label c i (all read and write operations happen after start time and before commit time)

transactions we need globally well-ordered time labels s i and c i (for our distributed system) T 1 and T 2 are concurrent if [s 1,c 1 ] and [s 2,c 2 ] overlap transaction T i : either all write operations commit, or none (atomic)

snapshot isolation define: (strong) snapshot isolation 1. T i reads from the last transaction committed before s i (that s T i s snapshot) 2. concurrent transactions have disjoint write sets (concurrent transactions are allowed for efficiency, but T 1 and T 2 cannot take the same $100 from a bank account) T 1 T 3 T 2 t

snapshot isolation 1. T i reads from the last transaction committed before s i 2. concurrent transactions have disjoint write sets T 1 T 3 T 2 t T 2 does not see T 1 s writes T 3 sees T 1 s and T 2 s writes if T 1 and T 2 have overlapping write sets, at least one aborts

snapshot isolation desirables for implementation: first committer wins reads are not blocked implemented in mainstream DBMS (Oracle,...), but scalable distributed transactions do not exist on the scale of clouds T 1 T 3 T 2 t

6. our transactional system for HBase Grid 2010 conference, Brussels, October 2010

our transactional system for HBase design principles: central mechanism for dispensing globally wellordered time labels use HBase s multiple data versions clients decide on whether they can commit (no centralized commit engine) two phases in commit store transactional meta-information in HBase tables use HBase as it is (bare-bones)

transaction T i has 7. protocol start label s i (ordered, not unique) commit label c i (ordered with the s i, unique) write label w i (unique) precommit label p i (ordered, unique)

protocol summary user table row1 row2 commit label L a committed table colx m q L a L b L c c 1 w 1 w 1 c 2 w 2 L b coly n write label counter table row 1211 34 470 precom mit label L a L b w 1 p 1 w 1 w 1 w 2 p 2 w 2 commit queue table (queue up for committing) write commit label label w-counter p-counter c-countter precommit queue table (queue up for conflict checking) w 1 c 1 w 2 c 2

protocol user data table I (user, many) location L a row1 row2 colx m q coly n location L b committed table (system, one) commit label L a L b L c c 1 w 1 w 1 c 2 w 2

transaction T i at start, reads committed table, s i =c last +1 (no wait) obtains w i from central w counter reads L a by scanning L a column in committed table, reads from last c j with c j <s i writes preliminarily to user data table with HBase write timestamp w i after reads/writes, queue up for conflict checking (get p i from central p counter) after conflict checking, queue up for committing (get c i from central c counter) commit by writing c i and writeset into committed table

central counters dispense unique, well-ordered w i, p i, c i labels use HBase s built-in atomic IncrementColumnValue method on a fixed location in an additional system table (a separate counter for w i, p i and c i ) take advantage of single global database-like table view c counter table (system, one) row 101 c-counter

queue up for conflict checking precommit queue table (system, one) write label precommit label w 1 p 1 w 1 w 1 w 2 how to make sure that the T i get processed in the order of the p i they get ( first committer wins ): use a distributed queue mechanism T i puts w i in table, gets p i from p counter, reads {w j } from table, then puts p i and writeset in table, waits until all in {w j } get a p j or disappear, wait for all p j <p i (with write conflicts) to disappear, go on for conflict checking L a L b

committed table conflict checking commit label L a L b L c c 1 w 1 w 1 c 2 w 2 T i checks conflicts in committed table: check for write conflicts with all transactions that have s i <=c j, go on to commit if no conflicts, otherwise abort (remove w i from queue)

queue up for committing issue: make sure that committing transactions end up in committed table in the order they get their c i label (because T j gets its s j from committed table, and a gap in c i in the committed table that gets filled up later may lead to inconsistent snapshots) commit label L a L b L c c 2 w 2 commit label L a L b L c c 1 w 1 w 1 c 2 w 2

queue up for committing issue: make sure that committing transactions end up in committed table in the order they get their c i label: use a distributed queue mechanism! commit queue table (system, one) write label w 1 c 1 commit label w 2 T i puts w i in table, gets c i from counter, reads {w j } from table, then puts c i in table, waits until all in {w j } get a c j or disappear, goes on to commit: write c i and writeset in committed table, remove w i records from the two queues

protocol summary user table row1 row2 commit label L a committed table colx m q L a L b L c c 1 w 1 w 1 c 2 w 2 L b (strong) global snapshot isolation! coly n write label row 1211 34 470 precom mit label L a L b w 1 p 1 w 1 w 1 w 2 p 2 w 2 commit queue table (queue up for committing) write commit label label w 1 c 1 w 2 c 2 counter table w-counter p-counter c-countter precommit queue table (queue up for conflict checking)

8. advanced features version table (system, one) commit label L a c 1 L b c 2 committed table (can be long) commit label L a L b L c c 1 w 1 w 1 c 2 w 2 row L a contains commit label of transaction that last updated location L a lazily updated by reading transactions T i that wants to read L a, first checks version table: if c 1 <s i, scan [c 1 +1,s i -1] in committed table; if s i <=c 1, scan [-inf,s i -1] c 1 s i t

advanced features deal with straggling/failing processes add a timeout mechanism waiting processes can kill and remove straggling/failed processes from queues based on their own clock final commit does CheckAndPut on two rows (at once) in committed table committed table commit label L a L b w 1 w 2 c 1 w 1 w 1 c 2 w 2 timeout N N

9. performance

performance

performance

performance

10. comparison with Google s percolator OSDI 2010, October 2010 goal: update Google s search/rank index incrementally ( fresher results, don t wait 3 days) replace MapReduce by an incremental update system, but need concurrent changes to data

comparison with Google s percolator snapshot isolation for Bigtable! percolator manages Google index/ranking since April 2010 (very very large dataset, tens of petabytes): it works and is very useful!

comparison with Google s percolator similarities with our HBase solution: central mechanism for dispensing globally well-ordered time labels use built-in multiple data versions clients decide on whether they can commit (no centralized commit engine) two phases in commit store transactional meta-information in tables clients remove straggling processes

comparison with Google s percolator differences with our HBase solution: percolator adds snapshot isolation metadata to user tables (more intrusive, but less centralized, no central system tables) percolator may block some reads percolator does not have strict first-committer-wins (may abort both concurrent T i s) different tradeoffs, different performance characteristics (percolator likely more scalable, throughput-friendly, less responsive in some cases) (note: percolator cannot be implemented directly into HBase because HBase lacks row transactions)

11. conclusions we have described the first (global, strong) snapshot isolation mechanism for HBase independently and at the same time, Google has developed a snapshot isolation mechanism for Bigtable that uses design principles that are very similar to ours in many ways snapshot isolation is now available for distributed sparse data stores scalable distributed transactions do now exist on the scale of clouds