HBase. Леонид Налчаджи

Similar documents
ADVANCED HBASE. Architecture and Schema Design GeeCON, May Lars George Director EMEA Services

Advanced HBase Schema Design. Berlin Buzzwords, June 2012 Lars George

Ghislain Fourny. Big Data 5. Column stores

HBASE INTERVIEW QUESTIONS

Ghislain Fourny. Big Data 5. Wide column stores

BigTable: A Distributed Storage System for Structured Data

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Facebook. The Technology Behind Messages (and more ) Kannan Muthukkaruppan Software Engineer, Facebook. March 11, 2011

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Comparing SQL and NOSQL databases

10 Million Smart Meter Data with Apache HBase

COSC 6339 Big Data Analytics. NoSQL (II) HBase. Edgar Gabriel Fall HBase. Column-Oriented data store Distributed designed to serve large tables

COSC 6397 Big Data Analytics. Data Formats (III) HBase: Java API, HBase in MapReduce and HBase Bulk Loading. Edgar Gabriel Spring 2014.

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

BigTable. Chubby. BigTable. Chubby. Why Chubby? How to do consensus as a service

Faster HBase queries. Introducing hindex Secondary indexes for HBase. ApacheCon North America Rajeshbabu Chintaguntla

Data Informatics. Seon Ho Kim, Ph.D.

NoSQL Databases. Amir H. Payberah. Swedish Institute of Computer Science. April 10, 2014

Distributed File Systems II

Typical size of data you deal with on a daily basis

HBase Solutions at Facebook

Apache Accumulo 1.4 & 1.5 Features

CS November 2017

CS November 2018

CSE-E5430 Scalable Cloud Computing Lecture 9

BigTable. CSE-291 (Cloud Computing) Fall 2016

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Cloudera Kudu Introduction

TRANSACTIONS AND ABSTRACTIONS

Extreme Computing. NoSQL.

HBase Java Client API

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

HBase Installation and Configuration

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

HBase... And Lewis Carroll! Twi:er,

Fattane Zarrinkalam کارگاه ساالنه آزمایشگاه فناوری وب

TRANSACTIONS OVER HBASE

W b b 2.0. = = Data Ex E pl p o l s o io i n

Bigtable. Presenter: Yijun Hou, Yixiao Peng

BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis

YCSB++ benchmarking tool Performance debugging advanced features of scalable table stores

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

GFS: The Google File System

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Introduction to BigData, Hadoop:-

Tuning Enterprise Information Catalog Performance

Distributed Computation Models

Big Table. Google s Storage Choice for Structured Data. Presented by Group E - Dawei Yang - Grace Ramamoorthy - Patrick O Sullivan - Rohan Singla

Apache HBase Andrew Purtell Committer, Apache HBase, Apache Software Foundation Big Data US Research And Development, Intel

HBase is not a direct replacement for a classic SQL Database, although recently its performance has improved.

The Google File System

GFS: The Google File System. Dr. Yingwu Zhu

Bigtable: A Distributed Storage System for Structured Data by Google SUNNIE CHUNG CIS 612

Using space-filling curves for multidimensional

Distributed Systems 16. Distributed File Systems II

Percona Live September 21-23, 2015 Mövenpick Hotel Amsterdam

CS 550 Operating Systems Spring File System

Indexing. Jan Chomicki University at Buffalo. Jan Chomicki () Indexing 1 / 25

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

HBASE MOCK TEST HBASE MOCK TEST III

ZooKeeper & Curator. CS 475, Spring 2018 Concurrent & Distributed Systems

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

HydraFS: a High-Throughput File System for the HYDRAstor Content-Addressable Storage System

Map-Reduce. Marco Mura 2010 March, 31th

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Distributed PostgreSQL with YugaByte DB

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Storage hierarchy. Textbook: chapters 11, 12, and 13

CA485 Ray Walshe NoSQL

Outline. Spanner Mo/va/on. Tom Anderson

Snapshots and Repeatable reads for HBase Tables

GridGain and Apache Ignite In-Memory Performance with Durability of Disk

File Systems. Kartik Gopalan. Chapter 4 From Tanenbaum s Modern Operating System

Replica Parallelism to Utilize the Granularity of Data

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

CSE 544 Principles of Database Management Systems

Course Content MongoDB

Topics. File Buffer Cache for Performance. What to Cache? COS 318: Operating Systems. File Performance and Reliability

Oral Questions and Answers (DBMS LAB) Questions & Answers- DBMS

CLOUD-SCALE FILE SYSTEMS

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Google File System. Arun Sundaram Operating Systems

Azure-persistence MARTIN MUDRA

HBase Security. Works in Progress Andrew Purtell, an Intel HBase guy

Accumulo Extensions to Google s Bigtable Design

MyRocks Engineering Features and Enhancements. Manuel Ung Facebook, Inc. Dublin, Ireland Sept th, 2017

Enabling Persistent Memory Use in Java. Steve Dohrmann Sr. Staff Software Engineer, Intel

SQL, NoSQL, MongoDB. CSE-291 (Cloud Computing) Fall 2016 Gregory Kesden

Persistent Storage - Datastructures and Algorithms

YCSB++ Benchmarking Tool Performance Debugging Advanced Features of Scalable Table Stores

User Perspective. Module III: System Perspective. Module III: Topics Covered. Module III Overview of Storage Structures, QP, and TM

CS220 Database Systems. File Organization

Hadoop. copyright 2011 Trainologic LTD

MyRocks deployment at Facebook and Roadmaps. Yoshinori Matsunobu Production Engineer / MySQL Tech Lead, Facebook Feb/2018, #FOSDEM #mysqldevroom

Research on the Application of Bank Transaction Data Stream Storage based on HBase Xiaoguo Wang*, Yuxiang Liu and Lin Zhang

big picture parallel db (one data center) mix of OLTP and batch analysis lots of data, high r/w rates, 1000s of cheap boxes thus many failures

The Google File System. Alexandru Costan

Who Am I? Chris Larsen

Transcription:

HBase Леонид Налчаджи leonid.nalchadzhi@gmail.com

HBase Overview Table layout Architecture Client API Key design 2

Overview 3

Overview NoSQL Column oriented Versioned 4

Overview All rows ordered by row key All cells are uninterpreted arrays of bytes Auto-sharding 5

Overview Distributed HDFS based Highly fault tolerant CP/CAP 6

Overview Coordinates: <RowKey, CF:CN, Version> Data is grouped by column family Null values for free 7

Overview No actual deletes (tombstones) TTL for cells 8

Disadvantages No secondary indexes from the box Available through additional table and coprocessors No transactions CAS support, row locks, counters Not good for multiple data centers Master-slave replication No complex query processing 9

Table layout 10

Table layout 11

Table layout 12

Table layout 13

Architecture 14

Terms Region Region server WAL Memstore StoreFile/HFile 15

Architecture 16

17

Architecture 18

Architecture One master (with backup) many slaves (region servers) Very tight integration with ZooKeeper Client calls ZK, RS, Master 19

Master Handles schema changes Not SPOF Cluster housekeeping operations Check RS status through ZK 20

Region server Stores data in regions HDFS DataNode Region is a container of data Does actual data manipulation 21

ZooKeeper Distributed synchronization service FS-like tree structure contains keys Provides primitives to achieve a lot of goals 22

Storage Handled by region server 2 types of file WAL log Data storage Memstore: sorted write buffer 23

Client connection Ask ZK to get address -ROOT- region From.META. gets the address of RS Cache all these data From -ROOT- gets the RS with.meta. 24

Write path Write to HLog (one per RS) Write to Memstore (one per CF) If Memstore is full, flush is requested This request is processed async by another thread 25

On Memstore flush On flush Memstore is written to HDFS file The last operation sequence number is stored Memstore is cleared 26

Region splits Region is closed Creates directory structure for new regions (multithreaded).meta. is updated (parent and daughters) 27

Region splits Master is notified for load balancing All steps are tracked in ZK Parent cleaned up eventually 28

Compaction When Memstore is flushed new store file is created Compaction is a housekeeping mechanism which merges store files Come in two varieties: minor and major The type is determined when the compaction check is executed 29

Minor compaction Fast Selects files to include Params: max size and min number of files From the oldest to the newest 30

Minor compaction 31

Major compaction Compact all files into a single file Might be promoted from the minor compaction Starts periodically (each 24 hours by default) Checks tombstone markers and removes data 32

HFile format 33

HFile format Immutable (trailer is written once) Block size by default 64K Trailer has pointers to other blocks Indexes store offsets to data, metadata Block contains magic and number of key values 34

KeyValue format 35

Write-Ahead log 36

Write-Ahead log One per region server Contains data from several regions Stores data sequentially 37

Write-Ahead log Rolled periodically (every hour by default) Applied on cluster restart and on server failure Split is distributed 38

Read path There is no single index with logical rows Data is stored in several files and Memstore per column family Gets are the same as scans 39

Read path 40

Block cache Each region has LRU block cache Based on priorities 1. Single access 2. Multi access priority 3. Catalog CF: ROOT and META 41

Block cache Size of cache: number of region servers * heap size * (hfile.block.cache.size) * 0.85 1 RS with 1 Gb RAM = 217 Mb 20 RS with 8 Gb RAM = 34 Gb 100 RS with 24 Gb RAM with 0.5 = 1 Tb 42

Block cache Always in cache: Catalog tables HFiles indexes Keys (row keys, CF, TS) 43

Client API 44

CRUD All operations do through HTable object All operations per-row atomic Row lock is available Batch is available Looks like this: HTable table = new HTable("table"); table.put(new Put(...)); 45

Put Put(byte[] row) Put(byte[] row, RowLock rowlock) Put(byte[] row, long ts) Put(byte[] row, long ts, RowLock rowlock) 46

Put Put add(byte[] family, byte[] qualifier, byte[] value) Put add(byte[] family, byte[] qualifier, long ts, byte[] value) Put add(keyvalue kv) 47

Put setwritetowal() heapsize() 48

Write buffer There is a client-side write buffer Sorts data by Region Servers 2Mb by default Controlled by: table.setautoflush(false); table.flushcommits(); 49

Write buffer 50

What else CAS: boolean checkandput(byte[] row, byte[] family, Batch: byte[] qualifier, byte[] value, Put put) throws IOException void put(list<put> puts) throws IOException 51

Get Returns strictly one row Result res = table.get(new Get(row)); Can ask if row exists boolean e = table.exist(new Get(row)); Filter can be applied 52

Row lock Region server provide a row lock Client can ask for explicit lock: RowLock lockrow(byte[] row) throws IOException; void unlockrow(rowlock rl) throws IOException; Do not use if you do not have to 53

Scan Iterator over a number of rows Leased for amount of time Batching and caching is available Can be defined start and end rows 54

Scan final Scan scan = new Scan(); scan.addfamily(bytes.tobytes("colfam1")); scan.setbatch(5); scan.setcaching(5); scan.setmaxversions(1); scan.settimestamp(0); scan.setstartrow(bytes.tobytes(0)); scan.setstartrow(bytes.tobytes(1000)); ResultScanner scanner = table.getscanner(scan); for (Result res : scanner) {... } scanner.close(); 55

Scan Caching is for rows Batching is for columns Result.next() will return next set of batched cols (more than one for row) 56

Scan 57

Counters Any column can be treated as counters Atomic increment without locking Can be applied for multiple rows Looks like this: long cnt = table.incrementcolumnvalue(bytes.tobytes("20131028"), Bytes.toBytes("daily"), Bytes.toBytes("hits"), 1); 58

HTable pooling Creating an HTable instance is expensive HTable is not thread safe Should be created on start up and closed on application exit 59

HTable pooling Provided pool: HTablePool Thread safe Shares common resources 60

HTable pooling HTablePool(Configuration config, int maxsize) HTableInterface gettable(string tablename) void puttable(htableinterface table) void closetablepool(string tablename) 61

Key design 62

Key design The only index is row key Row keys are sorted Key is uninterpreted array of bytes 63

Key design 64

Email example Approach #1: user id is raw key, all messages in one cell <userid> : <cf> : <col> : <ts>:<messages> 12345 : data : msgs : 1307097848 : ${msg1}, ${msg2}... 65

Email example Approach #1: user id is raw key, all messages in one cell <userid> : <cf> : <colval> : <messages> All in one request Hard to find anything Users with a lot of messages 66

Email example Approach #2: user id is raw key, message id in column family <userid> : <cf> : <msgid> : <timestamp> : <email-message> 12345 : data : 5fc38314-e290-ae5da5fc375d : 1307097848 : "Hi,..." 12345 : data : 725aae5f-d72e-f90f3f070419 : 1307099848 : "Welcome, and..." 12345 : data : cc6775b3-f249-c6dd2b1a7467 : 1307101848 : "To Whom It..." 12345 : data : dcbee495-6d5e-6ed48124632c : 1307103848 : "Hi, how are..." 67

Email example Approach #2: user id is raw key, message id in column family <userid> : <cf> : <msgid> : <timestamp> : <email-message> Can load data on demand Find anything only with filter Users with a lot of messages 68

Email example Approach #3: raw id is a <userid> <msgid> <userid> - <msgid>: <cf> : <qualifier> : <timestamp> : <email-message> 12345-5fc38314-e290-ae5da5fc375d : data : : 1307097848 : "Hi,..." 12345-725aae5f-d72e-f90f3f070419 : data : : 1307099848 : "Welcome, and..." 12345-cc6775b3-f249-c6dd2b1a7467 : data : : 1307101848 : "To Whom It..." 12345-dcbee495-6d5e-6ed48124632c : data : : 1307103848 : "Hi, how are..." 69

Email example Approach #3: raw id is a <userid> <msgid> <userid> - <msgid>: <cf> : <qualifier> : <timestamp> : <email-message> Can load data on demand Find message with row key Users with a lot of messages 70

Email example Go further: <userid>-<date>-<messageid>-<attachmentid> 71

Email example Go further: <userid>-<date>-<messageid>-<attachmentid> <userid>-(max_long - <date>)-<messageid> -<attachmentid> 72

Performance 73

Performance 10 Gb data, 10 parallel clients, 1kb per entry HDFs replication = 3 3 Region servers, 2CPU (24 cores), 48Gb RAM, 10GB for App nobuf, nowal buf:100, nowal buf:1000, nowal nobuf, WAL buf:100, WAL buf:1000, WAL MB/s 12MB/s 53MB/s 48MB/s 5MB/s 11MB/s 31MB/s puts/s 11k rows/s 50k rows/s 45k rows/s 4.7k rows/s 10k rows/s 30k rows/s 74

Sources HBase. The defenitive guide (O Reilly) HBase in action (Manning) Official documentation Cloudera articles http://hbase.apache.org/0.94/book.html http://blog.cloudera.com/blog/category/hbase/ 75

HBase Леонид Налчаджи leonid.nalchadzhi@gmail.com