Scaling Up HBase. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Similar documents
Scaling Up Pig. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Scaling Up Pig. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Comparing SQL and NOSQL databases

Classification Key Concepts

COSC 6339 Big Data Analytics. NoSQL (II) HBase. Edgar Gabriel Fall HBase. Column-Oriented data store Distributed designed to serve large tables

Scaling Up 1 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. Hadoop, Pig

Analytics Building Blocks

Typical size of data you deal with on a daily basis

Basics, how to build & store graphs, laws, etc. Centrality, and algorithms you should know

Basics, how to build & store graphs, laws, etc. Centrality, and algorithms you should know

Fattane Zarrinkalam کارگاه ساالنه آزمایشگاه فناوری وب

Bigtable: A Distributed Storage System for Structured Data by Google SUNNIE CHUNG CIS 612

Big Data Analytics. Rasoul Karimi

Data Informatics. Seon Ho Kim, Ph.D.

CS November 2018

CS November 2017

Data Integration. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

CA485 Ray Walshe NoSQL

CSE-E5430 Scalable Cloud Computing Lecture 9

Text Analytics (Text Mining)

10 Million Smart Meter Data with Apache HBase

Text Analytics (Text Mining)

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Data Collection, Simple Storage (SQLite) & Cleaning

BigTable: A Distributed Storage System for Structured Data

CIB Session 12th NoSQL Databases Structures

Graphs / Networks CSE 6242/ CX Basics, how to build & store graphs, laws, etc. Centrality, and algorithms you should know

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Processing of big data with Apache Spark

Introduction to Machine Learning. Duen Horng (Polo) Chau Associate Director, MS Analytics Associate Professor, CSE, College of Computing Georgia Tech

NoSQL Databases. Amir H. Payberah. Swedish Institute of Computer Science. April 10, 2014

Distributed File Systems II

Apache Hadoop Goes Realtime at Facebook. Himanshu Sharma

HBase: column-oriented database

Introduction to BigData, Hadoop:-

Data Mining Concepts. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Presented by Sunnie S Chung CIS 612

Data Mining Concepts & Tasks

Distributed Systems [Fall 2012]

Distributed Database Case Study on Google s Big Tables

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Data Mining Concepts & Tasks

Graphs / Networks CSE 6242/ CX Interactive applications. Duen Horng (Polo) Chau Georgia Tech

Rule 14 Use Databases Appropriately

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

Big Table. Google s Storage Choice for Structured Data. Presented by Group E - Dawei Yang - Grace Ramamoorthy - Patrick O Sullivan - Rohan Singla

CISC 7610 Lecture 2b The beginnings of NoSQL

Integration of Apache Hive

"Big Data... and Related Topics" John S. Erickson, Ph.D The Rensselaer IDEA Rensselaer Polytechnic Institute

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

CSE 444: Database Internals. Lecture 23 Spark

Shen PingCAP 2017

Google big data techniques (2)

Graphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Stages of Data Processing

Ghislain Fourny. Big Data 5. Column stores

Structured Big Data 1: Google Bigtable & HBase Shiow-yang Wu ( 吳秀陽 ) CSIE, NDHU, Taiwan, ROC

PROFESSIONAL. NoSQL. Shashank Tiwari WILEY. John Wiley & Sons, Inc.

Graphs III. CSE 6242 A / CS 4803 DVA Feb 26, Duen Horng (Polo) Chau Georgia Tech. (Interactive) Applications

Database Evolution. DB NoSQL Linked Open Data. L. Vigliano

Column Stores and HBase. Rui LIU, Maksim Hrytsenia

HBASE INTERVIEW QUESTIONS

10/18/2017. Announcements. NoSQL Motivation. NoSQL. Serverless Architecture. What is the Problem? Database Systems CSE 414

Design & Implementation of Cloud Big table

Big Data Analytics using Apache Hadoop and Spark with Scala

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

7680: Distributed Systems

Cassandra- A Distributed Database

A NoSQL Introduction for Relational Database Developers. Andrew Karcher Las Vegas SQL Saturday September 12th, 2015

Graphs / Networks CSE 6242/ CX Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Hadoop Development Introduction

NoSQL systems. Lecture 21 (optional) Instructor: Sudeepa Roy. CompSci 516 Data Intensive Computing Systems

A Glimpse of the Hadoop Echosystem

CSE 530A. Non-Relational Databases. Washington University Fall 2013

Distributed Data Store

Column-Family Databases Cassandra and HBase

Time Series Storage with Apache Kudu (incubating)

Top 25 Hadoop Admin Interview Questions and Answers

Challenges for Data Driven Systems

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

5/2/16. Announcements. NoSQL Motivation. The New Hipster: NoSQL. Serverless. What is the Problem? Database Systems CSE 414

MapReduce & BigTable

Database Systems CSE 414

Analysis of HBase Read/Write

/ Cloud Computing. Recitation 10 March 22nd, 2016

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

Study of NoSQL Database Along With Security Comparison

Exploring Cassandra and HBase with BigTable Model

5/1/17. Announcements. NoSQL Motivation. NoSQL. Serverless Architecture. What is the Problem? Database Systems CSE 414

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

Big Data with Hadoop Ecosystem

Using space-filling curves for multidimensional

Intro Cassandra. Adelaide Big Data Meetup.

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Transcription:

http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Scaling Up HBase Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos, Parishit Ram (GT PhD alum; SkyTree), Alex Gray

What if you need real-time read/write for large datasets? 2

Lecture based on these two books. http://goo.gl/yncwn http://goo.gl/svztv 3

http://hbase.apache.org Built on top of HDFS Supports real-time read/write random access Scale to very large datasets, many machines Not relational, does NOT support SQL ( NoSQL = not only SQL ) http://en.wikipedia.org/wiki/nosql Supports billions of rows, millions of columns (e.g., serving Facebook s Messaging Platform) Written in Java; works with other APIs/languages (REST, Thrift, Scala) Where does HBase come from? http://radar.oreilly.com/2014/04/5-fun-facts-about-hbase-that-you-didnt-know.html http://wiki.apache.org/hadoop/hbase/poweredby 4

HBase s history Hadoop & HDFS based on... Designed for batch processing 2003 Google File System (GFS) paper http://cracking8hacking.com/cracking-hacking/ebooks/misc/pdf/the%20google%20filesystem.pdf 2004 Google MapReduce paper http://static.googleusercontent.com/media/research.google.com/en/us/archive/mapreduce-osdi04.pdf HBase based on... 2006 Google Bigtable paper http://static.googleusercontent.com/media/research.google.com/en/us/archive/bigtable-osdi06.pdf Designed for random access 5

How does HBase work? Column-oriented Column is the most basic unit (instead of row) Multiple columns form a row A column can have multiple versions, each version stored in a cell Rows form a table Row key locates a row Rows sorted by row key lexicographically (~= alphabetically) 6

Row key is unique Think of row key as the index of an HBase table You look up a row using its row key Only one index per table (via row key) HBase does not have built-in support for multiple indices; support enabled via extensions 7

Rows sorted lexicographically (=alphabetically) hbase(main):001:0> scan 'table1' ROW COLUMN+CELL row-1 column=cf1:, timestamp=1297073325971... row-10 column=cf1:, timestamp=1297073337383... row-11 column=cf1:, timestamp=1297073340493... row-2 column=cf1:, timestamp=1297073329851... row-22 column=cf1:, timestamp=1297073344482... row-3 column=cf1:, timestamp=1297073333504... row-abc column=cf1:, timestamp=1297073349875... 7 row(s) in 0.1100 seconds row-10 comes before row-2. How to fix? 8

Rows sorted lexicographically (=alphabetically) hbase(main):001:0> scan 'table1' ROW COLUMN+CELL row-1 column=cf1:, timestamp=1297073325971... row-10 column=cf1:, timestamp=1297073337383... row-11 column=cf1:, timestamp=1297073340493... row-2 column=cf1:, timestamp=1297073329851... row-22 column=cf1:, timestamp=1297073344482... row-3 column=cf1:, timestamp=1297073333504... row-abc column=cf1:, timestamp=1297073349875... 7 row(s) in 0.1100 seconds row-10 comes before row-2. How to fix? Pad row-2 with a 0. i.e., row-02 8

Columns grouped into column families Why? Helps with organization, understanding, optimization, etc. In details... Columns in the same family stored in same file called HFile inspired by Google s SSTable = large map whose keys are sorted Apply compression on the whole family... 9

More on column family, column Column family An HBase table supports only few families (e.g., <10) Due to limitations in implementation Family name must be printable Should be defined when table is created Shouldn not be changed often Each column referenced as family:qualifier Can have millions of columns Values can be anything that s arbitrarily long 10

Cell Value Timestamped Implicitly by system Or set explicitly by user Let you store multiple versions of a value = values over time Values stored in decreasing time order Most recent value can be read first 11

Time-oriented view of a row 12

Concise way to describe all these? HBase data model (= Bigtable s model) Sparse, distributed, persistent, multidimensional map Indexed by row key + column key + timestamp (Table, RowKey, Family, Column, Timestamp)! Value 13

An exercise How would you use HBase to create a webtable store snapshots of every webpage on the planet, over time? 14

Details: How does HBase scale up storage & balance load? Automatically divide contiguous ranges of rows into regions Start with one region, split into two when getting too large, and so on. 15

Details: How does HBase scale up storage & balance load? Excellent Summary: http://blog.cloudera.com/blog/2013/04/how-scaling-really-works-in-apache-hbase/ 16

How to use HBase Interactive shell Will show you an example, locally (on your computer, without using HDFS) Programmatically e.g., via Java, Python, etc. 17

Example, using interactive shell Start HBase Start Interactive Shell Check HBase is running 18

Example: Create table, add values 19

Example: Scan (show all cell values) 20

Example: Get (look up a row) Can also look up a particular cell value with a certain timestamp, etc. 21

Example: Delete a value 22

Example: Deleting a table Why need to disable a table before dropping it? http://stackoverflow.com/questions/35441342/hbase-why-do-i-need-to-disable-a-table-before-dropping-it 23

RDBMS vs HBase RDBMS (=Relational Database Management System) MySQL, Oracle, SQLite, Teradata, etc. Really great for many applications Ensure strong data consistency, integrity Supports transactions (ACID guarantees)... 24

RDBMS vs HBase How are they different? When to use what? 25

RDBMS vs HBase How are they different? Hbase when you don t know the structure/schema HBase supports sparse data many columns, values can be absent Relational databases good for getting whole rows HBase: keeps multiple versions of data RDBMS support multiple indices, minimize duplications Generally a lot cheaper to deploy HBase, for same size of data (petabytes) 26

More topics to learn about Other ways to get, put, delete... (e.g., programmatically via Java) Doing them in batch A lot more to read about cluster adminstration Configurations, specs for master (name node) and workers (region servers) Monitoring cluster s health Bad key design (http://hbase.apache.org/book/rowkey.design.html) monotonically increasing keys can decrease performance Integrating with MapReduce Cassandra, MongoDB, etc. http://db-engines.com/en/system/cassandra%3bhbase%3bmongodb http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis 27