Comparing SQL and NOSQL databases

Similar documents
COSC 6339 Big Data Analytics. NoSQL (II) HBase. Edgar Gabriel Fall HBase. Column-Oriented data store Distributed designed to serve large tables

NoSQL Databases. Amir H. Payberah. Swedish Institute of Computer Science. April 10, 2014

BigTable: A Distributed Storage System for Structured Data

Scaling Up HBase. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Typical size of data you deal with on a daily basis

HBASE INTERVIEW QUESTIONS

Projected by: LUKA CECXLADZE BEQA CHELIDZE Superviser : Nodar Momtsemlidze

Ghislain Fourny. Big Data 5. Column stores

CA485 Ray Walshe NoSQL

Introduction to BigData, Hadoop:-

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Ghislain Fourny. Big Data 5. Wide column stores

Data Informatics. Seon Ho Kim, Ph.D.

CS November 2017

COSC 6397 Big Data Analytics. Data Formats (III) HBase: Java API, HBase in MapReduce and HBase Bulk Loading. Edgar Gabriel Spring 2014.

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

CS November 2018

Big Data Hadoop Course Content

RDMA for Apache HBase User Guide

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Rails on HBase. Zachary Pinter and Tony Hillerson RailsConf 2011

10 Million Smart Meter Data with Apache HBase

HBase. Леонид Налчаджи

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

CS 655 Advanced Topics in Distributed Systems

CIB Session 12th NoSQL Databases Structures

Big Data Analytics. Rasoul Karimi

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

Fattane Zarrinkalam کارگاه ساالنه آزمایشگاه فناوری وب

HBase... And Lewis Carroll! Twi:er,

Column Stores and HBase. Rui LIU, Maksim Hrytsenia

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

10/18/2017. Announcements. NoSQL Motivation. NoSQL. Serverless Architecture. What is the Problem? Database Systems CSE 414

Distributed File Systems II

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

CSE-E5430 Scalable Cloud Computing Lecture 9

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

BigTable: A Distributed Storage System for Structured Data (2006) Slides adapted by Tyler Davis

ADVANCED HBASE. Architecture and Schema Design GeeCON, May Lars George Director EMEA Services

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Microsoft Big Data and Hadoop

Cassandra- A Distributed Database

Big Data Analytics using Apache Hadoop and Spark with Scala

Dr. Chuck Cartledge. 1 Oct. 2015

Extreme Computing. NoSQL.

CISC 7610 Lecture 2b The beginnings of NoSQL

YCSB++ benchmarking tool Performance debugging advanced features of scalable table stores

BigTable. Chubby. BigTable. Chubby. Why Chubby? How to do consensus as a service

Column-Family Databases Cassandra and HBase

Relational databases

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara

HBase Java Client API

HBase Installation and Configuration

Using space-filling curves for multidimensional

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

Introduction to NoSQL Databases

Oral Questions and Answers (DBMS LAB) Questions & Answers- DBMS

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads

5/1/17. Announcements. NoSQL Motivation. NoSQL. Serverless Architecture. What is the Problem? Database Systems CSE 414

Database Evolution. DB NoSQL Linked Open Data. L. Vigliano

HBase Installation and Configuration

App Engine: Datastore Introduction

Presented by Sunnie S Chung CIS 612

Bigtable. Presenter: Yijun Hou, Yixiao Peng

big picture parallel db (one data center) mix of OLTP and batch analysis lots of data, high r/w rates, 1000s of cheap boxes thus many failures

Hadoop An Overview. - Socrates CCDH

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Migrating Oracle Databases To Cassandra

SQL, NoSQL, MongoDB. CSE-291 (Cloud Computing) Fall 2016 Gregory Kesden

Database Systems CSE 414

5/2/16. Announcements. NoSQL Motivation. The New Hipster: NoSQL. Serverless. What is the Problem? Database Systems CSE 414

Bigtable: A Distributed Storage System for Structured Data by Google SUNNIE CHUNG CIS 612

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

NoSQL systems. Lecture 21 (optional) Instructor: Sudeepa Roy. CompSci 516 Data Intensive Computing Systems

Hadoop & Big Data Analytics Complete Practical & Real-time Training

Hadoop Development Introduction

Design Patterns for Large- Scale Data Management. Robert Hodges OSCON 2013

Facebook. The Technology Behind Messages (and more ) Kannan Muthukkaruppan Software Engineer, Facebook. March 11, 2011

Integration of Apache Hive

Stages of Data Processing

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

BIG DATA TECHNOLOGIES: WHAT EVERY MANAGER NEEDS TO KNOW ANALYTICS AND FINANCIAL INNOVATION CONFERENCE JUNE 26-29,

CSE 344 JULY 9 TH NOSQL

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Lecture 21 11/27/2017 Next Lecture: Quiz review & project meetings Streaming & Apache Kafka

PROFESSIONAL. NoSQL. Shashank Tiwari WILEY. John Wiley & Sons, Inc.

Distributed Computation Models

SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIME. Ryan Tabora - Think Big Analytics NoSQL Search Roadshow - June 6, 2013

Hadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop

Lessons Learned While Building Infrastructure Software at Google

Cloud Computing & Visualization

Datacenter replication solution with quasardb

BigTable. CSE-291 (Cloud Computing) Fall 2016

Distributed Systems [Fall 2012]

NoSQL Databases An efficient way to store and query heterogeneous astronomical data in DACE. Nicolas Buchschacher - University of Geneva - ADASS 2018

Big Table. Google s Storage Choice for Structured Data. Presented by Group E - Dawei Yang - Grace Ramamoorthy - Patrick O Sullivan - Rohan Singla

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Transcription:

COSC 6397 Big Data Analytics Data Formats (II) HBase Edgar Gabriel Spring 2014 Comparing SQL and NOSQL databases Types Development History Data Storage Model SQL One type (SQL database) with minor variations Developed in 1970s to deal with first wave of data storage applications Individual records (e.g., "employees") are stored as rows in tables, with each column storing a specific piece of data about that record. Separate data types are stored in separate tables, and then joined together when more complex queries are executed. NOSQL Many different types including key-value stores, document databases, wide-column stores, and graph databases Developed in 2000s to deal with scale, replication and unstructured data storage Varies based on database type. Key-value stores function similarly to SQL databases, but have only two columns. Document databases do away with the table-and-row model altogether, e.g. nest values hierarchically. Source: http://www.mongodb.com/learn/nosql 1

Comparing SQL and NOSQL databases Schemas Scaling Development Model Structure and data types are fixed in advance. To store information about a new data item, the entire database must be altered, during which time the database must be taken offline. Vertically, meaning a single server must be made increasingly powerful in order to deal with increased demand. It is possible to spread SQL databases over many servers, but significant additional engineering is generally required. Mix of open-source (e.g., Postgres, MySQL) and closed source (e.g., Oracle Database) Typically dynamic. Records can add new information on the fly, and unlike SQL table rows, dissimilar data can be stored together as necessary. Horizontally, meaning that to add capacity, a database administrator can simply add more commodity servers or cloud instances. The database automatically spreads data across servers as necessary Open Source Source: http://www.mongodb.com/learn/nosql Comparing SQL and NOSQL databases Supports Transactions Data Manipulation Consistency Yes, updates can be configured to complete entirely or not at all Specific language using Select, Insert, and Update statements Can be configured for strong consistency In certain circumstances and at certain levels (e.g., document level vs. database level) Through object-oriented APIs Depends on product. Some provide strong consistency (e.g., MongoDB) whereas others offer eventual consistency (e.g., Cassandra) Source: http://www.mongodb.com/learn/nosql 2

HBase Column-Oriented data store, known as Hadoop Database Distributed designed to serve large tables Billions of rows and millions of columns Runs on a cluster of commodity hardware Server hardware, not laptop/desktops Open-source, written in Java Type of NoSQL DB Does not provide a SQL based access Does not adhere to Relational Model for storage HBase Automatic fail-over Simple Java API Integration with Map/Reduce framework Based on google s Bigtable Recommended Literature: http://labs.google.com/papers/bigtable.html 3

HBase Good for: Lots and lots of data Large amount of clients/requests Single random selects and range scans by key Variable schema, e.g. rows may differ drastically Bad for: Traditional RDBMs retrieval, e.g. transactional application Relational analytics (e.g. group by, join, etc ) Text-based search access Hbase Data Model Data is stored in Tables Tables contain rows Rows are referenced by a unique key Key is an array of bytes good news Anything can be a key: string, long and your own serialized data structures Rows made of columns which are grouped in column families Data is stored in cells Identified by row - column-family - column Cell's content is also an array of bytes 4

Hbase Families Rows are grouped into families Labeled as family:column, e.g. user:first_name A way to organize your data Various features are applied to families Compression In-memory option Stored together - in a file called HFile/StoreFile Family definitions are static Created with table, should be rarely added and changed Limited to small number of families unlike columns that you can have millions of HBase Families Family name must be composed of printable characters Not bytes, unlike keys and values Think of family:column as a tag for a cell value and NOT as a spreadsheet Columns on the other hand are NOT static Create new columns at run-time Can scale to millions for a family 5

Hbase Timestamps Cells' values are versioned For each cell multiple versions are kept 3 by default Another dimension to identify your data Either explicitly timestamped by region server or provided by the client Versions are stored in decreasing timestamp order Read the latest first optimization to read the current value You can specify how many versions are kept Hbase Values and Row Keys Value = Table+RowKey+Family+Column+Timestamp Hbase Row Keys Rows are sorted lexicographically by key Compared on a binary level from left to right For example keys 1,2,3,10,15 will get sorted as 1, 10, 15, 2, 3 Somewhat similar to Relational DB primary index Always unique 6

Hbase Architecture Table is made of regions Region a range of rows stored together Used for scaling Dynamically split as they become too big and merged if too small Region Server- serves one or more regions A region is served by only 1 Region Server Master Server daemon responsible for managing HBase cluster, aka Region Servers HBase stores its data into HDFS relies on HDFS's high availability and fault-tolerance features HBase utilizes Zookeeper for distributed coordination Hbase Components 7

Hbase Regions Region is a range of keys start key stop key (ex. k3cod odiekd) start key inclusive and stop key exclusive Addition of data At first there is only 1 region Addition of data will eventually exceed the configured maximum the region is split Default is 256MB The region is split into 2 regions at the middle key Regions per server depend on hardware specs, with today's hardware it's common to have: 10 to 1000 regions per Region Server Managing as much as 1GB to 2 GB per region Splitting data into regions allows Fast recovery when a region fails Load balancing when a server is overloaded May be moved between servers Splitting is fast Reads from an original file while asynchronous process performs a split All of these happen automatically without user involvement 8

Data Storage Data is stored in files called HFiles/StoreFiles Usually saved in HDFS HFile is basically a key-value map Keys are sorted lexicographically When data is added it's written to a log called Write Ahead Log (WAL) and is also stored in memory (memstore) Flush: when in-memory data exceeds maximum value it is flushed to an Hfile Data persisted to HFile can then be removed from WAL Region Server continues serving read-writes during the flush operations, writing values to the WAL and memstore Data Storage Recall that HDFS doesn't support updates to an existing file therefore HFiles are immutable Cannot remove key-values out of HFile(s) Over time more and more HFiles are created Delete marker is saved to indicate that a record was removed These markers are used to filter the data - to hide the deleted records At runtime, data is merged between the content of the HFile and WAL 9

Hbase Master Responsible for managing regions and their locations Assigns regions to region servers Re-balanced to accommodate workloads Recovers if a region server becomes unavailable Uses Zookeeper distributed coordination service Doesn't actually store or read data Clients communicate directly with Region Servers Usually lightly loaded Responsible for schema management and changes Adding/Removing tables and column families Zookeper HBase uses Zookeeper extensively for region assignment Zookeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group HBase can manage Zookeeper daemons for you or you can install/manage them separately Learn More at http://zookeeper.apache.org 10

Hbase Access HBase Shell Native Java API Fastest and very capable options Avro Server Apache Avro is also a cross-language schema compiler http://avro.apache.org Requires running Avro Server HBql SQL like syntax for Hbase http://www.hbql.com Hbase shell > hbase shell HBase Shell; enter 'help<return>' for list of supported commands. Type "exit<return>" to leave the HBase Shell Version 0.90.4-cdh3u2, r, Thu Oct 13 20:32:26 PDT 2011 hbase(main):001:0> list TABLE 0 row(s) in 0.4070 seconds 11

HBase shell Quote all names Table and column name Single quotes for text hbase> get 't1', 'myrowid' Double quotes for binary Use hexadecimal representation of that binary value hbase> get 't1', "key\x03\x3f\xcd" Uses ruby hashes to specify parameters {'key1' => 'value1', 'key2' => 'value2', } Example: hbase> get 'UserTable', 'userid1', {COLUMN => 'address:str'} HBase shell commands General status, version Data Definition Language (DDL) alter, create, describe, disable, drop, enable, exists, list, Data Manipulation Language (DML) count, delete, deleteall, get, incr, put, scan, truncate, Cluster administration balancer, close_region, compact, flush, major_compact, move, split, unassign, zk_dump, add_peer, disable_peer, enable_peer, Learn more about each command hbase> help "<command>" 12

Create Table with two families (info, content) hbase> create 'Blog', {NAME=>'info'}, {NAME=>'content'} 0 row(s) in 1.3580 seconds Populate table with data records Format: put 'table', 'row_id', 'family:column', 'value Example hbase> put 'Blog', 'Matt-001', 'info:title', 'Elephant hbase> put 'Blog', 'Matt-001', 'info:author', 'Matt' Access data - count Count: display the total number of records get: retrieve a single row scan: retrieve a range of rows Format: count 'table_name Will scan the entire table! hbase> count 'Blog', {INTERVAL=>2} Current count: 2, row: John-005 Current count: 4, row: Matt-002 5 row(s) in 0.0220 seconds 13

Access data - get Select single row with 'get' command Format: get 'table', 'row_id' Returns an entire row Requires table name and row id Optional: timestamp or time-range, and versions Select specific columns hbase> get 't1', 'r1', {COLUMN => 'c1'} hbase> get 't1', 'r1', {COLUMN => ['c1', 'c2', 'c3']} Select specific timestamp or time-range hbase> get 't1', 'r1', {TIMERANGE => [ts1, ts2]} hbase> get 't1', 'r1', {COLUMN => 'c1', TIMESTAMP => ts1} Select more than one version hbase> get 't1', 'r1', {VERSIONS => 4} Access data - Scan Scan entire table or a portion of it Load entire row or explicitly retrieve column families, columns or specific cells To scan an entire table Format: scan 'table_name' Limit the number of results Format: scan 'table_name', {LIMIT=>1} Scan a range Format: scan table_name', {STARTROW=>'startRow', STOPROW=>'stopRow'} Start row is inclusive, stop row is exclusive Can provide just start row or just stop row 14

Edit data Put command inserts a new value if row id doesn't exist Put updates the value if the row does exist But does it really update? Inserts a new version for the cell Only the latest version is selected by default N versions are kept per cell configured per family at creation 3 versions are kept by default Format: create 'table', {NAME => 'family', VERSIONS>=7} hbase> put 'Blog', 'Michelle-004', 'info:date', '1990.07.06' 0 row(s) in 0.0520 seconds hbase> put 'Blog', 'Michelle-004', 'info:date', '1990.07.07' 0 row(s) in 0.0080 seconds hbase> put 'Blog', 'Michelle-004', 'info:date', '1990.07.08' 0 row(s) in 0.0060 seconds hbase> get 'Blog', 'Michelle-004', {COLUMN=>'info:date', VERSIONS=>3} COLUMN CELL info:date timestamp=1326071670471, value=1990.07.08 info:date timestamp=1326071670442, value=1990.07.07 info:date timestamp=1326071670382, value=1990.07.06 3 row(s) in 0.0170 seconds 15

Delete cell by providing table, row id and column coordinates delete 'table', 'rowid', 'column Deletes all the versions of that cell Optionally add timestamp to only delete versions before the provided timestamp Format: delete 'table', 'rowid', 'column', timestamp Must disable before dropping puts the table offline so schema based operations can be performed hbase> disable 'table_name hbase> drop 'table_name' For a large table it may take a long time... 16

hbase> list TABLE Blog 1 row(s) in 0.0120 seconds hbase> disable 'Blog' 0 row(s) in 2.0510 seconds hbase> drop 'Blog' 0 row(s) in 0.0940 seconds hbase> list TABLE 0 row(s) in 0.0200 seconds 17