Apache HBase Andrew Purtell Committer, Apache HBase, Apache Software Foundation Big Data US Research And Development, Intel

Similar documents
HBase Security. Works in Progress Andrew Purtell, an Intel HBase guy

Distributed Filesystem

Typical size of data you deal with on a daily basis

Tuning Intelligent Data Lake Performance

HBase Solutions at Facebook

Important Notice Cloudera, Inc. All rights reserved.

YCSB++ benchmarking tool Performance debugging advanced features of scalable table stores

HBASE INTERVIEW QUESTIONS

Important Notice Cloudera, Inc. All rights reserved.

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

10 Million Smart Meter Data with Apache HBase

A Fast and High Throughput SQL Query System for Big Data

FileSystem Space Quotas for Apache HBase

Distributed File Systems II

Cloudera Kudu Introduction

Tuning Intelligent Data Lake Performance

Apache Kudu. Zbigniew Baranowski

COSC 6339 Big Data Analytics. NoSQL (II) HBase. Edgar Gabriel Fall HBase. Column-Oriented data store Distributed designed to serve large tables

YCSB++ Benchmarking Tool Performance Debugging Advanced Features of Scalable Table Stores

HBase... And Lewis Carroll! Twi:er,

Ghislain Fourny. Big Data 5. Column stores

Hadoop Online Training

Data Informatics. Seon Ho Kim, Ph.D.

Apache Accumulo 1.4 & 1.5 Features

An Introduction to Big Data Formats

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Firebird in 2011/2012: Development Review

CS November 2018

Oracle Big Data Connectors

April Copyright 2013 Cloudera Inc. All rights reserved.

Ghislain Fourny. Big Data 5. Wide column stores

Unifying Big Data Workloads in Apache Spark

Faster HBase queries. Introducing hindex Secondary indexes for HBase. ApacheCon North America Rajeshbabu Chintaguntla

Tuning Enterprise Information Catalog Performance

BigTable. CSE-291 (Cloud Computing) Fall 2016

Snapshots and Repeatable reads for HBase Tables

EsgynDB Enterprise 2.0 Platform Reference Architecture

ADVANCED HBASE. Architecture and Schema Design GeeCON, May Lars George Director EMEA Services

Strategies for Incremental Updates on Hive

Comparing SQL and NOSQL databases

CA485 Ray Walshe Google File System

Hortonworks Data Platform

TRANSACTIONS OVER HBASE

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

How To Rock with MyRocks. Vadim Tkachenko CTO, Percona Webinar, Jan

Avro Specification

CS November 2017

What s New in Apache HTrace by Colin P. McCabe

Introduction Data Model API Building Blocks SSTable Implementation Tablet Location Tablet Assingment Tablet Serving Compactions Refinements

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

COSC 6397 Big Data Analytics. Data Formats (III) HBase: Java API, HBase in MapReduce and HBase Bulk Loading. Edgar Gabriel Spring 2014.

Big Data Analytics. Rasoul Karimi

TRANSACTIONS AND ABSTRACTIONS

STARCOUNTER. Technical Overview

VMware AirWatch Content Gateway for Linux. VMware Workspace ONE UEM 1811 Unified Access Gateway

Mastering phpmyadmiri 3.4 for

Hive SQL over Hadoop

Migrating from Oracle to Espresso

BigTable: A Distributed Storage System for Structured Data

Big Data XML Parsing in Pentaho Data Integration (PDI)

Microsoft Azure Databricks for data engineering. Building production data pipelines with Apache Spark in the cloud

Armon HASHICORP

Rio-2 Hybrid Backup Server

Rails on HBase. Zachary Pinter and Tony Hillerson RailsConf 2011

VMware AirWatch Content Gateway Guide for Linux For Linux

HDP Security Overview

HDP Security Overview

YARN: A Resource Manager for Analytic Platform Tsuyoshi Ozawa

Integration of Apache Hive

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Distributed Systems 16. Distributed File Systems II

The State of Apache HBase. Michael Stack

HADOOP 3.0 is here! Dr. Sandeep Deshmukh Sadepach Labs Pvt. Ltd. - Let us grow together!

HBase. Леонид Налчаджи

Big Data Hadoop Course Content

Apache Kylin. OLAP on Hadoop

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

This document contains information on fixed and known limitations for Test Data Management.

Map-Reduce. Marco Mura 2010 March, 31th

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

WHITEPAPER. MemSQL Enterprise Feature List

What s New for Oracle Internet of Things Cloud Service. Topics: Oracle Cloud. What's New for Oracle Internet of Things Cloud Service Release 17.4.

Processing of big data with Apache Spark

MapR Enterprise Hadoop

Using Apache Phoenix to store and access data

Introduction to Hadoop and MapReduce

Lambda Architecture for Batch and Stream Processing. October 2018

Avro Specification

Best Practices for Setting BIOS Parameters for Performance

ORC Files. Owen O June Page 1. Hortonworks Inc. 2012

Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads

RocksDB Key-Value Store Optimized For Flash

Facebook. The Technology Behind Messages (and more ) Kannan Muthukkaruppan Software Engineer, Facebook. March 11, 2011

Cloud FastPath: Highly Secure Data Transfer

Hadoop. copyright 2011 Trainologic LTD

An Oracle White Paper October Release Notes - V Oracle Utilities Application Framework

Enterprise Data Catalog for Microsoft Azure Tutorial

Transcription:

Apache HBase 0.98 Andrew Purtell Committer, Apache HBase, Apache Software Foundation Big Data US Research And Development, Intel

Who am I? Committer on the Apache HBase project Member of the Big Data Research And Development Group at Intel Release manager for Apache HBase 0.98

What s In Apache HBase 0.98? ~230 resolved JIRAs New features Reverse scans (HBASE-4811) EXEC access checks for Endpoints (HBASE-6104) Transparent server side encryption (HBASE-7544) Per-cell ACLs (HBASE-7662) Visibility labels (HBASE-7663) Stripe compactions (HBASE-7667) MapReduce over snapshots (HBASE-8369) REST streaming scans (HBASE-9343) Performance improvements Improved WAL write threading model (HBASE-8755) API cleanups and many bug fixes

Branch Release Criteria Wire compatibility with HBase 0.96 Mixed client server and server server operation with 0.96 possible as long as no 0.98 specific features enabled Compatible with earlier on-disk data formats Direct upgrade possible from 0.94 0.98 using the same offline data migration procedure necessary for 0.94 0.96 No significant performance regression from 0.96 using defaults Binary API compatibility with versions < 0.98 not guaranteed, code that directly references HBase JARs may need to be recompiled

Reverse Scans (HBASE-4811) Introduces a new internal scanner type that seeks to the end of a range and then steps backwards No longer necessary to maintain tables of keys in reverse sort order for scanning Exposed at the client with a new Scan method Scan#setReversed(boolean reversed) A few % slower than forward scanning in CPU bound tests (server side, filters)

Endpoint EXEC Grants (HBASE-6104) HBase ACLs can grant a familiar set of privileges to users (and groups): (R)ead (W)rite E(X)excute (C)reate (A)dmin AccessController versions prior to 0.98 ignored X Now access to coprocessor Endpoint invocations can be controlled on a global, per-table, or per-cf basis Enable the AccessController Set hbase.security.exec.permission.checks to true Grant or revoke permissions as appropriate Deploy the coprocessor application

Cell Tags All values written to HBase are stored into cells Cell is used interchangeably with key-value or KeyValue for legacy reasons Cells can now also carry an arbitrary number of tags Metadata, considered distinct from the key and the value Only available server side Coprocessors can manage their own user defined tags

HFile Version 3 HFile version 2 plus The ability to persist cell tags Support for optional file block encryption Enabled via a site file change hfile.format.version -> 3 Once enabled, all data is transparently migrated over time as new files are written by flushes and compactions Required for: Transparent Encryption (HBASE-7544) Per-cell ACLs (HBASE-7662) Visibility labels (HBASE-7663)

Transparent Encryption (HBASE-7544) Introduces a new generic cryptographic codec and key management framework into hbase-common Provides transparent encryption of HBase on disk data Optional per-file HFile block encryption (requires HFile v3) Optional secure WAL reader and writer Block encryption is enabled on a per-cf basis Supports schema design that places sensitive information in only a subset of column families Provides simple key management Flexible and non-intrusive key rotation Key provider supports secure local key storage or any network or hardware key storage with Java KeyStore support Simple shell support for testing

Transparent Encryption (HBASE-7544)

Per-Cell ACLs (HBASE-7662) Extends the AccessController with support for persisting and checking ACL data in cell tags Uses existing API facilities to transmit per cell ACLs Backward compatible with existing installs and code We treat ACLs on a cell as scoped only to the cell for straightforward policy evolution All mutations must have covering permission in a dominating grant

Visibility Labels (HBASE-7663) Introduces a new VisibilityController coprocessor Introduces per-cell visibility expressions, client API extensions for setting visibility and authorizations, and new shell commands for label management The maximal set of labels for a user is defined with the new shell command setauths or equivalent admin API Users specify visibility expressions on cells Users submit authorizations on Gets and Scans The effective label set for the request is built in the RPC context from authorizations; those not in the maximal set are dropped How this is done is pluggable, e.g. integration with enterprise identity management solutions Scan results are filtered with (label) set membership tests

Visibility Labels (HBASE-7663) Visibility expressions Labels: arbitrary strings (converted into ordinals with an internal dictionary) Expressions: Labels joined in boolean expressions Operators: &,,! Parenthesis for precedence secret secret topsecret ( secret topsecret ) &!probationary

Improved WAL Write Throughput (HBASE-8755) Introduces a new threading model for WAL writes that reduces lock contention Provides better write throughput when under load A ~15% improvement in write ops/sec at high write concurrency Lays groundwork for multiple WALs Will provide further write throughput increase Also important for limiting the impact of encrypting WAL entries

Stripe Compactions (HBASE-7667) Stripe compactions split the data inside the region by row key and create sub-ranges of data The sub-ranges are compacted independently Depending on ingest and access patterns, using stripe compactions can reduce read latency variability and reduce compaction data volume (write amplification) Two use cases in particular may benefit 1. Approximately uniform keys and large regions 2. Non-uniform data with sequential row keys (e.g. log data) Can be complex to configure and tune, consult the documentation for detail

MapReduce Over Snapshots (HBASE-8369) Introduces MapReduce utilities supporting MR jobs over snapshots of table data Similar to TableInputFormat but instead of running over an online table using the HBase API it runs directly over HFiles on disk collected from a table snapshot. For performance-dominant use cases where the HBase API cannot provide sufficient throughput Can increase throughput of bulk scanning ~5x by streaming HDFS reads directly to the client Caveat: Not recommended from a security perspective Built in access control is completely bypassed It is a risk to open direct access to HFile data in HDFS

REST Streaming Scans (HBASE-9343) The REST gateway provides stateful scanners to be consistent with the HBase API but this is not REST-ful Scanner state is not shared across multiple gateways Scanner state will be lost if the gateway fails Introduces a new scanning mode to the REST API for stateless scanning The client manages paging and limits Instead of forcing a batching up of results as they come back from the RegionServers into multiple HTTP transactions, the stateless scanner can stream all results back to the client over one HTTP connection

Going Forward with branch-0.98 Bug fixes Performance improvements Further deprecations and API changes on the way to HBase 1.0 No more breaking binary API changes allowed Tag compression in HFile (HBASE-10451) Performance improvements for encryption Per family WAL encryption (HBASE-10077) Optional native accelerated cryptographic functions

End Questions?