Data Informatics. Seon Ho Kim, Ph.D.

Size: px

Start display at page:

Download "Data Informatics. Seon Ho Kim, Ph.D."

Whitney Boone
6 years ago
Views:

1 Data Informatics Seon Ho Kim, Ph.D.

2 HBase

3 HBase is.. A distributed data store that can scale horizontally to 1,000s of commodity servers and petabytes of indexed storage. Designed to operate on top of the Hadoop distributed file system (HDFS) Distributed storage Table-like in data structure High scalability High availability High performance

4 HBase: Part of Hadoop s Ecosystem HBase is built on top of HDFS HBase files are internally stored in HDFS

5 HBase vs. HDFS Both are distributed systems that scale to hundreds or thousands of nodes HDFS is good for batch processing (scans over big files) Not good for record lookup Not good for incremental addition of small batches Not good for updates

6 HBase vs. HDFS (Cont d) HBase is designed to efficiently address the above points Fast record lookup Support for record-level insertion Support for updates (not in place) HBase updates are done by creating new versions of values

7 HBase vs. HDFS (Cont d) If application has neither random reads or writes è Stick to HDFS

8 HBase Is Not Tables have one primary index, the row key. No join operators. Scans and queries can select a subset of available columns, perhaps by using a wildcard. There are three types of lookups: Fast lookup using row key and optional timestamp. Full table scan Range scan from region start to end.

9 HBase Is Not (2) Limited atomicity and transaction support. HBase supports multiple batched mutations of single rows only. Data is unstructured and untyped. Not accessed or manipulated via SQL. Programmatic access via Java, REST, or Thrift APIs. Scripting via JRuby.

10 Why HBase? It is open source It has a good community and promise for the future It is developed on top of Hadoop and has good integration for the Hadoop platform, if you are using Hadoop already. It has a Cascading connector. Cascading is a software abstraction layer for Apache Hadoop. Cascading is used to create and execute complex data processing workflows on a Hadoop cluster using any JVM-based language (Java, JRuby, Clojure, etc.), hiding the underlying complexity of MapReduce jobs.

11 HBase benefits than RDBMS Automatic partitioning Scale linearly and automatically with new nodes Commodity hardware Fault tolerance Batch processing Who are using Hbase? Tech companies like: Facebook, Yahoo, Ebay, etc. Analytics companies like: RocketFuel, Flurry, etc.

12 HBase Data Model

13 HBase Data Model HBase is based on Google s Bigtable model: key-value pairs

14 HBase: Keys and Column Families Each row has a Key Each record is divided into Column Families Each column family consists of one or more Columns

Column family named anchor Column family named Contents Key Byte array Serves as the primary key for the table Indexed far fast lookup Column Family Has a name (string) Contains one or more related

15 Column family named anchor Column family named Contents Key Byte array Serves as the primary key for the table Indexed far fast lookup Column Family Has a name (string) Contains one or more related columns Column Belongs to one column family Included inside the row familyname:columnname Row key com.apac he.ww w com.cnn.w ww Time Stamp t12 t11 t10 t15 t13 t6 t5 Column content s: <html> <html> <html> <html> Column anchor: Column named apache.com anchor:apache.com anchor:cnnsi.co m anchor:my.look. ca APACH E CNN CNN.co m t3 <html>

16 Version number for each row Version Number Row key Time Stamp Column content s: Column anchor: Unique within each key By defaultà System s timestamp com.apac he.ww w t12 t11 t10 <html> <html> anchor:apache.com value APACH E Data type is Long Value (Cell) t15 t13 anchor:cnnsi.co m anchor:my.look. ca CNN CNN.co m Byte array com.cnn.w ww t6 <html> t5 <html> t3 <html>

Notes on Data Model HBase schema consists of several Tables Each table consists of a set of Column Families Columns are not part of the schema HBase has Dynamic

17 Notes on Data Model HBase schema consists of several Tables Each table consists of a set of Column Families Columns are not part of the schema HBase has Dynamic Columns Because column names are encoded inside the cells Different cells can have different columns Roles column family has different columns in different cells

18 Notes on Data Model (Cont d) The version number can be user-supplied Even does not have to be inserted in increasing order Version number are unique within each key Table can be very sparse Many cells are empty Keys are indexed as the primary key Has two columns [cnnsi.com & my.look.ca]

19 HBase Physical Model

20 HBase Physical Model Each column family is stored in a separate file (called HTables) Key & Version numbers are replicated with each column family Empty cells are not stored HBase maintains a multi-level index on values: <key, column family, column name, timestamp>

21 Example

22 HBase Architecture

23 Three Major Components The HBaseMaster One master The HRegionServer Many region servers The HBase client

24 Members Region A subset of a table s rows, like horizontal range partitioning Master Responsible for monitoring & coordinating region servers Load balancing for regions RegionServer slaves Serving requests (Write/Read/Scan) of Client Send HeartBeat to Master Throughput and Region numbers are scalable by region servers

25 HBase Components Region A subset of a table s rows, like horizontal range partitioning Automatically done RegionServer (many slaves) Manages data regions Serves data for reads and writes (using a log) Master Responsible for coordinating the slaves Assigns regions, detects failures Admin functions

26 Architecture

ZooKeeper HBase depends on ZooKeeper ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and

27 ZooKeeper HBase depends on ZooKeeper ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. By default HBase manages the ZooKeeper instance E.g., starts and stops ZooKeeper Master and RegionServers register themselves with ZooKeeper

28 HBase Tables

36 Key Design HBase has two fundamental key structures Row key Column key Both can be used to convey meaning Because they store particularly meaningful data Because their sorting order is important

37 Logical vs. On-disk Layout of a Table Main unit of separation within a table is the column family The actual columns (as opposed to other columnoriented DB) are not used to separate data Although cells are stored logically in a table format, rows are stored as linear sets of the cells Cells contain all the vital information inside them

39 Logical Layout (Top-Left) Table consists of rows and columns Columns are the combination of a column family name and a column qualifier <cf name: qualifier> is the column key Rows have a row key to address all columns of a single logical row Folding the Logical Layout (Top-Right) The cells of each row are stored one after the other Each column family are stored separately On disk all cells of one family reside on an individual StoreFile HBase does not store unset cells Row and column key is required to address every cell

40 Versioning Multiple versions of the same cell stored consecutively, together with the timestamp Cells are sorted in descending order of timestamp Newest value first KeyValue object The entire cell, with all the structural information, is a KeyValue object Contains: row key, <column family: qualifier> column key, timestamp and value Sorted by row key first, then by column key

41 Physical Layout (Lower-Right) Select data by row key This reduces the amount of data to scan for a row or a range of rows Select data by row key and column key This focuses the system on an individual storage file Select data by column qualifier Exact lookups, including filters to omit useless data

42 Tall-Narrow vs. Flat-Wide Tables Tall-Narrow Tables Few columns Many rows Flat-Wide Tables Many columns Few rows Given the query granularity explained before Store parts of the cell data in the row key Furthermore, HBase splits at row boundaries It is recommended to go for Tall-Narrow Tables

43 Tall-Narrow vs. Flat-Wide Tables Example: data - version 1 You have all s of a user in a single row (e.g. userid is the row key) There will be some outliers with orders of magnitude more s than others A single row could outgrow the maximum file/region size and work against split facility

44 Tall-Narrow vs. Flat-Wide Tables Example: data - version 2 Each of a user is stored in a separate row (e.g. userid:messageid is the row key) On disk this makes no difference (see the disk layout figure) If the messageid is in the column qualifier or the row key, each cell still contains a single message The table can be split easily and the query granularity is more fine-grained

45 Partial Key Scans Partial Key Scans reinforce the concept of Tall-Narrow Tables From the example: assume you have a separate row per message, across all users If you don t have an exact combination of user and message ID you cannot access a particular message Partial Key Scan solves the problems Specify a start and end key The start key is set to the exact userid only, with the end key set at userid+1 This triggers the internal lexicographic comparison mechanism Since the table does not have an exact match, it positions the scan at: <userid>:<lowest-messageid> The scan will then iterate over all the messages of an exact user, parse the row key and get the messageid

46 Partial Key Scans Composite keys and atomicity Following the example: a single user inbox now spans many rows It is no longer possible to modify a single user inbox in one atomic operation If this is acceptable or not, depends on the application at hand

HBase: Overview. HBase is a distributed column-oriented data store built on top of HDFS

HBase: Overview. HBase is a distributed column-oriented data store built on top of HDFS HBase 1 HBase: Overview HBase is a distributed column-oriented data store built on top of HDFS HBase is an Apache open source project whose goal is to provide storage for the Hadoop Distributed Computing