Using IBM Big SQL over HBase, Part 1: Creating tables and loading data

Size: px

Start display at page:

Download "Using IBM Big SQL over HBase, Part 1: Creating tables and loading data"

Aldous Daniels
5 years ago
Views:

Using IBM Big SQL over HBase, Part 1: Creating tables and Information On Demand Session 1687 Piotr Pruski Benjamin Leonhardi Deepa Remesh Bruce Brown February 18, 2014 With IBM's Big SQL technology,

Here in Part 1, learn fundamental usage of IBM's Big SQL technology for Hadoop over HBase by creating tables and examining ways to load data.

Part 2 explores query handling, and how to connect to Big SQL via JDBC to run business intelligence and reporting tools, such as BIRT and Cognos.

1 Using IBM Big SQL over HBase, Part 1: Creating tables and Information On Demand Session 1687 Piotr Pruski Benjamin Leonhardi Deepa Remesh Bruce Brown February 18, 2014 With IBM's Big SQL technology, you can use InfoSphere BigInsights to query HBase using industry-standard SQL. This two-part series focuses on creating tables, data-loading methods, and query handling. Here in Part 1, learn fundamental usage of IBM's Big SQL technology for Hadoop over HBase by creating tables and examining ways to load data. Follow a basic storyline of migrating a relational table to HBase using Big SQL. Part 2 explores query handling, and how to connect to Big SQL via JDBC to run business intelligence and reporting tools, such as BIRT and Cognos. View more content in this series Introduction InfoSphere BigInsights Quick Start Edition InfoSphere BigInsights Quick Start Edition is a complimentary, downloadable version of InfoSphere BigInsights, IBM's Hadoop-based offering. Using Quick Start Edition, you can try out the features IBM has built to extend the value of open source Hadoop, like Big SQL, text analytics, and BigSheets. Guided learning is available to make your experience as smooth as possible, including step-by-step, self-paced tutorials and videos to help you start putting Hadoop to work for you. With no time or data limit, you can experiment on your own time with large amounts of data. Watch the videos and download InfoSphere BigInsights Quick Start Edition now. This series walks you through using IBM's Big SQL technology with InfoSphere BigInsights to query HBase using standard SQL. Here, you'll see how to migrate a table from a relational database to InfoSphere BigInsights using Big SQL over HBase. You'll also explore how HBase handles row keys and learn about some pitfalls you might encounter. We'll try some useful options, such as pre-creating regions to see how it can help with data loading and queries, and cover various ways to load data. Copyright IBM Corporation 2014 Trademarks Page 1 of 21

2 developerworks ibm.com/developerworks/ This series covers extensive ground, so we've omitted some fundamental information. At least a rudimentary understanding of InfoSphere BigInsights, HBase, and Jaql is assumed (see Related topics for more information about these technologies). You can also download the sample data used in this series. Background This exercise uses one table from the Great Outdoors Sales Data Warehouse model (GOSALESDW): SLS_SALES_FACT. Figure 1 shows the details of the table and its primary key information. Figure 1. SLS_SALES_FACT table Assume there is an available instance of DB2 that contains the following table with data preloaded for our migration. Table/View Schema Type Creation time SLS_SALES_FACT_10P DB2INST1 T record(s) selected. By issuing the select statement, as shown below, you can examine how many rows are in the table to ensure that everything will be migrated properly later. db2 "SELECT COUNT(*) FROM sls_sales_fact_10p" You should expect 44,603 rows in this table, as shown below record(s) selected. Using the describe command below, examine all the columns and data types contained within this table. db2 "DESCRIBE TABLE sls_sales_fact_10p" Page 2 of 21

3 ibm.com/developerworks/ developerworks Listing 1. Data types within this table Data type Column Column name schema Data type name Length Scale Nulls ORDER_DAY_KEY SYSIBM INTEGER 4 0 Yes ORGANIZATION_KEY SYSIBM INTEGER 4 0 Yes EMPLOYEE_KEY SYSIBM INTEGER 4 0 Yes RETAILER_KEY SYSIBM INTEGER 4 0 Yes RETAILER_SITE_KEY SYSIBM INTEGER 4 0 Yes PRODUCT_KEY SYSIBM INTEGER 4 0 Yes PROMOTION_KEY SYSIBM INTEGER 4 0 Yes ORDER_METHOD_KEY SYSIBM INTEGER 4 0 Yes SALES_ORDER_KEY SYSIBM INTEGER 4 0 Yes SHIP_DAY_KEY SYSIBM INTEGER 4 0 Yes CLOSE_DAY_KEY SYSIBM INTEGER 4 0 Yes QUANTITY SYSIBM INTEGER 4 0 Yes UNIT_COST SYSIBM DECIMAL 19 2 Yes UNIT_PRICE SYSIBM DECIMAL 19 2 Yes UNIT_SALE_PRICE SYSIBM DECIMAL 19 2 Yes GROSS_MARGIN SYSIBM DOUBLE 8 0 Yes SALE_TOTAL SYSIBM DECIMAL 19 2 Yes GROSS_PROFIT SYSIBM DECIMAL 19 2 Yes 18 record(s) selected. One-to-one mapping In this section, we use Big SQL to do a one-to-one mapping of the columns in the relational DB2 table to an HBase table row key and columns. This is not a recommended approach. However, the goal of the exercise is to demonstrate the inefficiency and pitfalls that can occur with such a mapping. Big SQL supports both one-to-one and many-to-one mappings. In a one-to-one mapping, the HBase row key and each HBase column are mapped to a single SQL column. In Figure 2, the HBase row key is mapped to the SQL column id. Similarly, the cq_name column within the cf_data column family is mapped to the SQL column name, etc. Figure 2. One-to-one mapping To begin, you can optionally first create a schema to keep tables organized. Within the Big SQL (JSQSH) shell, use the create schema command to create a schema named gosalesdw, as shown below. CREATE SCHEMA gosalesdw; Issue the command shown below in the same Big SQL shell. This DDL statement will create the SQL table with the one-to-one mapping of what is in our relational DB2 source. Notice all the column names are the same with the same data types. The column mapping section requires a mapping for the row key. HBase columns are identified using family:qualifier. Page 3 of 21

4 developerworks ibm.com/developerworks/ Listing 2. HBase columns identified using family:qualifier CREATE HBASE TABLE GOSALESDW.SLS_SALES_FACT ( ORDER_DAY_KEY int, ORGANIZATION_KEY int, EMPLOYEE_KEY int, RETAILER_KEY int, RETAILER_SITE_KEY int, PRODUCT_KEY int, PROMOTION_KEY int, ORDER_METHOD_KEY int, SALES_ORDER_KEY int, SHIP_DAY_KEY int, CLOSE_DAY_KEY int, QUANTITY int, UNIT_COST decimal(19,2), UNIT_PRICE decimal(19,2), UNIT_SALE_PRICE decimal(19,2), GROSS_MARGIN double, SALE_TOTAL decimal(19,2), GROSS_PROFIT decimal(19,2) ) COLUMN MAPPING ( key mapped by (ORDER_DAY_KEY), cf_data:cq_organization_key mapped by (ORGANIZATION_KEY), cf_data:cq_employee_key mapped by (EMPLOYEE_KEY), cf_data:cq_retailer_key mapped by (RETAILER_KEY), cf_data:cq_retailer_site_key mapped by (RETAILER_SITE_KEY), cf_data:cq_product_key mapped by (PRODUCT_KEY), cf_data:cq_promotion_key mapped by (PROMOTION_KEY), cf_data:cq_order_method_key mapped by (ORDER_METHOD_KEY), cf_data:cq_sales_order_key mapped by (SALES_ORDER_KEY), cf_data:cq_ship_day_key mapped by (SHIP_DAY_KEY), cf_data:cq_close_day_key mapped by (CLOSE_DAY_KEY), cf_data:cq_quantity mapped by (QUANTITY), cf_data:cq_unit_cost mapped by (UNIT_COST), cf_data:cq_unit_price mapped by (UNIT_PRICE), cf_data:cq_unit_sale_price mapped by (UNIT_SALE_PRICE), cf_data:cq_gross_margin mapped by (GROSS_MARGIN), cf_data:cq_sale_total mapped by (SALE_TOTAL), cf_data:cq_gross_profit mapped by (GROSS_PROFIT) ); Big SQL supports a load from source command that can load data from warehouse sources, which we'll use first. Big SQL also supports from delimited files using a load hbase command, which we'll use later. Adding new JDBC drivers The load from source command uses Sqoop internally to do the load. Therefore, before using the load command from a Big SQL shell, you need to add the driver for the JDBC source into the Sqoop library directory, then the JSQSH terminal shared directory. From a Linux terminal, issue the following command (as the InfoSphere BigInsights administrator) to add the JDBC driver JAR file to access the database to the $SQOOP_HOME/lib directory. cp /opt/ibm/db2/v10.5/java/db2jcc.jar $SQOOP_HOME/lib Page 4 of 21

5 ibm.com/developerworks/ developerworks From the Big SQL shell, you can examine the drivers loaded for the JSQSH terminal, as shown below. \drivers Copy the same DB2 driver to the JSQSH share directory with the following command. cp /opt/ibm/db2/v10.5/java/db2jcc.jar $BIGINSIGHTS_HOME/bigsql/jsqsh/share/ When a user adds drivers, the Big SQL server must be restarted. You could do this from the web console or by using the following command from the Linux terminal. stop.sh bigsql && start.sh bigsql You can verify that the driver was loaded into JSQSH by using the \drivers command, as shown above. Now that the drivers have been set, the load can finally take place. The load from source statement extracts data from a source outside of an InfoSphere BigInsights cluster (DB2 in this case) and loads that data into an InfoSphere BigInsights HBase (or Hive) table. Issue the following command to load the SLS_SALES_FACT_10P table from DB2 into the SLS_SALES_FACT table we have defined in Big SQL. LOAD USING JDBC CONNECTION URL 'jdbc:db2://localhost:50000/gosales' WITH PARAMETERS (user = 'db2inst1',password = 'password') FROM TABLE SLS_SALES_FACT_10P SPLIT COLUMN ORDER_DAY_KEY INTO HBASE TABLE gosalesdw.sls_sales_fact APPEND; You should expect to load 44,603 rows, which is the same number of rows the select count statement on the original DB2 table verified rows affected (total: 1m37.74s) Try to verify this in Big SQL with a select count statement, as shown below. SELECT COUNT(*) FROM gosalesdw.sls_sales_fact; Notice there is a discrepancy between the results from the load operation and the select count statement row in results(first row: 3.13s; total: 3.13s) You should also verify from an HBase shell. Issue the count command, as shown below, to verify the number of rows. Page 5 of 21

6 developerworks ibm.com/developerworks/ count 'gosalesdw.sls_sales_fact' It should be apparent that the results from the Big SQL statement and HBase commands conform to one another. 33 row(s) in seconds However, this doesn't yet explain why there is a mismatch between the number of loaded rows and the number of retrieved rows when you query the table. The load (and insert, to be examined later) command behaves like upsert. If a row with the same row key exists, HBase will write the new value as a new version for that column or cell. When querying the table, only the latest value is returned by Big SQL. In many cases, this behavior could be confusing. As with our case, we tried to load data with repeating values for a row key from a DB2 table with 44,603 rows, and the load reported 44,603 rows affected. However, the select count(*) showed fewer rows (33). No errors are thrown in such scenarios, so it is always recommended to cross-check the number of rows by querying the table, as in our example. Now that you understand that all the rows are actually versioned in HBase, we can examine a possible way to retrieve all versions of a particular row. First, from the Big SQL shell, issue the following select query with a predicate on the order day key. In the original table, there are most likely many tuples with the same order day key. SELECT organization_key FROM gosalesdw.sls_sales_fact WHERE order_day_key = ; As expected, you only retrieve one row, which is the latest or newest version of the row inserted into HBase with the specified order day key organization_key row(s) in seconds Now, using the HBase shell, you can retrieve previous versions for a row key. Use the following get command to get the top four versions of the row with row key get 'gosalesdw.sls_sales_fact', ' ', {COLUMN => 'cf_data:cq_organization_key', VERSIONS => 4} Because the previous command specified only four versions (VERSIONS => 4), you only retrieve four rows in the output, as shown below. Page 6 of 21

7 ibm.com/developerworks/ developerworks COLUMN cf_data:cq_organization_key cf_data:cq_organization_key cf_data:cq_organization_key cf_data:cq_organization_key 4 row(s) in seconds CELL timestamp= , value=11171 timestamp= , value=11171 timestamp= , value=11171 timestamp= , value=11171 Optionally, try the same command again, specifying a larger version number (VERSIONS => 100, for example). Either way, this is most likely not the intended behavior users might expect when performing such a migration. Users probably wanted to get all the data into the HBase table without versioned cells. There are a couple of solutions to this. One is to define the table with a composite row key to enforce uniqueness, which will be covered later. Another option, outlined in the next section, is to force each row key to be unique by appending a universally unique identifier (UUID). One-to-one mapping with a unique clause Another approach to the migration is to use the force key unique option when creating the table using Big SQL syntax. This option will force the load to add a UUID to the row key. It helps prevent versioning of cells. However, this method is quite inefficient, as it stores more data and also makes queries slower. Issue the following command in the Big SQL shell. This statement will create the SQL table with the one-to-one mapping of what we have in our relational DB2 source. This DDL statement is almost identical to what you saw in the previous section (One-to-one mapping), with one exception: the force key unique clause is specified for the column mapping of the row key. Listing 3. DDL statement CREATE HBASE TABLE GOSALESDW.SLS_SALES_FACT_UNIQUE ( ORDER_DAY_KEY int, ORGANIZATION_KEY int, EMPLOYEE_KEY int, RETAILER_KEY int, RETAILER_SITE_KEY int, PRODUCT_KEY int, PROMOTION_KEY int, ORDER_METHOD_KEY int, SALES_ORDER_KEY int, SHIP_DAY_KEY int, CLOSE_DAY_KEY int, QUANTITY int, UNIT_COST decimal(19,2), UNIT_PRICE decimal(19,2), UNIT_SALE_PRICE decimal(19,2), GROSS_MARGIN double, SALE_TOTAL decimal(19,2), GROSS_PROFIT decimal(19,2) ) COLUMN MAPPING ( key mapped by (ORDER_DAY_KEY) force key unique, cf_data:cq_organization_key mapped by (ORGANIZATION_KEY), cf_data:cq_employee_key mapped by (EMPLOYEE_KEY), cf_data:cq_retailer_key mapped by (RETAILER_KEY), cf_data:cq_retailer_site_key mapped by (RETAILER_SITE_KEY), Page 7 of 21

8 developerworks ibm.com/developerworks/ cf_data:cq_product_key cf_data:cq_promotion_key cf_data:cq_order_method_key cf_data:cq_sales_order_key cf_data:cq_ship_day_key cf_data:cq_close_day_key cf_data:cq_quantity cf_data:cq_unit_cost cf_data:cq_unit_price cf_data:cq_unit_sale_price cf_data:cq_gross_margin cf_data:cq_sale_total cf_data:cq_gross_profit ); mapped by (PRODUCT_KEY), mapped by (PROMOTION_KEY), mapped by (ORDER_METHOD_KEY), mapped by (SALES_ORDER_KEY), mapped by (SHIP_DAY_KEY), mapped by (CLOSE_DAY_KEY), mapped by (QUANTITY), mapped by (UNIT_COST), mapped by (UNIT_PRICE), mapped by (UNIT_SALE_PRICE), mapped by (GROSS_MARGIN), mapped by (SALE_TOTAL), mapped by (GROSS_PROFIT) In One-to-one mapping you used the load from source command to get the data from the table in DB2 source into HBase. This may not always be feasible, so we'll explore the load hbase loading statement. The load hbase command will load data into HBase using flat files, which perhaps is an export of the data from the relational source. Issue the following statement to load data from a file into an InfoSphere BigInsights HBase table. LOAD HBASE DATA INPATH '/user/biadmin/gosalesdw/sls_sales_fact.10p.txt' DELIMITED FIELDS TERMINATED BY '\t' INTO TABLE gosalesdw.sls_sales_fact_unique; Note that the load hbase command can take in an optional list of columns. If no column list is specified, it will use the column ordering in table definition. The input file can be on DFS or on the local file system where the Big SQL server is running. Once again, you should expect to load 44,603 rows (the same number of rows that the select count statement on the original DB2 table verified) rows affected (total: 26.95s) Verify the number of rows loaded with a select count statement, as shown below. SELECT COUNT(*) FROM gosalesdw.sls_sales_fact_unique; This time, there is no discrepancy between the results from the load operation and the select count statement row in results(first row: 1.61s; total: 1.61s) Issue the same count from the HBase shell, as shown below, to be sure. count 'gosalesdw.sls_sales_fact_unique' Page 8 of 21

9 ibm.com/developerworks/ developerworks The values are persistent across load, select, and count row(s) in seconds As in the previous section, from the Big SQL shell, issue the following select query with a predicate on the order day key. SELECT organization_key FROM gosalesdw.sls_sales_fact_unique WHERE order_day_key = ; In One-to-one mapping, only one row was returned for the specified date. This time, expect to see 1,405 rows since the rows are now forced to be unique due to our clause in the create statement and, therefore, no versioning should be applied rows in results(first row: 0.47s; total: 0.58s) Once again, you can check from the HBase shell if there are multiple versions of the cells. Issue the following get statement to try to retrieve the top four versions of the row with row key get 'gosalesdw.sls_sales_fact_unique', ' ', {COLUMN => 'cf_data:cq_organization_key', VERSIONS => 4} Zero rows are returned, as the row key of doesn't exist. This is because we've appended the UUID to each row key ( UUID). COLUMN 0 row(s) in seconds CELL Therefore, you should instead issue the following HBase command to do a scan vs. a get. It will scan the table using the first part of the row key. We are also indicating scanner specifications of start and stop row values to only return the results we're interested in. scan 'gosalesdw.sls_sales_fact_unique', {STARTROW => ' ', STOPROW => ' '} Notice there are no discrepancies between the results from Big SQL select and HBase scan row(s) in seconds Many-to-one mapping (composite keys and dense columns) This section discusses the other option of trying to enforce uniqueness of the cells, which involves defining a table with a composite row key also known as many-to-one mapping. In a many-to-one mapping, multiple SQL columns are mapped to a single HBase entity (row key or a column). There are two terms that may be used frequently: composite key and dense column. Page 9 of 21

developerworks ibm.com/developerworks/ A composite key is an HBase row key mapped to multiple SQL columns. A dense column is an HBase column mapped to multiple SQL columns.

10 developerworks ibm.com/developerworks/ A composite key is an HBase row key mapped to multiple SQL columns. A dense column is an HBase column mapped to multiple SQL columns. In Figure 3, the row key contains two parts: userid and account number. Each part corresponds to an SQL column. Similarly, the HBase columns are mapped to multiple SQL columns. Note that you can have a mix. For example, you can have a composite key, a dense column, and a non-dense column or any mix of these. Figure 3. Many-to-one mapping Issue the following DDL statement from the Big SQL shell. It represents all entities from our relational table using a many-to-one mapping. Notice the column mapping section where multiple columns can be mapped to single family:qualifiers. Listing 4. DDL statement from Big SQL shell CREATE HBASE TABLE GOSALESDW.SLS_SALES_FACT_DENSE ( ORDER_DAY_KEY int, ORGANIZATION_KEY int, EMPLOYEE_KEY int, RETAILER_KEY int, RETAILER_SITE_KEY int, PRODUCT_KEY int, PROMOTION_KEY int, ORDER_METHOD_KEY int, SALES_ORDER_KEY int, SHIP_DAY_KEY int, CLOSE_DAY_KEY int, QUANTITY int, UNIT_COST decimal(19,2), UNIT_PRICE decimal(19,2), UNIT_SALE_PRICE decimal(19,2), GROSS_MARGIN double, SALE_TOTAL decimal(19,2), GROSS_PROFIT decimal(19,2) ) COLUMN MAPPING ( key mapped by (ORDER_DAY_KEY, ORGANIZATION_KEY, EMPLOYEE_KEY, RETAILER_KEY, RETAILER_SITE_KEY, PRODUCT_KEY, PROMOTION_KEY, ORDER_METHOD_KEY), cf_data:cq_other_keys mapped by (SALES_ORDER_KEY, SHIP_DAY_KEY, CLOSE_DAY_KEY), cf_data:cq_quantity cf_data:cq_dollar_values ); mapped by (QUANTITY), mapped by (UNIT_COST, UNIT_PRICE, UNIT_SALE_PRICE, GROSS_MARGIN, SALE_TOTAL, GROSS_PROFIT) Why do we need many-to-one mapping? HBase stores a lot of information for each value. For each value stored, a key consisting of the row key, column family name, column qualifier, and timestamp are also stored. A lot of duplicate information is kept. Page 10 of 21

11 ibm.com/developerworks/ developerworks HBase is verbose and primarily intended for sparse data. In most cases, data in the relational world is not sparse. If you were to store each SQL column individually on HBase, as previously done in this article, the required storage space would exponentially grow. When querying that data back, the query also returns the entire key (row key, column family, and column qualifier) for each value. For illustration, after into this table, we'll examine the storage space for each of the three tables created thus far. Issue the following statement, which will load data from a file into the InfoSphere BigInsights HBase table. LOAD HBASE DATA INPATH '/user/biadmin/gosalesdw/sls_sales_fact.10p.txt' DELIMITED FIELDS TERMINATED BY '\t' INTO TABLE gosalesdw.sls_sales_fact_dense; The number of rows loaded into a table with many-to-one mapping remains the same even though we're storing less data. The statement also executes much faster than the previous load for this exact reason rows affected (total: 3.42s) Issue the same statements and commands from the Big SQL and HBase shells, as in the previous two sections, to verify that the number of rows is the same as in the original dataset. All the results should be the same as before. SELECT COUNT(*) FROM gosalesdw.sls_sales_fact_dense; row in results(first row: 0.93s; total: 0.93s) SELECT organization_key FROM gosalesdw.sls_sales_fact_dense WHERE order_day_key = ; 1405 rows in results(first row: 0.65s; total: 0.68s) scan 'gosalesdw.sls_sales_fact_dense', {STARTROW => ' ', STOPROW => ' '} 1405 row(s) in seconds As mentioned, one-to-one mapping leads to use of too much storage space for the same data mapped using composite keys or dense columns, where the HBase row key or HBase column(s) are made up of multiple relational table columns. HBase would repeat row key, column family name, column name, and timestamp for each column value. For relational data, which is usually dense, this would cause an explosion in the required storage space. Page 11 of 21

12 developerworks ibm.com/developerworks/ Issue the following command as the InfoSphere BigInsights administrator from a Linux terminal to check the directory sizes for the three tables you created. hadoop fs -du /hbase/ hdfs://bivm:9000/hbase/gosalesdw.sls_sales_fact 3188 hdfs://bivm:9000/hbase/gosalesdw.sls_sales_fact_dense hdfs://bivm:9000/hbase/gosalesdw.sls_sales_fact_unique Data collation All data represented thus far has been stored as strings, which is the default encoding on HBase tables created by Big SQL. Therefore, numeric data is not collated correctly. HBase uses lexicographic ordering, so you might have cases where a query returns wrong results. The following scenario walks through a situation where data is not collated correctly. Using the Big SQL insert into hbase statement, add the following row to the sls_sales_fact_dense table (previously defined with data loaded). The date specified as part of the ORDER_DAY_KEY column, which has data type int, is a larger numerical value and does not conform to any date standard because it contains an extra digit. INSERT INTO gosalesdw.sls_sales_fact_dense ( ORDER_DAY_KEY, ORGANIZATION_KEY, EMPLOYEE_KEY, RETAILER_KEY, RETAILER_SITE_KEY, PRODUCT_KEY, PROMOTION_KEY, ORDER_METHOD_KEY ) VALUES ( , 11171, 4428, 7109, 5588, 30265, 5501, 605); Issue a scan on the table with the following start and stop criteria. scan 'gosalesdw.sls_sales_fact_dense', {STARTROW => ' ', STOPROW => ' '} Notice the last three rows/cells returned from the output of the scan. The newly added row shows up in the scan even though its integer value is not between and \x \x004428\x007109\x005588\x003 column=cf_data:cq_dollar_values, \x \x004428\x007109\x005588\x003 column=cf_data:cq_other_keys, \x \x004428\x007109\x005588\x003 column=cf_data:cq_quantity, row(s) in seconds Insert another row into the table with the following command. This time, we're conforming to the date format of YYYYMMDD and incrementing the day by one from the last value returned in the table ( ). INSERT INTO gosalesdw.sls_sales_fact_dense ( ORDER_DAY_KEY, ORGANIZATION_KEY, EMPLOYEE_KEY, RETAILER_KEY, RETAILER_SITE_KEY, PRODUCT_KEY, PROMOTION_KEY, ORDER_METHOD_KEY ) VALUES ( , 11171, 4428, 7109, 5588, 30265, 5501, 605); Issue another scan on the table. Remember to increase the stoprow criteria by one day. Page 12 of 21

13 ibm.com/developerworks/ developerworks scan 'gosalesdw.sls_sales_fact_dense', {STARTROW => ' ', STOPROW => ' '} The newly added row is included as part of the result set, and the row with ORDER_DAY_KEY of is after the row with ORDER_DAY_KEY of This is an example of numeric data that is not collated properly. The rows are not being stored in numerical order as you might expect but rather in byte lexicographical order \x \x004428\x007109\x005588\x003 column=cf_data:cq_dollar_values, \x \x004428\x007109\x005588\x003 column=cf_data:cq_other_keys, \x \x004428\x007109\x005588\x003 column=cf_data:cq_quantity, \x \x004428\x007109\x005588\x0030 column=cf_data:cq_dollar_values, \x \x004428\x007109\x005588\x0030 column=cf_data:cq_other_keys, \x \x004428\x007109\x005588\x0030 column=cf_data:cq_quantity, row(s) in seconds Many-to-one mapping with binary encoding Big SQL supports two types of data encodings: string and binary. Each HBase entity can also have its own encoding. For example, a row key can be encoded as a string, and one HBase column can be encoded as binary and another as string. String is the default encoding used in Big SQL HBase tables. The value is converted to string and stored as UTF-8 bytes. When multiple parts are packed into one HBase entity, separators are used to delimit data. The default separator is the null byte. As it is the lowest byte, it maintains data collation and allows range queries and partial row scans to work correctly. Binary encoding in Big SQL is sortable so numeric data, including negative numbers, collate properly. It handles separators internally and avoids issues of separators existing within data by escaping it. Issue the following DDL statement from the Big SQL shell to create a dense table, as you did in Many-to-one mapping (composite keys and dense columns), but this time you override the default encoding to binary. Listing 5. Override default encoding to binary CREATE HBASE TABLE GOSALESDW.SLS_SALES_FACT_DENSE_BINARY ( ORDER_DAY_KEY int, ORGANIZATION_KEY int, EMPLOYEE_KEY int, RETAILER_KEY int, RETAILER_SITE_KEY int, PRODUCT_KEY int, PROMOTION_KEY int, ORDER_METHOD_KEY int, SALES_ORDER_KEY int, SHIP_DAY_KEY int, CLOSE_DAY_KEY int, QUANTITY int, UNIT_COST decimal(19,2), UNIT_PRICE decimal(19,2), UNIT_SALE_PRICE decimal(19,2), GROSS_MARGIN double, SALE_TOTAL decimal(19,2), GROSS_PROFIT decimal(19,2) Page 13 of 21

14 developerworks ibm.com/developerworks/ ) COLUMN MAPPING ( key mapped by (ORDER_DAY_KEY, ORGANIZATION_KEY, EMPLOYEE_KEY, RETAILER_KEY, RETAILER_SITE_KEY, PRODUCT_KEY, PROMOTION_KEY, ORDER_METHOD_KEY), cf_data:cq_other_keys mapped by (SALES_ORDER_KEY, SHIP_DAY_KEY, CLOSE_DAY_KEY), cf_data:cq_quantity cf_data:cq_dollar_values ) default encoding binary; mapped by (QUANTITY), mapped by (UNIT_COST, UNIT_PRICE, UNIT_SALE_PRICE, GROSS_MARGIN, SALE_TOTAL, GROSS_PROFIT) Once again, use the load hbase data command to load the data into the table. This time, we're adding the DISABLE WAL clause. The option to disable the write-ahead log (WAL) can speed up writes into HBase. However, this is not a safe option. Turning off WAL can result in data loss if a region server crashes. Another option to speed up load is to increase the write buffer size. LOAD HBASE DATA INPATH '/user/biadmin/gosalesdw/sls_sales_fact.10p.txt' DELIMITED FIELDS TERMINATED BY '\t' INTO TABLE gosalesdw.sls_sales_fact_dense_binary DISABLE WAL; rows affected (total: 5.54s) Issue a select statement on the newly created and loaded table with binary encoding, sls_sales_fact_dense_binary. SELECT * FROM gosalesdw.sls_sales_fact_dense_binary go -m discard; Note that the go -m discard option is used so the results of the command will not be displayed in the terminal rows in results(first row: 0.35s; total: 2.89s) Issue another select statement on the previous table that has string encoding, sls_sales_fact_dense. SELECT COUNT(*) FROM gosalesdw.sls_sales_fact_dense go -m discard; rows in results(first row: 0.31s; total: 3.1s) A key point here is that the query can return faster. (Numeric types are also collated properly.) You will probably not see much, if any, performance difference when working with small datasets. There is no custom serialization/deserialization logic required for string encoding, making it portable if you want to use another application to read data in HBase tables. A primary use case for string encoding is when someone wants to map existing data. Delimited data is a common form of storing data, and it can be easily mapped using Big SQL string encoding. However, parsing Page 14 of 21

15 ibm.com/developerworks/ developerworks strings is expensive and queries with data encoded as strings are slow. And numeric data is not collated correctly, as shown in the example. Queries on data encoded as binary have faster response times. Numeric data, including negative numbers, are also collated correctly with binary encoding. The downside is that you get data encoded by Big SQL logic and it might not be portable as-is. Many-to-one mapping with HBase pre-created regions and external tables HBase automatically handles splitting regions when they reach a set limit. In some scenarios, like bulk loading, it is more efficient to pre-create regions so the load operation can take place in parallel. In the example, the data for sales is four months April through July You can precreate regions by specifying splits in the create table command. In this section, we create a table within the HBase shell with pre-defined splits, but not using any Big SQL features at first. Then we'll show how users can map existing data in HBase to Big SQL, which can prove to be a common practice. Creating external tables makes this possible. Start by issuing the following statement in the HBase shell. The sls_sales_fact_dense_split table will be created with pre-defined region splits for April through July in Listing 6. sls_sales_fact_dense_split table created with pre-defined region splits create 'gosalesdw.sls_sales_fact_dense_split', { NAME => 'cf_data', REPLICATION_SCOPE => '0', KEEP_DELETED_CELLS => 'false', COMPRESSION => 'NONE', ENCODE_ON_DISK => 'true', BLOCKCACHE => 'true', MIN_VERSIONS => '0', DATA_BLOCK_ENCODING => 'NONE', IN_MEMORY => 'false', BLOOMFILTER => 'NONE', TTL => ' ', VERSIONS => ' ', BLOCKSIZE => '65536'}, {SPLITS => ['200704', '200705', '200706', '200707']} Issue the following list command on the HBase shell to verify the newly created table. list If you were to list the tables from the Big SQL shell, you would not see this table because we haven't made any association yet to Big SQL. Open and point a browser to Scroll down and click on the table you just defined in the HBase shell, gosalesdw.sls_sales_fact_dense_split, as shown in Figure 4. Page 15 of 21

Pre-created regions Execute the following create external hbase command to map the existing table you just created in HBase to Big SQL.

16 developerworks ibm.com/developerworks/ Figure 4. Splits Figure 5 shows the pre-created regions that we defined when creating the table. Figure 5. Pre-created regions Execute the following create external hbase command to map the existing table you just created in HBase to Big SQL. With the create external hbase command: The create table statement lets you specify a different name for SQL tables through the hbase table name clause. Using external tables, you can also create multiple views of the same HBase table. For example, one table can map to a few columns and another table to another set of columns, etc. Page 16 of 21

17 ibm.com/developerworks/ developerworks The column mapping section of the create table statement allows you to specify a different separator for each column and row key. Map tables created using Hive HBase storage handler. These cannot be directly read using Big SQL storage handler. Listing 7. Map tables created using Hive HBase storage handler CREATE EXTERNAL HBASE TABLE GOSALESDW.SLS_SALES_FACT_DENSE_SPLIT ( ORDER_DAY_KEY int, ORGANIZATION_KEY int, EMPLOYEE_KEY int, RETAILER_KEY int, RETAILER_SITE_KEY int, PRODUCT_KEY int, PROMOTION_KEY int, ORDER_METHOD_KEY int, SALES_ORDER_KEY int, SHIP_DAY_KEY int, CLOSE_DAY_KEY int, QUANTITY int, UNIT_COST decimal(19,2), UNIT_PRICE decimal(19,2), UNIT_SALE_PRICE decimal(19,2), GROSS_MARGIN double, SALE_TOTAL decimal(19,2), GROSS_PROFIT decimal(19,2) ) COLUMN MAPPING ( key cf_data:cq_other_keys mapped by (ORDER_DAY_KEY, ORGANIZATION_KEY, EMPLOYEE_KEY, RETAILER_KEY, RETAILER_SITE_KEY, PRODUCT_KEY, PROMOTION_KEY, ORDER_METHOD_KEY) SEPARATOR '-', mapped by (SALES_ORDER_KEY, SHIP_DAY_KEY, CLOSE_DAY_KEY) SEPARATOR '/', cf_data:cq_quantity cf_data:cq_dollar_values mapped by (QUANTITY), mapped by (UNIT_COST, UNIT_PRICE, UNIT_SALE_PRICE, GROSS_MARGIN, SALE_TOTAL, GROSS_PROFIT) SEPARATOR ' ' ) HBASE TABLE NAME 'gosalesdw.sls_sales_fact_dense_split'; The data in external tables is not validated at creation time. For example, if a column in the external table contains data with separators incorrectly defined, the query results would be unpredictable. Note that external tables are not owned by Big SQL and, hence, cannot be dropped via Big SQL. Also, secondary indices cannot be created via Big SQL on external tables. Use the following command to load the external table we have defined. LOAD HBASE DATA INPATH '/user/biadmin/gosalesdw/sls_sales_fact.10p.txt' DELIMITED FIELDS TERMINATED BY '\t' INTO TABLE gosalesdw.external_sls_sales_fact_dense_split; rows affected (total: 1m57.2s) Verify that the number of rows loaded is the same number of rows returned by querying the external SQL table. Page 17 of 21

18 developerworks ibm.com/developerworks/ SELECT COUNT(*) FROM gosalesdw.external_sls_sales_fact_dense_split; row in results(first row: 6.44s; total: 6.46s) Verify the same from the HBase shell directly on the underlying HBase table. count 'gosalesdw.sls_sales_fact_dense_split' row(s) in seconds Issue a get command from the HBase shell specifying the row key as follows. Notice the separator between each part of the row key is a hyphen (-), as we defined when originally creating the external table. get 'gosalesdw.sls_sales_fact_dense_split', ' ' In the following output, you can also see the other separators we defined for the external table: for the cq_dollar_value and / for cq_quantity. COLUMN cf_data:cq_dollar_values cf_data:cq_other_keys cf_data:cq_quantity 3 row(s) in seconds CELL timestamp= , value= timestamp= , value=481896/ / timestamp= , value=25 Of course, in Big SQL, you don't need to specify the separators, such as -, when querying against the table, as with the command below. SELECT * FROM gosalesdw.external_sls_sales_fact_dense_split WHERE ORDER_DAY_KEY = AND ORGANIZATION_KEY = AND EMPLOYEE_KEY = 4428 AND RETAILER_KEY = 7109 AND RETAILER_SITE_KEY = 5588 AND PRODUCT_KEY = AND PROMOTION_KEY = 5501 AND ORDER_METHOD_KEY = 605; Handling errors with load data How do you handle errors during the load operation? The load hbase command has an option to continue past errors. You can use the LOG ERROR ROWS IN FILE clause to specify a file name to log any rows that could not be loaded because of errors. A few common errors are invalid numeric types and a separator existing within the data for string encoding. hadoop fs -cat /user/biadmin/gosalesdw/sls_sales_fact_badload.txt Page 18 of 21

19 ibm.com/developerworks/ developerworks a b The separator appearing within the data is an issue with string encoding. Knowing there are errors with the input data, go ahead and issue the following load command, specifying a directory and file in which to put the bad rows. LOAD HBASE DATA INPATH '/user/biadmin/gosalesdw/sls_sales_fact_badload.txt' DELIMITED FIELDS TERMINATED BY '\t' INTO TABLE gosalesdw.external_sls_sales_fact_dense_split LOG ERROR ROWS IN FILE '/tmp/sls_sales_fact_load.err'; In this example, four rows did not get loaded because of errors. The load command reports all the rows that passed through it. 1 row affected (total: 2.74s) Examine the specified file in the load command to view the rows that were not loaded. hadoop fs -cat /tmp/sls_sales_fact_load.err " a","11171"," "," "," ", "b ","11171"," "," "," ", " ","11171"," "," "," ", " ","11-71"," "," "," ", Summary The examples in this article have shown how to create tables and various ways to load data. We covered different types of one-to-one mapping and many-to-one mapping. Part 2 of this series covers query handling and how to connect to Big SQL via JDBC to run business reports with tools such as BIRT or Cognos. Acknowledgments Thanks to Uttam Jain for his contributions to this series. Page 19 of 21

20 developerworks ibm.com/developerworks/ Downloadable resources Description Name Size Data samples IBD-1687A_Data.zip 6KB Pre-created BIRT Report Orders.rptdesign.zip 8KB Presentation on Big SQL over HBase 1 IBD-1687A.pdf 3MB Note 1. This article is derived from a presentation at Information On Demand Session 1687 Adding Value to HBase with IBM InfoSphere BigInsights and Big SQL. Page 20 of 21

21 ibm.com/developerworks/ developerworks Related topics Learn more about BigInsights 2.1 from the BigInsights Information Center. Check out "What's the big deal about Big SQL?" for an introduction to Big SQL. Read "Understanding InfoSphere BigInsights" to learn more about the product's architecture and underlying technologies. Get a technical introduction to Big SQL on Slideshare. Learn more about HBase at Apache.org. Get familiar with the Cognos sample GOSALES databases by accessing the product's Information Center. Download a free native software installation copy of InfoSphere BigInsights 2.1 Quick Start Edition (sign-in required). Copyright IBM Corporation 2014 ( Trademarks ( Page 21 of 21

IBM Fluid Query for PureData Analytics. - Sanjit Chakraborty

IBM Fluid Query for PureData Analytics - Sanjit Chakraborty Introduction IBM Fluid Query is the capability that unifies data access across the Logical Data Warehouse. Users and analytic applications need