Using IBM Big SQL over HBase, Part 1: Creating tables and loading data

Size: px
Start display at page:

Download "Using IBM Big SQL over HBase, Part 1: Creating tables and loading data"

Transcription

1 Using IBM Big SQL over HBase, Part 1: Creating tables and Information On Demand Session 1687 Piotr Pruski Benjamin Leonhardi Deepa Remesh Bruce Brown February 18, 2014 With IBM's Big SQL technology, you can use InfoSphere BigInsights to query HBase using industry-standard SQL. This two-part series focuses on creating tables, data-loading methods, and query handling. Here in Part 1, learn fundamental usage of IBM's Big SQL technology for Hadoop over HBase by creating tables and examining ways to load data. Follow a basic storyline of migrating a relational table to HBase using Big SQL. Part 2 explores query handling, and how to connect to Big SQL via JDBC to run business intelligence and reporting tools, such as BIRT and Cognos. View more content in this series Introduction InfoSphere BigInsights Quick Start Edition InfoSphere BigInsights Quick Start Edition is a complimentary, downloadable version of InfoSphere BigInsights, IBM's Hadoop-based offering. Using Quick Start Edition, you can try out the features IBM has built to extend the value of open source Hadoop, like Big SQL, text analytics, and BigSheets. Guided learning is available to make your experience as smooth as possible, including step-by-step, self-paced tutorials and videos to help you start putting Hadoop to work for you. With no time or data limit, you can experiment on your own time with large amounts of data. Watch the videos and download InfoSphere BigInsights Quick Start Edition now. This series walks you through using IBM's Big SQL technology with InfoSphere BigInsights to query HBase using standard SQL. Here, you'll see how to migrate a table from a relational database to InfoSphere BigInsights using Big SQL over HBase. You'll also explore how HBase handles row keys and learn about some pitfalls you might encounter. We'll try some useful options, such as pre-creating regions to see how it can help with data loading and queries, and cover various ways to load data. Copyright IBM Corporation 2014 Trademarks Page 1 of 21

2 developerworks ibm.com/developerworks/ This series covers extensive ground, so we've omitted some fundamental information. At least a rudimentary understanding of InfoSphere BigInsights, HBase, and Jaql is assumed (see Related topics for more information about these technologies). You can also download the sample data used in this series. Background This exercise uses one table from the Great Outdoors Sales Data Warehouse model (GOSALESDW): SLS_SALES_FACT. Figure 1 shows the details of the table and its primary key information. Figure 1. SLS_SALES_FACT table Assume there is an available instance of DB2 that contains the following table with data preloaded for our migration. Table/View Schema Type Creation time SLS_SALES_FACT_10P DB2INST1 T record(s) selected. By issuing the select statement, as shown below, you can examine how many rows are in the table to ensure that everything will be migrated properly later. db2 "SELECT COUNT(*) FROM sls_sales_fact_10p" You should expect 44,603 rows in this table, as shown below record(s) selected. Using the describe command below, examine all the columns and data types contained within this table. db2 "DESCRIBE TABLE sls_sales_fact_10p" Page 2 of 21

3 ibm.com/developerworks/ developerworks Listing 1. Data types within this table Data type Column Column name schema Data type name Length Scale Nulls ORDER_DAY_KEY SYSIBM INTEGER 4 0 Yes ORGANIZATION_KEY SYSIBM INTEGER 4 0 Yes EMPLOYEE_KEY SYSIBM INTEGER 4 0 Yes RETAILER_KEY SYSIBM INTEGER 4 0 Yes RETAILER_SITE_KEY SYSIBM INTEGER 4 0 Yes PRODUCT_KEY SYSIBM INTEGER 4 0 Yes PROMOTION_KEY SYSIBM INTEGER 4 0 Yes ORDER_METHOD_KEY SYSIBM INTEGER 4 0 Yes SALES_ORDER_KEY SYSIBM INTEGER 4 0 Yes SHIP_DAY_KEY SYSIBM INTEGER 4 0 Yes CLOSE_DAY_KEY SYSIBM INTEGER 4 0 Yes QUANTITY SYSIBM INTEGER 4 0 Yes UNIT_COST SYSIBM DECIMAL 19 2 Yes UNIT_PRICE SYSIBM DECIMAL 19 2 Yes UNIT_SALE_PRICE SYSIBM DECIMAL 19 2 Yes GROSS_MARGIN SYSIBM DOUBLE 8 0 Yes SALE_TOTAL SYSIBM DECIMAL 19 2 Yes GROSS_PROFIT SYSIBM DECIMAL 19 2 Yes 18 record(s) selected. One-to-one mapping In this section, we use Big SQL to do a one-to-one mapping of the columns in the relational DB2 table to an HBase table row key and columns. This is not a recommended approach. However, the goal of the exercise is to demonstrate the inefficiency and pitfalls that can occur with such a mapping. Big SQL supports both one-to-one and many-to-one mappings. In a one-to-one mapping, the HBase row key and each HBase column are mapped to a single SQL column. In Figure 2, the HBase row key is mapped to the SQL column id. Similarly, the cq_name column within the cf_data column family is mapped to the SQL column name, etc. Figure 2. One-to-one mapping To begin, you can optionally first create a schema to keep tables organized. Within the Big SQL (JSQSH) shell, use the create schema command to create a schema named gosalesdw, as shown below. CREATE SCHEMA gosalesdw; Issue the command shown below in the same Big SQL shell. This DDL statement will create the SQL table with the one-to-one mapping of what is in our relational DB2 source. Notice all the column names are the same with the same data types. The column mapping section requires a mapping for the row key. HBase columns are identified using family:qualifier. Page 3 of 21

4 developerworks ibm.com/developerworks/ Listing 2. HBase columns identified using family:qualifier CREATE HBASE TABLE GOSALESDW.SLS_SALES_FACT ( ORDER_DAY_KEY int, ORGANIZATION_KEY int, EMPLOYEE_KEY int, RETAILER_KEY int, RETAILER_SITE_KEY int, PRODUCT_KEY int, PROMOTION_KEY int, ORDER_METHOD_KEY int, SALES_ORDER_KEY int, SHIP_DAY_KEY int, CLOSE_DAY_KEY int, QUANTITY int, UNIT_COST decimal(19,2), UNIT_PRICE decimal(19,2), UNIT_SALE_PRICE decimal(19,2), GROSS_MARGIN double, SALE_TOTAL decimal(19,2), GROSS_PROFIT decimal(19,2) ) COLUMN MAPPING ( key mapped by (ORDER_DAY_KEY), cf_data:cq_organization_key mapped by (ORGANIZATION_KEY), cf_data:cq_employee_key mapped by (EMPLOYEE_KEY), cf_data:cq_retailer_key mapped by (RETAILER_KEY), cf_data:cq_retailer_site_key mapped by (RETAILER_SITE_KEY), cf_data:cq_product_key mapped by (PRODUCT_KEY), cf_data:cq_promotion_key mapped by (PROMOTION_KEY), cf_data:cq_order_method_key mapped by (ORDER_METHOD_KEY), cf_data:cq_sales_order_key mapped by (SALES_ORDER_KEY), cf_data:cq_ship_day_key mapped by (SHIP_DAY_KEY), cf_data:cq_close_day_key mapped by (CLOSE_DAY_KEY), cf_data:cq_quantity mapped by (QUANTITY), cf_data:cq_unit_cost mapped by (UNIT_COST), cf_data:cq_unit_price mapped by (UNIT_PRICE), cf_data:cq_unit_sale_price mapped by (UNIT_SALE_PRICE), cf_data:cq_gross_margin mapped by (GROSS_MARGIN), cf_data:cq_sale_total mapped by (SALE_TOTAL), cf_data:cq_gross_profit mapped by (GROSS_PROFIT) ); Big SQL supports a load from source command that can load data from warehouse sources, which we'll use first. Big SQL also supports from delimited files using a load hbase command, which we'll use later. Adding new JDBC drivers The load from source command uses Sqoop internally to do the load. Therefore, before using the load command from a Big SQL shell, you need to add the driver for the JDBC source into the Sqoop library directory, then the JSQSH terminal shared directory. From a Linux terminal, issue the following command (as the InfoSphere BigInsights administrator) to add the JDBC driver JAR file to access the database to the $SQOOP_HOME/lib directory. cp /opt/ibm/db2/v10.5/java/db2jcc.jar $SQOOP_HOME/lib Page 4 of 21

5 ibm.com/developerworks/ developerworks From the Big SQL shell, you can examine the drivers loaded for the JSQSH terminal, as shown below. \drivers Copy the same DB2 driver to the JSQSH share directory with the following command. cp /opt/ibm/db2/v10.5/java/db2jcc.jar $BIGINSIGHTS_HOME/bigsql/jsqsh/share/ When a user adds drivers, the Big SQL server must be restarted. You could do this from the web console or by using the following command from the Linux terminal. stop.sh bigsql && start.sh bigsql You can verify that the driver was loaded into JSQSH by using the \drivers command, as shown above. Now that the drivers have been set, the load can finally take place. The load from source statement extracts data from a source outside of an InfoSphere BigInsights cluster (DB2 in this case) and loads that data into an InfoSphere BigInsights HBase (or Hive) table. Issue the following command to load the SLS_SALES_FACT_10P table from DB2 into the SLS_SALES_FACT table we have defined in Big SQL. LOAD USING JDBC CONNECTION URL 'jdbc:db2://localhost:50000/gosales' WITH PARAMETERS (user = 'db2inst1',password = 'password') FROM TABLE SLS_SALES_FACT_10P SPLIT COLUMN ORDER_DAY_KEY INTO HBASE TABLE gosalesdw.sls_sales_fact APPEND; You should expect to load 44,603 rows, which is the same number of rows the select count statement on the original DB2 table verified rows affected (total: 1m37.74s) Try to verify this in Big SQL with a select count statement, as shown below. SELECT COUNT(*) FROM gosalesdw.sls_sales_fact; Notice there is a discrepancy between the results from the load operation and the select count statement row in results(first row: 3.13s; total: 3.13s) You should also verify from an HBase shell. Issue the count command, as shown below, to verify the number of rows. Page 5 of 21

6 developerworks ibm.com/developerworks/ count 'gosalesdw.sls_sales_fact' It should be apparent that the results from the Big SQL statement and HBase commands conform to one another. 33 row(s) in seconds However, this doesn't yet explain why there is a mismatch between the number of loaded rows and the number of retrieved rows when you query the table. The load (and insert, to be examined later) command behaves like upsert. If a row with the same row key exists, HBase will write the new value as a new version for that column or cell. When querying the table, only the latest value is returned by Big SQL. In many cases, this behavior could be confusing. As with our case, we tried to load data with repeating values for a row key from a DB2 table with 44,603 rows, and the load reported 44,603 rows affected. However, the select count(*) showed fewer rows (33). No errors are thrown in such scenarios, so it is always recommended to cross-check the number of rows by querying the table, as in our example. Now that you understand that all the rows are actually versioned in HBase, we can examine a possible way to retrieve all versions of a particular row. First, from the Big SQL shell, issue the following select query with a predicate on the order day key. In the original table, there are most likely many tuples with the same order day key. SELECT organization_key FROM gosalesdw.sls_sales_fact WHERE order_day_key = ; As expected, you only retrieve one row, which is the latest or newest version of the row inserted into HBase with the specified order day key organization_key row(s) in seconds Now, using the HBase shell, you can retrieve previous versions for a row key. Use the following get command to get the top four versions of the row with row key get 'gosalesdw.sls_sales_fact', ' ', {COLUMN => 'cf_data:cq_organization_key', VERSIONS => 4} Because the previous command specified only four versions (VERSIONS => 4), you only retrieve four rows in the output, as shown below. Page 6 of 21

7 ibm.com/developerworks/ developerworks COLUMN cf_data:cq_organization_key cf_data:cq_organization_key cf_data:cq_organization_key cf_data:cq_organization_key 4 row(s) in seconds CELL timestamp= , value=11171 timestamp= , value=11171 timestamp= , value=11171 timestamp= , value=11171 Optionally, try the same command again, specifying a larger version number (VERSIONS => 100, for example). Either way, this is most likely not the intended behavior users might expect when performing such a migration. Users probably wanted to get all the data into the HBase table without versioned cells. There are a couple of solutions to this. One is to define the table with a composite row key to enforce uniqueness, which will be covered later. Another option, outlined in the next section, is to force each row key to be unique by appending a universally unique identifier (UUID). One-to-one mapping with a unique clause Another approach to the migration is to use the force key unique option when creating the table using Big SQL syntax. This option will force the load to add a UUID to the row key. It helps prevent versioning of cells. However, this method is quite inefficient, as it stores more data and also makes queries slower. Issue the following command in the Big SQL shell. This statement will create the SQL table with the one-to-one mapping of what we have in our relational DB2 source. This DDL statement is almost identical to what you saw in the previous section (One-to-one mapping), with one exception: the force key unique clause is specified for the column mapping of the row key. Listing 3. DDL statement CREATE HBASE TABLE GOSALESDW.SLS_SALES_FACT_UNIQUE ( ORDER_DAY_KEY int, ORGANIZATION_KEY int, EMPLOYEE_KEY int, RETAILER_KEY int, RETAILER_SITE_KEY int, PRODUCT_KEY int, PROMOTION_KEY int, ORDER_METHOD_KEY int, SALES_ORDER_KEY int, SHIP_DAY_KEY int, CLOSE_DAY_KEY int, QUANTITY int, UNIT_COST decimal(19,2), UNIT_PRICE decimal(19,2), UNIT_SALE_PRICE decimal(19,2), GROSS_MARGIN double, SALE_TOTAL decimal(19,2), GROSS_PROFIT decimal(19,2) ) COLUMN MAPPING ( key mapped by (ORDER_DAY_KEY) force key unique, cf_data:cq_organization_key mapped by (ORGANIZATION_KEY), cf_data:cq_employee_key mapped by (EMPLOYEE_KEY), cf_data:cq_retailer_key mapped by (RETAILER_KEY), cf_data:cq_retailer_site_key mapped by (RETAILER_SITE_KEY), Page 7 of 21

8 developerworks ibm.com/developerworks/ cf_data:cq_product_key cf_data:cq_promotion_key cf_data:cq_order_method_key cf_data:cq_sales_order_key cf_data:cq_ship_day_key cf_data:cq_close_day_key cf_data:cq_quantity cf_data:cq_unit_cost cf_data:cq_unit_price cf_data:cq_unit_sale_price cf_data:cq_gross_margin cf_data:cq_sale_total cf_data:cq_gross_profit ); mapped by (PRODUCT_KEY), mapped by (PROMOTION_KEY), mapped by (ORDER_METHOD_KEY), mapped by (SALES_ORDER_KEY), mapped by (SHIP_DAY_KEY), mapped by (CLOSE_DAY_KEY), mapped by (QUANTITY), mapped by (UNIT_COST), mapped by (UNIT_PRICE), mapped by (UNIT_SALE_PRICE), mapped by (GROSS_MARGIN), mapped by (SALE_TOTAL), mapped by (GROSS_PROFIT) In One-to-one mapping you used the load from source command to get the data from the table in DB2 source into HBase. This may not always be feasible, so we'll explore the load hbase loading statement. The load hbase command will load data into HBase using flat files, which perhaps is an export of the data from the relational source. Issue the following statement to load data from a file into an InfoSphere BigInsights HBase table. LOAD HBASE DATA INPATH '/user/biadmin/gosalesdw/sls_sales_fact.10p.txt' DELIMITED FIELDS TERMINATED BY '\t' INTO TABLE gosalesdw.sls_sales_fact_unique; Note that the load hbase command can take in an optional list of columns. If no column list is specified, it will use the column ordering in table definition. The input file can be on DFS or on the local file system where the Big SQL server is running. Once again, you should expect to load 44,603 rows (the same number of rows that the select count statement on the original DB2 table verified) rows affected (total: 26.95s) Verify the number of rows loaded with a select count statement, as shown below. SELECT COUNT(*) FROM gosalesdw.sls_sales_fact_unique; This time, there is no discrepancy between the results from the load operation and the select count statement row in results(first row: 1.61s; total: 1.61s) Issue the same count from the HBase shell, as shown below, to be sure. count 'gosalesdw.sls_sales_fact_unique' Page 8 of 21

9 ibm.com/developerworks/ developerworks The values are persistent across load, select, and count row(s) in seconds As in the previous section, from the Big SQL shell, issue the following select query with a predicate on the order day key. SELECT organization_key FROM gosalesdw.sls_sales_fact_unique WHERE order_day_key = ; In One-to-one mapping, only one row was returned for the specified date. This time, expect to see 1,405 rows since the rows are now forced to be unique due to our clause in the create statement and, therefore, no versioning should be applied rows in results(first row: 0.47s; total: 0.58s) Once again, you can check from the HBase shell if there are multiple versions of the cells. Issue the following get statement to try to retrieve the top four versions of the row with row key get 'gosalesdw.sls_sales_fact_unique', ' ', {COLUMN => 'cf_data:cq_organization_key', VERSIONS => 4} Zero rows are returned, as the row key of doesn't exist. This is because we've appended the UUID to each row key ( UUID). COLUMN 0 row(s) in seconds CELL Therefore, you should instead issue the following HBase command to do a scan vs. a get. It will scan the table using the first part of the row key. We are also indicating scanner specifications of start and stop row values to only return the results we're interested in. scan 'gosalesdw.sls_sales_fact_unique', {STARTROW => ' ', STOPROW => ' '} Notice there are no discrepancies between the results from Big SQL select and HBase scan row(s) in seconds Many-to-one mapping (composite keys and dense columns) This section discusses the other option of trying to enforce uniqueness of the cells, which involves defining a table with a composite row key also known as many-to-one mapping. In a many-to-one mapping, multiple SQL columns are mapped to a single HBase entity (row key or a column). There are two terms that may be used frequently: composite key and dense column. Page 9 of 21

10 developerworks ibm.com/developerworks/ A composite key is an HBase row key mapped to multiple SQL columns. A dense column is an HBase column mapped to multiple SQL columns. In Figure 3, the row key contains two parts: userid and account number. Each part corresponds to an SQL column. Similarly, the HBase columns are mapped to multiple SQL columns. Note that you can have a mix. For example, you can have a composite key, a dense column, and a non-dense column or any mix of these. Figure 3. Many-to-one mapping Issue the following DDL statement from the Big SQL shell. It represents all entities from our relational table using a many-to-one mapping. Notice the column mapping section where multiple columns can be mapped to single family:qualifiers. Listing 4. DDL statement from Big SQL shell CREATE HBASE TABLE GOSALESDW.SLS_SALES_FACT_DENSE ( ORDER_DAY_KEY int, ORGANIZATION_KEY int, EMPLOYEE_KEY int, RETAILER_KEY int, RETAILER_SITE_KEY int, PRODUCT_KEY int, PROMOTION_KEY int, ORDER_METHOD_KEY int, SALES_ORDER_KEY int, SHIP_DAY_KEY int, CLOSE_DAY_KEY int, QUANTITY int, UNIT_COST decimal(19,2), UNIT_PRICE decimal(19,2), UNIT_SALE_PRICE decimal(19,2), GROSS_MARGIN double, SALE_TOTAL decimal(19,2), GROSS_PROFIT decimal(19,2) ) COLUMN MAPPING ( key mapped by (ORDER_DAY_KEY, ORGANIZATION_KEY, EMPLOYEE_KEY, RETAILER_KEY, RETAILER_SITE_KEY, PRODUCT_KEY, PROMOTION_KEY, ORDER_METHOD_KEY), cf_data:cq_other_keys mapped by (SALES_ORDER_KEY, SHIP_DAY_KEY, CLOSE_DAY_KEY), cf_data:cq_quantity cf_data:cq_dollar_values ); mapped by (QUANTITY), mapped by (UNIT_COST, UNIT_PRICE, UNIT_SALE_PRICE, GROSS_MARGIN, SALE_TOTAL, GROSS_PROFIT) Why do we need many-to-one mapping? HBase stores a lot of information for each value. For each value stored, a key consisting of the row key, column family name, column qualifier, and timestamp are also stored. A lot of duplicate information is kept. Page 10 of 21

11 ibm.com/developerworks/ developerworks HBase is verbose and primarily intended for sparse data. In most cases, data in the relational world is not sparse. If you were to store each SQL column individually on HBase, as previously done in this article, the required storage space would exponentially grow. When querying that data back, the query also returns the entire key (row key, column family, and column qualifier) for each value. For illustration, after into this table, we'll examine the storage space for each of the three tables created thus far. Issue the following statement, which will load data from a file into the InfoSphere BigInsights HBase table. LOAD HBASE DATA INPATH '/user/biadmin/gosalesdw/sls_sales_fact.10p.txt' DELIMITED FIELDS TERMINATED BY '\t' INTO TABLE gosalesdw.sls_sales_fact_dense; The number of rows loaded into a table with many-to-one mapping remains the same even though we're storing less data. The statement also executes much faster than the previous load for this exact reason rows affected (total: 3.42s) Issue the same statements and commands from the Big SQL and HBase shells, as in the previous two sections, to verify that the number of rows is the same as in the original dataset. All the results should be the same as before. SELECT COUNT(*) FROM gosalesdw.sls_sales_fact_dense; row in results(first row: 0.93s; total: 0.93s) SELECT organization_key FROM gosalesdw.sls_sales_fact_dense WHERE order_day_key = ; 1405 rows in results(first row: 0.65s; total: 0.68s) scan 'gosalesdw.sls_sales_fact_dense', {STARTROW => ' ', STOPROW => ' '} 1405 row(s) in seconds As mentioned, one-to-one mapping leads to use of too much storage space for the same data mapped using composite keys or dense columns, where the HBase row key or HBase column(s) are made up of multiple relational table columns. HBase would repeat row key, column family name, column name, and timestamp for each column value. For relational data, which is usually dense, this would cause an explosion in the required storage space. Page 11 of 21

12 developerworks ibm.com/developerworks/ Issue the following command as the InfoSphere BigInsights administrator from a Linux terminal to check the directory sizes for the three tables you created. hadoop fs -du /hbase/ hdfs://bivm:9000/hbase/gosalesdw.sls_sales_fact 3188 hdfs://bivm:9000/hbase/gosalesdw.sls_sales_fact_dense hdfs://bivm:9000/hbase/gosalesdw.sls_sales_fact_unique Data collation All data represented thus far has been stored as strings, which is the default encoding on HBase tables created by Big SQL. Therefore, numeric data is not collated correctly. HBase uses lexicographic ordering, so you might have cases where a query returns wrong results. The following scenario walks through a situation where data is not collated correctly. Using the Big SQL insert into hbase statement, add the following row to the sls_sales_fact_dense table (previously defined with data loaded). The date specified as part of the ORDER_DAY_KEY column, which has data type int, is a larger numerical value and does not conform to any date standard because it contains an extra digit. INSERT INTO gosalesdw.sls_sales_fact_dense ( ORDER_DAY_KEY, ORGANIZATION_KEY, EMPLOYEE_KEY, RETAILER_KEY, RETAILER_SITE_KEY, PRODUCT_KEY, PROMOTION_KEY, ORDER_METHOD_KEY ) VALUES ( , 11171, 4428, 7109, 5588, 30265, 5501, 605); Issue a scan on the table with the following start and stop criteria. scan 'gosalesdw.sls_sales_fact_dense', {STARTROW => ' ', STOPROW => ' '} Notice the last three rows/cells returned from the output of the scan. The newly added row shows up in the scan even though its integer value is not between and \x \x004428\x007109\x005588\x003 column=cf_data:cq_dollar_values, \x \x004428\x007109\x005588\x003 column=cf_data:cq_other_keys, \x \x004428\x007109\x005588\x003 column=cf_data:cq_quantity, row(s) in seconds Insert another row into the table with the following command. This time, we're conforming to the date format of YYYYMMDD and incrementing the day by one from the last value returned in the table ( ). INSERT INTO gosalesdw.sls_sales_fact_dense ( ORDER_DAY_KEY, ORGANIZATION_KEY, EMPLOYEE_KEY, RETAILER_KEY, RETAILER_SITE_KEY, PRODUCT_KEY, PROMOTION_KEY, ORDER_METHOD_KEY ) VALUES ( , 11171, 4428, 7109, 5588, 30265, 5501, 605); Issue another scan on the table. Remember to increase the stoprow criteria by one day. Page 12 of 21

13 ibm.com/developerworks/ developerworks scan 'gosalesdw.sls_sales_fact_dense', {STARTROW => ' ', STOPROW => ' '} The newly added row is included as part of the result set, and the row with ORDER_DAY_KEY of is after the row with ORDER_DAY_KEY of This is an example of numeric data that is not collated properly. The rows are not being stored in numerical order as you might expect but rather in byte lexicographical order \x \x004428\x007109\x005588\x003 column=cf_data:cq_dollar_values, \x \x004428\x007109\x005588\x003 column=cf_data:cq_other_keys, \x \x004428\x007109\x005588\x003 column=cf_data:cq_quantity, \x \x004428\x007109\x005588\x0030 column=cf_data:cq_dollar_values, \x \x004428\x007109\x005588\x0030 column=cf_data:cq_other_keys, \x \x004428\x007109\x005588\x0030 column=cf_data:cq_quantity, row(s) in seconds Many-to-one mapping with binary encoding Big SQL supports two types of data encodings: string and binary. Each HBase entity can also have its own encoding. For example, a row key can be encoded as a string, and one HBase column can be encoded as binary and another as string. String is the default encoding used in Big SQL HBase tables. The value is converted to string and stored as UTF-8 bytes. When multiple parts are packed into one HBase entity, separators are used to delimit data. The default separator is the null byte. As it is the lowest byte, it maintains data collation and allows range queries and partial row scans to work correctly. Binary encoding in Big SQL is sortable so numeric data, including negative numbers, collate properly. It handles separators internally and avoids issues of separators existing within data by escaping it. Issue the following DDL statement from the Big SQL shell to create a dense table, as you did in Many-to-one mapping (composite keys and dense columns), but this time you override the default encoding to binary. Listing 5. Override default encoding to binary CREATE HBASE TABLE GOSALESDW.SLS_SALES_FACT_DENSE_BINARY ( ORDER_DAY_KEY int, ORGANIZATION_KEY int, EMPLOYEE_KEY int, RETAILER_KEY int, RETAILER_SITE_KEY int, PRODUCT_KEY int, PROMOTION_KEY int, ORDER_METHOD_KEY int, SALES_ORDER_KEY int, SHIP_DAY_KEY int, CLOSE_DAY_KEY int, QUANTITY int, UNIT_COST decimal(19,2), UNIT_PRICE decimal(19,2), UNIT_SALE_PRICE decimal(19,2), GROSS_MARGIN double, SALE_TOTAL decimal(19,2), GROSS_PROFIT decimal(19,2) Page 13 of 21

14 developerworks ibm.com/developerworks/ ) COLUMN MAPPING ( key mapped by (ORDER_DAY_KEY, ORGANIZATION_KEY, EMPLOYEE_KEY, RETAILER_KEY, RETAILER_SITE_KEY, PRODUCT_KEY, PROMOTION_KEY, ORDER_METHOD_KEY), cf_data:cq_other_keys mapped by (SALES_ORDER_KEY, SHIP_DAY_KEY, CLOSE_DAY_KEY), cf_data:cq_quantity cf_data:cq_dollar_values ) default encoding binary; mapped by (QUANTITY), mapped by (UNIT_COST, UNIT_PRICE, UNIT_SALE_PRICE, GROSS_MARGIN, SALE_TOTAL, GROSS_PROFIT) Once again, use the load hbase data command to load the data into the table. This time, we're adding the DISABLE WAL clause. The option to disable the write-ahead log (WAL) can speed up writes into HBase. However, this is not a safe option. Turning off WAL can result in data loss if a region server crashes. Another option to speed up load is to increase the write buffer size. LOAD HBASE DATA INPATH '/user/biadmin/gosalesdw/sls_sales_fact.10p.txt' DELIMITED FIELDS TERMINATED BY '\t' INTO TABLE gosalesdw.sls_sales_fact_dense_binary DISABLE WAL; rows affected (total: 5.54s) Issue a select statement on the newly created and loaded table with binary encoding, sls_sales_fact_dense_binary. SELECT * FROM gosalesdw.sls_sales_fact_dense_binary go -m discard; Note that the go -m discard option is used so the results of the command will not be displayed in the terminal rows in results(first row: 0.35s; total: 2.89s) Issue another select statement on the previous table that has string encoding, sls_sales_fact_dense. SELECT COUNT(*) FROM gosalesdw.sls_sales_fact_dense go -m discard; rows in results(first row: 0.31s; total: 3.1s) A key point here is that the query can return faster. (Numeric types are also collated properly.) You will probably not see much, if any, performance difference when working with small datasets. There is no custom serialization/deserialization logic required for string encoding, making it portable if you want to use another application to read data in HBase tables. A primary use case for string encoding is when someone wants to map existing data. Delimited data is a common form of storing data, and it can be easily mapped using Big SQL string encoding. However, parsing Page 14 of 21

15 ibm.com/developerworks/ developerworks strings is expensive and queries with data encoded as strings are slow. And numeric data is not collated correctly, as shown in the example. Queries on data encoded as binary have faster response times. Numeric data, including negative numbers, are also collated correctly with binary encoding. The downside is that you get data encoded by Big SQL logic and it might not be portable as-is. Many-to-one mapping with HBase pre-created regions and external tables HBase automatically handles splitting regions when they reach a set limit. In some scenarios, like bulk loading, it is more efficient to pre-create regions so the load operation can take place in parallel. In the example, the data for sales is four months April through July You can precreate regions by specifying splits in the create table command. In this section, we create a table within the HBase shell with pre-defined splits, but not using any Big SQL features at first. Then we'll show how users can map existing data in HBase to Big SQL, which can prove to be a common practice. Creating external tables makes this possible. Start by issuing the following statement in the HBase shell. The sls_sales_fact_dense_split table will be created with pre-defined region splits for April through July in Listing 6. sls_sales_fact_dense_split table created with pre-defined region splits create 'gosalesdw.sls_sales_fact_dense_split', { NAME => 'cf_data', REPLICATION_SCOPE => '0', KEEP_DELETED_CELLS => 'false', COMPRESSION => 'NONE', ENCODE_ON_DISK => 'true', BLOCKCACHE => 'true', MIN_VERSIONS => '0', DATA_BLOCK_ENCODING => 'NONE', IN_MEMORY => 'false', BLOOMFILTER => 'NONE', TTL => ' ', VERSIONS => ' ', BLOCKSIZE => '65536'}, {SPLITS => ['200704', '200705', '200706', '200707']} Issue the following list command on the HBase shell to verify the newly created table. list If you were to list the tables from the Big SQL shell, you would not see this table because we haven't made any association yet to Big SQL. Open and point a browser to Scroll down and click on the table you just defined in the HBase shell, gosalesdw.sls_sales_fact_dense_split, as shown in Figure 4. Page 15 of 21

16 developerworks ibm.com/developerworks/ Figure 4. Splits Figure 5 shows the pre-created regions that we defined when creating the table. Figure 5. Pre-created regions Execute the following create external hbase command to map the existing table you just created in HBase to Big SQL. With the create external hbase command: The create table statement lets you specify a different name for SQL tables through the hbase table name clause. Using external tables, you can also create multiple views of the same HBase table. For example, one table can map to a few columns and another table to another set of columns, etc. Page 16 of 21

17 ibm.com/developerworks/ developerworks The column mapping section of the create table statement allows you to specify a different separator for each column and row key. Map tables created using Hive HBase storage handler. These cannot be directly read using Big SQL storage handler. Listing 7. Map tables created using Hive HBase storage handler CREATE EXTERNAL HBASE TABLE GOSALESDW.SLS_SALES_FACT_DENSE_SPLIT ( ORDER_DAY_KEY int, ORGANIZATION_KEY int, EMPLOYEE_KEY int, RETAILER_KEY int, RETAILER_SITE_KEY int, PRODUCT_KEY int, PROMOTION_KEY int, ORDER_METHOD_KEY int, SALES_ORDER_KEY int, SHIP_DAY_KEY int, CLOSE_DAY_KEY int, QUANTITY int, UNIT_COST decimal(19,2), UNIT_PRICE decimal(19,2), UNIT_SALE_PRICE decimal(19,2), GROSS_MARGIN double, SALE_TOTAL decimal(19,2), GROSS_PROFIT decimal(19,2) ) COLUMN MAPPING ( key cf_data:cq_other_keys mapped by (ORDER_DAY_KEY, ORGANIZATION_KEY, EMPLOYEE_KEY, RETAILER_KEY, RETAILER_SITE_KEY, PRODUCT_KEY, PROMOTION_KEY, ORDER_METHOD_KEY) SEPARATOR '-', mapped by (SALES_ORDER_KEY, SHIP_DAY_KEY, CLOSE_DAY_KEY) SEPARATOR '/', cf_data:cq_quantity cf_data:cq_dollar_values mapped by (QUANTITY), mapped by (UNIT_COST, UNIT_PRICE, UNIT_SALE_PRICE, GROSS_MARGIN, SALE_TOTAL, GROSS_PROFIT) SEPARATOR ' ' ) HBASE TABLE NAME 'gosalesdw.sls_sales_fact_dense_split'; The data in external tables is not validated at creation time. For example, if a column in the external table contains data with separators incorrectly defined, the query results would be unpredictable. Note that external tables are not owned by Big SQL and, hence, cannot be dropped via Big SQL. Also, secondary indices cannot be created via Big SQL on external tables. Use the following command to load the external table we have defined. LOAD HBASE DATA INPATH '/user/biadmin/gosalesdw/sls_sales_fact.10p.txt' DELIMITED FIELDS TERMINATED BY '\t' INTO TABLE gosalesdw.external_sls_sales_fact_dense_split; rows affected (total: 1m57.2s) Verify that the number of rows loaded is the same number of rows returned by querying the external SQL table. Page 17 of 21

18 developerworks ibm.com/developerworks/ SELECT COUNT(*) FROM gosalesdw.external_sls_sales_fact_dense_split; row in results(first row: 6.44s; total: 6.46s) Verify the same from the HBase shell directly on the underlying HBase table. count 'gosalesdw.sls_sales_fact_dense_split' row(s) in seconds Issue a get command from the HBase shell specifying the row key as follows. Notice the separator between each part of the row key is a hyphen (-), as we defined when originally creating the external table. get 'gosalesdw.sls_sales_fact_dense_split', ' ' In the following output, you can also see the other separators we defined for the external table: for the cq_dollar_value and / for cq_quantity. COLUMN cf_data:cq_dollar_values cf_data:cq_other_keys cf_data:cq_quantity 3 row(s) in seconds CELL timestamp= , value= timestamp= , value=481896/ / timestamp= , value=25 Of course, in Big SQL, you don't need to specify the separators, such as -, when querying against the table, as with the command below. SELECT * FROM gosalesdw.external_sls_sales_fact_dense_split WHERE ORDER_DAY_KEY = AND ORGANIZATION_KEY = AND EMPLOYEE_KEY = 4428 AND RETAILER_KEY = 7109 AND RETAILER_SITE_KEY = 5588 AND PRODUCT_KEY = AND PROMOTION_KEY = 5501 AND ORDER_METHOD_KEY = 605; Handling errors with load data How do you handle errors during the load operation? The load hbase command has an option to continue past errors. You can use the LOG ERROR ROWS IN FILE clause to specify a file name to log any rows that could not be loaded because of errors. A few common errors are invalid numeric types and a separator existing within the data for string encoding. hadoop fs -cat /user/biadmin/gosalesdw/sls_sales_fact_badload.txt Page 18 of 21

19 ibm.com/developerworks/ developerworks a b The separator appearing within the data is an issue with string encoding. Knowing there are errors with the input data, go ahead and issue the following load command, specifying a directory and file in which to put the bad rows. LOAD HBASE DATA INPATH '/user/biadmin/gosalesdw/sls_sales_fact_badload.txt' DELIMITED FIELDS TERMINATED BY '\t' INTO TABLE gosalesdw.external_sls_sales_fact_dense_split LOG ERROR ROWS IN FILE '/tmp/sls_sales_fact_load.err'; In this example, four rows did not get loaded because of errors. The load command reports all the rows that passed through it. 1 row affected (total: 2.74s) Examine the specified file in the load command to view the rows that were not loaded. hadoop fs -cat /tmp/sls_sales_fact_load.err " a","11171"," "," "," ", "b ","11171"," "," "," ", " ","11171"," "," "," ", " ","11-71"," "," "," ", Summary The examples in this article have shown how to create tables and various ways to load data. We covered different types of one-to-one mapping and many-to-one mapping. Part 2 of this series covers query handling and how to connect to Big SQL via JDBC to run business reports with tools such as BIRT or Cognos. Acknowledgments Thanks to Uttam Jain for his contributions to this series. Page 19 of 21

20 developerworks ibm.com/developerworks/ Downloadable resources Description Name Size Data samples IBD-1687A_Data.zip 6KB Pre-created BIRT Report Orders.rptdesign.zip 8KB Presentation on Big SQL over HBase 1 IBD-1687A.pdf 3MB Note 1. This article is derived from a presentation at Information On Demand Session 1687 Adding Value to HBase with IBM InfoSphere BigInsights and Big SQL. Page 20 of 21

21 ibm.com/developerworks/ developerworks Related topics Learn more about BigInsights 2.1 from the BigInsights Information Center. Check out "What's the big deal about Big SQL?" for an introduction to Big SQL. Read "Understanding InfoSphere BigInsights" to learn more about the product's architecture and underlying technologies. Get a technical introduction to Big SQL on Slideshare. Learn more about HBase at Apache.org. Get familiar with the Cognos sample GOSALES databases by accessing the product's Information Center. Download a free native software installation copy of InfoSphere BigInsights 2.1 Quick Start Edition (sign-in required). Copyright IBM Corporation 2014 ( Trademarks ( Page 21 of 21

IBM Fluid Query for PureData Analytics. - Sanjit Chakraborty

IBM Fluid Query for PureData Analytics. - Sanjit Chakraborty IBM Fluid Query for PureData Analytics - Sanjit Chakraborty Introduction IBM Fluid Query is the capability that unifies data access across the Logical Data Warehouse. Users and analytic applications need

More information

Informatica PowerExchange for Microsoft Azure Blob Storage 10.2 HotFix 1. User Guide

Informatica PowerExchange for Microsoft Azure Blob Storage 10.2 HotFix 1. User Guide Informatica PowerExchange for Microsoft Azure Blob Storage 10.2 HotFix 1 User Guide Informatica PowerExchange for Microsoft Azure Blob Storage User Guide 10.2 HotFix 1 July 2018 Copyright Informatica LLC

More information

HBase Installation and Configuration

HBase Installation and Configuration Aims This exercise aims to get you to: Install and configure HBase Manage data using HBase Shell Install and configure Hive Manage data using Hive HBase Installation and Configuration 1. Download HBase

More information

Using Hive for Data Warehousing

Using Hive for Data Warehousing An IBM Proof of Technology Using Hive for Data Warehousing Unit 1: Exploring Hive An IBM Proof of Technology Catalog Number Copyright IBM Corporation, 2013 US Government Users Restricted Rights - Use,

More information

HBase Installation and Configuration

HBase Installation and Configuration Aims This exercise aims to get you to: Install and configure HBase Manage data using HBase Shell Manage data using HBase Java API HBase Installation and Configuration 1. Download HBase 1.2.2 $ wget http://apache.uberglobalmirror.com/hbase/1.2.2/hbase-1.2.2-

More information

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases

More information

BigInsights and Cognos Stefan Hubertus, Principal Solution Specialist Cognos Wilfried Hoge, IT Architect Big Data IBM Corporation

BigInsights and Cognos Stefan Hubertus, Principal Solution Specialist Cognos Wilfried Hoge, IT Architect Big Data IBM Corporation BigInsights and Cognos Stefan Hubertus, Principal Solution Specialist Cognos Wilfried Hoge, IT Architect Big Data 2013 IBM Corporation A Big Data architecture evolves from a traditional BI architecture

More information

IBM. Database Database overview. IBM i 7.1

IBM. Database Database overview. IBM i 7.1 IBM IBM i Database Database overview 7.1 IBM IBM i Database Database overview 7.1 Note Before using this information and the product it supports, read the information in Notices, on page 39. This edition

More information

Kathleen Durant PhD Northeastern University CS Indexes

Kathleen Durant PhD Northeastern University CS Indexes Kathleen Durant PhD Northeastern University CS 3200 Indexes Outline for the day Index definition Types of indexes B+ trees ISAM Hash index Choosing indexed fields Indexes in InnoDB 2 Indexes A typical

More information

Comparing SQL and NOSQL databases

Comparing SQL and NOSQL databases COSC 6397 Big Data Analytics Data Formats (II) HBase Edgar Gabriel Spring 2014 Comparing SQL and NOSQL databases Types Development History Data Storage Model SQL One type (SQL database) with minor variations

More information

Raima Database Manager Version 14.1 In-memory Database Engine

Raima Database Manager Version 14.1 In-memory Database Engine + Raima Database Manager Version 14.1 In-memory Database Engine By Jeffrey R. Parsons, Chief Engineer November 2017 Abstract Raima Database Manager (RDM) v14.1 contains an all new data storage engine optimized

More information

Using Hive for Data Warehousing

Using Hive for Data Warehousing An IBM Proof of Technology Using Hive for Data Warehousing Unit 3: Hive DML in action An IBM Proof of Technology Catalog Number Copyright IBM Corporation, 2013 US Government Users Restricted Rights - Use,

More information

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data Introduction to Hadoop High Availability Scaling Advantages and Challenges Introduction to Big Data What is Big data Big Data opportunities Big Data Challenges Characteristics of Big data Introduction

More information

IBM Big SQL Partner Application Verification Quick Guide

IBM Big SQL Partner Application Verification Quick Guide IBM Big SQL Partner Application Verification Quick Guide VERSION: 1.6 DATE: Sept 13, 2017 EDITORS: R. Wozniak D. Rangarao Table of Contents 1 Overview of the Application Verification Process... 3 2 Platform

More information

COSC 6339 Big Data Analytics. NoSQL (II) HBase. Edgar Gabriel Fall HBase. Column-Oriented data store Distributed designed to serve large tables

COSC 6339 Big Data Analytics. NoSQL (II) HBase. Edgar Gabriel Fall HBase. Column-Oriented data store Distributed designed to serve large tables COSC 6339 Big Data Analytics NoSQL (II) HBase Edgar Gabriel Fall 2018 HBase Column-Oriented data store Distributed designed to serve large tables Billions of rows and millions of columns Runs on a cluster

More information

BlueMix Hands-On Workshop

BlueMix Hands-On Workshop BlueMix Hands-On Workshop Lab E - Using the Blu Big SQL application uemix MapReduce Service to build an IBM Version : 3.00 Last modification date : 05/ /11/2014 Owner : IBM Ecosystem Development Table

More information

Using Hive for Data Warehousing

Using Hive for Data Warehousing An IBM Proof of Technology Using Hive for Data Warehousing Unit 5: Hive Storage Formats An IBM Proof of Technology Catalog Number Copyright IBM Corporation, 2013 US Government Users Restricted Rights -

More information

Perform scalable data exchange using InfoSphere DataStage DB2 Connector

Perform scalable data exchange using InfoSphere DataStage DB2 Connector Perform scalable data exchange using InfoSphere DataStage Angelia Song (azsong@us.ibm.com) Technical Consultant IBM 13 August 2015 Brian Caufield (bcaufiel@us.ibm.com) Software Architect IBM Fan Ding (fding@us.ibm.com)

More information

Data Informatics. Seon Ho Kim, Ph.D.

Data Informatics. Seon Ho Kim, Ph.D. Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu HBase HBase is.. A distributed data store that can scale horizontally to 1,000s of commodity servers and petabytes of indexed storage. Designed to operate

More information

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic) Hive and Shark Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 1 / 45 Motivation MapReduce is hard to

More information

Hadoop Online Training

Hadoop Online Training Hadoop Online Training IQ training facility offers Hadoop Online Training. Our Hadoop trainers come with vast work experience and teaching skills. Our Hadoop training online is regarded as the one of the

More information

Distributed File Systems II

Distributed File Systems II Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation

More information

This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.

This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem. About the Tutorial Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and

More information

IBM i Version 7.2. Database Database overview IBM

IBM i Version 7.2. Database Database overview IBM IBM i Version 7.2 Database Database overview IBM IBM i Version 7.2 Database Database overview IBM Note Before using this information and the product it supports, read the information in Notices on page

More information

Introduction to BigData, Hadoop:-

Introduction to BigData, Hadoop:- Introduction to BigData, Hadoop:- Big Data Introduction: Hadoop Introduction What is Hadoop? Why Hadoop? Hadoop History. Different types of Components in Hadoop? HDFS, MapReduce, PIG, Hive, SQOOP, HBASE,

More information

Enterprise Data Catalog Fixed Limitations ( Update 1)

Enterprise Data Catalog Fixed Limitations ( Update 1) Informatica LLC Enterprise Data Catalog 10.2.1 Update 1 Release Notes September 2018 Copyright Informatica LLC 2015, 2018 Contents Enterprise Data Catalog Fixed Limitations (10.2.1 Update 1)... 1 Enterprise

More information

Accessing Hadoop Data Using Hive

Accessing Hadoop Data Using Hive An IBM Proof of Technology Accessing Hadoop Data Using Hive Unit 3: Hive DML in action An IBM Proof of Technology Catalog Number Copyright IBM Corporation, 2015 US Government Users Restricted Rights -

More information

Hive SQL over Hadoop

Hive SQL over Hadoop Hive SQL over Hadoop Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Introduction Apache Hive is a high-level abstraction on top of MapReduce Uses

More information

DATA WAREHOUSING II. CS121: Relational Databases Fall 2017 Lecture 23

DATA WAREHOUSING II. CS121: Relational Databases Fall 2017 Lecture 23 DATA WAREHOUSING II CS121: Relational Databases Fall 2017 Lecture 23 Last Time: Data Warehousing 2 Last time introduced the topic of decision support systems (DSS) and data warehousing Very large DBs used

More information

This document contains information on fixed and known limitations for Test Data Management.

This document contains information on fixed and known limitations for Test Data Management. Informatica LLC Test Data Management Version 10.1.0 Release Notes December 2016 Copyright Informatica LLC 2003, 2016 Contents Installation and Upgrade... 1 Emergency Bug Fixes in 10.1.0... 1 10.1.0 Fixed

More information

INITIAL EVALUATION BIGSQL FOR HORTONWORKS (Homerun or merely a major bluff?)

INITIAL EVALUATION BIGSQL FOR HORTONWORKS (Homerun or merely a major bluff?) PER STRICKER, THOMAS KALB 07.02.2017, HEART OF TEXAS DB2 USER GROUP, AUSTIN 08.02.2017, DB2 FORUM USER GROUP, DALLAS INITIAL EVALUATION BIGSQL FOR HORTONWORKS (Homerun or merely a major bluff?) Copyright

More information

A Examcollection.Premium.Exam.47q

A Examcollection.Premium.Exam.47q A2090-303.Examcollection.Premium.Exam.47q Number: A2090-303 Passing Score: 800 Time Limit: 120 min File Version: 32.7 http://www.gratisexam.com/ Exam Code: A2090-303 Exam Name: Assessment: IBM InfoSphere

More information

InfoSphere Data Architect Pluglets

InfoSphere Data Architect Pluglets InfoSphere Data Architect Pluglets Macros for Eclipse This article provides information on how to develop custom pluglets and use sample pluglets provided by InfoSphere Data Architect. InfoSphere Data

More information

Reading Schema Error Getting Database Metadata Oracle

Reading Schema Error Getting Database Metadata Oracle Reading Schema Error Getting Database Metadata Oracle You get the following error in oidldap*.log : If you loaded the OracleAS Metadata Repository into an Oracle 10g Database that uses the AL32UTF8 character

More information

DB2 11 Global variables

DB2 11 Global variables DB2 11 Global variables Rajesh Venkata Rama Mallina (vmallina@in.ibm.com) DB2 Z/OS DBA IBM 03 March 2017 The following document is for IBM DB2 for z/os, Topic is Global variables. As a DB2 DBA administrator

More information

Contents. Error Message Descriptions... 7

Contents. Error Message Descriptions... 7 2 Contents Error Message Descriptions.................................. 7 3 4 About This Manual This Unify DataServer: Error Messages manual lists the errors that can be produced by the Unify DataServer

More information

sqoop Automatic database import Aaron Kimball Cloudera Inc. June 18, 2009

sqoop Automatic database import Aaron Kimball Cloudera Inc. June 18, 2009 sqoop Automatic database import Aaron Kimball Cloudera Inc. June 18, 2009 The problem Structured data already captured in databases should be used with unstructured data in Hadoop Tedious glue code necessary

More information

COGNOS (R) 8 GUIDELINES FOR MODELING METADATA FRAMEWORK MANAGER. Cognos(R) 8 Business Intelligence Readme Guidelines for Modeling Metadata

COGNOS (R) 8 GUIDELINES FOR MODELING METADATA FRAMEWORK MANAGER. Cognos(R) 8 Business Intelligence Readme Guidelines for Modeling Metadata COGNOS (R) 8 FRAMEWORK MANAGER GUIDELINES FOR MODELING METADATA Cognos(R) 8 Business Intelligence Readme Guidelines for Modeling Metadata GUIDELINES FOR MODELING METADATA THE NEXT LEVEL OF PERFORMANCE

More information

Sql Server 'create Schema' Must Be The First Statement In A Query Batch

Sql Server 'create Schema' Must Be The First Statement In A Query Batch Sql Server 'create Schema' Must Be The First Statement In A Query Batch ALTER VIEW must be the only statement in batch SigHierarchyView) WITH SCHEMABINDING AS ( SELECT (Sig). I'm using SQL Server 2012.

More information

Introduction to Hive Cloudera, Inc.

Introduction to Hive Cloudera, Inc. Introduction to Hive Outline Motivation Overview Data Model Working with Hive Wrap up & Conclusions Background Started at Facebook Data was collected by nightly cron jobs into Oracle DB ETL via hand-coded

More information

Using space-filling curves for multidimensional

Using space-filling curves for multidimensional Using space-filling curves for multidimensional indexing Dr. Bisztray Dénes Senior Research Engineer 1 Nokia Solutions and Networks 2014 In medias res Performance problems with RDBMS Switch to NoSQL store

More information

Big Data Analytics. Rasoul Karimi

Big Data Analytics. Rasoul Karimi Big Data Analytics Rasoul Karimi Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Big Data Analytics Big Data Analytics 1 / 1 Outline

More information

Relevancy Workbench Module. 1.0 Documentation

Relevancy Workbench Module. 1.0 Documentation Relevancy Workbench Module 1.0 Documentation Created: Table of Contents Installing the Relevancy Workbench Module 4 System Requirements 4 Standalone Relevancy Workbench 4 Deploy to a Web Container 4 Relevancy

More information

Integration of Apache Hive

Integration of Apache Hive Integration of Apache Hive and HBase Enis Soztutar enis [at] apache [dot] org @enissoz Page 1 Agenda Overview of Hive and HBase Hive + HBase Features and Improvements Future of Hive and HBase Q&A Page

More information

Data Base Concepts. Course Guide 2

Data Base Concepts. Course Guide 2 MS Access Chapter 1 Data Base Concepts Course Guide 2 Data Base Concepts Data The term data is often used to distinguish binary machine-readable information from textual human-readable information. For

More information

Hadoop File Formats and Data Ingestion. Prasanth Kothuri, CERN

Hadoop File Formats and Data Ingestion. Prasanth Kothuri, CERN Prasanth Kothuri, CERN 2 Files Formats not just CSV - Key factor in Big Data processing and query performance - Schema Evolution - Compression and Splittability - Data Processing Write performance Partial

More information

DATABASE PERFORMANCE AND INDEXES. CS121: Relational Databases Fall 2017 Lecture 11

DATABASE PERFORMANCE AND INDEXES. CS121: Relational Databases Fall 2017 Lecture 11 DATABASE PERFORMANCE AND INDEXES CS121: Relational Databases Fall 2017 Lecture 11 Database Performance 2 Many situations where query performance needs to be improved e.g. as data size grows, query performance

More information

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel Indexing Week 14, Spring 2005 Edited by M. Naci Akkøk, 5.3.2004, 3.3.2005 Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel Overview Conventional indexes B-trees Hashing schemes

More information

Intelligent Caching in Data Virtualization Recommended Use of Caching Controls in the Denodo Platform

Intelligent Caching in Data Virtualization Recommended Use of Caching Controls in the Denodo Platform Data Virtualization Intelligent Caching in Data Virtualization Recommended Use of Caching Controls in the Denodo Platform Introduction Caching is one of the most important capabilities of a Data Virtualization

More information

Manual Trigger Sql Server 2008 Inserted Table Examples Insert

Manual Trigger Sql Server 2008 Inserted Table Examples Insert Manual Trigger Sql Server 2008 Inserted Table Examples Insert This tutorial is applicable for all versions of SQL Server i.e. 2005, 2008, 2012, Whenever a row is inserted in the Customers Table, the following

More information

Strategies for Incremental Updates on Hive

Strategies for Incremental Updates on Hive Strategies for Incremental Updates on Hive Copyright Informatica LLC 2017. Informatica, the Informatica logo, and Big Data Management are trademarks or registered trademarks of Informatica LLC in the United

More information

Using Hive for Data Warehousing

Using Hive for Data Warehousing An IBM Proof of Technology Using Hive for Data Warehousing Unit 5: Hive Storage Formats An IBM Proof of Technology Catalog Number Copyright IBM Corporation, 2015 US Government Users Restricted Rights -

More information

Migrating Mappings and Mapplets from a PowerCenter Repository to a Model Repository

Migrating Mappings and Mapplets from a PowerCenter Repository to a Model Repository Migrating Mappings and Mapplets from a PowerCenter Repository to a Model Repository 2016 Informatica LLC. No part of this document may be reproduced or transmitted in any form, by any means (electronic,

More information

Oracle Schema Create Date Index Trunc >>>CLICK HERE<<<

Oracle Schema Create Date Index Trunc >>>CLICK HERE<<< Oracle Schema Create Date Index Trunc Changes in This Release for Oracle Database SQL Language Reference. open References to Partitioned Tables and Indexes References to Object Type Attributes and Methods

More information

Introduction to Column Oriented Databases in PHP

Introduction to Column Oriented Databases in PHP Introduction to Column Oriented Databases in PHP By Slavey Karadzhov About Me Name: Slavey Karadzhov (Славей Караджов) Programmer since my early days PHP programmer since 1999

More information

Database System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See for conditions on re-use

Database System Concepts, 6 th Ed. Silberschatz, Korth and Sudarshan See  for conditions on re-use Chapter 11: Indexing and Hashing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files Static

More information

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13 Bigtable A Distributed Storage System for Structured Data Presenter: Yunming Zhang Conglong Li References SOCC 2010 Key Note Slides Jeff Dean Google Introduction to Distributed Computing, Winter 2008 University

More information

COSC 6397 Big Data Analytics. Data Formats (III) HBase: Java API, HBase in MapReduce and HBase Bulk Loading. Edgar Gabriel Spring 2014.

COSC 6397 Big Data Analytics. Data Formats (III) HBase: Java API, HBase in MapReduce and HBase Bulk Loading. Edgar Gabriel Spring 2014. COSC 6397 Big Data Analytics Data Formats (III) HBase: Java API, HBase in MapReduce and HBase Bulk Loading Edgar Gabriel Spring 2014 Recap on HBase Column-Oriented data store NoSQL DB Data is stored in

More information

CTL.SC4x Technology and Systems

CTL.SC4x Technology and Systems in Supply Chain Management CTL.SC4x Technology and Systems Key Concepts Document This document contains the Key Concepts for the SC4x course, Weeks 1 and 2. These are meant to complement, not replace,

More information

Table of Contents. 7. sqoop-import Purpose 7.2. Syntax

Table of Contents. 7. sqoop-import Purpose 7.2. Syntax Sqoop User Guide (v1.4.2) Sqoop User Guide (v1.4.2) Table of Contents 1. Introduction 2. Supported Releases 3. Sqoop Releases 4. Prerequisites 5. Basic Usage 6. Sqoop Tools 6.1. Using Command Aliases 6.2.

More information

SMD149 - Operating Systems - File systems

SMD149 - Operating Systems - File systems SMD149 - Operating Systems - File systems Roland Parviainen November 21, 2005 1 / 59 Outline Overview Files, directories Data integrity Transaction based file systems 2 / 59 Files Overview Named collection

More information

DB2 Temporal tables. Introduction. 19 April Rajesh Venkata Rama Mallina DB2 Z/OS DBA IBM

DB2 Temporal tables. Introduction. 19 April Rajesh Venkata Rama Mallina DB2 Z/OS DBA IBM DB2 Temporal tables Rajesh Venkata Rama Mallina (vmallina@in.ibm.com) DB2 Z/OS DBA IBM 19 April 2017 As part of data management scenarios, any update and deletion of data requires and saving old data called

More information

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI 2006 Presented by Xiang Gao 2014-11-05 Outline Motivation Data Model APIs Building Blocks Implementation Refinement

More information

Lab 3 Pig, Hive, and JAQL

Lab 3 Pig, Hive, and JAQL Lab 3 Pig, Hive, and JAQL Lab objectives In this lab you will practice what you have learned in this lesson, specifically you will practice with Pig, Hive, and Jaql languages. Lab instructions This lab

More information

HBase... And Lewis Carroll! Twi:er,

HBase... And Lewis Carroll! Twi:er, HBase... And Lewis Carroll! jw4ean@cloudera.com Twi:er, LinkedIn: @jw4ean 1 Introduc@on 2010: Cloudera Solu@ons Architect 2011: Cloudera TAM/DSE 2012-2013: Cloudera Training focusing on Partners and Newbies

More information

SQL*Loader Concepts. SQL*Loader Features

SQL*Loader Concepts. SQL*Loader Features 6 SQL*Loader Concepts This chapter explains the basic concepts of loading data into an Oracle database with SQL*Loader. This chapter covers the following topics: SQL*Loader Features SQL*Loader Parameters

More information

Because databases are not easily accessible by Hadoop, Apache Sqoop was created to efficiently transfer bulk data between Hadoop and external

Because databases are not easily accessible by Hadoop, Apache Sqoop was created to efficiently transfer bulk data between Hadoop and external Because databases are not easily accessible by Hadoop, Apache Sqoop was created to efficiently transfer bulk data between Hadoop and external structured datastores. The popularity of Sqoop in enterprise

More information

OKC MySQL Users Group

OKC MySQL Users Group OKC MySQL Users Group OKC MySQL Discuss topics about MySQL and related open source RDBMS Discuss complementary topics (big data, NoSQL, etc) Help to grow the local ecosystem through meetups and events

More information

Scaling Up HBase. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Scaling Up HBase. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Scaling Up HBase Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based on materials

More information

Mastering phpmyadmiri 3.4 for

Mastering phpmyadmiri 3.4 for Mastering phpmyadmiri 3.4 for Effective MySQL Management A complete guide to getting started with phpmyadmin 3.4 and mastering its features Marc Delisle [ t]open so 1 I community experience c PUBLISHING

More information

7. Query Processing and Optimization

7. Query Processing and Optimization 7. Query Processing and Optimization Processing a Query 103 Indexing for Performance Simple (individual) index B + -tree index Matching index scan vs nonmatching index scan Unique index one entry and one

More information

B.H.GARDI COLLEGE OF MASTER OF COMPUTER APPLICATION. Ch. 1 :- Introduction Database Management System - 1

B.H.GARDI COLLEGE OF MASTER OF COMPUTER APPLICATION. Ch. 1 :- Introduction Database Management System - 1 Basic Concepts :- 1. What is Data? Data is a collection of facts from which conclusion may be drawn. In computer science, data is anything in a form suitable for use with a computer. Data is often distinguished

More information

Make sure you have the latest Hive trunk by running svn up in your Hive directory. More detailed instructions on downloading and setting up

Make sure you have the latest Hive trunk by running svn up in your Hive directory. More detailed instructions on downloading and setting up GenericUDAFCaseStudy Writing GenericUDAFs: A Tutorial User-Defined Aggregation Functions (UDAFs) are an excellent way to integrate advanced data-processing into Hive. Hive allows two varieties of UDAFs:

More information

New Features Summary. SAP Sybase Event Stream Processor 5.1 SP02

New Features Summary. SAP Sybase Event Stream Processor 5.1 SP02 Summary SAP Sybase Event Stream Processor 5.1 SP02 DOCUMENT ID: DC01616-01-0512-01 LAST REVISED: April 2013 Copyright 2013 by Sybase, Inc. All rights reserved. This publication pertains to Sybase software

More information

File Structures and Indexing

File Structures and Indexing File Structures and Indexing CPS352: Database Systems Simon Miner Gordon College Last Revised: 10/11/12 Agenda Check-in Database File Structures Indexing Database Design Tips Check-in Database File Structures

More information

sqoop Easy, parallel database import/export Aaron Kimball Cloudera Inc. June 8, 2010

sqoop Easy, parallel database import/export Aaron Kimball Cloudera Inc. June 8, 2010 sqoop Easy, parallel database import/export Aaron Kimball Cloudera Inc. June 8, 2010 Your database Holds a lot of really valuable data! Many structured tables of several hundred GB Provides fast access

More information

Installing IBM InfoSphere BigInsights Quick Start Edition

Installing IBM InfoSphere BigInsights Quick Start Edition Installing IBM InfoSphere BigInsights Quick Start Edition 1. System requirements Pr. Imade Benelallam Imade.benelallam@ieee.org Before you download, ensure that your system meets the minimum requirements:

More information

Release Notes. Installing. Upgrading. Registration Keys

Release Notes. Installing. Upgrading. Registration Keys Release Notes The SmartList Builder upgrade follows the supported upgrade paths from 2010 and 2013 that are available for Microsoft Dynamics GP. You can find these on PartnerSource or CustomerSource in

More information

Bigtable: A Distributed Storage System for Structured Data by Google SUNNIE CHUNG CIS 612

Bigtable: A Distributed Storage System for Structured Data by Google SUNNIE CHUNG CIS 612 Bigtable: A Distributed Storage System for Structured Data by Google SUNNIE CHUNG CIS 612 Google Bigtable 2 A distributed storage system for managing structured data that is designed to scale to a very

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

Column Stores and HBase. Rui LIU, Maksim Hrytsenia

Column Stores and HBase. Rui LIU, Maksim Hrytsenia Column Stores and HBase Rui LIU, Maksim Hrytsenia December 2017 Contents 1 Hadoop 2 1.1 Creation................................ 2 2 HBase 3 2.1 Column Store Database....................... 3 2.2 HBase

More information

DB2 for z/os Stored Procedure support in Data Server Manager

DB2 for z/os Stored Procedure support in Data Server Manager DB2 for z/os Stored Procedure support in Data Server Manager This short tutorial walks you step-by-step, through a scenario where a DB2 for z/os application developer creates a query, explains and tunes

More information

Set up and use federation in InfoSphere BigInsights Big SQL V3.0

Set up and use federation in InfoSphere BigInsights Big SQL V3.0 Set up and use federation in InfoSphere BigInsights Big SQL Mara Elisa de Paiva Fernandes Matias February 04, 2015 (First published July 08, 2014) Big SQL supports federation to many data sources, including

More information

Sqoop In Action. Lecturer:Alex Wang QQ: QQ Communication Group:

Sqoop In Action. Lecturer:Alex Wang QQ: QQ Communication Group: Sqoop In Action Lecturer:Alex Wang QQ:532500648 QQ Communication Group:286081824 Aganda Setup the sqoop environment Import data Incremental import Free-Form Query Import Export data Sqoop and Hive Apache

More information

IBM Campaign Version-independent Integration with IBM Watson Campaign Automation Version 1 Release 1.5 February, Integration Guide IBM

IBM Campaign Version-independent Integration with IBM Watson Campaign Automation Version 1 Release 1.5 February, Integration Guide IBM IBM Campaign Version-independent Integration with IBM Watson Campaign Automation Version 1 Release 1.5 February, 2018 Integration Guide IBM Note Before using this information and the product it supports,

More information

HBase. Леонид Налчаджи

HBase. Леонид Налчаджи HBase Леонид Налчаджи leonid.nalchadzhi@gmail.com HBase Overview Table layout Architecture Client API Key design 2 Overview 3 Overview NoSQL Column oriented Versioned 4 Overview All rows ordered by row

More information

Enable Spark SQL on NoSQL Hbase tables with HSpark IBM Code Tech Talk. February 13, 2018

Enable Spark SQL on NoSQL Hbase tables with HSpark IBM Code Tech Talk. February 13, 2018 Enable Spark SQL on NoSQL Hbase tables with HSpark IBM Code Tech Talk February 13, 2018 https://developer.ibm.com/code/techtalks/enable-spark-sql-onnosql-hbase-tables-with-hspark-2/ >> MARC-ARTHUR PIERRE

More information

Representing Data Elements

Representing Data Elements Representing Data Elements Week 10 and 14, Spring 2005 Edited by M. Naci Akkøk, 5.3.2004, 3.3.2005 Contains slides from 18.3.2002 by Hector Garcia-Molina, Vera Goebel INF3100/INF4100 Database Systems Page

More information

Andreas Weininger,

Andreas Weininger, External Tables: New Options not just for Loading Data Andreas Weininger, IBM Andreas.Weininger@de.ibm.com @aweininger Agenda Why external tables? What are the alternatives? How to use external tables

More information

Designing dashboards for performance. Reference deck

Designing dashboards for performance. Reference deck Designing dashboards for performance Reference deck Basic principles 1. Everything in moderation 2. If it isn t fast in database, it won t be fast in Tableau 3. If it isn t fast in desktop, it won t be

More information

In this exercise, you will import orders table from MySQL database. into HDFS. Get acquainted with some of basic commands of Sqoop

In this exercise, you will import orders table from MySQL database. into HDFS. Get acquainted with some of basic commands of Sqoop Practice Using Sqoop Data Files: ~/labs/sql/retail_db.sql MySQL database: retail_db In this exercise, you will import orders table from MySQL database into HDFS. Get acquainted with some of basic commands

More information

Sepand Gojgini. ColumnStore Index Primer

Sepand Gojgini. ColumnStore Index Primer Sepand Gojgini ColumnStore Index Primer SQLSaturday Sponsors! Titanium & Global Partner Gold Silver Bronze Without the generosity of these sponsors, this event would not be possible! Please, stop by the

More information

Tuning the Hive Engine for Big Data Management

Tuning the Hive Engine for Big Data Management Tuning the Hive Engine for Big Data Management Copyright Informatica LLC 2017. Informatica, the Informatica logo, Big Data Management, PowerCenter, and PowerExchange are trademarks or registered trademarks

More information

Business Intelligence Exchange (BIX)

Business Intelligence Exchange (BIX) Business Intelligence Exchange (BIX) Release Notes Version 2.3 And Version 2.3 SP1 February, 2012 Framework Overview The Business Intelligence Exchange (BIX) extracts information from a PRPC database into

More information

Installing Data Sync Version 2.3

Installing Data Sync Version 2.3 Oracle Cloud Data Sync Readme Release 2.3 DSRM-230 May 2017 Readme for Data Sync This Read Me describes changes, updates, and upgrade instructions for Data Sync Version 2.3. Topics: Installing Data Sync

More information

DEC 31, HareDB HBase Client Web Version ( X & Xs) USER MANUAL. HareDB Team

DEC 31, HareDB HBase Client Web Version ( X & Xs) USER MANUAL. HareDB Team DEC 31, 2016 HareDB HBase Client Web Version (1.120.02.X & 1.120.02.Xs) USER MANUAL HareDB Team Index New features:... 3 Environment requirements... 3 Download... 3 Overview... 5 Connect to a cluster...

More information

Chapter 8: Working With Databases & Tables

Chapter 8: Working With Databases & Tables Chapter 8: Working With Databases & Tables o Working with Databases & Tables DDL Component of SQL Databases CREATE DATABASE class; o Represented as directories in MySQL s data storage area o Can t have

More information

Db2 Alter Table Alter Column Set Data Type Char

Db2 Alter Table Alter Column Set Data Type Char Db2 Alter Table Alter Column Set Data Type Char I am trying to do 2 alters to a column in DB2 in the same alter command, and it doesn't seem to like my syntax alter table tbl alter column col set data

More information

Module 1: Information Extraction

Module 1: Information Extraction Module 1: Information Extraction Introduction to GATE Developer The University of Sheffield, 1995-2014 This work is licenced under the Creative Commons Attribution-NonCommercial-ShareAlike Licence About

More information