INITIAL EVALUATION BIGSQL FOR HORTONWORKS (Homerun or merely a major bluff?)

PER STRICKER, THOMAS KALB 07.02.2017, HEART OF TEXAS DB2 USER GROUP, AUSTIN 08.02.2017, DB2 FORUM USER GROUP, DALLAS INITIAL EVALUATION BIGSQL FOR HORTONWORKS (Homerun or merely a major bluff?) Copyright 2016 ITGAIN GmbH 1

Agenda Introduction The MPP Architecture DB2 DPF and Hadoop (HDFS) Installation stumbling blocks Red Hat or Centos The HDP Installation with Ambari (See Appendix) The BigSQL Installation Working with BigSQL Familiar and the New a. DB2 - Interface b. HDFS - Interface Der Big Data Deployment (SQL for unstructured Data) DB2 Engine versus HDFS-Engine Functional Differences Performance Differences BIG SQL and Hive Conclusion Sham or Masterstroke? Questions and Discussions Copyright 2016 ITGAIN GmbH 2

Hadoop Distribution Cloudera / Hortonworks / MapR / IOP (Worldwide Market share) others 20 % Hortonworks 16 % Cloudera 53% MapR 11 % Quelle: https://www.dezyre.com/article/top-6-hadoop-vendors-providing-big-data-solutions-in-open-data-platform/93 Copyright 2016 ITGAIN GmbH 4

Agenda Introduction The MPP Architecture DB2 DPF and Hadoop (HDFS) BIGSQL Sham or Masterstroke? Installation stumbling blocks Red Hat or Centos The HDP Installation with Ambari (See Appendix) The BigSQL Installation Working with BigSQL Familiar and the New a. DB2 - Interface b. HDFS - Interface Der Big Data Deployment (SQL for unstructured Data) DB2 Engine versus HDFS-Engine Functional Differences Performance Differences Conclusion Sham or Masterstroke? Questions and Discussion Copyright 2016 ITGAIN GmbH 7

Big SQL and MPP-Architecture IBM Big SQL is a high performance SQLon-Apache-Hadoop- Engine IBM MPP-engine (C++) replaces the MapReduce-Layer (Java) Big SQL is a MPP (Massively Parallel Processing) SQL-engine HIVE extends Hadoop with Data- Warehouse Features HBASE is a distributed column-oriented database HDFS is a high availability filesystem for storing very large volumes of data distributed across many nodes. Quelle: Big SQL: A Technical Introduction 2016 IBM Corporation Copyright 2016 ITGAIN GmbH 8

Horizontal Scaling

Agenda Introduction The MPP Architecture DB2 DPF and Hadoop (HDFS) Installation stumbling blocks Red Hat or Centos The HDP Installation with Ambari (See Appendix) The BigSQL Installation Working with BigSQL Familiar and the New a. DB2 - Interface b. HDFS - Interface Der Big Data Deployment (SQL for unstructured Data) DB2 Engine versus HDFS-Engine Functional Differences Performance Differences Conclusion Sham or Masterstroke? Questions and Discussions Copyright 2016 ITGAIN GmbH 18

Installation Stumbling Blocks ITGAIN Test Environment Installing two nodes Hardware 2 virtual Servers with 8 Cores / 10 GB RAM / SSDs Software Linux RedHat 7.2 / Cent OS 7.2 Ambari 2.2.2.0 Hortonworks Data Platform (HDP) 2.4.2 BETA: Big SQL 4.2 for Hortonworks Data Platform Extending with two additional identical nodes (DataNode / WorkerNode) Copyright 2016 ITGAIN GmbH 19

Installation Stumbling Blocks Red Hat or CentOS? IBM BigInsights for Apache Hadoop 4.2 only supports Red Hat Enterprise Linux (RHEL) Server 6.7 Red Hat Enterprise Linux (RHEL) Server 7.2 Hortonworks Data Platform HDP 2.4.2 supports Red Hat Enterprise Linux (RHEL) 6.x - 7.x CentOS 6.x - 7.x Debian 7.x Oracle Linux 6.x - 7.x SUSE Linux Enterprise Server (SLES) v11 SP3 / SP4 Ubuntu Precise v12.04 Ubuntu Trusty v14.04 Copyright 2016 ITGAIN GmbH 20

Installation Stumbling Blocks Red Hat or CentOS? Recommendation for BETA auf Hortonworks Red Hat Enterprise Linux (RHEL) Server 7.2 Test-Cluster on Red Hat Enterprise Linux (RHEL) Server 7.2 CentOS 7.2 Installation on both OSes was successful Copyright 2016 ITGAIN GmbH 21

Installation Stumbling Blocks The HDP Installation with Ambari Tips and Tricks: Very simple installation with Ambari, provided there are no errors Therefore: prior to the installation take the time to clear any warnings in the Confirm Hosts and Check Scripts In case of Errors: Check the errors output to stderr Often stderr is empty Typical cause is a timeout If stderr contains errors Attempt to correct the error and retry If the installation crashes it is often easier to retry with a fresh OS rather than changing the OS and retrying the installation Copyright 2016 ITGAIN GmbH 23

Installation Stumbling Blocks The BigSQL Installation Recommendations: Execute the Big SQL Pre-Checker before the Installation Pre-Checker Scripts are available in the installation package but need to be extracted rpm2cpio BigInsights-HDP-1.2.0.0-2.4.el7.x86_64.rpm cpio -ivd./var/lib/ambari-server/resources/stacks/hdp/2.4/services/bigsql/ package/scripts/bigsql-precheck.sh rpm2cpio BigInsights-HDP-1.2.0.0-2.4.el7.x86_64.rpm cpio -ivd./var/lib/ambari-server/resources/stacks/hdp/2.4/services/bigsql/ package/scripts/bigsql-util.sh All errors should be cleared before starting the installation Copyright 2016 ITGAIN GmbH 24

Installation Stumbling Blocks The BigSQL Installation It is always possible to add additional Big SQL Workers to an individual host via Add Services option under Hosts However, this is not possible on a Big SQL Head Node! Copyright 2016 ITGAIN GmbH 28

Working with BigSQL The New and the Familiar There is also a Command line for BigSQL: JSqsh (Java SQL Shell) pronounced "jay-skwish According to the docs it should be found in: /usr/ibmpacks/common-utils/current/jsqsh BUT: Copyright 2016 ITGAIN GmbH 39

Working with BigSQL The New and the Familiar Requesting the table list with Jsqsh Jsqsh Command help via \help e.g g.: Defining the current schema: use BIGSQL Requesting a table list in a given schema: \show tables Copyright 2016 ITGAIN GmbH 45

Working with BigSQL The New and the Familiar Starting point: Load data in the Tables Tip: for better Performance load the Load-File with hdfs hdfs dfs -copyfromlocal /tmp/firsttable.csv /tmp/ hdfs dfs -chmod 777 /tmp/firsttable.csv Copyright 2016 ITGAIN GmbH 46

Agenda Introduction The MPP Architecture DB2 DPF and Hadoop (HDFS) Installation stumbling blocks Red Hat or Centos The HDP Installation with Ambari (See Appendix) The BigSQL Installation Working with BigSQL Familiar and the New a. DB2 - Interface b. HDFS - Interface The Big Data Deployment (SQL for unstructured Data) DB2 Engine versus HDFS-Engine Functional Differences Performance Differences BIG SQL and Hive Conclusion Sham or Masterstroke? Questions and Discussions Copyright 2016 ITGAIN GmbH 60

The Big Data Deployment (SQL for unstructured Data) Working with datatypes for complex data (partially structured) ARRAY: Collection of data of the same datatype MAP: Collection of Key-Value pairs STRUCT: Collection of data with different datatypes Working with unstructured data is possible via the Serializer and Deserializer (SerDe) The SerDe-Interface is instructed how it should process data blocks There are many Built-In SerDes for example for JSON, Avro, Parquet, Regular Expressions, etc... Many SerDes are available in the Public Domain Specific SerDes that may be required can be developed in Java Copyright 2016 ITGAIN GmbH 61

Big Data Unstructured Data Using SerDes in BigSQL Before using the SerDe.jar-Files it needs to be registered in BigSQL - Only when the jar file has been successfully registered will it be available to BigSQL 3 Steps to Register: Hive Servers: Copy the SerDe.jar-File in the /lib/ directory Big SQL Node: Copy the SerDe.jar-File in the /userlib/ directory of each individual node Restart all BigSQL Services Copyright 2016 ITGAIN GmbH 65

Big Data Example of Unstructured Data select * from apache_log fetch first 5 rows only For example, to correlate Client Data with Web Browser data for analysis of user behavior Copyright 2016 ITGAIN GmbH 67

BIGSQL Sham or Masterstroke? Sham DB2 DPF for HDFS Masterstroke The right strategy at the right time Reuse of existing investments Increased acceptance via the reuse of SQL Simple integration of Big Data in an existing infrastructure Copyright 2016 ITGAIN GmbH 77

The Big Data Solution Big SQL Hadoop-Tables are not a replacement for OLTP-DBMS Technology Big SQL makes it possible to use SQL Requests against existing Hadoop Data (no proprietary storage formats) All the data are Hadoop files in HDFS Big SQL was developed to make effective and efficient use of the Hadoop infrastructure Most organizations possess experienced SQL developers No UPDATE or DELETE is possible on a Hadoop table Much lower license costs than DPF Good SQL compatibility Great monitoring with Speedgain for BIGSQL is available Copyright 2016 ITGAIN GmbH 78

The Big Data Solution Primary Use cases would be: To move rarely referenced data out of the Data-Warehouse and onto cheaper hardware while maintaining the ability to query the data via SQL To setup new Data-Warehouse To filter and analyze unstructured data (such as log files, sensor data and social media) as well as to connect this data to existing structured data (such as via federation) Copyright 2016 ITGAIN GmbH 79