PER STRICKER, THOMAS KALB 07.02.2017, HEART OF TEXAS DB2 USER GROUP, AUSTIN 08.02.2017, DB2 FORUM USER GROUP, DALLAS INITIAL EVALUATION BIGSQL FOR HORTONWORKS (Homerun or merely a major bluff?) Copyright 2016 ITGAIN GmbH 1
Agenda Introduction The MPP Architecture DB2 DPF and Hadoop (HDFS) Installation stumbling blocks Red Hat or Centos The HDP Installation with Ambari (See Appendix) The BigSQL Installation Working with BigSQL Familiar and the New a. DB2 - Interface b. HDFS - Interface Der Big Data Deployment (SQL for unstructured Data) DB2 Engine versus HDFS-Engine Functional Differences Performance Differences BIG SQL and Hive Conclusion Sham or Masterstroke? Questions and Discussions Copyright 2016 ITGAIN GmbH 2
Hadoop (HDFS) http://bradhedlund.s3.amazonaws.com/2011/hadoop-network-intro/hadoop-cluster.png Copyright 2016 ITGAIN GmbH 3
Hadoop Distribution Cloudera / Hortonworks / MapR / IOP (Worldwide Market share) others 20 % Hortonworks 16 % Cloudera 53% MapR 11 % Quelle: https://www.dezyre.com/article/top-6-hadoop-vendors-providing-big-data-solutions-in-open-data-platform/93 Copyright 2016 ITGAIN GmbH 4
Hadoop Appraisal Quelle: https://www.cloudera.com/content/dam/www/static/documents/analyst-reports/forrester-wave-big-data-hadoop-distributions.pdf Copyright 2016 ITGAIN GmbH 5
Hadoop SQL Engines Quelle: IBM Big SQL Vendor Landscape 2014 IBM Corporation Copyright 2016 ITGAIN GmbH 6
Agenda Introduction The MPP Architecture DB2 DPF and Hadoop (HDFS) BIGSQL Sham or Masterstroke? Installation stumbling blocks Red Hat or Centos The HDP Installation with Ambari (See Appendix) The BigSQL Installation Working with BigSQL Familiar and the New a. DB2 - Interface b. HDFS - Interface Der Big Data Deployment (SQL for unstructured Data) DB2 Engine versus HDFS-Engine Functional Differences Performance Differences Conclusion Sham or Masterstroke? Questions and Discussion Copyright 2016 ITGAIN GmbH 7
Big SQL and MPP-Architecture IBM Big SQL is a high performance SQLon-Apache-Hadoop- Engine IBM MPP-engine (C++) replaces the MapReduce-Layer (Java) Big SQL is a MPP (Massively Parallel Processing) SQL-engine HIVE extends Hadoop with Data- Warehouse Features HBASE is a distributed column-oriented database HDFS is a high availability filesystem for storing very large volumes of data distributed across many nodes. Quelle: Big SQL: A Technical Introduction 2016 IBM Corporation Copyright 2016 ITGAIN GmbH 8
SMP vs. MPP Architecture SMP: Dynamically distributes running processes across all available processors which share system resources (multi processor systems) Copyright 2016 ITGAIN GmbH 9
SMP vs. MMP Architecture MPP: Distributes a task across multiple independent nodes with individual processors, RAM and I/O. (Share nothing architecture) Copyright 2016 ITGAIN GmbH 10
SMP Scaling Vertical Scaling Copyright 2016 ITGAIN GmbH 11
Horizontal Scaling
Agenda Introduction The MPP Architecture DB2 DPF and Hadoop (HDFS) Installation stumbling blocks Red Hat or Centos The HDP Installation with Ambari (See Appendix) The BigSQL Installation Working with BigSQL Familiar and the New a. DB2 - Interface b. HDFS - Interface Der Big Data Deployment (SQL for unstructured Data) DB2 Engine versus HDFS-Engine Functional Differences Performance Differences BIG SQL and Hive Conclusion Sham or Masterstroke? Questions and Discussions Copyright 2016 ITGAIN GmbH 13
DB2 DPF versus Hadoop (HDFS) Hadoop Cluster (Diploma Thesis) DB2 DPF Hadoop Cluster Copyright 2016 ITGAIN GmbH 14
DB2 DPF Quelle: toadworld.com Copyright 2016 ITGAIN GmbH 15
Big SQL IBM Slide Quelle: Big SQL: A Technical Introduction 2016 IBM Corporation Copyright 2016 ITGAIN GmbH 16
BIG SQL ITGAIN Slide Copyright 2016 ITGAIN GmbH 17
Agenda Introduction The MPP Architecture DB2 DPF and Hadoop (HDFS) Installation stumbling blocks Red Hat or Centos The HDP Installation with Ambari (See Appendix) The BigSQL Installation Working with BigSQL Familiar and the New a. DB2 - Interface b. HDFS - Interface Der Big Data Deployment (SQL for unstructured Data) DB2 Engine versus HDFS-Engine Functional Differences Performance Differences Conclusion Sham or Masterstroke? Questions and Discussions Copyright 2016 ITGAIN GmbH 18
Installation Stumbling Blocks ITGAIN Test Environment Installing two nodes Hardware 2 virtual Servers with 8 Cores / 10 GB RAM / SSDs Software Linux RedHat 7.2 / Cent OS 7.2 Ambari 2.2.2.0 Hortonworks Data Platform (HDP) 2.4.2 BETA: Big SQL 4.2 for Hortonworks Data Platform Extending with two additional identical nodes (DataNode / WorkerNode) Copyright 2016 ITGAIN GmbH 19
Installation Stumbling Blocks Red Hat or CentOS? IBM BigInsights for Apache Hadoop 4.2 only supports Red Hat Enterprise Linux (RHEL) Server 6.7 Red Hat Enterprise Linux (RHEL) Server 7.2 Hortonworks Data Platform HDP 2.4.2 supports Red Hat Enterprise Linux (RHEL) 6.x - 7.x CentOS 6.x - 7.x Debian 7.x Oracle Linux 6.x - 7.x SUSE Linux Enterprise Server (SLES) v11 SP3 / SP4 Ubuntu Precise v12.04 Ubuntu Trusty v14.04 Copyright 2016 ITGAIN GmbH 20
Installation Stumbling Blocks Red Hat or CentOS? Recommendation for BETA auf Hortonworks Red Hat Enterprise Linux (RHEL) Server 7.2 Test-Cluster on Red Hat Enterprise Linux (RHEL) Server 7.2 CentOS 7.2 Installation on both OSes was successful Copyright 2016 ITGAIN GmbH 21
Installation Stumbling Blocks The HDP Installation with Ambari Copyright 2016 ITGAIN GmbH 22
Installation Stumbling Blocks The HDP Installation with Ambari Tips and Tricks: Very simple installation with Ambari, provided there are no errors Therefore: prior to the installation take the time to clear any warnings in the Confirm Hosts and Check Scripts In case of Errors: Check the errors output to stderr Often stderr is empty Typical cause is a timeout If stderr contains errors Attempt to correct the error and retry If the installation crashes it is often easier to retry with a fresh OS rather than changing the OS and retrying the installation Copyright 2016 ITGAIN GmbH 23
Installation Stumbling Blocks The BigSQL Installation Recommendations: Execute the Big SQL Pre-Checker before the Installation Pre-Checker Scripts are available in the installation package but need to be extracted rpm2cpio BigInsights-HDP-1.2.0.0-2.4.el7.x86_64.rpm cpio -ivd./var/lib/ambari-server/resources/stacks/hdp/2.4/services/bigsql/ package/scripts/bigsql-precheck.sh rpm2cpio BigInsights-HDP-1.2.0.0-2.4.el7.x86_64.rpm cpio -ivd./var/lib/ambari-server/resources/stacks/hdp/2.4/services/bigsql/ package/scripts/bigsql-util.sh All errors should be cleared before starting the installation Copyright 2016 ITGAIN GmbH 24
Installation Stumbling Blocks The BigSQL Installation Execute for ALL servers! Only when successful should you start the installation Copyright 2016 ITGAIN GmbH 25
Installation Stumbling Blocks The BigSQL Installation Add the Service to a Cluster Copyright 2016 ITGAIN GmbH 26
Installation Stumbling Blocks The BigSQL Installation Copyright 2016 ITGAIN GmbH 27
Installation Stumbling Blocks The BigSQL Installation It is always possible to add additional Big SQL Workers to an individual host via Add Services option under Hosts However, this is not possible on a Big SQL Head Node! Copyright 2016 ITGAIN GmbH 28
Installation Stumbling Blocks Extending the Cluster with Ambari Additional hosts can easily be added with the Add New Hosts Wizard Copyright 2016 ITGAIN GmbH 29
Installation Stumbling Blocks Extending the Cluster with Ambari Copyright 2016 ITGAIN GmbH 30
Installation Stumbling Blocks Extending the Cluster with Ambari Copyright 2016 ITGAIN GmbH 31
Installation Stumbling Blocks Extending the Cluster with Ambari Copyright 2016 ITGAIN GmbH 32
Installation Stumbling Blocks Extending the Cluster with Ambari Data must be redistributed after the extension Copyright 2016 ITGAIN GmbH 33
Agenda Introduction The MPP Architecture DB2 DPF and Hadoop (HDFS) Installation stumbling blocks Red Hat or Centos The HDP Installation with Ambari (See Appendix) The BigSQL Installation Working with BigSQL Familiar and the New a. DB2 - Interface b. HDFS - Interface Der Big Data Deployment (SQL for unstructured Data) DB2 Engine versus HDFS-Engine Functional Differences Performance Differences BIG SQL and Hive Conclusion Sham or Masterstroke? Questions and Discussions Copyright 2016 ITGAIN GmbH 34
Working with BigSQL The New and the Familiar DB2 Interface Copyright 2016 ITGAIN GmbH 35
Working with BigSQL The New and the Familiar Where does one find the Tables in HDFS? /apps/hive/warehouse/bigsql.db/firsttable Copyright 2016 ITGAIN GmbH 36
Working with BigSQL The New and the Familiar Or via the Command line (HDFS Browse): Copyright 2016 ITGAIN GmbH 37
Working with BigSQL The New and the Familiar Not everything works with the DB2 Command line: For example loading data into a Hadoop Table What now? Copyright 2016 ITGAIN GmbH 38
Working with BigSQL The New and the Familiar There is also a Command line for BigSQL: JSqsh (Java SQL Shell) pronounced "jay-skwish According to the docs it should be found in: /usr/ibmpacks/common-utils/current/jsqsh BUT: Copyright 2016 ITGAIN GmbH 39
Working with BigSQL The New and the Familiar SOLUTION: JSqsh isn t part of the BigSQL-Installation Copyright 2016 ITGAIN GmbH 40
Working with BigSQL The New and the Familiar JSqsh appears in the list of installed clients JSqsh can also be installed via the OpenSource GitHubproject Copyright 2016 ITGAIN GmbH 41
Working with BigSQL The New and the Familiar JSqsh Setup: Copyright 2016 ITGAIN GmbH 42
Working with BigSQL The New and the Familiar JSqsh Setup: driver selection Copyright 2016 ITGAIN GmbH 43
Working with BigSQL The New and the Familiar JSqsh Setup: Customize the Connection details and save Copyright 2016 ITGAIN GmbH 44
Working with BigSQL The New and the Familiar Requesting the table list with Jsqsh Jsqsh Command help via \help e.g g.: Defining the current schema: use BIGSQL Requesting a table list in a given schema: \show tables Copyright 2016 ITGAIN GmbH 45
Working with BigSQL The New and the Familiar Starting point: Load data in the Tables Tip: for better Performance load the Load-File with hdfs hdfs dfs -copyfromlocal /tmp/firsttable.csv /tmp/ hdfs dfs -chmod 777 /tmp/firsttable.csv Copyright 2016 ITGAIN GmbH 46
Working with BigSQL The New and the Familiar What happened in the hdfs-filesystem? a new file has appeared Copyright 2016 ITGAIN GmbH 47
Working with BigSQL The New and the Familiar db2top also works: For example, LOAD Copyright 2016 ITGAIN GmbH 48
Working with BigSQL The New and the Familiar Even db2pd works: For example LOAD However LIST UTILITIES does not work Copyright 2016 ITGAIN GmbH 49
Agenda Introduction The MPP Architecture DB2 DPF and Hadoop (HDFS) Installation stumbling blocks Red Hat or Centos The HDP Installation with Ambari (See Appendix) The BigSQL Installation Working with BigSQL Familiar and the New a. DB2 - Interface b. HDFS - Interface Der Big Data Deployment (SQL for unstructured Data) DB2 Engine versus HDFS-Engine Functional Differences Performance Differences BIG SQL and Hive Conclusion Sham or Masterstroke? Questions and Discussions Copyright 2016 ITGAIN GmbH 50
Loading the Benchmark BIGSQL HDFS Table Copyright 2016 ITGAIN GmbH 51
The HDFS (DB2-) Blocks Copyright 2016 ITGAIN GmbH 52
BIGSQL HDFS versus DB2 DPF Copyright 2016 ITGAIN GmbH 53
BIGSQL HDFS versus DB2 DPF Copyright 2016 ITGAIN GmbH 54
DB2 DPF Restrictions Copyright 2016 ITGAIN GmbH 55
DB2 DPF Restrictions Copyright 2016 ITGAIN GmbH 56
Performance differences DB2 DPF versus DB2 HDFS Loading 10 million rows DB2 DPF: 22 Sek. DB2 HDFS: 64 Sek. Copyright 2016 ITGAIN GmbH 57
Performance differences DB2 DPF versus DB2 HDFS Random I/O Benchmark (Reading von 1023 rows) DB2 DPF Cold: DB2 HDFS Cold: Warm: Warm: Copyright 2016 ITGAIN GmbH 58
Performance differences DB2 DPF versus DB2 HDFS Read-Ahead I/O Benchmark (Reading von 10 Mio. Rows) DB2 DPF Cold: DB2 HDFS Cold: Warm: Warm: Copyright 2016 ITGAIN GmbH 59
Agenda Introduction The MPP Architecture DB2 DPF and Hadoop (HDFS) Installation stumbling blocks Red Hat or Centos The HDP Installation with Ambari (See Appendix) The BigSQL Installation Working with BigSQL Familiar and the New a. DB2 - Interface b. HDFS - Interface The Big Data Deployment (SQL for unstructured Data) DB2 Engine versus HDFS-Engine Functional Differences Performance Differences BIG SQL and Hive Conclusion Sham or Masterstroke? Questions and Discussions Copyright 2016 ITGAIN GmbH 60
The Big Data Deployment (SQL for unstructured Data) Working with datatypes for complex data (partially structured) ARRAY: Collection of data of the same datatype MAP: Collection of Key-Value pairs STRUCT: Collection of data with different datatypes Working with unstructured data is possible via the Serializer and Deserializer (SerDe) The SerDe-Interface is instructed how it should process data blocks There are many Built-In SerDes for example for JSON, Avro, Parquet, Regular Expressions, etc... Many SerDes are available in the Public Domain Specific SerDes that may be required can be developed in Java Copyright 2016 ITGAIN GmbH 61
Big Data Working with the ARRAY-Data types Collection of data of the same datatype Copyright 2016 ITGAIN GmbH 62
Big Data Working with MAP Types Collection of Key-Value pairs Copyright 2016 ITGAIN GmbH 63
Big Data Working with STRUCTs Collection of data with different data types Copyright 2016 ITGAIN GmbH 64
Big Data Unstructured Data Using SerDes in BigSQL Before using the SerDe.jar-Files it needs to be registered in BigSQL - Only when the jar file has been successfully registered will it be available to BigSQL 3 Steps to Register: Hive Servers: Copy the SerDe.jar-File in the /lib/ directory Big SQL Node: Copy the SerDe.jar-File in the /userlib/ directory of each individual node Restart all BigSQL Services Copyright 2016 ITGAIN GmbH 65
Big Data Example of Unstructured Data Example: Parsing log files with Regular Expression (RegexSerDe) Copyright 2016 ITGAIN GmbH 66
Big Data Example of Unstructured Data select * from apache_log fetch first 5 rows only For example, to correlate Client Data with Web Browser data for analysis of user behavior Copyright 2016 ITGAIN GmbH 67
Agenda Introduction The MPP Architecture DB2 DPF and Hadoop (HDFS) Installation stumbling blocks Red Hat or Centos The HDP Installation with Ambari (See Appendix) The BigSQL Installation Working with BigSQL Familiar and the New a. DB2 - Interface b. HDFS - Interface Der Big Data Deployment (SQL for unstructured Data) DB2 Engine versus HDFS-Engine Functional Differences Performance Differences BIG SQL and Hive Conclusion Sham or Masterstroke? Questions and Discussions Copyright 2016 ITGAIN GmbH 68
Big SQL versus Hive SQLReplayer Copyright 2016 ITGAIN GmbH 69
Hive Big SQL Object Synchronization Create a table into Hive: SQLReplayer Copyright 2016 ITGAIN GmbH 70
Hive Big SQL Object Synchronization Synchronize the Hive Tables: SQLReplayer Copyright 2016 ITGAIN GmbH 71
Hive Big SQL Object Synchronization Test the Big SQL Table: SQLReplayer Copyright 2016 ITGAIN GmbH 72
Hive Big SQL Data Synchronization (Refresh) Edit the HDFS File: SQLReplayer Copyright 2016 ITGAIN GmbH 73
Hive Big SQL Data Synchronization (Refresh) Select the Hive Table: SQLReplayer Copyright 2016 ITGAIN GmbH 74
Hive Big SQL Data Synchronization (Refresh) Synchronization (Refresh): SQLReplayer Copyright 2016 ITGAIN GmbH 75
Agenda Introduction The MPP Architecture DB2 DPF and Hadoop (HDFS) Installation stumbling blocks Red Hat or Centos The HDP Installation with Ambari (See Appendix) The BigSQL Installation Working with BigSQL Familiar and the New a. DB2 - Interface b. HDFS - Interface Der Big Data Deployment (SQL for unstructured Data) DB2 Engine versus HDFS-Engine Functional Differences Performance Differences BIG SQL and Hive Conclusion Sham or Masterstroke? Questions and Discussions Copyright 2016 ITGAIN GmbH 76
BIGSQL Sham or Masterstroke? Sham DB2 DPF for HDFS Masterstroke The right strategy at the right time Reuse of existing investments Increased acceptance via the reuse of SQL Simple integration of Big Data in an existing infrastructure Copyright 2016 ITGAIN GmbH 77
The Big Data Solution Big SQL Hadoop-Tables are not a replacement for OLTP-DBMS Technology Big SQL makes it possible to use SQL Requests against existing Hadoop Data (no proprietary storage formats) All the data are Hadoop files in HDFS Big SQL was developed to make effective and efficient use of the Hadoop infrastructure Most organizations possess experienced SQL developers No UPDATE or DELETE is possible on a Hadoop table Much lower license costs than DPF Good SQL compatibility Great monitoring with Speedgain for BIGSQL is available Copyright 2016 ITGAIN GmbH 78
The Big Data Solution Primary Use cases would be: To move rarely referenced data out of the Data-Warehouse and onto cheaper hardware while maintaining the ability to query the data via SQL To setup new Data-Warehouse To filter and analyze unstructured data (such as log files, sensor data and social media) as well as to connect this data to existing structured data (such as via federation) Copyright 2016 ITGAIN GmbH 79
Conclusion Bluff = Homerun Copyright 2016 ITGAIN GmbH 80
Q & A Copyright 2016 ITGAIN GmbH 81