The Reality of Qlik and Big Data. Chris Larsen Q3 2016

Size: px

Start display at page:

Download "The Reality of Qlik and Big Data. Chris Larsen Q3 2016"

Scot Armstrong
6 years ago
Views:

1 The Reality of Qlik and Big Data Chris Larsen Q3 2016

Introduction Chris Larsen Sr Solutions Architect, Partner Engineering @Qlik Based in Lund,

modeling, Javascript, *NIX, Architecture, Strategy, Business Analysis, Implementations 9

2 Introduction Chris Larsen Sr Solutions Architect, Partner Based in Lund, Sweden Primary Responsibility Advanced Analytics (and formerly Big Data as well Data modeling, Javascript, *NIX, Architecture, Strategy, Business Analysis, Implementations 9 years Qlik dev, 10 years BI/DW, 16 years web dev LinkedIn: linkedin.com/in/chrislarsen

Challenge: Deliver Value Rapidly Easy to use

visual analysis Fail fast, learn fast Users

in harmony to contribute their own expertise

3 Challenge: Deliver Value Rapidly Easy to use integrated tools for data integration and visual analysis Fail fast, learn fast Users pick their own path to explore. If it s a dead end, they can easily try another path Ensure that everyone can work in harmony to contribute their own expertise Reuse data across multiple applications and share data and expressions in managed governed libraries. In four months we developed more applications and dashboards than we had in two years with other BI tools. Director of Business Intelligence

4 Amount of Data Technologies Law of Rabbitization n

5 Big Data Landscape Beyond the Hype Unclear Strategy & Direction Big & Wide Data Data Scientist Bottleneck Skills Gap Rapidly Evolving Technology Landscape Data Warehouse Augmentation

6 Depth of Coverage The Big Data User Advanced Analytics Deep drilling Data Scientists Mostly drilling, some exploration Data Experts Data Explorers Mostly exploration, some drilling Explore Data Lake during preparation, Quickly visualize Breadth of Coverage

7 Hype Cycle Internet of Things Pain Big Data In-Memory Analytics Source: Gartner August 2014

8 Expanding on 3 fronts Data Velocity Real Time Near Real Time PB Data Volume Periodic TB Batch GB Table MB Database Web XML Audio Video Social Data Variety

9 Partner for Success A strong culture of Partnership Broad range of technology partnerships Dedicated staff ensures continued focus Continuous evaluation of new market entrants HADOOP DATABASES ACCELERATORS DOMAIN And more

Qlik Big Data Methodologies Different data volumes and varieties are best met

user for every situation On Demand App Generation Methods can be combined to meet

Chaining Data Volume Size (rows) Dimensions (columns) Cardinality (uniqueness) App

10 Qlik Big Data Methodologies Different data volumes and varieties are best met using different methods Different methods ensure an optimized experience for the user for every situation On Demand App Generation Methods can be combined to meet different use cases Methods vary in deployment complexity Direct Discovery Chaining Data Volume Size (rows) Dimensions (columns) Cardinality (uniqueness) App Complexity Computational complexity such as set analysis Object density In-Memory Segmentation

11 Qlik In-Memory Approach Loads highly compressed data index into memory Supports highly interactive user experience Enables associative search and analysis across entire data set Supports 100 s millions to billions of rows of data

12 Qlik Segmentation & Chaining Approach Divides information into multiple related apps User unnoticeably changes apps when drilling down Still utilizes highly compressed data index in memory Detail North America Detail - Europe Detail Asia Supports 100s millions to billions of rows per segmented app Summary - Global

13 Qlik Direct Discovery Approach Associative Discovery for Big Data Seamlessly combines Big Data with other data sources Compliments the high-frequency, exploration done by most users Massive data scalability In-memory New dimensions require reload Direct Discovery Support billions of rows

14 On Demand App Generation A shopping cart approach to analytics Dimension selection to generate filtered analytics On demand data slices Converting Big Data to small data analytics Driven by business users governed by IT

NodeJS container with Sense Proxy, Engine and Repository API s and indexes the analysis app with the most recent data from the SAP HANA database

15 Qlik Sense and SAP HANA ODAG Example 1. Selection App is populated with dimensional data on schedule 2 2. User selects dimensional criteria in a mashup page. After the governed limit is reached an extension object button appears 3 Web Mashup 3. NodeJS container with Sense Proxy, Engine and Repository API s and indexes the analysis app with the most recent data from the SAP HANA database with only the data slice relevant to the user 4. Analysis app updated in the web mashup page Analysis App 4 API s Proxy, Engine and Repository Selection App Qlik Sense Server Dimensional Load 1

QMS API and EDX indexes analysis app with the most recent data from the Teradata database with only the data slice relevant to

16 Case Study - Belgacom 1. Selection App is populated with dimensional data on schedule 2 2. User selects dimensional criteria. After the governed limit is reached an ASPX page is invoked 3. QMS API and EDX indexes analysis app with the most recent data from the Teradata database with only the data slice relevant to the user 4. Analysis app deployed to access point with user security 4 Analysis App Access Point 1 Selection App QlikView Server IIS QlikView Management Service ASPX QMS API 3 EDX Publisher

17 QIX Performance

QIX Performance Large Sales Apps 200M Rows This test represents a common sales force scenario, where a sales organization, including a large sales force, needs access to sales information at several

18 QIX Performance Large Sales Apps 200M Rows This test represents a common sales force scenario, where a sales organization, including a large sales force, needs access to sales information at several levels of detail. Corporate users enter through a dashboard (7M) and chain to either the detailed app (200M) or Division app (4M) or Territory app (1M). All sales force users (1,000) use pre-sliced Territory apps (1M). Test Inputs Number Rows Duration Server RAM 7M, 200M, 4M & 1M 60 min. IBM x3650 (2) E GHz (8- core) 192 GB Test Results Total Sessions 6,540 Concurrent Users 1,120 RAM Used 138 GB Avg CPU 30% Avg Response Time 0.48 seconds RAM never reached working set minimum for this 45 minute test. CPU stayed low and peaked at 50%, averaging 30%. Response times averaged 0.48 seconds across the 126,739 user selections.

19 QIX Performance Performance is predictable and has truly linear scalability when adding resources Number of concurrent users, Complexity of security/data model/data sources/business requirements have very little impact on performance Tightly coupled UI with engine access provides robust, proven high performance architecture Essentially provides equivalent of a full outer join in SQL, to provide drill anywhere and the gray in green, white and gray.. Other products cannot do this

20 Qlik Performance BARC (Business Application Research Center) BI Survey 2015 reports top performance among all BI products

21 Hadoop and Other Open Source Big Data Platforms

22 Apache Hadoop v2 Hive (SQL) Pig (ETL) Mahout (ML) Giraph (Graph) Hive 1.2.0, Tez, Pig, Cascading, etc. HBase MapReduce Other Compute Engines (Tez, Spark, etc.) YARN (Cluster Resource Manager) HDFS2

Hive on Tez YARN integration Distributed execution framework Eliminate

Certified Query Optimisations Vectorised query execution Filter at

Higher compression Columnar Ideal for frequent fact filters Source:

23 Hive on Tez YARN integration Distributed execution framework Eliminate extra map reads Dataflow model on DAG of nodes Stinger Initiative Certified Query Optimisations Vectorised query execution Filter at storage layer vs SQL engine SQL cost based optimiser ORCFile format Higher compression Columnar Ideal for frequent fact filters Source: developers 44 companies Connect via ODBC

Parquet file format Driven from Twitter use cases Columnar data storage Limits IO to data needed Space saving Impala SQL Cost based optimisations Authentication, AD/Kerberos

g. RCfile/Parquet Impala Roadmap Additional SQL support S3 integration Nested data Connect via ODBC Source: http://blog.cloudera.

24 Parquet file format Driven from Twitter use cases Columnar data storage Limits IO to data needed Space saving Impala SQL Cost based optimisations Authentication, AD/Kerberos YARN integration In memory caching Impala Certified Metastore Can be same DB as Hive metastore e.g. MySQL Query optimiser can use table/column stats Can use Hbase/HDFS with several file formats e.g. RCfile/Parquet Impala Roadmap Additional SQL support S3 integration Nested data Connect via ODBC Source: Other Sources:

25 Apache Drill Dynamic Schema Discovery Does not require schema/type spec to start query execution Leverage self describing data formats, e.g. Parquet,Avro,JSON and NoSQL DB Flexible data model built for complex/semi-structured data Certified Performance Distributed execution engine for query processing Columnar execution, avoids disk access for columns not in query Vectorisation allows CPU to operate on record batches Optimistic and pipelined query execution Connect via ODBC ODBC/JDBC Drill SQL Query Layer & Execution Engine Files HBase Hive Hive UDF s Metastore SerDes Data sources Drill creates a virtual view in JSON Nested data support MAPR-FS,HDFS,H-Base Can use Hive Metastore JSON, Mongo DB/NoSQL Reuse Hive UDF s Source: Coming Soon:

ODBC Spark SQL Spark Spark Streaming SparkSQL DataFrame distributed collection of data in named columns Supported on RDD s,parquet,json,hive,jdbc

26 Resilient Distributed Datasets (RDD) In memory MR does not lend itself to interactive /ad-hoc analytical queries Logical collection of data partitioned across machines Can reference external datasets Market Spark distributed with ALL hadoop distro s Not just Big Data use cases Connect via ODBC Spark SQL Spark Spark Streaming SparkSQL DataFrame distributed collection of data in named columns Supported on RDD s,parquet,json,hive,jdbc sources YARN integration Mlib Machine Learning GraphX Graph Comp. Spark R R on Spark Spark Core Engine Source: Alpha/Pre-alpha

27 Search-centric indexing storage Based on Apache Lucene Enables Full Text Search REST-API driven Optimized for high volume of traffic Search Technology

28 Thank you

April Copyright 2013 Cloudera Inc. All rights reserved.

April Copyright 2013 Cloudera Inc. All rights reserved. Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here April 2014 Analytic Workloads on