HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE USING GPUS. Mark Brooks - Principal System Kinetica May 09, 2017

Similar documents
Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

VOLTDB + HP VERTICA. page

OPERATIONALIZING MACHINE LEARNING USING GPU ACCELERATED, IN-DATABASE ANALYTICS

Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics

Cloud Analytics and Business Intelligence on AWS

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

BIG DATA COURSE CONTENT

microsoft

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Data 101 Which DB, When. Joe Yong Azure SQL Data Warehouse, Program Management Microsoft Corp.

Top Five Reasons for Data Warehouse Modernization Philip Russom

Data 101 Which DB, When Joe Yong Sr. Program Manager Microsoft Corp.

Modern Data Warehouse The New Approach to Azure BI

The Evolution of Big Data Platforms and Data Science

Understanding the latent value in all content

WHITEPAPER. MemSQL Enterprise Feature List

Accelerate your SAS analytics to take the gold

Exam Questions

Přehled novinek v SQL Server 2016

Streaming Integration and Intelligence For Automating Time Sensitive Events

Microsoft Azure Databricks for data engineering. Building production data pipelines with Apache Spark in the cloud

Unifying Big Data Workloads in Apache Spark

Stages of Data Processing

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

Microsoft Exam

Bringing Data to Life

Big Data on AWS. Big Data Agility and Performance Delivered in the Cloud. 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

April Copyright 2013 Cloudera Inc. All rights reserved.

Microsoft Perform Data Engineering on Microsoft Azure HDInsight.

Netezza The Analytics Appliance

Evolving To The Big Data Warehouse

In-Memory Computing EXASOL Evaluation

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara

Real-time Streaming Applications on AWS Patterns and Use Cases

Overview of Data Services and Streaming Data Solution with Azure

Lambda Architecture for Batch and Stream Processing. October 2018

REGULATORY REPORTING FOR FINANCIAL SERVICES

Agenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache

Big Data with Hadoop Ecosystem

WHITEPAPER. The Lambda Architecture Simplified

Flash Storage Complementing a Data Lake for Real-Time Insight

Capture Business Opportunities from Systems of Record and Systems of Innovation

USERS CONFERENCE Copyright 2016 OSIsoft, LLC

Oracle Exadata: Strategy and Roadmap

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Data Lake Best Practices

AWS Serverless Architecture Think Big

Oracle Big Data Connectors

Data-Intensive Distributed Computing

The Reality of Qlik and Big Data. Chris Larsen Q3 2016

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

Intro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect

Inside Kinetica. Inside. 1 / Core Concepts. 2 / Kinetica in your environment. 3 / Core Architecture. 4 / Administering Kinetica.

Saving ETL Costs Through Data Virtualization Across The Enterprise

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Architectural challenges for building a low latency, scalable multi-tenant data warehouse

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

Data Analytics at Logitech Snowflake + Tableau = #Winning

Introduction to Big-Data

Data pipelines with PostgreSQL & Kafka

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

An InterSystems Guide to the Data Galaxy. Benjamin De Boe Product Manager

Acquiring Big Data to Realize Business Value

Making Data Integration Easy For Multiplatform Data Architectures With Diyotta 4.0. WEBINAR MAY 15 th, PM EST 10AM PST

Databricks, an Introduction

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

Top 7 Data API Headaches (and How to Handle Them) Jeff Reser Data Connectivity & Integration Progress Software

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

Oracle GoldenGate for Big Data

Achieving Horizontal Scalability. Alain Houf Sales Engineer

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

Spatial Analytics Built for Big Data Platforms

Big Data on AWS. Peter-Mark Verwoerd Solutions Architect

August 23, 2017 Revision 0.3. Building IoT Applications with GridDB

Gabriel Villa. Architecting an Analytics Solution on AWS

What s New at AWS? A selection of some new stuff. Constantin Gonzalez, Principal Solutions Architect, Amazon Web Services

Verarbeitung von Vektor- und Rasterdaten auf der Hadoop Plattform DOAG Spatial and Geodata Day 2016

HDInsight > Hadoop. October 12, 2017

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Evolution of Capabilities Hunter Downey, Solution Advisor

Accelerating Digital Transformation with InterSystems IRIS and vsan

Embedded Technosolutions

IOTA ARCHITECTURE: DATA VIRTUALIZATION AND PROCESSING MEDIUM DR. KONSTANTIN BOUDNIK DR. ALEXANDRE BOUDNIK

Alexander Klein. #SQLSatDenmark. ETL meets Azure

The age of Big Data Big Data for Oracle Database Professionals

Massive Scalability With InterSystems IRIS Data Platform

Oracle Database 11g for Data Warehousing & Big Data: Strategy, Roadmap Jean-Pierre Dijcks, Hermann Baer Oracle Redwood City, CA, USA

28 February 1 March 2018, Trafo Baden. #techsummitch

Fluentd + MongoDB + Spark = Awesome Sauce

Hadoop course content

Qualys Cloud Platform

Deep Dive Amazon Kinesis. Ian Meyers, Principal Solution Architect - Amazon Web Services

Safe Harbor Statement

Microsoft Analytics Platform System (APS)

Data in the Cloud and Analytics in the Lake

Transcription:

HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE USING GPUS Mark Brooks - Principal System Engineer @ Kinetica May 09, 2017

The Challenge: How to maintain analytic performance while dealing with: Larger data volumes Streaming data with minimal end-to-end latency Ad-hoc drill down (you can t pre-aggregate everything) 2

Architectural and Design Approaches 1. One database to rule them all 2. SQL on Hadoop (or directly on the Data Lake) 3. Data Lake + NoSQL + Spark + Search + Cache + 4. Lambda Architecture 5. Kappa Architecture 6. Next generation hardware acceleration 3

One Database To Rule Them All 4

SQL on a Data Lake Credit: https://www.slideshare.net/bigdatapump/sql-on-hadoop-49494494 5

Hadoop + NoSQL + Search + Memory Cache + Credit: Matt Turck - https://www.slideshare.net/mjft01/big-data-landscape-matt-turck-may-2014 6

Lambda Architecture Credit: Nathan Marz James Kinley http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html http://jameskinley.tumblr.com/tagged/lambda 7

Lambda Architecture Credit: James Kinley http://jameskinley.tumblr.com/tagged/lambda 7

Kappa Architecture Credit: Jay Kreps https://www.oreilly.com/ideas/questioning-the-lambda-architecture 8

Kappa Architecture Stream processing systems already have a notion of parallelism; why not just handle reprocessing by increasing the parallelism and replaying history very, very fast? Credit: Jay Kreps https://www.oreilly.com/ideas/questioning-the-lambda-architecture 8

Next Generation Hardware Acceleration Consider a system with these characteristics: Horizontally Scalable Low end-to-end latency Powerful enough to not require pre-aggregation This is now possible Credit: Jay Kreps https://www.oreilly.com/ideas/questioning-the-lambda-architecture 8

GPU Accelerated Compute 1990-2000 s 2005 2010 2017 AT SCALE PROCESSING BECOMES THE BOTTLENECK DATA WAREHOUSE DISTRIBUTED STORAGE AFFORDABLE MEMORY GPU ACCELERATED COMPUTE RDBMS & Data Warehouse technologies enable organizations to store and analyze growing volumes of data Hadoop and MapReduce enables distributed storage and processing across multiple machines. Affordable memory allows for faster data read and write. HANA, MemSQL, & Exadata provide faster analytics. GPU cores bulk process tasks in parallel - far more efficient for many data-intensive tasks than CPUs which process those tasks linearly. on high performance machines, but at high cost. Storing massive volumes of data becomes more affordable, but performance is slow 12

Kinetica: Core ANALYTICS DATABASE ACCELERATED BY GPUs Columnar in-memory database Data available much like a traditional RDBMS rows, columns Data held in-memory; persisted to disk Interact with Kinetica through its native REST API, Java, Python, JavaScript, NodeJS, C++, SQL, etc as well as with various connectors HTTP Head Node GPU Accelerated Columnar In-memory Database A1 B1 C1 A2 B2 C2 A3 B3 C3 A4 B4 C4 Disk Commodity Hardware w/ GPUs KINETICA Native GIS & IP address object support VERY FAST: Ideal for OLAP workloads Typical hardware setup: 256GB - 1TB memory with 2-4 GPUs per node. 13

Multi-Head Ingest and Scale-Out Architecture ON-DEMAND SCALE OUT HTTP Head Node HTTP Head Node HTTP Head Node Columnar In-memory Columnar In-memory Columnar In-memory A1 B1 C1 A2 B2 C2 A3 B3 C3 A1 B1 C1 A2 B2 C2 A3 B3 C3 A1 B1 C1 A2 B2 C2 A3 B3 C3 + A4 B4 C4 A4 B4 C4 A4 B4 C4 Disk Disk Disk Commodity Hardware w/ GPUs Commodity Hardware w/ GPUs Commodity Hardware w/ GPUs MULTI-HEAD INGEST 19

Real-Time Data Handlers for Structured & Unstructured Data Java API APIs C++ API VISUALIZATION via ODBC/JDBC GEOSPATIAL CAPABILITIES Geometric Objects WMS JavaScript API Node.js API Tracks WKT REST API Python API Geospatial Endpoints OPEN SOURCE INTEGRATION HTTP Head Node HTTP Head Node HTTP Head Node HTTP Head Node Apache NiFi Apache Kafka Apache Spark Apache Storm Columnar In-memory A1 B1 C1 A2 B2 C2 A3 B3 C3 A4 B4 C4 Columnar In-memory A1 B1 C1 A2 B2 C2 A3 B3 C3 A4 B4 C4 Columnar In-memory A1 B1 C1 A2 B2 C2 A3 B3 C3 A4 B4 C4 Columnar In-memory A1 B1 C1 A2 B2 C2 A3 B3 C3 A4 B4 C4 OTHER INTEGRATION Message Queues ETL Tools Disk Commodity Hardware w/ GPUs Disk Commodity Hardware w/ GPUs Disk Commodity Hardware w/ GPUs Disk Commodity Hardware w/ GPUs Streaming Tools KINETICA CLUSTER On-Demand Scale 20

Parallel Ingest Provides High Performance Streaming PARALLEL INGEST 1 NODE (1TB/2GPU) 1 NODE (1TB/2GPU) 1 NODE (1TB/2GPU) Each node of the system can share the task of data ingest, provides more and faster throughput. It can be made faster simply by adding more nodes. No compute is used on ingest! 16

Speed Layer for the Data Lake Parallel ingestion of events Kinetica is speed layer with realtime analytic capabilities Amazon Kinesis Put, get, scan ANALYSTS HDFS for archival store Much looser coupling than traditional lambda architecture Batch mode Spark or MR jobs can push data to Kinetica as needed for fast query on data loaded from the data lake EVENTS MESSAGE BROKERS Kinetica Connectors STREAM PROCESSING Parallel Ingestion Execute complex analytics on the fly MOBILE USERS DASHBOARDS & APPLICATIONS ALERTING SYSTEMS HDFS / AWS S3 / GCS / Azure Data Lake 17

Real-Time, Advanced Analytics, Speed Layer for Teradata or Oracle Parallel ingestion of events Lambda-type architecture for Teradata or Oracle Kinetica is speed layer with near-real-time analytic capabilities Converge Machine Learning, streaming and location analytics and fast Query and Analytics with Kinetica and RDBMS DATA IN MOTION AND REST Amazon Kinesis Kinetica Connectors STREAM / ETL PROCESSING Fast GPU accelerated, in- Memory Database Converge ML, AI, Streaming MOBILE USERS ANALYSTS DASHBOARDS & APPLICATIONS ALERTING SYSTEMS DATA WAREHOUSE / TRANSACTIONAL 18

Advanced In-Database Analytics ORCHESTRATION LAYER WITH USER-DEFINED FUNCTIONS (UDFs) 1. User-defined functions (UDFs) can receive table data, do arbitrary computations, and save output to a separate table in a distributed manner. 2. UDFs have direct access to CUDA APIs enables compute-to-grid analytics for logic deployed within Kinetica. 3. Works with custom code, or packaged code. Opens the way for machine learning/artificial intelligence libraries such as TensorFlow, BIDMach, Caffe and Torch to work on data directly within Kinetica. 4. Available now with C++ & Java bindings. PHYSICAL / VIRTUAL SERVER Table A Table B Table n Table C Proc Server UDF_A UDF_B UDF_n CUDA Libraries GPU n number of Kinetica servers Data returned to output table for further analysis /exec/proc/udf_a/ UDFs exposed from RESTful endpoint 19

Kinetica Architecture STREAMING DATA ETL / STREAM PROCESSING PARALLEL INGEST ON DEMAND SCALE OUT + 1TB MEM / 2 GPU CARDS Native APIs SQL Geospatial WMS Custom Connectors BI / GIS / APPS KINETICA REVEAL BI DASHBOARDS UDFs ERP / CRM / TRANSACTIONAL DATA In-Database Processing CUSTOM LOGIC BIDMach ML Libs CUSTOM APPS & GEOSPATIAL 20

AI & BI on One GPU-Accelerated Database BUSINESS INTELLIGENCE CUSTOM APPLICATIONS HIGH FIDELITY GEOSPATIAL PIPELINE SQL ODBC / JDBC Native REST API WMS HIGH PERFORMANCE ANALYTICS DATABASE BUSINESS USERS BIDMach UDF UDF UDF DATA SCIENTISTS / DEVELOPERS MACHINE LEARNING & DEEP LEARNING GPU-ACCELERATED DATA SCIENCE PREDICTIVE MODELS e.g. Risk Management, Sales Volume, Fraud. 21

50-100x Faster on Queries with Large Datasets WHEN COMPARED TO LEADING IN-MEMORY ALTERNATIVES Large retailer tested complex SQL queries on 3 years of retail data (150bn rows) 10 node Kinetica cluster against 30TB+ cluster from next best alternative GPU is able to perform many instructions in parallel. Huge performance gains on aggregations, group bys, joins, etc. Kinetica sustained ingest of 1.3bn objects/minute with 70 attributes per row SELECT (Q10) GROUP BY (Q5) SUM (Q1) 0 5 10 15 20 25 30 35 40 45 50 Kinetica Leading In-Memory DB More Details 22

Distributed Geospatial Pipeline NATIVE VISUALIZATION IS DESIGNED FOR FAST MOVING, LOCATION-BASED DATA Native Geospatial Object Types Points, Shapes, Tracks, Labels Native Geospatial Functions Filters (by area, by series, by geometry, etc.) Aggregation (histograms) Geofencing - triggers Video generation (based on dates/times) Generate Map Overlay Imagery (via WMS) Rasterize points Style based on attributes (class-break) Heat maps 23

Full-Text Search Kinetica includes powerful text search functionality, including : Rain Tire ~5 "Union Tranquility"~10 Exact Phrases Boolean AND / OR Wildcards Grouping Fuzzy Search (Damerau-Levenshtein optimal string alignment algorithm) N-Gram Term Proximity Search Term Boosting Relevance Prioritization [100 TO 200] 22

CASE STUDY : LOCATION BASED ANALYTICS INTELLIGENCE: US Army - INSCOM US Army s in-memory computational engine for any data with a geospatial or temporal attribute for a major joint cloud initiative within the Intelligence Community (IC ITE). U.S Army INSCOM Shift from Oracle to GPUdb Intel analysts are able to conduct near real-time analytics and fuse SIGINT, ISR, and GEOINT streaming big data feeds and visualize in a web browser. First time in history military analysts are able to query and visualize billions to trillions of near realtime objects in a production environment. Major executive military and congressional visibility. GPUdb (20ms) 42x Lower Space 28x Lower Cost 38x Lower Power Cost 1 GPUdb server vs 42 servers with Oracle 10gR2 (2011) Oracle Spatial (92 Minutes) 24

CASE STUDY : LOCATION BASED ANALYTICS LOGISTICS: Workforce optimization USPS is the single largest logistic entity in the country, moving more individual items in four hours than the combination of UPS, FedEx, and DHL move all year. DISTRIBUTED ANALYSIS USPS parallel cluster is able to serve up to 15,000 simultaneous sessions, providing the service s managers and analysts with the capability to instantly analyze their areas of responsibility via dashboards. AT SCALE With 200,000 USPS devices emitting location once every minute, that amounts to more than a quarter billion events captured and analyzed daily tracked on 10 nodes. 25

CASE STUDY : LOCATION BASED ANALYTICS LOGISTICS & FLEET MANAGEMENT LARGE RETAILER Kinetica enables agile tracking of shipments to assist store managers for tracking of inventory and arrival times. Visibility and tracking of deliveries & trucks for store managers ETA & Notifications Provide estimated time of delivery, notifications and custom location based alerting Route Optimization based on truck size, and if cargo is perishable or contains hazardous materials. 27

CASE STUDY : ADVANCED IN-DATABASE ANALYTICS RISK MANAGEMENT MULTINATIONAL BANK Large financial institution moves counterparty risk analysis from overnight to real-time. Data collected by XVA library which computes risk metrics for each trade Risk computations are becoming more complex and computationally heavy. xva analysis needs to project years into the future. Kinetica enables banks to move from batch/overnight analysis to a streaming/real-time system for flexible real-time monitoring by traders, auditors and management. 28

Scale Out on Industry Standard Hardware Kinetica typically results in 1 10 hardware costs of standard in-memory databases. Runs on industry standard servers, 512GB memory with GPUs (ex. NVIDIA K80) IN THE CLOUD WITH: COMING SOON: CERTIFIED ON PREMISE WITH: 29

Stop by Booth #431 and Get Your Free T-shirt www.kinetica.com