Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Similar documents
Stages of Data Processing

Oracle GoldenGate for Big Data

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

Big Data Architect.

SpagoBI and Talend jointly support Big Data scenarios

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Modern ETL Tools for Cloud and Big Data. Ken Beutler, Principal Product Manager, Progress Michael Rainey, Technical Advisor, Gluent Inc.

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Big Data with Hadoop Ecosystem

BIG DATA COURSE CONTENT

The age of Big Data Big Data for Oracle Database Professionals

Syncsort DMX-h. Simplifying Big Data Integration. Goals of the Modern Data Architecture SOLUTION SHEET

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

VOLTDB + HP VERTICA. page

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Hadoop. Introduction / Overview

Big Data on AWS. Big Data Agility and Performance Delivered in the Cloud. 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

The Technology of the Business Data Lake. Appendix

Security and Performance advances with Oracle Big Data SQL

Performance and Scalability Overview

microsoft

New Approaches to Big Data Processing and Analytics

Flash Storage Complementing a Data Lake for Real-Time Insight

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

Interactive SQL-on-Hadoop from Impala to Hive/Tez to Spark SQL to JethroData

Přehled novinek v SQL Server 2016

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Eight Essential Checklists for Managing the Analytic Data Pipeline

Microsoft Big Data and Hadoop

Configuring and Deploying Hadoop Cluster Deployment Templates

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

Ian Choy. Technology Solutions Professional

The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Dublin Apache Kafka Meetup, 30 August 2017.

Hadoop, Yarn and Beyond

Oracle Big Data Connectors

Data Lake Best Practices

IT directors, CIO s, IT Managers, BI Managers, data warehousing professionals, data scientists, enterprise architects, data architects

Accelerate Your Data Pipeline for Data Lake, Streaming and Cloud Architectures

Data Architectures in Azure for Analytics & Big Data

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Spotfire Advanced Data Services. Lunch & Learn Tuesday, 21 November 2017

The Reality of Qlik and Big Data. Chris Larsen Q3 2016

HDInsight > Hadoop. October 12, 2017

Agenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache

Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics

IBM Data Replication for Big Data

Managing IoT and Time Series Data with Amazon ElastiCache for Redis

Data 101 Which DB, When. Joe Yong Azure SQL Data Warehouse, Program Management Microsoft Corp.

Databases 2 (VU) ( / )

Hitachi Vantara Overview Pentaho 8.0 and 8.1 Roadmap. Pedro Alves

Embedded Technosolutions

Data Acquisition. The reference Big Data stack

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

Hadoop Overview. Lars George Director EMEA Services

STATE OF MODERN APPLICATIONS IN THE CLOUD

Big Data Analytics using Apache Hadoop and Spark with Scala

Intro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect

Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite. Peter Zaitsev, Denis Magda Santa Clara, California April 25th, 2017

Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data

Microsoft Analytics Platform System (APS)

SOLUTION TRACK Finding the Needle in a Big Data Innovator & Problem Solver Cloudera

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Big Data Hadoop Stack

April Copyright 2013 Cloudera Inc. All rights reserved.

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Introduction to Big-Data

Big Data Hadoop Course Content

<Insert Picture Here> Introduction to Big Data Technology

Data Science and Open Source Software. Iraklis Varlamis Assistant Professor Harokopio University of Athens

HOW TO ACHIEVE REAL-TIME ANALYTICS ON A DATA LAKE USING GPUS. Mark Brooks - Principal System Kinetica May 09, 2017

A NoSQL Introduction for Relational Database Developers. Andrew Karcher Las Vegas SQL Saturday September 12th, 2015

Challenges for Data Driven Systems

Big Data on AWS. Peter-Mark Verwoerd Solutions Architect

Data Acquisition. The reference Big Data stack

Hadoop Development Introduction

Integrating Oracle Databases with NoSQL Databases for Linux on IBM LinuxONE and z System Servers

Cloud Analytics and Business Intelligence on AWS

Microsoft Exam

DATA SCIENCE USING SPARK: AN INTRODUCTION

Cmprssd Intrduction To

Evolving To The Big Data Warehouse

Top Five Reasons for Data Warehouse Modernization Philip Russom

Modern Data Warehouse The New Approach to Azure BI

Hadoop An Overview. - Socrates CCDH

Part 1: Indexes for Big Data

Down the event-driven road: Experiences of integrating streaming into analytic data platforms

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Fluentd + MongoDB + Spark = Awesome Sauce

Oracle NoSQL Database Enterprise Edition, Version 18.1

Modernizing Business Intelligence and Analytics

Data Lake Based Systems that Work

Understanding the latent value in all content

SQL in the Hybrid World

Data 101 Which DB, When Joe Yong Sr. Program Manager Microsoft Corp.

New Technologies for Data Management

Understanding NoSQL Database Implementations

IOTA ARCHITECTURE: DATA VIRTUALIZATION AND PROCESSING MEDIUM DR. KONSTANTIN BOUDNIK DR. ALEXANDRE BOUDNIK

Transcription:

Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case Studies Pentaho Key Capabilities Summary Q&A

End-to-End Data Delivery Platform Ingest Process Publish Report Data Agnostic Metadata Driven Ingestion Data Orchestration Native Hadoop Integration Scale Up & Scale Out Blend Unstructured Data Streamlined Data Refinery Data Virtualization Machine Learning Production Reporting Custom Dashboards Self-Service Dashboards Interactive Analysis Embedded Analytics

Delivering Insight Data Integration & Orchestration Custom Dashboards Self-Service Dashboards Ingest Process Publish Report Interactive Analysis Data Engineers Data Scientists Data Analyst Consumers Production Reporting

Big Data Ecosystem 1 2 3 Relational Database Analytical Databases NoSQL Database 4 5 6 HDFS Map Reduce SQL on Hadoop Distributed Search 7 8 9 Message Streaming Event Stream Processing (ESP) Complex Event Processing (CEP)

Data Source Attributes Volume (Data Size) Small Medium Large Relational Database Analytical Databases NoSQL Database Variety (Data Type) Structured Semi-Structured Unstructured HDFS Map Reduce SQL on Hadoop Distributed Search Velocity (Processing) Batch Micro-Batch RT Streaming Message Streaming Event Stream Processing (ESP) Complex Event Processing (CEP) Latency (Reporting) Scheduled Prompted Interactive

Core Competency Relational Database Good Fit Not Optimal Not Recommended Relational Database MSFT SQL Server, Oracle, MySQL, PostGreSQL, IBM DB2 Volume (Data Size) Small Medium Large Operational databases for OLTP apps that require high transaction loads and user concurrency. Can scale up to data volumes but lack ability to easily scale-out for large data processing. Variety (Data Type) Structured Semi-Structured Unstructured Structured schema of tables containing rows and columns of data emphasizing integrity and consistency over speed and scale. Structured data accessed with the SQL query language. Velocity (Processing) Batch Micro-Batch RT Streaming Rigid schemas with batch-oriented ingestion and SQL query processing are not designed for continuous streaming data Latency (Reporting) Scheduled Prompted Interactive Optimized for frequent small CRUD queries (create, read, update, delete), not for analytic or interactive query workloads on large data

Core Competency Analytical Database Good Fit Not Optimal Not Recommended Analytical Database Columnar, In-Memory, MPP, OLAP Teradata, Oracle Exadata, IBM Netezza, EMC Greenplum, Vertica Volume (Data Size) Variety (Data Type) Velocity (Processing) Latency (Reporting) Small Medium Large Structured Semi-Structured Unstructured Batch Micro-Batch RT Streaming Scheduled Prompted Interactive Data warehouse/mart databases to support BI and advanced analytics workloads. MPP architecture gives ability to scale out to large data volumes at a financial cost. Structured schema of tables containing rows and columns of data offering improved speed and scalability over RDBMS but still limited to structured data. Rigid schemas with batch-oriented SQL queries are not designed for streaming applications. All four types (Columnar, In-Memory, MPP, OLAP) designed for improved query performance for analytic or interactive query workloads on large data.

Core Competency NoSQL Database Good Fit Not Optimal Not Recommended NoSQL Database MongoDB, HBase, Cassandra, MarkLogic, Couchbase Volume (Data Size) Variety (Data Type) Velocity (Processing) Latency (Reporting) Small Medium Large Structured Semi-Structured Unstructured Batch Micro-Batch RT Streaming Scheduled Prompted Interactive Good for web applications - less web app code to write, debug and maintain. Scale out - horizontal scaling w auto-sharding data to support millions of web app users. Compromise on consistency (ACID transactions) in favor of scale & up-time. Hierarchical, key-value or document design to capture all types of data in a single location. Schema-less design allows for rapid or continuous ingest at scale. Good storage option for high throughput, low latency requirements of streaming applications for real-time views of data. Seen as a key component to Lambda architecture. Low level query languages, lack of skills, lack SQL support makes NoSQL less appealing for reporting and analysis.

Core Competency HDFS MapReduce Good Fit Not Optimal Not Recommended HDFS Map Reduce Cloudera, Hortonworks, MapR, Pivotal, Amazon EMR, Hitachi HSP, MSFT HDInsights Volume (Data Size) Variety (Data Type) Velocity (Processing) Latency (Reporting) Small Medium Large Structured Semi-Structured Unstructured Batch Micro-Batch RT Streaming Scheduled Prompted Interactive Hadoop Distributed File System designed to distribute and replicate file blocks horizontally scaled across multiple commodity data nodes. MapReduce programming takes compute to the data for batch processing large data volumes. File system is schema-less allowing easy storage of any file type in multiple Hadoop file formats. HDFS and MapReduce designed for distributing batch processing workloads on large datasets, not for micro-batch or steaming use cases. MapReduce on HDFS lacks SQL support and report queries are slow and less appealing for reporting and analysis.

Core Competency SQL on Hadoop Good Fit Not Optimal Not Recommended SQL on Hadoop Batch-oriented, Interactive, and In-Memory Apache Hive, Apache Drill/Phoenix, Hortonworks Hive on Tez, Cloudera Impala, Pivotal HawQ, Spark SQL Volume (Data Size) Variety (Data Type) Velocity (Processing) Latency (Reporting) Small Medium Large Structured Semi-Structured Unstructured Batch Micro-Batch RT Streaming Scheduled Prompted Interactive SQL queries on a metadata layer (Hcatalog) in Hadoop. The queries are converted to MapReduce, Apache Tez, Impala MPP, and Spark and run on different storage formats such as HDFS and HBase. SQL was designed for structured data. Hadoop files may contain nested data, variable data, schema-less data. A SQL-on-Hadoop engine must be able to translate all these forms of data to flat relational data and optimize queries (Impala/Drill) SQL-on-Hadoop engines require smart and advanced workload managers for multiuser workloads designed for query processing not stream processing. Ad-hoc reporting, iterative OLAP, and data mining) in single-user and multi-user modes. For multi-user queries, Impala is on average 16.4x faster than Hive-on-Tez and 7.6x faster than Spark SQL with Tungsten, with an average response time of 12.8s compared to over 1.6 minutes or more.

Core Competency Distributed Search Good Fit Not Optimal Not Recommended Distributed Search ElasticSearch, Solr (based on Apache Lucene), Amazon CloudSearch Volume (Data Size) Variety (Data Type) Velocity (Processing) Latency (Reporting) Small Medium Large Structured Semi-Structured Unstructured Batch Micro-Batch RT Streaming Scheduled Prompted Interactive Search engines have to deal with large systems with millions of documents and are designed for index and search query processing at scale with clustering and distributed architecture. XML, CSV, RDBMS, Word, PDF,ActiveMQ, AWS SQS, DynamoDB (Amazon NoSQL), FileSystem, Git, JDBC, JMS, Kafka, LDAP, MongoDB, neo4j, RabbitMQ, Redis, and Twitter. ES scalable to very large clusters with near real-time search. The demands of real time web applications require search results in near real time as new content is generated by users. Some contention handling concurrent search + index requests. Both use key-value pair query language. Solr is much more oriented towards text search while Elasticsearch is often used for more advanced querying, filtering, and grouping. Good for interactive search queries but not interactive analytical reporting.

Core Competency Message Streaming Good Fit Not Optimal Not Recommended Message Streaming Kafka, JMS, AMQP Volume (Data Size) Variety (Data Type) Velocity (Processing) Latency (Reporting) Small Medium Large Structured Semi-Structured Unstructured Batch Micro-Batch RT Streaming Scheduled Prompted Interactive Kafka is an excellent low latency messaging platform that brokers massive message streams for parallel ingestion into Hadoop Data sources, such as the internet of things, sensors, clickstream, and transactional systems. Realtime streaming providing high throughput for both publishing and subscribing, with constant performance even with many terabytes of stored messages. Designed for streaming and can configure batch size for brokering micro batches of messages. Stream topics need to be processed by additional technology such as PDI, ESP, CEP, query processing engines for reporting.

Core Competency Event Stream Processing (ESP) Good Fit Not Optimal Not Recommended Message Streaming Apache Storm Volume (Data Size) Variety (Data Type) Velocity (Processing) Latency (Reporting) Small Medium Large Structured Semi-Structured Unstructured Batch Micro-Batch RT Streaming Scheduled Prompted Interactive Apache Storm is a distributed event-at-a-time stream processing system for processing large volumes in parallel with sub-second latency. Storm applications process 1 incoming event at a time as tuples of data; a tuple may can contain object of any type such as the internet of things, sensors, and transactional systems. Storm is extremely fast, with the ability to process over a million messages per second per node. Compromises on fault tolerance by offering at least once semantics in favor of speed. ESP provides the most recent processed data for all types of reporting. Example ESP Use Case: Stock market tickers showing stock performances with a Green up arrow or Red down arrow in real time.

Core Competency Complex Event Processing (CEP) Good Fit Not Optimal Not Recommended Message Streaming Spark, Flink Volume (Data Size) Variety (Data Type) Velocity (Processing) Latency (Reporting) Small Medium Large Structured Semi-Structured Unstructured Batch Micro-Batch RT Streaming Scheduled Prompted Interactive Spark and Flink are distributed micro-batch stream processing engines for processing large volumes of high-velocity data in parallel with a few seconds latency. Complex event processing for internet of things, sensors, and transactional systems. An aggregation-oriented CEP solution is focused on executing on-line algorithms as a response to event data entering the system. Detection-oriented CEP is focused on detecting combinations of events called events patterns or situations. Micro-batch processing engines with few seconds latency that is not as fast as Storm, but has better fault tolerance guaranteeing exactly once semantics for stateful computations. Great for machine learning computations. CEP provides the most recent processed data for all types of reporting. Example CEP use case: user sets up alert to the stock market saying "let me know if GOOG stocks went up by 10% and stayed up for 3 hours or more".

Big Data Ecosystem 1 2 3 Relational Database Analytical Databases NoSQL Database 4 5 6 HDFS Map Reduce SQL on Hadoop Distributed Search 7 8 9 Message Streaming Event Stream Processing (ESP) Complex Event Processing (CEP)

Mapping A Solution

Core Competency Good Fit Not Optimal Not Recommended Matrix for Analytics Performance (MAP) Relational Database Analytical Database NoSQL Database Hadoop File System (HDFS MR) SQL on Hadoop Distributed Search Message Streaming Event Stream Processing (ESP) Complex Event Processing (CEP) Volume (Data Size) Variety (Data Type) Velocity (Processing) Latency (Reporting) Small Medium Large Structured Semi-Structured Unstructured Batch Micro-Batch RT Streaming Scheduled Prompted Interactive

Big Data Projects BIG DATA SOURCES PENTAHO DATA INTEGRATION HADOOP/ DATA LAKE PENTAHO DATA INTEGRATION ANALYTIC DATASETS LINE OF BUSINESS CENTRALIZED ANALYTICS AT SCALE SELF- SERVICE ANALYTICS ON- DEMAND DATAMART EMBEDDED ANALYTICS EXTRANET DEPLOYMENTS PDI ANALYTICS TRADITIONAL DATA PENTAHO DATA INTEGRATION DATA WAREHOUSE PENTAHO DATA INTEGRATION DATA MARTS

A Single Flow Data Engineering Data Prep Analytics Ingestion Processing Blending Data Delivery Data Discovery / Analysis Analysis & Dashboards Administration Security Lifecycle Management Data Provenance Dynamic Data Pipeline Monitoring Automation

Key Takeaways Data architecture modernization involves many technologies Understanding the ecosystem of data technologies Mapping an end-to-end solution Pentaho key capabilities