The Technology of the Business Data Lake. Appendix

Similar documents
Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved.

Monitoring in Azure: Bringing IaaS and PaaS together. Vassil Nov 23 rd, 2017

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Copyright 2015 EMC Corporation. All rights reserved. A long time ago

2014 年 3 月 13 日星期四. From Big Data to Big Value Infrastructure Needs and Huawei Best Practice

Microsoft Big Data and Hadoop

Big Data Architect.

BIG DATA COURSE CONTENT

Big Data Hadoop Stack

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

Data Storage Infrastructure at Facebook

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Virtuoso Infotech Pvt. Ltd.

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Innovatus Technologies

Case Study: Tata Communications Delivering a Truly Interactive Business Intelligence Experience on a Large Multi-Tenant Hadoop Cluster

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

Modern Data Warehouse The New Approach to Azure BI

@Pentaho #BigDataWebSeries

An InterSystems Guide to the Data Galaxy. Benjamin De Boe Product Manager

Data Lake Based Systems that Work

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

Overview of Data Services and Streaming Data Solution with Azure

MapR Enterprise Hadoop

Flash Storage Complementing a Data Lake for Real-Time Insight

New Approaches to Big Data Processing and Analytics

BUILT FOR THE SPEED OF BUSINESS

IT directors, CIO s, IT Managers, BI Managers, data warehousing professionals, data scientists, enterprise architects, data architects

Lily 2.4 What s New Product Release Notes

Lambda Architecture for Batch and Stream Processing. October 2018

docs.hortonworks.com

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Data contains value and knowledge

Big Data on AWS. Peter-Mark Verwoerd Solutions Architect

HAWQ: A Massively Parallel Processing SQL Engine in Hadoop

Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data

Apache HAWQ (incubating)

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Oracle Big Data Connectors

microsoft

Cloud & CyberSecurity Services

Understanding the latent value in all content

Big Data and Enterprise Data, Bridging Two Worlds with Oracle Data Integration

VOLTDB + HP VERTICA. page

Oracle GoldenGate for Big Data

Spotfire Data Science with Hadoop Using Spotfire Data Science to Operationalize Data Science in the Age of Big Data

Big Data Analytics using Apache Hadoop and Spark with Scala

Data Architectures in Azure for Analytics & Big Data

Big Data Hadoop Course Content

Big Data with Hadoop Ecosystem

Capture Business Opportunities from Systems of Record and Systems of Innovation

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Building a Data Strategy for a Digital World

Big Data Applications with Spring XD

IBM Data Replication for Big Data

Capgemini Dynamic Services

WHITEPAPER. MemSQL Enterprise Feature List

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Data 101 Which DB, When Joe Yong Sr. Program Manager Microsoft Corp.

Stinger Initiative. Making Hive 100X Faster. Page 1. Hortonworks Inc. 2013

Oracle Database 11g for Data Warehousing & Big Data: Strategy, Roadmap Jean-Pierre Dijcks, Hermann Baer Oracle Redwood City, CA, USA

What's New in SAS Data Management

HDInsight > Hadoop. October 12, 2017

Ian Choy. Technology Solutions Professional

MAPR DATA GOVERNANCE WITHOUT COMPROMISE

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

Stages of Data Processing

TECHNOLOGY SOLUTION EVOLUTION

Lecture 7 (03/12, 03/14): Hive and Impala Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2018

Data in the Cloud and Analytics in the Lake

Part 1: Indexes for Big Data

Progress DataDirect For Business Intelligence And Analytics Vendors

Oracle 1Z Oracle Big Data 2017 Implementation Essentials.

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Teradata Aggregate Designer

Přehled novinek v SQL Server 2016

Data 101 Which DB, When. Joe Yong Azure SQL Data Warehouse, Program Management Microsoft Corp.

Building an Integrated Big Data & Analytics Infrastructure September 25, 2012 Robert Stackowiak, Vice President Data Systems Architecture Oracle

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Processing Unstructured Data. Dinesh Priyankara Founder/Principal Architect dinesql Pvt Ltd.

What does SAS Data Management do? For whom is SAS Data Management designed? Key Benefits

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

<Insert Picture Here> Introduction to Big Data Technology

BUSINESS DATA LAKE FADI FAKHOURI, SR. SYSTEMS ENGINEER, ISILON SPECIALIST. Copyright 2016 EMC Corporation. All rights reserved.

About ADS 1.1 ADS comprises the following components: HAWQ PXF MADlib

Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Automated Netezza to Cloud Migration

Talend Big Data Sandbox. Big Data Insights Cookbook

An Introduction to Big Data Formats

Hortonworks Data Platform

Configuring and Deploying Hadoop Cluster Deployment Templates

April Copyright 2013 Cloudera Inc. All rights reserved.

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Cloud Analytics and Business Intelligence on AWS

The age of Big Data Big Data for Oracle Database Professionals

Transcription:

The Technology of the Business Data Lake Appendix

Pivotal data products Term Greenplum Database GemFire Pivotal HD Spring XD Pivotal Data Dispatch Pivotal Analytics Description A massively parallel platform for large-scale data analytics to manage and analyze petabytes of data also available with Hadoop HDFS storage tier integration (HAWQ an add on for PHD). HAWQ brings mature MPP technology for SQL on Hadoop. MADlib, in-database parallel implementation of common analytics functions, will also work with HAWQ soon. A real-time distributed data store with linear scalability and continuous uptime capabilities now available with storage tier integrated on Hadoop HDFS (GemFire XD). Commercially supported Apache Hadoop. HAWQ brings mature enterprise class SQL capabilities to Hadoop and GemFire XD brings real-time data access to Hadoop. Spring XD simplifies the process of creating real world big data solutions. Simplifies high throughput data ingestion and export along with ability to create cross platform workflows. On-demand big data access across and beyond the enterprise. PDD provides data workers security controlled self service access to data. IT manages data modeling, access, compliance, and data lifecycle policies for all data provided through Pivotal DD. Provides the business community with visualizations and insights from big data. It provides the ability to join data from different sources to quickly create visualizations and dashboards. Pivotal Analytics can infer schemas from data sources and automatically create insights as it ingests data from various sources, freeing up business analysts to focus on analyzing data and generating insights rather than manipulating data. 2

Appendix 2: Terminology 3

Terminology Term Synchronous path Asynchronous path Streaming Micro batch Batch Mega batch Frequency Latency SLA Description Processing that happens while the user is waiting for the results from an action (usually a click). The results are usually returned from information stored in the real-time systems. Processing that happens in the background and no user is waiting for the results of the analysis. The results of the processing influence the synchronous processing by refreshing the information synchronous path processing relies on. Processing (collection, scoring, aggregation, deposition) of a single event as it happens. Streaming is usually associated with synchronous path processing. Processing of group of events as they come frequently in a compact package. Usually every few seconds or minutes. Processing of a large group of events coming in a package usually every hour or daily or monthly. Infrequent processing of all or most (very large amount of) data. Although repeatable, usually done once a quarter or even less frequently. Rate at which the events are generated aka. event rate. Time delay between the event generation (resulting from a business activity) and receiving it. Agreed service level agreement with the data consumer on the latency, quality and completeness of the data along with up time guarantees. 4

Terminology (cont d.) Term Real-time response time Interactive response time Near real-time response time Analytics Insights Actions Description Very low latency between the event occurrence and insight generation. Usually within seconds of the event occurrence. Time a user has to wait for the results if within minutes it is considered interactive. If the user needs to take a coffee break, it is batch. Slightly higher latency than real time. Usually within few minutes of the event occurrence. Algorithms that run on the data. Vast scale from simple pre-computed aggregation to complex algorithms looking for patterns in data. Results from the analytical algorithms made available to applications or business users. The activities that a business or an application performs in response to the information from the insights. 5

Appendix 3: Components of Business Data Lake 6

How is Business Data Lake different? Criteria Business Data Lake EDW Common data model Base class = standard data Derived classes = local data Single class = single view across the enterprise Data quality Full spectrum 0 1 1 1 0 1 0 1 0 0 1 0 0 1 Data integration Multiple interfaces SQL, SAS, R,, NoSQL SQL access integration with SAS, R and other analytical interfaces Mixed workload with varying QoS Support low latency, interactive and batch Limited QoS separation required 7

Generic Business Data Lake architecture Sources Ingestion tier Unified operations tier System monitoring System management Insights tier Action tier Real-time ingestion Real time Unified data management tier Data mgmt. services MDM RDM Audit and policy mgmt. SQL NoSQL Real-time insights Workflow management Micro batch ingestion Micro batch Processing tier In-memory SQL Interactive insights MPP database Batch ingestion Mega batch Distillation tier SQL Batch insights HDFS storage Unstructured and structured data Query interfaces 8

Components of Business Data Lake Term Storage Ingestion Distillation Processing Insights Action Unified data management Unified operations Description Ability to store ALL (structured, unstructured) data cost efficiently in the Business Data Lake. Ability to bring data from multiple data sources across all timelines with varying QoS. Ability to take the data stored in the storage tier and converting it to structured data for easier analysis by downstream applications. Ability to run analytical algorithms and user queries with varying QoS (real time, interactive, batch) to generate structured data for easier analysis by downstream applications. Ability to analyze all the data with varying QoS (real time, interactive and batch) to generate insights for business decisioning. Ability to integrate the insights with the business decisioning systems. Ability to manage the data lifecycle, access policy definition, and master data management and reference data management services. Ability to monitor, configure and manage the whole Data Lake from a single operations environment. 9

Pivotal components for the tiers Term Storage Ingestion Distillation Processing Insights Action Unified data management Unified operations Description Pivotal HD. GemFire XD, HAWQ, Pivotal HD and Spring XD. Pivotal Data Dispatch. Pivotal HD, HAWQ and GemFire XD queries optionally managed via Spring XD workflows. Pivotal HD, HAWQ and GemFire XD queries from user applications. Big data applications aka business decisioning systems. Pivotal Data Dispatch, master data management and reference data management services. Pivotal Command Center (component of Pivotal HD to manage HAWQ and GemFire XD*), Spring XD monitoring and Pivotal Data Dispatch monitoring. 10

Data Lake interfaces Ingestion Streaming Micro batch Batch Mega batch Data Loader Yes Yes Yes GemFire XD Yes PDD Spring XD Yes Yes Yes Yes Sqoop Yes Yes Distcp Yes Yes Flume Yes Yes Yes HDFS put Yes Yes Talend Yes Yes Informatica Yes Yes Monitoring data management Pivotal command center Pivotal Data Dispatch Interface Real time Interactive Batch GemFire XD (SQL) Yes Yes HAWQ (SQL) Yes Yes Yes Hive (HiveQL) Yes HBase (NoSQL) Yes Yes Yes Pig Yes Impala (SQL) Yes Yes Ingestion + Analytics Analytics Data access Legend: Pivotal Apache Partner Competition Configuration install Pivotal command center BI Tools GemFire XD HAWQ Hive MicroStrategy Yes Yes BusinessObjects Yes Yes Spotfire Yes Yes Tableau Yes Yes Microsoft Excel Yes Yes Datameer Yes Yes Karmasphere Yes Yes 11

Files Low throughput Event collection Events High throughput Data ingestion Streaming Micro batch GemFire XD Data loader Spring XD GemFire XD Mega batch N/A Spring XD Spring XD Data loader Events Event processing Files Real time Batch GemFire XD SQL Insert data into a GemFire XD and API to send data to GemFire XD. Spring XD Out of the box support for HTTP, Tail, Mail, Twitter, GemFire, TCP, JMS, RabbitMQ, Time, MQTT, Data loader Move massive amounts of data at wire speed with throttling capabilities. 12

Structured Lookup Event storage Query Unstructured Analytics Data access Data distillation Use connectors, programs, models to convert to structured data Pig GemFire XD HAWQ Hive SQL HiveQL Hbase APIs Pig HBase Structured interfaces Unstructured Real time Interactive Batch Event access methods GemFire XD SQL queries, NoSQL and alerting APIs for real-time data. Data persisted on HDFS immediately available for interactive queries. HAWQ SQL Query for interactive data access. Connectivity with industry standard BI tools. Hive HBase HiveQL and for batch data access. HBase for real-time looking and simple data queries. 13

Lookup HDFS Data storage Query Native Analytics Data distillation Connectors from Hadoop Greenplum database GemFire/SQL Fire GemFire XD HAWQ Hive HAWQ GemFire XD PXF connectors Pig HBase Hadoop Processing platform Native Real time Interactive Batch GemFire XD SQL queries, NoSQL and alerting APIs for real-time data. Data persisted on HDFS immediately available for interactive queries. HAWQ SQL Query for interactive data access. Connectivity with industry standard BI tools. Hive HBase HiveQL and for batch data access. HBase for real-time looking and simple data queries. 14

Unified data management: Pivotal Data Dispatch All data stored on HDFS: Pivotal: GemFire XD/HAWQ Hadoop data: Hive/HBase Raw ingested data IT managed: Data registered in PDD Data source connected and automated Target support for sandbox creation Auditable data access policy definition Data work: Self serve ability to access data on demand on a target sandbox from various sources while conforming to the data access policies. 15

Action tier: Decision maker expectations Informational Ability to get information in a dashboard Integration with business intelligence tools Tableau, MicroStrategy, BusinessObjects, Pentaho. Alerting Ability to alert the decision maker Integration with the alert systems Dashboard, alarms, emails, pagers, phones etc. Automation Ability to integrate with business decisioning systems Integration with the applications to take automated actions MessageMQ, Rabbit, Spring, & other technologies. 16

About Capgemini With more than 130,000 people in 44 countries, Capgemini is one of the world's foremost providers of consulting, technology and outsourcing services. The Group reported 2012 global revenues of EUR 10.3 billion. Together with its clients, Capgemini creates and delivers business and technology solutions that fit their needs and drive the results they want. A deeply multicultural organization, Capgemini has developed its own way of working, the Collaborative Business Experience, and draws on Rightshore, its worldwide delivery model. www.capgemini.com/bim The information contained in this presentation is proprietary. Rightshore is a trademark belonging to Capgemini.