The Technology of the Business Data Lake. Appendix

The Technology of the Business Data Lake Appendix

Pivotal data products Term Greenplum Database GemFire Pivotal HD Spring XD Pivotal Data Dispatch Pivotal Analytics Description A massively parallel platform for large-scale data analytics to manage and analyze petabytes of data also available with Hadoop HDFS storage tier integration (HAWQ an add on for PHD). HAWQ brings mature MPP technology for SQL on Hadoop. MADlib, in-database parallel implementation of common analytics functions, will also work with HAWQ soon. A real-time distributed data store with linear scalability and continuous uptime capabilities now available with storage tier integrated on Hadoop HDFS (GemFire XD). Commercially supported Apache Hadoop. HAWQ brings mature enterprise class SQL capabilities to Hadoop and GemFire XD brings real-time data access to Hadoop. Spring XD simplifies the process of creating real world big data solutions. Simplifies high throughput data ingestion and export along with ability to create cross platform workflows. On-demand big data access across and beyond the enterprise. PDD provides data workers security controlled self service access to data. IT manages data modeling, access, compliance, and data lifecycle policies for all data provided through Pivotal DD. Provides the business community with visualizations and insights from big data. It provides the ability to join data from different sources to quickly create visualizations and dashboards. Pivotal Analytics can infer schemas from data sources and automatically create insights as it ingests data from various sources, freeing up business analysts to focus on analyzing data and generating insights rather than manipulating data. 2

Appendix 2: Terminology 3

Terminology Term Synchronous path Asynchronous path Streaming Micro batch Batch Mega batch Frequency Latency SLA Description Processing that happens while the user is waiting for the results from an action (usually a click). The results are usually returned from information stored in the real-time systems. Processing that happens in the background and no user is waiting for the results of the analysis. The results of the processing influence the synchronous processing by refreshing the information synchronous path processing relies on. Processing (collection, scoring, aggregation, deposition) of a single event as it happens. Streaming is usually associated with synchronous path processing. Processing of group of events as they come frequently in a compact package. Usually every few seconds or minutes. Processing of a large group of events coming in a package usually every hour or daily or monthly. Infrequent processing of all or most (very large amount of) data. Although repeatable, usually done once a quarter or even less frequently. Rate at which the events are generated aka. event rate. Time delay between the event generation (resulting from a business activity) and receiving it. Agreed service level agreement with the data consumer on the latency, quality and completeness of the data along with up time guarantees. 4

Terminology (cont d.) Term Real-time response time Interactive response time Near real-time response time Analytics Insights Actions Description Very low latency between the event occurrence and insight generation. Usually within seconds of the event occurrence. Time a user has to wait for the results if within minutes it is considered interactive. If the user needs to take a coffee break, it is batch. Slightly higher latency than real time. Usually within few minutes of the event occurrence. Algorithms that run on the data. Vast scale from simple pre-computed aggregation to complex algorithms looking for patterns in data. Results from the analytical algorithms made available to applications or business users. The activities that a business or an application performs in response to the information from the insights. 5

Appendix 3: Components of Business Data Lake 6

How is Business Data Lake different? Criteria Business Data Lake EDW Common data model Base class = standard data Derived classes = local data Single class = single view across the enterprise Data quality Full spectrum 0 1 1 1 0 1 0 1 0 0 1 0 0 1 Data integration Multiple interfaces SQL, SAS, R,, NoSQL SQL access integration with SAS, R and other analytical interfaces Mixed workload with varying QoS Support low latency, interactive and batch Limited QoS separation required 7

Generic Business Data Lake architecture Sources Ingestion tier Unified operations tier System monitoring System management Insights tier Action tier Real-time ingestion Real time Unified data management tier Data mgmt. services MDM RDM Audit and policy mgmt. SQL NoSQL Real-time insights Workflow management Micro batch ingestion Micro batch Processing tier In-memory SQL Interactive insights MPP database Batch ingestion Mega batch Distillation tier SQL Batch insights HDFS storage Unstructured and structured data Query interfaces 8

Components of Business Data Lake Term Storage Ingestion Distillation Processing Insights Action Unified data management Unified operations Description Ability to store ALL (structured, unstructured) data cost efficiently in the Business Data Lake. Ability to bring data from multiple data sources across all timelines with varying QoS. Ability to take the data stored in the storage tier and converting it to structured data for easier analysis by downstream applications. Ability to run analytical algorithms and user queries with varying QoS (real time, interactive, batch) to generate structured data for easier analysis by downstream applications. Ability to analyze all the data with varying QoS (real time, interactive and batch) to generate insights for business decisioning. Ability to integrate the insights with the business decisioning systems. Ability to manage the data lifecycle, access policy definition, and master data management and reference data management services. Ability to monitor, configure and manage the whole Data Lake from a single operations environment. 9

Pivotal components for the tiers Term Storage Ingestion Distillation Processing Insights Action Unified data management Unified operations Description Pivotal HD. GemFire XD, HAWQ, Pivotal HD and Spring XD. Pivotal Data Dispatch. Pivotal HD, HAWQ and GemFire XD queries optionally managed via Spring XD workflows. Pivotal HD, HAWQ and GemFire XD queries from user applications. Big data applications aka business decisioning systems. Pivotal Data Dispatch, master data management and reference data management services. Pivotal Command Center (component of Pivotal HD to manage HAWQ and GemFire XD*), Spring XD monitoring and Pivotal Data Dispatch monitoring. 10

Data Lake interfaces Ingestion Streaming Micro batch Batch Mega batch Data Loader Yes Yes Yes GemFire XD Yes PDD Spring XD Yes Yes Yes Yes Sqoop Yes Yes Distcp Yes Yes Flume Yes Yes Yes HDFS put Yes Yes Talend Yes Yes Informatica Yes Yes Monitoring data management Pivotal command center Pivotal Data Dispatch Interface Real time Interactive Batch GemFire XD (SQL) Yes Yes HAWQ (SQL) Yes Yes Yes Hive (HiveQL) Yes HBase (NoSQL) Yes Yes Yes Pig Yes Impala (SQL) Yes Yes Ingestion + Analytics Analytics Data access Legend: Pivotal Apache Partner Competition Configuration install Pivotal command center BI Tools GemFire XD HAWQ Hive MicroStrategy Yes Yes BusinessObjects Yes Yes Spotfire Yes Yes Tableau Yes Yes Microsoft Excel Yes Yes Datameer Yes Yes Karmasphere Yes Yes 11

Files Low throughput Event collection Events High throughput Data ingestion Streaming Micro batch GemFire XD Data loader Spring XD GemFire XD Mega batch N/A Spring XD Spring XD Data loader Events Event processing Files Real time Batch GemFire XD SQL Insert data into a GemFire XD and API to send data to GemFire XD. Spring XD Out of the box support for HTTP, Tail, Mail, Twitter, GemFire, TCP, JMS, RabbitMQ, Time, MQTT, Data loader Move massive amounts of data at wire speed with throttling capabilities. 12

Structured Lookup Event storage Query Unstructured Analytics Data access Data distillation Use connectors, programs, models to convert to structured data Pig GemFire XD HAWQ Hive SQL HiveQL Hbase APIs Pig HBase Structured interfaces Unstructured Real time Interactive Batch Event access methods GemFire XD SQL queries, NoSQL and alerting APIs for real-time data. Data persisted on HDFS immediately available for interactive queries. HAWQ SQL Query for interactive data access. Connectivity with industry standard BI tools. Hive HBase HiveQL and for batch data access. HBase for real-time looking and simple data queries. 13

Lookup HDFS Data storage Query Native Analytics Data distillation Connectors from Hadoop Greenplum database GemFire/SQL Fire GemFire XD HAWQ Hive HAWQ GemFire XD PXF connectors Pig HBase Hadoop Processing platform Native Real time Interactive Batch GemFire XD SQL queries, NoSQL and alerting APIs for real-time data. Data persisted on HDFS immediately available for interactive queries. HAWQ SQL Query for interactive data access. Connectivity with industry standard BI tools. Hive HBase HiveQL and for batch data access. HBase for real-time looking and simple data queries. 14

Unified data management: Pivotal Data Dispatch All data stored on HDFS: Pivotal: GemFire XD/HAWQ Hadoop data: Hive/HBase Raw ingested data IT managed: Data registered in PDD Data source connected and automated Target support for sandbox creation Auditable data access policy definition Data work: Self serve ability to access data on demand on a target sandbox from various sources while conforming to the data access policies. 15

Action tier: Decision maker expectations Informational Ability to get information in a dashboard Integration with business intelligence tools Tableau, MicroStrategy, BusinessObjects, Pentaho. Alerting Ability to alert the decision maker Integration with the alert systems Dashboard, alarms, emails, pagers, phones etc. Automation Ability to integrate with business decisioning systems Integration with the applications to take automated actions MessageMQ, Rabbit, Spring, & other technologies. 16

About Capgemini With more than 130,000 people in 44 countries, Capgemini is one of the world's foremost providers of consulting, technology and outsourcing services. The Group reported 2012 global revenues of EUR 10.3 billion. Together with its clients, Capgemini creates and delivers business and technology solutions that fit their needs and drive the results they want. A deeply multicultural organization, Capgemini has developed its own way of working, the Collaborative Business Experience, and draws on Rightshore, its worldwide delivery model. www.capgemini.com/bim The information contained in this presentation is proprietary. Rightshore is a trademark belonging to Capgemini.