Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara
Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case Studies Pentaho Key Capabilities Summary Q&A
End-to-End Data Delivery Platform Ingest Process Publish Report Data Agnostic Metadata Driven Ingestion Data Orchestration Native Hadoop Integration Scale Up & Scale Out Blend Unstructured Data Streamlined Data Refinery Data Virtualization Machine Learning Production Reporting Custom Dashboards Self-Service Dashboards Interactive Analysis Embedded Analytics
Delivering Insight Data Integration & Orchestration Custom Dashboards Self-Service Dashboards Ingest Process Publish Report Interactive Analysis Data Engineers Data Scientists Data Analyst Consumers Production Reporting
Big Data Ecosystem 1 2 3 Relational Database Analytical Databases NoSQL Database 4 5 6 HDFS Map Reduce SQL on Hadoop Distributed Search 7 8 9 Message Streaming Event Stream Processing (ESP) Complex Event Processing (CEP)
Data Source Attributes Volume (Data Size) Small Medium Large Relational Database Analytical Databases NoSQL Database Variety (Data Type) Structured Semi-Structured Unstructured HDFS Map Reduce SQL on Hadoop Distributed Search Velocity (Processing) Batch Micro-Batch RT Streaming Message Streaming Event Stream Processing (ESP) Complex Event Processing (CEP) Latency (Reporting) Scheduled Prompted Interactive
Core Competency Relational Database Good Fit Not Optimal Not Recommended Relational Database MSFT SQL Server, Oracle, MySQL, PostGreSQL, IBM DB2 Volume (Data Size) Small Medium Large Operational databases for OLTP apps that require high transaction loads and user concurrency. Can scale up to data volumes but lack ability to easily scale-out for large data processing. Variety (Data Type) Structured Semi-Structured Unstructured Structured schema of tables containing rows and columns of data emphasizing integrity and consistency over speed and scale. Structured data accessed with the SQL query language. Velocity (Processing) Batch Micro-Batch RT Streaming Rigid schemas with batch-oriented ingestion and SQL query processing are not designed for continuous streaming data Latency (Reporting) Scheduled Prompted Interactive Optimized for frequent small CRUD queries (create, read, update, delete), not for analytic or interactive query workloads on large data
Core Competency Analytical Database Good Fit Not Optimal Not Recommended Analytical Database Columnar, In-Memory, MPP, OLAP Teradata, Oracle Exadata, IBM Netezza, EMC Greenplum, Vertica Volume (Data Size) Variety (Data Type) Velocity (Processing) Latency (Reporting) Small Medium Large Structured Semi-Structured Unstructured Batch Micro-Batch RT Streaming Scheduled Prompted Interactive Data warehouse/mart databases to support BI and advanced analytics workloads. MPP architecture gives ability to scale out to large data volumes at a financial cost. Structured schema of tables containing rows and columns of data offering improved speed and scalability over RDBMS but still limited to structured data. Rigid schemas with batch-oriented SQL queries are not designed for streaming applications. All four types (Columnar, In-Memory, MPP, OLAP) designed for improved query performance for analytic or interactive query workloads on large data.
Core Competency NoSQL Database Good Fit Not Optimal Not Recommended NoSQL Database MongoDB, HBase, Cassandra, MarkLogic, Couchbase Volume (Data Size) Variety (Data Type) Velocity (Processing) Latency (Reporting) Small Medium Large Structured Semi-Structured Unstructured Batch Micro-Batch RT Streaming Scheduled Prompted Interactive Good for web applications - less web app code to write, debug and maintain. Scale out - horizontal scaling w auto-sharding data to support millions of web app users. Compromise on consistency (ACID transactions) in favor of scale & up-time. Hierarchical, key-value or document design to capture all types of data in a single location. Schema-less design allows for rapid or continuous ingest at scale. Good storage option for high throughput, low latency requirements of streaming applications for real-time views of data. Seen as a key component to Lambda architecture. Low level query languages, lack of skills, lack SQL support makes NoSQL less appealing for reporting and analysis.
Core Competency HDFS MapReduce Good Fit Not Optimal Not Recommended HDFS Map Reduce Cloudera, Hortonworks, MapR, Pivotal, Amazon EMR, Hitachi HSP, MSFT HDInsights Volume (Data Size) Variety (Data Type) Velocity (Processing) Latency (Reporting) Small Medium Large Structured Semi-Structured Unstructured Batch Micro-Batch RT Streaming Scheduled Prompted Interactive Hadoop Distributed File System designed to distribute and replicate file blocks horizontally scaled across multiple commodity data nodes. MapReduce programming takes compute to the data for batch processing large data volumes. File system is schema-less allowing easy storage of any file type in multiple Hadoop file formats. HDFS and MapReduce designed for distributing batch processing workloads on large datasets, not for micro-batch or steaming use cases. MapReduce on HDFS lacks SQL support and report queries are slow and less appealing for reporting and analysis.
Core Competency SQL on Hadoop Good Fit Not Optimal Not Recommended SQL on Hadoop Batch-oriented, Interactive, and In-Memory Apache Hive, Apache Drill/Phoenix, Hortonworks Hive on Tez, Cloudera Impala, Pivotal HawQ, Spark SQL Volume (Data Size) Variety (Data Type) Velocity (Processing) Latency (Reporting) Small Medium Large Structured Semi-Structured Unstructured Batch Micro-Batch RT Streaming Scheduled Prompted Interactive SQL queries on a metadata layer (Hcatalog) in Hadoop. The queries are converted to MapReduce, Apache Tez, Impala MPP, and Spark and run on different storage formats such as HDFS and HBase. SQL was designed for structured data. Hadoop files may contain nested data, variable data, schema-less data. A SQL-on-Hadoop engine must be able to translate all these forms of data to flat relational data and optimize queries (Impala/Drill) SQL-on-Hadoop engines require smart and advanced workload managers for multiuser workloads designed for query processing not stream processing. Ad-hoc reporting, iterative OLAP, and data mining) in single-user and multi-user modes. For multi-user queries, Impala is on average 16.4x faster than Hive-on-Tez and 7.6x faster than Spark SQL with Tungsten, with an average response time of 12.8s compared to over 1.6 minutes or more.
Core Competency Distributed Search Good Fit Not Optimal Not Recommended Distributed Search ElasticSearch, Solr (based on Apache Lucene), Amazon CloudSearch Volume (Data Size) Variety (Data Type) Velocity (Processing) Latency (Reporting) Small Medium Large Structured Semi-Structured Unstructured Batch Micro-Batch RT Streaming Scheduled Prompted Interactive Search engines have to deal with large systems with millions of documents and are designed for index and search query processing at scale with clustering and distributed architecture. XML, CSV, RDBMS, Word, PDF,ActiveMQ, AWS SQS, DynamoDB (Amazon NoSQL), FileSystem, Git, JDBC, JMS, Kafka, LDAP, MongoDB, neo4j, RabbitMQ, Redis, and Twitter. ES scalable to very large clusters with near real-time search. The demands of real time web applications require search results in near real time as new content is generated by users. Some contention handling concurrent search + index requests. Both use key-value pair query language. Solr is much more oriented towards text search while Elasticsearch is often used for more advanced querying, filtering, and grouping. Good for interactive search queries but not interactive analytical reporting.
Core Competency Message Streaming Good Fit Not Optimal Not Recommended Message Streaming Kafka, JMS, AMQP Volume (Data Size) Variety (Data Type) Velocity (Processing) Latency (Reporting) Small Medium Large Structured Semi-Structured Unstructured Batch Micro-Batch RT Streaming Scheduled Prompted Interactive Kafka is an excellent low latency messaging platform that brokers massive message streams for parallel ingestion into Hadoop Data sources, such as the internet of things, sensors, clickstream, and transactional systems. Realtime streaming providing high throughput for both publishing and subscribing, with constant performance even with many terabytes of stored messages. Designed for streaming and can configure batch size for brokering micro batches of messages. Stream topics need to be processed by additional technology such as PDI, ESP, CEP, query processing engines for reporting.
Core Competency Event Stream Processing (ESP) Good Fit Not Optimal Not Recommended Message Streaming Apache Storm Volume (Data Size) Variety (Data Type) Velocity (Processing) Latency (Reporting) Small Medium Large Structured Semi-Structured Unstructured Batch Micro-Batch RT Streaming Scheduled Prompted Interactive Apache Storm is a distributed event-at-a-time stream processing system for processing large volumes in parallel with sub-second latency. Storm applications process 1 incoming event at a time as tuples of data; a tuple may can contain object of any type such as the internet of things, sensors, and transactional systems. Storm is extremely fast, with the ability to process over a million messages per second per node. Compromises on fault tolerance by offering at least once semantics in favor of speed. ESP provides the most recent processed data for all types of reporting. Example ESP Use Case: Stock market tickers showing stock performances with a Green up arrow or Red down arrow in real time.
Core Competency Complex Event Processing (CEP) Good Fit Not Optimal Not Recommended Message Streaming Spark, Flink Volume (Data Size) Variety (Data Type) Velocity (Processing) Latency (Reporting) Small Medium Large Structured Semi-Structured Unstructured Batch Micro-Batch RT Streaming Scheduled Prompted Interactive Spark and Flink are distributed micro-batch stream processing engines for processing large volumes of high-velocity data in parallel with a few seconds latency. Complex event processing for internet of things, sensors, and transactional systems. An aggregation-oriented CEP solution is focused on executing on-line algorithms as a response to event data entering the system. Detection-oriented CEP is focused on detecting combinations of events called events patterns or situations. Micro-batch processing engines with few seconds latency that is not as fast as Storm, but has better fault tolerance guaranteeing exactly once semantics for stateful computations. Great for machine learning computations. CEP provides the most recent processed data for all types of reporting. Example CEP use case: user sets up alert to the stock market saying "let me know if GOOG stocks went up by 10% and stayed up for 3 hours or more".
Big Data Ecosystem 1 2 3 Relational Database Analytical Databases NoSQL Database 4 5 6 HDFS Map Reduce SQL on Hadoop Distributed Search 7 8 9 Message Streaming Event Stream Processing (ESP) Complex Event Processing (CEP)
Mapping A Solution
Core Competency Good Fit Not Optimal Not Recommended Matrix for Analytics Performance (MAP) Relational Database Analytical Database NoSQL Database Hadoop File System (HDFS MR) SQL on Hadoop Distributed Search Message Streaming Event Stream Processing (ESP) Complex Event Processing (CEP) Volume (Data Size) Variety (Data Type) Velocity (Processing) Latency (Reporting) Small Medium Large Structured Semi-Structured Unstructured Batch Micro-Batch RT Streaming Scheduled Prompted Interactive
Big Data Projects BIG DATA SOURCES PENTAHO DATA INTEGRATION HADOOP/ DATA LAKE PENTAHO DATA INTEGRATION ANALYTIC DATASETS LINE OF BUSINESS CENTRALIZED ANALYTICS AT SCALE SELF- SERVICE ANALYTICS ON- DEMAND DATAMART EMBEDDED ANALYTICS EXTRANET DEPLOYMENTS PDI ANALYTICS TRADITIONAL DATA PENTAHO DATA INTEGRATION DATA WAREHOUSE PENTAHO DATA INTEGRATION DATA MARTS
A Single Flow Data Engineering Data Prep Analytics Ingestion Processing Blending Data Delivery Data Discovery / Analysis Analysis & Dashboards Administration Security Lifecycle Management Data Provenance Dynamic Data Pipeline Monitoring Automation
Key Takeaways Data architecture modernization involves many technologies Understanding the ecosystem of data technologies Mapping an end-to-end solution Pentaho key capabilities