Performance and Scalability Overview

Similar documents
Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Massive Scalability With InterSystems IRIS Data Platform

Oracle GoldenGate for Big Data

Stages of Data Processing

Composite Software Data Virtualization The Five Most Popular Uses of Data Virtualization

Informatica Enterprise Information Catalog

Data Lake Based Systems that Work

Syncsort DMX-h. Simplifying Big Data Integration. Goals of the Modern Data Architecture SOLUTION SHEET

Oracle Big Data Connectors

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight

Eight Essential. Checklists. for Managing the Analytic Data Pipeline

Microsoft Exam

Big Data with Hadoop Ecosystem

Part 1: Indexes for Big Data

A Fast and High Throughput SQL Query System for Big Data

Teradata Dynamic Workload Manager User Guide

Netezza The Analytics Appliance

@Pentaho #BigDataWebSeries

Key Differentiators. What sets Ideal Anaytics apart from traditional BI tools

Massively Parallel Processing. Big Data Really Fast. A Proven In-Memory Analytical Processing Platform for Big Data

<Insert Picture Here> MySQL Web Reference Architectures Building Massively Scalable Web Infrastructure

QLIKVIEW SCALABILITY BENCHMARK WHITE PAPER

Progress DataDirect For Business Intelligence And Analytics Vendors

Architectural challenges for building a low latency, scalable multi-tenant data warehouse

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

CIB Session 12th NoSQL Databases Structures

Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data

Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics

JAVASCRIPT CHARTING. Scaling for the Enterprise with Metric Insights Copyright Metric insights, Inc.

Security and Performance advances with Oracle Big Data SQL

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Migrate from Netezza Workload Migration

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

Information empowerment for your evolving data ecosystem

HANA Performance. Efficient Speed and Scale-out for Real-time BI

An Introduction to Big Data Formats

WHITEPAPER. MemSQL Enterprise Feature List

Leveraging Customer Behavioral Data to Drive Revenue the GPU S7456

Pentaho 30 for 30 QUICK START EVALUTATION. Aakash Shah

Roadmap: Operating Pentaho at Scale. Jens Bleuel Senior Product Manager, Pentaho

QLIKVIEW ARCHITECTURAL OVERVIEW

Embedded Technosolutions

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Data 101 Which DB, When. Joe Yong Azure SQL Data Warehouse, Program Management Microsoft Corp.

Spotfire Data Science with Hadoop Using Spotfire Data Science to Operationalize Data Science in the Age of Big Data

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

Accelerate Database Performance and Reduce Response Times in MongoDB Humongous Environments with the LSI Nytro MegaRAID Flash Accelerator Card

Top 7 Data API Headaches (and How to Handle Them) Jeff Reser Data Connectivity & Integration Progress Software

#mstrworld. Analyzing Multiple Data Sources with Multisource Data Federation and In-Memory Data Blending. Presented by: Trishla Maru.

Accelerating BI on Hadoop: Full-Scan, Cubes or Indexes?

FIVE BEST PRACTICES FOR ENSURING A SUCCESSFUL SQL SERVER MIGRATION

Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite. Peter Zaitsev, Denis Magda Santa Clara, California April 25th, 2017

Modern Data Warehouse The New Approach to Azure BI

DATABASE SCALE WITHOUT LIMITS ON AWS

Accelerate MySQL for Demanding OLAP and OLTP Use Case with Apache Ignite December 7, 2016

Microsoft Analytics Platform System (APS)

Provide Real-Time Data To Financial Applications

Qlik Sense Enterprise architecture and scalability

Shine a Light on Dark Data with Vertica Flex Tables

Jitterbit is comprised of two components: Jitterbit Integration Environment

Achieving Horizontal Scalability. Alain Houf Sales Engineer

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

VOLTDB + HP VERTICA. page

Big Data Architect.

Flash Storage Complementing a Data Lake for Real-Time Insight

In-Memory Computing EXASOL Evaluation

Ian Choy. Technology Solutions Professional

Chapter 5. The MapReduce Programming Model and Implementation

2013 AWS Worldwide Public Sector Summit Washington, D.C.

microsoft

April Copyright 2013 Cloudera Inc. All rights reserved.

Combine Native SQL Flexibility with SAP HANA Platform Performance and Tools

Perform scalable data exchange using InfoSphere DataStage DB2 Connector

Evolution of Database Systems

DATACENTER SERVICES DATACENTER

OLAP Introduction and Overview

The Reality of Qlik and Big Data. Chris Larsen Q3 2016

Pentaho Aggregation Designer User Guide

Approaching the Petabyte Analytic Database: What I learned

When, Where & Why to Use NoSQL?

Optimizing Testing Performance With Data Validation Option

Interactive SQL-on-Hadoop from Impala to Hive/Tez to Spark SQL to JethroData

Getting Started With Intellicus. Version: 7.3

Best Practices - PDI Performance Tuning

EMC GREENPLUM MANAGEMENT ENABLED BY AGINITY WORKBENCH

Getting Started with Intellicus. Version: 16.0

white paper OCDS to Server Express Product Evolution Table of Contents white paper

Introducing Oracle R Enterprise 1.4 -

DQpowersuite. Superior Architecture. A Complete Data Integration Package

Instructor : Dr. Sunnie Chung. Independent Study Spring Pentaho. 1 P a g e

Evolving To The Big Data Warehouse

Customer Use Case: Efficiently Maximizing Retail Value Across Distributed Data Warehouse Systems

A SAS/AF Application for Parallel Extraction, Transformation, and Scoring of a Very Large Database

Performance and Scalability with Griddable.io

QLogic 16Gb Gen 5 Fibre Channel for Database and Business Analytics

FEATURES BENEFITS SUPPORTED PLATFORMS. Reduce costs associated with testing data projects. Expedite time to market

August Oracle - GoldenGate Statement of Direction

Oracle Exadata: Strategy and Roadmap

BI ENVIRONMENT PLANNING GUIDE

Migrate from Netezza Workload Migration

Transcription:

Performance and Scalability Overview This guide provides an overview of some of the performance and scalability capabilities of the Pentaho Business Anlytics platform PENTAHO PERFORMANCE ENGINEERING TEAM

Pentaho Scalability and High-Performance Architecture Business Analytics solutions are only valuable when they can be accessed and used by anyone, from anywhere and at any time. When selecting a business analytics platform, it is critical to assess the underlying architecture of the platform to ensure that it not only scales to the number of users and amount of data organizations have today, but supports growing numbers of users and increased data sizes into the future. By tightly coupling high-performance business intelligence with data integration in a single platform, Pentaho Business Analytics provides a scalable solution that can address enterprise requirements in organizations of all sizes. This guide provides an overview for just some of the performance tuning and scalability options available. Pentaho Business Analytics Server is a Web application for creating, accessing and sharing reports, analysis and dashboards. The Pentaho Business Analytics Server can be deployed in different configurations, from a single server node, to a cluster of nodes distributed across multiple servers. There are a number of ways to increase performance and scalability: Deployment on 64-bit operating systems Clustering multiple server nodes Optimizing the configuration of the Reporting and engines Pentaho Business Analytics Server DBA/ETL/BI DEVELOPER BUSINESS USERS DATA ANALYSTS PENTAHO BUSINESS ANALYTICS Enterprise & Interactive Reporting Interactive Dashboards Predictive Direct Access Data Integration & Data Quality Visual MapReduce OPERATIONAL DATA BIG DATA DATA STREAM PUBLIC/PRIVATE CLOUDS PENTAHO 2

Deployment on 64-bit Operating Systems The Pentaho Business Analytics Server supports 64-bit operating systems for larger amounts of server memory and vertical scalability for higher user and data volumes on a single server. Clustering the Business Analytics Server Client Requests (Typically via web browser) The Pentaho Business Analytics Server can effectively scale out to a cluster, or further to a cloud environment. Clusters are excellent for permanently expanding resources commensurate with increasing load; cloud computing is particularly useful if scaling out is only needed for specific periods of increased activity. Load Balancer Example: Apache HTTPD (requires sticky sessions) Optimizing the Configuration of the Reporting and Engines Pentaho Reporting Pentaho BA Server Cluster (deployed in Tomcat or JBoss) The Pentaho Reporting engine enables the retrieval, formatting and processing of information from a data source, to generate user-readable output. One example for increasing the performance and scalability of the Pentaho Reporting solutions is to take advantage of result set caching. When rendered, a parameterized report must account for every dataset required for every parameter. Every time a parameter field changes, every dataset is recalculated. This can negatively impact performance. Caching parameterized report result sets creates improved performance for larger datasets. Pentaho The Pentaho engine (Mondrian) creates an analysis schema, and forms data sets from that schema by using an MDX query. Maximizing performance and scalability always begins with the proper design and tuning of source data. Once the database has been Business Analytics Repository optimized, there are some additional areas within the Pentaho engine that can be tuned. IN-MEMORY CACHING CAPABILITIES Pentaho s in-memory caching capability enables ad hoc analysis of millions of rows of data in seconds. Pentaho s pluggable, in-memory architecture is integrated with popular open source caching platforms such as Infinispan and Memcached and is used by many of the world s most popular social, ecommerce and multi-media websites. PENTAHO 3

IN-MEMORY CACHING CAPABILITIES Mondrian s Pluggable, In-Memory Caching Architecture Thin client: Ad Hoc Data Discovery MDX We have operational metrics for six different businesses running in each of our senior care facilities that need to be retrieved and accessed everyday by our corporate management, the individual facilities managers, as well as the line of business managers in a matter of seconds. Now, with the high performance in-memory analysis capabilities in the latest release of Pentaho Business Analytics, we can be more aggressive in rollouts adding more metrics to dashboards, giving dashboards and data analysis capabilities to more users, and see greater usage rates and more adoption of business analytics solutions. BRANDON JACKSON, DIR. OF ANALYTICS AND FINANCE, STONEGATE SENIOR LIVING LLC. Mondrian Server MDX Parser Query Optimizer SQL Generation In-Memory, Pluggable Cache Infinispan MemcacheD Aggregate Table Example Quantity SQL (JDBC) Relational, MPP, or Columnar Database Product Time Sales Customer Sales Aggregate Table In addition, Pentaho allows in-memory aggregation of data where granular data can be rolled-up to higher-level summaries entirely in-memory, reducing the need to send new queries to the database. This will result in even faster performance for more complex analytic queries. AGGREGATE TABLE SUPPORT When working with large data sets, properly creating and using aggregate tables greatly improves performance. An aggregate table coexists with the base fact table, and contains pre-aggregated measures built from the fact table. Registered in the schema Pentaho can choose to use an aggregate table rather than the fact table, resulting in faster query performance. PENTAHO 4

PARTITIONING SUPPORT FOR HIGH CARDINALITY DIMENSIONALITY Large, enterprise data warehouse deployments often contain attributes comprised of tens or hundreds of thousands of unique members. For these use cases, the Pentaho engine can be configured to properly address a (partitioned) high-cardinality dimension. This will streamline SQL generation for partitioned tables; ultimately, only the relevant partitions will be queried, which can greatly increase query performance. Pentaho Data Integration Pentaho Data Integration (PDI) is an extract, transform, and load (ETL) solution that uses an innovative metadata-driven approach. It includes an easy to use, graphical design environment for building ETL jobs and transformations, resulting in faster development, lower maintenance costs, interactive debugging, and simplified deployment. PDI s multi-threaded, scale-out architecture provides performance tuning and scalability options for handling even the most demanding ETL workloads. MULTI-THREADED ARCHITECTURE PDI s streaming engine architecture provides the ability to work with extremely large data volumes, and provides enterprise-class performance and scalability with a broad range of deployment options including dedicated, clustered, and/or cloud-based ETL servers. The architecture allows both vertical and horizontal scaling. The engine executes tasks in parallel and across multiple CPUs on a single machine as well as across multiple servers via clustering and partitioning. TRANSFORMATION PROCESSING ENGINE Pentaho Data Integration s transformation processing engine starts and executes all steps within a transformation in parallel (multi-threaded) allowing maximum usage of available CPU resources. Done by default this allows processing of an unlimited number of rows and MULTI-THREADED ARCHITECTURE Example of a Data Integration Flow with Multiple Threads for a Single Step (Row Demoralizer) Import Sort Demoralizer Group Demoralizer Import Sort Demoralizer Group Demoralizer columns in a streaming fashion. Furthermore, the engine is 100% metadata driven (no code generation) resulting in reduced deployment complexity. PDI also provides different processing engines that can be used to influence thread priority or limit execution to a single thread which is useful for parallel performance tuning of large transformations. Additional tuning options include the ability to configure streaming buffer sizes, reduce internal data type conversions (lazy conversion), leverage high performance nonblocking I/O (NIO) for read large blocks at a time and parallel reading of files, and support for multiple step copies to allowing optimization of Java Virtual Machine multi-thread usage. PENTAHO 5

Clustering in Pentaho Data Integration Source Data Flat Files Applications Databases Master Distributes the workload Slaves Parallel worker Target Database CLUSTERING AND PARTITIONING Pentaho Data Integration provides advanced clustering and partitioning capabilities that allow organizations to scale out their data integration deployments. Pentaho Data Integration clusters are built for increasing performance and throughput of data transformations; in particular they are built to perform classic divide and conquer processing of data sets in parallel. PDI clusters have a strong master/slave topology. There is one master in cluster but there can be many slaves. This cluster scheme can be used to distribute the ETL workload in parallel appropriately across these multiple systems. Transformations are broken into master/slaves topology and deployed to all servers in a cluster where each server in the cluster is running a PDI engine to listen, receive, execute and monitor transformations. It is also possible to define dynamic clusters where the Slave servers are only known at run-time. This is very useful in cloud computing scenarios where hosts are added or removed at will. More information on this topic including load statistics can be found in an independent consulting white paper created by Nick Goodman from Bayon Technologies, Scaling Out Large Data Volume Processing in the Cloud or on Premise. PENTAHO 6

PENTAHO MAPREDUCE EXAMPLE Executing Pentaho Data Integration Inside a Hadoop Cluster Map/Reduce Input Mapper Process Web Logs Reducer Map/Reduce Input Parse Log Combine Year & Month into Output Key Group on Key Field Map/Reduce Output Map/Reduce Output Pentaho Data Integration Engine (or PDI Server) JAR Hadoop Cluster EXECUTING IN HADOOP (PENTAHO MAPREDUCE) Pentaho s Java-based data integration engine integrates with the Hadoop cache for automatic deployment as a MapReduce task across every data node in a Hadoop cluster, leveraging the use of the massively parallel processing and high availability of Hadoop. NATIVE SUPPORT FOR BIG DATA SOURCES INCLUDING HADOOP, NOSQL AND HIGH-PERFORMANCE ANALYTICAL DATABASES Pentaho supports native access, bulk-loading and querying of a large number of databases including: NoSQL data sources such as: MongoDB Cassandra HBase HPCC Systems ElasticSearch Analytic databases such as: HP Vertica EMC Greenplum HP NonStop SQL/MX IBM Netezza Infobright Actian Vectorwise LucidDB MonetDB Teradata Transactional databases such as: MySQL Postgres Oracle DB2 SQL Server Teradata PENTAHO 7

Customer Examples and Use Cases INDUSTRY USE CASE DATA VOLUME AND TYPE # USERS (TOTAL) # USERS (CONCURRENT) Retail Store Operations 5+ TB 1200 200 Dashboard HP Neoview Telecom (B2C) Customer Value 2+ TB in Greenplum <500 <25 Social Networking Website Activity 1 TB in Vectorwise 10+ TB in a 20-node Hadoop cluster Social Networking Website Activity Loading 200,000 rows per second 20 billion chat logs per month 240 million user profiles System Integration (Global SI) Business Performance Metrics Dashboard 500 GB to 1TB in an 8-node Greenplum cluster >100,000 3,000 High-tech Manufacturing Customer Service Management 200 GB in Oracle Cloudera Hadoop Loading 10 million High-tech Manufacturing Customer Service Management records per hour 650,000 XML documents per week (2 to 4 MB each) 100+ million devices dimension Stream Global Provider of Sales, Customer Service and Technical Support for the Fortune 1000 10 Operational Dashboards Data from 28 switches around the world 12 source systems e.g. Oracle HRMS, SAP, Salesforce.com 20 million records 200+ Today 120-200 Will add 50-100 more 49 locations across 22 countries per hour Sheetz 2+ TB in Teradata 80 30 PENTAHO 8

Learn more about Pentaho Business Analytics pentaho.com/contact +1 (866) 660-7555. Global Headquarters Citadel International - Suite 340 5950 Hazeltine National Drive Orlando, FL 32822, USA tel +1 407 812 6736 fax +1 407 517 4575 US & Worldwide Sales Office 353 Sacramento Street, Suite 1500 San Francisco, CA 94111, USA tel +1 415 525 5540 toll free +1 866 660 7555 United Kingdom, Rest of Europe, Middle East, Africa London, United Kingdom tel +44 (0) 20 3574 4790 toll free (UK) 0 800 680 0693 FRANCE Offices - Paris, France tel +33 97 51 82 296 toll free (France) 0800 915343 GERMANY, AUSTRIA, SWITZERLAND Offices - Munich, Germany tel +49 (0) 322 2109 4279 toll free (Germany) 0800 186 0332 BELGIUM, NETHERLANDS, LUXEMBOURG Offices - Antwerp, Belgium tel (Netherlands) +31 8 58 880 585 toll free (Belgium) 0800 773 83 ITALY, SPAIN, PORTUGAL Offices - Valencia, Spain toll free (Italy) 800 798 217 toll free (Portugal) 800 180 060 Be social with Pentaho: Copyright 2015 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For the latest information, please visit our web site at pentaho.com.