Best Practices - Pentaho Data Modeling
This page intentionally left blank.
Contents Overview... 1 Best Practices for Data Modeling and Data Storage... 1 Best Practices - Data Modeling... 1 Dimensional Models... 1 Database Optimization... 2 Database Indexing... 2 Schema Tables... 2 Default Values... 3 Best Practices - Data Storage... 3 Data Storage... 3 Reporting Data... 3 Out-of-the-Box Configuration... 3
This page intentionally left blank.
Overview This document is intended to provide best practices around how to design and build your Pentaho solution for maximum speed, reuse, portability, maintainability, and knowledge transfer. It is not intended to demonstrate how to implement each best practice or provide templates based on the best practices defined within the document. Software Version Pentaho 5.4, 6.x, 7.x Best Practices for Data Modeling and Data Storage The document is arranged in a series of topic groups with individual best practices for that topic explained: Data Modeling Data Storage Best Practices - Data Modeling This section provides best practices and information on data models and operating and improving databases, tables, and values. Dimensional Models Database Optimization Database Indexing Schema Tables Default Values Dimensional Models A dimensional model design should be used whenever possible. Dimensional models are optimized for online queries and data warehousing. A Star Schema is a common example of a dimensional model. Dimensional models allow Mondrian and Pentaho to perform best at high volumes. Pentaho Data Modeling Best Practices Pentaho 1
Database Optimization Make sure the database server and instance are optimized for analytic workloads. Databases have specific parameters for analysis that do not apply to transaction workloads. Typically, this is providing the maximum amount of RAM and CPU to the database server and adjusting the Database Management System (DBMS) kernel parameters to efficiently use that additional capacity. Mondrian can only perform as fast as the database can return data. Database Indexing Make sure the database has standard indexing applied. Create indexes on all primary keys of a dimension, and all foreign keys in a fact table. Create indexes for each level of each hierarchy in all dimensions of all cubes of all schemas. A common approach to indexing can be found in Recommendations for database tuning. Indexes on keys are especially important on high cardinality dimensions and levels. Primary and foreign keys should be single-column integers or BIGINT. Keys should not be string, GUID or a combination of several fields. This allows the Mondrian-generated queries to perform at an optimal rate. Schema Tables Avoid using database views or structured query language (SQL) queries as tables in a schema, where possible. Define all tables in the schema as database tables. Normally, when an SQL query or view is desired, this is an indication that more extracting, transforming, and loading (ETL) needs to be done to the data before analysis. The entire database view must be evaluated before the filters are applied to avoid poor performance. Normally, if the dimensional model is set up properly, these techniques are not needed. If you are unable or unwilling to create a dimensional model and use ETL, these may be your only options. Pentaho Data Modeling Best Practices Pentaho 2
Default Values Pre-populate a record for all levels of all hierarchies with a default value. Use a value of N/A, unknown, or -1 to represent a not found value in a lookup. This will allow the data to flow into the analytic database without being lost. N/A records can later be found and updated as appropriate. Best Practices - Data Storage This section provides best practices for storing and reporting data, as well out-of-the-box configuration for reporting. Data Storage Reporting Data Out of-the-box Configuration Data Storage Data should reside on high-speed input-output storage. Use a physically mounted drive, or storage area network (SAN) with fiber channel in the same data center. A virtual machine for a database should be avoided, if possible, unless data storage concerns can be addressed. Database performance can degrade the performance of the entire analytic project. Reporting Data Do not use the database provided with Pentaho to store your reporting data. Store your reporting data in other database platforms better suited for this workload. The Pentaho database is for metadata around a Pentaho object, not for reporting data. Out-of-the-Box Configuration Do not use an out-of-the-box configuration for your reporting data DMBS. Modify your memory and kernel parameters soon after installation before loading with data. Most out-of-the-box DBMS configurations will only perform very well on very small or demonstrative data sets. Pentaho Data Modeling Best Practices Pentaho 3