PowerPlay 6.5 Tips and Techniques

Similar documents
Practical Guide For Transformer in Production

SMD149 - Operating Systems - File systems

Performance of relational database management

Jet Data Manager 2014 SR2 Product Enhancements

Informatica Data Explorer Performance Tuning

PowerPlanner manual. Contents. Powe r Planner All rights reserved

!! What is virtual memory and when is it useful? !! What is demand paging? !! When should pages in memory be replaced?

Segregating Data Within Databases for Performance Prepared by Bill Hulsizer

Virtual Memory. Chapter 8

Performance Tuning. Chapter 25

The tracing tool in SQL-Hero tries to deal with the following weaknesses found in the out-of-the-box SQL Profiler tool:

Cube Designer User Guide SAP BusinessObjects Financial Consolidation, Cube Designer 10.0

Accelerating BI on Hadoop: Full-Scan, Cubes or Indexes?

Managing Oracle Real Application Clusters. An Oracle White Paper January 2002

OneStop Reporting 4.5 OSR Administration User Guide

Optimizing Testing Performance With Data Validation Option

Process size is independent of the main memory present in the system.


Chapter 4 Data Movement Process

Embarcadero DB Optimizer 1.5 SQL Profiler User Guide

Cognos Dynamic Cubes

SQL Server Analysis Services

PowerLink Host Data Manager User Guide

Scalable Access to SAS Data Billy Clifford, SAS Institute Inc., Austin, TX

Analytics: Server Architect (Siebel 7.7)

Chapter 8 Virtual Memory

SolidFire and Pure Storage Architectural Comparison

Data warehouse architecture consists of the following interconnected layers:

Advanced Multidimensional Reporting

IBM Tivoli OMEGAMON XE for Storage on z/os Version Tuning Guide SC

EMC Ionix ControlCenter (formerly EMC ControlCenter) 6.0 StorageScope

Hyperion Essbase Audit Logs Turning Off Without Notification

Oracle Hyperion Profitability and Cost Management

Data Warehouse and Data Mining

WHITE PAPER Application Performance Management. The Case for Adaptive Instrumentation in J2EE Environments

Table Compression in Oracle9i Release2. An Oracle White Paper May 2002

Monitor Qlik Sense sites. Qlik Sense Copyright QlikTech International AB. All rights reserved.

Optimizing Tiered Storage Workloads with Precise for Storage Tiering

EMC GREENPLUM MANAGEMENT ENABLED BY AGINITY WORKBENCH

COGNOS DYNAMIC CUBES: SET TO RETIRE TRANSFORMER? Update: Pros & Cons

Basics of Dimensional Modeling

File Structures and Indexing

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

The Connector. Version 1.2 Microsoft Project to Atlassian JIRA Connectivity. User Manual

WHITE PAPER: ENTERPRISE AVAILABILITY. Introduction to Adaptive Instrumentation with Symantec Indepth for J2EE Application Performance Management

Oracle 1Z0-515 Exam Questions & Answers

Xcalibur Global Version Rev. 2 Administrator s Guide Document Version 1.0

Optimizing Performance for Partitioned Mappings

Introducing the SAS ODBC Driver

Interface Reference topics

Addresses in the source program are generally symbolic. A compiler will typically bind these symbolic addresses to re-locatable addresses.

Advanced Database Systems

DATA MINING AND WAREHOUSING

OneStop Reporting OSR Budgeting 4.5 User Guide

TUTORIAL FOR IMPORTING OTTAWA FIRE HYDRANT PARKING VIOLATION DATA INTO MYSQL

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

ORACLE USER PRODUCTIVITY KIT USAGE TRACKING ADMINISTRATION & REPORTING RELEASE SERVICE PACK 1 PART NO. E

Chapter 9: Virtual Memory

Chapter 12: Query Processing

OPERATING SYSTEM. Chapter 12: File System Implementation

ETL Transformations Performance Optimization

VERITAS Storage Foundation 4.0 TM for Databases

MEMORY MANAGEMENT/1 CS 409, FALL 2013

PowerCenter Repository Maintenance

TRIM Integration with Data Protector

This tutorial will help computer science graduates to understand the basic-to-advanced concepts related to data warehousing.

Function. Description

An Oracle White Paper April 2010

HPE Security Fortify Plugins for Eclipse

Best Practices. Deploying Optim Performance Manager in large scale environments. IBM Optim Performance Manager Extended Edition V4.1.0.

Teamcenter Appearance Configuration Guide. Publication Number PLM00021 J

Virtual Memory. Kevin Webb Swarthmore College March 8, 2018

The term "physical drive" refers to a single hard disk module. Figure 1. Physical Drive

Chapter 1. Introduction to Indexes

Embarcadero DB Optimizer 1.5 Evaluation Guide. Published: March 16, 2009

1. Attempt any two of the following: 10 a. State and justify the characteristics of a Data Warehouse with suitable examples.

Lotus Sametime 3.x for iseries. Performance and Scaling

SAS Scalable Performance Data Server 4.3 TSM1:

Seagate Info. Objective. Contents. EPF File Generation for Page-on-Demand Viewing

Virtual Memory - Overview. Programmers View. Virtual Physical. Virtual Physical. Program has its own virtual memory space.

Virtual Memory. CSCI 315 Operating Systems Design Department of Computer Science

Technical Note P/N REV A01 March 29, 2007

Product Documentation. DB Optimizer. Evaluation Guide. Version Published November 27, 2009

APPENDIX 3 Tuning Tips for Applications That Use SAS/SHARE Software

Query Processing & Optimization

SSIM Collection & Archiving Infrastructure Scaling & Performance Tuning Guide

A SAS/AF Application for Parallel Extraction, Transformation, and Scoring of a Very Large Database

TUTORIAL FOR IMPORTING OTTAWA FIRE HYDRANT PARKING VIOLATION DATA INTO MYSQL

HP Database and Middleware Automation

Chapter 11: File System Implementation. Objectives

SAS BI Dashboard 3.1. User s Guide Second Edition

Performance Optimization for Informatica Data Services ( Hotfix 3)

Managing Data Resources

Using Synology SSD Technology to Enhance System Performance Synology Inc.

BI-09 Using Enterprise Guide Effectively Tom Miron, Systems Seminar Consultants, Madison, WI

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,

CMSC 313 COMPUTER ORGANIZATION & ASSEMBLY LANGUAGE PROGRAMMING LECTURE 27, FALL 2012

COPYRIGHTED MATERIAL. Making Excel More Efficient

FileLoader for SharePoint

Chapter 11: Implementing File Systems

Transcription:

PowerPlay 6.5 Tips and Techniques Building Large Cubes The purpose of this document is to present observations, suggestions and guidelines, which may aid users in their production environment. The examples presented in this paper are illustrative, and are based upon hardware and software combinations designed to test Transformer in as many configurations as possible. Due to the large numbers of factors contributing to any specific production environment, generalizations were made in an attempt to address as many of these combinations as possible. If any information presented, herein, does not apply to your specific production situation, please contact Cognos and every attempt will be made to address any specific issues. 1 Introduction 1.1 The Cube Generation Process 2 Modeling Guidelines 2.1 Data Source Preparation & Processing 2.1.1 Data Staging 2.1.2 Dimension versus Fact Queries 2.1.3 Query Design Considerations 2.1.4 Multi-Processing 2.1.5 Data Source Types 2.1.5.1 Flat Files 2.1.5.1.1 Character Sets 2.1.5.1.2 Flat File Content 2.1.5.2 Oracle 2.1.5.2.1 Fetch Settings 2.1.5.2.2 Numeric Formats 2.1.5.3 Sybase Configuration 2.1.5.3.1 Fetch Settings 2.2 PowerCube Design 2.2.1 Partitioning 2.2.1.1 Auto-Partitioning 2.2.1.2 Manual Partitioning 2.2.1.3 Auto-Partitioning vs. Manual Partitioning 2.2.1.4 Troubleshooting your Partitioning Strategy 2.2.2 Dimension Views 2.2.3 Incremental Update 2.2.4 Compressed Cubes 2.2.5 Enable Cross-Tab Caching 2.3 Logging 2.3.1 Log File Structure 3 Hardware and Environmental Considerations 3.1 Memory 3.2 I/O 3.3 Operating System Considerations 3.3.1 UNIX 3.3.2 Windows NT 1. Introduction With the introduction of PowerPlay 6.5, it has now become possible to build cubes from over fifty million rows of source data and five hundred thousand categories. Also, 6.5 of PowerPlay Transformer has introduced significant performance improvements over 5.21 and through its

auto-partitioning algorithm has made building large high-performance cubes easier than ever before. As more users of Transformer attempt to build, large scale PowerCubes, modeling, and environment considerations become increasingly important to successful production. This paper discusses proper model design, and environment guidelines useful to anyone attempting to build large PowerCubes. 1.1 The Cube Generation Process The generation of PowerCubes involves three major steps: Category/Work File Generation. This includes reading the data source and processing the rows to create categories, and compressed working files. Meta-Data Update. This stage deals with specifying the structure of the cube. This includes a subset of the dimensions, categories and measures of the model, and reflects what the end-user will see. Data Update. Represents the dominant stage of cube processing in terms of time. This stage includes consolidation, partitioning, and updating the cube with the set of records that apply to the cube. Over the course of the paper, references to these stages will be made. 2. Modeling Guidelines How you define your models can affect both the sizes of the model file and the processing time required generating PowerCubes. This section discusses the following aspects of PowerCube design, and how to exploit these for PowerCube production:

Data Source Preparation & Processing; PowerCube Design and Partitioning; Logging; and Basic Troubleshooting. 2.1 Data Source Preparation & Processing The read phase of the PowerCube generation process accounts for a significant portion of the total build time. Proper query design can have a tremendous impact on the time it takes to produce PowerCubes. The rule is simple: The build time is directly proportional to the number of data source records processed by Transformer. 2.1.1 Data Staging There are several techniques for preparing the data for use by Transformer. Many organizations have data warehouses from which they build PowerCubes, Others stage off the data warehouse or the operational systems. No matter how it is done, staging the data has a number of advantages including: The ability for Transformer to produce cubes without impacting other information systems. In case of environmental failure, production can be restarted from the staged data. Complex queries can be simplified to a simple table extraction in many cases. It becomes possible to take advantage of data cleaning and transformation technologies during the staging process. This makes the Transformer models cleaner in terms of the categories that are generated and the number of value rows processed. Two of the common methods of staging data in Transformer is through the use of either temporary tables in a specific database technology, or through an exported flat file. In either approach, the result of a complex query is used to populate these tables. Once this is performed, a query is added to Transformer that reads either the flat file or temporary table. Many have found that employing this simple step has saved much time as it minimizes the impact on the other uses of the Database, as well as minimizes the affect that Database activity has on the cube building process. 2.1.2 Dimension versus Fact Queries There are two main classes of queries in Transformer: dimension queries and fact queries. In a typical production environment, a Transformer model would contain both structural and transaction queries. Dimension queries are composed of columns whose values build category structures within a Transformer model. The columns of dimension queries are associated with the dimensions and levels of the model, and provide data that is used to generate categories within these dimensions. The most important point to remember about dimension queries is that they do not contain columns that represent measure values. Instead, they establish category structures, provide labels, sort values, descriptions, and so forth. Typically, a dimension query is associated with a particular dimension, and provides all of the columns necessary for "populating" it with categories. Generally, dimension queries may not change as frequently over time as fact queries. For this reason, they may not have to be re-executed each time a cube is generated. For example, a structure query representing a geographic dimension may only need to be executed once if the geography for a model does not change. However, structure queries are typically much smaller (in terms of the number of rows they contain) than transaction queries, and so they are usually not major contributors to overall processing time. Fact queries provide measure values for a cube. The columns in a transaction query are

associated with measures, and with unique levels in the model. Unlike dimension queries, fact queries will change frequently, representing the latest data to be added to cubes. These queries will be the main drivers of cube generation, and are designed to minimize the data source processing time during cube production. Consequently, these queries should be designed to have small, concise records, with the absolute minimum amount of information required to add new data to the PowerCubes. Figure 1. Simple Star Schema with Dimension and Fact Tables. For example, in a sales analysis model with the dimensions Product Lines, Location, and Market, three queries could be used: A dimension query for the Products dimension. This query provides categories for the products. Since products change rarely, and only at pre-set intervals, the category hierarchy is relatively static. A dimension query for the Outlets (location) dimension. Again, because locations rarely change, the information in this query is relatively static. A third dimension query that describes the markets in which products are sold. Unlike the other queries, this one is relatively dynamic, as markets change in response to

ongoing analyses and strategic initiatives. A Sales fact query that provides measures for the model, and contains the information required mapping these measure values onto levels in the model. The dimension relationships between each query and the model dimension map are as shown in the following diagram: The kind of information in these queries is different, and you can use Transformer s query timing controls (in the General tab of the Query property sheet) to set when Transformer processes them. The Products and States structural queries would be executed initially to build the category structure within the model. Once this is complete, there may be no requirement to execute them during PowerCube generation. The model would be saved populated with these categories. The General tab (Query property sheet) would be set as follows once you ve generated categories for the Products and Outlets dimensions: The Market query represents a more volatile structure that would have to be recreated each time a cube is generated. It should be run during category generation phase of PowerCube creation. The General tab (Query property sheet) would be set as follows for the Market query:

The Sales transaction query is executed during PowerCube creation, as it provides the measure values for the PowerCube. The General tab (Query property sheet) would be set as follows for the Sales query: 2.1.3 Query Design Considerations The queries used in models, and their associations with the objects used by Transformer will have an impact on the performance of Data Source processing. Consider the following when designing queries to be used in Transformer. 1. To Transformer, a query is a distinct set of columns. This set of columns may correspond to a table or records that is a result of a SQL query executed against a relational database. Each query has a relationship with the Dimensions, Levels, Categories, and Measures in the model. They do not have, however, any relationship with other queries in the model. In other words, Transformer does not perform any operation between queries. This includes joins or other like operations. 2. Dimension queries can be used to populate the dimensions of the model. These queries should include the structure, labels, order by values, tags, and descriptions for dimensions of which the query is associated. For simplicity, consider having a one-to-one relationship between each Dimension query and the dimensions in the model. 3. Fact queries should contain the minimum set of columns needed to reference each dimension. This set should include only unique levels and sets of levels for each dimension for which the query applies. 4. Use Query Timing. This feature is designed to only process queries when appropriate. For example, if there exists a dimension that is populated with categories, and no new categories will be added by the Dimension query, then there is no need for the query to be executed during a PowerCube generation. This query can be deactivated for this operation. Fact queries often provide little or no valuable structural information. Therefore, there is no need to process these queries during a Category Generation operation. 5. When creating queries, eliminate columns that are not required. In the case of flat files, for example, the presence of these columns will have an impact on the data source read time. 6. When possible, consider consolidating any data sources before processing them by Transformer. Transformer will processing these data, and will consolidate them if requested, but often this can be effectively done by the data source. 7. If you are using level uniqueness, ensure that your data does not have any uniqueness

violations. 8. Use Check Model to verify your query associations. Check Model performs a series of validations, which identify potential problems with query design. 9. If you are modifying your query structures regularly, ensure that they are synchronized with the query definitions in Transformer. Check Columns and Modify columns provide the means of accomplishing this. 10. Ensure that all structure is in place before a Fact query is executed. This can be accomplished either by populating the model in advance, or by moving all Fact queries to the end of the Queries list. Transformer will handle any ordering of the queries, but better performance will be achieved if Dimension queries appear before Fact queries in the Queries list. 11. DO NOT EXPECT TRANSFORMER TO PEFORM JOINS BETWEEN TWO QUERIES. Refer to point 1. 2.1.4 Multi-Processing With the release of PowerPlay Transformer 6.x, it is now possible to utilize multiple processors during the read phase of PowerCube generation. Multi-Processing can be turned on for a query by simply checking "Enable multi-processing" on the General tab of the query property sheet. Figure 2. Enable Multi-processing for a Query When enabled, Transformer will use up to 2 processors to load balance the PowerCube

generation process. This turning on this feature is highly recommended for queries that have calculated columns defined in Transformer. The log file can be used to confirm that this option is in effect. 2.1.5 Data Source Types 2.1.5.1 Flat Files Flat file format, is the default format which PowerPlay Transformer supports for reading data source files. 2.1.5.1.1 Character Sets By default, Transformer uses Windows (ANSI) as the default character set when defining a query based on a flat file. This character set utilizes parsing routines that are multi-byte sensitive. Consider using the Windows (ANSI) Single-Byte character set as the character set for any queries in which multi-byte is not applicable. The performance of the parsing routines is better. This will result in improved performance during the data source read phase of PowerCube generation. 2.1.5.1.2 Flat File Content Another consideration when using flat files to ensure that they do not contain any trailing blanks, or any undesired special characters. Overall, performance of the read phase will be directly affected if Transformer must spend much time dealing with the extraneous data. Review your flat files to ensure that they are being created as you expect. 2.1.5.2 Oracle This section provides information that can be used to optimize the read performance for data sources stored using Oracle 7.x. 2.1.5.2.1 Fetch Settings When reading Oracle data sources, two settings are used to control the size of the buffer and number of rows when fetching rows. These two settings can be seen in the cogdmor.ini file located in the Transformer executable directory. These settings are as follows: Fetch Number of Rows. This setting is used to determine the number of rows to fetch each time a fetch operation is performed. The current limit for this setting is 32767. The default value for this setting is currently 10. However increasing it may yield a performance increase. In one experiment, we changed the value from 10 to 100, which yielded roughly a three-fold increase in performance. It should be noted, however, that increasing this number arbitrarily might cause performance degradation. Fetch Buffer Size. This setting can be used to control the size of the buffer used to fetch data. This value may also yield a performance increase depending on the situation. Fetch Number of Rows will take precedence if both settings are set. By default, this setting is disabled. It should be noted, that these settings can yield differing performance benefits depending on the system, although, through experimentation noticeable benefits may be realized. 2.1.5.2.2 Numeric Formats While source data is read from Oracle, integer values, that are greater than 10 digits, undergo a conversion to a double representation that Transformer uses to internally represent the values. If possible, consider storing the numeric values using a double equivalent in Oracle. This will eliminate the overhead of performing any conversions as Transformer is processing the data. This may be accomplished by using a temporary table to store the data records destined for Transformer. If storing the values using a double equivalent is not possible, you can force Oracle to perform the conversions by simply multiplying each numeric column by 1.0 in the SQL. If Oracle is located on a different machine than the machine on which cube production is occurring, a performance increase may be realized. In an experiment on HP/UX 9.04 with Oracle on one server, and Transformer server running on another, we were able to realize roughly a two-fold increase in data read times using this approach. 2.1.5.3 Sybase Configuration This section provides information that can be used to optimize the read performance for data

sources stored using Sybase. 2.1.5.3.1 Fetch Settings When reading Sybase data sources, you can control the number of rows when fetching from the database. This setting can be seen in the cogdmct.ini file located in the Transformer executable directory. These settings are as follows: NUMOFROWS. This setting is used to determine the number of rows to fetch each time a fetch operation is performed. The default value for this setting is currently 10. However increasing it may yield a performance increase. In one experiment we changed the value from 10 to 100, which yielded an increase in performance. It should be noted, however, that increasing this number arbitrarily might, in fact, cause performance degradation. 2.2 PowerCube Design This section discusses a number of aspect relating to PowerCube design. These include: Partitioning, Dimension Views, Incremental Update and other options. 2.2.1 Partitioning Partitioning is a process by which Transformer divides a large PowerCube into a set of nested "sub-cubes" called partitions. Partitioning optimizes run-time performance in PowerPlay by reducing the number of data records searched to satisfy each information request. For Large PowerCubes consisting of millions of rows, you should set up partitions to speed up cube access for PowerPlay users. Partitioning pre-summarizes the data in a PowerCube and groups it into several subordinate partitions so that it can be retrieved significantly faster than if the cube were not partitioned. Creating a very large cube without using partitions can result in poor run-time performance for PowerPlay users. There are two ways to partition a model: Use the Auto-Partition command to have Transformer implement a partitioning strategy for you. Define partitions yourself by assigning partition numbers to the categories in your model. While partitioning does significantly improve run-time access for large PowerCubes, there are some associated processing costs at PowerCube creation time. These are: 1. PowerCube size. Partitioned cubes are larger than non-partitioned cubes. 2. PowerCube creation time. Partitioned cubes take longer to create than non-partitioned cubes. Consider the following diagram: This represents the trade-off with respect to the effect of partitioning. On the left end of the spectrum, we have the time it takes to build cubes, on the other end we have the time it takes to navigate the cube for the end user. If partitioning is not employed, the build performance will be optimal, however, this comes at the potential cost of query performance for end users as

they are navigating the cube. As the number of levels of partitioning increases, the time it takes to build the cube increases proportionally. However, this yields performance gains for the end users. 2.2.1.1 Auto-Partitioning Transformer 6.x now has support for performing auto-partitioning during cube building. This feature has greatly simplified the partitioning process as it determines the best partitioning strategy as it is creating the cube. In addition, the partitioning strategy is specific for each cube being created by a model. Unlike manual partitioning, the user does not have to have a strong understanding of partitioning to be able to effective partition PowerCubes. The auto-partition feature is controlled through an optimization on the Processing tab of the PowerCube property sheet. When the auto-partition optimization is used, it enables the Auto-Partition tab seen below. Figure 3. Auto-Partition Tab of the PowerCube Property Sheet. Control Description Estimated Number of Consolidated Rows Desired Partition Size Default: 10,000,000 rows This control is used to serve as an estimation of the number of rows after consolidation the cube would contain. The default value is set to the published maximum number of consolidated rows a cube can have in PowerPlay 5.21. This value can be changed and is used to scale the desired partition size controls. Default: 500,000 rows or 5% of the Estimated Number of Consolidated Rows control.

These controls are used to set the desired size for each partition in the cube. The slider control conveys the trade-off between optimizing for cube build performance and end user query performance in PowerPlay clients. The slider control is gauged at 1% increments with the maximum being 100% of the Estimated Number of Consolidated Rows control. The maximum desired partition size(100%) will be set with the slider at the cube build performance side, and minimum(1%) at the end user query performance side of the control. The Desired Partition Size edit control will reflect the desired partition size as a number of rows as reflected by the position of the slider control. Also, the desired partition size can be set by typing a value in the edit control. The slider will reflect this setting as a percentage of the Estimated Number of Consolidated Rows. Maximum Number of Levels of Partitioning Default: 5 The value of this control is used as a safeguard in the case in which the Desired Partition Size is too small compared to the number of rows added to the Cube. This value also safeguards against "shallow" models that lack depth in terms of dimension levels. Each level of partitioning represents additional passes of a portion of the original source data. As the number of levels of partitioning increases, a subsequent increase in the number of passes of the data will occur. Use Multiple Processes Default: Unchecked If a cube has this option set, the cube generation phase will exploit the use of multiple processes when possible. When specifying the auto partitioning strategies using these controls consider the following: The estimated number of rows is merely an estimate used to allow you to scale the desired partition size controls. This setting does not have to be accurate. When setting the desired partition size, don t hesitate setting it larger that you would have set the Maximum number of transactions per Commit" setting in 5.21. The autopartition algorithm does a very good job creating near equivalent sized partitions, thus the query time performance is on average much better than PowerCubes partitioned manually by 5.21. The maximum number of levels of partitioning should only be modified when it is clear that the auto-partitioning algorithm is performing extra passes that are not improving the performance of the PowerCube. When a cube is built using auto-partitioning, Transformer will employ a dynamic weighting algorithm to choose each candidate dimension for partitioning. This algorithm tends to favor

dimensions with more levels that have category parent-child ratios that are consistent throughout the dimension. Next the auto-partitioning algorithm will dynamically assign a partitioning strategy to the categories in the dimension, which represent equivalent sized partitions as close to the desired partition size as possible. As Transformer processes each pass of partitioning the number of rows and categories left in the summary partition decreases. This is evident both through the animation and in the log file. A number of features are not supported by the auto-partition algorithm. They include externally rolled up measures, before rollup calculated measures, other cube optimizations. Before rollup calculated measures can be replaced by creating calculated columns. If the model includes settings that are not supported by the auto-partitioning algorithm then check model will indicate this with a warning. Incremental update only supports auto-partitioning during the initial load of the PowerCube. Incremental updates to the PowerCube will have performance comparable to that of 5.21. Refer to Section 2.2.3 for using incremental update and auto-partitioning. Certain dimensions are not good candidates for auto partitioning. They mainly consist of few levels, and large numbers of categories. It is possible to exclude these dimensions from the auto-partitioning algorithm by checking check box found on the general tab of the dimension property sheet. 2.2.1.2 Manual Partitioning The following steps provide a strategy to follow when manually partitioning a model: 1. Select the dimension in your model that contains the largest number of categories. In addition, consider dimensions that contain a large number of levels in comparison with other dimensions in the model. Most often, this offers the greatest chance of consolidation of rows during cube processing. 2. Choose a desired partition size. This would be a size designed to optimize run-time performance against the cube. 3. Calculate the number of categories that require a partition number. This becomes the set of partitions in a partition level. This will be referred to as the Number of partitions in the current level of partitioning. Number of partitions = # categories = # source rows/desired partition size 4. In the selected dimension, choose a level that contains close to the number of categories determined in Step 3. 5. Assign the same partition number to each category in the chosen level. To assign the same partition number to each category of a level, assign a partition number to that level. Ensure that the partition number is larger than the number assigned to categories in the previous partition level. Note: A partition level is not the same as a level in a dimension. A partition level is a set of categories that receive the same partition number. These categories can come from more than one level in a dimension. For simplicity, we suggest choosing a level to which the partition number can be assigned. In practice, you would try to select a set of categories across the dimension and assign these the partition number. This should be done while trying to adhere to the number of partitions established in Step 3. Once complete, you should not be able to find a path from root category to leaf that does not include a category that has the partition number assigned. 6. Build the cube and review the partition status. If the size of any partition is too large, another level of partitioning may be necessary.

If the partition status is unacceptable, (some partitions contain more than the desired number of records), proceed to test the performance in PowerPlay in Step 7. If the partition status is acceptable, you re done. 7. Navigate the cube in PowerPlay, drilling down into the partition with the largest number of records. If the performance is unacceptable, consider another level of partitioning in these dimensions. Examine the number of records in the summary partition. This is the number of records that you have to consider in subsequent partitioning. Go to Step 3 and repeat the entire partitioning process using a level (in some other dimension) that adequately divides the number of records in the summary partition. For each new level of partitioning, you must increase the assigned partition number by 1. After changing the partitioning, recreate the PowerCube and re-examine the partition status. 8. Repeat Steps 3 through 7 until there are a sufficient number of partition levels to yield desired run-time performance. 2.2.1.3 Auto-Partitioning vs. Manual Partitioning There are advantages that auto-partitioning has over manual partitioning. They include: The auto-partitioning is cube specific where the manual partitioning strategy is model specific. This means that for models containing multiple PowerCubes the manual partitioning strategy may not be good for all PowerCubes. The auto-partitioning adapts itself to the model and the data over time. Manual partitioning doesn t. In the worst case, a manual partitioning scheme may need to change from build to build depending on the volatility of the data. It is more time consuming developing a good manual partitioning strategy. This strategy requires several test builds and may take much time to tune the partitioning scheme. A better strategy may be to auto-partition the PowerCube, review it using the "Partition Status" dialog, and manually apply it to your model, making any changes as desired. 2.2.1.4 Troubleshooting your Partitioning Strategy The following table describes possible problems that could occur in you partitioning strategy accompanied with suggestions on how to address them. Problem 1. Performance in the PowerCube in certain reports does not meet expectations. 2. In the log file, the number of records and categories are not decreasing with each pass of partitioning. Suggestions If your report is a lowest level detail report then you should consider partitioning in a dimension other than one involved in the report. Transformer is having trouble consolidating your data within the desired partition size. Increasing the desired partition size should help alleviate this. If that doesn t work, exclude dimensions in which a partitioning pass has no effect. If that is unsuccessful set the Maximum Number of Passes to be

the last pass in which the records and categories have reduced. 3. PowerCube performance decreases with each incremental update. 4. PowerCube Build performance does not meet expectations. Incremental update will have an affect on any partition scheme in the cube. Over time, the partition scheme will "decay" as new data is added to the PowerCube during incremental update. Your incremental update production environment should include a full rebuild of the cube on a regular basis. Inspect the log file to see if the number of records and categories are decreasing with each successive pass of partitioning. If they are not refer to problem 2. If they are decreasing but the pass count is not increasing, refer to problem 5. If neither of these suggestions is helping, use the log file to isolate the phase of the cube build where the most time is spent. Isolate this area and attempt to isolate it to either an hardware, environment, or model design. 5. The Pass count is not incrementing during an auto-partitioning build. Transformer has had no success partitioning in on or more dimensions. This could be due to a number of factors. They include, the order of dimensions chosen by the auto-partitioning algorithm, the nature of the data, the structure of the dimensions itself, and so forth. Try either increasing the desired partition size, or excluding the dimensions in question from auto-partitioning. 6. Build performance in 6.x has not improved over 5.21. 7. A PowerCube built using 6.x does not complete whereas it did in 5.21. 8. (TR0536) Partitioning can only be specified for categories in a primary drilldown. The model imported from 5.21 may contain a feature that is not supported by the autopartitioning algorithm. Use Check Model to determine this. Transformer 6.x uses more disk space than in 5.21. Ensure that there is sufficient disk space for creating your PowerCubes. This happens, if you are employing a manual partitioning strategy and you have

assigned a number to a category in an alternate drilldown. You can only partition in the primary drilldown in a dimension. Ensure that manual partitions are assigned to categories that reside in the primary drilldown. The primary drilldown can be identified as the collection levels that are not in italics on the dimension map. 9. (TR0537) A partition can have a value between 0 and 15. 10. (TR0538) A partition cannot be specified for root, special or leaf categories. 11. (TR0542) A category cannot be suppressed or filtered and have a partition number. 12. (TR0710) Unable to show partition status. The selected PowerCube could not be found or has not yet been created. 13. (TR0715) This model is manually partitioned. The auto-partition will not override the manual partitioning scheme. 14. (TR1909) The PowerCube contains data and therefore new categories will be added to partition zero. 15. (TR2717) Category X has no children in dimension Y in at least one dimension view or Cube Group. This category cannot have a partition specified. This happens, if you are employing a manual partitioning strategy and have assigned a partition number to a category that is outside this range. This happens, if you are employing a manual partitioning strategy and you have attempted to assign a partition number to a root, special or leaf category in a dimension. Categories that have been suppressed or filtered in a dimension or user class view and have been assigned a partition number. You have attempted to show the partition status on a PowerCube that could not be located by Transformer. Check the location specified on the Output tab of the PowerCube property sheet. You have assigned partition numbers to categories in your model. Auto-partitioning will respect these settings and not attempt to auto-partition the cube. You can remove all partition settings by selecting "Reset partitions" on the Tools menu. You cannot change your partition strategy once you have added data to a PowerCube. This error typically occurs in an incremental update situation and you have employed a manual partitioning strategy. A view operation or Cube Group specification has effectively made this category a leaf in a PowerCube. This category cannot have a partition number

specified. 16. (TR2723) A gap has been detected between partition X and Y. No gaps in partition numbers are permitted. 17. (TR2724) Partition X is defined in dimensions Y and Z. A partition number cannot be assigned in more than one dimension. 18. (TR2725) Category X is a descendant of category Y and share the same partition number in dimension Z. A partition number can only occur once in a category path of a dimension. 19. (TR2751) This model contains one or more cubes with cube optimizations other than Auto-partition. Better performance may be realized if the optimization setting for these cubes is changed to Auto-Partition. 20. (TR2752) This model contains one or more data sources with external rollup specified. Auto-partitioning is not possible. This error occurs when you are applying a manual partitioning strategy to a model and have left gaps in the partition numbers you have assigned to categories or levels. You will have to review your partition numbers to make sure there are no gaps in the numbering. You are manual partitioning your model and have assigned the same partition number in one or more dimensions. Partition numbers cannot span dimensions. You are manual partitioning and have assigned the same partition number to two categories in a dimension such that one category is a descendant of the other. For the descendant category consider incrementing or decreasing the partition number assigned. You have chosen an optimization other than auto-partition for one or more PowerCubes in your model. The performance gains accompanying the auto-partitioning algorithm may not be realized with the current optimization setting. Consider changing it to auto-partitioning. This feature is not supported by the autopartitioning algorithm. 21. (TR2753) This model contains one or more cubes with incremental update selected and an existing cube. Auto-Partitioning is not possible on subsequent increments of a cube. This warning indicates that incremental updates to the cube are not supported by the auto-partitioning algorithm. New categories introduced by the increment will be added to the summary partition (partition 0). 2.2.2 Dimension Views Dimension views permit operations such as suppressing, summarizing, and apexing categories in dimensions included in a PowerCube. In addition, both dimension views and dimension diagrams allow filtering of categories. In most cases, these operations are designed to reduce the number of categories that are to be placed into a cube, thereby producing smaller cubes for specific audiences. User Class Views are used in conjunction with the Authenticator. These views do not remove categories for a PowerCube, but rather restrict access to these categories to certain User Classes. Multiple users can access the same cube, but these users only see the data they are

entitled to see. User Class Views are designed for large cubes shared amongst numerous users. Dimension views allow you create a single model with multiple cubes, each of which uses only portions of the dimensions and measures in the model. The benefit of Dimension views over having several models is that the source data needs to be processed only once during a PowerCube build. Figure 4. Dimension Tab of the PowerCube Property Sheet On the Dimensions tab of the PowerCube property sheet, you can select the dimensions to include or omit as well as specific dimension views that eliminate categories from this dimension in the resulting PowerCube. 2.2.3 Incremental Update Incremental update is an effective way to optimize cube production by adding only the newest data to an existing PowerCube without reprocessing all previous data. Updates are much smaller than the entire rebuilding of the cube and are usually done much quicker. Incremental update is best employed for cubes that have a static structure. As long as the structure remains static, incrementally updating the cube is possible. Once a structure change occurs, the cube must be regenerated. Incremental update is not useful in situations where historical data changes regularly. In such cases, the PowerCubes must be rebuilt to reflect these historical changes. When considering incremental update in a production environment, consider the following: When turning on incremental update for a PowerCube, the model and PowerCube become tightly coupled. That is to say, only that model can perform updates to the PowerCube. If the two become out of synch, then future increments may fail. Always backup your PowerCube, model, possibly source files, and log files prior to each

incremental update. If a problem occurred, you can simply roll back to these archived files. As increments are applied to a PowerCube any partitioning strategy present in the PowerCube will decay over time. It is highly recommended that a full build of the PowerCube should be performed on a regular basis. Refer to the following example: Build Activity 1 Initial Load 2 Increment 1 on build 1 3 Increment 2 on build 2 4 Increment 3 on build 3 5 Increment 4 on build 4 6 Full Load consisting of the initial Load and Increments 1 through 4. 7 Increment 5 on build 6 8 Increment 6 on build 7 This process does a full rebuild after every four incremental updates. This should be a planned activity. Performing a full rebuild on a regular basis will refresh the partitioning strategy so performance at query time is optimized. 2.2.4 Compressed Cubes An option is available for users of Transformer who wish to take advantage of the compression feature available for PowerCubes. The user may specify for a cube or cube group that the corresponding standard.mdc file be compressed upon the completion of cube generation cycle. This feature is only available for 32-bit Windows platforms, and is designed to provide additional space savings for standard.mdcs, for the purposes of deployment and storage. When the query is read by the PowerPlay Client, the cubes will be uncompressed. 2.2.5 Enable Cross-Tab Caching This feature caches the initial cross-tab of PowerCube in order to improve performance when the PowerCube is first opened by a PowerPlay client uers. 2.3 Logging The Transformer log file can be a very useful source of information. It provides information about model characteristics such as: the number of categories; number of source records processed; timings; auto-partitioning; and any errors produced during PowerCube processing.

2.3.1 Log File Structure 1. Initialization PowerPlay Transformer(6.5.283.0 ) Tue Aug 31 20:24:26 1999 LogFileDirectory=y:\data\extract ModelSaveDirectory=y:\data DataSourceDirectory=d:\program files\cognos\powerplay 6.5\pp6.5 samples CubeSaveDirectory=y:\data\extract DataWorkDirectory=y:\data\extract ModelWorkDirectory=y:\data\extract MaxTransactionNum=500000 ReadCacheSize=4096 WriteCacheSize=16384 This section outputs the environment variables used by Transformer for the PowerCube Generation process. 2. Read Phase Date/Time Logging Level Object Identifier Message 4 000000CB Start cube update. 4 000000CB Initializing categories. 4 000000CB Timing, INITIALIZING CATEGORIES,00:00:00 4 000000CB Start processing data source 'd:\program files\cognos\powerplay 6.5\pp6.5 samples\national\national.asc'. 4 000000CB Reading source data. 4 000000CB End processing 365 records from

data source 'd:\program files\cognos\powerplay 6.5\pp6.5 samples\national\national.asc'. 4 000000CB Marking categories used. 4 000000CB Timing, MARKING CATEGORIES USED,00:00:00 The read phase captures the start of the read phase, the source queries being processed, followed by the total time spent during the read phase. This collection of messages will be repeated for each query processed. Any errors that appear here indicate a problem during the read phase or because of an error encountered by check model. 3. Meta Data Update Phase Date/Time Logging Level Object Identifier Message 4 000000CB Updating category status. 4 000000CB Processing the work file. 4 000000CF Processing cube 'National' at location y:\data\extract\national.mdc 4 000000CB Timing, UPDATE CATEGORY AND PROCESS WORK FILE,00:00:00 4 000000CF Start metadata update of cube 'National'. 4 000000CB Marking categories needed. 08:24:47 PM 4 000000CB Updating the PowerCube metadata. 4 000000CB Updating the PowerCube with currency data. 4 000000CF End metadata update of cube 'National'. 132 categories were added to the cube.

08:24:47 PM 4 000000CB Timing, METADATA,00:00:01 The meta data update phase initializes the PowerCube and loads it with the necessary meta including all dimension, level, category, and measure information. 4. Data Update Phase Date/Time Logging Level Object Identifier Message 08:24:47 PM 4 000000CF Start update of cube 'National'. 08:24:47 PM 08:24:47 PM 08:24:47 PM 4 000000CF --- Performing Pass 0 with 365 rows and 132 categories 4 000000CF Selected Dimension 1 for next pass of partitioning. 4 000000CB Sorting the work file. 08:24:47 PM 4 000000CB Counting category hits. 08:24:47 PM 4 000000CF End sorting 364 records. 08:24:47 PM 08:24:47 PM 08:24:47 PM 4 000000CF Start Count and Consolidation with 364 rows and 132 categories 4 000000CF End Count and Consolidation with 356 rows and 132 categories 4 000000CF Start Write leaving 132 categories 4 000000CB Updating the PowerCube data. 4 000000CB Performing DataBase Commit at record number 357. 4 000000CF End Write leaving 132 categories.

4 000000CB Timing, CUBE UPDATE,00:00:02 4 000000CF --- Performing Pass 1 with 356 rows and 132 categories 4 000000CF Selected Dimension 0 for next pass of partitioning. 4 000000CB Counting category hits. 4 000000CF End sorting 356 records. 4 000000CF Start Count and Consolidation with 356 rows and 132 categories 4 000000CF End Count and Consolidation with 286 rows and 132 categories 4 000000CF Start Write leaving 132 categories 4 000000CB Updating the PowerCube data. 4 000000CB Performing DataBase Commit at record number 287. 4 000000CF End Write leaving 132 categories 4 000000CB Timing, CUBE UPDATE,00:00:00 4 000000CF --- Performing Pass 2 with 286 rows and 132 categories 4 000000CF Selected Dimension 2 for next pass of partitioning. 4 000000CB Counting category hits.

4 000000CF End sorting 286 records. 4 000000CF Start Count and Consolidation with 286 rows and 132 categories 4 000000CF End Count and Consolidation with 94 rows and 132 categories 4 000000CF Start Write leaving 132 categories 4 000000CB Updating the PowerCube data. 08:24:50 PM 08:24:50 PM 4 000000CB Performing DataBase Commit at record number 95. 4 000000CF End Write leaving 132 categories. 4 000000CB Timing, CUBE UPDATE,00:00:01 08:24:50 PM 4 000000CB Committing PowerCube(s). 08:24:50 PM 4 000000CB Timing, CUBE COMMIT,00:00:00 08:24:50 PM 4 000000CB End cube update. 08:24:50 PM 4 000000CB Timing, TOTAL TIME (CREATE CUBE),00:00:04 The Data Update phase adds the measure values to the PowerCube. This includes all autopartitioning passes, and PowerCube commits. 3. Hardware and Environmental Considerations 3.1 Memory One of the most important considerations when selecting the appropriate hardware for producing PowerCubes, is that of memory. The performance of PowerCube generation is proportional to the amount of physical memory available. This can be seen, in particular, if the PowerCubes being created are large. Operating systems, including Windows NT, HP/UX, AIX, and SunOS, all allocate a portion of

physical memory for disk caching. As more physical memory is available there is a greater potential for allocating more of it for disk caching. Windows NT, for example, employs a virtual block cache that is dynamic in size, which adjusts in size depending on the amount of available physical memory. The process of cube generation is very I/O intensive and the build performance is heavily dependant on minimizing the number of page faults that occur while the cube is being written. Therefore, optimum build performance will occur when the disk cache is larger than the PowerCube (.mdc file) at all stages during the generation process. In general, try to make as much physical memory as possible available for disc caching. The implementation of this is platform specific, but the rule applies across all platforms. 3.2 I/O After memory, the composition and configuration of the machines I/O systems (hard drives and controllers) can be the difference between a fast cube build and a slow one. In many ways, Transformer is similar to a database server and it is possible to configure the drives on the build machine to minimize I/O contention. A standard configuration with database servers is to have three disk controllers each with dedicated hard drives. With three controllers available Transformer can have its working directories specified in the following fashion: Controller 1: Application OS OS temporary files Transformer Models Transformer Data Sources Controller 2: Transformer Data Temporary Files Transformer Model Temporary Files Controller 3: Transformer PowerCubes Transformer Sort directory The configuration above provides the minimum amount of drive contention. Other options include looking for a drive configuration that is RAID (Redundant Array of Independent Disks) level that is selected. As stated previously, Transformer is a very I/O intensive application and higher levels of RAID (like RAID-3 or RAID-5) can cause a performance decrease during the cube build. A setting of RAID-0 would probably be the most optimal for building PowerCubes (stripe size doesn t really matter, as our internal tests have shown no difference between 16k, 32k or 64k stripe sizes). 3.3 Operating System Considerations NT Server and many UNIX operating systems are configured for multiple processes. For this reason, these operating systems restrict the amount of resources that are allocated to a given process. Transformer is a process that wants to consume all available resources for cube building. Every attempt should be made to configure these operating systems for Transformer needs. 3.3.1 UNIX The most common advice given regarding the UNIX operating systems is to ensure that the Transformer process receives a sufficient amount of the resources required to product PowerCubes. Many system administrators have imposed OS limits on process, or users through environmental settings. These limit either the amount of virtual memory, disk, or even

processor power available to the Transformer process itself, or to the users under whom the build process is executing. 3.3.2 Windows NT On Microsoft Windows NT, the disc caching plays a very important role in the generation times of PowerCubes. The Windows NT virtual block cache occupies physical memory and has a dynamic size. The virtual memory manager in NT automatically gives the cache more physical memory when memory is available and less physical memory when system wide demands are high. PowerCube generation is an I/O intensive operation. Consequently, optimum build performance will occur when the disk cache is larger than the PowerCube (.mdc file) at all stages during PowerCube generation. As the size of the disc cache decreases, the number of page faults subsequently increases which causes increased paging, and ultimately worse build performance.