Solutions for Netezza Performance Issues Vamsi Krishna Parvathaneni Tata Consultancy Services Netezza Architect Netherlands vamsi.parvathaneni@tcs.com Lata Walekar Tata Consultancy Services IBM SW ATU -Information Server and Netezza Lead Pune Lata.walekar@tcs.com
Table of Content About the Domain... 3 Introduction... 3 Recommendation for Netezza Optimization... 6 Benefits Derived from Performance Tuning... 6 References... 6
Abstract Netezza is an appliance from IBM which is an expert integrated system with built in expertise, integration by design and a simplified user experience. Part of the Pure Data family, the Netezza appliance is now known as the Pure Data System for Analytics. It has the same key design tenets of simplicity, speed, scalability and analytics power that was fundamental to Netezza appliances. With simple deployment, out-of-the-box optimization, no tuning and minimal on-going maintenance, the IBM Pure Data System for Analytics has the industry s fastest time-to-value and lowest total-cost-of-ownership. This white paper explains how we overcame performance issues in Netezza for one of the customer. About the Domain Customer is a world leader in the manufacture of advanced technology systems for the semiconductor industry. The company offers an integrated portfolio for manufacturing complex integrated circuits (also called ICs or chips). The customer organization designs, develops, integrates, markets and services advanced systems used by its end customers the major global semiconductor manufacturers to create chips that power a wide array of electronic, communications and information technology products. With every generation, the complexity of producing integrated circuits with more functionality increases. Semiconductor manufacturers need partner organization that provide technology and complete process solutions. Introduction With an objective to lay the foundation for centralized machine data with increased efficiency, the current file repository based Archive system is to be replaced with a data warehouse appliance (Netezza) to enable fast and controlled access to machine data. The main deliverables of this project to create a New System are: A Netezza data warehouse appliance filled with machine data as received from the machines located at its end customer sites, including the loaders to feed the daily inflow of machine data into the appliance. An Application Programming Interface (API), giving diagnostic applications efficient access to the stored machine data. It is important to note two things about this API. o o First: The paradigm shift from the current approach (large amounts of original machine data transferred to client PC, turned into information on the client) to the new approach (keep the original machine data in the appliance, only transfer information to client). Second: most data (in volume and number of files) part of the current Archive system will be stored in the Netezza. For certain information where there is no value/benefit from storing them in the Netezza, these will be kept as files on a micro archive, which will be accessible as a file system. Business drivers for the technology shift towards Netezza are:
Efficiency of diagnostics, reporting, and analysis on machine data Prepare for increase in volume of machine data for the future Single central repository of machine data with proper authorization and authentication, with diagnostic applications delivering a good user experience, eliminating the need for local copies of machine data Provide the foundation for future analytic applications Performance Issues post implementing a new system with Netezza and Infosphere: IBM Infosphere Datastage is the tool which is being used to load data into the Netezza appliance. For reporting or querying purposes, OBIEE and API are used. Performance issues were observed while loading data into Netezza Appliance and also while running queries on Netezza. Issues with ETL loading - All the customer machines at end-customer sites send data to the new system in the form of ADC packages. Each ADC package contains files relevant for Performance, monitoring and analysis of machine data. These are packed into a unix tape archive (tar) and then compressed (gzip), yielding a file with the extension.tgz containing one day of machine data. The new system receives around 2500 packages per day and all constitute approximately 200GB of data per day. - Infosphere Datastage processes these packages and loads the data into 5 types of tables like events, parameters, constants, configuration and test reports. Initially there were no issues with the loading of data but after a year Infosphere was not able to process 2500 packages in a day. So if there are releases or bug fixes the backlog of packages is getting increased and the target to process the complete days of packages as it comes is not being achieved. Solutions for ETL loading Each iteration of Infosphere Datastage would process 200 packages per iteration and it was taking 3 hours. So before inserting the data into the table Infosphere does a lookup into the existing tables and checks whether the data exists or not and basing on that it either inserts or updates. The biggest fact tables in Netezza are having approximately between 50-200 billion records. So the lookup into these big fact tables is expensive. In the new system Infosphere Datastage has 8 nodes and is designed to use parallel processing. So any job is split into 8 tasks and each task is run on each node parallel which would speed up the jobs. However this boomeranged when doing a lookup because all the eight nodes are trying to scan the same table at the same time for a limited amount of data. A single lookup on big fact tables itself is expensive and instead of doing a single lookup to check for any existing records, Infosphere is doing the lookups 8 times on the same table which killed the Netezza Performance. After we identified the issue we altered the Infosphere job to do a single lookup on the fact table to check for existing records. This improved the performance and brought down the time to process 200 packages to 1 hour. This is still far away from the performance we expected. Now we looked into the table structures of Netezza to do further optimization at Netezza. We observed that ETL jobs would do a lookup of the tables based on machine and date. The fact tables were having column of timestamp datatype. So while doing a lookup on the fact table the timestamp column was being truncated to date datatype. So we proposed two changes in the table structure.
To add a new column with date datatype Organize the table based on machine and date. Organizing is a feature in Netezza which will sort the complete table data based on the columns we select. This will be extremely beneficial for lookups and filter conditions of queries. Also we observed few fact tables are skewed and the data distribution is not equal in Netezza. So we changed the distribution of those tables to avoid skew. All the above changes improved the ETL performance and loading of 200 packages was now completing in 12-15 minutes instead of 3 hours. So now we are able to process one day of 2500 packages within 3 hours and we now can process any amount of packages that come to Infosphere. Issues with Reporting In the older system, there are many tools which use the Archive files to do their analysis. The new system replaced this with most of the data being in database and the tools to be converted to use Netezza instead of using the inefficient files archive. Most of the tools which started using the new system were successful and proved very efficient when compared with the old Model. However few tools still were giving bad performance. Solutions for Reporting At the time of these tools going live, we had 18 months of data and large fact tables were having data between 5-10TB which are causing the problems. We looked into each of the individual queries and came up altering the data model and the below changes Joins between big fact tables were expensive. So we avoided joins between fact tables to the minimum by having redundant data in fact tables Querying large volume sets of big fact tables repeatedly is expensive. We avoided this by building aggregate tables on top of base tables so that end users would use the aggregate tables which are small and efficient There are still many queries which will use the big fact tables. So concurrently when many queries try to scan large volume of data in the fact tables we see performance issues. On seeing the Netezza stats we observed disk utilization was 100%, CPU and RAM below 10%. Keeping Most of the data in few tables was causing this issue. Splitting the tables into smaller fact tables reduced the disk utilization and increased CPU and RAM utilization which in turn improved the performance of Netezza. Kept multiple buckets of priority to avoid smaller queries getting impacted under any scenario(s). At times we might have big queries taking all the resources and smaller queries need to wait for their chance to get resources. By keeping multiple buckets of priorities for different types of queries longer queries will take time and smaller queries will complete quicker. Relooked at the organizing of data in tables and changed the organizing columns of tables. Selecting good columns as organizing key improved the performance since it avoided scanning unnecessary data and queries were quicker.
Recommendation for Netezza Optimization Good Distribution of tables helps in Netezza Performance. Large fact tables should always be distributed on hash distribution and the columns selected should have good cardinality and also should be frequently used in query joins Large fact tables should either be organized or materialized views should be used. Organized data helps in avoiding scanning large volume sets of data in tables and also queries with filter conditions run quicker. Having statistics of table s updated helps in query performance. Inserts usually update the stats of the tables, however deletes and updates on table would make the statistics of tables outdated Workload management plays a key role in Netezza performance. Make sure groups are assigned resources appropriately and resource allocation be reviewed frequently. Monitor the Netezza utilization using nz_sysutil_stats command and monitor the disk, CPU and RAM utilizations on daily basis. Identify the time when the resource utilization is high. Identify faulty queries and fix them. Avoid joins between large fact tables and instead split the query between two fact queries into multiple queries of fact and dimension tables. This will reduce the impact on other queries and also queries will run faster. Avoid tables with large data sets and split them into multiple tables which would increase the maintenance but will improve Netezza efficiency Monitor the catalogue size of the appliance and perform Manual vacuum on the appliance whenever the catalogue size is greater than 10GB. Benefits Derived from Performance Tuning ETL loads used to take 3 hours to complete single iteration of 200 packages. They were now completing in less than 15minutes. We achieved performance improvement of 95%. Few tools which were running queries on Netezza appliance were taking more than 20minutes are now completing in less than 5 seconds. For many tools we optimized the performance by more than 50%. References https://www.ibm.com/developerworks/community/groups/service/html/communityview