Solutions for Netezza Performance Issues

Similar documents
Netezza The Analytics Appliance

IBM Data Retrieval Technologies: RDBMS, BLU, IBM Netezza, and Hadoop

Appliances and DW Architecture. John O Brien President and Executive Architect Zukeran Technologies 1

IBM s Data Warehouse Appliance Offerings

Evolving To The Big Data Warehouse

Netezza System Guide READ ONLINE

Oracle Exadata: The World s Fastest Database Machine

Configuring Short RPO with Actifio StreamSnap and Dedup-Async Replication

Perform scalable data exchange using InfoSphere DataStage DB2 Connector

Lenovo Database Configuration

Data Set Buffering. Introduction

IDAA v4.1 PTF 5 - Update The Fillmore Group June 2015 A Premier IBM Business Partner

DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE

1 Quantum Corporation 1

Protect enterprise data, achieve long-term data retention

Applying Analytics to IMS Data Helps Achieve Competitive Advantage

IT Best Practices Audit TCS offers a wide range of IT Best Practices Audit content covering 15 subjects and over 2200 topics, including:

IBM PureData System for Analytics The Next Generation. Ralf Götz Client Technical Professional Big Data IBM Deutschland GmbH

SAP HANA Scalability. SAP HANA Development Team

Teradata Analyst Pack More Power to Analyze and Tune Your Data Warehouse for Optimal Performance

Optimizing Testing Performance With Data Validation Option

Exadata Implementation Strategy

BEST PRACTICES IN SELECTING AND DEVELOPING AN ANALYTIC APPLIANCE

Advanced Data Management Technologies Written Exam

Presentation Abstract

<Insert Picture Here> MySQL Web Reference Architectures Building Massively Scalable Web Infrastructure

Lenovo Database Configuration for Microsoft SQL Server TB

Hyper-Converged Infrastructure: Providing New Opportunities for Improved Availability

Transformer Looping Functions for Pivoting the data :

Migrate from Netezza Workload Migration

Automated Netezza Migration to Big Data Open Source

Data Virtualization Implementation Methodology and Best Practices

Database Architectures

DATA DOMAIN INVULNERABILITY ARCHITECTURE: ENHANCING DATA INTEGRITY AND RECOVERABILITY

Oracle 1Z0-515 Exam Questions & Answers

powered by Cloudian and Veritas

IBM Db2 Analytics Accelerator Version 7.1

HANA Performance. Efficient Speed and Scale-out for Real-time BI

Oracle Data Warehousing Pushing the Limits. Introduction. Case Study. Jason Laws. Principal Consultant WhereScape Consulting

Welcome to Part 3: Memory Systems and I/O

AUTOMATIC CLUSTERING PRASANNA RAJAPERUMAL I MARCH Snowflake Computing Inc. All Rights Reserved

System Z Performance & Capacity Management using TDSz and DB2 Analytics Accelerator: UnipolSai Customer Experience

L9: Storage Manager Physical Data Organization

Best Practices. Deploying Optim Performance Manager in large scale environments. IBM Optim Performance Manager Extended Edition V4.1.0.

New Approach to Unstructured Data

Data Warehouse Appliance: Main Memory Data Warehouse

Achieving Horizontal Scalability. Alain Houf Sales Engineer

Chapter 13: Query Processing

Evaluating Hyperconverged Full Stack Solutions by, David Floyer

Call: Datastage 8.5 Course Content:35-40hours Course Outline

Scaling for Humongous amounts of data with MongoDB

Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor

Performance Best Practices Paper for IBM Tivoli Directory Integrator v6.1 and v6.1.1

Performance Optimization for Informatica Data Services ( Hotfix 3)

Chapter 12: Query Processing

Deploying an IBM Industry Data Model on an IBM Netezza data warehouse appliance

Inventory File Data with Snap Enterprise Data Replicator (Snap EDR)

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Data Analytics using MapReduce framework for DB2's Large Scale XML Data Processing

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

Chapter 13: Query Processing Basic Steps in Query Processing

TECHNICAL OVERVIEW OF NEW AND IMPROVED FEATURES OF EMC ISILON ONEFS 7.1.1

Modern Data Warehouse The New Approach to Azure BI

HAWQ: A Massively Parallel Processing SQL Engine in Hadoop

InfoSphere Warehouse with Power Systems and EMC CLARiiON Storage: Reference Architecture Summary

Private Cloud Database Consolidation Name, Title

Oracle Big Data Connectors

Optimized Data Integration for the MSO Market

Welcome. Lyubomira Mihaylova Business Development Manager. M.: October 2012

Oracle Database Exadata Cloud Service Exadata Performance, Cloud Simplicity DATABASE CLOUD SERVICE

Vendor: IBM. Exam Code: P Exam Name: IBM InfoSphere Information Server Technical Mastery Test v2. Version: Demo

Automating Information Lifecycle Management with

HP Dynamic Deduplication achieving a 50:1 ratio

Learning Objectives : This chapter provides an introduction to performance tuning scenarios and its tools.

QUESTION 1 Assume you have before and after data sets and want to identify and process all of the changes between the two data sets. Assuming data is

ELTMaestro for Spark: Data integration on clusters

Data Warehouse Tuning. Without SQL Modification

P IBM. IBM InfoSphere Information Server Technical Mastery Test v2

How to Modernize the IMS Queries Landscape with IDAA

Was ist dran an einer spezialisierten Data Warehousing platform?

Page 1. Oracle9i OLAP. Agenda. Mary Rehus Sales Consultant Patrick Larkin Vice President, Oracle Consulting. Oracle Corporation. Business Intelligence

Workload Optimized Systems: The Wheel of Reincarnation. Michael Sporer, Netezza Appliance Hardware Architect 21 April 2013

1 Copyright 2011, Oracle and/or its affiliates. All rights reserved. reserved. Insert Information Protection Policy Classification from Slide 8

SQL Maestro and the ELT Paradigm Shift

Why Quality Depends on Big Data

Session 4112 BW NLS Data Archiving: Keeping BW in Tip-Top Shape for SAP HANA. Sandy Speizer, PSEG SAP Principal Architect

Copyright 2018, Oracle and/or its affiliates. All rights reserved.

Chapter 12: Query Processing. Chapter 12: Query Processing

WHITE PAPER: BEST PRACTICES. Sizing and Scalability Recommendations for Symantec Endpoint Protection. Symantec Enterprise Security Solutions Group

Efficient Data Structures for Tamper-Evident Logging

Column Stores vs. Row Stores How Different Are They Really?

Introduction to K2View Fabric

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Exadata Implementation Strategy

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

Streaming Log Analytics with Kafka

Passit4sure.P questions

Exam Questions P

Netezza PureData System Administration Course

Transcription:

Solutions for Netezza Performance Issues Vamsi Krishna Parvathaneni Tata Consultancy Services Netezza Architect Netherlands vamsi.parvathaneni@tcs.com Lata Walekar Tata Consultancy Services IBM SW ATU -Information Server and Netezza Lead Pune Lata.walekar@tcs.com

Table of Content About the Domain... 3 Introduction... 3 Recommendation for Netezza Optimization... 6 Benefits Derived from Performance Tuning... 6 References... 6

Abstract Netezza is an appliance from IBM which is an expert integrated system with built in expertise, integration by design and a simplified user experience. Part of the Pure Data family, the Netezza appliance is now known as the Pure Data System for Analytics. It has the same key design tenets of simplicity, speed, scalability and analytics power that was fundamental to Netezza appliances. With simple deployment, out-of-the-box optimization, no tuning and minimal on-going maintenance, the IBM Pure Data System for Analytics has the industry s fastest time-to-value and lowest total-cost-of-ownership. This white paper explains how we overcame performance issues in Netezza for one of the customer. About the Domain Customer is a world leader in the manufacture of advanced technology systems for the semiconductor industry. The company offers an integrated portfolio for manufacturing complex integrated circuits (also called ICs or chips). The customer organization designs, develops, integrates, markets and services advanced systems used by its end customers the major global semiconductor manufacturers to create chips that power a wide array of electronic, communications and information technology products. With every generation, the complexity of producing integrated circuits with more functionality increases. Semiconductor manufacturers need partner organization that provide technology and complete process solutions. Introduction With an objective to lay the foundation for centralized machine data with increased efficiency, the current file repository based Archive system is to be replaced with a data warehouse appliance (Netezza) to enable fast and controlled access to machine data. The main deliverables of this project to create a New System are: A Netezza data warehouse appliance filled with machine data as received from the machines located at its end customer sites, including the loaders to feed the daily inflow of machine data into the appliance. An Application Programming Interface (API), giving diagnostic applications efficient access to the stored machine data. It is important to note two things about this API. o o First: The paradigm shift from the current approach (large amounts of original machine data transferred to client PC, turned into information on the client) to the new approach (keep the original machine data in the appliance, only transfer information to client). Second: most data (in volume and number of files) part of the current Archive system will be stored in the Netezza. For certain information where there is no value/benefit from storing them in the Netezza, these will be kept as files on a micro archive, which will be accessible as a file system. Business drivers for the technology shift towards Netezza are:

Efficiency of diagnostics, reporting, and analysis on machine data Prepare for increase in volume of machine data for the future Single central repository of machine data with proper authorization and authentication, with diagnostic applications delivering a good user experience, eliminating the need for local copies of machine data Provide the foundation for future analytic applications Performance Issues post implementing a new system with Netezza and Infosphere: IBM Infosphere Datastage is the tool which is being used to load data into the Netezza appliance. For reporting or querying purposes, OBIEE and API are used. Performance issues were observed while loading data into Netezza Appliance and also while running queries on Netezza. Issues with ETL loading - All the customer machines at end-customer sites send data to the new system in the form of ADC packages. Each ADC package contains files relevant for Performance, monitoring and analysis of machine data. These are packed into a unix tape archive (tar) and then compressed (gzip), yielding a file with the extension.tgz containing one day of machine data. The new system receives around 2500 packages per day and all constitute approximately 200GB of data per day. - Infosphere Datastage processes these packages and loads the data into 5 types of tables like events, parameters, constants, configuration and test reports. Initially there were no issues with the loading of data but after a year Infosphere was not able to process 2500 packages in a day. So if there are releases or bug fixes the backlog of packages is getting increased and the target to process the complete days of packages as it comes is not being achieved. Solutions for ETL loading Each iteration of Infosphere Datastage would process 200 packages per iteration and it was taking 3 hours. So before inserting the data into the table Infosphere does a lookup into the existing tables and checks whether the data exists or not and basing on that it either inserts or updates. The biggest fact tables in Netezza are having approximately between 50-200 billion records. So the lookup into these big fact tables is expensive. In the new system Infosphere Datastage has 8 nodes and is designed to use parallel processing. So any job is split into 8 tasks and each task is run on each node parallel which would speed up the jobs. However this boomeranged when doing a lookup because all the eight nodes are trying to scan the same table at the same time for a limited amount of data. A single lookup on big fact tables itself is expensive and instead of doing a single lookup to check for any existing records, Infosphere is doing the lookups 8 times on the same table which killed the Netezza Performance. After we identified the issue we altered the Infosphere job to do a single lookup on the fact table to check for existing records. This improved the performance and brought down the time to process 200 packages to 1 hour. This is still far away from the performance we expected. Now we looked into the table structures of Netezza to do further optimization at Netezza. We observed that ETL jobs would do a lookup of the tables based on machine and date. The fact tables were having column of timestamp datatype. So while doing a lookup on the fact table the timestamp column was being truncated to date datatype. So we proposed two changes in the table structure.

To add a new column with date datatype Organize the table based on machine and date. Organizing is a feature in Netezza which will sort the complete table data based on the columns we select. This will be extremely beneficial for lookups and filter conditions of queries. Also we observed few fact tables are skewed and the data distribution is not equal in Netezza. So we changed the distribution of those tables to avoid skew. All the above changes improved the ETL performance and loading of 200 packages was now completing in 12-15 minutes instead of 3 hours. So now we are able to process one day of 2500 packages within 3 hours and we now can process any amount of packages that come to Infosphere. Issues with Reporting In the older system, there are many tools which use the Archive files to do their analysis. The new system replaced this with most of the data being in database and the tools to be converted to use Netezza instead of using the inefficient files archive. Most of the tools which started using the new system were successful and proved very efficient when compared with the old Model. However few tools still were giving bad performance. Solutions for Reporting At the time of these tools going live, we had 18 months of data and large fact tables were having data between 5-10TB which are causing the problems. We looked into each of the individual queries and came up altering the data model and the below changes Joins between big fact tables were expensive. So we avoided joins between fact tables to the minimum by having redundant data in fact tables Querying large volume sets of big fact tables repeatedly is expensive. We avoided this by building aggregate tables on top of base tables so that end users would use the aggregate tables which are small and efficient There are still many queries which will use the big fact tables. So concurrently when many queries try to scan large volume of data in the fact tables we see performance issues. On seeing the Netezza stats we observed disk utilization was 100%, CPU and RAM below 10%. Keeping Most of the data in few tables was causing this issue. Splitting the tables into smaller fact tables reduced the disk utilization and increased CPU and RAM utilization which in turn improved the performance of Netezza. Kept multiple buckets of priority to avoid smaller queries getting impacted under any scenario(s). At times we might have big queries taking all the resources and smaller queries need to wait for their chance to get resources. By keeping multiple buckets of priorities for different types of queries longer queries will take time and smaller queries will complete quicker. Relooked at the organizing of data in tables and changed the organizing columns of tables. Selecting good columns as organizing key improved the performance since it avoided scanning unnecessary data and queries were quicker.

Recommendation for Netezza Optimization Good Distribution of tables helps in Netezza Performance. Large fact tables should always be distributed on hash distribution and the columns selected should have good cardinality and also should be frequently used in query joins Large fact tables should either be organized or materialized views should be used. Organized data helps in avoiding scanning large volume sets of data in tables and also queries with filter conditions run quicker. Having statistics of table s updated helps in query performance. Inserts usually update the stats of the tables, however deletes and updates on table would make the statistics of tables outdated Workload management plays a key role in Netezza performance. Make sure groups are assigned resources appropriately and resource allocation be reviewed frequently. Monitor the Netezza utilization using nz_sysutil_stats command and monitor the disk, CPU and RAM utilizations on daily basis. Identify the time when the resource utilization is high. Identify faulty queries and fix them. Avoid joins between large fact tables and instead split the query between two fact queries into multiple queries of fact and dimension tables. This will reduce the impact on other queries and also queries will run faster. Avoid tables with large data sets and split them into multiple tables which would increase the maintenance but will improve Netezza efficiency Monitor the catalogue size of the appliance and perform Manual vacuum on the appliance whenever the catalogue size is greater than 10GB. Benefits Derived from Performance Tuning ETL loads used to take 3 hours to complete single iteration of 200 packages. They were now completing in less than 15minutes. We achieved performance improvement of 95%. Few tools which were running queries on Netezza appliance were taking more than 20minutes are now completing in less than 5 seconds. For many tools we optimized the performance by more than 50%. References https://www.ibm.com/developerworks/community/groups/service/html/communityview