White Paper EMC GREENPLUM MANAGEMENT ENABLED BY AGINITY WORKBENCH A Detailed Review EMC SOLUTIONS GROUP Abstract This white paper discusses the features, benefits, and use of Aginity Workbench for EMC Greenplum a comprehensive management and development tool, specially tailored for the features and architecture of the EMC Greenplum Database. August 2011
Copyright 2011 EMC Corporation. All Rights Reserved. EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice. The information in this publication is provided as is. EMC Corporation makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose. Use, copying, and distribution of any EMC software described in this publication requires an applicable software license. For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com. All trademarks used herein are the property of their respective owners. Part Number: H8762 EMC Greenplum Management Enabled by Aginity Workbench A Detailed Review 2
Table of contents Executive summary... 5 Business case... 5 Solution overview... 5 Key benefits... 5 Introduction... 7 Purpose... 7 Scope... 7 Audience... 7 Terminology... 7 Technology overview... 8 Overview... 8 Aginity Workbench... 8 EMC Greenplum Database... 8 Configuration... 9 Overview... 9 Environment diagram... 9 Greenplum environment description... 10 EMC Greenplum Master Server... 10 EMC Greenplum Segment Servers... 10 Operational scenarios... 11 Overview... 11 List of scenarios... 11 Scenario 1: Browse objects in the Greenplum Database... 11 Scenario 2: Examine data distribution in the Greenplum Database... 13 Scenario 3: Identify poorly performing queries and optimize performance... 16 Scenario 4: Examine the status of Greenplum segments... 19 Scenario 5: Optimize space usage in a Greenplum Database... 21 Scenario 6: Examine roles and resource queues... 23 Scenario 7: Import or export data into or out of a database... 24 Conclusion... 27 Summary... 27 Findings... 27 EMC Greenplum Management Enabled by Aginity Workbench A Detailed Review 3
References... 28 White papers... 28 Product documentation... 28 Other information... 28 EMC Greenplum Management Enabled by Aginity Workbench A Detailed Review 4
Executive summary Business case The EMC Greenplum Database is a high-performance data warehouse system that employs a massively parallel processing (MPP) architecture many servers working in parallel on database tasks. While the details of the architecture and operation are largely hidden from database users, database administrators (DBAs) and developers often need access to these details to check system health, ensure optimal performance, and develop business analytics quickly and easily to derive value from the data in the warehouse. Standard query and DBA tools fall short of providing visibility into the features of parallel-processing architecture in general, and the unique features of the Greenplum Database in particular. Solution overview Aginity Workbench for EMC Greenplum (Aginity Workbench) offers a simple and efficient method of managing a Greenplum Database. Aginity Workbench gives you a single point of access to manage, monitor, and develop a Greenplum Database, by offering a range of tools and functions that look deep into the Greenplum architecture. With Aginity Workbench, you can: Examine the operational status of all segments Browse all objects in the Greenplum Database and make modifications Run multiple queries and export results to common file formats including Microsoft Excel Generate SQL and DDL with drag-and-drop ease Analyze query plans Quickly find tables that should be vacuumed to free up database resources See how primary and mirror Segment Instances are distributed across the Segment Servers Graphically view table distribution and easily spot distribution skew Easily redistribute data Key benefits Aginity Workbench brings a new level of insight into the Greenplum Database that no other graphical user interface (GUI) tool can provide. Benefits of using Aginity Workbench include: Ease of use - With a single access point from a user-friendly GUI, you require less time and effort to accomplish daily tasks with the Greenplum Database. Access to individual components allows for detailed diagnostics - You can analyze, test, and reset the database servers more quickly, which reduces down time. EMC Greenplum Management Enabled by Aginity Workbench A Detailed Review 5
Optimization of database performance - You can adjust the database settings to maximize its performance. Reduction of user errors - Developers can use the built-in functions instead of user-written scripts, which reduces errors and time spent on scripting. EMC Greenplum Management Enabled by Aginity Workbench A Detailed Review 6
Introduction Purpose The purpose of this white paper is to examine the functionality of the Aginity Workbench and demonstrate the benefits of using it to access, manipulate, and monitor a Greenplum Database. Scope This white paper describes the features and benefits of using Aginity Workbench in a Greenplum Database environment and describes the functionality of the main features of the product. This white paper does not provide configuration information for installing Aginity Workbench into a Greenplum environment. Audience This white paper is intended for EMC employees, partners, customers, and anyone interested in using Aginity Workbench to manage a Greenplum Database. Terminology Term Analytics This white paper includes the following terminology. Table 1. Definition Terminology Analytics is the study of operational data using statistical analysis with a goal of identifying and using patterns to optimize business performance. Business intelligence DDL Master Server Massively parallel processing (MPP) Segment Server Shared-nothing architecture SQL Business intelligence is the effective use of information assets to improve the profitability, productivity, or efficiency of a business. Frequently, IT professionals use this term to refer to the business applications and tools that enable such information usage. Data Definition Language is the syntax that is used to define and create objects in a relational database. In an EMC Greenplum Database, the Master Server or Host controls the operation of the entire system and is the main connection point for external clients accessing the database. The Master Server distributes incoming queries to the Segment Servers, gathers the results, and returns them to the client. MPP is the coordinated processing of data by multiple machines that work together on a task. In a shared-nothing MPP architecture, such as EMC Greenplum, each machine has its own memory and storage and is not choked by negotiation of shared resources. In an EMC Greenplum Database, a Segment Server is one of the worker nodes/servers that is used to do the work in the MPP deployment. Shared-nothing is a distributed computing architecture made up of a collection of independent, self-sufficient servers. This is in contrast to a traditional central computer that hosts all information and processing in a single location. Structured Query Language is the syntax that is used to access data from a relational database. EMC Greenplum Management Enabled by Aginity Workbench A Detailed Review 7
Technology overview Overview The primary components used in this environment are: Aginity Workbench EMC Greenplum Database Aginity Workbench Aginity Workbench makes developers and DBAs more productive by using tools that give new access and insight into the Greenplum Database and Greenplum Data Computing Appliance. Created by and for Aginity s own developers, Aginity Workbench is a client-based application that communicates with the Greenplum Database and has a deep understanding of the Greenplum internal architecture. For developers, Aginity Workbench has an intuitive interface for creating, managing, and tracking both individual SQL queries and entire databases. Sophisticated tools help developers analyze and tune queries for maximum performance. Results can be easily viewed or exported to other formats, such as Microsoft Excel, for further use. For DBAs, Aginity Workbench provides graphical information on important properties such as node status, database size and bloat, and table distribution and skew. Builtin functions assist with generating the commands used to maintain and optimize the database operation and health. EMC Greenplum Database EMC Greenplum Database is a shared-nothing, MPP architecture that has been designed for business intelligence and analytical processing. In this architecture, each server node acts as a self-contained database management system that owns and manages a distinct portion of the overall data. The system automatically distributes data and parallelizes query workloads across all available hardware. The core shared-nothing MPP architecture enables massive data storage, loading, and processing with linear scalability. Adaptive services provide worldwide businesses with high availability, workload management, and online expansion of capacity. Key product features enable petabyte-scale loading, hybrid storage (row or column) to best fit the unique needs of each analytical use case, and embedded support for SQL, MapReduce, and programmable analytics. In addition, all major third-party analytic and administration tools are supported through standard client interfaces. The core principle of the EMC Greenplum Database is to move the processing dramatically closer to the data and its users. This effectively enables the computational resources to process every query in a fully parallel manner, use all storage connections simultaneously, and flow data efficiently between resources as the query plan dictates. The result is that complex processing can be pushed down in close proximity to the data for maximum efficiency and incredible performance. EMC Greenplum Management Enabled by Aginity Workbench A Detailed Review 8
Configuration Overview Aginity Workbench is a Microsoft Windows-based tool and can attach to any Greenplum Database. Aginity Workbench uses a native EMC Greenplum connection from the Microsoft Windows client to the Greenplum Database. Aginity Workbench is a.net application and is currently supported on the following platforms: Windows XP (32-bit) Windows 7 (32-bit and 64-bit) Windows Server 2003 (32-bit and 64-bit) Windows Server 2008 (32-bit and 64-bit) Environment diagram In this white paper, several operational scenarios are described to show how the Aginity Workbench integrates with the Greenplum Database and makes it easier for you to manage the system. Figure 1 shows a generic Greenplum environment being managed by Aginity Workbench. Figure 1. Aginity Workbench in a generic Greenplum environment EMC Greenplum Management Enabled by Aginity Workbench A Detailed Review 9
Greenplum environment description Aginity Workbench runs on a Windows client that has a connection to the Greenplum Master Server through the data center network. You can use Aginity Workbench to develop and analyze queries, as well as maintain and optimize the database. EMC Greenplum Master Server The Greenplum Master Server is the access point for all user requests to the Greenplum Database and it also handles all coordination of the Segment Servers. EMC Greenplum Segment Servers The Greenplum Segment Servers are the workers of the Greenplum Database and perform all MPP tasks. EMC Greenplum Management Enabled by Aginity Workbench A Detailed Review 10
Operational scenarios Overview This section details some common operational scenarios of the Aginity Workbench that you can use to manage the Greenplum Database. List of scenarios Aginity Workbench was exercised in the following scenarios: Scenario 1: Browse objects in the Greenplum Database Scenario 2: Examine data distribution in the Greenplum Database Scenario 3: Identify poorly performing queries and optimize performance Scenario 4: Examine the status of Greenplum segments Scenario 5: Optimize space usage in a Greenplum Database Scenario 6: Examine roles and resource queues Scenario 7: Import or export data into or out of a database Scenario 1: Browse objects in the Greenplum Database The purpose of this scenario is to expand schemas to view tables, columns, views, stored procedures, and other database objects. A key function of any database tool is to simply allow browsing and examination of database objects. Aginity Workbench has a familiar tree structure to walk into the hierarchy of the database. Figure 2 shows the top-level view of a Greenplum Database showing the databases - and their sizes - in the system. Figure 2. Aginity Workbench tree structure EMC Greenplum Management Enabled by Aginity Workbench A Detailed Review 11
Figure 3 shows a database expanded to display database objects. The view displays Greenplum-specific objects and information such as Partitions and the Distributed By clause in a table definition. This information is typically missed by tools that do not understand the Greenplum architecture. Figure 3. Expanded database showing database objects EMC Greenplum Management Enabled by Aginity Workbench A Detailed Review 12
Each of the objects has a robust context menu that provides many useful functions that DBAs and developers can use to work more efficiently. Figure 4 shows the ability to quickly construct a Select statement for a particular table. Figure 4. Select statement script The resulting Select statement can be edited as desired and then executed. Additional menu selections will build Insert, Update, and Delete statements as well as the DDL commands to create the table. These commands can be sent to the workbench query window as well as to the clipboard for pasting into other programs. These shortcut functions are handy for both initial design as well as reverse engineering of existing designs. Note Commands are only shown in the menu if they are relevant to the object. Scenario 2: Examine data distribution in the Greenplum Database The purpose of this scenario is to: Check the data distribution of tables to determine how well the data is balanced across all the Segment Servers Identify a poorly distributed table and redistribute the data for better query performance EMC Greenplum Management Enabled by Aginity Workbench A Detailed Review 13
Figure 5 shows a poor table distribution. Figure 5. Query results showing poor table distribution EMC Greenplum Management Enabled by Aginity Workbench A Detailed Review 14
To change the table distribution, you need to choose the Change distribution option, under Advanced, as shown in Figure 6. Figure 6. Select Change distribution menu option As shown in Figure 7, you can choose one or more of the Available Columns by which to redistribute the table. In this example, proc_id was selected. While Aginity Workbench makes it easy to change the distribution key, it is up to you to choose the column (or columns) that will actually result in a better distribution of the data. Selecting multiple columns for a distribution key makes a composite key from those columns. Figure 7. Select redistribution criteria and execute command EMC Greenplum Management Enabled by Aginity Workbench A Detailed Review 15
After clicking OK, Aginity Workbench provides you with the commands that perform the redistribution. As redistribution is a significant activity on all the data in a table, you must manually verify and start the execution of the command. Choosing Show Distribution again now shows the results of this redistribution activity. Figure 8 shows the successful completion of the table redistribution. Figure 8. Successful completion of redistribution showing good table distribution Scenario 3: Identify poorly performing queries and optimize performance The purpose of this scenario is to: Identify poorly performing queries Examine the Explain Plan for the query and determine the reason for the poor performance Optimize the query and verify that it performs better EMC Greenplum Management Enabled by Aginity Workbench A Detailed Review 16
To identify poorly performing queries, you go to the Object menu, and under Database choose Show Query History. Figure 9 shows the Query History window. It provides several filters to narrow down the list. The Duration column visualizes query duration, for ease of interpretation. Figure 9. Query History After a query is selected, the context menu enables you to choose Explain SQL Statement, which shows the full query and the query plan. It also provides the output of an Explain Analysis of the query. Figure 10 shows the Explain Plan for the selected query. However, for larger and more complex Explain Plans, it may be difficult to read through all the output. Figure 10. Explain Plan for the selected query EMC Greenplum Management Enabled by Aginity Workbench A Detailed Review 17
As shown in Figure 11, Aginity Workbench supports you by providing iterator output of the query. This option is available in the Context menu of the query. Figure 11. Explain Plan The iterators give much more detailed information for the steps of the Explain Plan. Iterators are available for queries that have been executed and captured in the Greenplum Performance Monitor Database. Figure 12 shows the Query Plan window with the query plan as a navigation tree in the left pane, and summary and detail information in the right panes. You can immediately see the steps that are color-highlighted, which indicates that these are possible causes of slow performance. Figure 12. Query Plan showing iterator details EMC Greenplum Management Enabled by Aginity Workbench A Detailed Review 18
It is evident that without such easy to navigate, interactive support, it would be much more difficult to narrow down pain points in problematic queries this quickly and efficiently. Scenario 4: Examine the status of Greenplum segments The purpose of this scenario is to: Determine the operational status of Greenplum segments Determine the location of primary segments and their corresponding mirror segments Identify primary segments that have failed over to their mirror segments Observe the failback of mirror segments to the primary server when the Segment Server is restored to operation Managing a Greenplum Database means managing multiple database instances on multiple servers. Aginity Workbench supports you by providing Server Explorer. This gives a detailed view of the inner workings of the Greenplum architecture, which allows DBAs to easily visualize the system status. Server Explorer can be accessed from the Server Node in the navigation tree, as shown in Figure 13. Figure 13. Server explorer EMC Greenplum Management Enabled by Aginity Workbench A Detailed Review 19
Figure 14 shows the server in a healthy running state. Figure 14. Server explorer showing a healthy status The left pane shows the Segment Servers in the cluster. The right pane shows the configuration of each Segment Instance on each Segment Server. Columns can easily be sorted by clicking on the title of a column. Color-highlighting is used to visualize the placement of the primary-mirror pairs. For each primary-mirror pair, there is one row that shows all the configuration details, for example, role, mode, status, host, and so on. The colors show how the primary Segment Instances of a server are spread over different Segment Servers. This overview immediately informs you that there are no failed segments and that each Segment Server has six primary and six mirror Segment Instances. EMC Greenplum Management Enabled by Aginity Workbench A Detailed Review 20
If any Segment Instances are in a mode or status other than Synchronized or Up, this is highlighted as shown in Figure 15 and Figure 16. Figure 15. Server Explorer showing a failover Figure 16. Server Explorer showing resynchronization In situations where you want to focus on a certain Segment Server, clicking the node name in the left pane filters the list with segments only to that particular server. Scenario 5: Optimize space usage in a Greenplum Database The purpose of this scenario is to: Determine space utilization of tables in the database Find tables that have bloat caused by deletes that have not been vacuumed Reduce system resource usage by easily executing vacuum statements on the database Periodic vacuuming of database tables helps ensure that the space occupied by deleted items is reclaimed and available for use for new data in the database. EMC Greenplum Management Enabled by Aginity Workbench A Detailed Review 21
The Aginity Workbench makes it very easy to find the space used by the tables in the database. When you right-click on the database, it lets you choose the Database Maintenance option as shown in Figure 17. Figure 17. Database Maintenance This brings up a display of all the tables in that database and includes columns that show the Expected Bytes used, Actual Bytes used, Expired Bytes, and the Percent Unused. As shown in Figure 18, the Diagnostics Message column gives an indication of the amount of bloat in the table. Tables with high bloat (deleted objects whose space can be reclaimed) can be easily vacuumed right from the menu. Figure 18. Diagnostics Message showing bloat EMC Greenplum Management Enabled by Aginity Workbench A Detailed Review 22
Scenario 6: Examine roles and resource queues The purpose of this scenario is to: Examine the properties of resource queues Identify the resource queues to which roles are assigned An important aspect of Greenplum performance management is the notion of roles and resource queues. Roles roughly correspond to database users, and each user or role is assigned to a particular resource queue. Resource queues have associated properties that determine how much of the Greenplum system resources are applied to queries that run in those queues. Aginity Workbench can display the properties of resource queues as shown in Figure 19. Figure 19. Resource queues and user roles EMC Greenplum Management Enabled by Aginity Workbench A Detailed Review 23
Aginity Workbench understands the difference between resource queues with active statement limits and resource queues that have maximum query cost limits. It also understands the different priorities that resource queues can have. Aginity Workbench also displays properties of user roles, and can show the resource queue to which each role or user is assigned, as shown in Figure 19. This easy access to workload management information helps DBAs properly allocate system resources so that database jobs are executed with the greatest efficiency. Scenario 7: Import or export data into or out of a database The purpose of this scenario is to: Import data from a disk file to the database Export data from the database to a disk file Moving data into a database from a flat file (TXT or CSV), and exporting data from a table into a flat file, are common actions for developers as well as DBAs. Greenplum provides the SQL COPY command, which can load an entire file into the database, and is considerably more efficient than executing INSERT statements and much easier than writing a script to load data. Unfortunately, the syntax for the SQL COPY command is a little tricky and, unless you use it every day, easy to forget or enter incorrectly. Aginity Workbench provides an easy way of importing data into the database from flat files and also exporting data from a table back to a disk file. To import data from a CSV file, you right-click the table into which you want to load the data and choose Import Data. In Import Data, as shown in Figure 20, you can specify the location of the file and the format. You can also specify the encoding, delimiters, escape characters, whether the input file has a header row, as well as the Segment reject limit. The reject limit sets the number of errors in the input file that you are willing to accept before aborting the load. EMC Greenplum Management Enabled by Aginity Workbench A Detailed Review 24
Figure 20. Import Data As shown in Figure 21, the SQL tab shows the corresponding SQL COPY command that is generated, which can be edited further. Figure 21. SQL tab in Import Data window Getting data out of the database and into flat files is just as easy; you right-click the table and choose Export Data. EMC Greenplum Management Enabled by Aginity Workbench A Detailed Review 25
In Export Data, as shown in Figure 22, on the Parameters tab, you can specify many of the same kinds of properties as for importing data. The Selection tab allows you to specify the columns you want to export as well as an order-by clause for your desired sorting order. Figure 22. Export Data While the import and export functions do not use the Greenplum gpload/gpfdist programs for parallel bulk loading of extremely large amounts of data, these functions are very handy for quickly getting smaller amounts of data into and out of the database. EMC Greenplum Management Enabled by Aginity Workbench A Detailed Review 26
Conclusion Summary Aginity Workbench integrates easily with EMC Greenplum Database and allows you to quickly and efficiently manage, monitor, and access large-scale enterprise data warehouses. Findings Aginity Workbench features and functionality provides many benefits including: Ease of use, reduction of overhead, and improved return on investment Access to individual components in the database, which allows for detailed diagnostics and fine tuning Optimization of database performance Reduction of errors and down time Aginity Workbench is unmatched in its ability to expose the internals of the Greenplum Database and optimize the database with ease. EMC Greenplum Management Enabled by Aginity Workbench A Detailed Review 27
References White papers For additional information, see the white papers listed below. EMC Greenplum Data Computing Appliance: High Performance for Data Warehousing and Business Intelligence An Architectural Overview EMC Greenplum Database 4.0 Critical Mass Innovation Product documentation Other information For additional information, see the product document listed below. Greenplum Database 4.1 Administrator Guide For additional information and to download the software, see the websites listed below. Aginity.com Greenplum.com EMC Greenplum Management Enabled by Aginity Workbench A Detailed Review 28