Workload Experience Manager

Size: px

Start display at page:

Download "Workload Experience Manager"

Clement Shon Baker
5 years ago
Views:

1 Workload Experience Manager

2 Important Notice Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, and any other product or service names or slogans contained in this document are trademarks of Cloudera and its suppliers or licensors, and may not be copied, imitated or used, in whole or in part, without the prior written permission of Cloudera or the applicable trademark holder. If this documentation includes code, including but not limited to, code examples, Cloudera makes this available to you under the terms of the Apache License, Version 2.0, including any required notices. A copy of the Apache License Version 2.0, including any notices, is included herein. A copy of the Apache License Version 2.0 can also be found here: Hadoop and the Hadoop elephant logo are trademarks of the Apache Software Foundation. All other trademarks, registered trademarks, product names and company names or logos mentioned in this document are the property of their respective owners. Reference to any products, services, processes or other information, by trade name, trademark, manufacturer, supplier or otherwise does not constitute or imply endorsement, sponsorship or recommendation thereof by us. Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Cloudera. Cloudera may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Cloudera, the furnishing of this document does not give you any license to these patents, trademarks copyrights, or other intellectual property. For information about patents covering Cloudera products, see The information in this document is subject to change without notice. Cloudera shall not be liable for any damages resulting from technical errors or omissions which may be present in this document, or from use of this document. Cloudera, Inc. 395 Page Mill Road Palo Alto, CA info@cloudera.com US: Intl: Release Information Version: Workload Experience Manager 1.0.x Date: November 26, 2018

3 Table of Contents Overview of Workload Experience Manager...5 Default Time Range...5 Workload XM Diagnostic Data Collection...5 Sources of Data Sent to Workload XM...5 Diagnostic Data Collection Details...6 Redaction Capabilities for Diagnostic Data...6 What's New from Workload XM...8 Download SQL Commands to Address "Corrupt Table Statistics" and "Missing Table Statistics" Query Health Checks...8 New Log and Query Redaction Configuration Properties for Telemetry Publisher...9 Proxy Server Support for Telemetry Publisher...9 Multiple Usability Improvements...9 Setting Up Workload Experience Manager with Telemetry Publisher...12 Pre-Requisites for Setting Up Workload XM...12 Configuring a Firewall for Workload XM...12 Redact Data Before Sending to Workload XM...13 Connecting Cloudera Manager to Workload XM...14 Step 1. Get Altus Credentials...14 Step 2. Add Altus Credentials to Cloudera Manager...15 Step 3. Add the Telemetry Publisher Service Role...15 Configuring Telemetry Publisher When Key Trustee Is Enabled...18 Logging In to Workload XM...19 Using Workload Experience Manager (XM)...20 Default Time Range...20 Common Use Cases...20 Troubleshooting Abnormal Job Durations...20 (Hadoop Administrators) Troubleshooting Failed Data Engineering Jobs...24 (Application Developers) Determining Cause for Slow and Failed Queries...25 Workload Experience Manager (XM) Reference...28

4 Data Warehouse (Apache Impala) Query Status...28 Data Warehouse (Apache Impala) Query Types...28 Data Warehouse (Apache Impala) Health Checks...31 Potential SQL Issues...35 Data Engineering (Apache Hive, Spark, MapReduce) Health Checks...37 Appendix: Apache License, Version

5 Overview of Workload Experience Manager Overview of Workload Experience Manager Workload Experience Manager (Workload XM) is a tool that provides insights to help you gain in-depth understanding of the workloads you send to clusters managed by Cloudera Manager. In addition, it provides information that can be used for troubleshooting failed jobs and optimizing slow jobs that run on those clusters. After a job ends, information about job execution is sent to Workload XM with the Telemetry Publisher, a role in the Cloudera Manager Management Service. Workload XM uses the information to display metrics about the performance of a job. Additionally, Workload XM compares the current run of a job to previous runs of the same job by creating baselines. You can use the knowledge gained from this information to identify and address abnormal or degraded performance or potential performance improvements. Default Time Range If you have not specified a time range, Workload Experience Manager (Workload XM) displays data for the last 24 hours by default. If there is no data available for the last 24 hours, Workload XM displays the full range that is available by default. Workload XM Diagnostic Data Collection When you enable Workload XM, the Cloudera Management Service starts the Telemetry Publisher role. Telemetry Publisher collects and transmits metrics as well as configuration and log files from Impala, Oozie, Hive, YARN, and Spark services for jobs running on CDH clusters to Workload XM. Telemetry Publisher collects metrics for all clusters that use the environments where Workload XM is enabled. This topic describes the sources of information sent to Workload XM and how that data is redacted: Sources of Data Sent to Workload XM Workload Experience Manager 5

6 Overview of Workload Experience Manager The above diagram shows the sources from which you can configure Telemetry Publisher to collect diagnostic data. This data is collected in the following ways: Pull Telemetry Publisher pulls diagnostic data from these services periodically (once per minute, by default). These sources are indicated with the outbound arrows leading from Telemetry Publisher in the above diagram. They are Oozie, YARN, and Spark. Push A Cloudera Manager Agent pushes diagnostic data from these services to Telemetry Publisher within 5 seconds after a job finishes. These sources are indicated with the inbound arrows to Telemetry Publisher. They are Hive and Impala. After the diagnostic data reaches Telemetry Publisher, it is stored temporarily in its data directory and periodically (once per minute) exported to Workload XM. Diagnostic Data Collection Details The diagnostic data collected by Telemetry Publisher and sent to Workload XM includes the following: MapReduce Jobs Telemetry Publisher polls the YARN Job History Server for recently completed MapReduce jobs. For each of these jobs, Telemetry Publisher collects the configuration and jhist file, which is the job history file that contains job and task counters, from HDFS. Telemetry Publisher can be configured to collect MapReduce task logs from HDFS and send them to Workload XM. By default, this log collection is turned off. Spark Applications Telemetry Publisher polls the Spark History Server for recently completed Spark applications. For each of these applications, Telemetry Publisher collects their event log from HDFS. Telemetry Publisher only collects Spark application data from Spark version 2.2 and later. Telemetry Publisher can be configured to collect the executor logs of Spark applications from HDFS and send them to Workload XM, but this data collection is turned off by default. Important: CDH version 5.x is packaged with Spark 1.6 so you cannot configure Telemetry Publisher data collection for CDH 5.x clusters unless you are using CDS 2.2 Powered by Apache Spark or later versions with those clusters. Oozie Workflows Telemetry Publisher polls Oozie servers for recently completed Oozie workflows and sends their details to Workload XM. Hive Queries Telemetry Publisher uses the same mechanism used by Cloudera Navigator (a Cloudera Manager agent) to collect Hive query audits. The Cloudera Manager agent periodically searches for query detail files that are generated by HiveServer2 after a query completes and then sends the details from those files to Telemetry Publisher. Important: Cloudera Navigator does not need to be enabled on the cluster, but Hive query audits must be enabled. Impala Queries A Cloudera Manager agent periodically looks for query profiles of recently completed queries and sends them to Telemetry Publisher. Redaction Capabilities for Diagnostic Data The diagnostic data collected by Telemetry Publisher might contain sensitive data in the job configurations or the logs. There are several ways you can redact sensitive data before it is sent to Telemetry Publisher. Cloudera recommends enabling the following redaction features even if you are not sending diagnostic data to Telemetry Publisher: Log and query redaction Refer to the Workload XM documentation and the Cloudera Manager documentation for information about log and query redaction. This redaction feature enables you to redact information in logs and queries collected by Telemetry Publisher based on filters created with regular expressions. MapReduce job properties redaction You can redact job configuration properties before they are stored in HDFS. Since Telemetry Publisher reads job configuration files from HDFS, it only fetches redacted configuration information. See Redacting MapReduce Job Properties on page 14 for more information. 6 Workload Experience Manager

7 Overview of Workload Experience Manager Spark event and executor log redaction The Spark2 on YARN service has the spark.redaction.regex configuration property that can be used to redact sensitive data from event and executor logs. When this configuration property is enabled, Telemetry Publisher sends only redaction data to Workload XM. This configuration property is enabled by default, but can be overridden by using safety valves in Cloudera Manager or in the Spark application itself. See Redact Data Before Sending to Workload XM on page 13 for more information about data redaction in Workload XM. Workload Experience Manager 7

8 What's New from Workload XM What's New from Workload XM Download SQL Commands to Address "Corrupt Table Statistics" and "Missing Table Statistics" Query Health Checks If your queries trigger the Corrupt Table Statistics or the Missing Table Statistics health checks, Workload XM generates the SQL code you can copy and run on your cluster to address these issues. To download SQL code for creating or repairing table statistics: 1. Under Data Warehouse, select Queries. 2. On the Queries page, select the time period you want to investigate for the Range column. 3. In the Health Check column, select either Corrupt Table Statistics or Missing Table Statistics. This filters out queries that do not trigger these health checks. 4. Click the query to view its details. 5. In the Performance Issues region of the query details page, click the Health Check Violations tab. This lists the health checks that were triggered for this query. It is here you see the SQL code that you can copy and run to repair the table statistics issues. 8 Workload Experience Manager

9 What's New from Workload XM New Log and Query Redaction Configuration Properties for Telemetry Publisher In Cloudera Manager 5.16, you can now configure log and query redaction for the Telemetry Publisher service in Cloudera Manager. By default this configuration is enabled. For more information, see Log and Query Redaction for the Telemetry Publisher Service on page 16. Proxy Server Support for Telemetry Publisher In Cloudera Manager 5.16, you can now configure the Telemetry Publisher service to send metrics as well as configuration and log files to Workload XM by way of a proxy server for database and Altus metrics uploads. For more information, see Configuring Telemetry Publisher to Use a Proxy Server on page 17 Multiple Usability Improvements The Workload XM team is constantly improving usability. Here are some of our recent upgrades to the user experience: Support for parsing Spark 2.3 application history logs. Job history files and Spark event logs are now available to download from the Execution Detail tab in the Job detail page: Figure 1: Download Job History Files Workload Experience Manager 9

10 What's New from Workload XM Figure 2: Download Spark Event Logs Additions to the Query Detail page. Now you can download the query profile for Impala queries and view the total number of joins performed for a specific query: New Concurrency chart added to the Data Warehouse Summary page. This chart shows query concurrency in the cluster during a selected time range. You can use this chart to gain insight, such as identifying potential resource contention in the cluster or using it to identify the busiest time of day on your cluster. 10 Workload Experience Manager

11 What's New from Workload XM Workload Experience Manager 11

12 Setting Up Workload Experience Manager with Telemetry Publisher Setting Up Workload Experience Manager with Telemetry Publisher Diagnostic information about job and query execution is sent to Workload Experience Manager (Workload XM) with Telemetry Publisher, a role in the Cloudera Manager Management Service. When new clusters are added with Cloudera Manager, Telemetry Publisher automatically sends the new cluster information to Workload XM. This section describes how to connect Cloudera Manager to Workload XM by configuring Telemetry Publisher: Pre-Requisites for Setting Up Workload XM Connecting Cloudera Manager to Workload XM Configuring Telemetry Publisher When Key Trustee Is Enabled Pre-Requisites for Setting Up Workload XM Before you can set up Cloudera Manager's Telemetry Publisher service to send diagnostic data to Workload XM, you must make sure you have the correct versions of Cloudera Manager and CDH: Supported Versions of Cloudera Manager and CDH To use Workload XM with CDH clusters managed by Cloudera Manager, you must have the following versions: For CDH 5.x clusters: Cloudera Manager version and later CDH version 5.8 and later Important: Workload XM is not available on Cloudera Manager 6.0 whether you are managing CDH 5.x or CDH 6.x clusters. After you have verified that you have the correct versions of Cloudera Manager and CDH, you must configure data redaction and your firewall. These topics are addressed in the following sections: Configuring a Firewall for Workload XM Workload XM is a cloud service which runs on Amazon Web Services (AWS). The Telemetry Publisher service, which was introduced in Cloudera Manager version , collects metrics from various components in a CDH cluster and securely sends these metrics by way of Transport Layer Security (HTTPS) over the internet to the Workload XM service as shown in the following illustration. 12 Workload Experience Manager

Setting Up Workload Experience Manager with Telemetry Publisher To connect an on-premises CDH cluster to communicate with Workload XM, you must configure your firewall using the following information.

13 Setting Up Workload Experience Manager with Telemetry Publisher To connect an on-premises CDH cluster to communicate with Workload XM, you must configure your firewall using the following information. The Cloudera Telemetry Publisher service makes outbound connections to two endpoints to communicate with Workload XM as follows: Endpoint #1: This endpoint maps to a dynamic IP address in AWS us west-1. AWS us west-1 IP address ranges are documented here. Endpoint #2: This endpoint also maps to a dynamic IP address in AWS us west-1. See the above link for the IP address ranges that are documented on the AWS web site. Starting with Cloudera Manager version , you can also configure an HTTP proxy between Telemetry Publisher and Workload XM. In this configuration, the proxy acts as an HTTP tunnel for the encrypted TLS communication between Telemetry Publisher and Workload XM. See Configuring Telemetry Publisher to Use a Proxy Server on page 17 for details. Redact Data Before Sending to Workload XM Telemetry Publisher collects diagnostic data from logs, job configurations, and queries, and then sends this data to Workload XM. This diagnostic information might contain sensitive data so it is desirable to redact the sensitive information before Telemetry Publisher sends it to Workload XM. Redact Logs and Queries To redact sensitive data in the CDH cluster, such as log files, use Cloudera Manager. See Log and Query Redaction in the Cloudera Manager documentation. However, note that this only redacts data, not metadata. Sensitive data in files is redacted, but the name, owner, and other metadata about the files is not redacted. The Cloudera documentation Workload Experience Manager 13

14 Setting Up Workload Experience Manager with Telemetry Publisher referred to above explains what is redacted and what is not. Also see Log and Query Redaction for the Telemetry Publisher Service on page 16 for additional details about log and query redaction in Workload XM. Redact Spark Data The Spark on YARN service in CDH enables the spark.redaction.regex configuration property by default, which redacts sensitive data from event and executor logs. Do not override this setting to ensure that Telemetry Publisher only sends redacted information to Workload XM. Redacting MapReduce Job Properties Set the mapreduce.job.redacted-properties configuration property for YARN to redact MapReduce job configuration properties before they are stored in HDFS. Telemetry Publisher reads the job configuration file from HDFS, so if you set this property for all the MapReduce jobs you use, only redacted job configuration information is fetched from HDFS. To set this property in Cloudera Manager: 1. In the Cloudera Manager Admin Console, select the YARN service, and then click the Configuration tab. 2. Search for mapreduce.job.redacted-properties to locate this configuration property. By default, several MapReduce job properties are set. Leave these set as they are. 3. Click the plus sign after the last property listed and add any additional properties for your MapReduce jobs. 4. Click Save Changes and restart the YARN service. Connecting Cloudera Manager to Workload XM Diagnostic information about job and query execution is sent to Workload Experience Manager (Workload XM) with Telemetry Publisher, a role in the Cloudera Manager Management Service. When new clusters are added with Cloudera Manager, Telemetry Publisher automatically sends the new cluster information to Workload XM. This topic describes how to connect Cloudera Manager to Workload XM by connecting the Cloudera Manager Telemetry Publisher service role to a Cloudera Altus account. Note: Cloudera recommends using Java 8. If you are using Java 7, additional steps are required when you add the Telemetry Publisher service role. Connecting Cloudera Manager Telemetry Publisher to Workload XM is a three-step process. After connecting Cloudera Manager to Workload XM, you must enable Log and Query Redaction for the Telemetry Publisher Service on page 16. If you must use a proxy server with Workload XM, see Configuring Telemetry Publisher to Use a Proxy Server on page 17. Step 1. Get Altus Credentials In order to use Workload XM, you need an Altus account. For more information about how to set up an Altus account, see the Cloudera Altus documentation. 1. Go to wxm.cloudera.com, and follow the prompts to set up your account. 2. On the Altus Home page, click on your user name in the upper right corner of the page, and select My Account. 3. Click Generate Access Key. This creates an Altus Access Key ID and Altus Private Key. The Altus Access Key ID and Altus Private Key are needed to add an Altus account to Cloudera Manager. Note: The Cloudera Altus console displays the API access key immediately after you create it. You must copy or download the access key information when it is displayed. Do not exit the console without copying the keys. After you exit the console, there is no other way to view or copy the access key. 14 Workload Experience Manager

15 Step 2. Add Altus Credentials to Cloudera Manager 1. Sign in to the Cloudera Manager Admin Console. 2. Navigate to Administration > External Accounts > Altus Credentials. 3. Select Add Access Key Authentication, provide the following information, and click Add: Name Altus Access Key ID Altus Private Key 4. Navigate to Administration > Settings. Type Altus in the search box to find the Telemetry Altus Account configuration setting. Then select the Altus credentials you created and named in Step Click Save Changes. Step 3. Add the Telemetry Publisher Service Role After you add an Altus account, add the Telemetry Publisher service role to the Cloudera Manager Service. Important: Before you specify a host cluster for the Telemetry Publisher service, make sure that you name the cluster with a human-readable name in Cloudera Manager. If you do not name the cluster in Cloudera Manager before you associate the cluster with the Telemetry Publisher service, Workload XM identifies the cluster with a random string of 32 characters, such as 44a6e75e ea e84c2, which is difficult to identify and work with in the Workload XM application. To rename a cluster in Cloudera Manager: Setting Up Workload Experience Manager with Telemetry Publisher 1. On the Home page of the Admin Console, click the Clusters drop-down list and select the cluster you want to rename. 2. On the cluster page, click the Actions menu adjacent to the cluster name, and select Rename Cluster. 3. In the Rename Cluster dialog box, type the new cluster name, and then click Rename Cluster. 1. In the Cloudera Manager Admin Console, navigate to Clusters > Cloudera Management Service. 2. Select Actions > Add Role Instances. The Add Role Instances wizard opens. If a Telemetry Publisher role already exists, Cloudera Manager does not let you add another. 3. Select a host for the Telemetry Publisher and complete the wizard. 4. If you are using Java 8, skip this step. If you are using Java 7, you must configure Telemetry Publisher as follows: 1. In the Cloudera Manager Admin Console, click Cloudera Management Service. 2. On the Cloudera Management Service page, select the Configuration tab and then select the Telemetry Publisher filter under Scope. 3. Type java configuration in the search text box to locate the Java Configuration Options for Telemetry Publisher configuration property and add the following to the text box: -Dhttps.protocols=TLSv1.2 -Dhttps.cipherSuites=TLS_RSA_WITH_AES_256_CBC_SHA256 Workload Experience Manager 15

16 Setting Up Workload Experience Manager with Telemetry Publisher Figure 3: Java 7 Configuration for Telemetry Publisher 4. Click Save Changes. 5. Go to Clusters > Cloudera Management Service and select the Telemetry Publisher role. 6. Click Actions > Test Altus Connection. A successful test indicates that the Telemetry Publisher can connect to Altus. 7. Go to Clusters > Hive> > Instances and restart the roles for Hive. Log and Query Redaction for the Telemetry Publisher Service Log and query redaction for the Telemetry Publisher service is controlled with the Log and Query Redaction configuration property. This property is enabled by default and works with the log and query redaction property for HDFS. If you want to disable log and query redaction for the Telemetry Publisher service, you must also disable log and query redaction for HDFS or the Telemetry Publisher service will not start. The Log and Query Redaction configuration property is only available in Cloudera Manager version 5.16 and later. For more information about log and query redaction, see the Cloudera Manager documentation. Important: Cloudera strongly recommends that you enable log and query redaction for both HDFS and the Telemetry Publisher service to protect sensitive data from being accessed by unauthorized users. If you must disable Telemetry Publisher log and query redaction for testing purposes: 1. In the Cloudera Manager Admin Console, navigate to Clusters > HDFS > Configuration, and type redact into the Search box to locate the log and query redaction properties for HDFS. 2. Uncheck the Enable Log and Query Redaction property, and then click Save Changes. 3. Still in the Cloudera Manager Admin Console, click Clusters > Cloudera Management Service > Configuration > Telemetry Publisher, type redact in the Search box, and uncheck the Log and Query Redaction property for the Telemetry Publisher Default Group: 16 Workload Experience Manager

17 Setting Up Workload Experience Manager with Telemetry Publisher 4. Click Save Changes. 5. Restart both the HDFS and the Telemetry Publisher services to disable log and query redaction. Configuring Telemetry Publisher to Use a Proxy Server You can configure the Telemetry Publisher service to send metrics as well as configuration and log files to WXM by way of a proxy server for database and Altus metrics uploads. You cannot upload information from Amazon Web Services (AWS) by way of a proxy server. By default, this configuration property is disabled. Telemetry Publisher uses the TLS/HTTPS protocol to send telemetry information to WXM. This ensures that the data is encrypted in flight. The proxy you use must support the HTTP CONNECT method in order to be able to pass through the encrypted messages. For more information, see the associated RFC. Telemetry Publisher support for proxy servers is only available in Cloudera Manager version 5.16 and later. To enable Telemetry Publisher to send information by way of a proxy server: 1. In the Cloudera Manager Admin Console, navigator to Clusters > Cloudera Management Service > Configuration > Telemetry Publisher, and type proxy into the Search box to locate the proxy configuration properties: Workload Experience Manager 17

18 Setting Up Workload Experience Manager with Telemetry Publisher 2. Select Telemetry Publisher Default Group and provide the proxy server name, port, username, and password. 3. Click Save Changes, and then restart the Telemetry Publisher service. Configuring Telemetry Publisher When Key Trustee Is Enabled When Key Trustee is enabled, the default HDFS user for Telemetry Publisher (hdfs) does not have permission to download the relevant files from HDFS. The Telemetry Publisher user must be in both of the user groups that contain the Job History Server and the Spark History Server. For example, if the Job History Server is in the hadoop user group and the Spark History Server is in the spark user group, the Telemetry Publisher user must be in both the hadoop group and the spark group to download files from HDFS when Key Trustee is enabled. 18 Workload Experience Manager

19 Logging In to Workload XM Logging In to Workload XM To access Workload XM, perform the following steps: 1. Log in to the Workload XM console: wxm.cloudera.com/ 2. In Search, type the name of the cluster you want to analyze. 3. Select either Data Warehouse or Data Engineering summary pages or details (queries or jobs) in the left navigation menu. These links launch pages where you can drill down to view health checks, execution details, baselines, and trends. There can be a delay from job completion to when the job is available in Workload XM. Large jobs can take up to 10 minutes to display in Workload XM. For information about how to use Workload XM and the information it can provide, see Using Workload Experience Manager (XM) on page 20. Workload Experience Manager 19

20 Using Workload Experience Manager (XM) Using Workload Experience Manager (XM) Default Time Range If you have not specified a time range, Workload Experience Manager (Workload XM) displays data for the last 24 hours by default. If there is no data available for the last 24 hours, Workload XM displays the full range that is available by default. Common Use Cases The following examples of use cases provide an introduction to Workload XM's capabilities. Troubleshooting Abnormal Job Durations Using Workload XM to find and troubleshoot any slow-running jobs to help identify areas of risk in jobs running on your cluster. 1. Log in to the Workload XM console at: wxm.cloudera.com, and in Search, type the name of the cluster you want to analyze. 2. On the Data Engineering Summary page, click the time range in the upper right corner of the page and specify a time range you are interested in. 3. In the Trend graph, click the Abnormal Duration tab to view the number of jobs with an abnormal duration that executed within the selected time frame. Any jobs that fall outside of the baseline duration will be marked as slow. If you hover over the graph, a comparison between the current period and the previous period displays. 20 Workload Experience Manager

Using Workload Experience Manager (XM) After reviewing the chart, click the number of Abnormal Duration jobs above the graph to see a list of the slow jobs within the specified time range. 4.

21 Using Workload Experience Manager (XM) After reviewing the chart, click the number of Abnormal Duration jobs above the graph to see a list of the slow jobs within the specified time range. 4. After clicking the Abnormal Duration number, a list of all slow jobs displays on the Data Engineering Jobs page. These jobs have all triggered the Duration health check: From the Duration drop-down list, select a duration range or select Customize to enter a custom minimum or maximum duration to view any jobs that meet that duration criteria. 5. Click on the Job name to view more detailed information. Under the Duration health check, you can see that this job finished much slower than the normal duration: Workload Experience Manager 21

22 Using Workload Experience Manager (XM) To further investigate, click the Task Duration health check. 6. After clicking Task Duration, you can see that this job contains several tasks that are heavily skewed, meaning that they took an abnormal amount of time to finish: Click one of the tasks to view further details about it. 7. After clicking one of the tasks, the Task Details pane displays details about its run. In addition to the long run time, garbage collection is taking significantly more time than the average task: 22 Workload Experience Manager

23 Using Workload Experience Manager (XM) Click Task GC Time to view more information about garbage collection for this job. 8. On the Task GC Time page, click the Execution Details tab, and then click one of the MapReduce stages: 9. In the MapReduce stage Summary page, click View Configurations, and then enter part of the MapReduce memory configuration property name to search for and view the configuration for garbage collection: In the above case, setting this property to 1024 might be causing the mapper JVM to have insufficient memory, which triggers too frequent garbage collection. Increasing this number might improve performance on your cluster. Workload Experience Manager 23

24 Using Workload Experience Manager (XM) (Hadoop Administrators) Troubleshooting Failed Data Engineering Jobs Use Workload XM to quickly troubleshoot failed data engineering jobs. 1. Log in to the Workload XM console at: wxm.cloudera.com, and in Search, type the name of the cluster you want to analyze. 2. On the Data Engineering Jobs page, click the Health Checks drop-down list, and select Failed to Finish. This filters the list to display a list of jobs that did not complete. 3. In the list of jobs, click on the Job name to view more detailed information: 4. On the Jobs details page, click Health Checks to view details for the Failed to Finish health check. It indicates that the failure occurred in the Map stage of job execution: Click on Map Stage and then click Execution Details. 5. In the Summary section of the page, click on the number of failures to see all failed tasks.: 24 Workload Experience Manager

25 Using Workload Experience Manager (XM) 6. Click on a failed task to see the error message from each failed attempt. In this example, the error message, Task KILL is received. Killing attempt!, is not very descriptive or helpful. To gather more information about the task failure, open the associated log file to further analyze the root cause. (Application Developers) Determining Cause for Slow and Failed Queries You can also use Workload XM to find the cause of slow query run times and long execution times. 1. Log in to the Workload XM console at: wxm.cloudera.com, and in Search, type the name of the cluster you want to analyze. 2. On the Data Engineering Jobs page, click the Health Checks drop-down list, and select Task Wait Time. This filters the list to display jobs with longer than average wait times. Workload Experience Manager 25

26 Using Workload Experience Manager (XM) 3. Click on the Job name to view more detailed information. 4. On the details page for that job, click Health Checks and then click Task Wait Time to see which tasks have abnormally long wait times. Click one of the tasks listed under Outlier Tasks to view details about it. 5. When you view the Outlier Task details, notice the long wait time, which is indicated under Wait Duration. Compare this value to the run time once started, indicated under Successful Attempt Duration. The Successful Attempt Duration value is significantly better than the average. This could mean that insufficient resources were allocated for this job. 26 Workload Experience Manager

27 Using Workload Experience Manager (XM) Workload Experience Manager 27

28 Workload Experience Manager (XM) Reference Workload Experience Manager (XM) Reference The following topics provide descriptions of health checks for data engineering jobs, which involve Hive, MapReduce, and Spark, and descriptions of health checks for data warehousing workloads, which involve Impala. In addition to health check descriptions, these topics also provide recommendations for addressing the conditions that trigger health checks and information about the query statuses, types, and potential SQL issues that are identified by Workload XM. Impala Query Status Data Warehouse Query Types Impala Health Checks Potential SQL Issues Hive, Spark, MapReduce Health Checks Data Warehouse (Apache Impala) Query Status Query statuses appear in the Failed Queries graph on the Data Warehouse Summary page and in the Status drop-down list on the Data Warehouse Queries page. All query statuses are described in the following table: Query Status Analysis Exception Authorization Exception Cancelled Exceeded Memory Limit Failed - Any Reason Other Failures Rejected from Pool Session Closed Succeeded Description These queries failed due to syntax errors or incorrect table or column names. These queries failed because the user executing the queries does not have permission to access the data. These queries were cancelled by the system or a user. The amount of memory required to execute these queries exceed the allocated memory limit. Query failed for any reason listed here. These queries failed for other unclassified reasons. These queries failed because there are too many queries already pending in the Impala resource pool. The session was closed by the system or a user for this set of queries. Query succeeded. Data Warehouse (Apache Impala) Query Types Query types appear in the Type drop-down list on the Data Warehouse Queries page. All query types are described in the following table. For more detailed information about these SQL statements, see the Impala documentation. Query Types ALTER TABLE ALTER VIEW Description Changes the structure or properties of an existing table. For example, ALTER TABLE table_name ADD PARTITION (month=1, day=1); Changes the characteristics of a view. For example, ALTER VIEW view_name AS SELECT * FROM table_name; 28 Workload Experience Manager

29 Workload Experience Manager (XM) Reference Query Types COMPUTE STATS CREATE DATABASE CREATE FUNCTION CREATE ROLE CREATE TABLE CREATE TABLE AS SELECT CREATE TABLE LIKE CREATE VIEW DDL DESCRIBE DB DESCRIBE TABLE DML DROP DATABASE DROP FUNCTION DROP STATS Description Gathers information about volume and distribution data in a table and all associated columns and partitions. For example, COMPUTE STATS table_name; Creates a new database. For example, CREATE DATABASE database_name; Creates a user-defined function (UDF), which you can use to implement custom logic during SELECT or INSERT operations. For example, CREATE FUNCTION function_name LOCATION 'hdfs_path_to_jar' SYMBOL='class_name'; Creates a role to which privileges can be granted. After privileges are granted to roles, then the roles can be assigned to users. A user who has been assigned a role is only able to exercise the privileges of that role. For example, CREATE ROLE role_name; Creates a new table and specifies its characteristics. For example, CREATE TABLE table_name (column_name data_type) PARTITIONED BY (column_name data_type) LOCATION 'hdfs_path'; Creates a new table with the output from a SELECT statement. For example, CREATE TABLE table_name AS SELECT * FROM table_3; Creates a new table by cloning an existing table. For example, CREATE TABLE table_name_2 LIKE table_name_1; Creates a shorthand abbreviation (alias) for a query. A view is a purely logical construct with no physical data behind it. For example, CREATE VIEW view_name AS SELECT * FROM table_name; Data Definition Language. SQL statements that define data structures. For example, CREATE TABLE; Displays metadata about a database. For example, DESCRIBE database_name; Displays metadata about a table. For example, DESCRIBE table_name; Data Manipulation Language. SQL statements that manipulate data structures. For example, ALTER TABLE; Removes a database from the system. For example, DROP database_name; Removes a user-defined function (UDF) so that it is not available for execution during Impala SELECT or INSERT operations. For example, DROP FUNCTION function_name; Removes the specified statistics from a table or a partition. For example, DROP STATS table_name; Workload Experience Manager 29

30 Workload Experience Manager (XM) Reference Query Types DROP TABLE DROP VIEW EXPLAIN GRANT PRIVILEGE GRANT ROLE LOAD N/A REFRESH REVOKE PRIVILEGE REVOKE ROLE SELECT SET SHOW COLUMN STATS SHOW CREATE TABLE SHOW DATABASES Description Removes a table and its underlying HDFS data files for internal tables, although not for external tables. For example, DROP TABLE table_name; Removes the specified view. Because a view is purely a logical construct with no physical data behind it, DROP VIEW only involves changes to metadata in the metastore database, not any data files in HDFS. For example, DROP VIEW view_name; Generates a query execution plan for a specific query. For example, EXPLAIN SELECT * FROM table_1; Grants privileges on specified objects to groups. For example, GRANT privilege_name ON TABLE table_name TO role_name; Grants roles on specified objects to groups. For example, GRANT ROLE role_nameto GROUP group_name; Loads data from an external data source into a table. For example, LOAD DATA INPATH 'hdfs_file_or_directory_path' INTO TABLE tablename; These queries failed due to syntax errors and Impala is not able to identify a query type for them. Reloads the metadata for a table from the metastore database and does an incremental reload of the file and block metadata from the HDFS NameNode. REFRESH is used to avoid inconsistencies between Impala and external metadata sources, specifically the Hive Metastore and the NameNode. For example, REFRESH table_name; Revokes privileges on a specified object from groups. For example, REVOKE privilege_name ON TABLE table_name; Revokes roles on a specified object from groups. For example, REVOKE ROLE role_name FROM GROUP group_name; Requests data from a data source. For example, SELECT * FROM table_1; Sets configuration properties or session parameters. For example, SET compression_codec=snappy; Displays the column statistics for a specified table. For example, SHOW COLUMN STATS table_name; Displays the CREATE TABLE statement used to reproduce the current structure of a table. For example, SHOW CREATE TABLE table_name; Displays all available databases. For example, SHOW DATABASES; 30 Workload Experience Manager

31 Workload Experience Manager (XM) Reference Query Types SHOW FILES SHOW FUNCTIONS SHOW GRANT ROLE SHOW ROLES SHOW TABLES SHOW TABLE STATS TRUNCATE TABLE USE Description Displays the files that constitute a specified table or a partition within a partitioned table. For example, SHOW FILES IN table_name; Displays user-defined functions (UDFs) or user-defined aggregate functions (UDAFs) that are associated with a particular database. For example, SHOW FUNCTIONS IN database_name; or SHOW AGGREGATE FUNCTIONS IN database_name; Lists all the grants for the specified role name. For example, SHOW GRANT ROLE role_name; Displays all available roles. For example, SHOW ROLES; Displays the names of tables. For example, SHOW TABLES; Displays the statistics for a table. For example, SHOW TABLE STATS table_name; Removes the data from an Impala table, while leaving the table itself. For example, TRUNCATE TABLE table_name; Switches the current session to a specified database. For example, USE database_name; Data Warehouse (Apache Impala) Health Checks Impala health checks appear in the Suboptimal Queries graph on the Data Warehouse Summary page and in the Health Check drop-down list on the Data Warehouse Queries page. All query health checks are described in the following table. These health checks provide hints about how to make your workloads faster or they point out which aspects of your queries might be causing bottlenecks on your cluster. However, the following recommendations are not exhaustive and there may be additional fixes other than those listed below that can make your workloads run faster. It is important to note that query tuning can be as much an art as a science. If you are currently satisfied with your cluster performance, you can use these health checks as a way to gain insights into how your query workloads are executing on your cluster. That said, the suboptimal conditions identified by these health checks might cause problems as new applications are added, the system footprint is expanded, or the overall load on the system increases. Use these health checks to proactively monitor potential issues across your cluster. Table 1: Health Checks Aggregation Spilled Partitions Description Indicates that data spilled to disk during the aggregation operation for these queries. This health check is triggered during aggregation if there is not enough memory, which causes data to spill to disk. If you are satisfied with your cluster performance despite this health check being triggered, you can disregard it. If performance is an issue, try the following fixes: Use a less complex GROUP BY clause that involves fewer columns (do not use a high cardinality GROUP BY clause). Workload Experience Manager 31

32 Workload Experience Manager (XM) Reference Health Checks Bytes Read Skew Corrupt Table Statistics HashJoin Spilled Partitions Insufficient Partitioning Many Materialized Columns Description Increase the setting for the query's MEM_LIMIT query option. See the Impala documentation. Add more physical memory. For more details, see the Impala documentation SQL Operations that Spill to Disk. Indicates that one of the cluster nodes is reading a significantly larger amount of data than other nodes. To address this condition, rebalance the data or use the Impala SCHEDULE_RANDOM_REPLICA query option. For additional suggestions, see Avoiding CPU Hotspots for HDFS Cached Data in the Impala documentation set. Indicates that these queries contain table statistics that were incorrectly computed and cannot be used. This condition can be caused by metastore database issues. Recompute table statistics. For more information, see Detecting Missing Statistics in the Impala documentation set. Indicates that data spilled to disk during the hash join operation for these queries. This condition occurs when there is not enough memory during the hash join, which causes data to spill to disk. To address this issue: Reduce the cardinality of the right-hand side of the join by filtering more rows from it. Add more physical memory. Increase the setting for the query's MEM_LIMIT query option. See the Impala documentation. Use a denormalized table. Indicates that there is insufficient partitioning for parallel query execution to occur for these queries. This condition is triggered when query execution is wasting resources and time because the system is reading rows that are not required for the operation. To address this condition: Check to see if your more popular filters can become partition keys. For example, if you have many queries that use ship date as a filter, consider creating partitions using ship date as the partition key. Add filters to your query for existing partition columns. For more details see Partitioning for Impala Tables in the Impala documentation set. Indicates that an abnormally large number of columns were returned for these queries. 32 Workload Experience Manager

33 Workload Experience Manager (XM) Reference Health Checks Missing Table Statistics Slow Aggregate Slow Client Slow Code Generation Description This condition is only triggered for Parquet tables. If you are reading more than 15 columns, this health check is triggered. To address this condition, rewrite the query so it does not return more than 15 columns. Indicates that no table statistics were computed for query optimization for these queries. To address this condition, compute table statistics. For more information, see Detecting Missing Statistics in the Impala documentation set. Indicates that the aggregation operations were slower than expected for these queries. Ten million rows per second is the typical throughput and if the observed throughput is less than that, this health check is triggered. Observed throughput is calculated by dividing the time spent in the aggregation operation into the number of input rows. Addressing this condition depends on the root cause: If the root cause is resource conflicts with other queries, then allocate different resource pools to reduce conflicts. If the root cause is overly complex GROUP BY operations, then rewrite the queries to simplify the GROUP BY operations. Indicates that the client consumed query results slower than expected for these queries. The causes and remediations for this health check can vary: If the condition is triggered because some clients are taking too long to unregister the query, then use more appropriate clients for the workload. For example, if you are testing and building SQL queries, it might make more sense to use an interactive client over ODBC or JDBC. If the condition is triggered because you are doing exploratory analysis and reading some rows and then waiting for some time to read the next set of rows, this uses up systems resources because the query has not closed. To remediate, consider using the Impala timeout feature. See Setting Timeout Periods for Daemons, Queries, and Sessions in the Impala documentation set. As an additional option, consider adding a LIMIT clause to your queries to limit the number of rows returned to 100 or less. Indicates that compiled code was generated more slowly than expected for these queries. Workload Experience Manager 33

34 Workload Experience Manager (XM) Reference Health Checks Slow HDFS Scan Description In every query plan fragment, Impala considers how much time is used to generate the code and this health check indicates that the time exceeded 20% of the overall query execution time. This might be triggered by query complexity. For example, if the query has too many predicates in its WHERE clauses, too many joins, or too many columns. For queries where code generation is too slow, consider using the DISABLE_CODEGEN query option in your session. Indicates that scanning data from HDFS was slower than expected for these queries. Note: If the workload is accessing data that is stored on Amazon S3, this is a known limitation of that storage platform. This condition is caused by a slow disk, extremely complex scan predicates, or the HDFS NameNode is too busy. The HDFS scan rate is based on the amount of time that the scanner took to read a specific number of rows. This condition can be addressed by: Replacing the disk if the cause is a slow disk. Reduce complexity by simplifying the scan predicates. If the HDFS NameNode is too busy, consider upgrading to CDH 5.15 or later. For more information, see Upgrading Cloudera Manager and CDH. Slow Hash Join Slow Query Planning Indicates that hash join operations were slower than expected for these queries. This health check might be triggered when there are overly complex join predicates or the hash join is causing data to spill to disk. Five million rows per second is the typical throughput and if the observed throughput is less than that, this health check is triggered. Observed throughput is calculated by dividing the number of input rows by the time spent in the hash join operation. To remediate this condition, simplify the join predicates or reduce the size of the right side of the join. Indicates that the query plan was generated more slowly than expected for these queries. This health check is triggered when the query planning time exceeds 30% of the overall query execution time. This can be caused by very complex queries or if a metadata refresh occurs while the query is executing. To remediate this condition, consider simplifying your queries. For example, reduce the number of columns returned, reduce the number of filters, or reduce the number of joins. 34 Workload Experience Manager

35 Workload Experience Manager (XM) Reference Health Checks Slow Row Materialization Slow Sorting Speed Slow Write Speed Description Indicates that rows were returned more slowly than expected for these queries. This health check is triggered if it takes more than 20% of the query execution time to return rows. It can be caused by overly complex expressions in the SELECT list or when too many rows are requested. To address this condition, simplify the query by reducing the number of columns in the select list or by reducing the number of rows requested. Indicates that the sorting operations were slower than expected for these queries. Ten million rows per second is the typical throughput and if the observed throughput is less than that, this health check is triggered. Observed throughput is calculated by dividing the number of input rows by the time spent in the sorting operation. To remediate this condition, simplify the ORDER BY clause in queries. If data is spilling to disk, reduce the volume of data to be sorted by adding more predicates to the WHERE clauses, by increasing the available memory, or by increasing the value specified for the MEM_LIMIT query option. See the Impala documentation. Indicates that the query write speed is slower than expected for these queries. Note: If the workload is accessing data that is stored on Amazon S3, this is a known limitation of that storage platform. If the difference between actual write time and the expected write time are more than 20% of the query execution time, this health check is triggered. This condition can be caused when overly complex expressions are used, too many columns are specified, or too many rows are requested in the SELECT list. To address this condition, simplify the query by reducing the number of columns, or by reducing the complexity of the SELECT list expression. Potential SQL Issues Potential issues found in the SQL in your workloads appear in the Performance Issues region of the query details page when you click a query in the list on the Data Warehouse Queries page. Potential SQL issues are common mistakes made in writing SQL. All SQL issues that Workload Experience Manager (Workload XM) identifies are listed in the following table. Workload Experience Manager 35

Product Compatibility Matrix

Product Compatibility Matrix Compatibility Matrix Important tice (c) 2010-2014, Inc. All rights reserved., the logo, Impala, and any other product or service names or slogans contained in this document are trademarks of and its suppliers