Altus Data Engineering

Size: px

Start display at page:

Download "Altus Data Engineering"

Shanon Carpenter
5 years ago
Views:

1 Altus Data Engineering

2 Important Notice Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, and any other product or service names or slogans contained in this document are trademarks of Cloudera and its suppliers or licensors, and may not be copied, imitated or used, in whole or in part, without the prior written permission of Cloudera or the applicable trademark holder. If this documentation includes code, including but not limited to, code examples, Cloudera makes this available to you under the terms of the Apache License, Version 2.0, including any required notices. A copy of the Apache License Version 2.0, including any notices, is included herein. A copy of the Apache License Version 2.0 can also be found here: Hadoop and the Hadoop elephant logo are trademarks of the Apache Software Foundation. All other trademarks, registered trademarks, product names and company names or logos mentioned in this document are the property of their respective owners. Reference to any products, services, processes or other information, by trade name, trademark, manufacturer, supplier or otherwise does not constitute or imply endorsement, sponsorship or recommendation thereof by us. Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Cloudera. Cloudera may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Cloudera, the furnishing of this document does not give you any license to these patents, trademarks copyrights, or other intellectual property. For information about patents covering Cloudera products, see The information in this document is subject to change without notice. Cloudera shall not be liable for any damages resulting from technical errors or omissions which may be present in this document, or from use of this document. Cloudera, Inc. 395 Page Mill Road Palo Alto, CA info@cloudera.com US: Intl: Release Information Version: Cloudera Altus Date: September 26, 2018

3 Table of Contents Overview of Altus Data Engineering...6 Altus Data Engineering Service Architecture...6 Altus Data Engineering Clusters...7 Cluster Status...7 Connecting to a Cluster...8 Creating and Working with Clusters on the Console...8 Creating a Cluster for AWS...8 Creating a Cluster for Azure...13 Viewing the Cluster Status...17 Viewing the Cluster Details...17 Deleting a Cluster...19 Creating and Working with Clusters Using the CLI...19 Creating a Cluster for AWS...19 Creating a Cluster for Azure...21 Viewing the Cluster Status...23 Deleting a Cluster...23 Clusters on AWS...23 Worker Nodes...23 Spot Instances...23 Instance Reprovisioning...24 System Volume...25 Altus Data Engineering Jobs...26 Job Status...26 Job Queue...27 Running and Monitoring Jobs on the Console...28 Submitting a Job on the Console...28 Submitting Multiple Jobs on the Console...31 Viewing Job Status and Information...35 Viewing the Job Details...36 Running and Monitoring Jobs Using the CLI...36 Submitting a Spark Job...36 Submitting a Hive Job...39 Submitting a MapReduce Job...41 Submitting a PySpark Job...43

4 Submitting a Job Group with Multiple Job Types...44 Tutorial: Clusters and Jobs on AWS...45 Prerequisites...45 Altus Console Login...46 Exercise 1: Installing the Altus Client...46 Step 1. Install the Altus Client...46 Step 2. Configure the Altus Client with the API Access Key...46 Exercise 2: Creating a Spark Cluster and Submitting Spark Jobs...48 Creating a Spark Cluster on the Console...48 Submitting a Spark Job...49 Creating a SOCKS Proxy for the Spark Cluster...50 Viewing the Cluster and Verifying the Spark Job Output...51 Creating a Spark Job using the CLI...53 Terminating the Cluster...53 Exercise 3: Creating a Hive Cluster and Submitting Hive Jobs...53 Creating a Hive Cluster on the Console...54 Submitting a Hive Job Group...55 Creating a SOCKS Proxy for the Hive Cluster...58 Viewing the Hive Cluster and Verifying the Hive Job Output...59 Creating a Hive Job Group using the CLI...62 Terminating the Hive Cluster...62 Tutorial: Clusters and Jobs on Azure...64 Prerequisites...64 Sample Files Upload...65 Altus Console Login...65 Exercise 1: Installing the Altus Client...65 Step 1. Install the Altus Client...65 Step 2. Configure the Altus Client with the API Access Key...66 Exercise 2: Creating a Spark Cluster and Submitting Spark Jobs...67 Creating a Spark Cluster on the Console...67 Submitting a Spark Job...68 Creating a SOCKS Proxy for the Spark Cluster...69 Viewing the Cluster and Verifying the Spark Job Output...70 Creating a Spark Job using the CLI...72 Terminating the Cluster...72 Exercise 3: Creating a Hive Cluster and Submitting Hive Jobs...72 Creating a Hive Cluster on the Console...73 Submitting a Hive Job Group...73 Creating a SOCKS Proxy for the Hive Cluster...76 Viewing the Hive Cluster and Verifying the Hive Job Output...77

5 Creating a Hive Job Group using the CLI...80 Terminating the Hive Cluster...81 Appendix: Apache License, Version

Overview of Altus Data Engineering Overview of Altus Data Engineering Altus Data Engineering enables you to create clusters and run jobs specifically for data science and engineering workloads.

6 Overview of Altus Data Engineering Overview of Altus Data Engineering Altus Data Engineering enables you to create clusters and run jobs specifically for data science and engineering workloads. The Altus Data Engineering service offers multiple distributed processing engine options, including Hive, Spark, Hive on Spark, and MapReduce2 (MR2), which allow you to manage workloads in ETL, machine learning, and large scale data processing. Altus Data Engineering Service Architecture When you create an Altus Data Engineering cluster or submit a job in Altus, the Altus Data Engineering service accesses your AWS account or Azure subscription to create the cluster or run the job on your behalf. If your Altus account uses AWS, an AWS administrator must set up a cross-account access role to provide Altus access to your AWS account. When a user in your Altus account creates an Altus Data Engineering cluster, the Altus Data Engineering service uses the Altus cross-account access credentials to create the cluster in your AWS account. If your Altus account uses Azure, an administrator of your Azure subscription must provide consent for Altus to access the resources in your subscription. When a user in your Altus account creates an Altus Data Engineering cluster, your consent allows the Altus Data Engineering service to create the cluster in your Azure subscription. Altus manages the clusters and jobs in your cloud provider account. You can configure your Altus Data Engineering cluster to be terminated when the cluster is no longer in use. When you submit a job to run on a cluster, the Altus Data Engineering service creates a job queue for the cluster and adds the job to the job queue. The Altus Data Engineering service then runs the jobs in the cluster in your cloud provider account. In AWS, the jobs in the cluster access the Amazon S3 object storage for data input and output. In Azure, the jobs in the cluster access Microsoft Azure Data Lake Store (ADLS) for data input and output. The Altus Data Engineering service sends cluster diagnostic information and job execution metrics to Altus. It also stores the cluster and job information in the your cloud object storage. The following diagram shows the architecture and process flow of Altus Data Engineering: 6 Altus Data Engineering

7 Altus Data Engineering Clusters Altus Data Engineering Clusters You can use the Cloudera Altus console or the command-line interface to create and manage Altus Data Engineering clusters. The Altus Data Engineering service provisions single-user, transient clusters. By default, the Altus Data Engineering service creates a cluster that contains a master node and multiple worker nodes. The Altus Data Engineering service also creates a Cloudera Manager instance to manage the cluster. The Cloudera Manager instance provides visibility into the cluster but is not a part of the cluster. You cannot use the Cloudera Manager instance as a gateway node for the cluster. Cloudera Manager configures the master node with roles that give it the capabilities of a gateway node. The master node has a resource manager, Hive server and metastore, Spark service, and other roles and client configurations that essentially turns the master node into a gateway node. You can use the master node as a gateway node in an Altus Data Engineering cluster to run Hive and Spark shell commands and Hadoop commands. The Altus Data Engineering service creates a read-only user account to connect to the Cloudera Manager instance. When you create a cluster on the Altus console, specify the user name and password for the read-only user account. Use the user name and password to log in to Cloudera Manager. When you create a cluster using the CLI and you do not specify a user name and password, the Altus Data Engineering service creates a guest user account with a randomly generated password. You can use the guest user name and password to log in to Cloudera Manager. Altus appends tags to each node in a cluster. You can use the tags to identify the nodes and the cluster that they belong to. When you create an Altus Data Engineering cluster, you specify which service runs in the cluster. Select the service appropriate for the type of job that you plan to run on the cluster. The following list describes the services available in Altus clusters and the types of jobs you can run with each service: Service Type Hive Hive on Spark Spark 2.x Spark 1.6 MapReduce2 Multi Job Type Hive Hive Spark or PySpark Spark or PySpark MapReduce2 Hive, Spark or PySpark, MapReduce2 The Multi service cluster supports Spark 2.x. It does not support Spark 1.6. Cluster Status A cluster periodically changes status from the time that you create it until the time it is terminated. An Altus cluster can have the following statuses: Creating. The cluster creation process is in progress. Created. The cluster was successfully created. Failed. The cluster can be in a failed state at creation or at termination time. View the failure message to get more information about the failure. Terminating. The cluster is in the process of being terminated. Altus Data Engineering 7

8 Altus Data Engineering Clusters When the cluster is terminated, it is removed from the list of clusters displayed in the Clusters page on the console. It is also not included in the list of clusters displayed when you run the list-clusters command. Connecting to a Cluster You can access a cluster created in Altus in the same way that you access other CDH clusters. You can use SSH to connect to a service port in the cluster. If you use SSH, you might need to modify the security group in your cloud service provider to allow an SSH connection to your instances from the public Cloudera IP addresses. You can use the Altus client to set up a SOCKS proxy server to access the Cloudera Manager instance in the cluster. Creating and Working with Clusters on the Console You can create a cluster on the Cloudera Altus Console. You can also view the status and configuration of all clusters created through Altus in your cloud provider account. Creating a Cluster for AWS To create a cluster on the console for AWS: 1. Sign in to the Cloudera Altus console: 2. On the side navigation panel, click Clusters. By default, the Clusters page displays the list of all the Altus Data Engineering clusters in your Altus account. The cloud icon next to the cluster name indicates the cloud service provider for the cluster. You can filter the list by environment and status. You can also search for clusters by name. 3. Click Create Cluster. 4. In the General Information section, specify the following information: Cluster Name Service Type The name to identify the cluster that you are creating. The cluster name is an alphanumeric string of any length. It can include dashes (-) and underscores (_). It cannot include a space. Indicates the service to be installed on the cluster. Select the service based on the types of jobs you plan to run on the cluster. You can select from the following service types: Hive Hive on Spark Spark 2.x Spark 1.6 Select Spark1.6 only if your application specifically requires Spark version 1.6. Altus supports Spark 1.6 only on CDH MapReduce2 Multi A cluster with service type Multi allows you to run different types of jobs. You can run the following types of jobs in a Multi cluster: Spark2.x, Hive, MapReduce2. 8 Altus Data Engineering

9 Altus Data Engineering Clusters Note: If you are creating the cluster for a specific job or job group, the list includes only service types that can handle the job or group of jobs that you want to run. CDH Version The CDH version that the cluster will use. You can select from the following CDH versions: CDH 5.15 CDH 5.14 CDH 5.13 CDH 5.12 CDH 5.11 The CDH version that you select can affect the service that runs on the cluster: Spark 2.x or Spark 1.6 For a Spark service type, you must select the CDH version that supports the selected Spark version. Altus supports the following combinations of CDH and Spark versions: CDH versions 5.12 or later with Spark 2.2 CDH 5.11 with Spark 2.1 or Spark 1.6 Hive on Spark On CDH version 5.13 or later, dynamic partition pruning (DPP) is enabled for Hive on Spark by default. For details, see Dynamic Partition Pruning for Hive Map Joins in the Cloudera Enterprise documentation set. Environment Name of the Altus environment that describes the resources to be used for the cluster. The Altus environment specifies the network and instance settings for the cluster. If a lock icon appears next to the environment name, clusters that you create using this environment are secure. If you do not know which Altus environment to select, check with your Altus administrator. 5. In the Node Configuration section, specify the number of workers to create and the instance type to use for the cluster. Worker The worker nodes in a cluster can run data storage and computational processes. For more information about worker nodes, see Worker Nodes on page 23. You can configure the following properties for the worker node: Instance Type Select the instance type from the list of supported instance types. Default: m4.xlarge (16.0 GB 4 vcpus) Altus Data Engineering 9

10 Altus Data Engineering Clusters Number of Nodes Note: The master node, worker nodes, and compute worker nodes use the same instance type. If you modify the instance type for the worker node, Altus configures the master node and compute worker nodes to use the same instance type. Select the number of worker nodes to include in the cluster. A cluster must have a minimum of 3 worker nodes. Default: 5 Note: An Altus cluster can have a total of 50 worker and compute worker nodes. EBS Storage In the EBS Volume Configuration window, configure the following properties for the EBS Volume: Storage Type. Select the EBS volume type best suited for the job you want to run. Storage Size. Set the storage size of the EBS volume expressed in gibibyte (GiB). Volumes per Instance. Set the number of EBS volumes for each instance in the worker node. All EBS volumes are configured with the same volume size and type. If you do not configure the EBS volumes, Altus sets the optimum configuration for the EBS volumes based on the service type and instance type. For more information about Amazon EBS, see Amazon EBS Product Details on the AWS website. Purchasing Option By default, the worker nodes use On-Demand instances. You cannot modify the worker nodes to use Spot instances. Compute Worker In addition to the worker nodes, an Altus cluster can have compute worker nodes. Compute worker nodes run only computational processes. For more information about compute worker nodes, new see Worker Nodes on page 23. You can configure the following properties for the compute worker node: Instance Type You cannot directly modify the instance type for a compute worker node. Note: The master node, worker nodes, and compute worker nodes use the same instance type. If you modify the instance type for the worker node, Altus configures the master node and compute worker nodes to use the same instance type. 10 Altus Data Engineering

11 Altus Data Engineering Clusters Number of Nodes Select the number of compute worker nodes to include in the cluster. Default: 0 Note: An Altus cluster can have a total of 50 worker and compute worker nodes. EBS Storage In the EBS Volume Configuration window, configure the following properties for the EBS Volume: Storage Type. Select the EBS volume type best suited for the job you want to run. Storage Size. Set the storage size of the EBS volume expressed in gibibyte (GiB). Volumes per Instance. Set the number of EBS volumes for each instance in the worker node. All EBS volumes are configured with the same volume size and type. If you do not configure the EBS volumes, Altus sets the optimum configuration for the EBS volumes based on the service type and instance type. For more information about Amazon EBS, see Amazon EBS Product Details on the AWS website. Purchasing Option Select whether to use On-Demand instances or Spot instances. If you use Spot instances, you must specify the spot price. For more information about using Spot instances for compute worker nodes, see Spot Instances on page 23. Master Altus configures the master node for the cluster. You cannot modify the master node configuration. By default, Altus sets the following configuration for the master node: Instance Type m4.xlarge (16.0 GB 4 vcpus) Note: The master node, worker nodes, and compute worker nodes use the same instance type. If you modify the instance type for the worker node, Altus configures the master and compute worker nodes to use the same instance type. Number of Nodes 1 EBS Storage Altus sets the optimum configuration for the master node based on the service type and instance type. Purchasing Option On-Demand instance Altus Data Engineering 11

12 Altus Data Engineering Clusters Cloudera Manager Altus configures the Cloudera Manager instance for the cluster. You cannot modify the Cloudera Manager instance configuration. By default, Altus sets following configuration for the Cloudera Manager instance: Instance Type c4.2xlarge (15 GB 8 vcpus) Number of Nodes 1 EBS Storage Altus sets the optimum configuration for the Cloudera Manager node based on the service type and instance type. Purchasing Option On-Demand instance 6. In the Credentials section, provide the credentials for the user account to log in to Cloudera Manager. Public SSH Key Cloudera Manager Username Cloudera Manager Password Confirm Cloudera Manager Password You use an SSH key to access instances in the cluster that you are creating. Provide a public key that Altus will add to the authorized_keys file on each node in the cluster. When you access an Altus cluster, use the private key that corresponds to the public key to connect to the cluster through SSH. Select File Upload to upload a file that contains the public key or select Direct Input to enter the full key code. For more information about connecting to Altus clusters through SSH, see SSH Connection. Username for the guest account to use with Cloudera Manager. The guest account will be created as a read-only user account to access the Cloudera Manager instance in the cluster. Password for the Cloudera Manager guest account. Verification of the Cloudera Manager password. The password entries must match exactly. Important: Take note of the user name and password that you specify for the Cloudera Manager guest account. For security reasons, you cannot view the credentials for the Cloudera Manager guest account after the cluster is created. 7. In the Advanced Settings section, set the following optional properties: Instance bootstrap script Bootstrap script that is executed on all the cluster instances immediately after start-up before any service is configured and started. You can use the bootstrap script to install additional OS packages or application dependencies. You cannot use the bootstrap script to change the cluster configuration. Select File Upload to upload a script file or select Direct Input to type the script on the screen. 12 Altus Data Engineering

13 Altus Data Engineering Clusters Resource Tags The bootstrap script must be a local file. It can be in any executable format, such as a Bash shell script or Python script. The size of the script cannot be larger than 4096 bytes. Tags that you define and that you want Altus to append to the cluster that you are creating. Altus appends the tags you define to the nodes and resources associated with the cluster. You create the tag as a name-value pair. Click + to add a tag name and set the value for that tag. Click - to delete a tag from the list. By default, Altus appends tags to the cluster instance to make it easy to identify nodes in a cluster. When you define tags for the cluster, Altus adds your tags in addition to the default tags. For more information about the tags that Altus appends to the cluster, see Altus Tags. 8. Verify that all required fields are set and click Create Cluster. The Data Engineering service creates a CDH cluster with the configuration you set. On the Clusters page, the new cluster displays at the top of the list of clusters. Creating a Cluster for Azure To create a CDH cluster on the console: 1. Sign in to the Cloudera Altus console: 2. On the side navigation panel, click Clusters. By default, the Clusters page displays the list of all the Altus Data Engineering clusters in your Altus account. The cloud icon next to the cluster name indicates the cloud service provider for the cluster. You can filter the list by environment and status. You can also search for clusters by name. 3. Click Create Cluster. 4. In the General Information section, specify the following information: Cluster Name Service Type The name to identify the cluster that you are creating. The cluster name is an alphanumeric string of any length. It can include dashes (-) and underscores (_). It cannot include a space. Indicates the service to be installed on the cluster. Select the service based on the types of jobs you plan to run on the cluster. You can select from the following service types: Hive Hive on Spark Dynamic partition pruning (DPP) is enabled for Hive on Spark by default. For details, see Dynamic Partition Pruning for Hive Map Joins in the Cloudera Enterprise documentation set. Spark 2.x Altus supports Spark 2.2. Altus Data Engineering 13

14 Altus Data Engineering Clusters Spark 1.6 Select Spark 1.6 only if your application specifically requires Spark version 1.6. MapReduce2 Multi A cluster with service type Multi allows you to run different types of jobs. You can run the following types of jobs in a Multi cluster: Spark2.x, Hive, MapReduce2. Note: If you are creating the cluster for a specific job or job group, the list includes only service types that can handle the job or group of jobs that you want to run. CDH Version Environment The CDH version that the cluster will use. Altus supports CDH 5.14 and CDH Name of the Altus environment that describes the resources to be used for the cluster. The Altus environment specifies the network and instance settings for the cluster. If you do not know which Altus environment to select, check with your Altus administrator. 5. In the Node Configuration section, specify the configuration of the nodes in the cluster. Worker The worker nodes in a cluster can run data storage and computational processes. You can configure the following properties for the worker nodes: Instance Type Select the instance type to use for the worker nodes in the cluster. You can use one of the following instance types: Standard_D4S_v3 16 GiB with 4v CPU Standard_D8S_v3 32 GiB with 8v CPU Standard_D16S_v3 64 GiB with 16v CPU Standard_D32S_v3 128 GiB with 32v CPU Standard_D64S_v3 256 GiB with 64v CPU Standard_DS12_v2 28 GiB with 4v CPU Standard_DS13_v2 56 GiB with 8v CPU Standard_DS14_v2 112 GiB with 16v CPU Standard_DS15_v2 140 GiB with 20v CPU Standard_E4S_v3 32 GiB with 4v CPU Standard_E8S_v3 64 GiB with 8v CPU Standard_E16S_v3 128 GiB with 16v CPU Standard_E32S_v3 256 GiB with 32v CPU Standard_E64S_v3 432 GiB with 64v CPU Altus uses the same instance type for all the worker nodes in the cluster. 14 Altus Data Engineering

15 Altus Data Engineering Clusters Number of Nodes Select the number of worker nodes to include in the cluster. A cluster must have a minimum of 3 worker nodes. Default: 5 Note: An Altus cluster can have a total of 50 worker nodes. Disk Configuration In the Disk Configuration window, configure the following properties for the disk: Storage Type. Select the storage type best suited for the job you want to run, premium or standard. Storage Size. Set the storage size of the disk expressed in gibibyte (GiB). Disks per Instance. Set the number of disks for each instance in the worker node. If you do not change the disk configuration, Altus sets the optimum configuration for the disks based on the service type and instance type. For more information about Azure Managed Disks, see Managed Disks on the Azure website. Master Altus configures the master node for the cluster. You cannot modify the master node configuration. By default, Altus sets the following configuration for the master node: Instance Type Standard_DS12_v2 56 GiB with 4v CPU Note: The master node and worker nodes use the same instance type. If you modify the instance type for the worker node, Altus configures the master node to use the same instance type as the worker node. Number of Nodes 1 Disk Configuration Altus sets the optimum configuration for the master node based on the service type and instance type. Cloudera Manager Altus configures the Cloudera Manager node for the cluster. You cannot modify the Cloudera Manager node configuration. By default, Altus sets following configuration for the Cloudera Manager node: Instance Type Standard_DS12_v2 56 GiB with 4v CPU Number of Nodes 1 Altus Data Engineering 15

16 Altus Data Engineering Clusters Disk Configuration Altus sets the optimum configuration for the Cloudera Manager node based on the service type and instance type. 6. In the Credentials section, provide the credentials for the user account to log in to Cloudera Manager. SSH Public Key Cloudera Manager Username Cloudera Manager Password Confirm Cloudera Manager Password The public key to use to connect to the cluster. Altus adds the public key to the authorized_keys file on each node in the cluster. To connect to an Altus cluster through SSH, use the SSH private key that corresponds to the public key. Select File Upload to upload a file that contains the public key or select Direct Input to enter the full key code. Username for the guest account to use with Cloudera Manager. Altus creates the guest account as a read-only user account to access the Cloudera Manager instance in the cluster. Password for the Cloudera Manager guest account. Verification of the Cloudera Manager password. The password entries must match exactly. Important: Take note of the user name and password that you specify for the Cloudera Manager guest account. For security reasons, you cannot view the credentials for the Cloudera Manager guest account after the cluster is created. 7. In the Advanced Settings section, set the following optional properties: Instance bootstrap script Resource Tags Bootstrap script that is executed on all the cluster instances immediately after start-up before any service is configured and started. You can use the bootstrap script to install additional OS packages or application dependencies. You cannot use the bootstrap script to change the cluster configuration. Select File Upload to upload a script file or select Direct Input to type the script on the screen. The bootstrap script must be a local file. It can be in any executable format, such as a Bash shell script or Python script. The size of the script cannot be larger than 4096 bytes. Tags that you define and that you want Altus to append to the cluster that you are creating. Altus appends the tags you define to the nodes and resources associated with the cluster. You create the tag as a name-value pair. Click + to add a tag name and set the value for that tag. Click - to delete a tag from the list. By default, Altus appends tags to the cluster instance to make it easy to identify nodes in a cluster. When you define tags for the cluster, Altus adds your tags in addition to the default tags. 16 Altus Data Engineering

17 Altus Data Engineering Clusters For more information about the tags that Altus appends to the cluster, see Altus Tags. 8. Verify that all required fields are set and click Create Cluster. The Altus Data Engineering service creates a CDH cluster with the configuration you set. On the Clusters page, the new cluster displays at the top of the list of clusters. Viewing the Cluster Status To view the status of clusters on the console: 1. Sign in to the Cloudera Altus console: 2. On the side navigation panel, click Clusters. By default, the Clusters page displays the list of all the Altus Data Engineering clusters in your Altus account. The cloud icon next to the cluster name indicates the cloud service provider for the cluster. You can filter the list by environment and status. You can also search for clusters by name. The Clusters list shows the following information: Cluster name Status For more information about the different statuses that a cluster can have, see Cluster Status on page 7. Service type for the cluster Number of worker nodes Date and time the cluster was created in Altus Instance type for the cluster Version of CDH that runs in the cluster. 3. You can click the Actions button for a cluster to perform the following tasks: Clone Cluster. To create a cluster of the same type and characteristics as the cluster that you are viewing, select the Clone Cluster action. On the Create Cluster page, you can create a cluster with the same properties as the cluster you are cloning. You can modify or add to the properties before you create the cluster. Delete Cluster. To terminate a cluster, select the Delete Cluster action for the cluster you want to terminate. 4. To view the details of a cluster, click the name of the cluster you want to view. The Cluster Details page displays information about the cluster in more detail, including the list of jobs in the cluster. Viewing the Cluster Details To view the details of a cluster on the console: 1. Sign in to the Cloudera Altus console: 2. On the side navigation panel, click Clusters. By default, the Clusters page displays the list of all the Altus Data Engineering clusters in your Altus account. The cloud icon next to the cluster name indicates the cloud service provider for the cluster. You can filter the list by environment and status. You can also search for clusters by name. Altus Data Engineering 17

18 Altus Data Engineering Clusters 3. Click the name of a cluster. The details page for the selected cluster displays the status of the cluster and the following information: Cluster Status The details page displays information appropriate for the status of the cluster. For example, if a cluster failed at creation time, the details page displays the failure message that explains the reason for the failure, but does not display a link to the Cloudera Manager instance. Cloudera Manager Configuration The Cloudera Manager Configuration section provides a link to the Cloudera Manager instance in the cluster. You can log in to Cloudera Manager through the public or private IP. You can click a link to view the Altus command to set up a SOCKS proxy server to access the Cloudera Manager instance in the cluster. The section also displays the instance type of the Cloudera Manager instance. The Cloudera Manager Configuration section appears only if the IP addresses for the Cloudera Manager instance are available. The IP addresses might not be available when the cluster status is Creating or when the cluster failed at creation time. Node Configuration The Node Configuration section displays the configuration of the nodes in the cluster. For a cluster on AWS, the section displays the configuration of the master node, worker nodes, and any compute worker node that you add to the cluster. The section displays the number of nodes and their instance types, the EBS volume configuration and the pricing option used to acquire the instance. If the cluster does not have compute worker nodes, the section displays zero for the number of compute worker nodes, but shows the default settings that the Altus Data Engineering service uses for compute worker nodes. For a cluster on Azure, the section displays the configuration of the master node and worker nodes. The section displays the number of nodes, their instance types and storage volume configuration, and the number of disks per instance. Cluster Details Log Archive Location shows where the cluster and job logs are archived. Termination condition shows the action that Altus takes when all jobs in the cluster complete. Uses instance bootstrap script? shows whether a bootstrap script runs before cluster startup. All Jobs The All Jobs section shows the list of jobs that run on the cluster. Click View All to go to the Jobs page and view the list of all jobs in the your account. You might need to clear the filter to view all jobs in the account. Service Type and other key information Service Type shows the service that runs in the cluster. Creation Time shows the time when a user created the cluster in Altus. Total Nodes shows the number of nodes in the cluster. For a cluster on AWS, the total number of nodes includes the master node, worker nodes, and compute worker nodes. The number does not include the Cloudera Manager instance. If the compute worker nodes use Spot instances, the number of compute worker nodes available might not be equivalent to the number of compute worker nodes configured for the cluster. The section shows the number of nodes available in the cluster and the total number of nodes configured for the cluster. For a cluster on Azure, the total number of nodes includes the number of master and worker nodes but not the Cloudera Manager instance. To view information about the nodes, click View. The Instances window displays the list of instances in the cluster, their instance IDs and IP addresses, and their roles in the cluster. The list of instances does not include the Cloudera Manager instance. Environment displays the Altus environment used to create the cluster. 18 Altus Data Engineering

19 Altus Data Engineering Clusters Region indicates the region where the cluster is created. CDH Version shows the version of CDH in the cluster. CRN shows the Cloudera Resource Name (CRN) assigned to the cluster. Because the CRN is a long string of characters, Altus provides a copy icon so you can easily copy the CRN for any purpose. Deleting a Cluster To delete a cluster on the console: 1. Sign in to the Cloudera Altus console: 2. On the side navigation panel, click Clusters. By default, the Clusters page displays the list of all the Altus Data Engineering clusters in your Altus account. The cloud icon next to the cluster name indicates the cloud service provider for the cluster. You can filter the list by environment and status. You can also search for clusters by name. 3. Click the name of the cluster to terminate. On the Cluster details page, review the cluster information to verify that it is the cluster that you want to terminate. Note: Before you terminate a cluster, verify that there are no jobs running on the cluster. If you terminate a cluster when a job is running, the job fails. If you terminate a cluster when a job is queued to run on it, the job is interrupted and cannot complete. You can submit the job again to run on another cluster. 4. Click Actions and select Delete Cluster. 5. Click OK to confirm that you want to terminate the cluster. Creating and Working with Clusters Using the CLI You can use the Cloudera Altus client to create a cluster, view the properties of a cluster, or terminate a cluster. You can use the commands listed here as examples for how to use the Cloudera Altus commands. For more information about the commands available in the Altus client, run the following command: altus dataeng help Creating a Cluster for AWS You can use the following command to create a cluster: altus dataeng create-aws-cluster --service-type=servicetype --workers-group-size=numberofworkers --cluster-name=clustername --instance-type=instancetype --cdh-version=cdhversion --public-key=fullpath&filenameofpublickeyfile --environment-name=altusenvironmentname --compute-workers-configuration='{"groupsize": NumberOfComputeWorkers, "usespot": true, "bidusdperhr": BidPrice}' Guidelines for using the create-aws-cluster command: You must specify the service to include in the cluster. In the service-type parameter, use one of the following service names to specify the service in the cluster: HIVE Altus Data Engineering 19

20 Altus Data Engineering Clusters HIVE_ON_SPARK SPARK Use this service type for Spark 2.1 or Spark 2.2. SPARK_16 Use this service type only if your application specifically requires Spark version 1.6. If you specify SPARK_16 in the service-type parameter, you must specify CDH511 in the cdh-version parameter. MR2 MULTI A cluster with service type Multi allows you to run different types of jobs. You can run the following types of jobs in a Multi cluster: Spark2.x, Hive, MapReduce2. You must specify the version of CDH to include in the cluster. In the cdh-version parameter, use one of the following version names to specify the CDH version: CDH515 CDH514 CDH513 CDH512 CDH511 The CDH version that you specify can affect the service that runs on the cluster: Spark 2.x or Spark 1.6 For a Spark service type, you must select the CDH version that supports the selected Spark version. Altus supports the following combinations of CDH and Spark versions: CDH versions 5.12 or later with Spark 2.2 CDH 5.11 with Spark 2.1 or Spark 1.6 Hive on Spark On CDH version 5.13 or later, dynamic partition pruning (DPP) is enabled for Hive on Spark by default. For details, see Dynamic Partition Pruning for Hive Map Joins in the Cloudera Enterprise documentation set. The public-key parameter requires the full path and file name of a.pub file prefixed with file://. For example: --public-key=file:///my/file/path/to/ssh/publickey.pub Altus adds the public key to the authorized_keys file on each node in the cluster. You can use the cloudera-manager-username and cloudera-manager-password parameters to set the Cloudera Manager credentials. If you do not provide a username and password, the Data Engineering service generates a guest username and password for the Cloudera Manager user account. The compute-workers-configuration parameter is optional. It adds compute worker nodes to the cluster in addition to worker nodes. Compute worker nodes run only computational processes. If you do not set the configuration for the compute workers, Altus creates a cluster with no compute worker nodes. The response object for the create-aws-cluster command contains the credentials for the read-only account for the Cloudera Manager instance in the cluster. You must note down the credentials from this response since the credentials are not made available again. Example: Creating a Cluster in AWS for a PySpark Job This example shows how to create a cluster with a bootstrap script and run a PySpark job on the cluster. The bootstrap script installs a custom Python environment in which to run the job. The Python script file is available in the Cloudera Altus S3 bucket of job examples. 20 Altus Data Engineering

21 The following command creates a cluster with a bootstrap script and runs a job to implement an alternating least squares (ALS) algorithm. altus dataeng create-aws-cluster --environment-name=environmentname --service-type=spark --workers-group-size=3 --cluster-name=clustername --instance-type=m4.xlarge --cdh-version=cdh512 --public-key YourPublicSSHKey --instance-bootstrap-script='file:///pathtoscript/bootstrapscript.sh' --jobs '{ "name": "PySpark ALS Job", "pysparkjob": { "mainpy": "s3a://cloudera-altus-data-engineering-samples/pyspark/als/als2.py", "sparkarguments" : "--executor-memory 1G --num-executors 2 --conf spark.pyspark.python=/tmp/pyspark-env/bin/python" }}' The bootstrapscript.sh in this example creates a Python environment using the default Python version shipped with Altus and installs the NumPy package. It has the following content: #!/bin/bash target="/tmp/pyspark-env" mypip="${target}/bin/pip" echo "Provisioning pyspark environment..." virtualenv ${target} ${mypip} install numpy if [ $? -eq 0 ]; then echo "Successfully installed new python environment at ${target}" else echo "Failed to install custom python environment at ${target}" fi Altus Data Engineering Clusters Creating a Cluster for Azure You can use the following command to create a cluster: altus dataeng create-azure-cluster --service-type=servicetype --workers-group-size=numberofworkers --cluster-name=clustername --instance-type=instancetype --cdh-version=cdhversion --public-key=fullpath&filenameofpublickeyfile --environment-name=altusenvironmentname Guidelines for using the create-azure-cluster command: You must specify the service to include in the cluster. In the service-type parameter, use one of the following service names to specify the service in the cluster: HIVE HIVE_ON_SPARK SPARK Altus supports Spark 2.2. SPARK_16 Use this service type only if your application specifically requires Spark version 1.6. MR2 Altus Data Engineering 21

22 Altus Data Engineering Clusters MULTI A cluster with service type Multi allows you to run different types of jobs. You can run the following types of jobs in a Multi cluster: Spark2.x, Hive, MapReduce2. Altus supports CDH 5.14 and Specify the following value for the cdh-version parameter: CDH514 or CDH515 The ssh-public-key parameter requires the full path and file name of a.pub file prefixed with file://. For example: --public-key=file:///my/file/path/to/ssh/publickey.pub You can use the cloudera-manager-username and cloudera-manager-password parameters to set the Cloudera Manager credentials. If you do not provide a username and password, the Altus Data Engineering service generates a guest username and password for the Cloudera Manager user account. The response object for the create-azure-cluster command contains the credentials for the read-only account for the Cloudera Manager instance in the cluster. You must note the credentials from this response since the credentials are not made available again. Example: Creating a Cluster in Azure for a PySpark Job This example shows how to create a cluster with a bootstrap script and run a PySpark job on the cluster. The bootstrap script installs a custom Python environment in which to run the job. Cloudera provides the job example files and input files that you need to run the jobs. To use the following example, set up an Azure Data Lake Store (ADLS) account with permissions to allow read and write access when you run the Altus jobs. Then run the script that Altus provides to upload the files to the ADLS account so the job files and data files are available for your use. For instructions on uploading the jar file, see Sample Files Upload on page 65. The following command creates a cluster with a bootstrap script and runs a job to implement an alternating least squares (ALS) algorithm. altus dataeng create-azure-cluster --environment-name=environmentname --service-type=spark --workers-group-size=3 --cluster-name=clustername --instance-type=standard_ds12_v2 --cdh-version=cdh513 --public-key YourPublicSSHKey --instance-bootstrap-script='file:///pathtoscript/bootstrapscript.sh' --jobs '{ "name": "PySpark ALS Job", "pysparkjob": { "mainpy": "adl://youradlsaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/pyspark/als/als2.py", "sparkarguments" : "--executor-memory 1G --num-executors 2 --conf spark.pyspark.python=/tmp/pyspark-env/bin/python" }}' The bootstrapscript.sh in this example creates a Python environment using the default Python version shipped with Altus and installs the NumPy package. It has the following content: #!/bin/bash target="/tmp/pyspark-env" mypip="${target}/bin/pip" echo "Provisioning pyspark environment..." virtualenv ${target} ${mypip} install numpy if [ $? -eq 0 ]; then echo "Successfully installed new python environment at ${target}" else 22 Altus Data Engineering

23 Altus Data Engineering Clusters fi echo "Failed to install custom python environment at ${target}" Viewing the Cluster Status When you create a cluster, you can immediately check its status. If the cluster creation process is not yet complete, you can view information regarding the progress of cluster creation. You can use the following command to display the status of a cluster and other information: altus dataeng describe-cluster --cluster-name=clustername cluster-name is a required parameter. Deleting a Cluster You can use the following command to delete a cluster: altus dataeng delete-cluster --cluster-name=clustername Clusters on AWS If you create clusters on AWS, you can take advantage of the EC2 Spot instances that AWS offers at a discount. You can add compute worker nodes to your clusters and configure them to use Spot instances. Worker Nodes An Altus cluster on AWS can have the following types of worker nodes: Worker node A worker node runs both data storage and computational processes. Altus requires a minimum of three worker nodes in a cluster. Compute worker node A compute worker node is a type of worker node in an Altus cluster that runs only computational processes. It does not run data storage processes. Altus does not require compute worker nodes in a cluster. You can configure compute worker nodes for a cluster to add compute power and improve cluster performance. Compute worker nodes are stateless. They can be terminated and restarted without risking job execution. A cluster can have a total of 50 worker and compute worker nodes. You determine the combination of worker and compute worker nodes that provides the best performance for your workload. The worker nodes and compute worker nodes use the same instance type. If you add compute worker nodes to a cluster, Altus manages the provisioning of new instances to replace terminated or failed worker and compute worker instances in a cluster. For more information about reprovisioning cluster instances, see Instance Reprovisioning on page 24. All compute worker nodes in a cluster use the same instance pricing. You can configure the compute worker nodes to use On-Demand instances or Spot instances. For more information about using Spot instances for compute worker nodes, see Spot Instances on page 23. Spot Instances A Spot instance is an EC2 instance for which the hourly price fluctuates based on demand. The hourly price for a Spot instance is typically much lower than the hourly price of an On-Demand instance. However, you do not have control Altus Data Engineering 23

24 Altus Data Engineering Clusters on when Spot instances are available for your cluster. When you bid a price on Spot instances, your Spot instances run only when your bid price is higher than the current market price and terminate when your bid price becomes lower than the market price. If an increase in the number of nodes in your cluster can improve job performance, you might want to use Spot instances for compute worker nodes in your cluster. To ensure that jobs continue to run when Spot instances are terminated, Altus allows you to use Spot instances only for compute worker nodes. Compute worker nodes are stateless and can be terminated and restarted without risking job execution. Altus manages the use of Spot instances in a cluster. When a Spot instance with a running job terminates, Altus attempts to provision a new instance every 15 minutes. Altus uses the new instance to accelerate the running job. Use the following guidelines when deciding to use Spot instances for compute worker nodes in a cluster: You can use Spot instances only for compute worker nodes. You cannot use Spot instances for worker nodes. You can configure compute worker nodes to use On-Demand or Spot instances. If you configure compute worker nodes to use Spot instances, and no Spot instances are available, jobs run on the worker nodes. To ensure that worker nodes are available in a cluster to run the processes required to complete a job, worker nodes must use On-Demand instances. You cannot configure worker nodes to use Spot instances. Set your bid price for Spot instances high enough to have a good chance of exceeding market price. Generally, a bid price that is 75% of the On-Demand instance price is a good convention to follow. As you use Spot instances more, you can develop a better standard for setting a bid price that is reasonable but still has a good chance of exceeding market price. Use less On-Demand instances than required and offset the shortfall with a larger number of Spot instances. For example, you know that a job must run on a cluster with 10 On-Demand instances to meet a service level agreement. You can use 5 On-Demand instances and 15 Spot instances to increase the number of instances on which the job runs with the same or lower cost. This strategy means that most of the job processes run on the cheaper instances and is a cost-effective way to meet the SLA. For more information about AWS Spot instances, see Spot Instances on the AWS console. Instance Reprovisioning By default, if you add compute worker nodes to a cluster, Altus manages the provisioning of new instances to replace terminated or failed instances in the cluster. Altus periodically attempts to replace failed or terminated worker nodes and compute worker nodes in the cluster. When an instance fails or terminates, Altus attempts to provision a new instance every 15 minutes. Altus provisions new instances of worker nodes and compute worker nodes in the following manner: Altus provisions On-Demand instances to replace failed or terminated worker nodes and maintain the number of worker nodes configured for the cluster. If compute worker nodes are configured to use On-Demand instances, Altus provisions On-Demand instances to replace failed or terminated compute worker nodes and maintain the number of compute worker nodes configured for the cluster. If compute worker nodes are configured to use Spot instances, Altus provisions Spot instances to replace failed or terminated compute worker nodes and as much as possible maintain the number of compute worker nodes configured for the cluster. Depending on the availability of Spot instances, the number of compute worker nodes might not always match the number of compute worker nodes configured for the cluster. Note: Altus cannot provision new instances to replace a terminated or failed master node or Cloudera Manager instance. When a master node or Cloudera Manager instance fails, the cluster fails. 24 Altus Data Engineering

25 System Volume Altus Data Engineering Clusters By default, when you create an Altus cluster for AWS, each node in the cluster includes a root device volume. In addition, Altus attaches an EBS volume to the node to store data generated by the cluster. The EBS volume that Altus adds to the node is a system volume meant to hold logs and other data generated by Altus services and systems. Although Altus manages it, the system volume counts as a volume that you pay for in your instance. The system volume is deleted when the cluster is terminated. Altus configures the cluster so that sensitive information is not written to the root volume, but to the system volume. When you enable the secure cluster option for Altus clusters, Altus encrypts the system volume and the EBS volumes that you configure for the cluster. Altus does not need to encrypt the root device volume since it does not contain sensitive data. Altus Data Engineering 25

26 Altus Data Engineering Jobs Altus Data Engineering Jobs You can use the Cloudera Altus console or the command-line interface to run and monitor jobs. When you submit a job, configure it to run on a cluster that contains the service you require to run the job. The following list describes the services available in Altus clusters and the types of jobs you can run with each service: Service Type Hive Hive on Spark Spark 2.x Spark 1.6 MapReduce2 Multi Job Type Hive Hive Spark or PySpark Spark or PySpark MapReduce2 Hive, Spark or PySpark, MapReduce2 The Multi service cluster supports Spark 2.x. It does not support Spark 1.6. Altus creates a job queue for each cluster. When you submit a job, Altus adds the job to the queue of the cluster on which you configure the job to run. For more information about the job queue, see Job Queue on page 27. Altus generates a job ID for every job that you submit. If you submit a group of jobs, Altus generates a group ID. If you do not specify a job name or a group name, Altus sets the job name to the job ID and the group name to the group ID. If the Altus environment has Workload Analytics enabled, you can view performance information for a job after it ends, including health checks, baselines, and other execution information. Use this information to analyze a job's current performance and compare it to past runs of the same job. Job Status A job periodically changes status from the time that you submit it until the time it completes or is terminated. A user action or the configuration of the job or the cluster on which it runs can affect the status of the job. A data engineering job can have the following statuses: Queued. The job is queued to run on the selected cluster. Submitting. The job is being added to the job queue. Running. The job is in progress. Interrupted. A job is set to Interrupted status in the following situations: If you create a cluster when you submit a job and the cluster is not successfully created, the job status is set to Interrupted. You can create a cluster and rerun the job on the new cluster. If the job is queued to run on a cluster but the cluster is deleted, the job status is set to Interrupted. You can rerun the job on another cluster. If the job does not run because a previous job in the queue has the Action on Failure option set to Interrupt Job Queue, the job status is set to Interrupted. Completed. The job completed successfully. Terminating. You have initiated termination of the job and the job is in the process of being terminated. Terminated. The job termination process is complete. Failed. The job did not complete. 26 Altus Data Engineering

27 Altus Data Engineering Jobs Job Queue When you create a cluster, Altus sets up a job queue for the jobs submitted to the cluster. Altus sets up one job queue for each cluster and adds all jobs that are submitted to a cluster to the same job queue. Altus runs the jobs in the queue in the sequence that the job requests are received. Whether you submit single jobs or groups of jobs to the cluster, Altus runs the jobs sequentially in the order that each job request is received. You can configure the following options to manage how jobs run in the queue: Job Failure Action When you submit a job, you can specify the action that Altus takes when a job fails. You can use the job failure action to specify whether Altus runs the jobs in the queue following a failed job. This option is useful for handling job dependencies. If a job must complete before the next job can run, you can set the option to interrupt the job queue so that, if the job fails, Altus does not run the rest of the jobs in the queue. If a job failure does not affect subsequent jobs in the queue, you can set the option to an action of NONE so that, when a job fails, Altus continues to run the subsequent jobs in the queue. If you do not specify any action, Altus sets the option to interrupt the job queue by default. The following table shows the option on the console and parameter in the CLI that you can use to specify the action Altus takes when a job fails: Interface Console CLI Option/Parameter Action on Failure failureaction When you submit a job, you can set the Action on Failure option to one of the following actions: None. When a job fails, Altus continues with job execution and performs no special action. Altus runs the next job in the queue. Interrupt Job Queue. When the job fails, Altus does not run any of the subsequent jobs in the queue. The jobs that do not run after a job fails are set to a status of Interrupted. If you use the CLI to submit a job, you can set the failureaction parameter to one of the following actions: NONE. When a job fails, Altus continues with job execution and performs no special action. Altus runs the next job in the queue. INTERRUPT_JOB_QUEUE. When the job fails, Altus does not run any of the subsequent jobs in the queue. The jobs that do not run after a job fails are set to a status of Interrupted. Cluster termination after all jobs are processed When you create a cluster, you can configure how Altus handles a cluster when all jobs sent to the cluster are processed and the job queue becomes empty. The following table shows the option on the console and parameter in the CLI that you can use to specify the condition by which Altus terminates a cluster: Interface Console Option/Parameter Terminate cluster once jobs complete When you submit a job and you create a cluster on which to run the job, you can enable the Terminate Altus Data Engineering 27

28 Altus Data Engineering Jobs Interface CLI Option/Parameter --automatic-termination-condition cluster once jobs complete option to terminate the cluster when the job queue is empty. If you do not enable the option, Altus does not terminate the cluster when the job queue is empty. You must manually terminate the cluster if you do not plan to submit jobs to the cluster again. If you use the CLI to create a cluster, you can use the --automatic-termination-condition parameter to specify whether to terminate the cluster when the job queue is empty. You can set the parameter to one of the following conditions: NONE. When the job queue is empty, Altus does not terminate the cluster. EMPTY_JOB_QUEUE. When all jobs in the queue are processed and the queue is empty, Altus terminates the cluster. If you set the option to terminate the cluster, you must include the --jobs parameter and submit at least one job to the cluster. The --automatic-termination-condition parameter is optional. If you do not include the parameter, Altus does not terminate the cluster when the job queue is empty. Running and Monitoring Jobs on the Console You can submit a single job or a group of jobs on the Cloudera Altus console. When you view the list of jobs, you can file a support ticket with Cloudera for any job that has issues with which you require help. Submitting a Job on the Console You can submit a job to run on an existing cluster or create a cluster specifically for the job. To submit a job on the console: 1. Sign in to the Cloudera Altus console: 2. On the side navigation panel, click Jobs. By default, the Jobs page displays the list of all the Altus Data Engineering jobs in your Altus account. You can filter the list of jobs by Altus environment, the cluster on which the jobs run, or the time frame when the jobs run. You can also filter by the user who submitted the job and the job type and status. 3. Click Submit Jobs. 4. On the Job Settings page, select Single job. 5. Select the type of job you want to submit. You can select from the following types of jobs: Hive 28 Altus Data Engineering

29 Altus Data Engineering Jobs MapReduce2 PySpark Spark 6. Enter the job name. The job name is optional. If you do not specify a name, Altus sets the job name to be the same as the job ID. 7. Specify the properties for the job based on the job type. Hive Job Properties Script Hive Script Parameters Job XML The Hive script to execute. Select one of the following sources for the hive script: Script Path. Specify the path and file name of the file that contains the script. File Upload. Upload a file that contains the script. Direct Input. Type in the script. The Hive script can include parameters. Use the format ${Variable_Name} for the parameter. If the script contains parameters, you must specify the variable name and value for each parameter in Hive Script Parameters field. Required. Required if the Hive script includes variables. Select the option and provide the definition of the variables used as parameters in the Hive script. You must define the value of all variables that you use in the script. Click + to add a variable to the list. Click - to delete a variable from the list. Optional. XML document that defines the configuration settings for the job. Select the option and provide the job configuration. Select File Upload to upload the configuration XML file or select Direct Input to type in the configuration settings. Spark Job Properties Main Class Jars Main class and entry point of the Spark application. Required. Path and file names of jar files to be added to the classpath. You can include jar files that are stored in AWS S3 or Azure ADLS cloud storage or in HDFS. Click + to add a jar file to the list. Click - to delete a jar file from the list. Required. Altus Data Engineering 29

30 Altus Data Engineering Jobs Application Arguments Spark Arguments Optional. Arguments to pass to the main method of the main class of the Spark application. Click + to add an argument to the list. Click - to delete an argument from the list. Optional. A list of Spark configuration properties for the job. For example: --executor-memory 4G --num-executors 50 MapReduce2 Job Properties Main Class Jars MapReduce Application Arguments Java Options Job XML Main class and entry point of the MapReduce2 application. Required. Path and file names of jar files to be added to the classpath. You can include jar files that are stored in AWS S3 or Azure ADLS cloud storage or in HDFS. Click + to add a jar file to the list. Click - to delete a jar file from the list. Required. Optional. Arguments for the MapReduce2 application. The arguments are passed to the main method of the main class. Click + to add an argument to the list. Click - to delete an argument from the list. Optional. A list of Java options for the JVM. Optional. XML document that defines the configuration settings for the job. Select the option and provide the job configuration. Select File Upload to upload the configuration XML file or select Direct Input to type in the configuration settings. PySpark Job Properties Main Python File Python File Dependencies Path and file name of the main Python file for the Spark application. This is the entry point for your PySpark application. You can specify a file that is stored in cloud storage or in HDFS. Required. Optional. Files required by the PySpark job, such as.zip,.egg, or.py files. Altus adds the path and file names of the files in the PYTHONPATH for Python applications. You can include files that are stored in cloud storage or in HDFS. Click + to add a file to the list. Click - to delete a file from the list. 30 Altus Data Engineering

31 Altus Data Engineering Jobs Application Arguments Spark Arguments Optional. Arguments to pass to the main method of the PySpark application. Click + to add an argument to the list. Click - to delete an argument from the list. Optional. A list of Spark configuration properties for the job. For example: --executor-memory 4G --num-executors In Action on Failure, specify the action that Altus takes when the job fails. Altus can perform the following actions: None. When a job fails, Altus runs the subsequent jobs in the queue. Interrupt Job Queue. When a job fails, Altus does not run any of the subsequent jobs in the queue. The jobs that do not run after a job fails are set to a status of Interrupted. For more information about the Action on Failure option, see Job Queue on page In the Cluster Settings section, select the cluster on which the job will run: Use existing. Select from the list of clusters that is available for your use. Altus displays only the names of clusters where the type of job you selected can run and that you have access to. The list displays the number of workers in the cluster. Create new. Configure and create a cluster for the job. If the cluster creation process is not yet complete when you submit the job, Altus adds it to the job queue and runs it when the cluster is created. Clone existing. Select the cluster on which to base the configuration of a new cluster. 10. If you create or clone a cluster, set the properties and select the options for the new cluster. Complete the following steps: a. To allow Altus to terminate the cluster after the job completes, select the Terminate cluster once jobs complete option. If you create a cluster specifically for this job and you do not need the cluster after the job runs, you can have Altus terminate the cluster when the job completes. If the Terminate cluster once jobs complete option is selected, Altus terminates the cluster after the job runs, whether the job completes successfully or fails. This option is selected by default. If you do not want Altus to terminate the cluster, clear the selection. b. You create a cluster within the Jobs page the same way that you create a cluster on the Clusters page. To create a cluster for AWS, follow the instructions from Step 4 on page 8 to Step 7 on page 12 in Creating a Cluster for AWS on page 8. To create a cluster for Azure, follow the instructions from Step 4 on page 13 to Step 7 on page 16 in Creating a Cluster for Azure on page Verify that all required fields are set and click Submit. The Altus Data Engineering service submits the job to run on the selected cluster in your cloud provider account. Submitting Multiple Jobs on the Console You can group multiple jobs in one job submission. You can submit a group of jobs to run on an existing cluster or you can create a cluster specifically for the job group. Altus Data Engineering 31

32 Altus Data Engineering Jobs Note: When you create a job group on the console, you can only include the same type of jobs. A job group that you create on the console does not support multiple types of jobs, even if you run the job group on a Multi service cluster. Use the Altus CLI to create a job group with multiple types of jobs and run it in a Multi service cluster. To submit a job on the console: 1. Sign in to the Cloudera Altus console: 2. On the side navigation panel, click Jobs. By default, the Jobs page displays the list of all the Altus Data Engineering jobs in your Altus account. You can filter the list of jobs by Altus environment, the cluster on which the jobs run, or the time frame when the jobs run. You can also filter by the user who submitted the job and the job type and status. 3. Click Submit Jobs. 4. On the Job Settings page, select Group of jobs. 5. Select the type of job you want to submit. You can select from the following types of jobs: Hive MapReduce2 PySpark Spark 6. Enter a name for the job group. The job group name is optional. By default, Altus assigns an ID to the job group. If you do not specify a name, Altus sets the job group name to be the same as the job group ID. 7. Click Add <Job Type>. 8. On the Add Job window, enter the job name. The job name is optional. By default, Altus assigns an ID to the job. If you do not specify a name, Altus sets the job name to be the same as the job ID. 9. Set the properties for the job. Altus displays job properties based on the job type. Hive Job Properties Script The Hive script to execute. Select one of the following sources for the hive script: Script Path. Specify the path and file name of the file that contains the script. File Upload. Upload a file that contains the script. Direct Input. Type in the script. The Hive script can include parameters. Use the format ${Variable_Name} for the parameter. If the script contains parameters, you must specify the variable name and value for each parameter in Hive Script Parameters field. Required. 32 Altus Data Engineering

33 Altus Data Engineering Jobs Hive Script Parameters Job XML Required if the Hive script includes variables. Select the option and provide the definition of the variables used as parameters in the Hive script. You must define the value of all variables that you use in the script. Click + to add a variable to the list. Click - to delete a variable from the list. Optional. XML document that defines the configuration settings for the job. Select the option and provide the job configuration. Select File Upload to upload the configuration XML file or select Direct Input to type in the configuration settings. Spark Job Properties Main Class Jars Application Arguments Spark Arguments Main class and entry point of the Spark application. Required. Path and file names of jar files to be added to the classpath. You can include jar files that are stored in AWS S3 or Azure ADLS cloud storage or in HDFS. Click + to add a jar file to the list. Click - to delete a jar file from the list. Required. Optional. Arguments to pass to the main method of the main class of the Spark application. Click + to add an argument to the list. Click - to delete an argument from the list. Optional. A list of Spark configuration properties for the job. For example: --executor-memory 4G --num-executors 50 MapReduce2 Job Properties Main Class Jars Main class and entry point of the MapReduce2 application. Required. Path and file names of jar files to be added to the classpath. You can include jar files that are stored in AWS S3 or Azure ADLS cloud storage or in HDFS. Click + to add a jar file to the list. Click - to delete a jar file from the list. Required. Altus Data Engineering 33

34 Altus Data Engineering Jobs MapReduce Application Arguments Java Options Job XML Optional. Arguments for the MapReduce2 application. The arguments are passed to the main method of the main class. Click + to add an argument to the list. Click - to delete an argument from the list. Optional. A list of Java options for the JVM. Optional. XML document that defines the configuration settings for the job. Select the option and provide the job configuration. Select File Upload to upload the configuration XML file or select Direct Input to type in the configuration settings. PySpark Job Properties Main Python File Python File Dependencies Application Arguments Spark Arguments Path and file name of the main Python file for the Spark application. This is the entry point for your PySpark application. You can specify a file that is stored in cloud storage or in HDFS. Required. Optional. Files required by the PySpark job, such as.zip,.egg, or.py files. Altus adds the path and file names of the files in the PYTHONPATH for Python applications. You can include files that are stored in cloud storage or in HDFS. Click + to add a file to the list. Click - to delete a file from the list. Optional. Arguments to pass to the main method of the PySpark application. Click + to add an argument to the list. Click - to delete an argument from the list. Optional. A list of Spark configuration properties for the job. For example: --executor-memory 4G --num-executors In Action on Failure, specify the action that Altus takes when a job fails. Altus can perform the following actions: None. When a job fails, Altus runs the subsequent jobs in the queue. Interrupt Job Queue. When a job fails, Altus does not run any of the subsequent jobs in the queue. The jobs that do not run after a job fails are set to a status of Interrupted. For more information about the Action on Failure option, see Job Queue on page Click OK. The Add Job window closes and the job is added to the list of jobs for the group. You can edit the job or delete the job from the group. To add another job to the group, click Add <Job Type> and set the properties for the new job. When you complete setting up all jobs in the group, specify the cluster on which the jobs will run. 34 Altus Data Engineering

35 12. In the Cluster Settings section, select the cluster on which the jobs will run: Use existing. Select from the list of clusters that is available for your use. Altus displays only the names of clusters where the type of job you selected can run and that you have access to. The list displays the number of workers in the cluster. Create new. Configure and create a cluster for the job. If the cluster creation process is not yet complete when you submit the job, Altus adds it to the job queue and runs it when the cluster is created. Clone existing. Select the cluster on which to base the configuration of a new cluster. 13. If you create or clone a cluster, set the properties and select the options for the new cluster: Complete the following steps: a. To allow Altus to terminate the cluster after the job completes, select the Terminate cluster once jobs complete option. If you create a cluster specifically for this job and you do not need the cluster after the job runs, you can have Altus terminate the cluster when the job completes. If the Terminate cluster once jobs complete option is selected, Altus terminates the cluster after the job runs, whether the job completes successfully or fails. This option is selected by default. If you do not want Altus to terminate the cluster, clear the selection. b. You create a cluster within the Jobs page the same way that you create a cluster on the Clusters page. To create a cluster for AWS, follow the instructions from Step 4 on page 8 to Step 7 on page 12 in Creating a Cluster for AWS on page 8. To create a cluster for Azure, follow the instructions from Step 4 on page 13 to Step 7 on page 16 in Creating a Cluster for Azure on page Verify that all required fields are set and click Submit. The Altus Data Engineering service submits the jobs as a group to run on the selected cluster in your cloud service account. Viewing Job Status and Information To view Altus Data Engineering jobs on the console: 1. Sign in to the Cloudera Altus console: 2. On the side navigation panel, click Jobs. By default, the Jobs page displays the list of all the Altus Data Engineering jobs in your Altus account. You can filter the list of jobs by Altus environment, the cluster on which the jobs run, or the time frame when the jobs run. You can also filter by the user who submitted the job and the job type and status. The jobs list displays the name of the group to which the job belongs and the name of the cluster on which the job runs. Click the group name to view the details of the job group and the jobs in the group. Click the cluster name to view the cluster details. The Jobs list displays the status of the job. For more information about the different statuses that a job can have, see Altus Data Engineering Jobs on page You can click the Actions button for the job to perform the following tasks: Altus Data Engineering Jobs Clone a Job. To create a job of the same type as the job that you are viewing, select the Clone Job action. On the Submit Job page, you can submit a job with the same properties as the job you are cloning. You can modify or add to the properties before you submit the job. Altus Data Engineering 35

36 Altus Data Engineering Jobs Terminate a Job. If the job has a status of Queued, Running, or Submitting, you can select Terminate Job to stop the process. If you terminate a job with a status of Running, the job run is aborted. If you terminate a job with a status of Queued or Submitting, the job will not run. If the job status is Complete, the Terminate Job selection does not appear. Viewing the Job Details To view the details of a cluster on the console: 1. Sign in to the Cloudera Altus console: 2. On the side navigation panel, click Jobs. By default, the Jobs page displays the list of all the Altus Data Engineering jobs in your Altus account. You can filter the list of jobs by Altus environment, the cluster on which the jobs run, or the time frame when the jobs run. You can also filter by the user who submitted the job and the job type and status. 3. Click the name of a job. The Job details page displays information about the job, including the job type and the properties and status of the job. The Job Settings section displays the properties configured for the job. The section also shows the action that Altus will take if a job fails in the queue fails. The Job details page shows the name of the cluster on which the job runs. If the cluster is not terminated, the name is a link to the details page of the cluster where you can see more information about the cluster. The Cloudera Manager Configuration section displays the IP addresses that you can use to access the Cloudera Manager instance for the cluster. This section appears if the cluster on which the job runs is not terminated. The Altus client provides a command that you can use to SSH to the Cloudera Manager instance. You can view and copy the command with the cluster parameters set and run it to connect to Cloudera Manager. The Job details page displays the job ID and CRN. The job CRN is a long string of characters. If you need to use the CRN to identify a job, the jobs details page makes it easy for you to copy the CRN to the clipboard so you can paste it in the command or support case. For example, if you need to include the job CRN when you run a command or create a support case, copy the job CRN from the job details page and paste it on the command line or the support case. Running and Monitoring Jobs Using the CLI Use the Cloudera Altus client to submit a job or view the properties of a job. You can use the commands listed here as examples for how to use the Altus commands to submit jobs in Altus. For more information about the commands available in the Altus client, run the following command: altus dataeng help Submitting a Spark Job You can use the following command to submit a Spark job: altus dataeng submit-jobs --cluster-name ClusterName --jobs '{ "sparkjob": { "jars": [ "PathAndFilenameOfJar1", "PathAndFilenameOfJar2" 36 Altus Data Engineering

37 Altus Data Engineering Jobs }}' ] You can include the applicationarguments parameter to pass values to the main method and the sparkarguments parameter to specify Spark configuration settings. If you use the application and Spark arguments parameters, you must escape the list of arguments. Alternatively, you can put the arguments into a file and pass the path and file name with the arguments parameters. Use the following prefixes when you include jar files for the Spark job: For files in Amazon S3: s3a:// For files in Azure Data Lake Store: adl:// For files in local files system: file:// For files in HDFS in the cluster: hdfs:// You can also add the mainclass parameter to specify the entry point of your application. Note: You can find examples for submitting a Spark job in the Altus Data Engineering tutorials: In the Altus tutorial for AWS: Creating a Spark Job using the CLI on page 53 In the Altus tutorial for Azure: Creating a Spark Job using the CLI on page 72 Spark Job Examples The following examples show how to submit a Spark job to run on a cluster in AWS and in Azure. Spark Job Examples for a Cluster in AWS The following examples show how to submit a Spark job to run on a cluster in AWS: Pi Estimation Example Spark provides a library of code examples that illustrate how Spark works. The following example uses the Pi estimation example from the Spark library to show how to submit Spark job using the Altus CLI. You can use the following command to submit a Spark job to run the Pi estimation example: altus dataeng submit-jobs \ --cluster-name ClusterName \ --jobs '{ "sparkjob": { "jars": [ "local:///opt/cloudera/parcels/cdh/lib/spark/lib/spark-examples.jar" ], "sparkarguments" : "--executor-memory 1G --num-executors 2", "mainclass": "org.apache.spark.examples.sparkpi" }}' The --cluster-name parameter requires the name of a Spark cluster. Medicare Example The following example processes publicly available data to show the usage of Medicare procedure codes. The Spark job is available in a Cloudera Altus S3 bucket of job examples and reads input data from the Cloudera Altus example S3 bucket. You can create an S3 bucket in your account to write output data. To use the example, set up an S3 bucket in your AWS account and set permissions to allow write access when you run the job. To run the Spark job example: 1. Create a Spark cluster to run the job. You can create a cluster with Spark 2.x or Spark 1.6 service type. The version of the Spark service in the cluster must match the version of the Spark jar file: Altus Data Engineering 37

38 Altus Data Engineering Jobs For Spark 2.x, use the example jar file named altus-sample-medicare-spark2x.jar For Spark 1.6, use the example jar file named altus-sample-medicare-spark1x.jar For more information about creating a cluster, see Creating a Cluster for AWS on page Create an S3 bucket in your AWS account. 3. Use the following command to submit the Medicare job: altus dataeng submit-jobs \ --cluster-name ClusterName \ --jobs '{ "sparkjob": { "jars": [ "s3a://cloudera-altus-data-engineering-samples/spark/medicare/program/altus-sample-medicare-sparkversion.jar" ], "mainclass": "com.cloudera.altus.sample.medicare.transform", "applicationarguments": [ "s3a://cloudera-altus-data-engineering-samples/spark/medicare/input/", }}' ] "s3a://nameofoutputs3bucket/outputpath/" The --cluster-name parameter requires the name of a cluster with a version of Spark that matches the version of the example jar file. The jars parameter requires the name of the jar file that matches the version of the Spark service in the cluster. Spark Job Example for a Cluster in Azure This example processes publicly available data to show the usage of Medicare procedure codes. Cloudera provides the job example files and input files that you need to run the jobs. To use the following example, set up an Azure Data Lake Store (ADLS) account with permissions to allow read and write access when you run the Altus jobs. Then run the script that Altus provides to upload the files to the ADLS account so the job files and data files are available for your use. For instructions on uploading the jar file, see Sample Files Upload on page 65. To run the Spark job example: 1. Create a Spark cluster to run the job. You can create a cluster with Spark 2.x or Spark 1.6 service type. The version of the Spark service in the cluster must match the version of the Spark jar file: For Spark 2.x, use the example jar file named altus-sample-medicare-spark2x.jar For Spark 1.6, use the example jar file named altus-sample-medicare-spark1x.jar For more information about creating a cluster, see Creating a Cluster for Azure on page Use the following command to submit the Medicare job: altus dataeng submit-jobs \ --cluster-name ClusterName \ --jobs '{ "sparkjob": { "jars": [ "adl://youradlsaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/spark/medicare/program/altus-sample-medicare-sparkversion.jar" ], "mainclass": "com.cloudera.altus.sample.medicare.transform", "applicationarguments": [ "adl://youradlsaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/spark/medicare/input/", "adl://youradlsaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/spark/medicare/output/" 38 Altus Data Engineering

39 Altus Data Engineering Jobs ] }}' The --cluster-name parameter requires the name of a cluster with a version of Spark that matches the version of the example jar file. The jars parameter requires the name of the jar file that matches the version of the Spark service in the cluster. Submitting a Hive Job You can use the following command to submit a Hive job: altus dataeng submit-jobs --cluster-name ClusterName --jobs '{ "hivejob": { "script": "PathAndFilenameOfHQLScript" }}' You can also include the jobxml parameter to pass job configuration settings for the Hive job. The following is an example of the content of a Hive job XML that you can use with the jobxml parameter: <?xml version="1.0" encoding="utf-8"?> <configuration> <property> <name>hive.auto.convert.join</name> <value>true</value> </property> <property> <name>hive.auto.convert.join.noconditionaltask.size</name> <value> </value> </property> <property> <name>hive.optimize.bucketmapjoin.sortedmerge</name> <value>false</value> </property> <property> <name>hive.smbjoin.cache.rows</name> <value>10000</value> </property> <property> <name>mapred.reduce.tasks</name> <value>-1</value> </property> <property> <name>hive.exec.reducers.max</name> <value>1099</value> </property> </configuration> Note: You can find examples for submitting a Hive job in the Altus Data Engineering tutorials: In the Altus tutorial for AWS: Creating a Hive Job Group using the CLI on page 62 In the Altus tutorial for Azure: Submitting a Spark Job on page 68 Hive Job Examples The following examples show how to submit a Hive job to run on a cluster in AWS and in Azure. Hive Job Example for a Cluster in AWS The following example of a Hive job reads data from a CSV file and writes the data to an S3 bucket in the Cloudera AWS account. It then writes the same data, with the commas changed to colons, to an S3 bucket in your AWS account. Altus Data Engineering 39

40 Altus Data Engineering Jobs To use the example, set up an S3 bucket in your AWS account and set permissions to allow write access when you run the example Hive script. To run the Hive job example: 1. Create a cluster to run the Hive job. You can run a Hive job on a Hive on MapReduce or Hive on Spark cluster. For more information about creating a cluster, see Creating a Cluster for AWS on page Create an S3 bucket in your AWS account. 3. Create a Hive script file on your local drive. This example uses the file name hivescript.hql. 4. Copy and paste the following script into the file: DROP TABLE input; DROP TABLE output; CREATE EXTERNAL TABLE input(f1 STRING, f2 STRING, f3 STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 's3a://cloudera-altus-data-engineering-samples/hive/data/'; CREATE TABLE output(f1 STRING, f2 STRING, f3 STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ':' STORED AS TEXTFILE LOCATION 's3a://nameofoutputs3bucket/outputpath/'; INSERT OVERWRITE TABLE output SELECT * FROM input ORDER BY f1; 5. Modify the script and replace the name and path of the output S3 bucket with the name and path of the S3 bucket you created in your AWS account. 6. Run the following command: altus dataeng submit-jobs \ --cluster-name=clustername \ --jobs '{ "hivejob": { "script": "PathToHiveScript/hiveScript.hql" }}' The --cluster-name parameter requires the name of a Hive or Hive on Spark cluster. The script parameter requires the absolute path and file name of the script file prefixed with file://. For example: --jobs '{ "hivejob": { "script": "file:///file/path/to/my/hivescript.hql" }}' Hive Job Example for a Cluster in Azure This example of a Hive job reads data from a CSV file and writes the same data, with the commas changed to colons, to an output folder in your Azure Data Lake Store (ADLS) account. Cloudera provides the job example files and input files that you need to run the jobs. To use the following example, set up an Azure Data Lake Store (ADLS) account with permissions to allow read and write access when you run the Altus jobs. Then run the script that Altus provides to upload the files to the ADLS account so the job files and data files are available for your use. For instructions on uploading the jar file, see Sample Files Upload on page 65. To run the Hive job example: 1. Create a cluster to run the Hive job. You can run a Hive job on a Hive on MapReduce or Hive on Spark cluster. For more information about creating a cluster, see Creating a Cluster for Azure on page Create a Hive script file on your local drive. 40 Altus Data Engineering

41 Altus Data Engineering Jobs This example uses the file name hivescript.hql. 3. Copy and paste the following script into the file: DROP TABLE input; DROP TABLE output; CREATE EXTERNAL TABLE input(f1 STRING, f2 STRING, f3 STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 'adl://youradlsaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/hive/data/'; CREATE TABLE output(f1 STRING, f2 STRING, f3 STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ':' STORED AS TEXTFILE LOCATION 'adl://youradlsaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/hive/output/'; INSERT OVERWRITE TABLE output SELECT * FROM input ORDER BY f1; 4. Modify the script and replace the name of the ADLS account with the name of the ADLS account you set up for Altus examples. 5. Run the following command: altus dataeng submit-jobs \ --cluster-name=clustername \ --jobs '{ "hivejob": { "script": "PathToHiveScript/hiveScript.hql" }}' The --cluster-name parameter requires the name of a Hive or Hive on Spark cluster. The script parameter requires the absolute path and file name of the script file prefixed with file://. For example: --jobs '{ "hivejob": { "script": "file:///file/path/to/my/hivescript.hql" }}' Submitting a MapReduce Job You can use the following command to submit a MapReduce job: altus dataeng submit-jobs --cluster-name ClusterName --jobs '{ "mr2job": { "mainclass": "main.class.file", "jars": [ "PathAndFilenameOfJar1", "PathAndFilenameOfJar2" ] }}' Altus uses Oozie to run MapReduce2 jobs. When you submit a MapReduce2 job in Altus, Oozie launches a Java action to process the MapReduce2 job request. You can specify configuration settings for your job in an XML configuration file. To load the Oozie configuration settings into the MapReduce2 job, load the job XML file into the Java main class of the MapReduce2 application. For example, the following code snippet from a MapReduce2 application shows the oozie.action.conf.xml being loaded into the application: public int run(string[] args) throws Exception { Job job = Job.getInstance(loadJobConfiguration(), "wordcount");... // Launch MR2 Job... } Altus Data Engineering 41

42 Altus Data Engineering Jobs private Configuration loadjobconfiguration() { String ooziepreparedconfig = System.get("oozie.action.conf.xml"); if (ooziepreparedconfig!= null) { // Oozie collects hadoop configs with job.xml into a single file. // So default config is not needed. Configuration actionconf = new Configuration(false); actionconf.addresource(new Path("file:///", ooziepreparedconfig)); return actionconf; } else { return new Configuration(true); } } MapReduce Job Examples The following examples show how to submit a MapReduce job to run on a cluster in AWS and in Azure. MapReduce Job Example for a Cluster in AWS The following example of a MapReduce job is available in the Cloudera Altus S3 bucket of job examples. The job reads input data from a poetry file in the Cloudera Altus example S3 bucket. To use the example, set up an S3 bucket in your AWS account to write output data. Set the S3 bucket permissions to allow write access when you run the job. You can use the following command to submit a MapReduce job to run the example: altus dataeng submit-jobs \ --cluster-name ClusterName \ --jobs '{ "mr2job": { "mainclass": "com.cloudera.altus.sample.mr2.wordcount.wordcount", "jars": ["s3a://cloudera-altus-data-engineering-samples/mr2/wordcount/program/altus-sample-mr2.jar"], "arguments": [ "s3a://cloudera-altus-data-engineering-samples/mr2/wordcount/input/poetry/", "s3a://nameofoutputs3bucket/outputpath/" ] }}' The --cluster-name parameter requires the name of a MapReduce cluster. MapReduce Job Example for a Cluster in Azure This example is a simple job that reads input data from a file and counts the words. Cloudera provides the job example files and input files that you need to run the jobs. To use the following example, set up an Azure Data Lake Store (ADLS) account with permissions to allow read and write access when you run the Altus jobs. Then run the script that Altus provides to upload the files to the ADLS account so the job files and data files are available for your use. For instructions on uploading the jar file, see Sample Files Upload on page 65. You can use the following command to submit a MapReduce job to run the example: altus dataeng submit-jobs \ --cluster-name ClusterName \ --jobs '{ "mr2job": { "mainclass": "com.cloudera.altus.sample.mr2.wordcount.wordcount", "jars": ["adl://youradlsaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/mr2/wordcount/program/altus-sample-mr2.jar"], "arguments": [ "adl://youradlsaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/mr2/wordcount/input/poetry/", "adl://youradlsaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/mr2/wordcount/output/" 42 Altus Data Engineering

43 Altus Data Engineering Jobs }}' ] The --cluster-name parameter requires the name of a MapReduce cluster. Submitting a PySpark Job You can use the following command to submit a PySpark job: altus dataeng submit-jobs --cluster-name=clustername --jobs '{ "name": WordCountJob", "pysparkjob": { "mainpy": "PathAndFilenameOfthePySparkMainFile", "sparkarguments" : "SparkArgumentsRequiredForYourApplication", "pyfiles" : "PythonFilesRequiredForYourApplication", "applicationarguments": [ "PathAndFilenameOfFile1", "PathAndFilenameOfFile2" ] }} You can include the applicationarguments parameter to pass values to the main method and the sparkarguments parameter to specify Spark configuration settings. If you use the applicationarguments and sparkarguments parameters, you must escape the list of arguments. Alternatively, you can put the arguments into a file and pass the path and file name with the arguments parameters. The --cluster-name parameter requires the name of a Spark cluster. The pyfiles parameter takes the path and file names of Python modules. For example: "pyfiles" : ["s3a://path/to/module1.py", "s3a://path/to/module2.py"] PySpark Job Examples The following examples show how to submit a PySpark job to run on a cluster in AWS and in Azure. PySpark Job Example for a Cluster in AWS This example uses a PySpark job to count words in a text file and write the result to an S3 bucket that you specify. The Python file is available in a Cloudera Altus S3 bucket of job examples and also reads input data from the Cloudera Altus S3 bucket. You can create an S3 bucket in your account to write output data. The job in this example runs on a cluster with the Spark 2.2 service. You can use the following command to submit a PySpark job to run the word count example: altus dataeng submit-jobs \ --cluster-name=clustername \ --jobs '{ "name": "Word Count Job", "pysparkjob": { "mainpy": "s3a://cloudera-altus-data-engineering-samples/pyspark/wordcount/wordcount2.py", "sparkarguments" : "--executor-memory 1G --num-executors 2", "applicationarguments": [ "s3a://cloudera-altus-data-engineering-samples/pyspark/wordcount/input/hadooppoem0.txt", }} ] "s3a://nameofoutputs3bucket/pathtooutputfile" If you need to use a specific Python environment for your PySpark job, you can use the --instance-bootstrap-script parameter to include a bootstrap script to install a custom Python environment when Altus creates the cluster. Altus Data Engineering 43

44 Altus Data Engineering Jobs For an example of how to use a bootstrap script in the create-aws-cluster command to install a Python environment for a PySpark job, see Example: Creating a Cluster in AWS for a PySpark Job on page 20. PySpark Job Example for a Cluster in Azure This example uses a PySpark job to count words in a text file and write the result to an ADLS account that you specify. The job runs on a cluster with the Spark 2.2 service. Cloudera provides the job example files and input files that you need to run the jobs. To use the following example, set up an Azure Data Lake Store (ADLS) account with permissions to allow read and write access when you run the Altus jobs. Then run the script that Altus provides to upload the files to the ADLS account so the job files and data files are available for your use. For instructions on uploading the jar file, see Sample Files Upload on page 65. You can use the following command to submit a PySpark job to run the word count example: altus dataeng submit-jobs \ --cluster-name=clustername \ --jobs '{ "name": "Word Count Job", "pysparkjob": { "mainpy": "adl://youradlsaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/pyspark/wordcount/wordcount2.py", "sparkarguments" : "--executor-memory 1G --num-executors 2", "applicationarguments": [ "adl://youradlsaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/pyspark/wordcount/input/hadooppoem0.txt", "adl://youradlsaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/pyspark/wordcount/outputfile" }} ] If you need to use a specific Python environment for your PySpark job, you can use the --instance-bootstrap-script parameter to include a bootstrap script to install a custom Python environment when Altus creates the cluster. For an example of how to use a bootstrap script in the create-azure-cluster command to install a Python environment for a PySpark job, see Example: Creating a Cluster in Azure for a PySpark Job on page 22. Submitting a Job Group with Multiple Job Types You can submit different types of jobs to run on a Multi type cluster. Use the following command to submit a group of jobs that includes a PySpark job, Hive job, and a MapReduce job: altus dataeng submit-jobs --cluster-name MultiClusterName \ --job-submission-group-name "JobGroupName" \ --jobs '[ {"pysparkjob": { "mainpy": "PathAndFilenameOfthePySparkMainFile" }}, { "hivejob": { "script": "PathAndFilenameOfHQLScript" }}, { "mr2job": { "mainclass": "main.class.file", "jars": ["PathAndFilenameOfJar1"] }} ]' 44 Altus Data Engineering

45 Tutorial: Clusters and Jobs on AWS Tutorial: Clusters and Jobs on AWS This tutorial walks you through using the Altus console and CLI to create Altus Data Engineering clusters and submit jobs in Altus. The tutorial uses publicly available data that show the usage of Medicare procedure codes. Cloudera has created a publicly accessible S3 bucket for files used in Altus examples. This publicly accessible S3 bucket contains the data, scripts, and other artifacts used in the tutorial. You must create an S3 bucket in your AWS account to write output data. The tutorial has the following sections: Prerequisites To use this tutorial, you must have an Altus user account and the roles required to create clusters and run jobs in Altus. Altus Console Login on page 65 Log in to the Altus console to perform the exercises in this tutorial. Exercise 1: Installing the Altus Client on page 65 Learn how to install the Altus client and register an access key to use the CLI. Exercise 2: Creating a Spark Cluster and Submitting Spark Jobs on page 48 Learn how to create a cluster with a Spark service and submit a Spark job using the Altus console and the CLI. This exercise provides instructions on how to create a SOCKS proxy and view the cluster and monitor the job in Cloudera Manager. It also shows you how to delete the cluster on the console. Exercise 3: Creating a Hive Cluster and Submitting Hive Jobs on page 53 Learn how to create a cluster with a Hive service and submit a group of Hive jobs using the Altus console and the CLI. This exercise also walks you through the process of creating a SOCKS proxy and accessing Cloudera Manager. It also shows you how to delete the cluster on the console. Note: The tutorials in this section perform the same tasks as the sample applications provided for the Altus SDK for Java. For more information see Using the Altus SDK for Java. Prerequisites Before you start the tutorial, ensure that you have access to resources in your AWS account and an Altus user account with permission to create clusters and run jobs in Altus. The following are prerequisites for the tutorial: Altus user account, environment, and roles. An Altus user account allows you to log in to the Altus console and perform the exercises in the tutorial. An Altus administrator must assign an Altus environment to your user account so that you have access to resources in your AWS account. The Altus administrator must also assign roles to your user account to allow you to create clusters and run jobs in Altus. Public key. You must provide a public key for Altus to use when creating and configuring clusters in your AWS account. For more information about creating the SSH key in AWS, see Amazon EC2 Key Pairs. You can also create the SSH keys using other tools, such as ssh-keygen. S3 bucket for output.the tutorial provides read access to an S3 bucket that contains the jars, scripts, and input data used in the tutorial exercises. You must set up an S3 bucket in your AWS account for the output data generated by the jobs. The S3 bucket must have the permissions to allow write access when you run the Altus jobs. For more information about creating an S3 bucket in AWS, see Creating and Configuring an S3 Bucket. Altus Data Engineering 45

46 Tutorial: Clusters and Jobs on AWS Altus Console Login To access the Altus console, go to the following URL: Log in to Altus with your Cloudera user account. After you are authenticated, the Altus console displays your home page. The Data Engineering section displays on the side navigation panel. If you have been assigned roles and an environment in Altus, you can click on Clusters and Jobs to create clusters and submit jobs as you follow the tutorial exercises. Exercise 1: Installing the Altus Client To use the Altus CLI, you must install the Altus client and configure the client with an access key. Altus manages access to the Altus services so that only users with a registered access key can run commands to create clusters, submit jobs, or use SDX namespaces. Generate and register an access key with the Altus client to create a credentials file so that you do not need submit your access key with each command. This exercise provides instructions to download and install the Altus client on Linux, generate a key, and run the CLI command to register the key. To set up the Cloudera Altus client, complete the following tasks: 1. Install the Altus client. 2. Configure the Altus client with an access key. Step 1. Install the Altus Client To avoid conflicts with older versions of Python or other packages, Cloudera recommends that you install the Cloudera Altus client in a virtual environment. Use the virtualenv tool to create a virtual environment and install the client. The following commands show how you can use pip to install the client on a virtual environment on Linux: mkdir ~/altusclienv virtualenv ~/altusclienv --no-site-packages source ~/altusclienv/bin/activate ~/altusclienv/bin/pip install altuscli To upgrade the client to the latest version, run the following command: ~/altusclienv/bin/pip install --upgrade altuscli After the client installation process is complete, run the following command to confirm that the Altus client is working: If virtualenv is activated: altus --version If virtualenv is not activated: ~/altusclienv/bin/altus --version Step 2. Configure the Altus Client with the API Access Key You use the Altus console to generate the access that you register with the client. Keep the window that displays the access key on the console open until you complete the key registration process. To create and set up the client with a Cloudera Altus API access key: 1. Sign in to the Cloudera Altus console: 2. Click your user account name and select My Account. 3. On the My Account page, click Generate Access Key. 46 Altus Data Engineering

47 Tutorial: Clusters and Jobs on AWS Altus creates the key and displays the information on the screen. The following image shows an example of an Altus API access key as displayed on the Altus console: Note: The Cloudera Altus console displays the API access key immediately after you create it. You must copy the access key information when it is displayed. Do not exit the console without copying the keys. After you exit the console, there is no other way to view or copy the access key. 4. On the command line, run the following command to configure the client with the access key: altus configure 5. Enter the following information at the prompt: Altus Access key. Copy and paste the access key ID that you generated in the Cloudera Altus console. Altus Private key. Copy and paste the private key that you generated in the Cloudera Altus console. The private key is a very long string of characters. Make sure that you enter the full string. The configuration utility creates the following file to store your user credentials: ~/.altus/credentials 6. To verify that the credentials were created correctly, run the following command: altus iam get-user The command displays your Altus client credentials. 7. After the credentials file is created, you can go back to the Cloudera Altus console and click OK to exit the access key window. Altus Data Engineering 47

48 Tutorial: Clusters and Jobs on AWS Exercise 2: Creating a Spark Cluster and Submitting Spark Jobs This exercise shows you how to create a cluster with a Spark service on the Altus console and submit a Spark job on the console and the command line. It also shows you how to create a SOCKS proxy and access the cluster and view the progress of the job on Cloudera Manager. In this exercise, you complete the following tasks: 1. Create a cluster with a Spark service on the console. 2. Submit a Spark job on the console. 3. Create a SOCKS proxy to access the Spark cluster on Cloudera Manager. 4. View the Spark cluster and verify the Spark job output. 5. Submit a Spark job using the CLI. 6. Terminate the Spark cluster Creating a Spark Cluster on the Console You must be logged in to the Altus console to perform this task. Note that it can take a while for Altus to complete the process of creating a cluster. To create a cluster on the console: 1. In the Data Engineering section of the side navigation panel, click Clusters. 2. On the Clusters page, click Create Cluster. 3. Create a cluster with the following configuration: Cluster Name Service Type CDH Version Environment Node Configuration To help you easily identify your cluster, use your first initial and last name as prefix for the cluster name. This tutorial uses the cluster name mjones-spark-tutorial as an example. Spark 2.x CDH 5.13 Name of the Altus environment to which you have been given access for this tutorial. If you do not know which Altus environment to select, check with your Altus administrator. For the Worker node configuration, set the Number of Nodes to 3. Leave the rest of the node properties with their default setting. Credentials Configure your access credentials to Cloudera Manager: SSH Public Key If you have your public key in a file, select File Upload and choose the key file. If you have the key available for pasting on screen, select Direct Input to enter the full key code. Cloudera Manager User Set both the user name and password to guest. The following figure shows the Create Cluster page with the settings for this tutorial: 48 Altus Data Engineering

49 Tutorial: Clusters and Jobs on AWS 4. Verify that all required fields are set and click Create Cluster. The Altus Data Engineering service creates a CDH cluster with the configuration you set. On the Clusters page, the new cluster displays at the top of the list of clusters. Submitting a Spark Job Submit a Spark job to run on the cluster you created in the previous task. To submit a Spark job on the console: 1. In the Data Engineering section of the side navigation panel, click Jobs. 2. Click Submit Jobs. 3. On the Job Settings page, select Single job. 4. Select the Spark job type. 5. Create a Spark job with the following configuration: Job Name Main Class Jars Set the job name to Spark Medical Example. Set the main class to com.cloudera.altus.sample.medicare.transform Use the tutorial jar file: s3a:// cloudera-altus-data-engineering-samples/spark/medicare/program/ altus-sample-medicare-spark2x.jar Altus Data Engineering 49

50 Tutorial: Clusters and Jobs on AWS Application Arguments Set the application arguments to the S3 bucket to use for job input and output. Add the tutorial S3 bucket for the job input: s3a:// cloudera-altus-data-engineering-samples/spark/medicare/input/ Click + and add the S3 bucket you created for the job output: s3a://path/of/ The/Output/S3Bucket/ Cluster Settings Use an existing cluster and select the cluster that you created in the previous task. The following figure shows the Submit Jobs page with the settings for this tutorial: 6. Verify that all required fields are set and click Submit Jobs. The Altus Data Engineering service submits the job to run on the selected cluster in your AWS account. Creating a SOCKS Proxy for the Spark Cluster Use the Altus CLI to create a SOCKS proxy to log in to Cloudera Manager and view the cluster and progress of the job. To create a SOCKS proxy to access Cloudera Manager: 1. In the Data Engineering section of the side navigation panel, click Clusters. 2. On the Clusters page, find the cluster on which you submitted the job and click the cluster name. 3. On the cluster detail page, click View SOCKS Proxy CLI Command. Altus displays the command that you can use to create a SOCKS proxy to log in to the Cloudera Manager instance for the Spark cluster that you created. 50 Altus Data Engineering

51 Tutorial: Clusters and Jobs on AWS 4. Click Copy. 5. On a terminal window, paste the command. 6. Modify the command to use the name of the cluster you created and your private key and run the command: altus dataeng socks-proxy --cluster-name "YourClusterName" --ssh-private-key="yourprivatekey" --open-cloudera-manager="yes" The Cloudera Manager Admin console opens in a Chrome browser. Note: The command includes a parameter to open Cloudera Manager in a Chrome browser. If you do not use Chrome, remove the open-cloudera-manager parameter so that the command displays instructions for accessing the Cloudera Manager URL from any browser. Viewing the Cluster and Verifying the Spark Job Output Log in to Cloudera Manager with the guest user account that you set up when you created the cluster. To view the cluster and monitor the job on the Cloudera Manager Admin console: 1. Log in to Cloudera Manager using guest as the account name and password. 2. On the Home page, click Clusters on the top navigation bar. 3. On the cluster window, select YARN Applications. The following screenshots show the cluster services and workload information that you can view on the Cloudera Manager Admin console: Altus Data Engineering 51

52 Tutorial: Clusters and Jobs on AWS When your Spark job completes, you can view the output of the Spark job in the S3 bucket that you specified for your job output. The Spark job creates the following files in your output S3 bucket: 52 Altus Data Engineering

53 Tutorial: Clusters and Jobs on AWS Success (0 bytes) part (65.5 KB) part (69.5 KB) Note: If you want to use the same output S3 bucket for the next exercise, go to the AWS console and delete files in the S3 bucket. You will recreate the files when you submit the same Spark job using the CLI. Creating a Spark Job using the CLI You can submit the same Spark job to run on the same cluster using the CLI. If you want to view the cluster and monitor the job on Cloudera Manager, stay logged in to Cloudera Manager. To submit a Spark job using the CLI, run the following command: altus dataeng submit-jobs \ --cluster-name FirstInitialLastName-tutorialcluster \ --jobs '{ "sparkjob": { "jars": [ "s3a://cloudera-altus-data-engineering-samples/spark/medicare/program/altus-sample-medicare-spark2x.jar" ], "mainclass": "com.cloudera.altus.sample.medicare.transform", "applicationarguments": [ "s3a://cloudera-altus-data-engineering-samples/spark/medicare/input/", }}' ] "s3a://path/of/the/output/s3bucket/" To view the workload summary, go to the Cloudera Manager console and click Clusters > SPARK_ON_YARN-1. Cloudera Manager displays the same workload summary for this job as for the job that you submitted through the console. To verify the output, go to S3 bucket you specified for your job output and verify that it contains the files created by the Spark job: Success (0 bytes) part (65.5 KB) part (69.5 KB) Terminating the Cluster This task shows you how to terminate the cluster that you created for this tutorial. To terminate the cluster on the Altus console: 1. On the Altus console, go to the Data Engineering section of the side navigation panel and click Clusters. 2. On the Clusters page, click the name of the cluster that you created for this tutorial. 3. On the Cluster details page, review the cluster information to verify that it is the cluster that you want to terminate. 4. Click Actions and select Delete Cluster. 5. Click OK to confirm that you want to terminate the cluster. Exercise 3: Creating a Hive Cluster and Submitting Hive Jobs This exercise shows you how to create a cluster with a Hive service on the Altus console and submit Hive jobs on the console and the command line. It also shows you how to create a SOCKS proxy and access the cluster and view the progress of the jobs on Cloudera Manager. In this exercise, you complete the following tasks: Altus Data Engineering 53

54 Tutorial: Clusters and Jobs on AWS 1. Create a cluster with a Hive service on the console. 2. Submit a group of Hive jobs on the console. 3. Create a SOCKS proxy to access the Hive cluster on Cloudera Manager 4. View the Hive cluster and verify the Hive job output. 5. Submit a group of Hive jobs using the CLI. 6. Terminate the Hive cluster Creating a Hive Cluster on the Console You must be logged in to the Altus console to perform this task. Note that it can take a while for Altus to complete the process of creating a cluster. To create a cluster with a Hive service on the console: 1. In the Data Engineering section of the side navigation panel, click Clusters. 2. On the Clusters page, click Create Cluster. 3. Create a cluster with the following configuration: Cluster Name Service Type CDH Version Environment Node Configuration To help you easily identify your cluster, use your first initial and last name as prefix for the cluster name. This tutorial uses the cluster name mjones-hive-tutorial as an example. Hive CDH 5.13 Name of the Altus environment to which you have been given access for this tutorial. If you do not know which Altus environment to select, check with your Altus administrator. For the Worker node configuration, set the Number of Nodes to 3. Leave the rest of the node properties with their default setting. Credentials Configure your access credentials to Cloudera Manager: SSH Public Key If you have your public key in a file, select File Upload and choose the key file. If you have the key available for pasting on screen, select Direct Input to enter the full key code. Cloudera Manager User Set both the user name and password to guest. The following figure shows the Create Cluster page with the settings for this tutorial: 54 Altus Data Engineering

55 Tutorial: Clusters and Jobs on AWS 4. Verify that all required fields are set and click Create Cluster. The Altus Data Engineering service creates a CDH cluster with the configuration you set. On the Clusters page, the new cluster displays at the top of the list of clusters. Submitting a Hive Job Group Submit multiple Hive jobs as a group to run on the cluster that you created in the previous step. To submit a job group on the console: 1. In the Data Engineering section of the side navigation panel, click Jobs. 2. Click Submit Jobs. 3. On the Job Settings page, select Group of jobs. 4. Select the Hive job type. 5. Set the Job Group Name to Hive Medical Example. 6. Click Add Hive Job. 7. Create a job with the following configuration: Job Name Script Hive Script Parameters Set the job name to Create External Tables. Select Script Path and enter the following script name: s3a:// cloudera-altus-data-engineering-samples/hive/program/ med-part1.hql Select Hive Script Parameters and add the following variables and values: HOSPITALS_PATH: s3a://cloudera-altus-data-engineering-samples/hive/data/hospitals/ READMISSIONS_PATH: s3a://cloudera-altus-data-engineering-samples/hive/data/readmissionsdeath/ Altus Data Engineering 55

56 Tutorial: Clusters and Jobs on AWS Action on Failure EFFECTIVECARE_PATH: s3a://cloudera-altus-data-engineering-samples/hive/data/effectivecare/ GDP_PATH: s3a://cloudera-altus-data-engineering-samples/hive/data/gdp/ Select Interrupt Job Queue. The following figure shows the Add Job window with the settings for this job: 8. Click OK to add the job to the group. On the Submit Jobs page, Altus adds the Hive Medical Example job to the list of jobs in the group. 9. Click Add Hive Job. 10. Create a job with the following configuration: Job Name Script Action on Failure Set the job name to Clean Data. Select Script Path and enter the following script name: s3a:// cloudera-altus-data-engineering-samples/hive/program/ med-part2.hql Select Interrupt Job Queue. The following figure shows the Add Job window with the settings for this job: 56 Altus Data Engineering

57 Tutorial: Clusters and Jobs on AWS 11. Click OK. On the Submit Jobs page, Altus adds the Clean Data job to the list of jobs in the group. 12. Click Add Hive Job. 13. Create a job with the following configuration: Job Name Script Hive Script Parameters Action on Failure Set the job name to Write Output. Select Script Path and enter the following script name: s3a:// cloudera-altus-data-engineering-samples/hive/program/ med-part3.hql Select Hive Script Parameters and add the S3 bucket you created for the job output as a variable: OUTPUT_DIR: s3a://path/of/the/output/s3bucket/ Select None. The following figure shows the Add Job window with the settings for this job: Altus Data Engineering 57

58 Tutorial: Clusters and Jobs on AWS 14. Click OK. On the Submit Jobs page, Altus adds the Write Output job to the list of jobs in the group. 15. On the Cluster Settings section, select Use existing and select the Hive cluster you created for this exercise. The list of clusters displayed include only those clusters that can run Hive jobs. 16. Click Submit Jobs to run the job group on your Hive cluster. Creating a SOCKS Proxy for the Hive Cluster Use the Altus CLI to create a SOCKS proxy to log in to Cloudera Manager and view the progress of the job. To create a SOCKS proxy to access Cloudera Manager: 1. In the Data Engineering section of the side navigation panel, click Clusters. 2. On the Clusters page, find the cluster on which you submitted the Hive job group and click the cluster name. 3. On the cluster detail page, click View SOCKS Proxy CLI Command. Altus displays the command that you can use to create a SOCKS proxy to log in to the Cloudera Manager instance for the Hive cluster that you created. 58 Altus Data Engineering

59 Tutorial: Clusters and Jobs on AWS 4. Click Copy. 5. On a terminal window, paste the command. 6. Modify the command to use the name of the cluster you created and your private key and then run the following command: altus dataeng socks-proxy --cluster-name "YourClusterName" --ssh-private-key="yourprivatekey" --open-cloudera-manager="yes" The Cloudera Manager Admin console opens in a Chrome browser. Note: The command includes a parameter to open Cloudera Manager in a Chrome browser. If you do not use Chrome, remove the open-cloudera-manager parameter so that the command displays instructions for accessing the Cloudera Manager URL from any browser. Viewing the Hive Cluster and Verifying the Hive Job Output Log in to Cloudera Manager with the guest user account that you set up when you created the Hive cluster. To view the cluster and monitor the job on the Cloudera Manager Admin console: 1. Log in to Cloudera Manager using guest as the account name and password. 2. On the Home page, click Clusters on the top navigation bar. 3. On the cluster window, select YARN Applications. The following screenshots show the cluster services and workload information that you can view on the Cloudera Manager Admin console: Altus Data Engineering 59

60 Tutorial: Clusters and Jobs on AWS 4. Click Clusters on the top navigation bar and select the default Hive service named HIVE-1. Then click HiveServer2 Web UI. The following screenshots show the workload information that you can view for the Hive service: 60 Altus Data Engineering

61 Tutorial: Clusters and Jobs on AWS Altus Data Engineering 61

62 Tutorial: Clusters and Jobs on AWS 5. When the jobs complete, go to the S3 bucket you specified for your job output and verify the file created by the Hive jobs. The Hive jobs create the following file in your output S3 bucket: _0 (135.9 KB) Creating a Hive Job Group using the CLI You can submit the same group of Hive jobs to run on the same cluster using the CLI. If you want to view the cluster and monitor the job on Cloudera Manager, stay logged in to Cloudera Manager. To submit a group of Hive jobs using the CLI, run the submit-jobs command and provide the list of jobs in the jobs parameter. Run it on the same cluster and use the same job group name. Run the following command: altus dataeng submit-jobs \ --cluster-name FirstInitialLastName-tutorialcluster \ --job-submission-group-name "Hive Medical Example" \ --jobs '[ { "name": "Create External Tables", "failureaction": "INTERRUPT_JOB_QUEUE", "hivejob": { "script": "s3a://cloudera-altus-data-engineering-samples/hive/program/med-part1.hql", "params": ["HOSPITALS_PATH=s3a://cloudera-altus-data-engineering-samples/hive/data/hospitals/", "READMISSIONS_PATH=s3a://cloudera-altus-data-engineering-samples/hive/data/readmissionsDeath/", "EFFECTIVECARE_PATH=s3a://cloudera-altus-data-engineering-samples/hive/data/effectiveCare/", "GDP_PATH=s3a://cloudera-altus-data-engineering-samples/hive/data/GDP/"] }}, { "name": "Clean Data", "failureaction": "INTERRUPT_JOB_QUEUE", "hivejob": { "script": "s3a://cloudera-altus-data-engineering-samples/hive/program/med-part2.hql" }}, { "name": "Output Data", "failureaction": "NONE", "hivejob": { "script": "s3a://cloudera-altus-data-engineering-samples/hive/program/med-part3.hql", "params": ["outputdir=s3a://path/of/the/output/s3bucket/"] }} ]' You can go to the Cloudera Manager console to view the status of the Hive cluster and jobs: To view the workload summary, click Clusters > SPARK_ON_YARN-1. To view the job information, click Clusters > HIVE-1 > HiveServer2 Web UI. Cloudera Manager displays the same workload summary and job queries for this job as for the job that you submitted through the console. When the jobs complete, go to the S3 bucket you specified for your job output and verify the file created by the Hive jobs. The Hive job group creates the following file in your output S3 bucket: _0 (135.9 KB) Terminating the Hive Cluster 62 Altus Data Engineering This task shows you how to terminate the cluster that you created for this tutorial. To terminate the cluster on the Altus console: 1. On the Altus console, go to the Data Engineering section of the side navigation panel and click Clusters. 2. On the Clusters page, click the name of the cluster that you created for this tutorial.

63 Tutorial: Clusters and Jobs on AWS 3. On the Cluster details page, review the cluster information to verify that it is the cluster that you want to terminate. 4. Click Actions and select Delete Cluster. 5. Click OK to confirm that you want to terminate the cluster. Altus Data Engineering 63

64 Tutorial: Clusters and Jobs on Azure Tutorial: Clusters and Jobs on Azure This tutorial walks you through using the Altus console and CLI to create Altus Data Engineering clusters and submit jobs in Altus. The tutorial uses publicly available data that show the usage of Medicare procedure codes. You must set up an ADLS account to store the tutorial job examples and input data and to write output data. Cloudera has created a jar file that contains the job examples and input files that you need to successfully complete the tutorial. Before you start the exercises, upload the files to the ADLS account that you set up for the tutorial files. The tutorial has the following sections: Prerequisites To use this tutorial, you must have an Altus user account and the roles required to create clusters and run jobs in Altus. Sample Jar File Upload Upload the files you need to complete the tutorial. Altus Console Login on page 65 Log in to the Altus console to perform the exercises in this tutorial. Exercise 1: Installing the Altus Client on page 65 Learn how to install the Altus client and register an access key to use the CLI. Exercise 2: Creating a Spark Cluster and Submitting Spark Jobs on page 67 Learn how to create a cluster with a Spark service and submit a Spark job using the Altus console and the CLI. This exercise provides instructions on how to create a SOCKS proxy and view the cluster and monitor the job in Cloudera Manager. It also shows you how to delete the cluster on the console. Exercise 3: Creating a Hive Cluster and Submitting Hive Jobs on page 72 Learn how to create a cluster with a Hive service and submit a group of Hive jobs using the Altus console and the CLI. This exercise also walks you through the process of creating a SOCKS proxy and accessing Cloudera Manager. It also shows you how to delete the cluster on the console. Prerequisites Before you start the tutorial, ensure that you have access to resources in your Azure subscription and an Altus user account with permission to create clusters and run jobs in Altus. The following are prerequisites for the tutorial: Altus user account, environment, and roles. An Altus user account allows you to log in to the Altus console and perform the exercises in the tutorial. An Altus administrator must assign an Altus environment to your user account so that you have access to resources in your Azure subscription. The Altus administrator must also assign roles to your user account to allow you to create clusters and run jobs in Altus. Public key. When you create a cluster, provide an SSH public key that Altus can add to the cluster. You can then use the corresponding private key to access the cluster after the cluster is created. Azure Data Lake Store (ADLS) account. Set up an ADLS account to store sample jobs and input data files for use in the tutorial. You also write job output to the same account. The ADLS account must be set up with permissions to allow read and write access when you run the Altus jobs. For more information about creating an ADLS account in Azure, see Get started with Azure Data Lake Store using the Azure portal. 64 Altus Data Engineering

65 Tutorial: Clusters and Jobs on Azure Sample Files Upload Cloudera provides jar files that contain the Altus job example files and the input files used in the tutorial. Before you start the tutorial, upload the jar file to your ADLS account so the job examples and data are available for your use. Use the Azure Cloud Shell to upload the file. To upload the jar file to your ADLS account, complete the following steps: 1. Follow the instructions in the Azure documentation to set up an Azure Cloud Shell with a bash environment. 2. Run the following command to download the altus_adls_upload_examples.sh script: wget You use the script to upload the files that you need for the tutorials to your ADLS account. 3. In the Azure Cloud Shell, follow the instructions in the Azure documentation to log in to Azure using the Azure CLI. The Azure CLI is installed with Azure Cloud Shell so you do not need to install it separately. 4. Run the script to upload the tutorial files to your ADLS account: bash./altus_adls_upload_examples.sh --adls-account YourADLSaccountname --adls-path cloudera-altus-data-engineering-samples 5. Verify that the tutorial examples and input data files are uploaded to your ADLS account in the Altus Data Engineering examples folder. Altus Console Login To access the Altus console, go to the following URL: Log in to Altus with your Cloudera user account. After you are authenticated, the Altus console displays your home page. The Data Engineering section displays on the side navigation panel. If you have been assigned roles and an environment in Altus, you can click on Clusters and Jobs to create clusters and submit jobs as you follow the tutorial exercises. Exercise 1: Installing the Altus Client To use the Altus CLI, you must install the Altus client and configure the client with an access key. Altus manages access to the Altus services so that only users with a registered access key can run commands to create clusters, submit jobs, or use SDX namespaces. Generate and register an access key with the Altus client to create a credentials file so that you do not need submit your access key with each command. This exercise provides instructions to download and install the Altus client on Linux, generate a key, and run the CLI command to register the key. To set up the Cloudera Altus client, complete the following tasks: 1. Install the Altus client. 2. Configure the Altus client with an access key. Step 1. Install the Altus Client To avoid conflicts with older versions of Python or other packages, Cloudera recommends that you install the Cloudera Altus client in a virtual environment. Use the virtualenv tool to create a virtual environment and install the client. Altus Data Engineering 65

66 Tutorial: Clusters and Jobs on Azure The following commands show how you can use pip to install the client on a virtual environment on Linux: mkdir ~/altusclienv virtualenv ~/altusclienv --no-site-packages source ~/altusclienv/bin/activate ~/altusclienv/bin/pip install altuscli To upgrade the client to the latest version, run the following command: ~/altusclienv/bin/pip install --upgrade altuscli After the client installation process is complete, run the following command to confirm that the Altus client is working: If virtualenv is activated: altus --version If virtualenv is not activated: ~/altusclienv/bin/altus --version Step 2. Configure the Altus Client with the API Access Key You use the Altus console to generate the access that you register with the client. Keep the window that displays the access key on the console open until you complete the key registration process. To create and set up the client with a Cloudera Altus API access key: 1. Sign in to the Cloudera Altus console: 2. Click your user account name and select My Account. 3. On the My Account page, click Generate Access Key. Altus creates the key and displays the information on the screen. The following image shows an example of an Altus API access key as displayed on the Altus console: 66 Altus Data Engineering

67 Tutorial: Clusters and Jobs on Azure Note: The Cloudera Altus console displays the API access key immediately after you create it. You must copy the access key information when it is displayed. Do not exit the console without copying the keys. After you exit the console, there is no other way to view or copy the access key. 4. On the command line, run the following command to configure the client with the access key: altus configure 5. Enter the following information at the prompt: Altus Access key. Copy and paste the access key ID that you generated in the Cloudera Altus console. Altus Private key. Copy and paste the private key that you generated in the Cloudera Altus console. The private key is a very long string of characters. Make sure that you enter the full string. The configuration utility creates the following file to store your user credentials: ~/.altus/credentials 6. To verify that the credentials were created correctly, run the following command: altus iam get-user The command displays your Altus client credentials. 7. After the credentials file is created, you can go back to the Cloudera Altus console and click OK to exit the access key window. Exercise 2: Creating a Spark Cluster and Submitting Spark Jobs This exercise shows you how to create a cluster with a Spark service on the Altus console and submit a Spark job on the console and the command line. It also shows you how to create a SOCKS proxy and access the cluster and view the progress of the job on Cloudera Manager. In this exercise, you complete the following tasks: 1. Create a cluster with a Spark service on the console. 2. Submit a Spark job on the console. 3. Create a SOCKS proxy to access the Spark cluster on Cloudera Manager. 4. View the Spark cluster and verify the Spark job output. 5. Submit a Spark job using the CLI. 6. Terminate the Spark cluster Creating a Spark Cluster on the Console You must be logged in to the Altus console to perform this task. Note that it can take a while for Altus to complete the process of creating a cluster. To create a cluster on the console: 1. In the Data Engineering section of the side navigation panel, click Clusters. 2. On the Clusters page, click Create Cluster. 3. Create a cluster with the following configuration: Cluster Name Service Type To help you easily identify your cluster, use your first initial and last name as prefix for the cluster name. This tutorial uses the cluster name mjones-spark-tutorial as an example. Spark 2.x Altus Data Engineering 67

68 Tutorial: Clusters and Jobs on Azure CDH Version Environment Node Configuration CDH 5.14 Name of the Altus environment to which you have been given access for this tutorial. If you do not know which Altus environment to select, check with your Altus administrator. For the Worker node configuration, set the Number of Nodes to 3. Leave the rest of the node properties with their default setting. Credentials Configure your access credentials to Cloudera Manager: SSH Public Key If you have your public key in a file, select File Upload and choose the key file. If you have the key available for pasting on screen, select Direct Input to enter the full key code. Cloudera Manager User Set both the user name and password to guest. 4. Verify that all required fields are set and click Create Cluster. The Altus Data Engineering service creates a CDH cluster with the configuration you set. On the Clusters page, the new cluster displays at the top of the list of clusters. Submitting a Spark Job Submit a Spark job to run on the cluster you created in the previous task. To submit a Spark job on the console: 1. In the Data Engineering section of the side navigation panel, click Jobs. 2. Click Submit Jobs. 3. On the Job Settings page, select Single job. 4. Select the Spark job type. 5. Create a Spark job with the following configuration: Job Name Main Class Jars Application Arguments Set the job name to Spark Medical Example. Set the main class to com.cloudera.altus.sample.medicare.transform Use the tutorial jar file: adl://youradlsaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/spark/medicare/program/altus-sample-medicare-spark2x.jar Set the application arguments to the ADLS path to use for job input and output. Add the tutorial ADLS path for the job input: adl://youradlsaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/spark/medicare/input/ Click + and add the ADLS path for the job output: adl:// YourADLSaccountname.azuredatalakestore.net/ cloudera-altus-data-engineering-samples/spark/medicare/output Cluster Settings Use an existing cluster and select the cluster that you created in the previous task. The following figure shows the Submit Jobs page with the settings for this tutorial: 68 Altus Data Engineering

69 Tutorial: Clusters and Jobs on Azure 6. Verify that all required fields are set and click Submit Jobs. The Altus Data Engineering service submits the job to run on the selected cluster in your AWS account. Creating a SOCKS Proxy for the Spark Cluster Use the Altus CLI to create a SOCKS proxy to log in to Cloudera Manager and view the cluster and progress of the job. To create a SOCKS proxy to access Cloudera Manager: 1. In the Data Engineering section of the side navigation panel, click Clusters. 2. On the Clusters page, find the cluster on which you submitted the job and click the cluster name. 3. On the cluster detail page, click View SOCKS Proxy CLI Command. Altus displays the command that you can use to create a SOCKS proxy to log in to the Cloudera Manager instance for the Spark cluster that you created. Altus Data Engineering 69

70 Tutorial: Clusters and Jobs on Azure 4. Click Copy. 5. On a terminal window, paste the command. 6. Modify the command to use the name of the cluster you created and your private key and run the command: altus dataeng socks-proxy --cluster-name "YourClusterName" --ssh-private-key="yourprivatekey" --open-cloudera-manager="yes" The Cloudera Manager Admin console opens in a Chrome browser. Note: The command includes a parameter to open Cloudera Manager in a Chrome browser. If you do not use Chrome, remove the open-cloudera-manager parameter so that the command displays instructions for accessing the Cloudera Manager URL from any browser. Viewing the Cluster and Verifying the Spark Job Output Log in to Cloudera Manager with the guest user account that you set up when you created the cluster. To view the cluster and monitor the job on the Cloudera Manager Admin console: 1. Log in to Cloudera Manager using guest as the account name and password. 2. On the Home page, click Clusters on the top navigation bar. 3. On the cluster window, select YARN Applications. The following screenshots show the cluster services and workload information that you can view on the Cloudera Manager Admin console: 70 Altus Data Engineering

71 Tutorial: Clusters and Jobs on Azure When your Spark job completes, you can view the output of the Spark job in the ADLS account that you specified for your job output. The Spark job creates the following files in your ADLS output folder: Altus Data Engineering 71

72 Tutorial: Clusters and Jobs on Azure Success (0 bytes) part (65.5 KB) part (69.5 KB) Note: If you want to use the same ADLS output folder for the next exercise, go to the Azure portal and delete files in the ADLS output folder. You will recreate the files when you submit the same Spark job using the CLI. Creating a Spark Job using the CLI You can submit the same Spark job to run on the same cluster using the CLI. If you want to view the cluster and monitor the job on Cloudera Manager, stay logged in to Cloudera Manager. To submit a Spark job using the CLI, run the following command: altus dataeng submit-jobs \ --cluster-name FirstInitialLastName-tutorialcluster \ --jobs '{ "sparkjob": { "jars": [ "adl://youradlsaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/spark/medicare/program/altus-sample-medicare-spark2x.jar" ], "mainclass": "com.cloudera.altus.sample.medicare.transform", "applicationarguments": [ "adl://youradlsaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/spark/medicare/input/", "adl://youradlsaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/spark/medicare/output" }}' ] To view the workload summary, go to the Cloudera Manager console and click Clusters > SPARK_ON_YARN-1. Cloudera Manager displays the same workload summary for this job as for the job that you submitted through the console. To verify the output, go to ADLS account that you specified for your job output and verify that it contains the files created by the Spark job: Success (0 bytes) part (65.5 KB) part (69.5 KB) Terminating the Cluster This task shows you how to terminate the cluster that you created for this tutorial. To terminate the cluster on the Altus console: 1. On the Altus console, go to the Data Engineering section of the side navigation panel and click Clusters. 2. On the Clusters page, click the name of the cluster that you created for this tutorial. 3. On the Cluster details page, review the cluster information to verify that it is the cluster that you want to terminate. 4. Click Actions and select Delete Cluster. 5. Click OK to confirm that you want to terminate the cluster. Exercise 3: Creating a Hive Cluster and Submitting Hive Jobs This exercise shows you how to create a cluster with a Hive service on the Altus console and submit Hive jobs on the console and the command line. It also shows you how to create a SOCKS proxy and access the cluster and view the progress of the jobs on Cloudera Manager. 72 Altus Data Engineering

73 Tutorial: Clusters and Jobs on Azure In this exercise, you complete the following tasks: 1. Create a cluster with a Hive service on the console. 2. Submit a group of Hive jobs on the console. 3. Create a SOCKS proxy to access the Hive cluster on Cloudera Manager 4. View the Hive cluster and verify the Hive job output. 5. Submit a group of Hive jobs using the CLI. 6. Terminate the Hive cluster Creating a Hive Cluster on the Console You must be logged in to the Altus console to perform this task. Note that it can take a while for Altus to complete the process of creating a cluster. To create a cluster with a Hive service on the console: 1. In the Data Engineering section of the side navigation panel, click Clusters. 2. On the Clusters page, click Create Cluster. 3. Create a cluster with the following configuration: Cluster Name Service Type CDH Version Environment Node Configuration To help you easily identify your cluster, use your first initial and last name as prefix for the cluster name. This tutorial uses the cluster name mjones-hive-tutorial as an example. Hive CDH 5.14 Name of the Altus environment to which you have been given access for this tutorial. If you do not know which Altus environment to select, check with your Altus administrator. For the Worker node configuration, set the Number of Nodes to 3. Leave the rest of the node properties with their default setting. Credentials Configure your access credentials to Cloudera Manager: SSH Public Key If you have your public key in a file, select File Upload and choose the key file. If you have the key available for pasting on screen, select Direct Input to enter the full key code. Cloudera Manager User Set both the user name and password to guest. 4. Verify that all required fields are set and click Create Cluster. The Altus Data Engineering service creates a CDH cluster with the configuration you set. On the Clusters page, the new cluster displays at the top of the list of clusters. Submitting a Hive Job Group Submit multiple Hive jobs as a group to run on the cluster that you created in the previous step. To submit a job group on the console: 1. In the Data Engineering section of the side navigation panel, click Jobs. 2. Click Submit Jobs. Altus Data Engineering 73

74 Tutorial: Clusters and Jobs on Azure 3. On the Job Settings page, select Group of jobs. 4. Select the Hive job type. 5. Set the Job Group Name to Hive Medical Example. 6. Click Add Hive Job. 7. Create a job with the following configuration: Job Name Script Hive Script Parameters Action on Failure Set the job name to Create External Tables. Select Script Path and enter the following script name: adl:// YourADLSaccountname.azuredatalakestore.net/ cloudera-altus-data-engineering-samples/hive/program/ med-part1.hql Select Hive Script Parameters and add the following variables and values: HOSPITALS_PATH: adl://youradlsaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/hive/data/hospitals/ READMISSIONS_PATH: adl://youradlsaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/hive/data/readmissionsdeath/ EFFECTIVECARE_PATH: adl://youradlsaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/hive/data/effectivecare/ GDP_PATH: adl://youradlsaccountname.azuredatalakestore.net/cloudera-altus-data-engineering-samples/hive/data/gdp/ Select Interrupt Job Queue. The following figure shows the Add Job window with the settings for this job: 8. Click OK to add the job to the group. 74 Altus Data Engineering

75 On the Submit Jobs page, Altus adds the Hive Medical Example job to the list of jobs in the group. 9. Click Add Hive Job. 10. Create a job with the following configuration: Tutorial: Clusters and Jobs on Azure Job Name Script Action on Failure Set the job name to Clean Data. Select Script Path and enter the following script name: adl:// YourADLSaccountname.azuredatalakestore.net/ cloudera-altus-data-engineering-samples/hive/program/ med-part2.hql Select Interrupt Job Queue. The following figure shows the Add Job window with the settings for this job: 11. Click OK. On the Submit Jobs page, Altus adds the Clean Data job to the list of jobs in the group. 12. Click Add Hive Job. 13. Create a job with the following configuration: Job Name Script Hive Script Parameters Set the job name to Write Output. Select Script Path and enter the following script name: adl:// YourADLSaccountname.azuredatalakestore.net/ cloudera-altus-data-engineering-samples/hive/program/ med-part3.hql Select Hive Script Parameters and add the ADLS folder that you created for the job output as a variable: OUTPUT_DIR: adl:// YourADLSaccountname.azuredatalakestore.net/ cloudera-altus-data-engineering-samples/hive/data/ output/ Altus Data Engineering 75

76 Tutorial: Clusters and Jobs on Azure Action on Failure Select None. The following figure shows the Add Job window with the settings for this job: 14. Click OK. On the Submit Jobs page, Altus adds the Write Output job to the list of jobs in the group. 15. On the Cluster Settings section, select Use existing and select the Hive cluster you created for this exercise. The list of clusters displayed include only those clusters that can run Hive jobs. 16. Click Submit Jobs to run the job group on your Hive cluster. Creating a SOCKS Proxy for the Hive Cluster Use the Altus CLI to create a SOCKS proxy to log in to Cloudera Manager and view the progress of the job. To create a SOCKS proxy to access Cloudera Manager: 1. In the Data Engineering section of the side navigation panel, click Clusters. 2. On the Clusters page, find the cluster on which you submitted the Hive job group and click the cluster name. 3. On the cluster detail page, click View SOCKS Proxy CLI Command. Altus displays the command that you can use to create a SOCKS proxy to log in to the Cloudera Manager instance for the Hive cluster that you created. 76 Altus Data Engineering

77 Tutorial: Clusters and Jobs on Azure 4. Click Copy. 5. On a terminal window, paste the command. 6. Modify the command to use the name of the cluster you created and your private key and then run the following command: altus dataeng socks-proxy --cluster-name "YourClusterName" --ssh-private-key="yourprivatekey" --open-cloudera-manager="yes" The Cloudera Manager Admin console opens in a Chrome browser. Note: The command includes a parameter to open Cloudera Manager in a Chrome browser. If you do not use Chrome, remove the open-cloudera-manager parameter so that the command displays instructions for accessing the Cloudera Manager URL from any browser. Viewing the Hive Cluster and Verifying the Hive Job Output Log in to Cloudera Manager with the guest user account that you set up when you created the Hive cluster. To view the cluster and monitor the job on the Cloudera Manager Admin console: 1. Log in to Cloudera Manager using guest as the account name and password. 2. On the Home page, click Clusters on the top navigation bar. 3. On the cluster window, select YARN Applications. The following screenshots show the cluster services and workload information that you can view on the Cloudera Manager Admin console: Altus Data Engineering 77

78 Tutorial: Clusters and Jobs on Azure 4. Click Clusters on the top navigation bar and select the default Hive service named HIVE-1. Then click HiveServer2 Web UI. The following screenshots show the workload information that you can view for the Hive service: 78 Altus Data Engineering

79 Tutorial: Clusters and Jobs on Azure Altus Data Engineering 79

Cloudera Manager Quick Start Guide

Cloudera Manager Quick Start Guide Cloudera Manager Guide Important Notice (c) 2010-2015 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, Cloudera Impala, and any other product or service names or slogans contained in this