EMC ISILON HADOOP STARTER KIT

Size: px

Start display at page:

Download "EMC ISILON HADOOP STARTER KIT"

Rose Lawson
5 years ago
Views:

1 EMC ISILON HADOOP STARTER KIT Deploying IBM BigInsights v 4.0 with EMC ISILON Boni Bruno, CISSP, CISM, CGEIT Chief Solutions Architect October, 2015 #RememberRuddy

2 To learn more about how EMC products, services, and solutions can help solve your business and IT challenges, contact your local representative or authorized reseller, visit or explore and compare products in the EMC Store Copyright 2015 EMC Corporation. All Rights Reserved. EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice. The information in this publication is provided as is. EMC Corporation makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose. Use, copying, and distribution of any EMC software described in this publication requires an applicable software license. For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com. EMC are registered trademarks or trademarks of EMC, Inc. in the United States and/or other jurisdictions. All other trademarks used herein are the property of their respective owners. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 2

3 Contents INTRODUCTION... 6 IBM & EMC Technology Highlights... 6 Audience... 7 Apache Hadoop Projects... 7 IBM Open Platform and the Ambari Manager... 8 Isilon Scale-Out NAS for HDFS... 8 Overview of Isilon Scale-Out NAS for Big Data... 9 PRE-INSTALLATION CHECKLIST Supported Software Versions Hardware Requirements and Suggested Hadoop Service Layout INSTALLATION OVERVIEW Prerequisites Isilon Scale-Out NAS or Isilon OneFS Simulator Linux Networking DNS Other Prepare Isilon Assumptions SmartConnect for HDFS OneFS Access Zones Sharing Data between Access Zones User & Group ID s Configuring Isilon for HDFS Create DNS Records for Isilon Prepare Linux Compute Nodes Linux Operating System packages needed for IBM BigInsights: Enable NTP on all Linux Compute nodes Disable SELinux on each node if enabled before installing Ambari EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 3

4 Check UMASK Settings Set ulimit Properties Kernel Modifications Create IBM BigInsights Hadoop Users and Groups Configure Passwordless SSH Additional Linux Packages to Install Test DNS Resolution Edit sudoers file on all Linux compute nodes INSTALLING IBM OPEN PLATFORM (OP) Download IBM Open Platform Software Create IBM Open Platform Repository Validating IBM Open Platform Install Adding a Hadoop User Additional Service Tests HDFS YARN/MAPREDUCE HIVE HBASE Ambari Service Check INSTALLING IBM VALUE PACKAGES Before You Begin Installation Procedure Select IBM BigInsights Service to Install Installing BigInsights Home Configure Knox Installing BigSheets Installing Big SQL Connecting to Big SQL Running JSqsh Connection setup Commands and queries Command and query edit EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 4

5 Configuration variables Installing Text Analytics Installing Big R IBM BigInsights Online Tutorials SECURITY CONFIGURATION AND ADMINISTRATION Setting up HTTPS for Ambari Configuring SSL support for HBase REST gateway with Knox Overview of Kerberos Enabling Kerberos for IBM Open Platform Manually generating keytabs for Kerberos authentication Setting up Active Directory or LDAP authentication in Ambari Enabling Kerberos for HDFS on Isilon Using MIT Kerberos Running the Ambari Kerberos Wizard Trouble Shooting and Support EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 5

6 EMC Isilon Hadoop Starter Kit for IBM BigInsights v 4.0 This document describes how to create a Hadoop environment utilizing IBM Open Platform with Apache Hadoop and an EMC Isilon scale-out network-attached storage (NAS) for HDFS accessible shared storage. Installation and configuration of IBM BigInsights Value Packages is also presented in this document. Introduction IBM & EMC Technology Highlights The IBM Open Platform with Apache Hadoop is comprised of entirely Apache Hadoop open source components, such as Apache Ambari, YARN, Spark, Knox, Slider, Sqoop, Flume, Hive, Oozie, HBase, ZooKeeper, and more. After installing IBM Open Platform, you can install additional IBM value-add service modules. These value-add service modules are installed separately, and they include IBM BigInsights Analyst, IBM BigInsights Data Scientist, and the IBM BigInsights Enterprise Management module to provide enhanced capabilities to IBM Open Platform to accelerate the conversion of all types of data into business insight and action. The EMC Isilon Scale-Out Network-Attached Storage (NAS) platform provides Hadoop clients with direct access to big data through a Hadoop File System (HDFS) interface. Powered by the distributed EMC Isilon OneFS operating system, an EMC Isilon cluster delivers a powerful yet simple and highly efficient storage platform with native HDFS integration to accelerate analytics, gain new flexibility, and avoid the costs of a separate Hadoop infrastructure. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 6

7 Audience EMC Isilon Hadoop Starter Kit for IBM BigInsights This document is intended for IT program managers, IT architects, Developers, and IT management to easily deploy IBM BigInsights v4.0 with EMC Isilon OneFS v for HDFS storage. If a physical EMC Isilon Cluster is not available, download the free EMC Isilon OneFS Simulator which can be installed as a virtual machine for integration testing and training purposes. See for EMC Isilon OneFS Simulator. Apache Hadoop Projects Apache Hadoop is an open source, batch data processing system for enormous amounts of data. Hadoop runs as a platform that provides cost-effective, scalable infrastructure for building Big Data analytic applications. All Hadoop clusters contain a distributed file system called the Hadoop Distributed File System (HDFS) and a computation layer called MapReduce. The Apache Hadoop project contains the following subprojects: Hadoop Distributed File System (HDFS) A distributed file system that provides high-throughput access to application data. Hadoop MapReduce A software framework for writing applications to reliably process large amounts of data in parallel across a cluster. Hadoop is supplemented by an ecosystem of Apache projects, such as Pig, Hive, Sqoop, Flume, Oozie, Slider, HBase, Zookeeper and more that extend the value of Hadoop and improves its usability. Version 2 of Apache Hadoop introduces YARN, a sub-project of Hadoop that separates the resource management and processing components. YARN was born of a need to enable a broader array of interaction patterns for data stored in HDFS beyond MapReduce. The YARNbased architecture of Hadoop 2.0 provides a more general processing platform that is not constrained to MapReduce. For full details of the Apache Hadoop project see EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 7

8 IBM Open Platform and the Ambari Manager The IBM Open Platform with Apache Hadoop enables Enterprise Hadoop by providing the complete set of essential Hadoop capabilities required for any enterprise. Utilizing YARN at its core, it provides capabilities for several functional areas including Data Management, Data Access, Data Governance, Integration, Security and Operations. IBM Open Platform delivers the core elements of Hadoop - scalable storage and distributed computing as well as all of the necessary enterprise capabilities such as security, high availability and integration with a broad range of hardware and software solutions. Apache Ambari is an open operational framework for provisioning, managing and monitoring Apache Hadoop clusters. As of version 4.0 of IBM Open Platform, Ambari can be used to setup and deploy Hadoop clusters for nearly any task. Ambari can provision, manage and monitor every aspect of a Hadoop deployment. More information on IBM Open Platform can be found at: Isilon Scale-Out NAS for HDFS EMC Isilon is the only scale-out NAS platform natively integrated with the Hadoop Distributed File System (HDFS). Using HDFS as an over-the-wire protocol, you can deploy a powerful, efficient, and flexible data storage and analytics ecosystem. In addition to native integration with HDFS, EMC Isilon storage easily scales to support massively large Hadoop analytics projects. Isilon scale-out NAS also offers unmatched simplicity, efficiency, flexibility, and reliability that you need to maximize the value of your Hadoop data storage and analytics workflow investment. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 8

9 Overview of Isilon Scale-Out NAS for Big Data The EMC Isilon scale-out platform combines modular hardware with unified software to provide the storage foundation for data analysis. Isilon scale-out NAS is a fully distributed system that consists of nodes of modular hardware arranged in a cluster. The distributed Isilon OneFS operating system combines the memory, I/O, CPUs, and disks of the nodes into a cohesive storage unit to present a global namespace as a single file system. The nodes work together as peers in a shared-nothing hardware architecture with no single point of failure. Every node adds capacity, performance, and resiliency to the cluster and each node acts as a Hadoop namenode and datanode. The namenode daemon is a distributed process that runs on all the nodes in the cluster. A compute client can connect to any node through HDFS. As nodes are added, the file system expands dynamically and redistributes data, eliminating the work of partitioning disks and creating volumes. The result is a highly efficient and resilient storage architecture that brings all the advantages of an enterprise scale-out NAS system to storing data for analysis. With traditional direct attached storage, the ratio of CPU, RAM, and disk space requirements depends on the workload these factors make it difficult to size a Hadoop cluster before you have had a chance to measure your MapReduce workload. Expanding data sets also makes sizing decisions upfront problematic. Isilon scale-out NAS lends itself perfectly to this situation: Isilon scale-out NAS lets you increase CPUs, RAM, and disk space by adding nodes to dynamically match storage capacity and performance with the demands of a dynamic Hadoop workload. An Isilon cluster optimizes data protection. OneFS more efficiently and reliably protects data than HDFS. The HDFS protocol, by default, replicates a block of data three times. In contrast, OneFS stripes the data across the cluster and protects the data with forward error correction codes, which consume less space than replication with better protection. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 9

10 Pre-installation Checklist Supported Software Versions The environment used for this document consists of the following software versions: Ambari 1.7.0_IBM IBM Open Platform v Isilon OneFS with patch All of IBM BigInsights v 4.0 value packs, i.e. Business Analyst, Data Scientist, and Enterprise Management Note: IBM BigInsights v 4.0 requires OneFS v with patch OneFS version should also work as well as version when available. Do not install IBM BigInsights with OneFS versions lower than See EMC Isilon Supportability and Compatibility Guide for the latest compatibility updates: Guide.pdf?language=en_US Hardware Requirements and Suggested Hadoop Service Layout Detail system requirements for IBM BigInsights compute nodes can be found at: In a multi-node IBM BigInsights cluster, it is suggested that you have at least one management node in your non-high availability environment, if performance is not an issue. If performance is a concern, consider configuring at least three management nodes. If you use the BigInsights - Big SQL service, consider configuring four management nodes. If you use a high availability environment, consider six management nodes. Use EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 10

11 the following list as a guide for the nodes in your IBM/EMC cluster. A suggested layout is shown in Table 1 for both Non-High availability and High availability deployments. Note: With both deployment options, EMC Isilon provides namenode, secondary namenode and datanode functions for the entire cluster. Do not designate any compute node as a namenode, secondary namenode, or datanode in any aspect of the IBM BigInsights configuration. Table 1. Suggested Service Layout Non-High availability High availability Management node 1 Ambari PostgreSQL Knox Zookeeper Hive Spark Spark History Server BigInsights Home BigSheets Big R BigSQL Headnode Text Analytics Management node 2 Resource Manager HBase Master Zookeeper Oozie Ambari monitoring service Management node 3 Job history server Zookeeper App Timeline Server Kafka Management node 4 Big SQL Scheduler Hive Server (MySQL) MySQL metastore Hive/Oozie metastore WebHCat Server Data Server Manager Management node 1 Ambari PostgreSQL Spark Spark History Server BigSQL Headnode Management node 2 Resource Manager Zookeeper Oozie Ambari monitoring service BigInsights Home Management node 3 Resource Manager (standby) Job history server Zookeeper App Timeline Server Kafka Oozie (Standby) Management node 4 Big SQL Scheduler HBase Master (standby) Hive Server MySQL Server Hive metastore WebHCat Server Data Server Manager Management node 5 Big SQL Headnode (Standby) Big SQL Scheduler (Standby) HBase Master Hive Server (Standby) Hive Metastore (Standby) Journal Node Zookeeper EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 11

12 Installation Overview Below is the overview of the installation process that this document will describe. 1. Confirm prerequisites. 2. Prepare your network infrastructure including DNS. 3. Prepare your Isilon cluster. 4. Prepare Linux compute nodes. 5. Install Ambari Server. 6. Use Ambari Manager to deploy IBM Open Platform to compute nodes. 7. Install IBM BigInsights Value Packages 8. Perform key functional tests. Prerequisites Isilon Scale-Out NAS or Isilon OneFS Simulator For low-capacity, non-performance testing of Isilon, the EMC Isilon OneFS Simulator can be used instead of a cluster of physical Isilon appliances. This can be downloaded for free from Refer to the EMC Isilon OneFS Simulator Install Guide for details. Be sure to follow the section for running the virtual nodes in VMware ESX. Only a single virtual node is required but adding additional nodes will allow you to explore other features such as data protection, SmartPools (tiering), and SmartConnect (network load balancing). For physical Isilon nodes, you should have already completed the console-based installation process for your first Isilon node and added two other nodes for a minimum of 3 Isilon nodes. You should have OneFS version patch installed on Isilon. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 12

13 You must obtain OneFS HDFS license code and install it on your Isilon cluster. You can get your free OneFS HDFS license from: It is recommended, but not required, to have a SmartConnect Advanced license for your Isilon cluster. To allow for scripts and other small files to be easily shared between all nodes in your environment, it is highly recommended to enable NFS (Unix Sharing) on your Isilon cluster. By default, the entire /ifs directory is already exported and this can remain unchanged. This document assumes that a single Isilon cluster is used for this NFS export as well as for HDFS. However, there is no requirement that the NFS export be on the same Isilon cluster that you are using for HDFS. Linux RedHat Enterprise Linux (RHEL) Server 6 (Update 5 minimum) or comparable CentOS Server. 100GB Root Partition At a minimum, 96G RAM for production environments. The more RAM the better for Hadoop. Networking For the best performance, a single 10 Gigabit Ethernet switch should connect to at least one 10 Gigabit port on each Linux host. Additionally, the same switch should connect to at least one 10 Gigabit port on each Isilon node. A single dedicated layer-2 network can be used to connect all hosts and Isilon nodes. Although multiple networks can be used for increased security, monitoring, and robustness. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 13

14 At least an entire /24 IP address block should be allocated to your network. This will allow a DNS reverse lookup zone to be delegated to your Hadoop DNS server. If using the EMC Isilon OneFS Simulator, you will need at least two static IP addresses (one for the node s ext-1 interface, another for the SmartConnect service IP). Each additional Isilon node will require an additional IP address. At a minimum, you will need to allocate to your Isilon cluster one IP address per Access Zone per Isilon node. In general, you will need one Access Zone for each separate Hadoop cluster that will use Isilon for HDFS storage. For the best possible load balancing during an Isilon node failure scenario, the recommended number of IP addresses is given by the formula below. Of course, this is in addition to any IP addresses used for non-hdfs pools. # of IP addresses = 2 * (# of Isilon Nodes) * (# of Access Zones) For example, 20 IP addresses are recommended for 5 Isilon nodes and 2 Access Zones. This document will assume that Internet access is available to all servers to download various components from Internet repositories. DNS A DNS server is required and you must have the ability to create DNS records and zone delegations. It is recommended that your DNS server delegate a subdomain to your Isilon cluster. For instance, DNS requests for subnet0-pool0.isiloncluster1.example.com or isiloncluster1.example.com should be delegated to the Service IP defined on your Isilon cluster. To allow for a convenient way of changing the HDFS Namenode used by all Hadoop applications and services, create a DNS record for your Isilon cluster s HDFS Namenode service. This should be a CNAME alias to your Isilon SmartConnect zone. Specify a TTL of 1 minute to allow for quick changes. For example, create a CNAME record for mycluster1-hdfs.example.com that targets subnet0- EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 14

15 pool0.isiloncluster1.example.com. If you later want to redirect all HDFS I/O to another cluster or a different pool on the same Isilon cluster, you simply need to change the DNS record and restart all Hadoop services. Other See there are three scripts to download to help automate new IBM BigInsights installations with EMC Isilon: 1. bi_create_users.sh use this script to create the users and groups on all the Linux nodes before beginning installation. 2. isilon_create_users.sh use this script to create the users and groups on Isilon before beginning installation. You must first create your access zone (described later in this document, e.g. ibm), before running this script. 3. isilon_create_directories.sh run this after the script above. More information on the use of these scripts is provided in the installation section of this document. Prepare Isilon Assumptions This document makes the assumptions listed below. These are not necessarily requirements but they are usually valid and simplify the process. It is assumed that you are not using a directory service such as Active Directory for Hadoop users and groups. It is assumed that you are not using Kerberos authentication for Hadoop. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 15

16 SmartConnect for HDFS A best practice for HDFS on Isilon is to utilize two SmartConnect IP address pools for each access zone. One IP address pool should be used by Hadoop clients to connect to the HDFS namenode service on Isilon and it should use the dynamic IP allocation method to minimize connection interruptions in the event that an Isilon node fails. Note: Dynamic IP allocation requires a SmartConnect Advanced license. A Hadoop client uses a specific SmartConnect IP address pool simply by using its zone name (DNS name) in the HDFS URI: For example, hdfs://subnet0-pool1.isiloncluster1.example.com:8020 A second IP address pool should be used for HDFS datanode connections and it should also use dynamic IP allocation method. To assign specific Smart-Connect IP address pools for datanode connections, you will use the isi hdfs racks modify command. If the network is flat, there is no need to use isi hdfs racks modify, the default configuration will suffice. If IP addresses are limited and you have a SmartConnect Advanced license, you may choose to use a single dynamic pool for namenode and datanode connections. This may result in uneven utilization of Isilon nodes. If you do not have a SmartConnect Advanced license, you may choose to use a single static pool for namenode and datanode connections. This may result in some failed HDFS connections in the event of a node failure. For more information, see EMC Isilon Best Practices for Hadoop Data Storage white paper online at: EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 16

17 OneFS Access Zones Access zones on OneFS are a way to select a distinct configuration for the OneFS cluster based on the IP address that the client connects to. For HDFS, this configuration includes authentication methods, HDFS root path, and authentication providers (AD, LDAP, local, etc.). By default, OneFS includes a single access zone called System. If you will only have a single Hadoop cluster connecting to your Isilon cluster, then you can use the System access zone with no additional configuration. However, to have more than one Hadoop cluster connect to your Isilon cluster, it is best to have each Hadoop cluster connect to a separate OneFS access zone. This will allow OneFS to present each Hadoop cluster with its own HDFS namespace and an independent set of users. For more information, see Security and Compliance for Scale-out Hadoop Data Lakes whitepaper. To view your current list of access zones and the IP pools associated with them: isiloncluster1-1# isi zone zones list Name Path System /ifs Total: 1 isiloncluster1-1# isi networks list pools -v subnet0:pool0 In Subnet: subnet0 Allocation: Static Ranges: Pool Membership: 4 1:10gige-1 (up) 2:10gige-1 (up) 3:10gige-1 (up) 4:10gige-1 (up) Aggregation Mode: Link Aggregation Control Protocol (LACP) Access Zone: System (1) SmartConnect: Suspended Nodes : None Auto Unsuspend... 0 Zone : subnet0-pool0.isiloncluster1.lab.example.com Time to Live : 0 Service Subnet : subnet0 Connection Policy: Round Robin Failover Policy : Round Robin Rebalance Policy : Automatic Failback EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 17

18 To create a new access zone and an associated IP address pool: isiloncluster1-1# mkdir -p /ifs/isiloncluster1/zone1 isiloncluster1-1# isi zone zones create --name zone1 \ --path /ifs/isiloncluster1/zone1 isiloncluster1-1# isi networks create pool --name subnet0:pool1 \ --ranges ifaces 1-4:10gige-1 \ --access-zone zone1 --zone subnet0-pool1.isiloncluster1.lab.example.com \ --sc-subnet subnet0 --dynamic Creating pool subnet0:pool1 : OK Saving: OK Note: If you do not have a SmartConnect Advanced license, you will need to omit the -- dynamic option. Sharing Data between Access Zones By default, the data in one access zone cannot be access by users in another access zone. In certain cases, however, you may need to make the same data set available to more than one Hadoop compute cluster. Using fully qualified HDFS paths, e.g. hdfs://zone1- hdfs.example.com/hadoop/dir1, can render a data set available across two or more access zones. With fully qualified HDFS paths, the data sets do not cross access zones. Instead, the Hadoop jobs can access the data sets from a common shared HDFS namespace. For instance, you can selectively share data between two or more access zones based on referential links and file/directory permissions. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 18

19 User & Group ID s Isilon clusters and Hadoop servers each have their own mapping of user IDs (uid) to user names and group IDs (gid) to group names. When Isilon is used only for HDFS storage by the Hadoop servers, the IDs do not need to match. This is due to the fact that the HDFS protocol only refers to users and groups by their names, and never their numeric IDs. In contrast, the NFS protocol refers to users and groups by their numeric IDs. Although NFS is rarely used in traditional Hadoop environments, the high-performance, enterpriseclass, and POSIX-compatible NFS functionality of Isilon makes NFS a compelling protocol for certain workflows. If you expect to use both NFS and HDFS on your Isilon cluster (or simply want to be open to the possibility in the future), it is highly recommended to maintain consistent names and numeric IDs for all users and groups on Isilon and your Hadoop servers. In a multi-tenant environment with multiple Hadoop clusters, numeric IDs for users in different clusters should be distinct. For instance, the user bigsql in Hadoop cluster 1 may have ID 1013 and this same ID will be used in the Isilon access zone for Hadoop cluster 1 as well as every server in Hadoop cluster 1. The user bigsql in Hadoop cluster 2 may have ID 710 and this ID will be used in the Isilon access zone for Hadoop cluster 2 as well as every server in Hadoop cluster 2. Configuring Isilon for HDFS Note: In the steps below, replace zone1 with System to use the default System access zone or you may specify the name of a new access zone that you previously created. 1. Open a web browser to the your Isilon cluster s web administration page. If you don t know the URL, simply point your browser to: EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 19

The isilon_node_ip_address is any IP address on any Isilon node that is in the System Access Zone. This usually corresponds to the ext-1 interface of any Isilon node. 2. Login with your root account.

20 The isilon_node_ip_address is any IP address on any Isilon node that is in the System Access Zone. This usually corresponds to the ext-1 interface of any Isilon node. 2. Login with your root account. You specified the root password when you configured your first node using the console. 3. Check, and edit as necessary, your NTP settings. Click Cluster Management -> General Settings -> NTP. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 20

21 1. SSH into any node in your Isilon cluster as root. 2. Confirm that your Isilon cluster is at OneFS version isiloncluster1-1# isi version Isilon OneFS v For OneFS version , you must have patch installed. You can view the list of patches you have installed with: # isi pkg info patch : This patch adds support for the Ambari 1.7.0_IBM Server. 4. Install the patch if needed: [user@workstation ~]$ scp patch tgz root@mycluster1-hdfs:/tmp isiloncluster1-1# gunzip < /tmp/patch tgz tar -xvf - isiloncluster1-1# isi pkg install patch tar Preparing to install the package... Checking the package for installation... Installing the package Committing the installation... Package successfully installed. 5. Verify your HDFS license. isiloncluster1-1# isi license Module License Status Configuration Expiration Date HDFS Evaluation Not Configured November12, 2016 EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 21

22 6. Create the HDFS root directory. This is usually called hadoop and must be within the access zone directory. isiloncluster1-1# mkdir -p /ifs/isiloncluster1/zone1/hadoop 7. Set the HDFS root directory for the access zone. isiloncluster1-1# isi zone zones modify zone1 \ --hdfs-root-directory /ifs/isiloncluster1/zone1/hadoop 8. Set the HDFS block size used for reading from Isilon. isiloncluster1-1# isi hdfs settings modify --default-block-size 128M 9. Create an indicator file so that we can easily determine when we are looking your Isilon cluster via HDFS. isiloncluster1-1# touch \ /ifs/isiloncluster1/zone1/hadoop/this_is_isilon_isiloncluster1_zone1 10.Copy the scripts (isilon_create_users.sh & isilon_create_directories.sh) you downloaded from to Isilon, [user@workstation ~]$ scp isilon_create_*.sh \ root@isilon_node_ip_address:/ifs/isiloncluster1/scripts 11.Execute the script isilon_create_users.sh. This script will create all required users and groups for IBM BigInsights v 4.0. Warning: The script isilon_create_users.sh will create local user and group accounts on your Isilon cluster for Hadoop services. If you are using a directory service such as Active Directory and you want these users and groups to be defined in your directory service, then DO NOT run this script. Instead, refer to the OneFS documentation and EMC Isilon Best Practices for Hadoop Data Storage. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 22

23 Script Usage: isilon_create_users.sh dist <DIST> [ startgid <GID>] [ startuid <UID>] [ zone <ZONE>] dist - This will correspond to your Hadoop distribution bi4.0 startgid - Group IDs will begin with this value. For example: 1000 startuid - User IDs will begin with this value. This is generally the same as gid_base. For example: zone Access Zone name. For example: zone1 isiloncluster1-1# bash /ifs/isiloncluster1/scripts/isilon_create_users.sh \ --dist bi4.0 --startgid startuid zone zone1 Example output of script is shown below: Info: Hadoop distribution: bi Info: groups will start at GID 1000 Info: users will start at UID 1000 Info: will put users in zone: zone1 Info: HDFS root: /ifs/isiloncluster1/hadoop Failed to add member UID:1001 to group GROUP:hadoop: User is already in local group SUCCESS -- Hadoop users created successfully! Done! EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 23

24 Note: The User is already in local group message is expected, this user corresponds to the hadoop user which is already in the hadoop group. 12. Execute the script isilon_create_directories.sh. This script will create all required directories with the appropriate ownership and permissions. Script Usage: isilon_create_directories.sh dist <DIST> [ fixperm] [ zone <ZONE>] dist - This will correspond to your Hadoop distribution bi4.0 fixperm - Updates ownership and permissions on hadoop directories. zone - Access Zone name. For example: zone1 isiloncluster1-1# bash /ifs/isiloncluster1/scripts/isilon_create_directories.sh \ --dist bi4.0 --fixperm --zone zone1 13. Map the hdfs user to the Isilon superuser. This will allow the hdfs user to chown (change ownership of) all files during IBM BigInsights installation. Warning: The command below will restart the HDFS service on Isilon to ensure that any cached user mapping rules are flushed. This will temporarily interrupt any HDFS connections coming from other Hadoop clusters. isiloncluster1-1# isi zone zones modify --user-mapping-rules= hdfs=>root --zone zone1 isiloncluster1-1# isi services isi_hdfs_d disable ; isi services isi_hdfs_d enable The service isi_hdfs_d has been disabled. The service isi_hdfs_d has been enabled. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 24

25 Create DNS Records for Isilon You will now create the required DNS records that will be used to access your Isilon cluster. 1. Create a delegation record so that DNS requests for the zone isiloncluster1.example.com are delegated to the Service IP that will be defined on your Isilon cluster. The Service IP can be any unused static IP address in your lab subnet. 2. Create a CNAME alias for your Isilon SmartConnect zone. For example, create a CNAME record for mycluster1-hdfs.example.com that targets subnet0- pool0.isiloncluster1.example.com. 3. Test name resolution. [user@workstation ~]$ ping mycluster1-hdfs.example.com PING subnet0-pool0.isiloncluster1.example.com ( ) 56(84) bytes of data. 64 bytes from : icmp_seq=1 ttl=64 time=1.15 ms Prepare Linux Compute Nodes Linux Operating System packages needed for IBM BigInsights: 1. Compatibility Libraries 2. Networking Tools 3. Perl Support 4. Ruby Support 5. Web Services add on 6. PHP Support 7. Web Server EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 25

26 8. Mysql* 9. PostGres* 10.snmp support 11.Development Tools 12. Korn Shel Enable NTP on all Linux Compute nodes 1. Edit /etc/ntp.conf file and add your NTP Server. 2. Enable NTP, service ntpd start 3. chkconfig level 2345 ntpd on Disable SELinux on each node if enabled before installing Ambari. 1. Edit /etc/selinux/config 2. Set SELINUX=disabled 3. Reboot Note: SELinux can be disabled temporarily with the setenforce 0 command. Check UMASK Settings The umask setting on each node should be set to 0022 in /etc/profile and /etc/bashrc. Just modify existing umask entry if needed, e.g. umask EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 26

27 Set ulimit Properties 1. Edit /etc/security/limits.d/90-nproc.conf #set for all users * hard nofile * soft nofile * hard nproc * hard nproc Kernel Modifications 1. Edit /etc/sysctl.conf and add the following: vm.swappiness=5 kernel.pid_max= net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1 net.ipv4.ip_local_port_range = Create IBM BigInsights Hadoop Users and Groups Create required users on all Linux nodes. It is recommended to create all Hadoop users before installing IBM BigInsights. Use the bi_create_users.sh script obtained from: [user_workstation ~$] scp bi_create_users.sh [node1]:/root Run script, e.g. #./bi_create_users.sh Repeat above for all nodes. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 27

28 Configure Passwordless SSH Configure passwordless SSH for all Linux nodes. 1. Create Authentication SSH Keys ssh-keygen -f id_rsa -t rsa -N 2. Create.ssh directories on all nodes ssh mkdir p.ssh cd.ssh Upload generated keys to all hosts: cat id_rsa.pub ssh 'cat >>.ssh/authorized_keys' Repeat above for all nodes. 3. Set permissions on.ssh directory ssh "chmod 700.ssh; chmod 640.ssh/authorized_keys Additional Linux Packages to Install Install the following packages on all Linux compute nodes. deltarpm python-deltarpm createrepo pam el6.i686.rpm mysql-connector-java el6.noarch.rpm ksh nc libdbi libstdc libaio java openjdk-devel python-paramiko python-rrdtool el6.rfx.x86_64 EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 28

29 snappy el6.x86_64 web-ui-framework Install the above packages using the yum install command. Test DNS Resolution Make sure all compute nodes resolve with a fully qualifies domain name. Ping each host with the associated FQDN and make sure it is reachable by FQDN. Edit sudoers file on all Linux compute nodes. 1. Edit /etc/sudoers ## Additions needed for IBM BigInsights hadoop ALL=(ALL) bigsql ALL=(ALL) NOPASSWD: ALL NOPASSWD: ALL Check IBM s BigInsights Website for more info on preparing Linux nodes. ts.install.doc/doc/install_prepare.html Installing IBM Open Platform (OP) Download IBM Open Platform Software Log into the IBM Passport Advantage web portal with your IBM assigned credentials and download the following packages onto the designated Ambari server node: BI-AH IOP-4.0.x86_64.bin IOP x86_64.rpm iop x86_64.tar.gz iop-utils-1.0-iop-4.0.x86_64.tar.gz EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 29

30 Create IBM Open Platform Repository The IBM Open Platform with Apache Hadoop uses the repository-based Ambari installer. You have two options for specifying the location of the repository from which Ambari obtains the component packages. The IBM Open Platform with Apache Hadoop installation includes OpenJDK During installation, you can either install the version provided or make sure Java 7 is installed on all nodes in the cluster. 1. Log in to your Linux cluster as root, or as a user with root privileges. 2. Ensure that the nc package is installed on all nodes: yum install -y nc If you installed the Basic Server option on your server, the nc package might not be installed, which might result in the failure on datanodes of the IBM Open Platform with Apache Hadoop. 3. Locate the IOP x86_64.rpm file you downloaded from the download site. Run the following command to install the ambari.repo file into /etc/yum.repos.d: yum install IOP x86_64.rpm If using a mirror repository, edit the file /etc/yum.repos.d/ambari.repo and replace baseurl= with your mirror URL. For example, baseurl= Disable the gpgcheck in the ambari.repo file. To disable signature validation, change gpgcheck=1 to gpgcheck=0. Alternatively, you can keep gpgcheck on and change the public key file location to the mirror Ambari repository. To do this, change the following EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 30

31 KEY.public to the following: gpgkey= 4. Clean the yum cache on each node so that the right packages from the remote repository are seen by your local yum. >sudo yum clean all 5. Install the Ambari server on the intended management node, using the following command: >sudo yum install ambari-server Accept the install defaults. 6. If you are using a mirror repository, after you install the Ambari server, update the following file with the mirror repository URLs. /var/lib/ambari-server/resources/stacks/biginsights/4.0/repos/repoinfo.xml In the file, change the information from the Original content to the Modified content gpgkey= Original content <os type="redhat6"> <repo> <baseurl> /4.0</baseurl> <repoid>iop-4.0</repoid> <reponame>iop</reponame> </repo> <repo> Modified content <os type="redhat6"> <repo> <baseurl> L6/x86_64/4.0</baseurl> <repoid>iop-4.0</repoid> <reponame>iop</reponame> </repo> <repo> <baseurl> EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 31

32 <baseurl> platform.ibm.com/repos/iop- UTILS/RHEL6/x86_64/1.0</baseurl> <repoid>iop-utils-1.0</repoid> <reponame>iop-utils</reponame> </repo> </os> Edit the /etc/ambari-server/conf/ambari.properties file. change the information from the Original content to the Modified content UTILS/RHEL6/x86_64/1.0</baseurl> <repoid>iop-utils-1.0</repoid> <reponame>iop- UTILS</reponame> </repo> </os> Original content jdk1.7.url= platform.ibm.com/repos/iop- UTILS/RHEL6/x86_64/1.0/openjdk/jdk tar.gz Modified content jdk1.7.url= epos/iop- UTILS/RHEL6/x86_64/1.0/openjdk /jdk tar.gz 7. Set up the Ambari server, using the following command: >sudo ambari-server setup Accept the setup preferences. A Java JDK is installed as part of the Ambari server setup. However, the Ambari server setup also allows you to reuse an existing JDK. The command is: ambari-server setup -j /full/path/to/jdk The JDK path set by the -j parameter must be the same on each node in the cluster. 8. Start the Ambari server, using the following command: >sudo ambari-server start EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 32

33 9. If the Ambari server had been installed on your node previously, the node may contain old cluster information. Reset the Ambari server to clean up its cluster information in the database, using the following commands: >sudo ambari-server stop >sudo ambari-server reset >sudo ambari-server start 10. Access the Ambari web user interface from a web browser by using the server name (the fully qualified domain name, or the short name) on which you installed the software, and port For example, enter abc.com:8080. You can use any available port other than 8080 that will allow you to connect to the Ambari server. In some networks, port 8080 is already in use. To use another port, do the following: a. Edit the ambari.properties file: vi /etc/ambari-server/conf/ambari.properties b. Add a line in the file to select another port: client.api.port=8081 c. Save the file and restart the Ambari server: ambari-server restart 11. Log in to the Ambari server with the default username and password: admin/admin. The default username and password is required only for the first login. You can configure users and groups after the first login to the Ambari web interface. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 33

12. On the Welcome page, click Launch Install Wizard. 13. On the Get Started page, enter a name for the cluster you want to create. The name cannot contain blank spaces or special characters.

34 12. On the Welcome page, click Launch Install Wizard. 13. On the Get Started page, enter a name for the cluster you want to create. The name cannot contain blank spaces or special characters. Click Next. 14. You will deploy IBM Open Platform for Apache Hadoop with EMC Isilon. Ambari Server allows for the immediate usage of an Isilon cluster for all HDFS services (NameNode and DataNode), no reconfiguration will be necessary once the IBM Open Platform install is completed. 1. SSH into Isilon as root and configure the Ambari Agent. isiloncluster1-1# isi zone zones modify zone1 --hdfs-ambari-namenode mycluster1-hdfs.example.com isiloncluster1-1# isi zone zones modify zone1 --hdfs-ambari-server managersvr-1.example.com EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 34

35 15. On the Select Stack page, click the Stack version you want to install (BigInsights 4.0). Click Next. 16. On the Install Options page, in Target Hosts, add the list of Linux hosts that the Ambari server will manage and the IBM Open Platform with Apache Hadoop software will deploy one node per line. For example, enter host1.example.com host2.example.com host3.example.com host4.example.com In Host Registration Information, select one of the two options: Provide the SSH Private Key to automatically register hosts EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 35

36 Click SSH Private Key. The private key file is /root/.ssh/id_rsa, where the root user installed the Ambari server. Click Choose File to find the private key file you installed previously. You should have retained a copy of the SSH private key (.ssh/id_rsa) in your local directory when you set up password-less SSH. Copy and paste the key into the text box manually. Click the Register and Confirm button. Note: After the Linux hosts register, click the back button and Perform manual registration for Isilon and do not use SSH. Isilon has an ambari-agent within OneFS and needs to be manually registered in Ambari. After registering Isilon manually, click the Next button. You should see the Ambari agents on both your Linux hosts and Isilon become registered. 17. On the Confirm Hosts page, you check that the correct hosts for your cluster have been located and that those hosts have the correct directories, packages, and processes to continue the installation. If hosts were selected in error, click the check boxes next to the hosts you want to remove. Click Remove Selected. To remove a single host, click Remove in the Action column. If warnings are found during the check process, you can click Click here to see the warnings to see what caused the warnings. The Host Checks page identifies any issues with the hosts. For example, a host may have Transparent Huge Pages or Firewall issues. You can ignore errors related to user names and groups as we pre-created the users in the pre-installation steps of this document. After you resolve the issues, click Rerun Checks on the Host Checks page. When you have confirmed the hosts, click Next. 18. On the Choose Services page, select the services you want to install. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 36

37 Ambari shows a confirmation message to install the required service dependencies. For example, when selecting Oozie only, the Ambari web interface shows messages for accepting YARN/MR2, HDFS and Zookeeper installations. It also shows Nagios and Ganglia for monitoring and alerting, but they are not required services. 19. On the Assign Masters page, assign NameNode and SNameNode components to the Isilon SmartConnect address e.g. mycluster1-hdfs.example.com. The rest of the services can be deployed per the recommended services layout - refer back to Table 1. Make sure you assign Namenode and SNameNode only to the Isilon SmartConnect address and none of the Linux nodes, e.g. only mycluster1-hdfs.example.com. Click Next. On the Assign Slaves and Clients page, assign the components to Linux hosts in your cluster and make sure datanode is only assigned to Isilon. Assign Client to the client nodes. Click Next. Tip: If you anticipate adding the Big SQL service at some later time, you must include all clients on all the anticipated Big SQL worker nodes. Big SQL specifically needs the HDFS, Hive, HBase, Sqoop, HCat, and Oozie clients. 20. On the Customize Services page, select configuration settings for the services selected. Default values are filled in automatically when available and they are the recommended values. The installation wizard prompts you for required fields (such as password entries) by displaying a number in a circle next to an installed service. Assign passwords to Hive, Oozie, and any other selected services that require them. The following settings should be checked: YARN Node Manager log-dirs YARN Node Manager local-dirs HBase local directory ZooKeeper directory EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 37

38 Oozie Data Dir Storm storm.local.dir Click the number and enter the requested information in the field outlined in red. Make sure that the service port that is set is not already used by another component. For example, the Knox gateway port is, by default, set as But, when the Ambari server is set up with HTTPs, and the SSL port is set up using 8443, then you must change the Knox gateway port to some other value. Note: If you are working in an LDAP environment where users are set up centrally by the LDAP administrator and therefore, already exist, selecting the defaults can cause the installation to fail. Open the Misc tab, and check the box to ignore user modification errors. 21. When you have completed the configuration of the services, click Next. 22. On the Review page, verify that your settings are correct. Click Deploy. 23. The Install, Start, and Test page shows the progress of the installation. The progress bar at the top of the page gives the overall status while the main section of the page gives the status for each host. Logs for a specific task can be displayed by clicking on the task. Click the link in the Message column to find out what tasks have been completed for a specific host or to see the warnings that have been encountered. When the message "Successfully installed and started the services" appears, click Next. 24. On the Summary page, review the accomplished tasks. Click Complete to go to the IBM Open Platform with Apache Hadoop dashboard. Validating IBM Open Platform Install Ambari provides service checks for all the supported services. These checks run automatically after each service installation, or they can be run manually at any time. You EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 38

can access the Ambari web interface and use the Services View to make sure all the components pass their checks successfully. The following steps provide another way to validate your installation. 1.

39 can access the Ambari web interface and use the Services View to make sure all the components pass their checks successfully. The following steps provide another way to validate your installation. 1. As the root user on a node on which Apache Hadoop is installed, enter the following command to become the ambari-qa user: su - ambari-qa 2. As the ambari-qa user, run the following command: export HADOOP_MR_DIR=/usr/iop/current/hadoop-mapreduce-client # Generate data with 1000 rows. Each row is about 100 bytes. yarn jar $HADOOP_MR_DIR/hadoop-mapreduce-examples.jar teragen 1000 /tmp/tgout # Sort data yarn jar $HADOOP_MR_DIR/hadoop-mapreduce-examples.jar terasort /tmp/tgout /tmp/tsout # Validate data yarn jar $HADOOP_MR_DIR/hadoop-mapreduce-examples.jar teravalidate /tmp/tsout /tmp/tvout If the job is successful, you will see a log record similar to the following: INFO mapreduce.job: Job job_id completed successfully Browse to your cluster on port 8088 to see the results of your validation tests, e.g. example YARN test results shown below. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 39

40 Adding a Hadoop User You must add a user account for each Linux user that will submit MapReduce jobs. The procedure below can be used to add a user named hduser1 as an example. 1. Add user to Isilon. isiloncluster1-1# isi auth groups create hduser1 --zone zone1 --provider local isiloncluster1-1# isi auth users create hduser1 --primary-group hduser1 --zone zone1 -- provider local --home-directory /ifs/isiloncluster1/zone1/hadoop/user/hduser1 2. Add user to Hadoop nodes. [root@mycluster1-master-0 ~]# adduser hduser1 3. Create the user s home directory on HDFS. [root@mycluster1-master-0 ~]# sudo -u hdfs hdfs dfs -mkdir -p /user/hduser1 [root@mycluster1-master-0 ~]# sudo -u hdfs hdfs dfs -chown hduser1:hduser1 \ /user/hduser1 [root@mycluster1-master-0 ~]# sudo -u hdfs hdfs dfs -chmod 755 /user/hduser1 Additional Service Tests The tests below should be performed to ensure a proper installation. Perform the tests in the order shown. You must create the Hadoop user hduser1 before proceeding. HDFS [root@mycluster1-master-0 ~]# sudo -u hdfs hdfs dfs -ls / Found 5 items -rw-r--r-- 1 root hadoop :59 /THIS_IS_ISILON drwxr-xr-x - hbase hbase :06 /hbase drwxrwxr-x - solr solr :07 /solr drwxrwxrwt - hdfs supergroup :07 /tmp drwxr-xr-x - hdfs supergroup :07 /user [root@mycluster1-master-0 ~]# sudo -u hdfs hdfs dfs -put -f /etc/hosts /tmp [root@mycluster1-master-0 ~]# sudo -u hdfs hdfs dfs -cat /tmp/hosts localhost [root@mycluster1-master-0 ~]# sudo -u hdfs hdfs dfs -rm -skiptrash /tmp/hosts EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 40

41 ~]# su - hduser1 [hduser1@mycluster1-master-0 ~]$ hdfs dfs -ls / Found 5 items -rw-r--r-- 1 root hadoop :59 /THIS_IS_ISILON drwxr-xr-x - hbase hbase :28 /hbase drwxrwxr-x - solr solr :07 /solr drwxrwxrwt - hdfs supergroup :07 /tmp drwxr-xr-x - hdfs supergroup :39 /user [hduser1@mycluster1-master-0 ~]$ hdfs dfs -ls... YARN/MAPREDUCE [hduser1@mycluster1-master-0 ~]$ hadoop jar \ /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar \ pi Estimated value of Pi is [hduser1@mycluster1-master-0 ~]$ hadoop fs -mkdir in You can put any file into the in directory. It will be used the datasource for subsequent tests. [hduser1@mycluster1-master-0 ~]$ hadoop fs -put -f /etc/hosts in [hduser1@mycluster1-master-0 ~]$ hadoop fs -ls in... [hduser1@mycluster1-master-0 ~]$ hadoop fs -rm -r out [hduser1@mycluster1-master-0 ~]$ hadoop jar \ /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar \ wordcount in out... [hduser1@mycluster1-master-0 ~]$ hadoop fs -ls out Found 4 items -rw-r--r-- 1 hduser1 hduser :44 out/_success -rw-r--r-- 1 hduser1 hduser :44 out/part-r rw-r--r-- 1 hduser1 hduser :44 out/part-r rw-r--r-- 1 hduser1 hduser :44 out/part-r [hduser1@mycluster1-master-0 ~]$ hadoop fs -cat out/part* localhost Browse to the YARN Resource Manager GUI Browse to the MapReduce History Server GUI In particular, confirm that you can view the complete logs for task attempts. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 41

42 HIVE ~]$ hadoop fs -mkdir -p sample_data/tab1 ~]$ cat - > tab1.csv 1,true, , :55:00 2,false,1243.5, :40:00 3,false, , :33: ,false, , :32: ,true, , :11:33 Type <Control+D>. [hduser1@mycluster1-master-0 ~]$ hadoop fs -put -f tab1.csv sample_data/tab1 [hduser1@mycluster1-master-0 ~]$ hive hive> DROP TABLE IF EXISTS tab1; CREATE EXTERNAL TABLE tab1 ( id INT, col_1 BOOLEAN, col_2 DOUBLE, col_3 TIMESTAMP ) ROW FORMAT DELIMITED FIELDS TERMINATED BY, LOCATION /user/hduser1/sample_data/tab1 ; DROP TABLE IF EXISTS tab2; CREATE TABLE tab2 ( id INT, col_1 BOOLEAN, col_2 DOUBLE, month INT, day INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY, ; INSERT OVERWRITE TABLE tab2 SELECT id, col_1, col_2, MONTH(col_3), DAYOFMONTH(col_3) FROM tab1 WHERE YEAR(col_3) = 2012;... OK Time taken: seconds hive> show tables; OK EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 42

43 tab1 tab2 Time taken: seconds, Fetched: 2 row(s) hive> select * from tab1; OK 1 true :55:00 2 false :40:00 3 false :33: false :32: true :11:33 Time taken: seconds, Fetched: 5 row(s) hive> select * from tab2; OK 1 true false Time taken: seconds, Fetched: 2 row(s) hive> select * from tab1 where id=1; OK 1 true :55:00 Time taken: seconds, Fetched: 1 row(s) hive> select * from tab2 where id=1; OK 1 true Time taken: seconds, Fetched: 1 row(s) hive> exit; HBASE [hduser1@mycluster1-master-0 ~]$ hbase shell hbase(main):001:0> create test, cf 0 row(s) in seconds => Hbase::Table - test hbase(main):002:0> list test TABLE test 1 row(s) in seconds => [ test ] hbase(main):003:0> put test, row1, cf:a, value1 0 row(s) in seconds hbase(main):004:0> put test, row2, cf:b, value2 EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 43

44 0 row(s) in seconds hbase(main):005:0> scan test ROW COLUMN+CELL row1 column=cf:a,timestamp= ,value=value1 row2 column=cf:b,timestamp= ,value=value2 2 row(s) in seconds hbase(main):006:0> get test, row1 COLUMN CELL cf:a timestamp= ,value=value1 1 row(s) in seconds hbase(main):007:0> quit Ambari Service Check Ambari has built-in functional tests for each component. These are executed automatically when you install your cluster with Ambari. To execute them after installation, select the service in Ambari, click the Service Actions button, and select Run Service Check. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 44

45 Installing IBM Value Packages Before You Begin Please note that BigInsights Analyst and BigInsights Data Scientist value package have been sanity tested on EMC Isilon, but have not been performance profiled and tested under load with Isilon version. EMC and IBM BigInsights plan to validate these components under load as part of future integration efforts. Please refer to EMC IBM BigInsights Joint Support Statement for further details. You must acquire the software from Passport Advantage. The acquired software has a *.bin extension. The name of the *.bin file depends on whether the BigInsights Analyst or the BigInsights Data Scientist module was downloaded. When you run the *.bin file, configuration files are copied to appropriate locations to enable Ambari to see that value-add services as available. When adding the value-add services through Ambari, additional software packages can be downloaded. If the Hadoop cluster cannot directly access the internet, a local mirror repository can be created. Where you perform the following steps depends on whether the Hadoop cluster has direct internet access. If the Hadoop cluster has direct access to the internet, perform the steps from the Ambari server of the Hadoop cluster. If the Hadoop cluster does not have direct internet access, perform the steps from a Linux host with direct internet access. Then, transfer the files, as required, to a local repository mirror. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 45

46 Installation Procedure 1. Update the permissions on the downloaded *.bin file to enable execute. chmod +x <package_name>.bin 2. Run the *.bin file to extract and install the services in the module../<package_name>.bin where <package_name> is BI-Analyst-xxxxx.bin for the Analyst module or BI-DSxxxxx.bin for the Data Scientist module. 3. After the prompt, agree to the license terms. Reply yes y to continue install. 4. After the prompt, choose if you want to do an online (option 1) or offline (option 2) install. a. Online install will lay out the Ambari service configuration files and update the repository locations in the Ambari server file. Skip to step 6. b. Offline install initiates a download of files to set up a local repository mirror. A subdirectory called BigInsights will be created with RPMs and associated files will be located in directory BigInsights/packages 5. Setup a local repository. A local repository is required if the Hadoop cluster cannot connect directly to the internet, or if you wish to avoid multiple downloads of the same software when installing services across multiple nodes. In the following steps, the host that performs the repository mirror function is called the repository server. If you do not have an additional Linux host, you can use one of the Hadoop management nodes. The repository server must be accessible over the network by the Hadoop cluster. The repository server requires an HTTP web server. The following instructions describe how to set up a repository server by using a Linux host with an Apache HTTP server. a. On the repository server, if the Apache HTTP server is not installed, install it: EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 46

47 yum install httpd b. On the repository server, ensure that the createrepo package is installed. c. On the repository server, create a directory for your value-add repository, such as <mirror web server document root>/repos/valueadds. For example, for Apache httpd, the default is /var/www/html/repos. mkdir /var/www/html/repos/valueadds d. By selecting Option 2 in step 4, RPMs were downloaded to a subdirectory called BigInsights/packages. Copy all of the RPMs to the mirror web server location, <your.mirror.web.server.document root>/repos/valueadds directory. cp BigInsights/packages/* /var/www/html/repos/valueadds/ e. Start this web server. If you use Apache httpd, start it by using either of the following commands: apachect start or service httpd start f. Test your local repository by browsing to the web directory: You should see all of the files that you copied to the repository server. g. On the repository server, run the createrepo command to initialize the repository: createrepo /var/www/html/repos/valueadds h. In the BigInsights/packages directory, find the RPM to install on the Ambari Server host of the Hadoop cluster: BigInsights Analyst BI-Analyst-X.X.X.X-IOP-X.X.x86_64.rpm EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 47

48 BigInsights Data Scientist BI-DS-X.X.X.X-IOP-X.X.x86_64.rpm Tip: The BigInsights Data Scientist module also entitles you to the features of the BigInsights Analyst module. Therefore, consider doing the yum install for both of the RPM packages. Then, copy the file to the Ambari Server host and install the RPMs by using the following commands: sudo yum install <BI-xxx IOP...>.rpm <repo> i. On the Ambari Server node, navigate to the /var/lib/ambariserver/resources/stacks/biginsights/<version_number>/repos/repoinfo. xml file. If the file does not exist, create it. Ensure the <baseurl> element for the BIGINSIGHTS-VALUEPACK <repo> entry points to your repository server. Remember, there might be multiple <repo> sections. Make sure that the URL you tested in step 5.f matches exactly the value indicated in the <baseurl> element. For example, the repoinfo.xml might look like the following content after you change <baseurl> </baseurl> <repoid>biginsights-valuepack</repoid> <reponame>biginsights-valuepack</reponame> </repo> Note: The new <repo> section might appear as a single line. become Tip: If you later find an error in this configuration file, make corrections and run the following command: EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 48

yum clean all Then, restart the ambari server. ambari-server restart j. When the module is installed, restart the Ambari server. k. Open the Ambari web interface and log in.

49 yum clean all Then, restart the ambari server. ambari-server restart j. When the module is installed, restart the Ambari server. k. Open the Ambari web interface and log in. The default address is the following URL: The default login name is admin and the default password is admin. l. Click Actions > Add service. In the list of services you will see the services that you previously added as well as the BigInsights services you can now add. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 49

50 Select IBM BigInsights Service to Install Select the service that you want to install and deploy. Even though your module might contain multiple services, install the specific service that you want and the BigInsights Home service. Installing one value-add service at a time is recommended. Follow the service specific installation instructions for more information. At the conclusion of installing all the IBM BigInsights Services, the Ambari GUI Software List should have green check marks next to each service as shown below: EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 50

51 Installing BigInsights Home The BigInsights Home service is the main interface to launch BigInsights - BigSheets, BigInsights - Text Analytics, and BigInsights - Big SQL. The BigInsights Home service requires Knox to be installed, configured and started. Open a browser and access the Ambari server dashboard. The following is the default URL: The default user name is admin, and the default password is admin. In the Ambari dashboard, click Actions > Add Service. In the Add Service Wizard > Choose Services, select the BigInsights BigInsights Home service. Click Next. If you do not see the option for BigInsights BigInsights Home, follow the instructions described in Installing the BigInsights value-add packages. In the Assign Masters page, select a Management node (edge node) that your users can communicate with. BigInsights Home is a web application that your users must be able to open with a web browser. In the Assign Slaves and Clients page, make selections to assign slaves and clients. The nodes that you select will have JSQSH (an open source, command line interface to SQL for Big SQL and other database engines) and SFTP client. Select nodes that might be used to ingest data as an SFTP client, where you might want to work with Big SQL scripts, or other databases interactively. Click Next to review any options that you might want to customize. Click Deploy. If the BigInsights BigInsights Home service fails to install, run the remove_value_add_services.sh cleanup script. The following code is an example command: cd /usr/ibmpacks/bin/<version> EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 51

52 remove_value_add_services.sh -u admin -p admin -x s WEBUIFRAMEWORK -r For more information about cleaning the value-add service environment, see Removing BigInsights value-add services. After installation is complete, click Next > Complete. Configure Knox The Apache Knox gateway is a system that provides a single point of authentication and access for Apache Hadoop services on the compute nodes in a cluster; however authentication to HDFS services is completely controlled by Isilon OneFS only. The Knox gateway simplifies Hadoop security for users that access the cluster and execute jobs and operators that control access and manage the cluster. The gateway runs as a server, or a cluster of servers, providing centralized access to one or more Hadoop clusters. In IBM Open Platform with Apache Hadoop, Knox is a service that you start, stop, and configure in the Ambari web interface. Users access the following BigInsights value added components through Knox by going to the IBM BigInsights home service. BigSheets Text Analytics Big SQL Knox supports only REST API calls for the following Hadoop services: WebHCat EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 52

53 Oozie HBase Hive Yarn Click the Knox service from the Ambari web interface to see the summary page. Select Service Actions > Restart All to restart it and all of its components. If you are using LDAP, you must also start LDAP if it is not already started. Click the BigInsights Home service in the Ambari User Interface. Select Service Actions > Restart All to restart it and all of its components. Open the BigInsights Home page from a web. The URL for BigInsights Home is: where: knox_host The host where Knox is installed and running knox_port The port where Knox is listening (by default this is 8443) knox_gateway_path The value entered in the gateway.path field in the Knox configuration (by default this is 'gateway') EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 53

54 For example, the URL might look like the following address: If you are using the Knox Demo LDAP, a default user ID and password is created for you. When you access the web page, use the following preset credentials: User Name = guest Password = guest-password Installing BigSheets To extend the power of the Open Platform for Apache Hadoop, install and deploy the BigInsights BigSheets service, which is the IBM spreadsheet interface for big data. 1. Open a browser and access the Ambari server dashboard. The following is the default URL. The default user name is admin, and the default password is admin. 2. In the Ambari Dashboard, click Actions > Add Service. 3. In the Add Service Wizard, Choose Services, select the BigInsights - BigSheets service, and if you have not already installed the BigInsights Home service, select that as well. Click Next. If you do not see BigInsights BigSheets service, you need to install the appropriate module and restart Ambari as described in Installing the BigInsights value-add packages. 4. In the Assign Masters page, decide on which node of your cluster you want to run the specified BigSheets master. 5. In the Assign Slaves and Clients page all the defaults are automatically accepted and the next page automatically appears. BigSheets service does not have any slaves and EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 54

55 clients. The Assign Slaves and Clients page will show and be skipped immediately during install. This is the expected behavior. 6. In the Customize Services page, accept the recommended configurations for the BigSheets service, or customize the configuration by expanding the configuration files and modifying the values. In theadvanced bigsheets-user-config section, make sure that you enter the following information: a. In the bigsheets.user field, leave the default user name, which is bigsheets. b. In the bigsheets.password field, type a valid password. c. In the bigsheets.userid, type a valid user ID to use for the bigsheets service user. This user ID is created across all of the nodes of the cluster, and must be unique across all nodes of the cluster. d. Click Next.. 7. In the Advanced bigsheets-ambari-config section, in the ambari.password field, type the correct Ambari administration password. 8. You can review your selections in the Review page before accepting them. If you want to modify any values, click the Back button. If you are satisfied with your setup, click Deploy. 9. In the Install, Start and Test page, the BigSheets service is installed and verified. If you have multiple nodes, you can see the progress on each node. When the installation is complete, either view the errors or warnings by clicking the link, or click Next to see a summary and then the new service added to the list of services. 10.Click Complete. If the BigInsights BigSheets service fails to install, run the remove_value_add_services.shcleanup script. The following code is an example of the command: cd /usr/ibmpacks/bin/<version>./remove_value_add_services.sh -u admin -p admin -x s BIGSHEETS -r For more information about cleaning the value-add service environment, see Removing BigInsights value-add services. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 55

56 11.After you install BigInsights - BigSheets, you must restart the HDFS, MapReduce2, YARN, Knox, Nagios and Ganglia client services. a. For each service that requires restart, select the service. b. Click Service Actions. c. Click Restart All. 12.Access the BigInsights - BigSheets service from the BigInsights Home service. o o If the BigInsights Home service has not yet been added, see Installing BigInsights Home. If the BigInsights Home service has been installed, it must be restarted so the BigInsights - BigSheets icon will display. 13.Launch the BigInsights Home service by typing the following address in your browser: x.html Where: knox_host The host where Knox is installed and running knox_port The port where Knox is listening (by default this is 8443) knox_gateway_path The value entered in the gateway.path field in the Knox configuration (by default this is 'gateway') For example, the URL might look like the following address: EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 56

57 Installing Big SQL To extend the power of the Open Platform for Apache Hadoop, install and deploy the BigInsights - Big SQL service, which is the IBM SQL interface to the Hadoop-based platform, IBM Open Platform with Apache Hadoop. 1. Open a browser and access the Ambari server dashboard. The following is the default URL. The default user name is admin, and the default password is admin. 2. In the Ambari web interface, click Actions > Add Service. 3. In the Add Service Wizard, Choose Services, select the BigInsights - Big SQL service, and thebiginsights Home service. Click Next. If you do not see the option to select the BigInsights - Big SQL service, complete the steps. 4. In the Assign Masters page, decide which nodes of your cluster you want to run the specified components, or accept the default nodes. Follow these guidelines: o For the Big SQL monitoring and editing tool, make sure that the Data Server Manager (DSM) is assigned to the same node that is assigned to the Big SQL Head node. 5. Click Next. 6. In the Assign Slaves and Clients page, accept the defaults, or make specific assignments for your nodes. Follow these guidelines: o o Select the non-head nodes for the Big SQL Worker components. You must select at least one node as the worker node. Select all nodes for the CLIENT. This puts JSqsh and SFTP clients on the nodes. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 57

7. In the Customize Services page, accept the recommended configurations for the Big SQL service, or customize the configuration by expanding the configuration files and modifying the values.

58 7. In the Customize Services page, accept the recommended configurations for the Big SQL service, or customize the configuration by expanding the configuration files and modifying the values. Make sure that you have a valid bigsql_user and bigsql_user_password (see reference screen below) and user_id (created by the bi_create_users.sh script) in the appropriate fields in theadvanced bigsql-users-env section. 8. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 58

9. You can review your selections in the Review page before accepting them. If you want to modify any values, click the Back button. If you are satisfied with your setup, click Deploy. 10.

59 9. You can review your selections in the Review page before accepting them. If you want to modify any values, click the Back button. If you are satisfied with your setup, click Deploy. 10.In the Install, Start and Test page, the Big SQL service is installed and verified. If you have multiple nodes, you can see the progress on each node. When the installation is complete, either view the errors or warnings by clicking the link, or click Next to see a summary and then the new service added to the list of services. If the BigInsights Big SQL service fails to install, run the remove_value_add_services.shcleanup script. The following code is an example of the command: EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 59

60 cd /usr/ibmpacks/bin/<version>./remove_value_add_services.sh -u admin -p admin -x s BIGSQL -r For more information about cleaning the value-add service environment, see Removing BigInsights value-add services. 11. A web application interface for Big SQL monitoring and editing is available to your endusers to work with Big SQL. You access this monitoring utility from the IBM BigInsights Home service. If you have not added the BigInsights Home service yet, do that now. 12. Restart the Knox Service. Also start the Knox Demo LDAP service if you have not configured your own LDAP. 13. Restart the BigInsights Home services. 14. To run SQL statements from the Big SQL monitoring and editing tool, type the following address in your browser to open the BigInsights Home service: x.html Where: knox_host The host where Knox is installed and running knox_port The port where Knox is listening (by default this is 8443) knox_gateway_path The value entered in the gateway.path field in the Knox configuration (by default this is 'gateway') For example, the URL might look like the following address: If you use the Knox Demo LDAP service, the default credential is: userid = guest password = guest-password EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 60

61 Your end users can also use the JSqsh client, which is a component of the BigInsights - Big SQL service. 15. If the BigInsights - Big SQL service shows as unavailable, there might have been a problem with post-installation configuration. Run the following commands as root (or sudo) where the Big SQL monitoring utility (DSM) server is installed: a. Run the dsmknoxsetup script: b. cd /usr/ibmpacks/bigsql/<version-number>/dsm/1.1/ibm-datasrvrmgr/bin/./dsmknoxsetup.sh -knoxhost <knox-host> where <knox-host> is the node where the Knox gateway service is running. c. Make sure that you do not stop and restart the Knox gateway service within Ambari. If you do, then run the dsmknoxsetup script again. d. Restart the BigInsights Home service so that the Big SQL monitoring utility (DSM) can be accessed from the BigInsights Home interface. 16. For HBase, do the following post-installation steps:. For all nodes where HBase is installed, check that the symlinks to hive-serde.jar and hive-common.jar in the hbase/lib directory are valid. To verify the symlinks are created and valid: namei /usr/iop/<version-number>/hbase/lib/hive-serde.jar namei /usr/iop/<version-number>/hbase/lib/hive-common.jar If they are not valid, do the following steps: cd /usr/iop/<version-number>/hbase/lib rm -rf hive-serde.jar rm -rf hive-common.jar ln -s /usr/iop/<version-number>/hive/lib/hive-serde.jar hive-serde.jar ln -s /usr/iop/<version-number>/hive/lib/hive-common.jar hive-common.jar a. After installing the Big SQL service, and fixing the symlinks, restart the HBase service from the Ambari web interface. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 61

62 After you add Big SQL worker nodes, make sure that you stop and then restart the Hive service. Connecting to Big SQL You can run Big SQL queries from Java SQL Shell (JSqsh), or from the IBM Data Server Manager. You can also run queries from a client application, such as IBM Data Studio, that uses JDBC or ODBC drivers. You must identify a running Big SQL server and configure either a JDBC or ODBC driver. For more information about JSqsh, or IBM Data Studio, see the related topics in the IBM BigInsights Knowledge Center. Running JSqsh JSqsh is installed in /usr/ibmpacks/common-utils/current/jsqsh/bin. Change to that directory and type./jsqsh to open the JSqsh shell: cd /usr/ibmpacks/common-utils/current/jsqsh/bin./jsqsh You can then run any JSqsh commands from the prompt. Connection setup To use the JSqsh command shell, you can use the default connections or define and test a connection to the Big SQL server. 1. The first time that you open the JSqsh command shell, a configuration wizard is started. When you are at the Jsqsh command prompt, type \drivers to determine the available drivers. a. On the driver selection screen, select the Big SQL instance that you want to run Note: Big SQL is designated as DB2 in this example: Name Target Class EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 62

63 *db2 IBM Data Server(DB2 com.ibm.db2.jcc.db2driver b. Verify the port, server, and user name. Run \setup and click C to define a password for the connection. The username must have database administration privileges, or must be granted those privileges by the Big SQL administrator. c. Test the connection to the Big SQL server. d. Save and name this connection. 2. Generally, you can access JSqsh from /usr/ibmpacks/commonutils/current/jsqsh/bin with the following command: 3../jsqsh --driver=db2 --user=<username> --password=<user_password> 4. Open the saved configuration wizard any time by typing \setup while in the command interface, or./jsqsh --setup when you open the command interface. 5. Specify the following connection name in the JSqsh command shell to establish a connection:./jsqsh name 6. Use the \connect command when you are already inside the JSQSH shell to establish a connection at the JSqsh prompt: \connect name Commands and queries At the JSqsh command prompt, you can run JSqsh commands or database server commands. JSqsh commands usually begin with a backslash (\) character. JSqsh commands accept command-line arguments and allow for common shell activities, such as I/O redirection and pipes. For example, consider this set of commands: 1> select * from t1 2> where c1 > 10 3> \go --style csv > /tmp/t1.csv EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 63

64 Because the commands do not begin with a backslash character, the first two commands are assumed to be SQL statements, and are sent to the Big SQL server. The \go command sends the statements to run on the server. The \go command has a built-in alias so that you can omit the backslash. Additionally, you can specify a trailing semicolon to indicate that you want to run a statement, for example: 1> select * from t1 2> where c1 > 10; The --style option in the \go command indicates that the display shows comma-separated values (CSV). The \go form is most useful if you provide additional arguments to affect how the query is run. Changing the display style is an example of this feature. The redirection operator (>) specifies that the results of the command are sent to a file called /tmp/t1.csv. A set of frequently run commands does not require the leading backslash. Any JSqsh command can bealiased to another name (without a leading backslash, if you choose), by using the \alias command. For example, if you want to be able to type bye to leave the JSqsh shell, you establish that word as the alias for the \quit command: \alias bye='\quit' You can run a script that contains one or more SQL statements. For example, assume that you have a file called mysql.sql. That file contains these statements: select tabschema, tabname from syscat.tables fetch first 5 rows only; select tabschema, colname, colno, typename, length from syscat.columns fetch first 10 rows only; You can start JSqsh and run the script at the same time with this command: /usr/ibmpacks/common-utils/current/jsqsh/bin/jsqsh bigsql < /home/bigsql/mysql.sql The redirection operator specifies to JSqsh to get the commands from the file located in the /home/bigsqldirectory, and then run the statements within the file. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 64

65 Command and query edit The JSqsh command shell uses the JLine2 library, which allows you to edit previously entered commands and queries. You use the command-line edit features to move the arrow keys and to edit the command or query on the current line. The JLine2 library provides the same key bindings (vi and emacs) as the GNU Readline library. In addition, it attempts to apply any custom key maps that you created in a GNU Readline configuration file, (.inputrc) in the local file system $HOME/ directory. In addition to individual line editing, the JSqsh command shell remembers the 50 most recently run statements, which you can view by using the \history command: 1> \history (1) use tpch; (2) select count(*) from lineitem Previously run statements are prefixed with a number in parentheses. You use this number to recall that query by using the JSqsh recall operator (!), for example: 1>!2 1> select count(*) from lineitem 2> The! recall operator has the following behavior:!! Recalls the previously run statement.!5 Recalls the fifth query from history.!-2 Recalls the query from two prior runs. You can also edit queries that span multiple lines by using the \buf-edit command, which pulls the current query into an external editor, for example: 1> select id, count(*) 2> from t1, t2 3> where t1.c1 = t2.c2 4> \buf-edit The query is opened in an external editor (/usr/bin/vi by default. However, you can specify a different editor on the environment variable $EDITOR). When you close the editor, the edited query is entered at the JSqsh command shell prompt. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 65

66 The JSqsh command shell provides built-in aliases, vi and emacs, for the \bufedit command. The following commands, for example, open the query in the vi editor: 1> select id, count(*) 2> from t1, t2 3> where t1.c1 = t2.c2 4> vi Configuration variables You can use the \set command to list or define values for a number of configuration variables, for example: 1> \set If you want to redefine the prompt in the command shell, you run the following command with the prompt option: 1> \set prompt='foo $lineno> ' foo 1> Every JSqsh configuration variable has built-in help available: 1> \help prompt If you want to permanently set a specific variable, you can do so by editing your $HOME/.jsqsh/sqshrc file and including the appropriate \set command in it. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 66

67 Installing Text Analytics The Text Analytics service provides powerful text extraction capabilities. You can extract structured information from unstructured and semi-structured text. It is recommended that you make sure that the python-paramiko package is installed prior to installing the Text Analytics service. yum install python-paramiko You will be selecting a Master node for Text Analytics, and this node should contain the python-paramikopackage. The master node is the node where Text Analytics Web Tooling and Text Analytics Runtime are both installed. 1. Open a browser and access the Ambari server dashboard. The following is the default URL. The default user name is admin, and the default password is admin. 2. In the Ambari dashboard, click Actions > Add Service. 3. In the Add Service Wizard, Choose Services, select the BigInsights - Text Analytics service. If you do not see the option to select the BigInsights - Text Analytics service, complete the steps ininstalling the BigInsights value-add packages. 4. To assign master nodes, select the Text Analytics Master server Node. 5. Click Next. The Assign Slaves and Clients page displays. 6. Assign slave and client components to the hosts on which you want them to run. An asterisk (*) after a host name indicates the host is assigned a master component. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 67

68 a. To assign slaves nodes and clients, click All on the Clients column. The client package that is installed contains runtime binaries that are needed to run Text Analytics. This client needs to be installed on all datanodes that belong to your cluster. Client nodes will install only the Text Analytics Runtime artifacts. (/usr/ibmpacks/current/text-analytics-runtime). Choose one or more clients. You do not have to choose the Master node as a client since it already installs Text Analytics Runtime. 7. Click Next and select BigInsights - Text Analytics. 8. Expand Advanced ta-database-config and enter the password in the database.password field.recommended configurations for the service are completed automatically but you can edit these default settings as desired. By default, the database server is MySQL. There are two options: o o database.create.new = Yes (default) a. You must enter the password for the database. b. You must ensure that the default port, is free. You can change the port to any free port. c. You can change the database.username, but any changes to the database.hostnameare ignored. database.create.new = N a. You must enter the database.hostname, database.port (where the existing database server instance is running), database.user and database.password. Ensure that the user and password have full access to create a database in the existing database server instance you specify. Especially if it is a remote MySQL server instance, ensure that all permissions are given to the user and password to access this remote instance. Ensure that the server instance is up and running so that the Text Analytics service can be started successfully. 9. Click Next and in the Review screen that opens, click Deploy. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 68

69 10.After installation is complete, click Next > Complete. 11.After the installation is successful, click Next and Complete. If the BigInsights - Text Analytics service fails to install, run the remove_value_add_services.shcleanup script. The following code is an example command: cd /usr/ibmpacks/bin/<version> remove_value_add_services.sh -u admin -p admin -x s TEXTANALYTICS -r For more information about cleaning the value-add service environment, see Removing BigInsights value-add services. 12. The Text Analytics directory on all nodes where Text Analytics components are installed is created with world-writable permissions, which are not required. Change the permissions to rwxr-x-r-x on all nodes to improve security: chmod go-w /usr/ibmpacks/text-analytics-runtime 13. Restart the Knox service. If you have not configured LDAP service, start the Knox Demo LDAP service. 14. Open the BigInsights Home and launch Text Analytics at the following address: EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 69

70 x.html Where: knox_host The host where Knox is installed and running knox_port The port where Knox is listening (by default this is 8443) knox_gateway_path The value entered in the gateway.path field in the Knox configuration (by default this is 'gateway') For example, the URL might look like the following address: If you use the Knox Demo LDAP service and have not modified the default configuration, the default credential to log into the BigInsights - Home service is: userid = guest password = guest-password Note: If you do not see the Text Analytics service from BigInsights Home, restart the BigInsights Home service in the Ambari interface. At this point, IBM BigHome should show all three Big Insights Services as shown below: EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 70

Installing Big R To extend the power of the Open Platform for Apache Hadoop, install and deploy the Big R service, which is the IBM R extension, to the Hadoop-based platform, IBM Open Platform with

71 Installing Big R To extend the power of the Open Platform for Apache Hadoop, install and deploy the Big R service, which is the IBM R extension, to the Hadoop-based platform, IBM Open Platform with Apache Hadoop. 1. Open a browser and access the Ambari server dashboard. The following is the default URL. The default user name is admin, and the default password is admin. 2. In the Ambari web interface, click Actions > Add Service. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 71

72 3. Optional: If you do not already have the R Service installed, you can add it now. Big R service depends on the R statistics environment and the following three R packages: base64enc, rjava and data.table. If these have been installed on all nodes in the cluster, this step can be skipped. Otherwise, you can choose to install the above dependencies with your own approach, or, if your cluster has external network access, you can use the following R service to install these dependencies. a. In the Add Service Wizard, Choose Services, select the R service and click Next. b. In the Assign Slaves and Clients page, for client nodes, mark all of the nodes as the R Clientnode and click Next. c. In the Customize Services page, accept the recommended configurations for the R service, or customize the configuration by expanding the configuration files and modifying the values. Make sure that you read the R license, and indicate acceptance by typing Y in the fieldaccept.r.licenses. The value is case sensitive, so make sure you type an uppercaseletter. The R Licenses field contains a URL where you can find the licensing information. In the user.r.packages you must ensure that the following required packages are listed:base64enc, rjava, and data.table. In the user.r.repository field, enter the preferred repository. The default is epel-release, which uses the EPEL repository, but you can also type a different repository by entering a URL, such as Note: When installing R from the EPEL repository, you might have the following GPG key error: GPG key retrieval failed: [Errno 14] Could not open/read If you receive this error, you can import the key with the following rpm command, then retry: rpm --import d. Click Next and in the Review Page that opens, click Deploy. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 72

73 e. If R deployment fails, review and correct the errors before reattempting the installation. Remove the R service from Ambari and delete the RSERV server by using the following command: f. curl -u [uid]:[pwd] -H "X-Requested-By: ambari" -X DELETE name]/services/rserv where [uid:[pwd]] The Ambari administrator user ID and password. [hostname] The correct host name for your environment The port number 8080 is the default. Modify this according to your environment. [cluster name] The correct name of your cluster. The following command is an example: curl -u admin:admin -H "X-Requested-By: ambari" -X DELETE ERV g. In the Summary page, click Complete. When you return to the Ambari Dashboard Services tab, you notice that the R service is now listed. 4. In the Add Service Wizard, Choose Services page, select the Big R service and click Next. 5. In the Assign Masters page, decide which nodes of your cluster you want to run the specified components, or accept the default nodes. You must assign the Big R Connector to the same node that is running the MapReduce2 Client service, which is a required service that runs MapReduce2 Hadoop jobs. Click Next. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 73

74 6. In the Assign Slaves and Clients page, accept the defaults, or make specific assignments for your nodes. For client nodes, mark all of the nodes as the Big R Client node and click Next. 7. In the Customize Services page, default Big R environment variables are set in the bigr-env template field. Review these entries for accuracy and completeness. Make any necessary changes and click Next 8. You can review your selections in the Review page before accepting them. If you want to modify any values, click the Back button. If you are satisfied with your setup, click Deploy. 9. In the Install, Start and Test page, the Big R service is installed and verified. If you have multiple nodes, you can see the progress on each node. When the installation is complete, either view the errors or warnings by clicking the link, or click Next to see a summary and then the new service added to the list of services. If the BigInsights Big R service fails to install, run the remove_value_add_services.sh cleanup script. The following code is an example of the command: cd /usr/ibmpacks/bin/<version>./remove_value_add_services.sh -u admin -p admin -x s BIGR -r For more information about cleaning the value-add service environment, see Removing BigInsights value-add services. 10. Advise your end users that the service is deployed and ready for their use by having them launch the Value Added packages welcome page. 11. In the Summary page, click Complete. Running BigInsights - Big R as the YARN application master You must update the Linux Container Executor as the default executor in the yarnsite.xml file to change the owner to the bigr server user (the application process owner). EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 74

75 1. In the Ambari web interface, from the YARN service Configs page, scroll down to find theadvanced yarn-site and expand it. 2. Change the yarn.nodemanager.container-executor.class property to have the following value: org.apache.hadoop.yarn.server.nodemanager.linuxcontainerexecutor 3. In the Custom yarn-site section, click Add Property to add the following 4. properties: Property name yarn.nodemanager.linux-containerexecutor.nonsecure-mode.local-user yarn.nodemanager.linux-containerexecutor.nonsecure-mode.limit-users Value Yarn False 5. Make sure that the property yarn.nodemanager.linux-containerexecutor.group has the valuehadoop. 6. Click Save in the Configs page to save your configuration changes. 7. Make sure that the directories on ALL the nodes set in the Node Manager section for the properties yarn.nodemanager.local-dirs and yarn.nodemanager.logdirs have permissionsyarn:hadoop: On ALL nodes do the following commands: $ echo "yarn.nodemanager.linux-container-executor.group=hadoop" >> /etc/hadoop/conf/container-executor.cfg $ echo "banned.users=hdfs,yarn,mapred,bin" >> /etc/hadoop/conf/container-executor.cfg $ echo "min.user.id=1000" >> /etc/hadoop/conf/container-executor.cfg $ chown root:hadoop /etc/hadoop/conf/container-executor.cfg $ chown root:hadoop /usr/iop/ /hadoop-yarn/bin/container-executor $ chmod 6050 /usr/iop/ /hadoop-yarn/bin/container-executor 8. Make sure that the user ID with which the BigR connection is made (by using bigr.connect) is present on ALL nodes, and that the user belongs to groups users, EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 75

76 hadoop. If the user does not exist, run the following command as the root user on ALL nodes: $ useradd -G users,hadoop someuser 10. dml.yarn.appmaster value: true 11. You can optionally update the MapReduce configuration to get better performance: a. In the Ambari web interface, from the MapReduce2 service Configs page, scroll down to find the Advanced map-red site section and expand it. b. Update the property mapreduce.task.io.sort.mb to 384. This should be approximately three times the HDFS block size. Note: If the property is not available, add it to the Custom map-red site. 12. Click Save in the Configs page to save your configuration changes. For information about using BigInsights - Big R, see Analyzing data with IBM BigInsights Big R. IBM BigInsights Online Tutorials Learn how to use BigInsights by completing online tutorials, which use real data and teach you to run applications. Complete the tutorials in any order. 9. Change the SystemML configuration file, /usr/ibmpacks/current/bigr/machinelearning/systemml-config.xml: 01.ibm.com/support/knowledgecenter/SSPT3X_4.0.0/com.ibm.swg.im.infosphere.biginsights.tu t.doc/doc/tut_introduction.html You can find additional information, tutorials, and articles about BigInsights, Hadoop, and related components at Hadoop Dev. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 76

77 Security Configuration and Administration IBM Open Platform with Apache Hadoop security includes perimeter security, authentication, and authorization. Authenticate, authorize, and protect your data by using the steps and recommendations listed in this section. This document covers security to the Isilon HDFS storage, the resources that you use in Yarn, and the cluster infrastructure. Setting up HTTPS for Ambari You can limit access to the Ambari Web interface to HTTPS connections. Before you begin The Ambari server must not be running when you are performing this task. You must provide a certificate. You can use a self-signed certificate for initial trials, but these certificates are not suitable for production environments. The certificate you use must be PEM-encoded, not DER-encoded. If you attempt to use a DERencoded certificate, the following error appears. unable to load certificate :error:0906D06C:PEM routines:pem_read_bio:no start line:pem_lib.c :698:Expecting: TRUSTED CERTIFICATE You can use the following command to convert a DER-encoded certificate to a PEM-encoded certificate.cert.crt is the DER-encoded certificate, and cert.pem is the resulting PEM-encoded certificate. openssl x509 -in cert.crt -outform pem -out cert.pem Procedure 1. Log into the Ambari server host. Note: Make sure Ambari server is not running. 2. Locate the certificate that you want to use. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 77

78 You can use the following example to create a temporary self-signed certificate. Replace $wserverwith the Ambari server host name. openssl genrsa -out $wserver.key 2048 openssl req -new -key $wserver.key -out $wserver.csr openssl x509 -req -days 365 -in $wserver.csr -signkey $wserver.key -out $wserver.crt 3. Run the following command and answer the prompts that appear. ambari-server setup-security a. At the Security setup options prompt, type 1. b. When asked whether you want to configure HTTPS, type y. c. Select the port that you want to use for SSL. The default is Note: Make sure that you choose a port that is not being used by any services on the machine. For example, the default port for Knox is also d. Provide the path to your certificate and your private key. e. Provide the password for the private key. Configuring SSL support for HBase REST gateway with Knox By using Knox, your Hadoop cluster can be securely accessible to a large number of users, such as HBase, Hive, and Oozie. Follow these steps to use SSL to connect between Knox and a Hadoop component such as HBase. Many of the services in IBM Open Platform with Apache Hadoop use Knox to allow more users to make use of the data and queries in Hadoop without compromising on security. Only a handful of administrators are allowed to connect directly to their Hadoop clusters, while endusers are routed through Knox. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 78

79 Knox acts as a reverse proxy between end-users and Hadoop, providing a two connection hop between the client and the Hadoop cluster. The first connection is between the client and Knox. Knox comes with SSL support for this connection. The second connection is between Knox and a given Hadoop component, such as HBase, which requires some configuration. Procedure 1. You must have a certificate, either self-signed or one signed by a Certificate Authority (CA). Trusted SSL Certificates are issued by Certificate Authorities (CAs). Self-signed certificates are signed by the same entity whose identity it certifies. It is one signed with its own private key. The examples use a self-signed certificate, but this might not be suitable for your production environment. a. Configure the SSL on the HBase REST server. This example uses a self-signed certificate, and a SSL certificate used by a Certificate Authority (CA) makes the configuration steps even easier. i. Log-in to the HBase REST server. As the HBase user (su hbase), create a keystore to hold the SSL certificate. export HOST_NAME=`hostname` keytool -genkey -keyalg RSA -alias selfsigned -keystore hbase.jks -storepass password -validity 360 -keysize dname "CN=$HOST_NAME, OU=Eng, O=MyCompany, L=Central City, ST=CA, C=US" -keypass password Make sure the common name portion of the certificate matches the host where the certificate will be deployed. For example, when the host that runs HBase is actuallysandbox.mycompany.com, the self-signed SSL certificate in the example, uses this value as the CN: sandbox.mycompany.com. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 79

80 Owner: CN=sandbox.MyCompany.com, OU=Eng, O=MC, L=CC, ST=CA, C=US Issuer: CN=sandbox.MyCompany.com, OU=Eng, O=MC, L=CC, ST=CA, C=US You can now use this self-signed certificate with HBase. ii. iii. Skip this step if you use a Certificate Authority signed certificate. Self-signed certificates are rejected during SSL handshake. If you use a self-signed certificate, export the certificate and put it in the cacerts file of the JRE that is used by Knox. On the machine that is running HBase, export the HBase SSL certificate into a file hbase.crt: keytool -exportcert -file hbase.crt -keystore hbase.jks -alias selfsigned -storepass password Copy the hbase.crt file to the Node that is running Knox. Then run the following command: keytool -import -file hbase.crt -keystore /<your_jdk_path>/jre/lib/security/cacerts -storepass changeit -alias selfsigned Make sure the path to the cacerts file points to the cacerts of the JDK that is used to run the Knox gateway. The default cacerts password is changeit. 2. Configure the HBase REST Server for SSL. a. Use the Ambari web interface to update the Hadoop configuration properties: <property> <name>hbase.rest.ssl.enabled</name> <value>true</value> </property> <property> <name>hbase.rest.ssl.keystore.store</name> <value>/path/to/keystore/created/hbase.jks</value> </property> EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 80

81 <property> <name>hbase.rest.ssl.keystore.password</name> <value>password</value> </property> <property> <name>hbase.rest.ssl.keystore.keypassword</name> <value>password</value> </property> b. Click Save in the Ambari configuration page. c. Restart the HBase REST server by clicking the HBase service in the Ambari web interface. You can also type the following command in the Linux terminal window: sudo /usr/iop/current/hbase-client/bin/hbase-daemon.sh stop rest & sudo /usr/iop/current/hbase-client/bin/hbase-daemon.sh start rest -p Verify the HBase REST server over SSL. Replace localhost with the hostname of your HBase REST server. curl -H "Accept: application/json" -k The command should display the tables in your HBase environment: { table :[{ name : ambarismoketest }]}. 4. Configure Knox to point to HBase over SSL and then re-start Knox. Change the URL of the HBase service for your Knox topology to HTTPS. Make sure that the Host matches the host of HBase rest server. <service> <role>webhbase</role> <url> </service> EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 81

82 Overview of Kerberos To ensure secure access in Hadoop, you need a strong authentication and a reliable way to establish the identity of a user. When users successfully identify themselves, then that identity can be propagated throughout the Hadoop cluster. Those users can access resources or work with applications on the cluster. The Hadoop cluster resources, such as Hosts and Services, also must authenticate with each other to avoid potential malicious systems or daemons that pretend to be trusted components of the cluster to gain access to data. Hadoop uses Kerberos as the basis for strong authentication and identity propagation for both users and services. Kerberos is a third party authentication mechanism, in which users and services rely on a third party - the Kerberos server - to authenticate each to the other. The Kerberos server itself is known as the Key Distribution Center (KDC). The KDC has three parts: Principals A database of the users and services that the server knows about and their respective Kerberos passwords. Authentication Server (AS) An AS performs the initial authentication and issues a Ticket Granting Ticket (TGT). Ticket Granting Server (TGS) A TGS issues subsequent service tickets based on the initial TGT. The basic flow is illustrated by the following steps: 1. A user principal requests authentication from the AS. 2. The AS returns a TGT that is encrypted by using the Kerberos password of the user principal. This password is known only to the user principal and the AS. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 82

83 3. The user principal decrypts the TGT locally by using its Kerberos password, and from that point forward, until the ticket expires, the user principal can use the TGT to get service tickets from the TGS. 4. Service tickets are what allow a principal to access various services. Because cluster resources (hosts or services) cannot provide a password each time to decrypt the TGT, they use a special file, called a keytab. The keytab contains the authentication credentials of the resource principal. The set of hosts, users, and services over which the Kerberos server has control is called a realm. Each service and sub-service in Hadoop must have its own principal. A principal name in a given realm consists of a primary name and an instance name. The instance name is the fully qualified domain name (FQDN) of the host that runs that service. Note: With respect to the HDFS service, this service is entirely handled by Isilon. So it is very important to make sure the fully qualified Isilon Hadoop Zone Name be used for the instance name for the HDFS service. As services do not log in with a password to acquire their tickets, the authentication credentials of their principal are stored in a keytab file. This file is extracted from the Kerberos database and stored locally in a secured directory with the service principal on the service component host. In addition to the Hadoop Service Principals, Ambari also requires a set of Ambari Principals to perform service checks and alert health checks. Keytab files for the Ambari, or headless, principals reside on each cluster host, just as keytab files for the service principals. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 83

84 Terminology The following terms are useful in understanding Kerberos: Key Distribution Center The trusted source for authentication in a Kerberos-enabled environment. Kerberos KDC Server The server that serves as the KDC. Kerberos Client Any machine in the cluster that authenticates against the KDC. Principal The unique name of a user or service that authenticates against the KDC. Keytab A file that includes one or more principals and their keys. Realm The Kerberos network that includes a KDC and a number of Clients. KDC Admin Account An administrative account that is used by Ambari to create principals and generate keytabs in the KDC. Kerberos Descriptor A JSON-formatted text file that contains information Ambari needs to enable or disable FlumKerberos for a stack and its services. This file must be named kerberos.json. It must be in the root directory of the relevant stack or service. Kerberos Descriptors are meant to be hierarchical such that details in the stack-level descriptor can be overwritten or updated by details in the service-level descriptors. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 84

85 Enabling Kerberos for IBM Open Platform You begin setting up Kerberos by enabling it from the Ambari web interface. To use Kerberos authentication in IBM Open Platform with Apache Hadoop, you must generate principals and keytabs for each of the services on each node where you installed the product. Before you begin 1. You must have the latest supported Red Hat Enterprise Linux (RHEL) packages to enable and use Kerberos krb5-server, krb5-workstation and krb5-libs. 2. Deploy the Java Cryptography Extension (JCE) security policy files on the Ambari server and on all hosts in the cluster. Depending on the JDK that you selected during the installation of IBM Open Platform with Apache Hadoop the JCE policy files might already be downloaded and installed onto the server. a. Stop the Ambari server: ambari-server stop b. Make sure you have access to the policy file archive. c. From the Ambari server and on each host in the cluster, add the unlimited security policy JCE jars to $JAVA_HOME/jre/lib/security/. For example, run the following command to extract the policy jars into the JDK that is installed on your host: unzip -o -j -q UnlimitedJCEPolicyJDK7.zip -d /usr/jdk64/jdk1.version/jre/lib/security/ d. Restart the Ambari server. ambari-server restart 3. Ambari automatically creates principals in the KDC and generates keytabs. Therefore, you must have the Kerberos Admin Account credentials available when running the Kerberos wizard. 4. Use an existing Active Directory installation with Kerberos. a. Make sure that Ambari server and cluster hosts have network access to, and be able to resolve the DNS names of, the Domain Controllers. b. Configure the LDAP or Active Directory authentication connectivity. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 85

86 c. The Active Directory User container for principals is created and is available. For example, "OU=Hadoop,OU=People,dc=apache,dc=org" Manually generating keytabs for Kerberos authentication You use the kadmin local command-line interface to generate keytabs for IBM Open Platform with Apache Hadoop services. All Kerberos-enabled services need a keytab file to authenticate to the Key Distribution Center (KDC). You can also use the kadmin command-line interface that can be used on Kerberos client nodes and KDC server nodes. The kadmind service starts the Kerberos administration server, whereas the kadmin.local command-line interface directly accesses the KDC database. To generate keytabs for services that contain the HTTP principal, you use the ktadd command with the -norandkey option in the kadmin.local command-line interface. This option indicates to not randomize the keytabs. The keytabs and their version numbers remain unchanged. Note: If your version of Kerberos does not support this option, or if you cannot use the kadmin.local shell, then create your keytabs with the ktadd command and use the ktutil command to merge keytabs that you create. You must generate keytabs for the following services to configure them with Kerberos HTTP authentication. If two or more of these services run on the same host, then all running services on that host must use the same HTTP principal and key for their HTTP endpoints. Hadoop, HBase, HttpFS, and Oozie require HTTP principal. Procedure 1. From the Linux shell, as the root user start the kadmin.local or kadmin command-line interface. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 86

87 Important: If you have root access to your KDC machine, login to the KDC machine as root and use the kadmin.local command-line interface to generate principles and keytabs. If you do not have root access to the KDC machine, use the kadmin command-line interface on any Kerberos configured machine to generate principles and keytabs. kadmin.local 2. Create the principal and keytab for each of the IBM Open Platform with Apache Hadoop services. For each service, you must enter the domain.name and YOUR_REALM.COM parameters. domain.name - The fully qualified domain name of the cluster node where the server component is running. The domain.name must be lowercase characters. YOUR_REALM.COM - The name of the Kerberos realm where you are installing IBM Open Platform with Apache Hadoop. Kerberos realm names are typically in all uppercase characters to differentiate it from any similar DNS domain that the realm is associated with. Option Description Flume On every Kerberos configured node that runs a Flume agent that writes to HDFS, generate a keytab file that contains entries for the Flume agent principal. a. On each host where a Flume agent runs, create the Flume principal and keytab file, and then copy the keytab to the respective host under the../conf/security/keytabs/flume.keytab. addprinc -randkey flume/domain.name@your_realm.com ktadd -k flume.keytab flume/domain.name@your_realm.com b. Check to ensure that Flume agent principal information was added to the keytab file. klist -e -k -t flume.keytab c. Ensure that the flume.keytab file is only readable by the Flume user. sudo chown flume:biadmin../conf/security/keytabs/flume.keytab sudo chmod 400../conf/security/keytabs/flume.keytab d. To enable the Flume agent to store data on a secure EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 87

88 Option Hadoop Description HDFS, add the following parameters to the Flume configuration file,flume-conf.properties.template, which exists in the../flume/conf directory. You can rename this configuration file to generate your own configuration file for Flume. agentname.sinks.sinkname.type = HDFS agentname.sinks.sinkname.hdfs.kerberosprincipal = flume/domain.name@your_realm.com agentname.sinks.sinkname.hdfs.kerberoskeytab = keytab_path agentname Name of the Flume agent that you are configuring for Kerberos authentication. sinkname Name of the HDFS sink that you are configuring. The sink type must be HDFS. keytab_path Path to the Flume keytab. The default path is../conf/security/keytabs/flume.keytab. When you start the Flume agent, specify the -- conf-file option to point to the Flume configuration file that you modified. For example, $FLUME_HOME/bin/flume-ng agent --conf-file flume-conf.properties.template --name myagentname -Dflume.root.logger=INFO,console On every Kerberos configured node that runs a Hadoop server, generate a keytab file for HDFS, MapReduce, and HTTP services. The HDFS keytab file must contain entries for the HDFS principal and the HTTP principal. The MapReduce keytab file must contain entries for the MapReduce principal and the HTTP principal. Both Hadoop and HBase use the HTTP keytab file. On each node, the HTTP principal must be the same in all keytab files. e. Run the following commands on every host in your cluster that runs a Hadoop server or an HBase server. addprinc -randkey HTTP/domain.name@YOUR_REALM.COM ktadd -norandkey -k http.domain.name.keytab HTTP/domain.name@YOUR_REALM.COM EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 88

89 Option Description f. Run the following commands on every host in your cluster where Hadoop servers run. Create principals and keytabs for HDFS services including the NameNode, Secondary NameNode, DataNodes in all cases the instance name will point to the FQDN of Isilon Hadoop Zone, e.g. If you plan to enable high availability with the Quorum Journal Manager (QJM), create principals and keytabs for JournalNodes. addprinc -randkey Tip: You can add keytabs for NFS high availability. NFS high availability. i. Add the following principles addprinc -randkey addprinc -randkey ii. Add the NFS principles and key to every HA node. ktadd -norandkey -k hdfs.domain.name.keytab hdfs/ ktadd -norandkey -k http.domain.name.keytab HTTP/ ktadd -norandkey -k hdfs.isilonzone.domain.name.keytab HTTP/ ktadd -norandkey -k hdfs.domain.name.keytab Check to ensure that the HDFS and HTTP principal information was added to the keytab file. klist -e -k -t hdfs.isilonzone.domain.name.keytab g. Run the following commands on every host in your cluster where hadoop servers run, including the JobTracker and TaskTracker. addprinc -randkey HBase ktadd -norandkey -k mapred.domain.name.keytab Check to ensure that MapReduce and HTTP principal information was added to the keytab file. klist -e -k -t mapred.domain.name.keytab h. On every Kerberos configured node that runs HBase, EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 89

90 Option Description including the primary and secondary servers, generate a keytab file that contains entries for the HBase principal. addprinc -randkey hbase/domain.name@your_realm.com Hive HttpFS ktadd -k hbase.domain.name.keytab hbase/domain.name@your_realm.com i. Check to ensure that HBase principal information was added to the keytab file. klist -e -k -t hbase.domain.name.keytab j. On every Kerberos configured node that runs a Hive JDBC server, generate a Hive keytab file that contains entries for the Hive principal. addprinc -randkey hive/domain.name@your_realm.com ktadd -k hive.domain.name.keytab hive/domain.name@your_realm.com k. Check to ensure that Hive principal information was added to the keytab file. klist -e -k -t hive.domain.name.keytab l. On every Kerberos configured node that runs the HttpFS server, generate a keytab file that contains entries for the HttpFS principal and an HTTP principal. addprinc -randkey httpfs/isilonzone.domain.name@your_realm.com addprinc -randkey HTTP/isilonzone.domain.name@YOUR_REALM.COM Oozie ktadd -norandkey -k httpfs.domain.name.keytab httpfs/isilonzone.domain.name@your_realm.com HTTP/isilonzone.domain.name@YOUR_REALM.COM m. On every Kerberos configured node that runs Oozie, generate a keytab file that contains entries for the Oozie principal and an HTTP principal. addprinc -randkey oozie/domain.name@your_realm.com addprinc -randkey HTTP/domain.name@YOUR_REALM.COM ZooKeeper ktadd -norandkey -k oozie.domain.name.keytab oozie/domain.name@your_realm.com HTTP/domain.name@YOUR_REALM.COM n. Check to ensure that Oozie and HTTP principal information was added to the keytab file. klist -e -k -t oozie.domain.name.keytab o. On every Kerberos configured node that runs EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 90

91 Option Description ZooKeeper, generate a keytab file that contains entries for the ZooKeeper principal. addprinc -randkey zookeeper/domain.name@your_realm.com ktadd -k zookeeper.domain.name.keytab zookeeper/domain.name@your_realm.com p. Check to ensure that ZooKeeper principal information was added to the keytab file. klist -e -k -t zookeeper.domain.name.keytab Setting up Active Directory or LDAP authentication in Ambari Lightweight Directory Access Protocol (LDAP security) is an interface that is used to read from and write to the Active Directory database. By default, Ambari uses an internal database as the user store for authentication and authorization. You can configure LDAP or Active Directory (AD) external authentication. Before you begin An LDAP client must be installed on the Ambari server host. The Ambari server must not be running when you are performing this task. The following table describes the properties and values that are required to set up LDAP authentication. Table 1. Ambari server LDAP properties Property Values Description authentication.ldap.primaryurl server:port The hostname and port for the LDAP or AD server. For example, my.ldap.server:389. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 91

92 Table 1. Ambari server LDAP properties Property Values Description authentication.ldap.secondaryu rl server:port The hostname and port for the secondary LDAP or AD server. For example,my.secondary.ldap.server: 389. This value is optional. authentication.ldap.usessl true or false If true, use SSL when connecting to the LDAP or the AD server. authentication.ldap. usernameattribute authentication.ldap.basedn authentication.ldap. bindanonymously authentication.ldap.managerdn authentication.ldap. managerpassword authentication.ldap.userobjectc lass authentication.ldap.groupobject Class authentication.ldap.groupmemb ershi pattr authentication.ldap.groupnamin gattr [LDAP attribute] [Distinguish ed Name] true or false [Full Distinguishe d Name] [password] [LDAP Object Class] [LDAP Object Class] [LDAP attribute] [LDAP attribute] The attribute for username. For example, uid. The root Distinguished Name to search in the directory for users. For example,ou=people,dc=hadoop,dc=ap ache,dc=org. If true, bind to the LDAP or AD server anonymously. If Bind anonymous is set to false, the Distinguished Name ( DN ) for the manager. For example,uid=hdfs,ou=people,dc=had oop,dc=apache,dc=org. If Bind anonymous is set to false, the password for the manager. The object class that is used for users. For example, organizationalperson. The object class that is used for groups. For example, groupofuniquenames. The attribute for group membership. For example, uniquemember. The attribute for group name. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 92

93 Note: If you are going to set bindanonymously to false (the default), make sure that you have an LDAP Manager name and password set up. If you are going to use SSL, make sure you have already set up your certificate and keys. To manage authorization and permissions against your users and groups, you must synchronize those LDAP users and groups in the Ambari database. If the LDAP server certificate is signed by a trusted Certificate Authority, you do not need to import the certificate into Ambari. If the LDAP server certificate is self-signed, or is signed by an unrecognized certificate authority such as an internal certificate authority, you must import the certificate and create a keystore file. Procedure 1. Stop the Ambari server. ambari-server stop 2. If required, create a keystore file. a. Create a directory for the keystore file. For example, type mkdir /keys to create a directory calledkeys. b. Create the keystore file. For example, type the following command to create the keystore file ldaps-keystore.jks in the keys directory. $JAVA_HOME/bin/keytool -import -trustcacerts -alias root -file $PATH_TO_YOUR_LDAPS_CERT -keystore /keys/ldaps-keystore.jks c. When prompted, set a password. The password is needed when you are setting up LDAP or AD authentication in Ambari. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 93

94 3. Run the following LDAP set up command, and answer the prompts with the information that you previously collected. ambari-server setup-ldap Note: Prompts marked with an asterisk are required values. 4. At the Primary URL* prompt, type the server URL and port. 5. At the Secondary URL prompt, type the secondary server URL and port. 6. At the Use SSL* prompt, type your value. If you are using LDAP, type true. 7. At the User name attribute* prompt, type your value. The default value is uid. 8. At the Base DN* prompt, type your value. 9. At the Bind anonymously* prompt, type your value. 10. If you have set bind.anonymously to false, at the Manager DN* prompt, type your value. 11. At the Enter the Manager Password* prompt, type the password for your LDAP manager. 12. At the Enter the userobjectclass* prompt, type the object class that is used for users. 13. At the Enter the groupobjectclass * prompt, type the object class that is used for groups. 14. At the Enter the groupmembershipattr * prompt, type the attribute for group membership. 15. At the Enter the groupnamingattr * prompt, type the attribute for group name. 16. If you set Use SSL* to true in step 6, the prompt Do you want to provide custom TrustStore for Ambari? appears. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 94

95 o If you are using a self-signed certificate that you do not want imported to the existing JDK keystore, type y. This is option is more secure. For example, you want only Ambari to use this certificate, and not any other applications run by JDK on the same host. When you select this option, other prompts appear. At the TrustStore type prompt, type jks. o At the Path to TrustStore file prompt, type /keystore_directory/ldapskeystore.jks. At the Password for TrustStore prompt, type the password that you defined for the keystore. If you are using a self-signed certificate that you want to import and store in the existing, default JDK keystore, type n. This is option is less secure. When you select this option, do the following. If necessary, convert the SSL certificate to X.509 format by executing the following command: openssl x509 -in slapd.pem -out slapd.crt where slapd.crt is the path to the X.509 certificate. Import the SSL certificate to the existing keystore, such as the default jre certificates store, by typing the following command: /usr/jdk64/jdk1.7.0_45/bin/keytool -import -trustcacerts -file slapd.crt -keystore /usr/jdk64/jdk1.7.0_45/jre/lib/security/cacerts where Ambari is set up to use JDK 1.7. Consequently, the certificate must be imported into the JDK 7 keystore. 17. Review your settings, and if they are correct, select y. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 95

96 18. Restart the Ambari server. 19. Synchronize your LDAP users and groups into the Ambari database. o To synchronize a specific set of users and groups, type the following command: ambari-server sync-ldap --users users.txt --groups groups.txt where users.txt and groups.txt are files that contain comma-separated users and groups. Note: Group membership is determined using the group membership attribute that you specified when you ran ambari-setup setup-ldap. o If you have synchronized a specific set of users and groups, type the following command to synchronize only those entities that are in Ambari with LDAP. Users are removed from Ambari if they no longer exist in LDAP, and group membership in Ambari is updated to match LDAP. ambari-server sync-ldap --existing Note: Group membership is determined using the group membership attribute that you specified when you ran ambari-setup setup-ldap. o To import all entities with matching LDAP user and group object classes into Ambari, type the following command: ambari-server sync-ldap --all Note: Use this option only if you are sure that you want to synchronize all users and groups from LDAP into Ambari. Isilon will also need to be configured for LDAP authentication for this synchronization to work across the entire cluster. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 96

97 Additional User Priviledges Initially, the users you have enabled all have Ambari User privileges. Ambari Users can read metrics, view service status and configuration, and browse job information. If you want users to be able to start or stop services, modify configurations, and run smoke tests, you must give the users administrator privileges. Enabling Kerberos for HDFS on Isilon Using MIT Kerberos 5 This section explains how to set up an Isilon cluster to authenticate HDFS connections with a stand-alone MIT Kerberos 5 key distribution center. The following instructions assume that you have already set up a Kerberos system with a resolvable hostname for the KDC and a resolvable hostname for the KDC admin server. It is assumed your KDC is running on the Ambari Server, all KDC s have a different realm name, and the Hadoop client setup for Kerberos is complete on the compute nodes and you have one KDC per zone. Note: AES encryption must be disabled in krb5.conf and RC4/DES should be listed as the only supported encryption type on server and clients: kdc.conf supported_enctypes = RC4-HMAC:normal DES-CBC-MD5:normal DES-CBC-CRC:normal Note: Deleting principals from Isilon does not remove them from KDC. Procedure Connect with SSH as root to any node in your Isilon cluster and run the following commands to configure Isilon for Kerberos. 1. To prevent auto spn generation in the system zone you need to set All Auth Providers setting on the system zone to No. isi zone zones modify --zone=system --all-auth-providers=no EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 97

98 2. Add the KDC to the Isilon cluster and each KDC needs a unique name: isi auth krb5 create --realm=example.com --admin-server=kdc.example.com --kdc=kdc.example.com --user=kadmin/admin --password=isi 3. To verify and list all the auth providers for the cluster run: isi auth status 4. Modify zone to use authenticaion provider isi zone zones modify --zone=zone-example --add-auth-provider=krb5:example.com 5. Verify zone infor with view command: isi zone zones view --zone=zone-example 6. Create the Isilon spn s for the zone. The format needs to be hdfs/<cluster hostname/sc name>@realm and HTTP/<cluster hostname/sc name>@realm isi auth krb5 spn create --provider-name=example.com -- spn=hdfs/cluster.example.com@example.com --user=kadmin/admin -- password=isi isi auth krb5 spn create --provider-name=example.com -- spn=http/cluster.example.com@example.com --user=kadmin/admin -- password=isi 7. Verify spn creation: isi auth krb5 spn list --provider-name=example.com 8. Lastly create proxy users o isi hdfs proxyusers create oozie --zone=zone-example --add-user=ambari-qa o isi hdfs proxyusers create hive --zone=zone-example --add-user=ambari-qa o isi hdfs proxyusers create zookeeper --zone=zone-example --adduser=ambari-qa o isi hdfs proxyusers create flume --zone=zone-example --add-user=ambari-qa o isi hdfs proxyusers create hadoop --zone=zone-example --add-user=ambariqa o isi hdfs proxyusers create hbase --zone=zone-example --add-user=ambari-qa 9. Before proceeding to this step, you should be finished with the Kerberos setup on the compute nodes as well as completed the Ambari Security Wizard. After everything has finished installing you need to configure the Isilon zone to only allow secure connections with the command shown below: EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 98

99 o isi zone zones modify --zone=zone-example --hdfsauthentication=kerberos_only Note: It is very important during the Ambari Security Wizard (next section) to configure the HDFS principals (namenode, snamenode, datanode) to, for example - hdfs/isilonzone.example.com@example.com. All three principals must point to the FQDN of the Isilon Hadoop Zone configured@realm_name. Running the Ambari Kerberos Wizard Note: Make sure you complete the Enabling Kerberos for HDFS on Isilon (shown in the following section) setup before completing the Ambari Kerberos Wizard. Your cluster might use a primary KDC and one or more secondary KDCs to ensure continued availability of Kerberos-enabled services. In this configuration, each KDC contains a copy of the Kerberos database. The primary KDC contains the writeable copy of the realm database, which is replicated on each of the secondary KDCs. The Kerberos realm must trust the server. In Kerberos configuration files, your realm is typically identified in uppercase characters to differentiate it from any similar DNS domain that the realm is associated with. Note: To use Kerberos, you must install a few basic packages on the machines in your cluster or build and install the packages from scratch. If you need to build the packages yourself, you can download the latest version from the MIT website. If your system uses a package management system, you can install the following packages to use a generic version of Kerberos: krb5-workstation must be installed on all client systems. This package contains basic Kerberos program, in addition to Kerberos-enabled versions of the telnet and ftp applications. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 99

krb5-server must be installed on all server and secondary server systems. This package provides the programs that must be installed on a Kerberos 5 server or server replica.

100 krb5-server must be installed on all server and secondary server systems. This package provides the programs that must be installed on a Kerberos 5 server or server replica. krb5-libs must be installed on all client and server systems. This package contains the shared libraries that are used by Kerberos on all clients and services. pam_krb5 on all client systems. This package provides a pluggable authentication module (PAM) that enables Kerberos authentication. Procedure 1. From the Ambari web dashboard, from the menu bar, click Admin > Kerberos. 2. Click Enable Kerberos. 3. Select the type of KDC that you want to use and confirm that you meet the prerequisites. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 100

101 4. Provide information about the KDC and admin account in the configuration page. 5. Install the Kerberos client. The wizard page shows you the progress, but you can also see the progress of the install in the file /var/log/ambari-server/ambari-server.log. The Kerberos clients are installed on the hosts and the access to the KDC is tested by testing that Ambari can create a principal, generate a keytab and distribute that keytab. 6. Configure the Kerberos identities that are used by Hadoop. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 101

102 7. Kerberize the cluster. EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 102

Note: Make sure Isilon is configured for Kerberos before configuring HDFS in the Ambari Security Wizard see Enabling Kerberos for HDFS on Isilon.

103 Note: Make sure Isilon is configured for Kerberos before configuring HDFS in the Ambari Security Wizard see Enabling Kerberos for HDFS on Isilon. Click through the wizard untill you get to the screen that configures the principals. Note: Isilon does not convert principal names to short names using rules so don t use aliases(e.g. rm instead of yarn) o Realm name o Hdfs -> namenode hdfs/isilonzone.example.com@example.com o Hdfs -> secondarynamenode hdfs/isilonzone.example.com@example.com o Hdfs -> datanode hdfs/isilonzone.example.com@example.com o Yarn -> resourcemanager yarn/_host o Yarn -> nodemanager yarn/_host o Mapreduce2 -> history server principal -> mapred/_host EMC ISILON HADOOP STARTER KIT FOR IBM BIGINSIGHTS 103

EMC ISILON ONEFS WITH HADOOP AND CLOUDERA

EMC ISILON ONEFS WITH HADOOP AND CLOUDERA FOR KERBEROS INSTALLATION GUIDE VERSION 1.03 Abstract This guide walks you through the process of installing EMC Isilon OneFS with the Cloudera for Kerberos distribution