Building Apache Hadoop on IBM Power Systems

Similar documents
Best practices. Starting and stopping IBM Platform Symphony Developer Edition on a two-host Microsoft Windows cluster. IBM Platform Symphony

Best practices. Linux system tuning for heavilyloaded. IBM Platform Symphony

IBM PowerKVM available with the Linux only scale-out servers IBM Redbooks Solution Guide

Best practices. Reducing concurrent SIM connection requests to SSM for Windows IBM Platform Symphony

IBM BigInsights Security Implementation: Part 1 Introduction to Security Architecture

IBM Cloud Orchestrator. Content Pack for IBM Endpoint Manager for Software Distribution IBM

Platform LSF Version 9 Release 1.1. Migrating on Windows SC

Build integration overview: Rational Team Concert and IBM UrbanCode Deploy

IBM Endpoint Manager Version 9.1. Patch Management for Ubuntu User's Guide

IBM z/os Management Facility V2R1 Solution Guide IBM Redbooks Solution Guide

IBM Platform HPC V3.2:

Contents. Configuring AD SSO for Platform Symphony API Page 2 of 8

IBM. Cúram JMX Report Generator Guide

Enterprise Caching in a Mobile Environment IBM Redbooks Solution Guide

IBM WebSphere Sample Adapter for Enterprise Information System Simulator Deployment and Testing on WPS 7.0. Quick Start Scenarios

Platform LSF Version 9 Release 1.3. Migrating on Windows SC

White Paper: Configuring SSL Communication between IBM HTTP Server and the Tivoli Common Agent

Release Notes. IBM Tivoli Identity Manager Rational ClearQuest Adapter for TDI 7.0. Version First Edition (January 15, 2011)

Best practices. Defining your own EGO service to add High Availability capability for your existing applications. IBM Platform Symphony

Tivoli Storage Manager for Virtual Environments: Data Protection for VMware Solution Design Considerations IBM Redbooks Solution Guide

IBM Spectrum LSF Process Manager Version 10 Release 1. Release Notes IBM GI

Using application properties in IBM Cúram Social Program Management JUnit tests

Tivoli Endpoint Manager for Patch Management - AIX. User s Guide

Integrated use of IBM WebSphere Adapter for Siebel and SAP with WPS Relationship Service. Quick Start Scenarios

iscsi Configuration Manager Version 2.0

Release Notes. IBM Security Identity Manager GroupWise Adapter. Version First Edition (September 13, 2013)

IBM Operational Decision Manager Version 8 Release 5. Configuring Operational Decision Manager on Java SE

Release Notes. IBM Tivoli Identity Manager Universal Provisioning Adapter. Version First Edition (June 14, 2010)

Continuous Availability with the IBM DB2 purescale Feature IBM Redbooks Solution Guide

IBM. Avoiding Inventory Synchronization Issues With UBA Technical Note

Getting Started with InfoSphere Streams Quick Start Edition (VMware)

IBM Storage Driver for OpenStack Version Release Notes

Setting Up Swagger UI for a Production Environment

Workplace Designer. Installation and Upgrade Guide. Version 2.6 G

IBM License Metric Tool Version Readme File for: IBM License Metric Tool, Fix Pack TIV-LMT-FP0001

Netcool/Impact Version Release Notes GI

Setting Up Swagger UI on WebSphere

IBM System Storage - DS8870 Disk Storage Microcode Bundle Release Note Information

IBM. Express Edition for Power Systems Getting Started. IBM Systems Director. Version 6 Release 3

Migrating on UNIX and Linux

IBM Extended Command-Line Interface (XCLI) Utility Version 5.2. Release Notes IBM

Version 9 Release 0. IBM i2 Analyst's Notebook Configuration IBM

Patch Management for Solaris

IBM Rational Synergy DCM-GUI

IBM Cognos Dynamic Query Analyzer Version Installation and Configuration Guide IBM

IBM Content Analytics with Enterprise Search Version 3.0. Expanding queries and influencing how documents are ranked in the results

Using the IBM DS8870 in an OpenStack Cloud Environment IBM Redbooks Solution Guide

Readme File for Fix Pack 1

IBM Maximo for Service Providers Version 7 Release 6. Installation Guide

IBM Operational Decision Manager. Version Sample deployment for Operational Decision Manager for z/os artifact migration

Implementing IBM Easy Tier with IBM Real-time Compression IBM Redbooks Solution Guide

Requirements Supplement

Release Notes. IBM Tivoli Identity Manager GroupWise Adapter. Version First Edition (September 13, 2013)

Version 9 Release 0. IBM i2 Analyst's Notebook Premium Configuration IBM

IBM. Networking INETD. IBM i. Version 7.2

Tivoli Access Manager for Enterprise Single Sign-On

IBM. IBM i2 Enterprise Insight Analysis Understanding the Deployment Patterns. Version 2 Release 1 BA

Express Edition for IBM x86 Getting Started

Tivoli Access Manager for Enterprise Single Sign-On

IBM System Storage - DS8870 Disk Storage Microcode Bundle Release Note Information v1

Release Notes. IBM Tivoli Identity Manager Oracle PeopleTools Adapter. Version First Edition (May 29, 2009)

IBM OpenPages GRC Platform - Version Interim Fix 1. Interim Fix ReadMe

Designing a Reference Architecture for Virtualized Environments Using IBM System Storage N series IBM Redbooks Solution Guide

IBM Rational Development and Test Environment for System z Version Release Letter GI

Version 2 Release 1. IBM i2 Enterprise Insight Analysis Understanding the Deployment Patterns IBM BA

IBM Security Access Manager for Versions 9.0.2, IBM Security App Exchange Installer for ISAM

IBM Platform LSF. Best Practices. IBM Platform LSF and IBM GPFS in Large Clusters. Jin Ma Platform LSF Developer IBM Canada

IBM Maximo for Aviation MRO Version 7 Release 6. Installation Guide IBM

Open Source on IBM I Announce Materials

IBM Tivoli Directory Server Version 5.2 Client Readme

IBM Geographically Dispersed Resiliency for Power Systems. Version Release Notes IBM

IBM Netcool/OMNIbus 8.1 Web GUI Event List: sending NodeClickedOn data using Netcool/Impact. Licensed Materials Property of IBM

Implementing IBM CICS JSON Web Services for Mobile Applications IBM Redbooks Solution Guide

Migrating Classifications with Migration Manager

CONFIGURING SSO FOR FILENET P8 DOCUMENTS

Chapter 1. Fix Pack 0001 overview

Emulex 8Gb Fibre Channel Single-port and Dual-port HBAs for IBM System x IBM System x at-a-glance guide

Optimizing Data Integration Solutions by Customizing the IBM InfoSphere Information Server Deployment Architecture IBM Redbooks Solution Guide

Application and Database Protection in a VMware vsphere Environment

2-Port 40 Gb InfiniBand Expansion Card (CFFh) for IBM BladeCenter IBM BladeCenter at-a-glance guide

ServeRAID-MR10i SAS/SATA Controller IBM System x at-a-glance guide

Tivoli Access Manager for Enterprise Single Sign-On

IBM Endpoint Manager for OS Deployment Linux OS provisioning using a Server Automation Plan

Operating System Installation Guide for Models 3xx, 5xx, 7xx, and 9xx

IBM Financial Transactions Repository Version IBM Financial Transactions Repository Guide IBM

Redpaper. IBM Tivoli Access Manager for e-business: Junctions and Links. Overview. URLs, links, and junctions. Axel Buecker Ori Pomerantz

IBM LoadLeveler Version 5 Release 1. Documentation Update: IBM LoadLeveler Version 5 Release 1 IBM

IBM Storage Driver for OpenStack Version Release Notes

IBM Security QRadar Version Forwarding Logs Using Tail2Syslog Technical Note

Using Netcool/Impact and IBM Tivoli Monitoring to build a custom selfservice

IBM. Tivoli Usage and Accounting Manager (ITUAM) Release Notes. Version GI

IBM. Business Process Troubleshooting. IBM Sterling B2B Integrator. Release 5.2

IBM OpenPages GRC Platform Version 7.0 FP2. Enhancements

IBM Kenexa LCMS Premier on Cloud. Release Notes. Version 9.3

IBM Maximo Asset Management Report Update Utility Version x releases

IBM System Storage - DS8870 Disk Storage Microcode Bundle Release Note Information v1

Managing IBM Db2 Analytics Accelerator by using IBM Data Server Manager 1

IBM BladeCenter Chassis Management Pack for Microsoft System Center Operations Manager 2007 Release Notes

IBM i Version 7.2. Systems management Logical partitions IBM

Version 4 Release 1. IBM i2 Enterprise Insight Analysis Data Model White Paper IBM

Transcription:

Building Apache Hadoop on IBM Power Systems January 5, 2015 César Diniz Maciel Executive IT Specialist IBM Global Techline cmaciel@us.ibm.com

Trademarks, Copyrights, Notices and Acknowledgements IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. These and other IBM trademarked terms are marked on their first occurrence in this information with the appropriate symbol ( or TM), indicating US registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at http://www.ibm.com/legal/copytrade.shtml The following terms are trademarks of the International Business Machines Corporation in the United States, other countries, or both: Active Memory TM AIX POWER POWER Hypervisor TM Power Systems TM Power Systems Software TM POWER6 POWER7 POWER7+ TM POWER8 TM PowerHA PowerLinux TM PowerVM System x System z POWER Hypervisor TM Additional Trademarks may be identified in the body of this document. The following terms are trademarks of other companies: Intel, Intel Xeon, Intel logo, Intel Inside logo, and Intel Centrino logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Linux is a trademark of Linus Torvalds in the United States, other countries, or both. LTO, Ultrium, the LTO Logo and the Ultrium logo are trademarks of HP, IBM Corp. and Quantum in the U.S. and other countries. Microsoft, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Java, and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates. UNIX is a registered trademark of The Open Group in the United States and other countries. Apache, Apache Hadoop, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. Other company, product, or service names may be trademarks or service marks of others. While this white paper has a single principal author/editor, it is the culmination of the work of a number of different subject matter experts within IBM who contributed ideas, detailed technical information. I would like to thank the following individuals that contributed with content, comments, validation and corrections to the present document. Corentin Baron, Luke Browning, Pascal Oliva, Tony Reix, Daniele Silvestre IBM US

Building Apache Hadoop on IBM Power Systems Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the framework itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. There are three components integrated in this build: Hadoop Common, that contains the utilities that are used by any hadoop components Hadoop Map Reduce, which is the framework to compute large sets of data distributed in a cluster. Hadoop HDFS, which is the distributed file system The IBM POWER processor provides several features that provide better performance for Hadoop workloads. From the multiple SMT threads, to the memory bandwidth, to the fast I/O speeds, Power Systems offer an attractive and cost-efficient infrastructure for Big Data solutions. The IBM Solutions Operating Environment hosts buildable Hadoop source trees optimized for Linux running on POWER. These trees support both RedHat Enterprise Linux (RHEL) v6.5 or later on bigendian with PowerVM and Ubuntu v14.04 on little-endian with PowerKVM. It also works with opensuse 13.2 (big-endian). Other versions or releases may also work, but have not been tested. The current version available for download is Hadoop version 2.4.1. The tree is hosted at https://github.com/ibmsoe/hadoop-common and there is a README.md file with information on requirements and how to build the package. This document is aimed to complement the information available with additional details on some package installation and procedures. It is recommended to create a user for building and managing Hadoop. In this document, we assume that there is a hadoop user created for that purpose. Therefore, the environment customizations required should be performed under the hadoop user. For software installation, root authority (either by logging in as root, or by using sudo) is required. 2015 International Business Machines Corporation 1

A - Installing requisite components Several components have to be installed and configured prior to building Hadoop. Some come with the Linux distribution, and some should be manually downloaded and installed in the system. 1. Downloading and installing Linux packages: The following components are part of the Linux OS and can be installed using the system installation tools available, such as yum or apt-get. cmake automake autoconf git openssl (called libopenssl on opensuse) SSL library development package (openssl-devel on RedHat 7, libssl-dev on Ubuntu, libopenssl-dev on opensuse) zlib (called zlib1g on Ubuntu, libz1 in opensuse) 2. Installing the C and C++ compilers: Download and install the GNU C and C++ compilers (gcc and g++). Linux distributions come with gcc/g++, but we recommend using the IBM Advanced Toolchain for PowerLinux. It is a build of the GNU C/C++ compilers optimized for the POWER platform, and maintained by IBM. Documentation on how to download and install the toolchain is available at http://ibm.biz/powerlinuxat. Depending on the Linux distribution and version, different versions of the Advanced Toolchain are supported. The wiki highlights the supported combinations. You should use the latest version available for the Linux distribution being used. Although you can have both the standard gcc/g++ and the Advanced Toolchain installed on a system, you should make sure that when you compile code, you are using the compiler, and libraries, that you require for that operation. It is simpler to remove gcc/g++ from the system (if previously installed) and install the Advanced Toolchain. It works exactly the same as the compilers provided by the distribution, and can be used for any project. Once you install the toolchain, customize.bashrc to include the following: export PATH=$PATH:/opt/atX.X/bin (where X.X is your Advanced Toolkit version ) example: export PATH=$PATH:/opt/at7.0/bin Test your installation by running the following command: gcc --version It should return the compiler version and build information. 2015 International Business Machines Corporation 2

3. Installing Java You must have Java installed in order to build and run Hadoop. Although OpenJDK is available for Linux on Power Systems, the IBM JDK, similarly to the Advanced Toolchain, is optimized for the POWER processor. Moreover, the IBMSOE Hadoop has been optimized for the IBM JDK and does not support OpenJDK at this moment. Download and install 64-bit IBM POWER JDK, version 7 or later, from http://www.ibm.com/developerworks/java/jdk/linux/download.html. Customize.bashrc to include the following: export JAVA_HOME=/opt/ibm/java-ppc64-71 export PATH=$PATH:/opt/ibm/java-ppc64-71/bin Test your installation by running the following command: java -version It should return the JDK version and build information. Although you can have different JDKs installed simultaneously, if you have more than one JDK installed make sure you specify the correct path for the java executable and set the appropriate JAVA_HOME for this project. 4. Installing Apache Maven Apache Maven is a software project management and comprehension tool. It is a requisite to build the IBMSOE Hadoop. It depends on Java, so make sure your JDK is set up in advance. Download Apache Maven from http://maven.apache.org/download.cgi. You must install version 3 or later. Maven version 2 does not work to build IBMSOE Hadoop. In this document version 3.2.3 was used. Extract the package to a given directory (assuming /usr/local in this tutorial) You can either customize.bashrc or add a script to /etc/profile.d/ with Maven customization. Adding the script makes the system more organized and easier to manage, but it is up to the administrator to decide. In this document we used a /etc/profile.d/maven.sh script with the following: export M2_HOME=/usr/local/apache-maven-XXX, where XXX is your Maven version (example: export M2_HOME=/usr/local/apache-maven-3.2.3) export M2=$M2_HOME/bin export PATH=$PATH:$M2 2015 International Business Machines Corporation 3

Test your installation by running the following command: mvn -version It should return the Maven version and build information. While building Hadoop, Maven needs access to the Internet in order to download packages. Therefore, the build system must be capable of accessing the Internet either directly or through proxies. Contact your network administrator to understand if proxies are required on your environment. Refer to the Maven documentation on how to set up proxies at http://maven.apache.org/guides/mini/guideproxies.html 5. Apache Ant Apache Ant is a Java library and command-line tool whose mission is to drive processes described in build files as targets and extension points dependent upon each other. The main known usage of Ant is the build of Java applications. Download and install Apache Ant from http://ant.apache.org/bindownload.cgi - download the binary archive and extract to a given directory. Assuming /usr/local in this document. Again for Ant, either customize.bashrc, or add a script to /etc/profile.d/ (in this document, we used /etc/profile.d/ant.sh), to include the following: export ANT_HOME=/usr/local/apache-ant-XXX, where XXX is your Ant version (example: export ANT_HOME=/usr/local/apache-ant-1.9.4) export PATH=$PATH:$ANT_HOME/bin Test your installation by running the following command: ant -version It should return the Ant version and date. 6. Protobuf Protocol buffers are Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data. Download and install Protobuf from https://github.com/ibmsoe/protobuf - follow the compilation and installation instructions from the README.txt file. Essentially, it is a standard Linux compilation process used in many open source applications:./configure make make check make install 2015 International Business Machines Corporation 4

The make check step tests that the code compiled and is working properly in the system. An important consideration is related to Protobuf shared libraries. By default, the package will be installed to /usr/local. However, on many platforms, /usr/local/lib is not part of LD_LIBRARY_PATH. Because of that, when the protoc command is issued, the library is not found and an error similar to the following is shown: $ protoc --version protoc: error while loading shared libraries: libprotoc.so.8: cannot open shared object file: No such file or directory To solve the problem, either invoke the configure script passing the installation path as being /usr, as follows:./configure prefix=/usr and redo the compilation and installation steps, or simply export the LD_LIBRARY_PATH environment variable as follows (add to.bashrc): export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<protobuf library path> example: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib64 You can also include the library using ldconfig. Check your Linux documentation for details. After this, re-running the command should display the installed library version. 7. Snappy Most Linux distributions bring the snappy package as part of the distribution. For those cases, simply install the snappy libraries, and the snappy development package. The package names change depending on the Linux distribution being used. Examples are as follows: RedHat 6, Suse: snappy, snappy-devel Ubuntu: libsnappy1, libsnappy1-dev opensuse: libsnappy1, snappy-devel To install and configure snappy for Hadoop, simply install the two packages in the Linux system. RedHat version 7 brings the snappy library, but does not bring the development package. Therefore, for RedHat 7 specifically, it is easier to build and install snappy from source. The snappy-java is a Java port of the snappy http://code.google.com/p/snappy/, a fast C++ compresser/decompresser developed by Google. Download from https://github.com/ibmsoe/snappy-java. Run make, and after the build is done, do the following: 2015 International Business Machines Corporation 5

cd target/snappy-1.1.1./configure ; make ; make install This installs the library and include files required to build Hadoop. After all the requisites are installed, the environment is ready to build Hadoop. In order to verify that everything is properly set up, it is recommended to re-check the following environment variables and command output: echo $JAVA_HOME echo $M2_HOME echo $ANT_HOME echo $M2 echo $LD_LIBRARY_PATH (if used for protobuf) echo $PATH Make sure the variables are set to the required values as described in the previous sections. Also, make sure that the following commands run when invoked: gcc --version java -version ant -version mvn -version protoc --version 2015 International Business Machines Corporation 6

B Building Hadoop If everything is working properly, retrieve the Hadoop code from https://github.com/ibmsoe/hadoopcommon (either using the zip file, or cloning the tree using git). To compile and install Hadoop in to Maven cache using JNI and snappy use the following build command from the root of the Hadoop tree directory (for example, hadoop-common): <hadoop_common> $ mvn install -Pnative -DskipTests -Drequire.snappy If you get some errors that look related to Maven been unable to get some artifact, that is probably an issue with Maven when downloading JAR files from the web. Clean your Maven repository: rm -rf ~/.m2/repository/*. and retry the operation. After running a successful build, you can test it using the following command: <hadoop_common> $ mvn test -Pnative -Drequire.snappy Be aware that the test runs for several hours. To create an archive containing a Hadoop build use the following command <hadoop_common> $ mvn package -Pnative,dist -Drequire.snappy -DskipTests -Dtar This command generates a gzipped tar file under hadoop-common/target named hadoop-dist-2.4.1-javadoc.jar. This is a complete Hadoop core build that can be copied to other systems to run Hadoop. 2015 International Business Machines Corporation 7

C- Building additional Hadoop components After the Hadoop core is compiled and installed, and Hadoop is configured and operational, other components can be added to the infrastructure to provide additional services, such as Pig, that provides a programming environment for Hadoop, or Hive, that provides an SQL-like query interface for Hadoop. From the https://github.com/ibmsoe tree it is possible to retrieve the source code for many of these components, and compile them to use with Hadoop. It allows to extend the Hadoop functionality and integration with other software components, perform management, and data analysis. For details on the various components, and how to build them, refer to the provided link and browse the different trees available. 2015 International Business Machines Corporation 8

About the author: Cesar Diniz Maciel is an Executive IT Specialist with IBM in the United States. He joined IBM in 1996 as Presales Technical Support for the IBM RS/6000 family of UNIX servers in Brazil, and came to IBM United States in 2005. He is part of the Global Techline team, working Power Systems pre-sales consulting. He holds a degree in Electrical Engineering from Universidade Federal de Minas Gerais (UFMG) in Brazil. His areas of expertise include Power Systems, AIX, and IBM POWER Virtualization. He has coauthored several ITSO redbooks on Power Systems and related products. Notices: This information was developed for products and services offered in the U.S.A. IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-ibm product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not grant you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A. The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON- INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-ibm websites are provided for convenience only and do not in any manner serve as an endorsement of those websites. The materials at those websites are not part of the materials for this IBM product and use of those websites is at your own risk. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Users of this document should verify the applicable data for their specific environment. Information concerning non-ibm products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-ibm products. Questions on the capabilities of non-ibm products should be addressed to the suppliers of those products.. IBM Corporation 2015 IBM Corporation Systems and Technology Group Route 100 Somers, New York 10589 Produced in the United States of America January 20145 All Rights Reserved The Power Systems page can be found at: http://www-03.ibm.com/systems/power/ The IBM Systems Software home page on the Internet can be found at: http://www- 03.ibm.com/systems/software/ 2015 International Business Machines Corporation 9