Building Apache Hadoop on IBM Power Systems

Building Apache Hadoop on IBM Power Systems January 5, 2015 César Diniz Maciel Executive IT Specialist IBM Global Techline cmaciel@us.ibm.com

Trademarks, Copyrights, Notices and Acknowledgements IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. These and other IBM trademarked terms are marked on their first occurrence in this information with the appropriate symbol ( or TM), indicating US registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at http://www.ibm.com/legal/copytrade.shtml The following terms are trademarks of the International Business Machines Corporation in the United States, other countries, or both: Active Memory TM AIX POWER POWER Hypervisor TM Power Systems TM Power Systems Software TM POWER6 POWER7 POWER7+ TM POWER8 TM PowerHA PowerLinux TM PowerVM System x System z POWER Hypervisor TM Additional Trademarks may be identified in the body of this document. The following terms are trademarks of other companies: Intel, Intel Xeon, Intel logo, Intel Inside logo, and Intel Centrino logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Linux is a trademark of Linus Torvalds in the United States, other countries, or both. LTO, Ultrium, the LTO Logo and the Ultrium logo are trademarks of HP, IBM Corp. and Quantum in the U.S. and other countries. Microsoft, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Java, and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates. UNIX is a registered trademark of The Open Group in the United States and other countries. Apache, Apache Hadoop, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. Other company, product, or service names may be trademarks or service marks of others. While this white paper has a single principal author/editor, it is the culmination of the work of a number of different subject matter experts within IBM who contributed ideas, detailed technical information. I would like to thank the following individuals that contributed with content, comments, validation and corrections to the present document. Corentin Baron, Luke Browning, Pascal Oliva, Tony Reix, Daniele Silvestre IBM US

Building Apache Hadoop on IBM Power Systems Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the framework itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. There are three components integrated in this build: Hadoop Common, that contains the utilities that are used by any hadoop components Hadoop Map Reduce, which is the framework to compute large sets of data distributed in a cluster. Hadoop HDFS, which is the distributed file system The IBM POWER processor provides several features that provide better performance for Hadoop workloads. From the multiple SMT threads, to the memory bandwidth, to the fast I/O speeds, Power Systems offer an attractive and cost-efficient infrastructure for Big Data solutions. The IBM Solutions Operating Environment hosts buildable Hadoop source trees optimized for Linux running on POWER. These trees support both RedHat Enterprise Linux (RHEL) v6.5 or later on bigendian with PowerVM and Ubuntu v14.04 on little-endian with PowerKVM. It also works with opensuse 13.2 (big-endian). Other versions or releases may also work, but have not been tested. The current version available for download is Hadoop version 2.4.1. The tree is hosted at https://github.com/ibmsoe/hadoop-common and there is a README.md file with information on requirements and how to build the package. This document is aimed to complement the information available with additional details on some package installation and procedures. It is recommended to create a user for building and managing Hadoop. In this document, we assume that there is a hadoop user created for that purpose. Therefore, the environment customizations required should be performed under the hadoop user. For software installation, root authority (either by logging in as root, or by using sudo) is required. 2015 International Business Machines Corporation 1

A - Installing requisite components Several components have to be installed and configured prior to building Hadoop. Some come with the Linux distribution, and some should be manually downloaded and installed in the system. 1. Downloading and installing Linux packages: The following components are part of the Linux OS and can be installed using the system installation tools available, such as yum or apt-get. cmake automake autoconf git openssl (called libopenssl on opensuse) SSL library development package (openssl-devel on RedHat 7, libssl-dev on Ubuntu, libopenssl-dev on opensuse) zlib (called zlib1g on Ubuntu, libz1 in opensuse) 2. Installing the C and C++ compilers: Download and install the GNU C and C++ compilers (gcc and g++). Linux distributions come with gcc/g++, but we recommend using the IBM Advanced Toolchain for PowerLinux. It is a build of the GNU C/C++ compilers optimized for the POWER platform, and maintained by IBM. Documentation on how to download and install the toolchain is available at http://ibm.biz/powerlinuxat. Depending on the Linux distribution and version, different versions of the Advanced Toolchain are supported. The wiki highlights the supported combinations. You should use the latest version available for the Linux distribution being used. Although you can have both the standard gcc/g++ and the Advanced Toolchain installed on a system, you should make sure that when you compile code, you are using the compiler, and libraries, that you require for that operation. It is simpler to remove gcc/g++ from the system (if previously installed) and install the Advanced Toolchain. It works exactly the same as the compilers provided by the distribution, and can be used for any project. Once you install the toolchain, customize.bashrc to include the following: export PATH=$PATH:/opt/atX.X/bin (where X.X is your Advanced Toolkit version ) example: export PATH=$PATH:/opt/at7.0/bin Test your installation by running the following command: gcc --version It should return the compiler version and build information. 2015 International Business Machines Corporation 2

3. Installing Java You must have Java installed in order to build and run Hadoop. Although OpenJDK is available for Linux on Power Systems, the IBM JDK, similarly to the Advanced Toolchain, is optimized for the POWER processor. Moreover, the IBMSOE Hadoop has been optimized for the IBM JDK and does not support OpenJDK at this moment. Download and install 64-bit IBM POWER JDK, version 7 or later, from http://www.ibm.com/developerworks/java/jdk/linux/download.html. Customize.bashrc to include the following: export JAVA_HOME=/opt/ibm/java-ppc64-71 export PATH=$PATH:/opt/ibm/java-ppc64-71/bin Test your installation by running the following command: java -version It should return the JDK version and build information. Although you can have different JDKs installed simultaneously, if you have more than one JDK installed make sure you specify the correct path for the java executable and set the appropriate JAVA_HOME for this project. 4. Installing Apache Maven Apache Maven is a software project management and comprehension tool. It is a requisite to build the IBMSOE Hadoop. It depends on Java, so make sure your JDK is set up in advance. Download Apache Maven from http://maven.apache.org/download.cgi. You must install version 3 or later. Maven version 2 does not work to build IBMSOE Hadoop. In this document version 3.2.3 was used. Extract the package to a given directory (assuming /usr/local in this tutorial) You can either customize.bashrc or add a script to /etc/profile.d/ with Maven customization. Adding the script makes the system more organized and easier to manage, but it is up to the administrator to decide. In this document we used a /etc/profile.d/maven.sh script with the following: export M2_HOME=/usr/local/apache-maven-XXX, where XXX is your Maven version (example: export M2_HOME=/usr/local/apache-maven-3.2.3) export M2=$M2_HOME/bin export PATH=$PATH:$M2 2015 International Business Machines Corporation 3

Test your installation by running the following command: mvn -version It should return the Maven version and build information. While building Hadoop, Maven needs access to the Internet in order to download packages. Therefore, the build system must be capable of accessing the Internet either directly or through proxies. Contact your network administrator to understand if proxies are required on your environment. Refer to the Maven documentation on how to set up proxies at http://maven.apache.org/guides/mini/guideproxies.html 5. Apache Ant Apache Ant is a Java library and command-line tool whose mission is to drive processes described in build files as targets and extension points dependent upon each other. The main known usage of Ant is the build of Java applications. Download and install Apache Ant from http://ant.apache.org/bindownload.cgi - download the binary archive and extract to a given directory. Assuming /usr/local in this document. Again for Ant, either customize.bashrc, or add a script to /etc/profile.d/ (in this document, we used /etc/profile.d/ant.sh), to include the following: export ANT_HOME=/usr/local/apache-ant-XXX, where XXX is your Ant version (example: export ANT_HOME=/usr/local/apache-ant-1.9.4) export PATH=$PATH:$ANT_HOME/bin Test your installation by running the following command: ant -version It should return the Ant version and date. 6. Protobuf Protocol buffers are Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data. Download and install Protobuf from https://github.com/ibmsoe/protobuf - follow the compilation and installation instructions from the README.txt file. Essentially, it is a standard Linux compilation process used in many open source applications:./configure make make check make install 2015 International Business Machines Corporation 4

The make check step tests that the code compiled and is working properly in the system. An important consideration is related to Protobuf shared libraries. By default, the package will be installed to /usr/local. However, on many platforms, /usr/local/lib is not part of LD_LIBRARY_PATH. Because of that, when the protoc command is issued, the library is not found and an error similar to the following is shown: $ protoc --version protoc: error while loading shared libraries: libprotoc.so.8: cannot open shared object file: No such file or directory To solve the problem, either invoke the configure script passing the installation path as being /usr, as follows:./configure prefix=/usr and redo the compilation and installation steps, or simply export the LD_LIBRARY_PATH environment variable as follows (add to.bashrc): export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<protobuf library path> example: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib64 You can also include the library using ldconfig. Check your Linux documentation for details. After this, re-running the command should display the installed library version. 7. Snappy Most Linux distributions bring the snappy package as part of the distribution. For those cases, simply install the snappy libraries, and the snappy development package. The package names change depending on the Linux distribution being used. Examples are as follows: RedHat 6, Suse: snappy, snappy-devel Ubuntu: libsnappy1, libsnappy1-dev opensuse: libsnappy1, snappy-devel To install and configure snappy for Hadoop, simply install the two packages in the Linux system. RedHat version 7 brings the snappy library, but does not bring the development package. Therefore, for RedHat 7 specifically, it is easier to build and install snappy from source. The snappy-java is a Java port of the snappy http://code.google.com/p/snappy/, a fast C++ compresser/decompresser developed by Google. Download from https://github.com/ibmsoe/snappy-java. Run make, and after the build is done, do the following: 2015 International Business Machines Corporation 5

cd target/snappy-1.1.1./configure ; make ; make install This installs the library and include files required to build Hadoop. After all the requisites are installed, the environment is ready to build Hadoop. In order to verify that everything is properly set up, it is recommended to re-check the following environment variables and command output: echo $JAVA_HOME echo $M2_HOME echo $ANT_HOME echo $M2 echo $LD_LIBRARY_PATH (if used for protobuf) echo $PATH Make sure the variables are set to the required values as described in the previous sections. Also, make sure that the following commands run when invoked: gcc --version java -version ant -version mvn -version protoc --version 2015 International Business Machines Corporation 6

B Building Hadoop If everything is working properly, retrieve the Hadoop code from https://github.com/ibmsoe/hadoopcommon (either using the zip file, or cloning the tree using git). To compile and install Hadoop in to Maven cache using JNI and snappy use the following build command from the root of the Hadoop tree directory (for example, hadoop-common): <hadoop_common> $ mvn install -Pnative -DskipTests -Drequire.snappy If you get some errors that look related to Maven been unable to get some artifact, that is probably an issue with Maven when downloading JAR files from the web. Clean your Maven repository: rm -rf ~/.m2/repository/*. and retry the operation. After running a successful build, you can test it using the following command: <hadoop_common> $ mvn test -Pnative -Drequire.snappy Be aware that the test runs for several hours. To create an archive containing a Hadoop build use the following command <hadoop_common> $ mvn package -Pnative,dist -Drequire.snappy -DskipTests -Dtar This command generates a gzipped tar file under hadoop-common/target named hadoop-dist-2.4.1-javadoc.jar. This is a complete Hadoop core build that can be copied to other systems to run Hadoop. 2015 International Business Machines Corporation 7

C- Building additional Hadoop components After the Hadoop core is compiled and installed, and Hadoop is configured and operational, other components can be added to the infrastructure to provide additional services, such as Pig, that provides a programming environment for Hadoop, or Hive, that provides an SQL-like query interface for Hadoop. From the https://github.com/ibmsoe tree it is possible to retrieve the source code for many of these components, and compile them to use with Hadoop. It allows to extend the Hadoop functionality and integration with other software components, perform management, and data analysis. For details on the various components, and how to build them, refer to the provided link and browse the different trees available. 2015 International Business Machines Corporation 8

About the author: Cesar Diniz Maciel is an Executive IT Specialist with IBM in the United States. He joined IBM in 1996 as Presales Technical Support for the IBM RS/6000 family of UNIX servers in Brazil, and came to IBM United States in 2005. He is part of the Global Techline team, working Power Systems pre-sales consulting. He holds a degree in Electrical Engineering from Universidade Federal de Minas Gerais (UFMG) in Brazil. His areas of expertise include Power Systems, AIX, and IBM POWER Virtualization. He has coauthored several ITSO redbooks on Power Systems and related products. Notices: This information was developed for products and services offered in the U.S.A. IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-ibm product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not grant you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A. The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON- INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-ibm websites are provided for convenience only and do not in any manner serve as an endorsement of those websites. The materials at those websites are not part of the materials for this IBM product and use of those websites is at your own risk. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Users of this document should verify the applicable data for their specific environment. Information concerning non-ibm products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-ibm products. Questions on the capabilities of non-ibm products should be addressed to the suppliers of those products.. IBM Corporation 2015 IBM Corporation Systems and Technology Group Route 100 Somers, New York 10589 Produced in the United States of America January 20145 All Rights Reserved The Power Systems page can be found at: http://www-03.ibm.com/systems/power/ The IBM Systems Software home page on the Internet can be found at: http://www- 03.ibm.com/systems/software/ 2015 International Business Machines Corporation 9