Logging on to the Hadoop Cluster Nodes. To login to the Hadoop cluster in ROGER, a user needs to login to ROGER first, for example:

Size: px

Start display at page:

Download "Logging on to the Hadoop Cluster Nodes. To login to the Hadoop cluster in ROGER, a user needs to login to ROGER first, for example:"

Charla McDowell
5 years ago
Views:

1 Hadoop User Guide Logging on to the Hadoop Cluster Nodes To login to the Hadoop cluster in ROGER, a user needs to login to ROGER first, for example: ssh username@roger-login.ncsa. illinois.edu after entering password, we can login to the Hadoop login-node (i.e., cg-hm08.ncsa.illinois. edu) ssh cg-hm08 Data access on Hadoop After login to the Hadoop cluster, a user can interact with the HDFS with a set of commands. To check the version of the Hadoop software: hadoop version or the existing nodes in the cluster: yarn node -list Before writing specific MapReduce code, the following steps are necessary to make sure the settings are properly configured. The first thing to do is check existing files in HDFS: hdfs dfs -ls PATH (e.g., /user /account_name/folder_name)

2 hdfs dfs -ls /user (to check existing users in the Hadoop cluster) if it is the first time for a user to use the Hadoop cluster, creating a user folder with the same name as the cluster account name is necessary. To make a directory: hdfs dfs -mkdir /user /account_name (for first time usage) hdfs dfs -mkdir folder_name (it will create a folder /user /account_name/folder_name) To copy your data from local disks to HDFS: hdfs dfs - /user/account_name/folder_name which is equivalent to: hdfs dfs - folder_name Contrary, to collect data from HDFS to local storage: hdfs dfs -getmerge PATH_TO_HDFS_FILE PATH_TO_LOCAL_FILE To remove a file (or directory) from HDFS we can use: hdfs dfs -rm(r) PATH_TO_HDFS_FILE Furthermore, options such as replication factor and block size can be specified using -D directive. hdfs dfs -Ddfs.replication=1 - PATH_TO_HDFS_FILE

3 Click here for a detailed HDFS commands guide» Writing MapReduce Programs Interact with Hadoop using Pig script To run a Pig script, directly type in the Linux console: pig pig_script.pig or directly enter pig console: pig Note that the Pig script is submitted via YARN, which will be introduced later. Writing MapReduce program using Java 3. A Java program It is recommended to compile the Java source code using Ant, where all the dependency and libraries are configured in the build.xml (see Appendix for details) A MapReduce program usually consists of three main classes: one Mapper class, one reducer class and one runner class which runs the Hadoop job. (combiner class is optional) MapReduce with Hadoop Streaming API Hadoop streaming API can be utilized by any executable or script as the mapper and/or the reducer respectively, in this case, Python Create a folder named app Create a python script to handle mapper, e.g., mapper.py Create a python script to handle reducer, e.g., reducer.py If third-party scripts are imported, put the source code in the same folder. Writing Spark program Currently, Spark supports Scala (native), Python and R (in this case, we will use Python as an example) Spark can be launched directly in the local node by typing: pyspark 3. It can also execute python script by: pyspark script.py (in the local node) spark-submit script.py --master yarn-client (as a job running in the cluster)

4 Submitting Jobs via YARN Hadoop jobs are managed by YARN. YARN handles: a. Queuing jobs b. Assigning submitted jobs to computing resources Submit Java programs hadoop jar dist /your_compiled_program.jar [input parameters] Input parameters will then be handled by the main runner class of the program. They usually include input and output address on HDFS, as well as some configuration of the job. Submit MapReduce programs with Hadoop Streaming API (Python) If your source code only contains mapper.py and reducer.py yarn jar /usr/hdp/ /ha doop-mapreduce/hadoop-streaming jar \ -D mapred.reduce.tasks=5 \ -mapper "mapper_script.py" \ -file mapper_script.py -reducer "reducer_script.py" \ -file reducer_script.py -input INPUT_FILE (PATH) - output OUTPUT_FILE (PATH) if your source code utilizes third-party libraries yarn jar /usr/hdp/ /ha doop-mapreduce/hadoop-streaming jar \ -D mapred.reduce.tasks=5 \ -files app \ -mapper "app/mapper_script.py" \ -reducer "app/reducer_script.py" \ -input INPUT_FILE (PATH) - output OUTPUT_FILE (PATH) The -file function will upload the specific python file to the entire cluster so that every node is aware of the file and will execute this script without problem.

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals