Programming for Big Data Hadoop Lab 2 Exploring the Hadoop Environment Video A short video guide for some of what is covered in this lab. Link for this video is on my module webpage 1
Open a Terminal window Hadoop Processes Enter hadoop version to check Hadoop runs Enter start-dfs.sh to start the hdfs daemons Enter start-yarn.sh to start the yarn daemons Enter mr-jobhistory-daemon.sh start historyserver to start the JobHistoryServer daemon Note: Use stop-dfs.sh stop-yarn.sh mr-jobhistory-daemon.sh to stop them at the end of the session. Type jps to see what daemons are running. and stop historyserver You should have a NameNode, DataNode, SecondaryNameNode, NodeManager, ResourceManager, JobHistoryServer Hadoop Environment Explore the linux environment variable to discover where Hadoop is installed What other Hadoop related environment variables are setup? 2
Hadoop Web-Interfaces Browse the web-interface for the filesystem/namenode available at: http://localhost:50070/ Enter hadoop fs to see all the commands available in the filesystem Enter hadoop fs -help to see all details on the commands available in the file system Enter hadoop fs -ls / to see the contents of the root directory in HDFS What files are available in HDFS? Explore and find out details of them. Browse the web interface for the NameNode http://localhost:50070/ and see the same contents 3
Most commands behave like posix / linux commands ls, cat, du, etc.. List supported commands hdfs dfs help Display detailed help for a command hdfs dfs -help <command_name> Shell commands follow the relative path format: hdfs dfs -<command> -<option> <path> For example: hdfs dfs -rm -r /removeme cat stream source to stdout entire file: hdfs dfs -cat /dir/file.txt Almost always a good idea to pipe to head, tail, more or less Get the fist 25 lines of file.txt hdfs dfs -cat /dir/file.txt head -n 25 cp copy files from source to destination hdfs dfs -cp /dir/file1 /otherdir/file2 ls for a file displays stats, for a directory displays immediate children hdfs dfs -ls /dir/ mkdir create a directory hdfs dfs -mkdir /brandnewdir 4
mv move from source to destination hdfs dfs -mv /dir/file1 /dir2/file2 put copy file from local filesystem to hdfs hdfs dfs -put localfile /dir/file1 Can also use copyfromlocal get copy file to the local filesystem hdfs dfs -get /dir/file localfile Can also use copytolocal rm delete files hdfs dfs -rm /dir/filetodelete rm -r delete directories recursively hdfs dfs -rm -r /dirwithstuff du displays length for each file/dir (in bytes) hdfs dfs -du /somedir/ Add -h option to display in human-readable format instead of bytes hdfs dfs -du -h /somedir More commands tail, chmod, count, touchz, test, etc... To learn more about each command, for example: hdfs dfs -help rm 5
Check your VM to see what data already exists on it. Loading Files into Hadoop Do all of these tasks on the VM Download the sample data into the VM (available in Webcourses/webpage) Unzip the shakespeare.tar.gz file Insert the shakespeare directory into HDFS using the command, as follows: hadoop fs -put shakespeare shakespeare How to view/explore the data in Hadoop Enter hadoop fs -ls to see the updated contents in HDFS You can use these same steps to load your own data. If the data already exists then follow these steps Enter hadoop fs -ls shakespeare to see the contents in the shakespeare directory in HDFS Note that the default location in HDFS is user/<your name> Access the contents of the the poems file using hadoop fs -cat shakespeare/poems less Browse the web interface for the NameNode http://localhost:50070/ and see the same contents How many blocks are used? What else can you see/find out about the data? Hadoop Documentation Hadoop Website http://hadoop.apache.org/ Hadoop Documentation http://hadoop.apache.org/docs/stable/ Hadoop APIs http://hadoop.apache.org/docs/stable/api/index.html 6
Exercise - Download files and load into Hadoop The Wikimedia Foundation, Inc. (http://wikimediafoundation.org/) is a nonprofit charitable organization dedicated to encouraging the growth, development and distribution of free, multilingual, educational content, and to providing the full content of these wiki-based projects to the public free of charge. The Wikimedia Foundation operates some of the largest collaboratively edited reference projects in the world; you are probably most familiar with Wikipedia which is a free encyclopedia and is available in over 50 languages (see https://meta.wikimedia.org/wiki/list_of_wikipedias for a list of languages). Information on all the projects that are the core of the Wikimedia Foundation available at http://wikimediafoundation.org/wiki/our_projects. Aggregated page view statistics for Wikimedia projects is available at http://dumps.wikimedia.org/other/pagecounts-raw/. This page gives access to files which contain the total hourly page views for Wikimedia project pages by page. Information on the file format is given on this page view statistics page. 1. Download 2 to 3 of the files for the 1 st January, 2016 2. Load the files into Hadoop 3. Explore the files and data using the HDFS command line 4. Explore the files using the web interface 7