Hadoop Lab 2 Exploring the Hadoop Environment

Similar documents
Hadoop Lab 3 Creating your first Map-Reduce Process

Pivotal Capgemini Just Do It Training HDFS-NFS Gateway Labs

UNIT-IV HDFS. Ms. Selva Mary. G

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

HDFS Access Options, Applications

Apache Hadoop Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

TDDE31/732A54 - Big Data Analytics Lab compendium

Getting Started with Hadoop/YARN

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

In this exercise you will practice working with HDFS, the Hadoop. You will use the HDFS command line tool and the Hue File Browser

Installing Hadoop / Yarn, Hive 2.1.0, Scala , and Spark 2.0 on Raspberry Pi Cluster of 3 Nodes. By: Nicholas Propes 2016

When talking about how to launch commands and other things that is to be typed into the terminal, the following syntax is used:

CS November 2017

Distributed Systems. 09r. Map-Reduce Programming on AWS/EMR (Part I) 2017 Paul Krzyzanowski. TA: Long Zhao Rutgers University Fall 2017

Spark Programming at Comet. UCSB CS240A Tao Yang

Getting Started with Hadoop

50 Must Read Hadoop Interview Questions & Answers

We are ready to serve Latest Testing Trends, Are you ready to learn.?? New Batches Info

Introduction into Big Data analytics Lecture 3 Hadoop ecosystem. Janusz Szwabiński

CMU MSP Intro to Hadoop

Hadoop Setup on OpenStack Windows Azure Guide

Getting your department account

Aims. Background. This exercise aims to get you to:

Part II (c) Desktop Installation. Net Serpents LLC, USA

Unix/Linux Basics. Cpt S 223, Fall 2007 Copyright: Washington State University

Big Data for Engineers Spring Resource Management

Big Data 7. Resource Management

CSCI 2132 Software Development. Lecture 4: Files and Directories

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer. HCatalog

This lab exercise is to be submitted at the end of the lab session! passwd [That is the command to change your current password to a new one]

Developer Training for Apache Spark and Hadoop: Hands-On Exercises

Lab 2: Linux/Unix shell

3 Hadoop Installation: Pseudo-distributed mode

Linux Command Line Primer. By: Scott Marshall

Unix Filesystem. January 26 th, 2004 Class Meeting 2

732A54 - Big Data Analytics Lab compendium

Files

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

BIG DATA TRAINING PRESENTATION

Installation of Hadoop on Ubuntu

Exam Questions CCA-505

Unix Tutorial. Beginner. CS Help Desk: Marc Jarvis (in spirit), Monica Ung, Corey Antoniuk 2015

Hadoop Integration User Guide. Functional Area: Hadoop Integration. Geneos Release: v4.9. Document Version: v1.0.0

Unix File System. Class Meeting 2. * Notes adapted by Joy Mukherjee from previous work by other members of the CS faculty at Virginia Tech

Hadoop streaming is an alternative way to program Hadoop than the traditional approach of writing and compiling Java code.

Chapter 1 - Introduction. September 8, 2016

Connecting to ICS Server, Shell, Vim CS238P Operating Systems fall 18

The Unix Shell. Pipes and Filters

Command-line interpreters

Lec 1 add-on: Linux Intro

Contents. Note: pay attention to where you are. Note: Plaintext version. Note: pay attention to where you are... 1 Note: Plaintext version...

Introduction to the Linux Command Line

UNIT II HADOOP FRAMEWORK

Vendor: Cloudera. Exam Code: CCA-505. Exam Name: Cloudera Certified Administrator for Apache Hadoop (CCAH) CDH5 Upgrade Exam.

Filesystem and common commands

Create Test Environment

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Hadoop Quickstart. Table of contents

Embedded Linux Systems. Bin Li Assistant Professor Dept. of Electrical, Computer and Biomedical Engineering University of Rhode Island

Top 25 Hadoop Admin Interview Questions and Answers

Introduc)on to Linux Session 2 Files/Filesystems/Data. Pete Ruprecht Research Compu)ng Group University of Colorado Boulder

Introduction to Linux

SE256 : Scalable Systems for Data Science

Using UNIX. -rwxr--r-- 1 root sys Sep 5 14:15 good_program

HDP HDFS ACLs 3. Apache HDFS ACLs. Date of Publish:

Introduction to Linux. Woo-Yeong Jeong Computer Systems Laboratory Sungkyunkwan University

Welcome. IT in AOS. Michael Havas Dept. of Atmospheric and Oceanic Sciences McGill University. September 21, 2012

An Introduction to Unix Power Tools

Advanced Linux Commands & Shell Scripting

Getting Started with Pentaho and Cloudera QuickStart VM

LOG ON TO LINUX AND LOG OFF

CS356: Discussion #1 Development Environment. Marco Paolieri

Introduction to the Hadoop Ecosystem - 1

Lab Working with Linux Command Line

Introduction of Linux

Files and Directories

System Administration

Unix Workshop Aug 2014

Introduction to UNIX command-line II

Multi-Node Cluster Setup on Hadoop. Tushar B. Kute,

The Command Line. Matthew Bender. September 10, CMSC Command Line Workshop. Matthew Bender (2015) The Command Line September 10, / 25

Compile and Run WordCount via Command Line

Shell Programming Systems Skills in C and Unix

Introduction to Unix The Windows User perspective. Wes Frisby Kyle Horne Todd Johansen

CS370 Operating Systems

<Partner Name> <Partner Product> RSA Ready Implementation Guide for. MapR Converged Data Platform 3.1

Introduction. File System. Note. Achtung!

Introduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece

COSC UNIX. Textbook. Grading Scheme

: the User (owner) for this file (your cruzid, when you do it) Position: directory flag. read Group.

Short Read Sequencing Analysis Workshop

CS 3410 Intro to Unix, shell commands, etc... (slides from Hussam Abu-Libdeh and David Slater)

CCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH)

Outline Introduction Big Data Sources of Big Data Tools HDFS Installation Configuration Starting & Stopping Map Reduc.

Hands-on Exercise Hadoop

Working with Basic Linux. Daniel Balagué

Unix Basics. Systems Programming Concepts

Introduction: What is Unix?

Week 2 Lecture 3. Unix

Transcription:

Programming for Big Data Hadoop Lab 2 Exploring the Hadoop Environment Video A short video guide for some of what is covered in this lab. Link for this video is on my module webpage 1

Open a Terminal window Hadoop Processes Enter hadoop version to check Hadoop runs Enter start-dfs.sh to start the hdfs daemons Enter start-yarn.sh to start the yarn daemons Enter mr-jobhistory-daemon.sh start historyserver to start the JobHistoryServer daemon Note: Use stop-dfs.sh stop-yarn.sh mr-jobhistory-daemon.sh to stop them at the end of the session. Type jps to see what daemons are running. and stop historyserver You should have a NameNode, DataNode, SecondaryNameNode, NodeManager, ResourceManager, JobHistoryServer Hadoop Environment Explore the linux environment variable to discover where Hadoop is installed What other Hadoop related environment variables are setup? 2

Hadoop Web-Interfaces Browse the web-interface for the filesystem/namenode available at: http://localhost:50070/ Enter hadoop fs to see all the commands available in the filesystem Enter hadoop fs -help to see all details on the commands available in the file system Enter hadoop fs -ls / to see the contents of the root directory in HDFS What files are available in HDFS? Explore and find out details of them. Browse the web interface for the NameNode http://localhost:50070/ and see the same contents 3

Most commands behave like posix / linux commands ls, cat, du, etc.. List supported commands hdfs dfs help Display detailed help for a command hdfs dfs -help <command_name> Shell commands follow the relative path format: hdfs dfs -<command> -<option> <path> For example: hdfs dfs -rm -r /removeme cat stream source to stdout entire file: hdfs dfs -cat /dir/file.txt Almost always a good idea to pipe to head, tail, more or less Get the fist 25 lines of file.txt hdfs dfs -cat /dir/file.txt head -n 25 cp copy files from source to destination hdfs dfs -cp /dir/file1 /otherdir/file2 ls for a file displays stats, for a directory displays immediate children hdfs dfs -ls /dir/ mkdir create a directory hdfs dfs -mkdir /brandnewdir 4

mv move from source to destination hdfs dfs -mv /dir/file1 /dir2/file2 put copy file from local filesystem to hdfs hdfs dfs -put localfile /dir/file1 Can also use copyfromlocal get copy file to the local filesystem hdfs dfs -get /dir/file localfile Can also use copytolocal rm delete files hdfs dfs -rm /dir/filetodelete rm -r delete directories recursively hdfs dfs -rm -r /dirwithstuff du displays length for each file/dir (in bytes) hdfs dfs -du /somedir/ Add -h option to display in human-readable format instead of bytes hdfs dfs -du -h /somedir More commands tail, chmod, count, touchz, test, etc... To learn more about each command, for example: hdfs dfs -help rm 5

Check your VM to see what data already exists on it. Loading Files into Hadoop Do all of these tasks on the VM Download the sample data into the VM (available in Webcourses/webpage) Unzip the shakespeare.tar.gz file Insert the shakespeare directory into HDFS using the command, as follows: hadoop fs -put shakespeare shakespeare How to view/explore the data in Hadoop Enter hadoop fs -ls to see the updated contents in HDFS You can use these same steps to load your own data. If the data already exists then follow these steps Enter hadoop fs -ls shakespeare to see the contents in the shakespeare directory in HDFS Note that the default location in HDFS is user/<your name> Access the contents of the the poems file using hadoop fs -cat shakespeare/poems less Browse the web interface for the NameNode http://localhost:50070/ and see the same contents How many blocks are used? What else can you see/find out about the data? Hadoop Documentation Hadoop Website http://hadoop.apache.org/ Hadoop Documentation http://hadoop.apache.org/docs/stable/ Hadoop APIs http://hadoop.apache.org/docs/stable/api/index.html 6

Exercise - Download files and load into Hadoop The Wikimedia Foundation, Inc. (http://wikimediafoundation.org/) is a nonprofit charitable organization dedicated to encouraging the growth, development and distribution of free, multilingual, educational content, and to providing the full content of these wiki-based projects to the public free of charge. The Wikimedia Foundation operates some of the largest collaboratively edited reference projects in the world; you are probably most familiar with Wikipedia which is a free encyclopedia and is available in over 50 languages (see https://meta.wikimedia.org/wiki/list_of_wikipedias for a list of languages). Information on all the projects that are the core of the Wikimedia Foundation available at http://wikimediafoundation.org/wiki/our_projects. Aggregated page view statistics for Wikimedia projects is available at http://dumps.wikimedia.org/other/pagecounts-raw/. This page gives access to files which contain the total hourly page views for Wikimedia project pages by page. Information on the file format is given on this page view statistics page. 1. Download 2 to 3 of the files for the 1 st January, 2016 2. Load the files into Hadoop 3. Explore the files and data using the HDFS command line 4. Explore the files using the web interface 7