CS November 2017

Similar documents
Distributed Systems. 09r. Map-Reduce Programming on AWS/EMR (Part I) 2017 Paul Krzyzanowski. TA: Long Zhao Rutgers University Fall 2017

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

UNIT-IV HDFS. Ms. Selva Mary. G

Distributed Systems 16. Distributed File Systems II

Distributed Filesystem

HDFS Architecture Guide

CS370 Operating Systems

Tutorial 1. Account Registration

Distributed Systems. CS422/522 Lecture17 17 November 2014

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

The Google File System. Alexandru Costan

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

HDFS Access Options, Applications

labibi Documentation Release 1.0 C. Titus Brown

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

A brief history on Hadoop

50 Must Read Hadoop Interview Questions & Answers

Hadoop Lab 2 Exploring the Hadoop Environment

Hadoop and HDFS Overview. Madhu Ankam

MapReduce. U of Toronto, 2014

AWS Setup Guidelines

CA485 Ray Walshe Google File System

CS370 Operating Systems

BigData and Map Reduce VITMAC03

Vendor: Cloudera. Exam Code: CCA-505. Exam Name: Cloudera Certified Administrator for Apache Hadoop (CCAH) CDH5 Upgrade Exam.

Getting Started with Hadoop

Top 25 Hadoop Admin Interview Questions and Answers

Introduction to the Linux Command Line

Introduction to Cloud Computing

ThoughtSpot on AWS Quick Start Guide

Cloud Computing CS

Logging on to the Hadoop Cluster Nodes. To login to the Hadoop cluster in ROGER, a user needs to login to ROGER first, for example:

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

Distributed File Systems II

Linux Command Line Primer. By: Scott Marshall

Hadoop File System Commands Guide

USING NGC WITH GOOGLE CLOUD PLATFORM

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

TP1-2: Analyzing Hadoop Logs

CS370 Operating Systems

Service and Cloud Computing Lecture 10: DFS2 Prof. George Baciu PQ838

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

HDFS: Hadoop Distributed File System. Sector: Distributed Storage System

MI-PDB, MIE-PDB: Advanced Database Systems

Pivotal Capgemini Just Do It Training HDFS-NFS Gateway Labs

CLOUD-SCALE FILE SYSTEMS

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Namenode HA. Sanjay Radia - Hortonworks

Hadoop. copyright 2011 Trainologic LTD

Commands Manual. Table of contents

Google File System (GFS) and Hadoop Distributed File System (HDFS)

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Chase Wu New Jersey Institute of Technology

Immersion Day. Getting Started with Linux on Amazon EC2

Hands-on Exercise Hadoop

Tutorial for Assignment 2.0

Progress OpenEdge. > Getting Started. in the Amazon Cloud.

Lecture 11 Hadoop & Spark

Cloud Computing. Hwajung Lee. Key Reference: Prof. Jong-Moon Chung s Lecture Notes at Yonsei University

Database Systems CSE 414

Commands Guide. Table of contents

Name Date Reason For Changes Version Status Initial version v0.1 Draft Revision based on feedback v0.2 Draft.

CLIENT DATA NODE NAME NODE

Spark Programming at Comet. UCSB CS240A Tao Yang

Automatic-Hot HA for HDFS NameNode Konstantin V Shvachko Ari Flink Timothy Coulter EBay Cisco Aisle Five. November 11, 2011

Hadoop Setup on OpenStack Windows Azure Guide

A Study of Comparatively Analysis for HDFS and Google File System towards to Handle Big Data

7680: Distributed Systems

Your First Hadoop App, Step by Step

Amazon Web Services (AWS) Setup Guidelines

Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng.

Discover CephFS TECHNICAL REPORT SPONSORED BY. image vlastas, 123RF.com

Distributed Systems. Hajussüsteemid MTAT Distributed File Systems. (slides: adopted from Meelis Roos DS12 course) 1/25

Ross Whetten, North Carolina State University

CSE 344 Introduc/on to Data Management. Sec/on 9: AWS, Hadoop, Pig La/n Yuyin Sun

Xcalar Installation Guide

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Amazon Web Services Hands on EC2 December, 2012

A BigData Tour HDFS, Ceph and MapReduce

NoSQL Concepts, Techniques & Systems Part 1. Valentina Ivanova IDA, Linköping University

Map-Reduce. Marco Mura 2010 March, 31th

Hadoop, Yarn and Beyond

Towards a Real- time Processing Pipeline: Running Apache Flink on AWS

CS November 2017

When talking about how to launch commands and other things that is to be typed into the terminal, the following syntax is used:

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

11/5/2018 Week 12-A Sangmi Lee Pallickara. CS435 Introduction to Big Data FALL 2018 Colorado State University

Actual4Test. Actual4test - actual test exam dumps-pass for IT exams

AWS Solution Architecture Patterns

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Cloudera s Enterprise Data Hub on the Amazon Web Services Cloud: Quick Start Reference Deployment October 2014

DePaul University CSC555 -Mining Big Data. Course Project by Bill Qualls Dr. Alexander Rasin, Instructor November 2013

CENG 334 Computer Networks. Laboratory I Linux Tutorial

MapReduce, Hadoop and Spark. Bompotas Agorakis

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014

Transcription:

Distributed Systems 09r. Map-Reduce Programming on AWS/EMR (Part I) Setting Up AWS/EMR Paul Krzyzanowski TA: Long Zhao Rutgers University Fall 2017 November 21, 2017 2017 Paul Krzyzanowski 1 November 21, 2017 2017 Paul Krzyzanowski 2 Visit https://aws.amazon.com/education/awseducate/, sign up and create your account. Step 1/3: Choose your role November 21, 2017 2017 Paul Krzyzanowski 3 November 21, 2017 2017 Paul Krzyzanowski 4 Step 2/3: Tell us about yourself Step 3/3: Choose one of the following Choose this option Use your Rutgers email address @rutgers.edu If you have an AWS account, please just enter your account ID. Or you should sign up a new account by clicking the link below. You need a credit card and a mobile phone for verification. Next page shows where to find your AWS ID. Leave empty November 21, 2017 2017 Paul Krzyzanowski 5 November 21, 2017 2017 Paul Krzyzanowski 6 Paul Krzyzanowski 1

Find your AWS account ID. First login your AWS account, then click your user name and click My Account. Find your AWS account ID November 21, 2017 2017 Paul Krzyzanowski 7 November 21, 2017 2017 Paul Krzyzanowski 8 Find your AWS account ID Then check your email, and you will find the link and the credit code. Click the link in Step 1 and login your AWS account. Then follow the instructions and enter your credit code. You will receive $100 for using AWS which is able to support EMR around 100+ hours. November 21, 2017 2017 Paul Krzyzanowski 9 November 21, 2017 2017 Paul Krzyzanowski 10 Find EMR on the AWS console page. Find EMR on the AWS console page. IMPORTANT: Please make sure that you are in this region zone. If you find that your cluster is lost, please switch back to the zone where you create your cluster. November 21, 2017 2017 Paul Krzyzanowski 11 November 21, 2017 2017 Paul Krzyzanowski 12 Paul Krzyzanowski 2

Create your cluster. Create your cluster. November 21, 2017 2017 Paul Krzyzanowski 13 November 21, 2017 2017 Paul Krzyzanowski 14 Create AWS key-pair. Create AWS key-pair. Follow the instruction here to create a key-pair for you. For example, here I create a key-pair named awskeypair. Please save this.pem file in a safe place. This file is VETY IMPORTANT!!! November 21, 2017 2017 Paul Krzyzanowski 15 November 21, 2017 2017 Paul Krzyzanowski 16 Then go back to the EWR page, choose the key-pair you just created and then create cluster. Wait several minutes until the cluster is created. November 21, 2017 2017 Paul Krzyzanowski 17 November 21, 2017 2017 Paul Krzyzanowski 18 Paul Krzyzanowski 3

Configure Security Groups. Click My cluster. Choose Security Groups for the master. November 21, 2017 2017 Paul Krzyzanowski 19 November 21, 2017 2017 Paul Krzyzanowski 20 Choose Security Groups for the master and Inbound. Add rules to make TCP and ICMP can be used form Anywhere. November 21, 2017 2017 Paul Krzyzanowski 21 November 21, 2017 2017 Paul Krzyzanowski 22 Add another rule for SSH. Then Save. Then go back to EMR page to check your DNS of the master node. November 21, 2017 2017 Paul Krzyzanowski 23 November 21, 2017 2017 Paul Krzyzanowski 24 Paul Krzyzanowski 4

Then we need to check if the cluster can be visited by us. You have do all the operations on a Linux machine or Mac OS. The ilab machine is highly recommended. 1. Ping the master DNS: Open the terminal, and type the command ping <dns>, where <dns> is the DNS of your master node. 2. Log into the master via SSH: In the terminal type the command ssh -i <path-to-pem> hadoop@<dns>, where <path-to-pem> is the path to the.pem key-pair file you have saved. For me, if the awskeypair.pem file is in the current folder, I would type ssh -i./awskeypair.pem hadoop@ec2-52-37-85-231.us-west- 2.compute.amazonaws.com. If you see login information similar to the screenshot below, it means you have successfully set up the EMR cluster. November 21, 2017 2017 Paul Krzyzanowski 25 November 21, 2017 2017 Paul Krzyzanowski 26 The following tips are VERY IMPORTANT. If you finished using your cluster, please remember to do the following steps. Or there will be a service charge for EMR, once $100 credits run out. (You will receive another $100 each year.) The following tips are VERY IMPORTANT. If you finished using your cluster, please remember to do the following steps. Or there will be a service charge for EMR, once $100 credits run out. If you need EMR again, please just repeat the above steps to create another cluster. 1. Terminate the cluster 2. Delete S3 storage for the cluster by click Services, then S3. November 21, 2017 2017 Paul Krzyzanowski 27 November 21, 2017 2017 Paul Krzyzanowski 28 What is HDFS? Introduction to HDFS HDFS is an implementation of the Google File System (GFS) within the Apache Hadoop project it is a large-scale distributed, parallel, faulttolerant Java-based file system. 1. HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand. 2. HDFS is the primary distributed storage for Hadoop applications. 3. HDFS provides interfaces for applications to move themselves closer to data. 4. HDFS is designed to just work, however a working knowledge helps in diagnostics and improvements. November 21, 2017 2017 Paul Krzyzanowski 29 November 21, 2017 2017 Paul Krzyzanowski 30 Paul Krzyzanowski 5

Components of HDFS There are two (and a half) types of machines in a HDFS cluster: NameNode is the heart of an HDFS filesystem. It maintains and manages the file system metadata; e.g., what blocks make up a file, and on which datanodes those blocks are stored. DataNode where HDFS stores the actual data. There are usually quite a few of these. HDFS Data Organization 1. Each file written into HDFS is split into data blocks 2. Each block is stored on one or more nodes 3. Each copy of the block is called replica 4. Block placement policy First replica is placed on the local node Second replica is placed in a different rack Third replica is placed in the same rack as the second replica November 21, 2017 2017 Paul Krzyzanowski 31 November 21, 2017 2017 Paul Krzyzanowski 32 Interfaces to HDFS Java API (DistributedFileSystem) C wrapper (libhdfs) HTTP protocol WebDAV protocol Shell Commands* *However, the command line is one of the simplest and most familiar. HDFS Shell Commands There are two types of shell commands: User Commands hdfs dfs runs filesystem commands on the HDFS hdfs fsck runs a HDFS filesystem checking command Administration Commands hdfs dfsadmin runs HDFS administration commands November 21, 2017 2017 Paul Krzyzanowski 33 November 21, 2017 2017 Paul Krzyzanowski 34 HDFS User Commands (dfs) List directory contents >> hdfs dfs ls >> hdfs dfs -ls / >> hdfs dfs -ls -R /var Display the disk space used by files >> hdfs dfs -du -h / >> hdfs dfs -du /hbase/data/hbase/ >> hdfs dfs -du -h /hbase/data/hbase/ >> hdfs dfs -du -s /hbase/data/hbase/ HDFS User Commands (dfs) Copy data to HDFS >> hdfs dfs -mkdir tdata >> hdfs dfs -ls >> hdfs dfs -copyfromlocal tutorials/data/geneva.csv tdata >> hdfs dfs -ls R Copy the file back to local filesystem >> cd tutorials/data/ >> hdfs dfs copytolocal tdata/geneva.csv geneva.csv.hdfs >> md5sum geneva.csv geneva.csv.hdfs November 21, 2017 2017 Paul Krzyzanowski 35 November 21, 2017 2017 Paul Krzyzanowski 36 Paul Krzyzanowski 6

HDFS User Commands (dfs) List acl for a file >> hdfs dfs -getfacl tdata/geneva.csv List the file statistics (%r replication factor) >> hdfs dfs -stat "%r" tdata/geneva.csv Write to hdfs reading from stdin >> echo "blah blah blah" hdfs dfs -put - tdataset/tfile.txt >> hdfs dfs -ls R >> hdfs dfs -cat tdataset/tfile.txt HDFS User Commands (fsck) Removing a file >> hdfs dfs -rm tdataset/tfile.txt >> hdfs dfs -ls R List the blocks of a file and their locations >> hdfs fsck /user/cloudera/tdata/geneva.csv -files -blocks locations Print missing blocks and the files they belong to >> hdfs fsck / -list-corruptfileblocks November 21, 2017 2017 Paul Krzyzanowski 37 November 21, 2017 2017 Paul Krzyzanowski 38 HDFS Adminstration Commands Comprehensive status report of HDFS cluster >> hdfs dfsadmin report HDFS Adminstration Commands Get a list of namenodes in the Hadoop cluster >> hdfs getconf namenodes Prints a tree of racks and their nodes >> hdfs dfsadmin printtopology Get the information for a given datanode (like ping) >> hdfs dfsadmin -getdatanodeinfo localhost:50020 Dump the NameNode fsimage to XML file >> cd /var/lib/hadoophdfs/cache/hdfs/dfs/name/current hdfs oiv -i fsimage_0000000000000003388 -o /tmp/fsimage.xml -p XML The general command line syntax is hdfs command [genericoptions] [commandoptions] November 21, 2017 2017 Paul Krzyzanowski 39 November 21, 2017 2017 Paul Krzyzanowski 40 Other Interfaces to HDFS HTTP Interface http://<dns>:50070 Other Useful Links ilab: https://www.cs.rutgers.edu/resources/instructionallab Amazon EMR Official Documentation: https://aws.amazon.com/documentation/emr/ HDFS Architecture Guide: http://hadoop.apache.org/docs/stable/hadoop-projectdist/hadoop-hdfs/hdfsdesign.html HDFS File System Shell Guide: http://hadoop.apache.org/docs/stable/hadoop-projectdist/hadoop-common/filesystemshell.html November 21, 2017 2017 Paul Krzyzanowski 41 November 21, 2017 2017 Paul Krzyzanowski 42 Paul Krzyzanowski 7

The end November 21, 2017 2017 Paul Krzyzanowski 43 Paul Krzyzanowski 8