Getting Started with Hadoop

Similar documents
Getting Started with Hadoop/YARN

Exam Questions CCA-505

Introduction to MapReduce

Vendor: Cloudera. Exam Code: CCA-505. Exam Name: Cloudera Certified Administrator for Apache Hadoop (CCAH) CDH5 Upgrade Exam.

Installing Hadoop. You need a *nix system (Linux, Mac OS X, ) with a working installation of Java 1.7, either OpenJDK or the Oracle JDK. See, e.g.

A brief history on Hadoop

Introduction to the Hadoop Ecosystem - 1

Introduction to MapReduce

Clustering Lecture 8: MapReduce

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Map Reduce & Hadoop Recommended Text:

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

Hadoop Tutorial. General Instructions

CCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH)

Hadoop streaming is an alternative way to program Hadoop than the traditional approach of writing and compiling Java code.

Actual4Dumps. Provide you with the latest actual exam dumps, and help you succeed

Hortonworks Data Platform

CMU MSP Intro to Hadoop

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Big Data for Engineers Spring Resource Management

Logging on to the Hadoop Cluster Nodes. To login to the Hadoop cluster in ROGER, a user needs to login to ROGER first, for example:

UNIT II HADOOP FRAMEWORK

Distributed File Systems II

MapReduce and Hadoop

Building A Better Test Platform:

Hadoop Integration User Guide. Functional Area: Hadoop Integration. Geneos Release: v4.9. Document Version: v1.0.0

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine

Note: Who is Dr. Who? You may notice that YARN says you are logged in as dr.who. This is what is displayed when user

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

An introduction to Docker

Apache Hadoop Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

Hadoop Quickstart. Table of contents

Expert Lecture plan proposal Hadoop& itsapplication

Hadoop MapReduce Framework

Docker und IBM Digital Experience in Docker Container

Big Data 7. Resource Management

MapReduce Simplified Data Processing on Large Clusters

This brief tutorial provides a quick introduction to Big Data, MapReduce algorithm, and Hadoop Distributed File System.

Automation of Rolling Upgrade for Hadoop Cluster without Data Loss and Job Failures. Hiroshi Yamaguchi & Hiroyuki Adachi

Problem Set 0. General Instructions

Who is Docker and how he can help us? Heino Talvik

ExamTorrent. Best exam torrent, excellent test torrent, valid exam dumps are here waiting for you

CS 345A Data Mining. MapReduce

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

STATS Data Analysis using Python. Lecture 7: the MapReduce framework Some slides adapted from C. Budak and R. Burns

Docker A FRAMEWORK FOR DATA INTENSIVE COMPUTING

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

CS370 Operating Systems

50 Must Read Hadoop Interview Questions & Answers

HADOOP 3.0 is here! Dr. Sandeep Deshmukh Sadepach Labs Pvt. Ltd. - Let us grow together!

Red Hat Containers Cheat Sheet

Project Design. Version May, Computer Science Department, Texas Christian University

Timeline Dec 2004: Dean/Ghemawat (Google) MapReduce paper 2005: Doug Cutting and Mike Cafarella (Yahoo) create Hadoop, at first only to extend Nutch (

Hadoop Lab 2 Exploring the Hadoop Environment

Dockerfile & docker CLI Cheat Sheet

Hadoop. copyright 2011 Trainologic LTD

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Top 25 Big Data Interview Questions And Answers

Parallel Programming Pre-Assignment. Setting up the Software Environment

Developer Training for Apache Spark and Hadoop: Hands-On Exercises

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Top 25 Hadoop Admin Interview Questions and Answers

Hadoop Development Introduction

USING NGC WITH GOOGLE CLOUD PLATFORM

CENG 334 Computer Networks. Laboratory I Linux Tutorial

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Map-Reduce. Marco Mura 2010 March, 31th

Friends Workshop. Christopher M. Judd

Vendor: Cloudera. Exam Code: CCD-410. Exam Name: Cloudera Certified Developer for Apache Hadoop. Version: Demo

Introduction into Big Data analytics Lecture 3 Hadoop ecosystem. Janusz Szwabiński

Introduction to Containers

A Guide to Running Map Reduce Jobs in Java University of Stirling, Computing Science

DGX-1 DOCKER USER GUIDE Josh Park Senior Solutions Architect Contents created by Jack Han Solutions Architect

BIG DATA TRAINING PRESENTATION

CS November 2017

Distributed Systems. 09r. Map-Reduce Programming on AWS/EMR (Part I) 2017 Paul Krzyzanowski. TA: Long Zhao Rutgers University Fall 2017

Hadoop and HDFS Overview. Madhu Ankam

Deploy Hadoop For Processing Text Data To Run Map Reduce Application On A Single Site

Linux Operating System Environment Computadors Grau en Ciència i Enginyeria de Dades Q2

Click Stream Data Analysis Using Hadoop

Docker Cheat Sheet. Introduction

Perl and R Scripting for Biologists

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Distributed Systems 16. Distributed File Systems II

A BigData Tour HDFS, Ceph and MapReduce

CS-580K/480K Advanced Topics in Cloud Computing. Container III

HADOOP FRAMEWORK FOR BIG DATA

Processing Big Data with Hadoop in Azure HDInsight

Hadoop An Overview. - Socrates CCDH

Data Analysis Using MapReduce in Hadoop Environment

Distributed Systems CS6421

BigData and Map Reduce VITMAC03

CSE 303 Lecture 2. Introduction to bash shell. read Linux Pocket Guide pp , 58-59, 60, 65-70, 71-72, 77-80

MapReduce. U of Toronto, 2014

Deploying Custom Step Plugins for Pentaho MapReduce

Docker 101 Workshop. Eric Smalling - Solution Architect, Docker

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

CS427 Multicore Architecture and Parallel Computing

Transcription:

Getting Started with Hadoop May 28, 2018 Michael Völske, Shahbaz Syed Web Technology & Information Systems Bauhaus-Universität Weimar 1 webis 2018

What is Hadoop Started in 2004 by Yahoo Open-Source implementation of Google MapReduce, Google Filesystem and Google BigTable Apache Software Foundation top level project Written in Java 2 webis 2018

What is Hadoop Scale out, not up! 4000+ nodes, 100PB+ data cheap commodity hardware instead of supercomputers fault-tolerance, redundancy Bring the program to the data storage and data processing on the same node local processing (network is the bottleneck) Working sequentially instead of random-access optimized for large datasets Hide system-level details User doesn t need to know what code runs on which machine 3 webis 2018

What is Hadoop [http://hortonworks.com/hadoop/yarn/] 4 webis 2018

HDFS Distributed File System HDFS Overview Designed for storing large files Files are split in blocks Integrity: Blocks are checksummed Redundancy: Each block stored on multiple machines Optimized for sequentially reading whole blocks Daemon processes: NameNode: Central registry of block locations DataNode: Block storage on each node 5 webis 2018

HDFS Distributed File System Reading Files 6 webis 2018

A Virtual Hadoop Cluster Typical Cluster Network Layout HDFS Namenode YARN ResourceManager SSH Gateway user HDFS Datanode HDFS Datanode HDFS Datanode YARN NodeManager YARN NodeManager YARN NodeManager public network cluster network 7 webis 2018

A Virtual Hadoop Cluster Simplification: simulate only relevant processes HDFS Namenode SSH Gateway YARN ResourceManager user HDFS Datanode HDFS Datanode HDFS Datanode YARN NodeManager YARN NodeManager YARN NodeManager public network cluster network 8 webis 2018

Docker tutorial. Introduction to Docker Why Docker? Container vs Virtual Machine Images & Containers Docker workflow Dockerfile 9 webis 2018

Why Docker? Docker provides lightweight containerization of applications. Containerization : Use of Linux containers to deploy applications Container : Self-contained, lightweight environment Run applications with exactly the same behavior on all platforms (Un)Install packages/applications without convoluting the host system Highly suitable for developing scalable, micro-services based applications (e.g. Netflix) 10 webis 2018

Container vs Virtual Machine A container runs natively on Linux and shares the kernel of the host machine with other containers. (minimal memory consumption) A VM runs a full blown guest OS with virtual access to host resources. (extra memory allocated than needed) [Container vs VM] 11 webis 2018

Images & Containers Analogy => {Image : Containers} = {Class : Objects} A Docker Image is an immutable snapshot of a filesystem contains basic kernel & runtime needed for building larger systems (e.g base ubuntu image) is built using Dockerfile (a recipe to build an environment from scratch) A Docker Container is a temporary filesystem on top of a base Image saves all installations as a stack of layers discards all layers when stopped and removed consists of its own network stack (private address) to communicate with the host has options to start, stop, restart, kill, pause, unpause 12 webis 2018

13 Run in terminal: docker run --rm hello-world webis 2018 Docker workflow [local workflow]

Docker workflow Run in Terminal: docker run --rm -it ubuntu:16.04 bash Try creating some files, then exit the container and start it again No persistence by default. How to address this? 14 webis 2018

Docker workflow Run in Terminal: docker run --rm -it ubuntu:16.04 bash Try creating some files, then exit the container and start it again No persistence by default. How to address this? Run in Terminal: docker run --rm -it -v./workspace:/my-folder ubuntu:16.04 bash Mounting persistent volumes to work around this. 15 webis 2018

Docker workflow Run in Terminal: docker run --rm -it ubuntu:16.04 bash Try creating some files, then exit the container and start it again No persistence by default. How to address this? Run in Terminal: docker run --rm -it -v./workspace:/my-folder ubuntu:16.04 bash Mounting persistent volumes to work around this. Persistence for system changes: build a new image! 16 webis 2018

Docker workflow Run in terminal: docker build -t test-image. docker run --rm -it test-image bash 17 webis 2018

Dockerfile A script which contains a collection of commands(docker and Linux) that will be executed sequentially in the docker environment for building a new docker image. FROM : Name/Path of the base image for building a new image; must be the first command in the Dockerfile RUN : used to execute a command during the build process of the image ADD : copy a file from host machine into a new docker image (or a URL) ENV : define an environment variable (e.g. JAVA_HOME) USER : specify the user which will be used to run any subsequent RUN instructions... 18 webis 2018

Building Our Virtual Cluster Step 1: SSH Gateway SSH Gateway user public network 19 webis 2018

Building Our Virtual Cluster Step 1: SSH Gateway Review Dockerfile.gateway Run in terminal: docker-compose up --build Connect to the SSH Gateway ssh -p 10022 tutorial@localhost 20 webis 2018

Building Our Virtual Cluster Step 1: SSH Gateway SSH Gateway user public network 21 webis 2018

Building Our Virtual Cluster Step 2: HDFS Distributed File System HDFS Namenode SSH Gateway user HDFS Datanode HDFS Datanode HDFS Datanode public network cluster network 22 webis 2018

Building Our Virtual Cluster Step 2: HDFS Distributed File System Add namenode to virtual cluster Format the namenode: 1. docker-compose run --rm namenode bash 2. hdfs namenode -format Add datanodes to virtual cluster Re-start the virtual cluster: 1. docker-compose down 2. docker-compose up Set up proxy access to HDFS web UI Review configuration files Basic HDFS commands + web UI http://namenode:50070 23 webis 2018

Building Our Virtual Cluster SSH configurations For ssh-client in Linux & macos : ssh -p 10022 -D 12345 tutorial@localhost For Windows using Putty: 24 webis 2018

Building Our Virtual Cluster SSH configurations FoxyProxy configuration in the browser 25 webis 2018

Building Our Virtual Cluster SSH configurations If you re running Docker Toolbox (e.g. on Windows versions earlier than 10), you need an additional step: Open Oracle VirtualBox Select Settings for the Docker virtual machine Select Network Advanced Port Forwarding Create a new port forwarding rule (top right button) For the new rule, change both Host port and Guest port to 10022 (leave the other fields as they are) 26 webis 2018

Building Our Virtual Cluster Basic HDFS Commands When logged into the gateway node, you can now run the following commands: List files hadoop fs -ls NAME Remove directory hadoop fs -rmdir NAME Remove file hadoop fs -rm NAME Copy from local FS to HDFS hadoop fs -put LOCAL REMOTE Create a HDFS home directory for your user for later: hadoop fs -mkdir -p /user/tutorial 27 webis 2018

Building Our Virtual Cluster Step 2: HDFS Distributed File System HDFS Namenode SSH Gateway user HDFS Datanode HDFS Datanode HDFS Datanode public network cluster network 28 webis 2018

Building Our Virtual Cluster Step 3: YARN Distributed Processing Framework HDFS Namenode SSH Gateway YARN ResourceManager user HDFS Datanode HDFS Datanode HDFS Datanode YARN NodeManager YARN NodeManager YARN NodeManager public network cluster network 29 webis 2018

Building Our Virtual Cluster Step 3: YARN Distributed Processing Framework Add YARN processes to virtual cluster Review configuration files Explore ResourceManager Web UI: http://resourcemanager:8088 Continue with MapReduce... 30 webis 2018

MapReduce Problem Collecting data is easy and cheap Evaluating data is difficult Solution Divide and Conquer Parallel Processing 31 webis 2018

MapReduce MapReduce Steps 1. Map Each worker applies the map() function to the local data and writes the output to temporary storage. Each output record gets a key. 2. Shuffle Worker nodes redistribute data based on the output keys: all records with the same key go to the same worker node. 3. Reduce Workers apply the reduce() function to each group, per key, in parallel. The user specifies the map() and reduce() functions 32 webis 2018

MapReduce Example: Counting Words Mary had a little lamb its fleece was white as snow and everywhere that Mary went the lamb was sure to go 33 webis 2018

MapReduce Example: Counting Words Mary had a little lamb its fleece was white as snow and everywhere that Mary went the lamb was sure to go Map() Map() Map() Map() Mary 1 had 1 a 1 little 1 lamb 1 its 1 fleece 1 was 1 white 1 as 1 snow 1 and 1 everywhere 1 that 1 Mary 1 went 1 the 1 lamb 1 was 1 sure 1 to 1 go 1 34 webis 2018

MapReduce Example: Counting Words Mary had a little lamb its fleece was white as snow and everywhere that Mary went the lamb was sure to go Map() Map() Map() Map() Mary 1 had 1 a 1 little 1 lamb 1 its 1 fleece 1 was 1 white 1 as 1 snow 1 and 1 everywhere 1 that 1 Mary 1 went 1 the 1 lamb 1 was 1 sure 1 to 1 go 1 Shuffle Reduce() Reduce() a 1 as 1 lamb 2 little 1... Mary 2 was 2 went 1... 35 webis 2018

MapReduce Data Representation with Key-Value Pairs Map Step: Map(k1,v1) list(k2,v2) Sorting and Shuffling: All pairs with the same key are grouped together; one group per key. Reduce Step: Reduce(k2, list(v2)) list(v3) 36 webis 2018

MapReduce MapReduce on YARN 37 webis 2018

MapReduce MapReduce on YARN Recap: Components of the YARN Framework ResourceManager Single instance per cluster, controls container allocation NodeManager Runs on each cluster node, provides containers to applications Components of a YARN MapReduce Job ApplicationMaster Controls execution on the cluster (one for each YARN application) Mapper Processes input data Reducer Processes (sorted) Mapper output Each of the above runs in a YARN Container 38 webis 2018

MapReduce MapReduce on YARN Basic process: 1. Client application requests a container for the ApplicationMaster 2. ApplicationMaster runs on the cluster, requests further containers for Mappers and Reducers 3. Mappers execute user-provided map() function on their part of the input data 4. The shuffle() phase is started to distribute map output to reducers 5. Reducers execute user-provided reduce() function on their group of map output 6. Final result is stored in HDFS See also: [Anatomy of a MapReduce Job] 39 webis 2018

MapReduce Examples Quasi-Monte-Carlo Estimation of π Idea: 1 Area = 1 0 0 1 Area = π / 4 The area of a circle segment inside the unit square is π 4 40 webis 2018

MapReduce Examples Quasi-Monte-Carlo Estimation of π Idea: 1 Mapper 1 : (3, 1) Area = 1 0 0 1 Area = π / 4 Mapper 2 : (2, 2) Mapper 3 : (4, 0) The area of a circle segment inside the unit square is π 4 Each mapper generates some random points inside the square, and counts how many fall inside/outside the circle segment. 41 webis 2018

MapReduce Examples Quasi-Monte-Carlo Estimation of π Idea: 1 Area = 1 0 0 1 Area = π / 4 Mapper 1 : (3, 1) Mapper 2 : (2, 2) Mapper 3 : (4, 0) Reducer: (3+2+4) π 4* (4+4+4) The area of a circle segment inside the unit square is π 4 Each mapper generates some random points inside the square, and counts how many fall inside/outside the circle segment. The reducer sums up points inside and points total, to compute our estimate of π. 42 webis 2018

MapReduce Examples Monte-Carlo Estimation of π This is already included as an example program in Hadoop! Connect to the gateway node and run: cd /opt/hadoop-*/share/hadoop/mapreduce then: hadoop jar hadoop-mapreduce-examples-*.jar pi 4 1000000 The output should look like this: Number of Maps = 4 Samples per Map = 1000000 Wrote input for Map #0... Job Finished in 13.74 seconds Estimated value of Pi is 3.14160400000000000000 43 webis 2018

MapReduce Examples Parallellizing Shell Scripts with Hadop Streaming Let s say we want to know which of the words you and thou occurs more frequently in Shakespeare s works. We can answer our question with simple Linux shell script. First some basics. Download the file shakespeare.txt to the workspace directory of your gateway node. Then, connect to the gateway node. 44 webis 2018

MapReduce Examples Some Shell Scripting Basics cat FILE outputs contents of FILE A B the output of command A becomes input of command B grep PATTERN outputs all input lines containing PATTERN 45 webis 2018

MapReduce Examples Some Shell Scripting Basics cat FILE outputs contents of FILE A B the output of command A becomes input of command B grep PATTERN outputs all input lines containing PATTERN Example: cat shakespeare.txt grep you 46 webis 2018

MapReduce Examples Some Shell Scripting Basics cat FILE outputs contents of FILE A B the output of command A becomes input of command B grep PATTERN outputs all input lines containing PATTERN Example: cat shakespeare.txt grep you grep -o PATTERN outputs only the matching part of each input line. \ inside the PATTERN, marks an alternative ( or ) 47 webis 2018

MapReduce Examples Some Shell Scripting Basics cat FILE outputs contents of FILE A B the output of command A becomes input of command B grep PATTERN outputs all input lines containing PATTERN Example: cat shakespeare.txt grep you grep -o PATTERN outputs only the matching part of each input line. \ inside the PATTERN, marks an alternative ( or ) Example: cat shakespeare.txt grep -o you \ thou 48 webis 2018

MapReduce Examples Some Shell Scripting Basics cat FILE outputs contents of FILE A B the output of command A becomes input of command B grep PATTERN outputs all input lines containing PATTERN Example: cat shakespeare.txt grep you grep -o PATTERN outputs only the matching part of each input line. \ inside the PATTERN, marks an alternative ( or ) Example: cat shakespeare.txt grep -o you \ thou sort sorts the input alphabetically uniq -c counts consecutive identical lines in the input 49 webis 2018

MapReduce Examples Some Shell Scripting Basics cat FILE outputs contents of FILE A B the output of command A becomes input of command B grep PATTERN outputs all input lines containing PATTERN Example: cat shakespeare.txt grep you grep -o PATTERN outputs only the matching part of each input line. \ inside the PATTERN, marks an alternative ( or ) Example: cat shakespeare.txt grep -o you \ thou sort sorts the input alphabetically uniq -c counts consecutive identical lines in the input So, finally: cat shakespeare.txt grep -o you \ thou sort uniq -c [full explanation] 50 webis 2018

MapReduce Examples Parallellizing Shell Scripts with Hadop Streaming We have our answer, but we only used one machine. Hadoop Streaming lets us easily parallelize such shell scripts over the entire cluster! 51 webis 2018

MapReduce Examples Parallellizing Shell Scripts with Hadop Streaming We have our answer, but we only used one machine. Hadoop Streaming lets us easily parallelize such shell scripts over the entire cluster! 1. Put the input file in HDFS: hadoop fs -put shakespeare.txt shakespeare-hdfs.txt 2. Go to the directory with the Hadoop Streaming Jar file: cd /opt/hadoop-*/share/hadoop/tools/lib 3. Run our shellscript as a Streaming job: hadoop jar hadoop-streaming-*.jar \ -input shakespeare-hdfs.txt \ -output my-word-counts \ -mapper "grep -o you \ thou " \ -reducer "uniq -c" Notes: \ means continue the previous line ; Hadoop already does the sorting for us. 52 webis 2018

MapReduce Examples Parallellizing Shell Scripts with Hadop Streaming Let s look at the results: hadoop fs -ls my-word-counts Found 2 items -rw-r-r- 3 tutorial tutorial 0 2018-04-21 13:41 my-word-counts/_success -rw-r-r- 3 tutorial tutorial 31 2018-04-21 13:41 my-word-counts/part-00000 hadoop fs -cat my-word-counts/part-00000 4159 thou 8704 you 53 webis 2018

Working with the Real Cluster Betaweb Cluster Network Layout betaweb020.medien.uni-weimar.de 141.54.132.20 HDFS Namenode webis17.medien.uni-weimar.de 141.54.159.7 YARN ResourceManager user SSH Gateway betaweb001 141.54.132.1 HDFS Datanode betaweb130 141.54.132.130 HDFS Datanode YARN NodeManager YARN NodeManager public network cluster network 54 webis 2018

Working with the Real Cluster Things to Know Gateway host webis17.medien.uni-weimar.de Gateway login (your university username) Gateway password (check your email) ResourceManager UI HDFS UI http://betaweb020.medien.uni-weimar.de:8088 http://betaweb020.medien.uni-weimar.de:50070 Python Notebook UI https://webis17.medien.uni-weimar.de:8000 55 webis 2018