Introduction to Hadoop. Scott Seighman Systems Engineer Sun Microsystems

Similar documents
Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Hadoop. copyright 2011 Trainologic LTD

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Distributed Systems 16. Distributed File Systems II

Hadoop An Overview. - Socrates CCDH

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

MapReduce Simplified Data Processing on Large Clusters

50 Must Read Hadoop Interview Questions & Answers

Introduction to MapReduce. Instructor: Dr. Weikuan Yu Computer Sci. & Software Eng.

Distributed File Systems II

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Lecture 11 Hadoop & Spark

HADOOP FRAMEWORK FOR BIG DATA

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Outline. What is Big Data? Hadoop HDFS MapReduce Twitter Analytics and Hadoop

International Journal of Advance Engineering and Research Development. A Study: Hadoop Framework

Cluster Setup. Table of contents

Chapter 5. The MapReduce Programming Model and Implementation

A BigData Tour HDFS, Ceph and MapReduce

Introduction to HDFS and MapReduce

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Big Data landscape Lecture #2

Top 25 Hadoop Admin Interview Questions and Answers

Distributed Filesystem

Chase Wu New Jersey Institute of Technology

Map-Reduce. Marco Mura 2010 March, 31th

UNIT-IV HDFS. Ms. Selva Mary. G

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Hortonworks HDPCD. Hortonworks Data Platform Certified Developer. Download Full Version :

MI-PDB, MIE-PDB: Advanced Database Systems

Map Reduce & Hadoop Recommended Text:

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Cloud Computing. Leonidas Fegaras University of Texas at Arlington. Web Data Management and XML L12: Cloud Computing 1

Dept. Of Computer Science, Colorado State University

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Cloud Computing. Up until now

Introduction to MapReduce

Semantics with Failures

Hadoop Quickstart. Table of contents

2/26/2017. For instance, consider running Word Count across 20 splits

Vendor: Cloudera. Exam Code: CCA-505. Exam Name: Cloudera Certified Administrator for Apache Hadoop (CCAH) CDH5 Upgrade Exam.

Introduction to Hadoop and MapReduce

Hadoop and HDFS Overview. Madhu Ankam

Google File System (GFS) and Hadoop Distributed File System (HDFS)

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Map-Reduce in Various Programming Languages

Large-scale Information Processing

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Programming with Hadoop MapReduce. Kostas Solomos Computer Science Department University of Crete, Greece

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Clustering Lecture 8: MapReduce

CS60021: Scalable Data Mining. Sourangshu Bhattacharya

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Introduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece

Cloud Computing. Leonidas Fegaras University of Texas at Arlington. Web Data Management and XML L3b: Cloud Computing 1

HDFS Architecture Guide

CCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH)

Parallel Processing - MapReduce and FlumeJava. Amir H. Payberah 14/09/2018

Introduction to BigData, Hadoop:-

MapReduce. U of Toronto, 2014

Hadoop Distributed File System(HDFS)

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

BigData and Map Reduce VITMAC03

Hadoop MapReduce Framework

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Clustering Documents. Case Study 2: Document Retrieval

PARLab Parallel Boot Camp

UNIT V PROCESSING YOUR DATA WITH MAPREDUCE Syllabus

TP1-2: Analyzing Hadoop Logs

A brief history on Hadoop

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

Map- reduce programming paradigm

Big Data 7. Resource Management

GFS: The Google File System

Certified Big Data and Hadoop Course Curriculum

itpass4sure Helps you pass the actual test with valid and latest training material.

MapReduce-style data processing

Big Data for Engineers Spring Resource Management

The Google File System. Alexandru Costan

Actual4Dumps. Provide you with the latest actual exam dumps, and help you succeed

Hadoop File Management System

Getting Started with Hadoop

Hadoop Development Introduction

Introduction to MapReduce

MapReduce, Hadoop and Spark. Bompotas Agorakis

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

Introduction to Hadoop

Big Data and Scripting map reduce in Hadoop

Hadoop. Introduction to BIGDATA and HADOOP

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

CSE6331: Cloud Computing

Big Data Hadoop Course Content

Big Data Analytics. 4. Map Reduce I. Lars Schmidt-Thieme

Batch Inherence of Map Reduce Framework

Transcription:

Introduction to Hadoop Scott Seighman Systems Engineer Sun Microsystems 1

Agenda Identify the Problem Hadoop Overview Target Workloads Hadoop Architecture Major Components > HDFS > Map/Reduce Demo Resources Java User Group December '09 2

Solving a Problem How do you scale up applications? > 100 s of terabytes of data > Takes 11 days to read on 1 computer Need lots of cheap computers > Fixes speed problem (15 minutes on 1000 computers), but > Introduce reliability problems In large clusters, computers fail every day Cluster size is not fixed Need common infrastructure > Must be efficient and reliable Java User Group December '09 3

A Solution Hadoop Open Source Apache Project Hadoop is named after Cutting s son s stuff elephant Hadoop Core includes: > Distributed File System - distributes data > Map/Reduce - distributes application Written in Java Runs on > Linux, Mac OS/X, Windows, and Solaris > Commodity hardware Java User Group December '09 4

A Solution Hadoop Distributed File System > Modeled on GFS Distributed Processing Framework > Using Map/Reduce metaphor Java User Group December '09 5

A Solution Hadoop It s a framework for large-scale data processing: > Inspired by Google s architecture: Map Reduce and GFS > A top-level Apache project > Hadoop is open source > Written in Java, plus a few shell scripts Java User Group December '09 6

A Solution Hadoop The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. Hadoop includes: > Hadoop Common utilities > Avro: A data serialization system with scripting languages > Chukwa: managing large distributed systems > HBase: A scalable, distributed database for large tables > HDFS: A distributed file system > Hive: data summarization and ad hoc querying (SQL) > MapReduce: distributed processing on compute clusters > Pig: A high-level data-flow language for parallel computation > ZooKeeper: coordination service for distributed applications Java User Group December '09 7

A Solution Hadoop Fault-tolerant hardware is expensive Hadoop is designed to run on cheap commodity hardware It automatically handles data replication and node failure It does the hard work, you can focus on processing data Java User Group December '09 8

Hadoop History Java User Group December '09 9

Who is Using Hadoop? Java User Group December '09 10

Workload Targets for Hadoop When you must process lots of unstructured data When your processing can easily be made parallel When running batch jobs is acceptable When you have access to lots of cheap hardware Java User Group December '09 11

Avoid Hadoop if... The app includes intense calculations with little or no data Your processing cannot be easily made parallel Your data is not self-contained You need interactive results You own stock in supercomputer companies Java User Group December '09 12

Workload Examples/Anti-Examples Good choice for... > Indexing log files > Sorting vast amounts of data > Image analysis Hadoop would be a poor choice for... > Figuring Pi to 1,000,000 digits > Calculating Fibonacci sequences > A general RDBMS replacement Java User Group December '09 13

Hadoop Architecture Java User Group December '09 14

Hadoop Architecture Java User Group December '09 15

Hadoop Components Name Node There is only one (active) name node per cluster It manages the filesystem namespace and metadata The one place to spend $$$ for good hardware Java User Group December '09 16

Hadoop Components Job Tracker There is exactly one job tracker per cluster Receives job requests submitted by client Schedules and monitors MR jobs on task trackers Java User Group December '09 17

Hadoop Components Task Tracker There are typically many task trackers Responsible for executing MR operations Reads blocks from data nodes Java User Group December '09 18

Hadoop Components Data Nodes There are typically many data nodes Manages data blocks, serves them to clients Data is replicated, failure is no big deal Java User Group December '09 19

Hadoop Distributed File System (HDFS) Java User Group December '09 20

Hadoop Distributed File System HDFS is perhaps Hadoop s most interesting feature HDFS = userspace Inspired by Google File System (GFS) High aggregate throughput for streaming large files Replication and locality Java User Group December '09 21

Hadoop Distributed File System Single namespace for entire cluster > Managed by a single namenode. > Hierarchal directories > Optimized for streaming reads of large files. Files are broken in to large blocks. > Typically 64 or 128 MB > Replicated to several datanodes, for reliability > Clients can find location of blocks Client talks to both namenode and datanodes > Data is not sent through the namenode. Java User Group December '09 22

Hadoop Distributed File System API + implementation for working with Map Reduce More importantly, it provides infrastructure: > Job configuration and efficient scheduling > Browser-based monitoring of important cluster stats > Handling failures in both computation and data nodes > A distributed FS optimized for HUGE amounts of data Java User Group December '09 23

How HDFS Works Data copied into HDFS is split into blocks Typical block size: UNIX = 4KB vs. HDFS = 64/128MB Java User Group December '09 24

Distributed Workloads User submits Map/Reduce job to JobTracker System: > Splits job into lots of tasks > Schedules tasks on nodes close to data > Monitors tasks > Kills and restarts if they fail/hang/disappear Pluggable file systems for input/output > Local file system for testing, debugging, etc Java User Group December '09 25

How HDFS Replication Works Each data blocks is replicated to multiple machines Allows for node failure without data loss Java User Group December '09 26

Hadoop Distributed File System Single petabyte file system for entire cluster > Managed by a single namenode > Files are written, read, renamed, deleted, but append-only > Optimized for streaming reads of large files Files are broken in to large blocks > Transparent to the client > Blocks are typically 128 MB > Replicated to several datanodes, for reliability Client talks to both namenode and datanodes > Data is not sent through the namenode > Throughput of file system scales nearly linearly Access from Java, C, or command line Java User Group December '09 27

Hadoop Modes of Operation Hadoop supports three modes of operation: > Standalone > Pseudo-distributed > Fully-distributed Java User Group December '09 28

Map Reduce Java User Group December '09 29

Map/Reduce Map/Reduce is a programming model for efficient distributed computing It works like a Unix pipeline: > cat input grep sort uniq -c cat > output > Input Map Shuffle & Sort Reduce Output Efficiency from > Streaming through data, reducing seeks > Pipelining A good fit for a lot of applications > Log processing > Web index building > Data mining and machine learning Java User Group December '09 30

Map/Reduce Features Java, C++, and text-based APIs > In Java use Objects, with C++ bytes > Text-based (streaming) great for scripting or legacy apps > Higher level interfaces: Pig, Hive, Jaql Automatic re-execution on failure > In a large cluster, some nodes are always slow or flaky > Framework re-executes failed tasks Locality optimizations > With large data, bandwidth to data is a problem > Map-Reduce queries HDFS for locations of input data > Map tasks are scheduled close to the inputs when possible Java User Group December '09 31

Map/Reduce Features Abstracts a very common pattern (munge, regroup, munge) Designed for: > Building or updating offline databases (e.g. Indexes) > Computing statistics (e.g. query log analysis) Software framework > Frozen part: distributed sort, and reliability via re-execution > Hot parts: input, map, partition, compare, reduce, and output Java User Group December '09 32

Map/Reduce Features Data is a stream of keys and values Mapper > Input: key1,value1 pair > Output: key2, value2 pairs Reducer > Called once per a key, in sorted order > Input: key2, stream of value2 > Output: key3, value3 pairs Launching Program > Creates a JobConf to define a job > Submits JobConf and waits for completion Java User Group December '09 33

Map/Reduce Optimizations Overlap of maps, shuffle, and sort Mapper locality > Schedule mappers close to the data Combiner > Mappers may generate duplicate keys > Side-effect free reducer run on mapper node > Minimize data size before transfer > Reducer is still run Speculative execution > Some nodes may be slower > Run duplicate task on another node Java User Group December '09 34

Simple Application Java User Group December '09 35

Writing a Basic Application To write a distributed word count program: Mapper: Given a line of text, break it into words and output the word and the count of 1: > hi Apache bye Apache -> > ( hi, 1), ( Apache, 1), ( bye, 1), ( Apache, 1) Combiner/Reducer: Given a word and a set of counts, output the word and the sum > ( Apache, [1, 1]) -> ( Apache, 2) Launcher: Builds the configuration and submits job Java User Group December '09 36

Word Count Example: Mapper public class WCMap extends MapReduceBase implements Mapper { private static final IntWritable ONE = new IntWritable(1); } public void map(writablecomparable key, Writable value, OutputCollector output, Reporter reporter) throws IOException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { output.collect(new Text(itr.next()), ONE); } } Java User Group December '09 37

Word Count Example: Reduce public class WCReduce extends MapReduceBase implements Reducer { public void reduce(writablecomparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasnext()) { sum += ((IntWritable) values.next()).get(); } output.collect(key, new IntWritable(sum)); } } Java User Group December '09 38

Word Count Example: Launcher public static void main(string[] args) throws IOException { JobConf conf = new JobConf(WordCount.class); conf.setjobname("wordcount"); conf.setoutputkeyclass(text.class); conf.setoutputvalueclass(intwritable.class); conf.setmapperclass(wcmap.class); conf.setcombinerclass(wcreduce.class); conf.setreducerclass(wcreduce.class); conf.setinputpath(new Path(args[0])); conf.setoutputpath(new Path(args[1])); } JobClient.runJob(conf); Java User Group December '09 39

Running a Hadoop Job Basic Steps Compile your job into a JAR file Copy input data into HDFS Execute bin/hadoop jar with relevant args Monitor tasks via Web interface (optional) Examine output when job is complete Java User Group December '09 40

Running on Amazon EC2/S3 Amazon sells cluster services > EC2: $0.10/cpu hour > S3: $0.20/gigabyte month Hadoop supports: > EC2: cluster management scripts included > S3: file system implementation included Tested on 400 node cluster Combination used by several startups Java User Group December '09 41

Block Placement Default is 3 replicas, but configurable Blocks are placed (writes are pipelined): > On same node > On different rack > On the other rack Clients read from closest replica If the replication for a block drops below target, it is automatically re-replicated Java User Group December '09 42

Data Validation Data is checked with CRC32 File Creation > Client computes checksum per 512 byte > DataNode stores the checksum File access > Client retrieves the data and checksum from DataNode > If Validation fails, Client tries other replicas Periodic validation by DataNode Java User Group December '09 43

Installing Hadoop Java User Group December '09 44

Installing Hadoop The installation process, for distributed modes: > Requirements: Java 1.6, sshd, rsync > Configure SSH for password-free authentication > Unpack Hadoop distribution > Edit a few configuration files > Format the DFS on the name node > Start all the daemon processes Java User Group December '09 45

Installing Hadoop Modify hadoop-site.xml to set directories and master hostnames. Create a slaves file that lists the worker machines one per a line. Run bin/start-dfs on the namenode. Run bin/start-mapred on the jobtracker Java User Group December '09 46

Demo Java User Group December '09 47

Resources For more information: > Website: http://hadoop.apache.org/core Mailing lists: > core-dev@hadoop.apache > core-user@hadoop.apache IRC: #hadoop on irc.freenode.org Java User Group December '09 48

Thanks! Scott Seighman scott.seighman@sun.com 49

Job Launch: Client Client program creates a JobConf Identify classes implementing Mapper and Reducer interfaces setmapperclass(), setreducerclass() Specify inputs, outputs setinputpath(), setoutputpath() Optionally, other options too: setnumreducetasks(), setoutputformat() Java User Group December '09 50

Appendix Java User Group December '09 51

Job Launch: JobClient Pass JobConf to JobClient.runJob() // blocks JobClient.submitJob() // does not block JobClient: Determines proper division of input into InputSplits Sends job data to master JobTracker server Java User Group December '09 52

Job Launch: JobTracker JobTracker: Inserts jar and JobConf (serialized to XML) in shared location Posts a JobInProgress to its run queue Java User Group December '09 53

Job Launch: TaskTracker TaskTrackers running on slave nodes periodically query JobTracker for work Retrieve job-specific jar and config Launch task in separate instance of Java main() is provided by Hadoop Java User Group December '09 54

Job Launch: Task TaskTracker.Child.main(): Sets up the child TaskInProgress attempt Reads XML configuration Connects back to necessary MapReduce components via RPC Uses TaskRunner to launch user process Java User Group December '09 55

Job Launch: TaskRunner TaskRunner, MapTaskRunner, MapRunner work in a daisy-chain to launch Mapper Task knows ahead of time which InputSplits it should be mapping Calls Mapper once for each record retrieved from the InputSplit Running the Reducer is much the same Java User Group December '09 56

Creating the Mapper Your instance of Mapper should extend MapReduceBase One instance of your Mapper is initialized by the MapTaskRunner for a TaskInProgress Exists in separate process from all other instances of Mapper no data sharing! Java User Group December '09 57

HDFS Limitations Almost GFS (Google FS) No file update options (record append, etc); all files are write-once Does not implement demand replication Designed for streaming Random seeks devastate performance Java User Group December '09 58

NameNode Head interface to HDFS cluster Records all global metadata Java User Group December '09 59

Secondary NameNode Not a failover NameNode! Records metadata snapshots from real NameNode Can merge update logs in flight Can upload snapshot back to primary Java User Group December '09 60

NameNode Death No new requests can be served while NameNode is down Secondary will not fail over as new primary So why have a secondary at all? Java User Group December '09 61

NameNode Death, cont d If NameNode dies from software glitch, just reboot But if machine is hosed, metadata for cluster is irretrievable! Java User Group December '09 62

Bringing the Cluster Back If original NameNode can be restored, secondary can re-establish the most current metadata snapshot If not, create a new NameNode, use secondary to copy metadata to new primary, restart whole cluster ( ) Is there another way? Java User Group December '09 63

Keeping the Cluster Up Problem: DataNodes fix the address of the NameNode in memory, can t switch in flight Solution: Bring new NameNode up, but use DNS to make cluster believe it s the original one Java User Group December '09 64

Further Reliability Measures Namenode can output multiple copies of metadata files to different directories Including an NFS mounted one May degrade performance; watch for NFS locks Java User Group December '09 65

Making Hadoop Work Basic configuration involves pointing nodes at master machines mapred.job.tracker fs.default.name dfs.data.dir, dfs.name.dir hadoop.tmp.dir mapred.system.dir See Hadoop Quickstart in online documentation Java User Group December '09 66

Configuring for Performance Configuring Hadoop performed in base JobConf in conf/hadoop-site.xml Contains 3 different categories of settings Settings that make Hadoop work Settings for performance Optional flags/bells & whistles Java User Group December '09 67

Configuring for Performance mapred.child.java.opts -Xmx512m dfs.block.size 134217728 mapred.reduce.parallel.copies 20 50 dfs.datanode.du.reserved 1073741824 io.sort.factor 100 io.file.buffer.size 32K 128K io.sort.mb 20--200 tasktracker.http.threads 40 50 Java User Group December '09 68

Number of Tasks Controlled by two parameters: mapred.tasktracker.map.tasks.maximum mapred.tasktracker.reduce.tasks.maximum Two degrees of freedom in mapper run time: Number of tasks/node, and size of InputSplits Current conventional wisdom: 2 map tasks/core, less for reducers See http://wiki.apache.org/lucenehadoop/howmanymapsandreduces Java User Group December '09 69

Dead Tasks Student jobs would run away, admin restart needed Very often stuck in huge shuffle process Students did not know about Partitioner class, may have had non-uniform distribution Did not use many Reducer tasks Lesson: Design algorithms to use Combiners where possible Java User Group December '09 70

Working With the Scheduler Remember: Hadoop has a FIFO job scheduler No notion of fairness, round-robin Design your tasks to play well with one another Decompose long tasks into several smaller ones which can be interleaved at Job level Java User Group December '09 71