Introduction to Hadoop. Scott Seighman Systems Engineer Sun Microsystems

Introduction to Hadoop Scott Seighman Systems Engineer Sun Microsystems 1

Agenda Identify the Problem Hadoop Overview Target Workloads Hadoop Architecture Major Components > HDFS > Map/Reduce Demo Resources Java User Group December '09 2

Solving a Problem How do you scale up applications? > 100 s of terabytes of data > Takes 11 days to read on 1 computer Need lots of cheap computers > Fixes speed problem (15 minutes on 1000 computers), but > Introduce reliability problems In large clusters, computers fail every day Cluster size is not fixed Need common infrastructure > Must be efficient and reliable Java User Group December '09 3

A Solution Hadoop Open Source Apache Project Hadoop is named after Cutting s son s stuff elephant Hadoop Core includes: > Distributed File System - distributes data > Map/Reduce - distributes application Written in Java Runs on > Linux, Mac OS/X, Windows, and Solaris > Commodity hardware Java User Group December '09 4

A Solution Hadoop Distributed File System > Modeled on GFS Distributed Processing Framework > Using Map/Reduce metaphor Java User Group December '09 5

A Solution Hadoop It s a framework for large-scale data processing: > Inspired by Google s architecture: Map Reduce and GFS > A top-level Apache project > Hadoop is open source > Written in Java, plus a few shell scripts Java User Group December '09 6

A Solution Hadoop The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. Hadoop includes: > Hadoop Common utilities > Avro: A data serialization system with scripting languages > Chukwa: managing large distributed systems > HBase: A scalable, distributed database for large tables > HDFS: A distributed file system > Hive: data summarization and ad hoc querying (SQL) > MapReduce: distributed processing on compute clusters > Pig: A high-level data-flow language for parallel computation > ZooKeeper: coordination service for distributed applications Java User Group December '09 7

A Solution Hadoop Fault-tolerant hardware is expensive Hadoop is designed to run on cheap commodity hardware It automatically handles data replication and node failure It does the hard work, you can focus on processing data Java User Group December '09 8

Hadoop History Java User Group December '09 9

Who is Using Hadoop? Java User Group December '09 10

Workload Targets for Hadoop When you must process lots of unstructured data When your processing can easily be made parallel When running batch jobs is acceptable When you have access to lots of cheap hardware Java User Group December '09 11

Avoid Hadoop if... The app includes intense calculations with little or no data Your processing cannot be easily made parallel Your data is not self-contained You need interactive results You own stock in supercomputer companies Java User Group December '09 12

Workload Examples/Anti-Examples Good choice for... > Indexing log files > Sorting vast amounts of data > Image analysis Hadoop would be a poor choice for... > Figuring Pi to 1,000,000 digits > Calculating Fibonacci sequences > A general RDBMS replacement Java User Group December '09 13

Hadoop Architecture Java User Group December '09 14

Hadoop Architecture Java User Group December '09 15

Hadoop Components Name Node There is only one (active) name node per cluster It manages the filesystem namespace and metadata The one place to spend $$$ for good hardware Java User Group December '09 16

Hadoop Components Job Tracker There is exactly one job tracker per cluster Receives job requests submitted by client Schedules and monitors MR jobs on task trackers Java User Group December '09 17

Hadoop Components Task Tracker There are typically many task trackers Responsible for executing MR operations Reads blocks from data nodes Java User Group December '09 18

Hadoop Components Data Nodes There are typically many data nodes Manages data blocks, serves them to clients Data is replicated, failure is no big deal Java User Group December '09 19

Hadoop Distributed File System (HDFS) Java User Group December '09 20

Hadoop Distributed File System HDFS is perhaps Hadoop s most interesting feature HDFS = userspace Inspired by Google File System (GFS) High aggregate throughput for streaming large files Replication and locality Java User Group December '09 21

Hadoop Distributed File System Single namespace for entire cluster > Managed by a single namenode. > Hierarchal directories > Optimized for streaming reads of large files. Files are broken in to large blocks. > Typically 64 or 128 MB > Replicated to several datanodes, for reliability > Clients can find location of blocks Client talks to both namenode and datanodes > Data is not sent through the namenode. Java User Group December '09 22

Hadoop Distributed File System API + implementation for working with Map Reduce More importantly, it provides infrastructure: > Job configuration and efficient scheduling > Browser-based monitoring of important cluster stats > Handling failures in both computation and data nodes > A distributed FS optimized for HUGE amounts of data Java User Group December '09 23

How HDFS Works Data copied into HDFS is split into blocks Typical block size: UNIX = 4KB vs. HDFS = 64/128MB Java User Group December '09 24

Distributed Workloads User submits Map/Reduce job to JobTracker System: > Splits job into lots of tasks > Schedules tasks on nodes close to data > Monitors tasks > Kills and restarts if they fail/hang/disappear Pluggable file systems for input/output > Local file system for testing, debugging, etc Java User Group December '09 25

How HDFS Replication Works Each data blocks is replicated to multiple machines Allows for node failure without data loss Java User Group December '09 26

Hadoop Distributed File System Single petabyte file system for entire cluster > Managed by a single namenode > Files are written, read, renamed, deleted, but append-only > Optimized for streaming reads of large files Files are broken in to large blocks > Transparent to the client > Blocks are typically 128 MB > Replicated to several datanodes, for reliability Client talks to both namenode and datanodes > Data is not sent through the namenode > Throughput of file system scales nearly linearly Access from Java, C, or command line Java User Group December '09 27

Hadoop Modes of Operation Hadoop supports three modes of operation: > Standalone > Pseudo-distributed > Fully-distributed Java User Group December '09 28

Map Reduce Java User Group December '09 29

Map/Reduce Map/Reduce is a programming model for efficient distributed computing It works like a Unix pipeline: > cat input grep sort uniq -c cat > output > Input Map Shuffle & Sort Reduce Output Efficiency from > Streaming through data, reducing seeks > Pipelining A good fit for a lot of applications > Log processing > Web index building > Data mining and machine learning Java User Group December '09 30

Map/Reduce Features Java, C++, and text-based APIs > In Java use Objects, with C++ bytes > Text-based (streaming) great for scripting or legacy apps > Higher level interfaces: Pig, Hive, Jaql Automatic re-execution on failure > In a large cluster, some nodes are always slow or flaky > Framework re-executes failed tasks Locality optimizations > With large data, bandwidth to data is a problem > Map-Reduce queries HDFS for locations of input data > Map tasks are scheduled close to the inputs when possible Java User Group December '09 31

Map/Reduce Features Abstracts a very common pattern (munge, regroup, munge) Designed for: > Building or updating offline databases (e.g. Indexes) > Computing statistics (e.g. query log analysis) Software framework > Frozen part: distributed sort, and reliability via re-execution > Hot parts: input, map, partition, compare, reduce, and output Java User Group December '09 32

Map/Reduce Features Data is a stream of keys and values Mapper > Input: key1,value1 pair > Output: key2, value2 pairs Reducer > Called once per a key, in sorted order > Input: key2, stream of value2 > Output: key3, value3 pairs Launching Program > Creates a JobConf to define a job > Submits JobConf and waits for completion Java User Group December '09 33

Map/Reduce Optimizations Overlap of maps, shuffle, and sort Mapper locality > Schedule mappers close to the data Combiner > Mappers may generate duplicate keys > Side-effect free reducer run on mapper node > Minimize data size before transfer > Reducer is still run Speculative execution > Some nodes may be slower > Run duplicate task on another node Java User Group December '09 34

Simple Application Java User Group December '09 35

Writing a Basic Application To write a distributed word count program: Mapper: Given a line of text, break it into words and output the word and the count of 1: > hi Apache bye Apache -> > ( hi, 1), ( Apache, 1), ( bye, 1), ( Apache, 1) Combiner/Reducer: Given a word and a set of counts, output the word and the sum > ( Apache, [1, 1]) -> ( Apache, 2) Launcher: Builds the configuration and submits job Java User Group December '09 36

Word Count Example: Mapper public class WCMap extends MapReduceBase implements Mapper { private static final IntWritable ONE = new IntWritable(1); } public void map(writablecomparable key, Writable value, OutputCollector output, Reporter reporter) throws IOException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { output.collect(new Text(itr.next()), ONE); } } Java User Group December '09 37

Word Count Example: Reduce public class WCReduce extends MapReduceBase implements Reducer { public void reduce(writablecomparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasnext()) { sum += ((IntWritable) values.next()).get(); } output.collect(key, new IntWritable(sum)); } } Java User Group December '09 38

Word Count Example: Launcher public static void main(string[] args) throws IOException { JobConf conf = new JobConf(WordCount.class); conf.setjobname("wordcount"); conf.setoutputkeyclass(text.class); conf.setoutputvalueclass(intwritable.class); conf.setmapperclass(wcmap.class); conf.setcombinerclass(wcreduce.class); conf.setreducerclass(wcreduce.class); conf.setinputpath(new Path(args[0])); conf.setoutputpath(new Path(args[1])); } JobClient.runJob(conf); Java User Group December '09 39

Running a Hadoop Job Basic Steps Compile your job into a JAR file Copy input data into HDFS Execute bin/hadoop jar with relevant args Monitor tasks via Web interface (optional) Examine output when job is complete Java User Group December '09 40

Running on Amazon EC2/S3 Amazon sells cluster services > EC2: $0.10/cpu hour > S3: $0.20/gigabyte month Hadoop supports: > EC2: cluster management scripts included > S3: file system implementation included Tested on 400 node cluster Combination used by several startups Java User Group December '09 41

Block Placement Default is 3 replicas, but configurable Blocks are placed (writes are pipelined): > On same node > On different rack > On the other rack Clients read from closest replica If the replication for a block drops below target, it is automatically re-replicated Java User Group December '09 42

Data Validation Data is checked with CRC32 File Creation > Client computes checksum per 512 byte > DataNode stores the checksum File access > Client retrieves the data and checksum from DataNode > If Validation fails, Client tries other replicas Periodic validation by DataNode Java User Group December '09 43

Installing Hadoop Java User Group December '09 44

Installing Hadoop The installation process, for distributed modes: > Requirements: Java 1.6, sshd, rsync > Configure SSH for password-free authentication > Unpack Hadoop distribution > Edit a few configuration files > Format the DFS on the name node > Start all the daemon processes Java User Group December '09 45

Installing Hadoop Modify hadoop-site.xml to set directories and master hostnames. Create a slaves file that lists the worker machines one per a line. Run bin/start-dfs on the namenode. Run bin/start-mapred on the jobtracker Java User Group December '09 46

Demo Java User Group December '09 47

Resources For more information: > Website: http://hadoop.apache.org/core Mailing lists: > core-dev@hadoop.apache > core-user@hadoop.apache IRC: #hadoop on irc.freenode.org Java User Group December '09 48

Thanks! Scott Seighman scott.seighman@sun.com 49

Job Launch: Client Client program creates a JobConf Identify classes implementing Mapper and Reducer interfaces setmapperclass(), setreducerclass() Specify inputs, outputs setinputpath(), setoutputpath() Optionally, other options too: setnumreducetasks(), setoutputformat() Java User Group December '09 50

Appendix Java User Group December '09 51

Job Launch: JobClient Pass JobConf to JobClient.runJob() // blocks JobClient.submitJob() // does not block JobClient: Determines proper division of input into InputSplits Sends job data to master JobTracker server Java User Group December '09 52

Job Launch: JobTracker JobTracker: Inserts jar and JobConf (serialized to XML) in shared location Posts a JobInProgress to its run queue Java User Group December '09 53

Job Launch: TaskTracker TaskTrackers running on slave nodes periodically query JobTracker for work Retrieve job-specific jar and config Launch task in separate instance of Java main() is provided by Hadoop Java User Group December '09 54

Job Launch: Task TaskTracker.Child.main(): Sets up the child TaskInProgress attempt Reads XML configuration Connects back to necessary MapReduce components via RPC Uses TaskRunner to launch user process Java User Group December '09 55

Job Launch: TaskRunner TaskRunner, MapTaskRunner, MapRunner work in a daisy-chain to launch Mapper Task knows ahead of time which InputSplits it should be mapping Calls Mapper once for each record retrieved from the InputSplit Running the Reducer is much the same Java User Group December '09 56

Creating the Mapper Your instance of Mapper should extend MapReduceBase One instance of your Mapper is initialized by the MapTaskRunner for a TaskInProgress Exists in separate process from all other instances of Mapper no data sharing! Java User Group December '09 57

HDFS Limitations Almost GFS (Google FS) No file update options (record append, etc); all files are write-once Does not implement demand replication Designed for streaming Random seeks devastate performance Java User Group December '09 58

NameNode Head interface to HDFS cluster Records all global metadata Java User Group December '09 59

Secondary NameNode Not a failover NameNode! Records metadata snapshots from real NameNode Can merge update logs in flight Can upload snapshot back to primary Java User Group December '09 60

NameNode Death No new requests can be served while NameNode is down Secondary will not fail over as new primary So why have a secondary at all? Java User Group December '09 61

NameNode Death, cont d If NameNode dies from software glitch, just reboot But if machine is hosed, metadata for cluster is irretrievable! Java User Group December '09 62

Bringing the Cluster Back If original NameNode can be restored, secondary can re-establish the most current metadata snapshot If not, create a new NameNode, use secondary to copy metadata to new primary, restart whole cluster ( ) Is there another way? Java User Group December '09 63

Keeping the Cluster Up Problem: DataNodes fix the address of the NameNode in memory, can t switch in flight Solution: Bring new NameNode up, but use DNS to make cluster believe it s the original one Java User Group December '09 64

Further Reliability Measures Namenode can output multiple copies of metadata files to different directories Including an NFS mounted one May degrade performance; watch for NFS locks Java User Group December '09 65

Making Hadoop Work Basic configuration involves pointing nodes at master machines mapred.job.tracker fs.default.name dfs.data.dir, dfs.name.dir hadoop.tmp.dir mapred.system.dir See Hadoop Quickstart in online documentation Java User Group December '09 66

Configuring for Performance Configuring Hadoop performed in base JobConf in conf/hadoop-site.xml Contains 3 different categories of settings Settings that make Hadoop work Settings for performance Optional flags/bells & whistles Java User Group December '09 67

Configuring for Performance mapred.child.java.opts -Xmx512m dfs.block.size 134217728 mapred.reduce.parallel.copies 20 50 dfs.datanode.du.reserved 1073741824 io.sort.factor 100 io.file.buffer.size 32K 128K io.sort.mb 20--200 tasktracker.http.threads 40 50 Java User Group December '09 68

Number of Tasks Controlled by two parameters: mapred.tasktracker.map.tasks.maximum mapred.tasktracker.reduce.tasks.maximum Two degrees of freedom in mapper run time: Number of tasks/node, and size of InputSplits Current conventional wisdom: 2 map tasks/core, less for reducers See http://wiki.apache.org/lucenehadoop/howmanymapsandreduces Java User Group December '09 69

Dead Tasks Student jobs would run away, admin restart needed Very often stuck in huge shuffle process Students did not know about Partitioner class, may have had non-uniform distribution Did not use many Reducer tasks Lesson: Design algorithms to use Combiners where possible Java User Group December '09 70

Working With the Scheduler Remember: Hadoop has a FIFO job scheduler No notion of fairness, round-robin Design your tasks to play well with one another Decompose long tasks into several smaller ones which can be interleaved at Job level Java User Group December '09 71