Hadoop. copyright 2011 Trainologic LTD

Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides the complexity of accomplishing high scalability and better hardware utilization.

HDFS HDFS stands for: Hadoop Distributed File System. It is a subproject of Hadoop. Has the following features: Write-once-read-many for simple concurrency model. Replications for fault-tolerance. Supports large files (terra/ petas) and lots of them (millions). Shell support (linux like). Data access by Map/Reduce streaming.

Map-Reduce Map-reduce was introduced by Google and it is an interface that can break a task to sub-tasks distribute them to be executed in parallel (map), and aggregate the results (reduce). Between the Map and Reduce parts, an alternative phase called shuffe can be introduced. In the Shuffle phase, the output of the Map operations is sorted and aggregated for the Reduce part.

NameNode In a single Hadoop cluster, there is a single NameNode. The NameNode is responsible for the directory structure of the HDFS and the blocks location. It is currently a single-point-of-failure, but will be fixed in the next version (2.0.2). There can be a secondary NameNode that contains an edit log which will enable fast recovery. Though, when a NameNode goes down, the whole HDFS is down.

DataNode There are many DataNodes inside an Hadoop Cluster. They contain the data and provide HA and faulttolerance through replication.

Master Nodes There are two nodes in an Hadoop cluster that are single points of failure: NameNode HDFS point of failure (discussed). JobTracker MapReduce point of failure. If a JobTracker machine goes down, all the map-reduces are halted. This is not planned to be resolved in the coming version.

JobTracker A JobTracker node is responsible for assigning Job ids, distributing the task code and configuration. It also monitors the tasks for failures and does reassignment. Provides diagnostic API for the jobs.

TaskTracker A TaskTracker is responsible for the task execution. There is one TaskTracker per node (slave). Uses heartbeat service to notify the JobTracker on progress.

Map-Reduce Interfaces In order to perform a Map-Reduce operation you ll have to create the following components: JobConf InputFormat OutputFormat Mapper Partitioner Combiner Reducer Let s review them

JobConf Represents the configuration of the job. It binds up all the other components. A job can be run by the JobClient.runJob() by providing a JobConf to the method.

Input Processing You can provide configuration for InputFormat, InputSplit and RecordReader for processing the input and deciding whether to split a file into records for the map operations, or to combine files, etc Hadoop knows how to decompress files by default. You can provide codec implementation for different formats.

Mapper The Mapper represents the logic to be done on a key/ value pair of input. It returns an intermediary key/value output. It s like the select clause from SQL (with simple where ). There are many built-in extending classes that provide services like: concurrency, inverting, identity and regex.

Side Effects The mapper s logic should be idempotent (i.e., without side effects). Actually, every implementation of Mapper, Combiner, Partitioner and Reducer, should be pure functional.

Reducer The Reducer is responsible to aggregate the outputs from the Mappers into smaller results. Just like aggregate functions in SQL. It is composed of 3 stages: Shuffle fetching the relevant partition from the mappers. Sort grouping the outputs by the key. Reduce generated the final output (usually saved to the FS).

Combiner The combiner performs aggregation on the mapper s output before it is sent to the reducer. There is no special interface to the combiner and Hadoop uses the same interface as for the Reducer. Usually, the combiner class is the same as the reducer class. Best suited for monoid aggregators (like sum).

Partitioner By default, the results from the mappers are distributed to the reducers according to some hashing on the partition key. However, in order to implement an order by behavior, you can implement a custom partitioner that will uses ranges instead of hashing. Note that usually we combine it with predefined histogram sampling job to define the ranges.

Reporter Both Mappers and Reducers can use Reporters to report their progress and to update counters. Counters are defined by either the application or the map-reduce job. Counters are based on long values.

Hadoop Streaming Bundled in the Hadoop distribution. Can take any executable as the Mapper and the Reducer. E.g.: $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoopstreaming.jar -input idir -output odir -mapper /bin/cat - reducer /bin/wc

Hadoop Streaming You can specify Java classes as inputs for mapper, reducer, combiner, partitioner, etc Actually you can create the JobConf entirely from the command line.

Writable Inputs and Outputs in Hadoop must be serialized. The standard Java serialization is way too slow. Instead, Hadoop uses DataInput and DataOutput stream wrappers to write the primitives of your data. You should implement the Writable interface if you want custom serialization, or use out-of-box implementations. Note that there is no support for super-fast serialization using direct memory or Unsafe.

Caching Hadoop provides a distribute cache utility that can efficiently distribute read-only files and JARs. The application can specify in the JobConf which files will be distributed in the cache.

What Hadoop is Not Hadoop is not well suited for the following usecases: Low Latency applications Hadoop is tuned for high throughput. Small files default file block size is 64m. Not fully HA. Complex configuration. Note that for real-time processing you can use Twitter Storm.

Additional Projects The Hadoop ecosystem provides several projects that are used on-top of Hadoop and provide some useful abstractions: Pig. Hive. Mahout. Zookeeper. HBase. Oozie.

Pig Pig provides an SQL-like DSL on top of Hadoop. It also allows for great flexibility by the way of writing Java functions that can be used in the select and where clauses. The framework will automatically run the functions in the Mappers or Reducers. Let s see some examples

Pig C = LOAD course' USING PigStorage() AS (name:chararray, duration:int; Ns = FOREACH C GENERATE name;

Hive Hive provides warehouse facilities on-top of Hadoop. Can be seen as an alternative to Pig. The query language is very much like SQL. Some consider Pig appropriate for developers and Hive for analysts.

HBase HBase is a NoSQL storage on top of the HDFS. Doesn t use the Map-Reduce and is tuned for lowlatency. Belong to the column-family category.

Zookeeper Zookeeper is a distributed coordination service. Allows for implementing distributed transactions, configuration management, primary elections, etc Very useful in the Hadoop ecosystem.

Mahout A machine learning library on top of Hadoop. Implements most of the machine-learning algorithms. Machine Learning computations are usually well-suited for Hadoop as we usually have big data and some parts of the algorithm can be easily parallelized.