Big Data Analysis using Hadoop Lecture 3

Size: px

Start display at page:

Download "Big Data Analysis using Hadoop Lecture 3"

Phyllis Morrison
6 years ago
Views:

1 Big Data Analysis using Hadoop Lecture 3 Last Week - Recap Driver Class Mapper Class Reducer Class Create our first MR process Ran on Hadoop Monitored on webpages Checked outputs using HDFS command line and webpageas 1

2 In this Class Counters Combiners Partitionars Reading and Writing Data Chaining MapReduce Jobs (Workflows) Lab work Assignment Counters 2

3 Passing information back to the driver Counters provide a way for Mappers or Reducers to pass aggregated values back to the driver after the job has completed. Framework provides built in counters: Map-Reduce counters - e.g. number of input and output records for mapper, reducers, time and memory statistics File System counters - e.g. number of bytes read or written Job counters - e.g. launched tasks, failed tasks, etc... Counters are visible from the JobTracker UI Counters are reported on the console when the job finishes User Defined Counters User Defined Counters are a useful mechanism for gathering statistics about the job e.g. quality control - track different types of input record types, e,g, bad input records Instrumentation of code Numbers of warnings Number of errors Number of.. Framework aggregates all user defined counters over all mappers, reducers and reports back on UI enum GroupName { Counter1, Counter2,... Counters are set up as Enums in Mapper or Reducer context.getcounter(enum<?> countername); Counters are retrieved from the Context object which is passed to the Mapper and Reducer setvalue(long value); Increment or set the counter value: increment(long incrementamt); 3

4 Using Counters Set up the user-defined counters public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable> { static enum MyCounters{Bad, Good, Missing public void map(longwritable key, Text value, Context context)... Increment the counter when necessary... if (<input data problem>) { context.getcounter(mycounters.bad).increment(1);... Using Counters Print out the counters at the end of the Job in the Driver... job.waitforcompletion(true)? 0 : 1; System.out.println("Job is complete - printing counters now:"); Counters counters = job.getcounters(); Counter bad = counters.findcounter(mymapper.mycounters.bad); System.out.println("Number of bad records is "+ bad.getvalue());... 4

6 Combiners Combiners Mappers can produce a large amount of intermediate data generating significant network traffic when passed to Reducers. A Combiner is a mini-reducer runs locally on a single Mapper s output passes its output to the Reducer reduces the intermediate data passed to Reducer Can lead to faster jobs and less network traffic often can reduce the amount of work needed to be done by the Reducer may be the same code as the Reducer 6

Output from the Mapper: (this, 1) (one, 1) (I, 1) (think, 1) (is, 1) (called,1)(a, 1)(yink,1) (he, 1) (likes,1)

7 Typical MR process MR process With Combiner WordCount Example Input to the Mapper: (124, this one I think is called a yink ) (158, he likes to wink, he likes to drink ) (195, he likes to drink and drink and drink ) Output from the Mapper: (this, 1) (one, 1) (I, 1) (think, 1) (is, 1) (called,1)(a, 1)(yink,1) (he, 1) (likes,1) (to,1) (wink,1) (he,1) (likes,1) (to,1) (drink,1) (he,1) (likes,1) (to,1) (drink 1) (and,1) (drink,1) (and,1) (drink,1) 7

8 Example Job What happens with only a Reducer Intermediate data sent to the Reducer (a, [1]) (and,[1,1]) (called, [1]) (drink, [1,1,1,1]) (he, [1,1,1]) (I, [1]) (is, [1]) (likes, [1,1]) (one, [1]) (think, [1]) (this, [1]) (to, [1,1,1]) (wink, [1]) (yink, [1] ) Reducer output (a,1) (and,2) (called,1) (drink,4) (he,3) (I,1) (is,1) (likes,2) (one,1) (think,1) (this,1) (to,3) (wink,1) (yink,1) When we use a Combiner Output from Mapper (a, [1]) (and,[1,1]) (called, [1]) (drink, [1,1,1,1]) (he, [1,1,1]) (I, [1]) (is, [1]) (likes, [1,1]) (one, [1]) (think, [1]) (this, [1]) (to, [1,1,1]) (wink, [1]) (yink, [1] ) Combiner Output (a, [1]) (and,[2]) (called, [1]) (drink, [4]) (he, [3]) (I, [1]) (is, [1]) (likes, [2]) (one, [1]) (think, [1]) (this, [1]) (to, [3]) (wink, [1]) (yink, [1] ) Combiner output (a,1) (and,2) (called,1) (drink,4) (he,3) (I,1) (is,1) (likes,2) (one,1) (think,1) (this,1) (to,3) (wink,1) (yink,1) 8

9 When we use a Combiner Intermediate data sent to the Reducer Combiner output (a, [1]) (and,[2]) (called, [1]) (drink, [4]) (he, [3]) (I, [1]) (is, [1]) (likes, [2]) (one, [1]) (think, [1]) (this, [1]) (to, [3]) (wink, [1]) (yink, [1] ) Reducer output (a,1) (and,2) (called,1) (drink,4) (he,3) (I,1) (is,1) (likes,2) (one,1) (think,1) (this,1) (to,3) (wink,1) (yink,1) Other Combiner outputs Specifying a Combiner Set the Combiner up in the job configuration in the Driver job.setcombinerclass(yourcombinerclass.class) The Combiner uses the same interface as the Reducer takes in a key and a list of values (output from the Mapper) outputs zero or more lists of key/value pairs the work is done in the reduce method Note: The Combiner input types must be the same as the Mapper output types (K2, V2) The Combiner output types must be the same as the Reducer input types (K2, V2) Don t put code in the combiner that alters the data from the Mapper 9

Example Code 1 1 The Reducer code is used for the Combiner Example Code 2 2 1 Sometimes the code can be slightly

10 Example Code 1 1 The Reducer code is used for the Combiner Example Code Sometimes the code can be slightly different between the Combiner and the Reducer You need to be careful to maintain the input and output formats 3 10

11 Partitioners Partitioners The number of Reducers that run is specified in the job configuration the default number is 1 The number of Reducers can be set when setting up the job setnumreducetasks(value) job.setnumreducetasks(10); No need to set this value if you just want one Reducer Partitioners implementation directs key-value pairs to a specific reducer Number of Partitions = Number of Reducers Default is to hash key the determine partition implemented HashPartitioner<K,V> 11

12 Shape = key Inner patterns = values Note: All keys go to the same reducer A reducer can handle different keys Reducers can have different loads Partitioners Shuffle & Sort 12

13 Partitioners Partitioners determine which reducer the map output data is sent to in the shuffle & sort phase Normally determined using a hash function on the key value Important that the Partitioner will distribute the map output records evenly over the reducers job.setpartitionerclass(casepartitioner.class); job.setnumreducetasks(10); Default Partitioner The default Partitioner is the HashPartitioner The key s hash code is turned into a non negative integer by bitwise ANDing it with the largest integer value It is then reduced modulo the number of partitions to find the index of the partition (the reducer number) that the record belongs to public int getpartition(k2 key, V2 value,int numpartitions) { return (key.hashcode() & Integer.MAX_VALUE) % numpartitions; s distributed evenly across available reduce tasks Assuming good hashcode() function s with same key will make it into the same reduce task 13

14 Default Partitioner Reducer 0 Key Hashcode Modulo 3 This 1 1 is 2 2 not 3 0 my 4 1 office 5 2 Colour 6 0 Pen 7 1 Money 8 2 Reducer 1 Reducer 2 To Implement a Custom Partitioner... Set the number of reducers in the job configuration Create a custom Partitioner class by extending the Partitioner class implement the getpartition() method to return a number between 0 and the number of reducers indexing to which reducer that the key/value pair should be sent. public class MyPartitioner extends Partitioner<KEY, VALUE> { public int getpartition(key key, VALUE value, int numpartitions){ // put code here to decide based on the key which // reducer the map output should go to... return partitionnumber; 14

15 Simple MapReduce Job Complete MapReduce Job 15

16 Drawbacks Need to know the number of partitions/reducers at the start, not dynamic. Letting the application fix the number of reducers rather than the cluster can result in inefficient use of the cluster and uneven reduce jobs that can dominate the job execution time. See MultipleOutputs later on for an alternative solution Partitioner example code

17 Reading & Writing Data 17

Logical split created by an InputFormat Each split is processed by a single mapper - Data locality Each record (key value pair) is processed by the map method File HDFS Physical locations Block Block

18 Logical split created by an InputFormat Each split is processed by a single mapper - Data locality Each record (key value pair) is processed by the map method File HDFS Physical locations Block Block Block InputSplit InputSplit InputSplit InputSplit InputSplit InputSplit Mapper Mapper Mapper Mapper Mapper Mapper map map map map map map Input Data The input data is split into chunks called input splits (logical division) split size is normally the size of a block - but is configurable The size of the splits determines the level of parallelisation One input split a mapper All input data in one single split no parallelisation Small input split useful for parallelisation of CPU bound tasks HDFS stores input data in blocks spread across the nodes (physical division) One block one input split (very efficient for I/O bound tasks) An input split may be split across blocks - Hadoop guarantees the processing of all records but data-local mappers will need to access remote data infrequently. 18

Input Data Data input is supported by InputFormat: indicates how the input files should be split into input splits Reader: performs the reading of the data providing a key/value pair for input to the

19 Input Data Data input is supported by InputFormat: indicates how the input files should be split into input splits Reader: performs the reading of the data providing a key/value pair for input to the mapper Hadoop provides predefined InputFormats TextInputFormat : default input format KeyValueTextInputFormat SequenceFileInputFormat (normally used for chaining multiple MapReduce jobs) NLineInputFormat The input format can be set in the job configuration in your driver, e.g. job.setinputformatclass(keyvaluetextinputformat.class) To read input data in a way not supported by the standard InputFormat classes you can create a custom InputFormat 19

Output Data Each reducer writes its output to its own file normally named part-nnnnn, where nnnnn is the partition ID of the reducer Data output is supported by OutputFormat Writer Hadoop provides

20 Output Data Each reducer writes its output to its own file normally named part-nnnnn, where nnnnn is the partition ID of the reducer Data output is supported by OutputFormat Writer Hadoop provides predefined OutputFormats TextOutputFormat : default output format SequenceFileOutputFormat (normally used for chaining multiple MapReduce jobs) NullOutputFormat job.setoutputformatclass(sequencefileoutputformat.class) 20

21 Writing Multiple Files MultipleOutputs allows you to write data to multiple files whose names are derived from the output keys and values Output file names are of the form name-x-nnnn name is set by the code x = m, for mapper output x = r, for reducer output nnnn is an integer designating the part number Using MultipleOutputs Create an instance of MultipleOutputs in reducer or mapper where the output is being generated normally done in the setup() protected void setup(context context) throws IOException, InterruptedException{ multipleoutputs = new MultipleOutputs<KEY, VALUE>(context); Close the MultipleOutputs instance once finished with it normally done in the cleanup() protected void cleanup(context context) IOException,InterruptedException { throws multipleoutputs.close(); 21

22 Using MultipleOutputs Write the output key value pair to the instance of MultipleOutputs where name identifies the base output path name is interpreted relative to the output directory so it is possible to create subdirectories by including file path separator characters in name Include logic to determine which output file to write to normally dependent on key and/or value e.g. monthly reports, weekly reports, files identified by time periods e.g. store or branch reports, files identified by store or branch, or both... multipleoutputs.write(key, value, name ); Using MultipleOutputs MultipleOutputs delegates to the given OutputFormat separate named outputs can be set up in the driver using addnamedoutput each with its own OutputFormat and key value types MultipleOutputs.addNamedOutput(job, name, OUTPUTFORMAT, KEY, VALUE ); A single record can be written to multiple output files 22

23 Reading/Writing other types of data Reading to/from a database using JDBC Can use the DBInputFormat and DBOutputFormat DBInputFormat doesn t have sharding capabilities so you have to be careful not to overwhelm the database by reading with too many mappers DBOutputFormat very useful for outputting data to a database Reading XML Create a custom InputFormat to read a whole file at a time (see Chpt 7 in Hadoop the Definitive Guide), suitable for small XML files, or Use XMLInputFormat in Mahout (the machine learning library that is implemented on Hadoop) Chaining MapReduce Jobs 23

24 Chaining MapReduce Jobs We ve looked at single MapReduce jobs. Not every problem can be solved with a MapReduce job. Map-Reduce can get very complex with multiple MR jobs. Many problems can be solved with MapReduce, by writing serval MapReduce steps which run in series, or parallel or both Can control these and get them to interact with each other (dependencies) Gives better control and allows for greater computational capabilities Chaining Jobs in a Sequence MapReduce 1 MapReduce 2 MapReduce 3... Run MapReduce jobs sequentially with the output of one job being the input of another Note: Watch intermediate file format... SequenceFileOutputFormat & SequenceFileInputFormat are useful for this. Remember: Sequence files are Hadoop s compressed binary file format for storing key/value pairs. Set up a Job (job1) in the Driver - run job1; Then set up a new Job (job 2) in the driver, with the input path = the output path of job1 - run job2, etc... 24

Chaining Jobs with Dependencies MapReduce 1 MapReduce 2 MapReduce 3 Dependencies can occur when tasks don t run sequentially Use Job, ControlledJob & JobControl classes to setup and manage job

25 Chaining Jobs with Dependencies MapReduce 1 MapReduce 2 MapReduce 3 Dependencies can occur when tasks don t run sequentially Use Job, ControlledJob & JobControl classes to setup and manage job dependencies Setting up workflows JobControl is created to hold the workflow Allows for the creation of simple workflows Represents a graph of Jibs to run Specify dependencies in code JobControl control = new JobControl("Workflow-Example"); A ControlledJob is set up for each job in the workflow ControlledJob step1 = new ControlledJob(job1, null); ControlledJob is a wrapper for Job ControlledJob constructor can take in job dependencies List<ControlledJob> dependencies = new ArrayList<ControlledJob>(); dependencies.add(step1); ControlledJob step2 = new ControlledJob(job2, dependencies); Dependencies between jobs can also be setup using adddependingjob(), e.g. step2.adddependingjob(step1) means step1 will not start until step1 has finished 25

26 Setting up workflows Each ControlledJob is added to the JobControl object using addjob()... control.addjob(step1); control.addjob(step2);... The JobControl is executed in a thread JobControl implements Runnable... Thread workflowthread = new Thread(control,"Workflow-Thread"); workflowthread.setdaemon(true); workflowthread.start();... Setting up workflows Wait for JobControl to complete and report results... while (!control.allfinished()){ Thread.sleep(500); if (control.getfailedjoblist().size() > 0 ){ log.error(control.getfailedjoblist().size() + " jobs failed!"); for ( ControlledJob job : control.getfailedjoblist()){ log.error(job.getjobname() + " failed"); else { log.info("success!! Workflow completed [" + control.getsuccessfuljoblist().size() + "] jobs");... JobControl has methods to allow monitoring and tracking of its jobs 26

Chaining preprocessing & postprocessing steps Sequential Jobs: Modular Jobs: [ map reduce ]+ map+ reduce map Preprocessing (and postprocessing) might require a number of Mappers to run sequentially,

27 Chaining preprocessing & postprocessing steps Sequential Jobs: Modular Jobs: [ map reduce ]+ map+ reduce map Preprocessing (and postprocessing) might require a number of Mappers to run sequentially, e.g. text preprocessing Sequential jobs (using identity Reducer) are inefficient Use ChainMapper and ChainReducer to implement modular pre- and postprocessing steps each mapper is added with its own job configuration parameters to ChainMapper or ChainReducer each mapper can be run individually, useful for testing/debugging MapReduce Algorithms Available in pdf at 27

28 Programmer Control The ability to construct complex data structures as keys and values to store and communicate partial results The ability to to execute user-specified initialisation code at the beginning of a map or reduce task, and the ability to execute user-specified termination code at the end of a map or reduce task. The ability to preserve state in both mappers and reducers across multiple input or intermediate keys. The ability to control the sort order of intermediate keys, and therefore the order in which a reducer will encounter particular keys. The ability to control the partitioning of the key space, and therefore the set of keys that will be encountered by a particular reducer

29 Useful code! 29

30 User-defined types - example public class TextPair implements WritableComparable<TextPair> { private Text first; private Text second; public TextPair() { set(new Text(), new Text()); public TextPair(String first, String second) { set(new Text(first), new Text(second)); public TextPair(Text first, Text second) { set(first, second); public void set(text first, Text second) { this.first = first; this.second = second; public Text getfirst() { return first; public Text getsecond() { return public void write(dataoutput out) throws IOException { first.write(out); public void readfields(datainput in) throws IOException { first.readfields(in); public int compareto(textpair tp) { int cmp = first.compareto(tp.first); if (cmp!= 0) { return cmp; return public int hashcode() { return first.hashcode() * public boolean equals(object o) { if (o instanceof TextPair) { TextPair tp = (TextPair) o; return first.equals(tp.first) && second.equals(tp.second); return public String tostring() { return first + "\t" + second; 30

IntWritable w1 = new IntWritable(163); IntWritable w2 = new IntWritable(67); assertthat(comparator.compare(w1, w2), greaterthan(0));

IntWritable w1 = new IntWritable(163); IntWritable w2 = new IntWritable(67); assertthat(comparator.compare(w1, w2), greaterthan(0)); factory for RawComparator instances (that Writable implementations have registered). For example, to obtain a comparator for IntWritable, we just use: RawComparator comparator = WritableComparator.get(IntWritable.class);