Cloud Computing. Up until now

Size: px

Start display at page:

Download "Cloud Computing. Up until now"

Ami Gibbs
6 years ago
Views:

1 Cloud Computing Lecture 9 Map Reduce Introduction Up until now Definition of Cloud Computing Grid Computing Content Distribution Networks Cycle-Sharing Distributed Scheduling 1

2 Outline Map Reduce: What is it? Concepts Examples MapReduce: : What is it? Functional Programming + Distributed Processing Platform Typical Challenges in Parallel Processing: How to assign tasks to the workers? What if we have more tasks than workers? What if the workers need to share partial results? How do we aggregate partial results? How do we know whether the workers have finished? What if the workers fail? 2

3 Origins: Functional Programming An old idea from the 50s. What is functional programming? Computing as the composition/sequencing of a set of functions. The theoretical foundations are lambda calculus. What is the difference to imperative programming? The concepts of data and instructions are blended. The flow of data is implicit. Different execution flows are possible. For example, in Scheme (define (foo x y) (sqrt (+ (* x x) (* y y)))) (foo 3 4) 5 (define (bar f x) (f (f x))) (define (baz x) (* x x)) (bar baz 2) 16 3

4 So... What does this have to do with MapReduce? What Scheme and Lisp do is list processing. They use two basic concepts from functional programming: Map: apply the same operation to all the elements in a list. Fold: use an operator to combine all the elements of a list. Map Mapis a second order function: It receives another function as a parameter. It works by: Applying the parameter function to all the elements of a list. Thereby generating a new list. f f f f f 4

5 Reduce Fold (Reduce)is also a second order function. It works by: Initializing an accumulator. Applying the parameter function to the accumulator and the first element of the list. The result is stored in the accumulator. The operation is repeated for each element of the list. The result is the final value of the accumulator. f f f f f final value Initial value Map/Fold Example Simple map example: (map (lambda (x) (* x x)) '( )) '( ) Simple fold example: (fold + 0 '( )) 15 (fold * 1 '( )) 120 Sum of squares: (define (squares-sum v) (fold + 0 (map (lambda (x) (* x x)) v))) (squares-sum '( )) 55 5

6 MapReduce Map+fold over lists of <key, value> pairs. Map: operates on <key1, value1>pairs resulting in lists of <key2, value2> pairs. Reduce: Receives all <key2, value2>for a specific key2 and generates <key3, value3> pairs. MapReduce Input Data Master Partitioned Output 6

7 Good and Bad Examples MapReduceis good for: Log indexing. Ordering large amounts of data. Analysing images. MapReduceis bad for: Calculating digits of π. Calculating sequences of Fibonacci numbers. Replacing a relational database. Real Examples Implementing scalable learning algorithms. Graph algorithms, e.g. travelling salesman. Gathering and analysing medical information. Detecting face similarities in large sets of images. Web crawling. 7

8 Hadoop: : Map/Reduce Hadoop: FLOSS Apache project that reimplementsseveral of Google s cloud components, for example MapReduce. Example: the HelloWorldof distributed processing, word counting. Wordcount: : Map public class Map extends MapReduceBase implements Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(writablecomparable key, Writable values, OutputCollector output, Reporter reporter) throws IOException { String line = values.tostring(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasmoretokens()) { word.set(tokenizer.nexttoken()); output.collect(word, one); 8

9 Wordcount: : Reduce public class Reduce extends MapReduceBase implements Reducer public void reduce(writablecomparable _key, Iterator _values, OutputCollector output, Reporter reporter) throws IOException { Iterator<IntWritable> values = (Iterator<IntWritable>) _values; int sum = 0; while (values.hasnext()) { sum += values.next().get(); output.collect(_key, new IntWritable(sum)); Datatypes Writable Defines a serialization protocol. All datatypes are Writable. WritableComparable Define an ordering criteria. All keys must be of this type, but not the values. IntWritable LongWritable Text Concrete classes for the classic datatypes. 9

10 Basic Datatypes IntWritable DoubleWritable FloatWritable BooleanWritable ArrayWritable BytesWritable MapWritable VLongWritable VIntWritable Complex Datatypes The easy way: Code them in text, e.g. (a, b) = a:b. Use regular expressions to parse and extract the data. It works but is bad software engineering. The not so easy way: Define an implementation of WritableComparable. You must implement: readfields, write, compareto. Computationally efficient. 10

11 Writable public class MyWritable implements Writable { private int counter; private long timestamp; public void write(dataoutput out) throws IOException { out.writeint(counter); out.writelong(timestamp); public void readfields(datainput in) throws IOException { counter = in.readint(); timestamp = in.readlong(); public static MyWritable read(datainput in) throws IOException { MyWritable w = new MyWritable(); w.readfields(in); return w; WritableComparable public class MyWritableComparable implements WritableComparable { private int counter; private long timestamp; public void write(dataoutput out) throws IOException { out.writeint(counter); out.writelong(timestamp); public void readfields(datainput in) throws IOException { counter = in.readint(); timestamp = in.readlong(); public int compareto(mywritablecomparable w) { int thisvalue = this.value; int thatvalue = ((IntWritable)w).value; return (thisvalue < thatvalue? -1 : (thisvalue==thatvalue? 0 : 1)); 11

12 Wordcount: Main public static void main(string[] args) { JobClient client = new JobClient(); JobConf conf = new JobConf(pt.utl.ist.cn.Wordcount.class); conf.setoutputkeyclass(text.class); conf.setoutputvalueclass(intwritable.class); conf.setinputpath(new Path("src")); conf.setoutputpath(new Path("out")); conf.setmapperclass(pt.utl.ist.cn.map.class); conf.setreducerclass(pt.utl.ist.cn.reduce.class); client.setconf(conf); try { JobClient.runJob(conf); catch (Exception e) { e.printstacktrace(); Master (i) Execution is controlled by the master process: Input data are splitinto 64MB blocks. Tasks are assigned to the worker processes dynamically. In the reduction phase there are as many calls to reduce as the output files of the map phase (but only 1 reducer process by default). A typical Google run: Mappers=200,000; Reducers=4,000; Workers=2,000. The master assigns each map task (1 split) to one worker: Worker reads the input ideally from the local disk. And produces R local files with <key,value> pairs. 12

13 Master (ii) Master assigns each reduce task to a reducer: The reducer reads the intermediate filesfrom the mappers. The reducer orders the <key,value>pairs and applies the reduce function. The user may specify partitionsto control which values each reducer gets. Worker faults: Fault Tolerance Faults are detected by periodic pings. Ready or ongoing tasks that fail are redone. Ongoing reduce tasks that fail are redone. Finished tasks are reported to the master. Master failure: The master state is checkpointedin a distributed file system. When a failed master is restarted, he reads the saved state and continues from that point on. 13

14 Splits Input data is provided to the mappersin splits. A split is a block of input data. Default HDFS are 64MB big (by default). In particular, Hadooptries to create splits containing data local only to one node. A mapperthen runs on the node where the split was created. Input Formats Controlled by conf.setinputformat FileInputFormat(default option): 64MB blocks (for files 64MB) per split (and per map). CombineFileInputFormat: groups files on the same node into the same split. KeyValueTextInputFormat NLineInputFormat 14

15 Output Formats Controlled by conf.setoutputformat. TextOutputFormat: one file per reducer (default option). SequenceFileOutputFormat: compressed output. Normally used to feed other MapReduce cycles. MultipleOutputFormat: allows control over the amount and name of output files. It is also possible to output to databases. The Original Challenges Scheduling: Assigns mapand reduceto the workers. Distribution: Workers are moved to the data. (Will be important later on.) Synchronization: Gathering, ordering and distributes intermediate results. Fault Tolerance: Detection and restart of failed tasks. All about a distributed file system (next lecture). 15

16 Next time MapReduce: systems perspective. 16

Introduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece

Introduction to Map/Reduce. Kostas Solomos Computer Science Department University of Crete, Greece Introduction to Map/Reduce Kostas Solomos Computer Science Department University of Crete, Greece What we will cover What is MapReduce? How does it work? A simple word count example (the Hello World! of