GC3: Grid Computing Competence Center Introduction to Python programming, II (with a hint of MapReduce) Riccardo Murri Grid Computing Competence Center, University of Zurich Oct. 10, 2012
Today s class Explain more Python constructs and semantics by looking at John Arley Burns MapReduce in 98 lines of Python. These slides are available for download from: http://www.gc3.uzh.ch/teaching/lsci2012/lecture03.pdf
References See the course website for an extensive and commented list. Dean, J., and Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters, OSDI 04 Greiner, J., Wong, S.: Distributed Parallel Processing with MapReduce Carter, J.: Simple MapReduce with Ruby and Rinda
What is MapReduce? MapReduce is: 1. a programming model 2. an associated implementation Both are important!!
MapReduce The Map function processes a key/value pair to produce intermediate key/value pairs. Image source: Greiner, J., Wong, S.: Distributed Parallel Processing with MapReduce
MapReduce The Reduce function merges all intermediate values associated with a given key. Image source: Greiner, J., Wong, S.: Distributed Parallel Processing with MapReduce
MapReduce: advantages of the model Programs written in this style are automatically parallelized and executed on a large cluster of machines... Quoted from: Dean and Ghemawat: MapReduce: Simplified Data Processing on Large Clusters
Example: word count Input is a text file, to be split at line boundaries. Image source: http://blog.jteam.nl/2009/08/04/introduction-to-hadoop/
Example: word count The Map function scans an input line and outputs a pair (word, 1) for each word in the text line. Image source: http://blog.jteam.nl/2009/08/04/introduction-to-hadoop/
Example: word count The pairs are shuffled and sorted so that each reducer gets all pairs (word, 1) with the same word part. Image source: http://blog.jteam.nl/2009/08/04/introduction-to-hadoop/
Example: word count The Reduce function gets all pairs (word, 1) with the same word part, and outputs a single pair (word, count) where count is the number of input items received. Image source: http://blog.jteam.nl/2009/08/04/introduction-to-hadoop/
Example: word count The global output is a list of pairs (word, count) where count is the number of occurences of word in the input text. Image source: http://blog.jteam.nl/2009/08/04/introduction-to-hadoop/
MapReduce: features of the implementation The run-time system takes care of the details: partitioning the input data, scheduling the program execution, handling machine failures, managing the required inter-machine communication. Quoted from: Dean and Ghemawat: MapReduce: Simplified Data Processing on Large Clusters
MapReduce: features of the implementation The run-time system takes care of the details: partitioning the input data, scheduling the program execution, handling machine failures, managing the required inter-machine communication. Quoted from: Dean and Ghemawat: MapReduce: Simplified Data Processing on Large Clusters
MapReduce: features of the implementation The run-time system takes care of the details: partitioning the input data, scheduling the program execution, handling machine failures, managing the required inter-machine communication. Quoted from: Dean and Ghemawat: MapReduce: Simplified Data Processing on Large Clusters
MapReduce: features of the implementation The run-time system takes care of the details: partitioning the input data, scheduling the program execution, handling machine failures, managing the required inter-machine communication. Quoted from: Dean and Ghemawat: MapReduce: Simplified Data Processing on Large Clusters
MapReduce: features of the implementation The run-time system takes care of the details: partitioning the input data, scheduling the program execution, handling machine failures, managing the required inter-machine communication. Quoted from: Dean and Ghemawat: MapReduce: Simplified Data Processing on Large Clusters
MapReduce: features of the implementation The run-time system takes care of the details: partitioning the input data, scheduling the program execution, handling machine failures, managing the required inter-machine communication. These are all highly nontrivial tasks to handle! The quality of a MapReduce implementation should be judged by how effective it is at handling the non-map/reduce part.
Back to Python! mapreduce.py by John Arley Burns is a simple Python class that simulates running a MapReduce algorithm using in-memory data structures. A MapReduce algorithm is specified by subclassing the MapReduce class and overriding methods to provide the Split, Map, and Reduce functions. (There s no Partition/Shuffle function because all the data is kept in memory and sorted there, so no locality issues.)
import re from mapreduce import MapReduce class WordCount(MapReduce): def init (self, data): MapReduce. init (self) self.data = data The word count example using mapreduce.py def split_fn(self, data): def line_to_tuple(line): return (None, line) data_list = [ line_to_tuple(line) for line in data.splitlines() ] return data_list def map_fn(self, key, value): for word in re.split(r \W+, value.lower()): bareword = re.sub(r"[ˆa-za-z0-9]*", r"", word); if len(bareword) > 0: yield (bareword, 1) def reduce_fn(self, word, count_list): return [(word, sum(count_list))] def output_fn(self, output_list): sorted_list = sorted(output_list, key=operator.itemgetter(1)) for word, count in sorted_list: print(word, count)
Importing modules This imports the re import re (regular expressions) from mapreduce import MapReduce module. class WordCount(MapReduce): #... def map_fn(self, key, value): for word in re.split (...): #... bareword = re.sub (...) if len(bareword) > 0: yield (bareword, 1) All names defined in that module are now visible under the re namespace, e.g., re.sub, re.split.
Importing names import re from mapreduce import MapReduce class WordCount( MapReduce ): def init (self, data): MapReduce. init (self) self.data = data #... This imports the MapReduce name, defined in the mapreduce module, into this module s namespace. So you need not use a prefix to qualify it.
Defining objects class WordCount(MapReduce): def init (self, data): MapReduce. init (self) self.data = data #... The class keyword starts the definition of a class (in the OOP sense). The class definition is indented.
Inheritance class WordCount( MapReduce ): def init (self, data): MapReduce. init (self) self.data = data #... This tells Python that the WordCount class inherits from the MapReduce class. Every class must inherit from some other class; the root of the class hierarchy is the built-in object class.
Declaring methods class WordCount(MapReduce): def init (self, data): MapReduce. init (self) self.data = data #... A method declaration looks exactly like a function definition. Every method must have at least one argument, named self. (Why the double underscore? More on this later!)
The self argument class WordCount(MapReduce): def init ( self, data): MapReduce. init ( self ) self.data = data #... self is a reference to the object instance (like, e.g., this in Java). It is used to access attributes and invoke methods of the instance itself.
The self argument Every method of a Python object always has self as first argument. However, you do not specify it when calling a method: it s automatically inserted by Python: >>> class ShowSelf(object):... def show(self):... print(self)... >>> x = ShowSelf() # construct instance >>> x.show() # self automatically inserted! < main.showself object at 0x299e150> The self variable is a reference to the object instance itself. You need to use self when accessing methods or attributes of this instance.
The self argument class WordCount(MapReduce): def init (self, data): MapReduce. init (self) self.data = data #... Q: (1) Why is the data identifier qualified with the self. namespace?
The self argument class WordCount(MapReduce): def init (self, data): MapReduce. init ( self ) self.data = data Q: (2) Why do we explicitly write self here? #...
Name resolution rules Within a function/method body, names are resolved according to the LEGB rule: L Local scope: any names defined in the current function; E Enclosing function scope: names defined in enclosing functions (outermost last); G global scope: names defined in the toplevel of the current module; B Built-in names (i.e., Python s module). builtins Any name that is not in one of the above scopes must be qualified. So you have to write self.data to call a method on this instance, re.sub to mean a function defined in module re, MapReduce. init to reference a method defined in the MapReduce class, etc.
Object attributes A Python object is (in particular) a key-value mapping: attributes (keys) are valid identifiers, values can be any Python object. Any object has attributes, which you can access (create, read, overwrite) using the dot notation: # create or overwrite the name attribute of w w.name = "Joe" # get the value of w.name and print it print (w.name) So, in the constructor you create the required instance attributes using self.var =... Note: also methods are attributes!
No access control There are no public / private /etc. qualifiers for object attributes. Any code can create/read/overwrite/delete any attribute on any object. There are conventions, though: protected attributes: name private attributes: name (But again, note that this is not enforced by the system in any way.)
Class attributes, I Classes are Python objects too, hence they can have attributes. Class attributes can be created with the variable assignment syntax in a class definition block: class A(object): class_attr = value def init (self): #... Class attributes are shared among all instances of the same class!
Class attributes, II Methods are class attributes, too. However, looking up a method attribute on an instance returns a bound method, i.e., one for which self is automatically inserted. Looking up the same method on a class, returns an unbound method, which is just like a regular function, i.e., you must pass self explicitly.
Constructors, I class WordCount(MapReduce): def init (self, data): MapReduce. init (self) self.data = data #... The init method is the instance constructor. It should never return any value (other than None).
Constructors, II The init method is the instance constructor. It should never return any value (other than None). However, you call a constructor by class name: # make wc an instance of WordCount wc = WordCount("some text") (Again, note that the self part is automatically inserted by Python.)
No overloading Python does not allow overloading of functions. Any function. Hence, no overloading of constructors. So: a class can have one and only one constructor.
Constructor chaining When a class is instanciated, Python only calls the first constructor it can find in the class inheritance call-chain. If you need to call a superclass constructor, you need to do it explicitly: class WordCount(MapReduce): def init (self,...): # do WordCount-specific stuff here MapReduce. init (self,...) # some more WordCount-specific stuff Calling a superclass constructor is optional, and it can happen anywhere in the init method body.
Multiple-inheritance Python allows multiple inheritance. Just list all the parent classes: class C(A,B): # class definition With multiple inheritance, it is your responsibility to call all the needed superclass constructors. Python uses the C3 algorithm to determine the call precedence in an inheritance chain. You can always query a class for its method resolution order, via the mro attribute: >>> C. mro (<class ex.c >, <class ex.a >, <class ex.b >, <type object >)
Nested functions import re class WordCount(MapReduce): #... def split_fn(self, data): def line to tuple(line): return (None, line) data_list = [ line to tuple (line) for line in data.splitlines() ] return data_list #... You can define functions (and classes) within functions. The nested functions are only visible within the enclosing function. (But they can capture any variable from the enclosing function environment by name.)
List comprehensions, I Q: What is this? class WordCount(MapReduce): #... def split_fn(self, data): def line_to_tuple(line): return (None, line) data list = [ line to tuple(line) for line in data.splitlines() ] return data_list #...
An easy exercise A dotfile is a file whose name starts with a dot character.. How can you list the full pathname of all dotfiles in a given directory? (The Python library call for listing the entries in a directory is os.listdir(), which returns a list of file names.)
A very basic solution Use a for loop to accumulate the results into a list: dotfiles = [ ] for entry in os.listdir(path): if entry.startswith(. ): dotfiles.append( os.path.join(path, entry))
List comprehensions, II Python has a better and more compact syntax for filtering elements of a list and/or applying a function to them: dotfiles = [ os.path.join(path, entry) for entry in dotfiles if entry.startswith(. ) ] This is called a list comprehension.
List comprehensions, III The general syntax of a list comprehension is: where: [ expr for var in iterable if condition ] expr is any Python expression; iterable is a (generalized) sequence; condition is a boolean expression, depending on var; var is a variable that will be bound in turn to each item in iterable which satisfies condition. The if condition part is optional.
Generator expressions List comprehensions are a special case of generator expressions: ( expr for var in iterable if condition ) A generator expression is a valid iterable and can be used to initialize tuples, sets, dicts, etc.: # the set of square numbers < 100 squares = set(n*n for n in range(10)) Generator expressions are valid expression, so they can be nested: # cartesian product of sets A and B C = set( (a,b) for a in A for b in B )
Generators Generator expressions are a special case of generators. A generator is like a function, except it uses yield instead of return: def squares(): n = 0 while True: yield n*n n += 1 At each iteration, execution resumes with the statement logically following yield in the generator s execution flow. There can be multiple yield statements in a generator. Reference: http://wiki.python.org/moin/generators
Generators in action class WordCount(MapReduce): #... This makes map fn into a generator that return pairs (word, 1) def map_fn(self, key, value): for word in re.split(r \W+, value.lower()): bareword = re.sub(r"[ˆa-za-z0-9]*", r"", word); if len(bareword) > 0: yield (bareword, 1) #...
The Iterator Protocol An object can function as an iterator iff it implements a next() method, that: either returns the next value in the iteration, or raises StopIteration to signal the end of the iteration. An object can be iterated over with for if it implements a iter () method. Reference: http://www.python.org/dev/peps/pep-0234/
Iterate over the words in class WordIterator(object): the given text: split the text at white spaces, and def init (self, text): return the parts self._words = text.split() one by one. def next(self): if len(self._words) > 0: return self._words.pop(0) else: raise StopIteration def iter (self): return self Source code available at: http://www.gc3.uzh.ch/teaching/lsci2011/lecture08/worditerator.py
class WordIterator( object ): def init (self, text): self._words = text.split() def next(self): if len(self._words) > 0: return self._words.pop(0) else: raise StopIteration def iter (self): return self Every class must inherit from a parent class. If there s no other class, inherit from the object class. (Root of the class hierarchy.)
Using iterators Iterators can be used in a for loop: >>> for word in WordIterator("a nice sunny day"):... print * +word+ *,... *a* *nice* *sunny* *day* They can be composed with other iterators for effect: >>> for n, word in enumerate(worditerator("a...")):... print str(n)+ : +word,... 0:a 1:nice 2:sunny 3:day See also: http://docs.python.org/library/itertools.html
class WordIterator(object): Q: What is this? def init (self, text): self._words = text.split() def next(self): if len(self._words) > 0: return self._words.pop(0) else: raise StopIteration def iter (self): return self
Exceptions Exceptions are objects that inherit from the built-in Exception class. To create a new exception just make a new class: class NewKindOfError(Exception): """ Do use the docstring to document what this error is about. """ pass Exceptions are handled by class name, so they usually do not need any new methods (although you are free to define some if needed). See also: http://docs.python.org/library/exceptions.html
try: # code that might raise an exception except SomeException: # handle some exception except AnotherException, ex: # the actual Exception instance # is available as variable ex else: # performed on normal exit from try finally: # performed on exit in any case The optional else clause is executed if and when control flows off the end of the try clause. The optional finally clause is executed on exit from the try or except block in any case. Reference: http://docs.python.org/reference/compound stmts.html#try
Raising exceptions Use the raise statement with an Exception instance: if an_error_occurred: raise AnError("Spider sense is tingling.") Within an except clause, you can use raise with no arguments to re-raise the current exception: try: something() except ItDidntWork: do_cleanup() # re-raise exception to caller raise
Exception handling example Read lines from a CSV file, ignoring those that do not have the required number of fields. If other errors occur, abort. Close the file when done. job_state = { } # empty dict try: csv_file = open( jobs.csv, r ) for line in csv_file: line = line.strip() # remove trailing newline try: name, jobid, state = line.split(",") except ValueError: continue # ignore line job_state[jobid] = state except IOError: raise # up to caller finally: csv_file.close()
A common case The cleanup pattern is so common that Python has a special statement to deal with it: with open( jobs.csv, r ) as csv_file: for line in csv_file: line = line.strip() # remove trailing newline try: name, jobid, state = line.split(",") except ValueError: continue # ignore line job_state[jobid] = state The with statement ensures that the file is closed upon exit from the with block (for whatever reason). Reference: http://docs.python.org/reference/compound stmts.html#with
The context manager protocol Any object can be used in a with statement, provided it defines the following two methods: enter () Called upon entrance of the with block; it return value is assigned to the variable following as (if any). exit (exc_cls, exc_val, exc_tb) Called with three arguments upon exit from the block. If an exception occurred, the three arguments are the exception type, value and traceback; otherwise, the three argument are all set to None Q: Can you think of other examples where this could be useful? See also: http://www.python.org/dev/peps/pep-0343/