Introduction to Python programming, II

Similar documents
Introduction to Python programming, II

Weiss Chapter 1 terminology (parenthesized numbers are page numbers)

Python I. Some material adapted from Upenn cmpe391 slides and other sources

\n is used in a string to indicate the newline character. An expression produces data. The simplest expression

Chapter 1 Summary. Chapter 2 Summary. end of a string, in which case the string can span multiple lines.

Lexical Considerations

F21SC Industrial Programming: Python: Classes and Exceptions

CIS192 Python Programming

Data Structures (list, dictionary, tuples, sets, strings)

CIS192 Python Programming

A Short Summary of Javali

Class definition. F21SC Industrial Programming: Python. Post-facto setting of class attributes. Class attributes

Programming I. Course 9 Introduction to programming

Decaf Language Reference Manual

Structure and Flow. CS 3270 Chapter 5

1 Lexical Considerations

Alastair Burt Andreas Eisele Christian Federmann Torsten Marek Ulrich Schäfer. October 6th, Universität des Saarlandes. Introduction to Python

CS 11 python track: lecture 2

Overview of OOP. Dr. Zhang COSC 1436 Summer, /18/2017

Python in 10 (50) minutes

CIS192 Python Programming

STATS Data Analysis using Python. Lecture 8: Hadoop and the mrjob package Some slides adapted from C. Budak

Lexical Considerations

There are four numeric types: 1. Integers, represented as a 32 bit (or longer) quantity. Digits sequences (possibly) signed are integer literals:

Positional, keyword and default arguments

DOWNLOAD PDF CORE JAVA APTITUDE QUESTIONS AND ANSWERS

About Python. Python Duration. Training Objectives. Training Pre - Requisites & Who Should Learn Python

Basic Object-Oriented Concepts. 5-Oct-17

CSE : Python Programming

Python Boot Camp. Day 3

Scheme Quick Reference

Java Bytecode (binary file)

GIS 4653/5653: Spatial Programming and GIS. More Python: Statements, Types, Functions, Modules, Classes

List Comprehensions. Function Definitions. This is the same as mapping the doubling function on the list [1,2,3], but without an explicit

The PCAT Programming Language Reference Manual

Professor Hugh C. Lauer CS-1004 Introduction to Programming for Non-Majors

STATS 507 Data Analysis in Python. Lecture 2: Functions, Conditionals, Recursion and Iteration

UNIVERSITY OF CALIFORNIA Department of Electrical Engineering and Computer Sciences Computer Science Division. P. N. Hilfinger

DSC 201: Data Analysis & Visualization

S206E Lecture 19, 5/24/2016, Python an overview

IC Language Specification

CS 11 python track: lecture 4

Lessons on Python Classes and Objects

CS Programming Languages: Python

Lecture 7: Type Systems and Symbol Tables. CS 540 George Mason University

Babu Madhav Institute of Information Technology, UTU 2015

CIS192 Python Programming

Python Essential Reference, Second Edition - Chapter 5: Control Flow Page 1 of 8

The SPL Programming Language Reference Manual

Introduction to MapReduce

Class extension and. Exception handling. Genome 559

SCHEME AND CALCULATOR 5b

Glossary. For Introduction to Programming Using Python By Y. Daniel Liang

Python A Technical Introduction. James Heliotis Rochester Institute of Technology December, 2009

Recap: Functions as first-class values

Declarations and Access Control SCJP tips

JAVASCRIPT AND JQUERY: AN INTRODUCTION (WEB PROGRAMMING, X452.1)

CSE 341, Autumn 2015, Ruby Introduction Summary

GNU ccscript Scripting Guide IV

AP Computer Science Chapter 10 Implementing and Using Classes Study Guide

Chapter 4 Defining Classes I

Brief Summary of Java

OBJECT ORIENTED PROGRAMMING USING C++ CSCI Object Oriented Analysis and Design By Manali Torpe

PYTHON CONTENT NOTE: Almost every task is explained with an example

AP COMPUTER SCIENCE JAVA CONCEPTS IV: RESERVED WORDS

Ruby: Introduction, Basics

NAME CHSM-Java Concurrent, Hierarchical, Finite State Machine specification language for Java

COMP519 Web Programming Lecture 21: Python (Part 5) Handouts

CMSC 132: Object-Oriented Programming II

File Operations. Working with files in Python. Files are persistent data storage. File Extensions. CS111 Computer Programming

PTN-102 Python programming

Inheritance. Transitivity

Definition of DJ (Diminished Java)

Course Title: Python + Django for Web Application

What we already know. more of what we know. results, searching for "This" 6/21/2017. chapter 14

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

PREPARING FOR PRELIM 2

Object oriented programming. Instructor: Masoud Asghari Web page: Ch: 3

Unit3: Java in the large. Prepared by: Dr. Abdallah Mohamed, AOU-KW

CNRS ANF PYTHON Objects everywhere

Exception Handling. Genome 559

Ruby: Introduction, Basics

SCHEME 8. 1 Introduction. 2 Primitives COMPUTER SCIENCE 61A. March 23, 2017

File Operations. Working with files in Python. Files are persistent data storage. File Extensions. CS111 Computer Programming

Lecture no

CME 193: Introduction to Scientific Python Lecture 6: Classes and iterators

Absent: Lecture 3 Page 1. def foo(a, b): a = 5 b[0] = 99

Lecture Notes on Programming Languages

The Decaf Language. 1 Lexical considerations

IPCoreL. Phillip Duane Douglas, Jr. 11/3/2010

User Defined Types. Babes-Bolyai University Lecture 06. Lect Phd. Arthur Molnar. User defined types. Python scope and namespace

A Second Look At ML. Chapter Seven Modern Programming Languages, 2nd ed. 1

Contents. Figures. Tables. Examples. Foreword. Preface. 1 Basics of Java Programming 1. xix. xxi. xxiii. xxvii. xxix

Ruby: Introduction, Basics

G Programming Languages - Fall 2012

Python-2. None. Special constant that is a null value

// the current object. functioninvocation expression. identifier (expressionlist ) // call of an inner function

Class extension and. Exception handling. Genome 559

A Crash Course in Python Part II. Presented by Cuauhtémoc Carbajal ITESM CEM

GBIL: Generic Binary Instrumentation Language. Language Reference Manual. By: Andrew Calvano. COMS W4115 Fall 2015 CVN

Transcription:

GC3: Grid Computing Competence Center Introduction to Python programming, II (with a hint of MapReduce) Riccardo Murri Grid Computing Competence Center, University of Zurich Oct. 10, 2012

Today s class Explain more Python constructs and semantics by looking at John Arley Burns MapReduce in 98 lines of Python. These slides are available for download from: http://www.gc3.uzh.ch/teaching/lsci2012/lecture03.pdf

References See the course website for an extensive and commented list. Dean, J., and Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters, OSDI 04 Greiner, J., Wong, S.: Distributed Parallel Processing with MapReduce Carter, J.: Simple MapReduce with Ruby and Rinda

What is MapReduce? MapReduce is: 1. a programming model 2. an associated implementation Both are important!!

MapReduce The Map function processes a key/value pair to produce intermediate key/value pairs. Image source: Greiner, J., Wong, S.: Distributed Parallel Processing with MapReduce

MapReduce The Reduce function merges all intermediate values associated with a given key. Image source: Greiner, J., Wong, S.: Distributed Parallel Processing with MapReduce

MapReduce: advantages of the model Programs written in this style are automatically parallelized and executed on a large cluster of machines... Quoted from: Dean and Ghemawat: MapReduce: Simplified Data Processing on Large Clusters

Example: word count Input is a text file, to be split at line boundaries. Image source: http://blog.jteam.nl/2009/08/04/introduction-to-hadoop/

Example: word count The Map function scans an input line and outputs a pair (word, 1) for each word in the text line. Image source: http://blog.jteam.nl/2009/08/04/introduction-to-hadoop/

Example: word count The pairs are shuffled and sorted so that each reducer gets all pairs (word, 1) with the same word part. Image source: http://blog.jteam.nl/2009/08/04/introduction-to-hadoop/

Example: word count The Reduce function gets all pairs (word, 1) with the same word part, and outputs a single pair (word, count) where count is the number of input items received. Image source: http://blog.jteam.nl/2009/08/04/introduction-to-hadoop/

Example: word count The global output is a list of pairs (word, count) where count is the number of occurences of word in the input text. Image source: http://blog.jteam.nl/2009/08/04/introduction-to-hadoop/

MapReduce: features of the implementation The run-time system takes care of the details: partitioning the input data, scheduling the program execution, handling machine failures, managing the required inter-machine communication. Quoted from: Dean and Ghemawat: MapReduce: Simplified Data Processing on Large Clusters

MapReduce: features of the implementation The run-time system takes care of the details: partitioning the input data, scheduling the program execution, handling machine failures, managing the required inter-machine communication. Quoted from: Dean and Ghemawat: MapReduce: Simplified Data Processing on Large Clusters

MapReduce: features of the implementation The run-time system takes care of the details: partitioning the input data, scheduling the program execution, handling machine failures, managing the required inter-machine communication. Quoted from: Dean and Ghemawat: MapReduce: Simplified Data Processing on Large Clusters

MapReduce: features of the implementation The run-time system takes care of the details: partitioning the input data, scheduling the program execution, handling machine failures, managing the required inter-machine communication. Quoted from: Dean and Ghemawat: MapReduce: Simplified Data Processing on Large Clusters

MapReduce: features of the implementation The run-time system takes care of the details: partitioning the input data, scheduling the program execution, handling machine failures, managing the required inter-machine communication. Quoted from: Dean and Ghemawat: MapReduce: Simplified Data Processing on Large Clusters

MapReduce: features of the implementation The run-time system takes care of the details: partitioning the input data, scheduling the program execution, handling machine failures, managing the required inter-machine communication. These are all highly nontrivial tasks to handle! The quality of a MapReduce implementation should be judged by how effective it is at handling the non-map/reduce part.

Back to Python! mapreduce.py by John Arley Burns is a simple Python class that simulates running a MapReduce algorithm using in-memory data structures. A MapReduce algorithm is specified by subclassing the MapReduce class and overriding methods to provide the Split, Map, and Reduce functions. (There s no Partition/Shuffle function because all the data is kept in memory and sorted there, so no locality issues.)

import re from mapreduce import MapReduce class WordCount(MapReduce): def init (self, data): MapReduce. init (self) self.data = data The word count example using mapreduce.py def split_fn(self, data): def line_to_tuple(line): return (None, line) data_list = [ line_to_tuple(line) for line in data.splitlines() ] return data_list def map_fn(self, key, value): for word in re.split(r \W+, value.lower()): bareword = re.sub(r"[ˆa-za-z0-9]*", r"", word); if len(bareword) > 0: yield (bareword, 1) def reduce_fn(self, word, count_list): return [(word, sum(count_list))] def output_fn(self, output_list): sorted_list = sorted(output_list, key=operator.itemgetter(1)) for word, count in sorted_list: print(word, count)

Importing modules This imports the re import re (regular expressions) from mapreduce import MapReduce module. class WordCount(MapReduce): #... def map_fn(self, key, value): for word in re.split (...): #... bareword = re.sub (...) if len(bareword) > 0: yield (bareword, 1) All names defined in that module are now visible under the re namespace, e.g., re.sub, re.split.

Importing names import re from mapreduce import MapReduce class WordCount( MapReduce ): def init (self, data): MapReduce. init (self) self.data = data #... This imports the MapReduce name, defined in the mapreduce module, into this module s namespace. So you need not use a prefix to qualify it.

Defining objects class WordCount(MapReduce): def init (self, data): MapReduce. init (self) self.data = data #... The class keyword starts the definition of a class (in the OOP sense). The class definition is indented.

Inheritance class WordCount( MapReduce ): def init (self, data): MapReduce. init (self) self.data = data #... This tells Python that the WordCount class inherits from the MapReduce class. Every class must inherit from some other class; the root of the class hierarchy is the built-in object class.

Declaring methods class WordCount(MapReduce): def init (self, data): MapReduce. init (self) self.data = data #... A method declaration looks exactly like a function definition. Every method must have at least one argument, named self. (Why the double underscore? More on this later!)

The self argument class WordCount(MapReduce): def init ( self, data): MapReduce. init ( self ) self.data = data #... self is a reference to the object instance (like, e.g., this in Java). It is used to access attributes and invoke methods of the instance itself.

The self argument Every method of a Python object always has self as first argument. However, you do not specify it when calling a method: it s automatically inserted by Python: >>> class ShowSelf(object):... def show(self):... print(self)... >>> x = ShowSelf() # construct instance >>> x.show() # self automatically inserted! < main.showself object at 0x299e150> The self variable is a reference to the object instance itself. You need to use self when accessing methods or attributes of this instance.

The self argument class WordCount(MapReduce): def init (self, data): MapReduce. init (self) self.data = data #... Q: (1) Why is the data identifier qualified with the self. namespace?

The self argument class WordCount(MapReduce): def init (self, data): MapReduce. init ( self ) self.data = data Q: (2) Why do we explicitly write self here? #...

Name resolution rules Within a function/method body, names are resolved according to the LEGB rule: L Local scope: any names defined in the current function; E Enclosing function scope: names defined in enclosing functions (outermost last); G global scope: names defined in the toplevel of the current module; B Built-in names (i.e., Python s module). builtins Any name that is not in one of the above scopes must be qualified. So you have to write self.data to call a method on this instance, re.sub to mean a function defined in module re, MapReduce. init to reference a method defined in the MapReduce class, etc.

Object attributes A Python object is (in particular) a key-value mapping: attributes (keys) are valid identifiers, values can be any Python object. Any object has attributes, which you can access (create, read, overwrite) using the dot notation: # create or overwrite the name attribute of w w.name = "Joe" # get the value of w.name and print it print (w.name) So, in the constructor you create the required instance attributes using self.var =... Note: also methods are attributes!

No access control There are no public / private /etc. qualifiers for object attributes. Any code can create/read/overwrite/delete any attribute on any object. There are conventions, though: protected attributes: name private attributes: name (But again, note that this is not enforced by the system in any way.)

Class attributes, I Classes are Python objects too, hence they can have attributes. Class attributes can be created with the variable assignment syntax in a class definition block: class A(object): class_attr = value def init (self): #... Class attributes are shared among all instances of the same class!

Class attributes, II Methods are class attributes, too. However, looking up a method attribute on an instance returns a bound method, i.e., one for which self is automatically inserted. Looking up the same method on a class, returns an unbound method, which is just like a regular function, i.e., you must pass self explicitly.

Constructors, I class WordCount(MapReduce): def init (self, data): MapReduce. init (self) self.data = data #... The init method is the instance constructor. It should never return any value (other than None).

Constructors, II The init method is the instance constructor. It should never return any value (other than None). However, you call a constructor by class name: # make wc an instance of WordCount wc = WordCount("some text") (Again, note that the self part is automatically inserted by Python.)

No overloading Python does not allow overloading of functions. Any function. Hence, no overloading of constructors. So: a class can have one and only one constructor.

Constructor chaining When a class is instanciated, Python only calls the first constructor it can find in the class inheritance call-chain. If you need to call a superclass constructor, you need to do it explicitly: class WordCount(MapReduce): def init (self,...): # do WordCount-specific stuff here MapReduce. init (self,...) # some more WordCount-specific stuff Calling a superclass constructor is optional, and it can happen anywhere in the init method body.

Multiple-inheritance Python allows multiple inheritance. Just list all the parent classes: class C(A,B): # class definition With multiple inheritance, it is your responsibility to call all the needed superclass constructors. Python uses the C3 algorithm to determine the call precedence in an inheritance chain. You can always query a class for its method resolution order, via the mro attribute: >>> C. mro (<class ex.c >, <class ex.a >, <class ex.b >, <type object >)

Nested functions import re class WordCount(MapReduce): #... def split_fn(self, data): def line to tuple(line): return (None, line) data_list = [ line to tuple (line) for line in data.splitlines() ] return data_list #... You can define functions (and classes) within functions. The nested functions are only visible within the enclosing function. (But they can capture any variable from the enclosing function environment by name.)

List comprehensions, I Q: What is this? class WordCount(MapReduce): #... def split_fn(self, data): def line_to_tuple(line): return (None, line) data list = [ line to tuple(line) for line in data.splitlines() ] return data_list #...

An easy exercise A dotfile is a file whose name starts with a dot character.. How can you list the full pathname of all dotfiles in a given directory? (The Python library call for listing the entries in a directory is os.listdir(), which returns a list of file names.)

A very basic solution Use a for loop to accumulate the results into a list: dotfiles = [ ] for entry in os.listdir(path): if entry.startswith(. ): dotfiles.append( os.path.join(path, entry))

List comprehensions, II Python has a better and more compact syntax for filtering elements of a list and/or applying a function to them: dotfiles = [ os.path.join(path, entry) for entry in dotfiles if entry.startswith(. ) ] This is called a list comprehension.

List comprehensions, III The general syntax of a list comprehension is: where: [ expr for var in iterable if condition ] expr is any Python expression; iterable is a (generalized) sequence; condition is a boolean expression, depending on var; var is a variable that will be bound in turn to each item in iterable which satisfies condition. The if condition part is optional.

Generator expressions List comprehensions are a special case of generator expressions: ( expr for var in iterable if condition ) A generator expression is a valid iterable and can be used to initialize tuples, sets, dicts, etc.: # the set of square numbers < 100 squares = set(n*n for n in range(10)) Generator expressions are valid expression, so they can be nested: # cartesian product of sets A and B C = set( (a,b) for a in A for b in B )

Generators Generator expressions are a special case of generators. A generator is like a function, except it uses yield instead of return: def squares(): n = 0 while True: yield n*n n += 1 At each iteration, execution resumes with the statement logically following yield in the generator s execution flow. There can be multiple yield statements in a generator. Reference: http://wiki.python.org/moin/generators

Generators in action class WordCount(MapReduce): #... This makes map fn into a generator that return pairs (word, 1) def map_fn(self, key, value): for word in re.split(r \W+, value.lower()): bareword = re.sub(r"[ˆa-za-z0-9]*", r"", word); if len(bareword) > 0: yield (bareword, 1) #...

The Iterator Protocol An object can function as an iterator iff it implements a next() method, that: either returns the next value in the iteration, or raises StopIteration to signal the end of the iteration. An object can be iterated over with for if it implements a iter () method. Reference: http://www.python.org/dev/peps/pep-0234/

Iterate over the words in class WordIterator(object): the given text: split the text at white spaces, and def init (self, text): return the parts self._words = text.split() one by one. def next(self): if len(self._words) > 0: return self._words.pop(0) else: raise StopIteration def iter (self): return self Source code available at: http://www.gc3.uzh.ch/teaching/lsci2011/lecture08/worditerator.py

class WordIterator( object ): def init (self, text): self._words = text.split() def next(self): if len(self._words) > 0: return self._words.pop(0) else: raise StopIteration def iter (self): return self Every class must inherit from a parent class. If there s no other class, inherit from the object class. (Root of the class hierarchy.)

Using iterators Iterators can be used in a for loop: >>> for word in WordIterator("a nice sunny day"):... print * +word+ *,... *a* *nice* *sunny* *day* They can be composed with other iterators for effect: >>> for n, word in enumerate(worditerator("a...")):... print str(n)+ : +word,... 0:a 1:nice 2:sunny 3:day See also: http://docs.python.org/library/itertools.html

class WordIterator(object): Q: What is this? def init (self, text): self._words = text.split() def next(self): if len(self._words) > 0: return self._words.pop(0) else: raise StopIteration def iter (self): return self

Exceptions Exceptions are objects that inherit from the built-in Exception class. To create a new exception just make a new class: class NewKindOfError(Exception): """ Do use the docstring to document what this error is about. """ pass Exceptions are handled by class name, so they usually do not need any new methods (although you are free to define some if needed). See also: http://docs.python.org/library/exceptions.html

try: # code that might raise an exception except SomeException: # handle some exception except AnotherException, ex: # the actual Exception instance # is available as variable ex else: # performed on normal exit from try finally: # performed on exit in any case The optional else clause is executed if and when control flows off the end of the try clause. The optional finally clause is executed on exit from the try or except block in any case. Reference: http://docs.python.org/reference/compound stmts.html#try

Raising exceptions Use the raise statement with an Exception instance: if an_error_occurred: raise AnError("Spider sense is tingling.") Within an except clause, you can use raise with no arguments to re-raise the current exception: try: something() except ItDidntWork: do_cleanup() # re-raise exception to caller raise

Exception handling example Read lines from a CSV file, ignoring those that do not have the required number of fields. If other errors occur, abort. Close the file when done. job_state = { } # empty dict try: csv_file = open( jobs.csv, r ) for line in csv_file: line = line.strip() # remove trailing newline try: name, jobid, state = line.split(",") except ValueError: continue # ignore line job_state[jobid] = state except IOError: raise # up to caller finally: csv_file.close()

A common case The cleanup pattern is so common that Python has a special statement to deal with it: with open( jobs.csv, r ) as csv_file: for line in csv_file: line = line.strip() # remove trailing newline try: name, jobid, state = line.split(",") except ValueError: continue # ignore line job_state[jobid] = state The with statement ensures that the file is closed upon exit from the with block (for whatever reason). Reference: http://docs.python.org/reference/compound stmts.html#with

The context manager protocol Any object can be used in a with statement, provided it defines the following two methods: enter () Called upon entrance of the with block; it return value is assigned to the variable following as (if any). exit (exc_cls, exc_val, exc_tb) Called with three arguments upon exit from the block. If an exception occurred, the three arguments are the exception type, value and traceback; otherwise, the three argument are all set to None Q: Can you think of other examples where this could be useful? See also: http://www.python.org/dev/peps/pep-0343/