Jussi Enkovaara Martti Louhivuori. Python in High-Performance Computing. CSC IT Center for Science Ltd, Finland January 29-31, 2018

Size: px

Start display at page:

Download "Jussi Enkovaara Martti Louhivuori. Python in High-Performance Computing. CSC IT Center for Science Ltd, Finland January 29-31, 2018"

Deirdre Conley
5 years ago
Views:

PDB import PDBParser biopython_installed = True except ImportError: biopython_installed = False

0 default_segid = '' # default B-factor # default occupancy level # empty segment ID class

and forth without loading anything to memory. """ def init (self, filename): try: self.

1 Jussi Enkovaara Martti Louhivuori Python in High-Performance Computing CSC IT Center for Science Ltd, Finland January 29-31, 2018 import sys, os try: from Bio.PDB import PDBParser biopython_installed = True except ImportError: biopython_installed = False default_bfactor = 0.0 default_occupancy = 1.0 default_segid = '' # default B-factor # default occupancy level # empty segment ID class EOF(Exception): def init (self): pass class FileCrawler: """ Crawl through a file reading back and forth without loading anything to memory. """ def init (self, filename): try: self. fp = open(filename) except IOError: raise ValueError, "Couldn't open file '%s' for reading." % filename self.tell = self. fp.tell self.seek = self. fp.seek def prevline(self): try: self.prev()

3 Agenda Monday 9:00-9:15 Python and HPC 9:15-10:00 NumPy fast array interface to Python 10:00-10:30 Exercises 10:30-10:45 Coffee Break 10:45-11:00 NumPy tools 11:00-12:00 Exercises 12:00-13:00 Lunch break 13:00-13:45 Advanced indexing, Vectorized operations & broadcasting, numexpr 13:45-14:30 Exercises 14:30-14:45 Coffee Break 14:45-15:30 Performance analysis 15:30-16:30 Exercises Tuesday Optimising Python with Cython 9:45-10:30 Cython cont Coffee break 10:45-12:00 Exercises Lunch break :45 Interfacing external libraries 13:45-14:30 Exercises Coffee break Multiprocessing 15:30-16:30 Exercises Wednesday MPI introduction 9:45-10:30 Point-to-point communication Coffee break 10:45-12:00 Exercises Lunch break :45 Non-blocking communication and communicators 13:45-14:30 Exercises Coffee break Collective communications 15:30-16:30 Exercises 16:30-16:45 Summary of Python HPC strategies

5 PYTHON AND HIGH-PERFORMANCE COMPUTING

7 Efficiency Python is an interpreted language no pre-compiled binaries, all code is translated on-the-fly to machine instructions byte-code as a middle step and may be stored (.pyc) All objects are dynamic in Python nothing is fixed == optimisation nightmare lot of overhead from metadata Flexibility is good, but comes with a cost! Improving Python performance Array based computations with NumPy Using extended Cython programming language Embed compiled code in a Python program C, Fortran Utilize parallel processing Parallelisation strategies for Python Agenda Global Interpreter Lock (GIL) CPython s memory management is not thread-safe no threads possible, except for I/O etc. affects overall performance if threading Process-based threading with multiprocessing fork independent processes that have a limited way to communicate Message-passing is the Way to Go to achieve true parallelism in Python Monday 9:00-9:15 Python and HPC 9:15-10:00 NumPy fast array interface to Python 10:00-10:30 Exercises 10:30-10:45 Coffee Break 10:45-11:00 NumPy tools 11:00-12:00 Exercises 12:00-13:00 Lunch break 13:00-13:45 Advanced indexing, Vectorized operations & broadcasting, numexpr 13:45-14:30 Exercises 14:30-14:45 Coffee Break 14:45-15:30 Performance analysis 15:30-16:30 Exercises Tuesday Optimising Python with Cython 9:45-10:30 Cython cont Coffee break 10:45-12:00 Exercises Lunch break :45 Interfacing external libraries 13:45-14:30 Exercises Coffee break Multiprocessing 15:30-16:30 Exercises Wednesday MPI introduction 9:45-10:30 Point-to-point communication Coffee break 10:45-12:00 Exercises Lunch break :45 Non-blocking communication and communicators 13:45-14:30 Exercises Coffee break Collective communications 15:30-16:30 Exercises 16:30-16:45 Summary of Python HPC strategies

9 NUMPY BASICS

11 Numpy fast array interface Standard Python is not well suitable for numerical computations lists are very flexible but also slow to process in numerical computations Numpy adds a new array data type static, multidimensional fast processing of arrays some linear algebra, random numbers Numpy arrays All elements of an array have the same type Array can have multiple dimensions The number of elements in the array is fixed, shape can be changed Python list vs. NumPy array Creating numpy arrays Python list Memory layout NumPy array Memory layout From a list: >>> import numpy as np >>> a = np.array((1, 2, 3, 4), float) >>> a array([ 1., 2., 3., 4.]) >>> >>> list1 = [[1, 2, 3], [4,5,6]] >>> mat = np.array(list1, complex) >>> mat array([[ 1.+0.j, 2.+0.j, 3.+0.j], [ 4.+0.j, 5.+0.j, 6.+0.j]]) >>> mat.shape (2, 3) >>> mat.size 6 Creating numpy arrays More ways for creating arrays: >>> import numpy as np >>> a = np.arange(10) >>> a array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) >>> >>> b = np.linspace(-4.5, 4.5, 5) >>> b array([-4.5, -2.25, 0., 2.25, 4.5 ]) >>> >>> c = np.zeros((4, 6), float) >>> c.shape (4, 6) >>> >>> d = np.ones((2, 4)) >>> d array([[ 1., 1., 1., 1.], [ 1., 1., 1., 1.]]) Simple indexing: Slicing: Indexing and slicing arrays >>> mat = np.array([[1, 2, 3], [4, 5, 6]]) >>> mat[0,2] 3 >>> mat[1,-2] >>> 5 >>> a = np.arange(5) >>> a[2:] array([2, 3, 4]) >>> a[:-1] array([0, 1, 2, 3]) >>> a[1:3] = -1 >>> a array([0, -1, -1, 3, 4]) Indexing and slicing arrays Slicing is possible over all dimensions: >>> a = np.arange(10) >>> a[1:7:2] array([1, 3, 5]) >>> >>> a = np.zeros((4, 4)) >>> a[1:3, 1:3] = 2.0 >>> a array([[ 0., 0., 0., 0.], [ 0., 2., 2., 0.], [ 0., 2., 2., 0.], [ 0., 0., 0., 0.]]) Views and copies of arrays Simple assignment creates references to arrays Slicing creates views to the arrays Use copy() for real copying of arrays example.py a = np.arange(10) b = a # reference, changing values in b changes a b = a.copy() # true copy c = a[1:4] # view, changing c changes elements [1:4] of a c = a[1:4].copy() # true copy of subarray

12 Array manipulation reshape : change the shape of array >>> mat = np.array([[1, 2, 3], [4, 5, 6]]) >>> mat array([[1, 2, 3], [4, 5, 6]]) >>> mat.reshape(3,2) array([[1, 2], [3, 4], [5, 6]]) ravel : flatten array to 1-d >>> mat.ravel() array([1, 2, 3, 4, 5, 6]) Array manipulation concatenate : join arrays together >>> mat1 = np.array([[1, 2, 3], [4, 5, 6]]) >>> mat2 = np.array([[7, 8, 9], [10, 11, 12]]) >>> np.concatenate((mat1, mat2)) array([[ 1, 2, 3], [ 4, 5, 6], [ 7, 8, 9], [10, 11, 12]]) >>> np.concatenate((mat1, mat2), axis=1) array([[ 1, 2, 3, 7, 8, 9], [ 4, 5, 6, 10, 11, 12]]) split : split array to N pieces >>> np.split(mat1, 3, axis=1) [array([[1], [4]]), array([[2], [5]]), array([[3], [6]])] Array operations Most operations for numpy arrays are done elementwise +, -, *, /, ** >>> a = np.array([1.0, 2.0, 3.0]) >>> b = 2.0 >>> a * b array([ 2., 4., 6.]) >>> a + b array([ 3., 4., 5.]) >>> a * a array([ 1., 4., 9.]) Array operations Numpy has special functions which can work with array arguments sin, cos, exp, sqrt, log,... >>> import numpy, math >>> a = numpy.linspace(-math.pi, math.pi, 8) >>> a array([ , , , , , , , ]) >>> numpy.sin(a) array([ e-16, e-01, e-01, e-01, e-01, e-01, e-01, e-16]) >>> >>> math.sin(a) Traceback (most recent call last): File "<stdin>", line 1, in? TypeError: only length-1 arrays can be converted to Python scalars I/O with Numpy NUMPY TOOLS Numpy provides functions for reading data from file and for writing data into the files Simple text files numpy.loadtxt numpy.savetxt Data in regular column layout Can deal with comments and different column delimiters Random numbers The module numpy.random provides several functions for constructing random arrays random: uniform random numbers normal: normal distribution poisson: Poisson distribution... >>> import numpy.random as rnd >>> rnd.random((2,2)) array([[ , ], [ , ]]) >>> rnd.poisson(size=(2,2)) Polynomials Polynomial is defined by array of coefficients p p(x, N) = p[0] x N-1 + p[1] x N p[n-1] Least square fitting: numpy.polyfit Evaluating polynomials: numpy.polyval Roots of polynomial: numpy.roots... >>> x = np.linspace(-4, 4, 7) >>> y = x**2 + rnd.random(x.shape) >>> >>> p = np.polyfit(x, y, 2) >>> p array([ , , ])

13 Linear algebra Numpy can calculate matrix and vector products efficiently: dot, vdot,... Eigenproblems: linalg.eig, linalg.eigvals, Linear systems and matrix inversion: linalg.solve, linalg.inv >>> A = np.array(((2, 1), (1, 3))) >>> B = np.array(((-2, 4.2), (4.2, 6))) >>> C = np.dot(a, B) >>> >>> b = np.array((1, 2)) >>> np.linalg.solve(c, b) # solve C x = b array([ , ]) Linear algebra Normally, NumPy utilises high performance libraries in linear algebra operations Exampe: matrix multiplication C = A * B matrix dimension 200 pure python: 5.30 s naive C: 0.09 s numpy.dot: 0.01 s Anatomy of NumPy array NUMPY ADVANCED TOPICS ndarray type is made of one dimensional contiguous block of memory (raw data) indexing scheme: how to locate an element data type descriptor: how to interpret an element NumPy indexing There are many possible ways of arranging items of N- dimensional array in a 1-dimensional block NumPy uses striding where N-dimensional index (n 0, n 1, n N-1 ) corresponds to offset from the beginning of 1-dimensional block offset = N 1 s k n k k=0 s k stride in dimension k ndarray attributes a = np.array( ) a.flags various information about memory layout a.strides bytes to step in each dimension when traversing a.itemsize size of one array element in bytes a.data Python buffer object pointing to start of arrays data (n 0, n 1 ) offset a. array_interface Python internal interface Advanced indexing Numpy arrays can be indexed also with other arrays (integer or boolean) >>> x = np.arange(10,1,-1) >>> x array([10, 9, 8, 7, 6, 5, 4, 3, 2]) >>> x[np.array([3, 3, 1, 8])] array([7, 7, 9, 2]) Boolean mask arrays >>> m = x > 7 >>> m array([ True, True, True, False, False,... >>> x[m] array([10, 9, 8]) Advanced indexing creates copies of arrays Vectorized operations for loops in Python are slow Use vectorized operations when possible Example: difference example.py # brute force using a for loop arr = np.arange(1000) dif = np.zeros(999, int) for i in range(1, len(arr)): dif[i-1] = arr[i] - arr[i-1] # vectorized operation arr = np.arange(1000) dif = arr[1:] - arr[:-1] for loop is ~80 times slower!

14 Broadcasting If array shapes are different, the smaller array may be broadcasted into a larger shape >>> from numpy import array >>> a = array([[1,2],[3,4],[5,6]], float) >>> a array([[ 1., 2.], [ 3., 4.], [ 5., 6.]]) >>> b = array([[7,11]], float) >>> b array([[ 7., 11.]]) >>> >>> a * b array([[ 7., 22.], [ 21., 44.], [ 35., 66.]]) Broadcasting Example: calculate distances from a given point example.py # array containing 3d coordinates for 100 points points = np.random.random((100, 3)) origin = np.array((1.0, 2.2, -2.2)) dists = (points - origin)**2 dists = np.sqrt(np.sum(dists, axis=1)) # find the most distant point i = np.argmax(dists) print(points[i]) Temporary arrays In complex expressions, NumPy stores intermediate values in temporary arrays Memory consumption can be higher than expected example.py a = np.random.random((1024, 1024, 50)) b = np.random.random((1024, 1024, 50)) # two temporary arrays will be created c = 2.0 * a 4.5 * b temp1 temp2 # three temporary arrays will be created due to unnecessary parenthesis c = (2.0 * a 4.5 * b) * (np.sin(a) + np.cos(b)) Temporary arrays Broadcasting approaches can lead also to hidden temporary arrays Example: pairwise distance of M points in 3 dimensions Input data is M x 3 array Output is M x M array containing the distance between points i and j example.py X = np.random.random((1000, 3)) D = np.sqrt(((x[:, np.newaxis, :] - X) ** 2).sum(axis=-1)) Temporary 1000 x 1000 x 3 array Numexpr Evaluation of complex expressions with one operation at a time can lead also into suboptimal performance Effectively, one carries out multiple for loops in the NumPy C-code Numexpr package provides fast evaluation of array expressions example.py import numexpr as ne x = np.random.random(( , 1)) y = np.random.random(( , 1)) poly = ne.evaluate("((.25*x +.75)*x - 1.5)*x - 2") Numexpr By default, numexpr tries to use multiple threads Number of threads can be queried and set with ne.set_num_threads(nthreads) Supported operators and functions +,-,*,/,**, sin, cos, tan, exp, log, sqrt Speedups in comparison to NumPy are typically between 0.95 and 4 Works best on arrays that do not fit in CPU cache Summary Numpy provides a static array data structure Multidimensional arrays Fast mathematical operations for arrays Tools for linear algebra and random numbers Arrays can be broadcasted into same shapes Expression evaluation can lead into temporary arrays

15 PERFORMANCE MEASUREMENT

17 Measuring application performance Correctness is the most import factor in any application Premature optimization is the root of all evil! Before starting to optimize application, one should measure where time is spent Typically 90 % of time is spent in 10 % of application Applications own timers timeit module cprofile module Full fedged profiling tools: TAU, Intel Vtune, Python Tools for Visual Studio Measuring application performance Python time module can be used for measuring time spent in specific part of the program time.time(), time.clock(), In Python 3: time.perf_counter(), time.process_time() timing.py import time t0 = time.time() for n in range(niter): heavy_calculation() t1 = time.time() Print( Time spent in heavy calculation, t1-t0) timeit module Easy timing of small bits of Python code Tries to avoid common pitfalls in measuring execution times Command line interface and Python interface $ python m timeit s from mymodule import func func() 10 loops, best of 3: 433 msec per loop %timeit magic in IPython In [1]: from mymodule import func In [2]: %timeit func() 10 loops, best of 3: 433 msec per loop cprofile Execution profile of Python program Time spent in different parts of the program Call graphs Python API: profile.py import cprofile # profile statement and save results to a file func.prof cprofile.run( func(), func.prof ) Profiling whole program from command line $ python m cprofile o myprof.prof myprogram.py Investigating profile with pstats Printing execution time of selected functions Sorting by function name, time, cumulative time, Python module interface and interactive browser In [1]: from pstats import Stats In [2]: p = Stats( myprof.prof ) In [3]: p.strip_dirs() In [4]: p.sort_stats( time ) In [5]: p.print_stats(5) Mon Oct 12 10:11: my.prof $ python m pstats myprof.prof Welcome to the profile statistics % strip % sort time % stats 5 Mon Oct 12 10:11: my.prof Summary Python has various built-in tools for measuring application performance time module timeit module cprofile and pstats modules

19 CYTHON

21 Cython Optimising static compiler for Python Extended Cython programming language Tune readable Python code into plain C performance by adding static type declarations Easy interfacing to external C libraries Python overheads Interpreting Boxing - everything is an object Function call overhead Global interpreter lock no threading benefits (CPython) Interpreting Case study: Mandelbrot fractal Cython command generates a C /C++ source file from a Cython source file C/C++ source is then compiled into an extension module Interpreting overhead is normally not drastic setup.py from distutils.core import setup from Cython.Build import cythonize # Normally, one compiles cython extended code with.pyx ending setup(ext_modules=cythonize( mandel_cyt.py ), ) $ python setup.py build_ext --inplace In [1]: import mandel_cyt Pure Python: Compiled with Cython: 2.71 s 2.61 s mandel.py def kernel(zr, zi, cr, ci, lim, cutoff): count = 0 while ((zr*zr + zi*zi) < (lim*lim)) \ and count < cutoff: zr = zr * zr - zi * zi + cr zi = zr * zr - zi * zi + cr count += 1 return count Boxing In Python, everything is an object Object Integer int 7 other stuff Object Integer int 6 other stuff Check the types: integers int 7 + int 6 = int 13 Object Integer int 13 other stuff Static type declarations Cython extended code should have.pyx ending Cannot be run with normal Python Types are declared with cdef keyword In function signatures only type is given example.py def integrate(f, a, b, N): s = 0 dx = (b-a)/n for i in range(n): s += f(a+i*dx) return s * dx example.pyx def integrate(f, double a, double b, int N): cdef double s = 0 cdef int i cdef double dx = (b-a)/n for i in range(n): s += f(a+i*dx) return s * dx Static type declarations Pure Python: 2.71 s Type declarations in kernel: 20.2 ms Function call overhead Function calls in Python can involve lots of checking and boxing Overhead can be reduced by declaring functions to be C-functions cdef keyword: functions can be called only from Cython cpdef keyword: generate also Python wrapper (can have additional overhead in some cases)

22 Using C functions NumPy arrays with Cython Static type declarations: Kernel as C function: 20.2 ms 12.5 ms Cython supports fast indexing for NumPy arrays Type and dimensions of array have to be declared mandel.py cdef int kernel(double zr, double zi, ): cdef int count = 0 while ((zr*zr + zi*zi) < (lim*lim)) \ and count < cutoff: zr = zr * zr - zi * zi + cr zi = zr * zr - zi * zi + cr count += 1 return count numpy_example.py import numpy as np # Normal NumPy import cimport numpy as cnp # Import for NumPY C-API def func(): # declarations can be made only in function scope cdef cnp.ndarray[cnp.int_t, ndim=2] data data = np.empty((n, N), dtype=int) for i in range(n): for j in range(n): data[i,j] = # double loop is done in nearly C speed Compiler directives Compiler directives can be used for turning of certain Python features for additional performance boundscheck (False) : assume no IndexErrors wraparound (False): no negative indexing numpy_example.py import numpy as np # Normal NumPy import cimport numpy as cnp # Import for NumPY C-API import cython Final performance Pure Python: 2.7 s Static type declarations: 20.2 ms Kernel as C function: 12.5 ms Fast indexing and directives: 2.4 def func(): # declarations can be made only in function scope cdef cnp.ndarray[cnp.int_t, ndim=2] data data = np.empty((n, N), dtype=int) Where to add types? HTML-report Typing everything reduces readibility and can even slow down the performance Profiling should be first step when optimising Cython is able to provide annotated HTML-report Lines are colored according to the level of typedness white lines translate to pure C lines that require the Python C-API are yellow (darker as they translate to more C-API interaction) $cython a cython_module.pyx $firefox cython_module.html Profiling Cython code By default, Cython code does not show up in profile produced by cprofile Profiling can be enabled for entire source file or on per function basis profiling.py # cython: profile=true profiling.py # cython: profile=false Summary Cython is optimising static compiler for Python Possible to add type declarations with Cython language Fast indexing for NumPy arrays At best cases, huge speed ups can be obtained Some compromise for Python flexibility import cdef func(): import cdef func():

23 Further functionality in Cython Using C structs and C++ classes in Cython Exceptions handling Parallelisation (threading) with Cython INTERFACING EXTERNAL LIBRARIES Increasing performance with compiled code There are Python interfaces for many high performance libraries However, sometimes one might want to utilize a library without Python interface Existing libraries Own code written in C or Fortran Python C-API provides the most comprehensive way to extend Python Cffi, cython, and f2py can provide easier approaches cffi C Foreign Function Interface for Python Interact with almost any C code C-like declarations within Python Can often be copy-pasted from headers / documentation ABI and API modes ABI does not require compilation API can be more robust Only ABI discussed here Some understanding of C required cffi_example.py from cffi import FFI import numpy as np cffi example ffi = FFI() lib = ffi.dlopen("./myclib.so") ffi.cdef("""void add(double *x, double *y, int n);""") ffi.cdef("""void subtract(double *x, double *y, int n);""") Interfacing C with cython As Cython code compiles down to C code, it is relatively easy to call C functions C declarations need to be given in Cython code C sources or libraries need to be specified in setup.py a = np.random.random(( ,1)) b = np.zeros_like(a) # Pointer objects need to be passed to library aptr = ffi.cast("double *", ffi.from_buffer(a)) bptr = ffi.cast("double *", ffi.from_buffer(b)) lib.add(bptr, aptr, len(a)) lib.subtract(bptr, aptr, len(a)) cython_example.pyx import numpy as np cimport numpy as cnp cdef extern from "myclib.h": Cython example void add(double *a, double *b, int n) void subtract(double *a, double *b, int n) def add_py(cnp.ndarray[cnp.double_t,ndim=1] a, cnp.ndarray[cnp.double_t,ndim=1] b): Cython example Including C-code in Cython module setup.py from distutils.core import setup, Extension from Cython.Build import cythonize # Specify all sources in Extension object ext = Extension("module_name", sources=["cython_source.pyx", "c_source.c"]) setup(ext_modules=cythonize(ext)) add(&a[0], &b[0], len(a))

24 Linking agains C library Cython example setup.py from distutils.core import setup, Extension from Cython.Build import cythonize # Specify all sources in Extension object ext = Extension("module_name", sources=["cython_source.pyx",], libraries=["name",], # Cython module is linked against library_dirs=[".",]) # libname.so, looked in "." setup(ext_modules=cythonize(ext)) diff.f90 Interfacing with Fortran NumPy includes f2py for connecting Python with Fortran codes f2py creates C wrapper for Fortran code that can then be used as extension module subroutine diff(in, out, n) real*8, intent(in) :: in(n) real*8, intent(out) :: out(n-1)!f2py intent(in, out) :: out integer :: n... $ f2py -c diff.f90 -m diff_mod example.py import numpy as np import diff_mod a = np.arange(10.0) b = np.zeros_like(a[1:]) diff_mod.diff(a, b, len(a)) Interfacing with Fortran f2py nitty-gritties If multidimensional NumPy arrays are not stored in Fortran order, f2py creates copies of arrays Modifications of arrays in Fortran are not seen in Python example.py a = np.zeros((10, 10)) modify(a) # a won t be modified a = np.asfortranarray(a) # convert a to Fortran modify(a) # a will now be modified # Array can be created in Fortran order in first place a = np.zeros((10, 10), order="f") Summary External libraries can interfaced in various ways cffi provides easy interfacing Cython can give more control and sometimes better performance Fortran routines are called implicitly via C wrappers f2py automates the creation of wrappers

25 MULTIPROCESSING PROCESS BASED THREADING

27 Processes and threads Serial region Processes and threads Serial region Parallel processes Serial region Serial region Parallel region Parallel region Parallel processes Serial region Serial region Parallel region Parallel region Process Independent execu:on units Have their own state informa:on and own memory address space Thread A single process may contain mul:ple threads Have their own state informa:on, but share the same memory address space Process Long-lived: created when parallel program started, killed when program is finished Explicit communica:on between processes Thread Short-lived: created when entering a parallel region, destroyed (joined) when region ends Communica:on through shared memory Parallel processes Process MPI good performance scales from a laptop to a supercomputer Processes and threads Serial region Serial region Serial region Parallel region Parallel region Thread OpenMP C / Fortran, not Python threading module only for I/O bound tasks (maybe) Global Interpreter Lock (GIL) limits usability Parallel processes Process MPI good performance scales from a laptop to a supercomputer Processes and threads master process master master + workers master + workers master process Thread Process mul:processing module relies on OS for forking worker processes that mimic threads limited communica:on between the parallel processes Mul;processing Underlying OS used to spawn new independent subprocesses processes are independent and execute code in an asynchronous manner no guarantee on the order of execu:on Communica:on possible only through dedicated, shared communica:on channels Queues, Pipes must be created before a new process is forked Spawn a process spawn.py from multiprocessing import Process import os def hello(name): print 'Hello', name print 'My PID is', os.getpid() print "My parent's PID is", os.getppid() # Create a new process p = Process(target=hello, args=('alice', )) # Start the process p.start() print 'Spawned a new process from PID', os.getpid() # End the process p.join() Communica;on Sharing data shared memory, data manager Pipes direct communica:on between two processes Queues work sharing among a group of processes Pool of workers offloading tasks to a group of worker processes Queues FIFO (first-in-first-out) task queues that can be used to distribute work among processes Shared among all processes all processes can add and retrieve data from the queue Automa:cally takes care of locking, so can be used safely with minimal hassle

28 Queues task-queue.py from multiprocessing import Process, Queue def f(q): while True: x = q.get() if x is None: break print(x**2) q = Queue() for i in range(100): q.put(i) # task queue: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9,..., 99] for i in range(3): q.put(none) p = Process(target=f, args=(q, )) p.start() Queues task-queue.py from multiprocessing import Process, Queue def f(q): while True: x = q.get() if x is None: # if sentinel, stop execution break print(x**2) q = Queue() for i in range(100): q.put(i) # task queue: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9,..., 99] for i in range(3): q.put(none) # add sentinels to the queue to signal STOP p = Process(target=f, args=(q, )) p.start() Pool of workers Group of processes that carry out tasks assigned to them 1. Master process submits tasks to the pool 2. Pool of worker processes perform the tasks 3. Master process retrieves the results from the pool Blocking and non-blocking (= asynchronous) calls available pool.py from multiprocessing import Pool import time def f(x): return x**2 pool = Pool(8) Pool of workers # Blocking execution (with a single process) result = pool.apply(f, (4,)) print(result) # Non-blocking execution "in the background" result = pool.apply_async(f, (12,)) while not result.ready(): time.sleep(1) print(result.get()) # an alternative to "sleeping" is to use e.g. result.get(timeout=1) Pool of workers Mul;processing summary pool-map.py from multiprocessing import Pool import time def f(x): return x**2 pool = Pool(8) # calculate x**2 in parallel for x in 0..9 result = pool.map(f, range(10)) print(result) # non-blocking alternative result = pool.map_async(f, range(10)) while not result.ready(): time.sleep(1) print(result.get()) Parallelism achieved by launching new OS processes Only limited communica:on possible work sharing: queues / pool of workers Non-blocking execu:on available do something else while wai:ng for results Further informa:on: h_ps://docs.python.org/2/library/mul:processing.html

29 MESSAGE PASSING INTERFACE

Message passing interface MPI is an applica,on programming interface (API) for communica,on between separate processes The most widely used approach for distributed parallel compu,ng MPI programs are

31 Message passing interface MPI is an applica,on programming interface (API) for communica,on between separate processes The most widely used approach for distributed parallel compu,ng MPI programs are portable and scalable the same program can run on different types of computers, from PC's to supercomputers MPI is flexible and comprehensive large (over 120 procedures) concise (ohen only 6 procedures are needed) MPI standard defines C and Fortran interfaces MPI for Python (mpi4py) provides an unofficial Python interface Parallel processes Process Independent execu,on units Have their own state informa,on and own memory address space Processes and threads Serial region Serial region Serial region Parallel region Parallel region Thread A single process may contain mul,ple threads Have their own state informa,on, but share the same memory address space Execu>on model MPI program is launched as a set of independent, iden0cal processes execute the same program code and instruc,ons can reside in different nodes (or even in different computers) The way to launch a MPI program depends on the system mpirun, mpiexec, aprun, srun,... aprun on sisu.csc.fi, srun on taito.csc.fi MPI rank Rank: ID number given to a process it is possible to query for rank processes can perform different tasks based on their rank example.py if (rank == 0): # do something elif (rank == 1): # do something else else: # all other processes do something different Data model Each MPI process has its own separate memory space, i.e. all variables and data structures are local to the process Processes can exchange data by sending and receiving messages rank 0 a = 1.0 b = 2.0 process 1 MPI messages rank 1 a = -0.7 b = 3.0 process 2 MPI communicator Communicator: a group containing all the processes that will par,cipate in communica,on in mpi4py all MPI calls are implemented as methods of a communicator object MPI_COMM_WORLD contains all processes (MPI.COMM_WORLD in mpi4py) user can define custom communicators Rou>nes in MPI for Python Communica,on between processes sending and receiving messages between two processes sending and receiving messages between several processes Synchroniza,on between processes Communicator crea,on and manipula,on Advanced features (e.g. user defined datatypes, onesided communica,on and parallel I/O) GeHng started Basic methods of communicator object Get_size() Number of processes in communicator Get_rank() rank of this process hello.py from mpi4py import MPI comm = MPI.COMM_WORLD # communicator object containing all processes size = comm.get_size() rank = comm.get_rank() print("i am rank %d in group of %d processes" % (rank, size))

Running an example program hello.py from mpi4py import MPI comm = MPI.COMM_WORLD # communicator object containing all processes size = comm.get_size() rank = comm.

32 Running an example program hello.py from mpi4py import MPI comm = MPI.COMM_WORLD # communicator object containing all processes size = comm.get_size() rank = comm.get_rank() print("i am rank %d in group of %d processes" % (rank, size)) POINT-TO-POINT COMMUNICATION $ mpirun np 4 python hello.py I am rank 2 in group of 4 processes I am rank 0 in group of 4 processes I am rank 3 in group of 4 processes I am rank 1 in group of 4 processes MPI communica>on MPI point-to-point opera>ons MPI processes are independent, they communicate to coordinate work Point-to-point communica,on Messages are sent between two processes Collec,ve communica,on Involving a number of processes at the same,me One process sends a message to another process that receives it Sends and receives in a program should match one receive per send Sending and receiving data Sending and receiving data Sending and receiving a dic,onary send.py from mpi4py import MPI comm = MPI.COMM_WORLD # communicator object containing all processes rank = comm.get_rank() if rank == 0: data = {'a': 7, 'b': 3.14} comm.send(data, dest=1, tag=11) elif rank == 1: data = comm.recv(source=0, tag=11) Arbitrary Python objects can be communicated with the send and receive methods of a communicator send(data, dest, tag) data Python object to send dest des,na,on rank tag ID given to the message recv(source, tag) source source rank tag ID given to the message data is provided as return value Des,na,on and source ranks as well as tags have to match Case study: parallel sum Case study: parallel sum Memory P0 P1 Array originally on process #0 (P0) Parallel algorithm Scader Half of the array is sent to process 1 Compute P0 & P1 sum independently their segments Reduc,on Par,al sum on P1 sent to P0 P0 sums the par,al sums Memory P0 P1 Step 1.1: Receive opera,on in scader Timeline P0 P1 recv P1 posts a receive to receive half of the array from P0

33 Case study: parallel sum Case study: parallel sum Memory Step 1.2: Send opera,on in scader Memory Step 2: Compute the sum in parallel P0 P1 Timeline P0 P1 Timeline P0 send P0 send compute P1 recv = P1 recv compute P0 posts a send to send the lower part of the array to P1 = P0 & P1 computes their parallel sums and store them locally Case study: parallel sum Case study: parallel sum Memory Step 3.1: Receive opera,on in reduc,on Memory Step 3.2: send opera,on in reduc,on P0 P1 Timeline P0 P1 Timeline P0 send compute r P0 send compute r = P1 recv compute = P1 recv compute s = P0 posts a receive to receive par,al sum = P1 posts a send with par,al sum Case study: parallel sum Blocking rou>nes & deadlocks Memory P0 P1 Step 4: compute final answer Timeline P0 send compute r P1 recv compute s P0 sums the par,al sums send() and recv() are blocking rou,nes the func,ons exit only once it is safe to use the data (memory) involved in the communica,on Comple,on depends on other processes => risk for deadlocks for example, if all processes call recv() there is no-one leh to call a corresponding send() and the program is stuck forever = Typical point-to-point communica>on parerns Communica>ng NumPy arrays Pairwise exchange Process 0 Process 1 Process 2 Process 3 Pipe, a ring of processes exchanging data Process 0 Process 1 Process 2 Process 3 Arbitrary Python objects are converted to byte streams (pickled) when sending and back to Python objects (unpickled) when receiving these conversions may be a serious overhead to communica,on Con,guous memory buffers (such as NumPy arrays) can be communicated with very lidle overhead using upper case methods: Send(data, dest, tag) Recv(data, source, tag) Incorrect ordering of sends and receives may result in a deadlock note the difference in receiving: the data array has to exist at the,me of call

34 send-array.py from mpi4py import MPI import numpy comm = MPI.COMM_WORLD rank = comm.get_rank() Send/receive a NumPy array if rank == 0: data = numpy.arange(100, dtype=float) comm.send(data, dest=1, tag=13) elif rank == 1: data = numpy.empty(100, dtype=float) comm.recv(data, source=0, tag=13) Note the difference between upper/lower case! send/recv: general Python objects, slow Send/Recv: con,nuous arrays, fast Combined send and receive sendrecv.py data = numpy.arange(10, dtype=float) * (rank + 1) # send buffer buffer = numpy.empty(10, float) # receive buffer if rank == 0: comm.sendrecv(data, dest=1, sendtag=8, recvbuf=buffer, source=1, recvtag=22) elif rank == 1: comm.sendrecv(data, dest=0, sendtag=22, recvbuf=buffer, source=0, recvtag=8) Send one message and receive another with a single command reduces risk for deadlocks Des,na,on and source ranks can be same or different MPI.PROC_NULL can be used for no des0na0on/source MPI datatypes MPI has a number of predefined datatypes to represent data e.g. MPI.INT for integer and MPI.DOUBLE for float No need to specify the datatype for Python objects or Numpy arrays objects are serialised as byte streams automa,c detec,on for NumPy arrays If needed, one can also define custom datatypes for example to use non-con,guous data buffers Summary Point-to-point communica,on = messages are sent between two MPI processes Point-to-point opera,ons enable any parallel communica,on padern (in principle) Arbitrary Python objects (that can be pickled!) send / recv sendrecv Memory buffers such as Numpy arrays Send / Recv Sendrecv Non-blocking communica>on NON-BLOCKING COMMUNICATION Non-blocking sends and receives isend & irecv returns immediately and sends/receives in background return value is a Request object Enables some compu,ng concurrently with communica,on Avoids many common dead-lock situa,ons Non-blocking communica>on Typical usage parern Have to finalize send/receive opera,ons wait() Waits for the communica,on started with isend or irecv to finish (blocking) test() Tests if the communica,on has finished (non-blocking) You can mix non-blocking and blocking p2p rou,nes e.g., receive isend with recv request = comm.irecv(ghost_data) comm.isend(border_data) compute(ghost_independent_data) request.wait() compute(border_data) P 0 P 1 P 2

35 Non-blocking send/receive Interleaving communica,on and computa,on isend.py rank = comm.get_rank() size = comm.get_size() if rank == 0: data = arange(size, dtype=float) * (rank + 1) req = comm.isend(data, dest=1, tag=0) # start a send calculate_something(rank) #.. do something else.. MPI.Request.Wait(req) # wait for send to finish # safe to read/write data again elif rank == 1: data = empty(size, float) req = comm.irecv(data, source=0, tag=0) # post a receive calculate_something(rank) #.. do something else.. MPI.Request.Wait(req) # wait for receive to finish # data is now ready for use Addi>onal comple>on opera>ons Other useful rou,nes in MPI.Request Waitall(requests) wait for all ini,ated requests to complete Waitany(requests) wait for any ini,ated request to complete Test(request) test for the comple,on of a send or receive Communicator also has a blocking call e.g. to probe the incoming message size before pos,ng a receive comm.probe(status) Non-blocking collec>ves Non-blocking collec>ves New in MPI 3: no support in mpi4py Non-blocking collec,ves enable the overlapping of communica,on and computa,on together with the benefits of collec,ve communica,on Restric,ons have to be called in same order by all ranks in a communicator mixing of blocking and non-blocking collec,ves is not allowed MPI.Allreduce Two independent opera,ons NOT SUPPORTED by MPI for Python MPI_Iallreduce MPI_Wait Summary Non-blocking communica,on is usually the smart way to do point-to-point communica,on in MPI Non-blocking communica,on realiza,on isend / Isend irecv / Irecv MPI.Request.Wait(req) MPI-3 contains also non-blocking collec,ves, but these are not supported by MPI for Python COMMUNICATORS Communicators User-defined communicators MPI.COMM_WORLD Comm Comm 2 e.g. comm.split 1 0 Comm 3 By default a single, universal communicator exists to which all processes belong (MPI.COMM_WORLD) One can create new communicators, e.g. by spliong this into sub-groups split.py comm = MPI.COMM_WORLD rank = comm.get_rank() color = rank % 4 local_comm = comm.split(color) local_rank = local_comm.get_rank() print("global rank: %d Local rank: %d" % (rank, local_rank))

Collec>ve communica>on COLLECTIVE COMMUNICATION Collec,ve communica,on transmits data among all processes in a process group (communicator) These rou,nes must be called by all the processes in the

36 Collec>ve communica>on COLLECTIVE COMMUNICATION Collec,ve communica,on transmits data among all processes in a process group (communicator) These rou,nes must be called by all the processes in the group Collec,ve communica,on includes data movement collec,ve computa,on synchroniza,on Example comm.barrier() makes every task hold un,l all tasks in the communicator comm have called it Collec>ve communica>on Collec,ve communica,on typically outperforms point-topoint communica,on Code becomes more compact (and efficient!) and easier to maintain: p2p.py if rank == 0: for i in range(1, size): comm.send(data, i, tag) else: comm.recv(data, 0, tag) collective.py comm.bcast(data, 0) Collec>ve communica>on Amount of sent and received data must match No tag arguments Order of execu,on must coincide across processes Communica,ng a Numpy array of 1M elements from task 0 to all other tasks Broadcas>ng Send the same data from one process to all the other Processes A Local memory Bcast This buffer may contain mul,ple elements of any datatype. A A A A Broadcas>ng Broadcast sends same data to all processes bcast.py from mpi4py import MPI import numpy comm = MPI.COMM_WORLD rank = comm.get_rank() if rank == 0: n = 100 data = numpy.arange(n, dtype=float) comm.bcast(n, root=0) else: n = comm.bcast(none, root=0) # returns the value! data = numpy.zeros(n, float) # prepare a receive buffer comm.bcast(data, root=0) # in-place modification on the receive side ScaRering Send equal amount of data from one process to others Processes Send buffer Recv buffer A B C D A Scatter B C D Segments A, B, may contain mul,ple elements ScaRering Scader distributes data to processes scatter.py from mpi4py import MPI from numpy import arange, empty comm = MPI.COMM_WORLD rank = comm.get_rank() size = comm.get_size() if rank == 0: py_data = range(size) data = arange(size**2, dtype=float) else: py_data = None data = None buffer = empty(size, float) # prepare a receive buffer new_data = comm.scatter(py_data, root=0) # returns the value comm.scatter(data, buffer, root=0) # in-place modification

Gathering Collect data from all the process to one process Processes Send buffer Recv buffer A B C D Gather A B C D Gathering Gather pulls data from all processes gather.

37 Gathering Collect data from all the process to one process Processes Send buffer Recv buffer A B C D Gather A B C D Gathering Gather pulls data from all processes gather.py from mpi4py import MPI from numpy import arange, zeros comm = MPI.COMM_WORLD rank = comm.get_rank() size = comm.get_size() data = arange(10, dtype=float) * (rank + 1) buffer = zeros(size * 10, float) n = comm.gather(rank, root=0) # returns the value comm.gather(data, buffer, root=0) # in-place modification Segments A, B, may contain mul,ple elements Reduce opera>on Applies an opera,on over set of processes and places result in single process Send buffer A 0 B 0 C 0 D 0 Reduce Recv buffer A i B i C i D i Reduce opera>on Reduce gathers data and applies an opera,on on it reduce.py from mpi4py import MPI from numpy import arange, empty comm = MPI.COMM_WORLD rank = comm.get_rank() size = comm.get_size() Processes A 1 B 1 C 1 D 1 A 2 B 2 C 2 D 2 (sum) data = arange(10 * size, dtype=float) * (rank + 1) buffer = zeros(size * 10, float) n = comm.reduce(rank, op=mpi.sum, root=0) # returns the value comm.reduce(data, buffer, op=mpi.sum, root=0) # in-place modification A 3 B 3 C 3 D 3 Other common collec>ve opera>ons Scaderv each process receives different amount of data Gatherv each process sends different amount of data Allreduce all processes receive the results of reduc,on Alltoall each process sends and receives to/from each other Alltoallv each process sends and receives different amount of data to/from each other Common mistakes with collec>ves Using a collec,ve opera,on within one branch of an iftest of the rank if rank == 0: comm.bcast(...) all processes in a communicator must call a collec,ve rou,ne! Assuming that all processes making a collec,ve call would complete at the same,me Using the input buffer as the output buffer comm.scatter(a, a, MPI.SUM) Summary Collec,ve communica,ons involve all the processes within a communicator all processes must call them Collec,ve opera,ons make code more transparent and compact Collec,ve rou,nes allow op,miza,ons by MPI library On-line resources Documenta,on for mpi4py is quite limited short on-line manual and API reference available at hdp://pythonhosted.org/mpi4py/ Some good references: "A Python Introduc,on to Parallel Programming with MPI" by Jeremy Bejarano hdp://materials.jeremybejarano.com/mpiwithpython/ "mpi4py examples" by Jörg Bornschein hdps://github.com/jbornschein/mpi4py-examples

38 mpi4py performance Summary Ping-pong test mpi4py provides Python interface to MPI MPI calls via communicator object Possible to communicate arbitrary Python objects NumPy arrays can be communicated with nearly same speed as from C/Fortran

Numpy fast array interface

NUMPY Numpy fast array interface Standard Python is not well suitable for numerical computations lists are very flexible but also slow to process in numerical computations Numpy adds a new array data type