Jussi Enkovaara Martti Louhivuori. Python in High-Performance Computing. CSC IT Center for Science Ltd, Finland January 29-31, 2018

Size: px
Start display at page:

Download "Jussi Enkovaara Martti Louhivuori. Python in High-Performance Computing. CSC IT Center for Science Ltd, Finland January 29-31, 2018"

Transcription

1 Jussi Enkovaara Martti Louhivuori Python in High-Performance Computing CSC IT Center for Science Ltd, Finland January 29-31, 2018 import sys, os try: from Bio.PDB import PDBParser biopython_installed = True except ImportError: biopython_installed = False default_bfactor = 0.0 default_occupancy = 1.0 default_segid = '' # default B-factor # default occupancy level # empty segment ID class EOF(Exception): def init (self): pass class FileCrawler: """ Crawl through a file reading back and forth without loading anything to memory. """ def init (self, filename): try: self. fp = open(filename) except IOError: raise ValueError, "Couldn't open file '%s' for reading." % filename self.tell = self. fp.tell self.seek = self. fp.seek def prevline(self): try: self.prev()

2 All material (C) 2018 by the authors. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License,

3 Agenda Monday 9:00-9:15 Python and HPC 9:15-10:00 NumPy fast array interface to Python 10:00-10:30 Exercises 10:30-10:45 Coffee Break 10:45-11:00 NumPy tools 11:00-12:00 Exercises 12:00-13:00 Lunch break 13:00-13:45 Advanced indexing, Vectorized operations & broadcasting, numexpr 13:45-14:30 Exercises 14:30-14:45 Coffee Break 14:45-15:30 Performance analysis 15:30-16:30 Exercises Tuesday Optimising Python with Cython 9:45-10:30 Cython cont Coffee break 10:45-12:00 Exercises Lunch break :45 Interfacing external libraries 13:45-14:30 Exercises Coffee break Multiprocessing 15:30-16:30 Exercises Wednesday MPI introduction 9:45-10:30 Point-to-point communication Coffee break 10:45-12:00 Exercises Lunch break :45 Non-blocking communication and communicators 13:45-14:30 Exercises Coffee break Collective communications 15:30-16:30 Exercises 16:30-16:45 Summary of Python HPC strategies

4

5 PYTHON AND HIGH-PERFORMANCE COMPUTING

6

7 Efficiency Python is an interpreted language no pre-compiled binaries, all code is translated on-the-fly to machine instructions byte-code as a middle step and may be stored (.pyc) All objects are dynamic in Python nothing is fixed == optimisation nightmare lot of overhead from metadata Flexibility is good, but comes with a cost! Improving Python performance Array based computations with NumPy Using extended Cython programming language Embed compiled code in a Python program C, Fortran Utilize parallel processing Parallelisation strategies for Python Agenda Global Interpreter Lock (GIL) CPython s memory management is not thread-safe no threads possible, except for I/O etc. affects overall performance if threading Process-based threading with multiprocessing fork independent processes that have a limited way to communicate Message-passing is the Way to Go to achieve true parallelism in Python Monday 9:00-9:15 Python and HPC 9:15-10:00 NumPy fast array interface to Python 10:00-10:30 Exercises 10:30-10:45 Coffee Break 10:45-11:00 NumPy tools 11:00-12:00 Exercises 12:00-13:00 Lunch break 13:00-13:45 Advanced indexing, Vectorized operations & broadcasting, numexpr 13:45-14:30 Exercises 14:30-14:45 Coffee Break 14:45-15:30 Performance analysis 15:30-16:30 Exercises Tuesday Optimising Python with Cython 9:45-10:30 Cython cont Coffee break 10:45-12:00 Exercises Lunch break :45 Interfacing external libraries 13:45-14:30 Exercises Coffee break Multiprocessing 15:30-16:30 Exercises Wednesday MPI introduction 9:45-10:30 Point-to-point communication Coffee break 10:45-12:00 Exercises Lunch break :45 Non-blocking communication and communicators 13:45-14:30 Exercises Coffee break Collective communications 15:30-16:30 Exercises 16:30-16:45 Summary of Python HPC strategies

8

9 NUMPY BASICS

10

11 Numpy fast array interface Standard Python is not well suitable for numerical computations lists are very flexible but also slow to process in numerical computations Numpy adds a new array data type static, multidimensional fast processing of arrays some linear algebra, random numbers Numpy arrays All elements of an array have the same type Array can have multiple dimensions The number of elements in the array is fixed, shape can be changed Python list vs. NumPy array Creating numpy arrays Python list Memory layout NumPy array Memory layout From a list: >>> import numpy as np >>> a = np.array((1, 2, 3, 4), float) >>> a array([ 1., 2., 3., 4.]) >>> >>> list1 = [[1, 2, 3], [4,5,6]] >>> mat = np.array(list1, complex) >>> mat array([[ 1.+0.j, 2.+0.j, 3.+0.j], [ 4.+0.j, 5.+0.j, 6.+0.j]]) >>> mat.shape (2, 3) >>> mat.size 6 Creating numpy arrays More ways for creating arrays: >>> import numpy as np >>> a = np.arange(10) >>> a array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) >>> >>> b = np.linspace(-4.5, 4.5, 5) >>> b array([-4.5, -2.25, 0., 2.25, 4.5 ]) >>> >>> c = np.zeros((4, 6), float) >>> c.shape (4, 6) >>> >>> d = np.ones((2, 4)) >>> d array([[ 1., 1., 1., 1.], [ 1., 1., 1., 1.]]) Simple indexing: Slicing: Indexing and slicing arrays >>> mat = np.array([[1, 2, 3], [4, 5, 6]]) >>> mat[0,2] 3 >>> mat[1,-2] >>> 5 >>> a = np.arange(5) >>> a[2:] array([2, 3, 4]) >>> a[:-1] array([0, 1, 2, 3]) >>> a[1:3] = -1 >>> a array([0, -1, -1, 3, 4]) Indexing and slicing arrays Slicing is possible over all dimensions: >>> a = np.arange(10) >>> a[1:7:2] array([1, 3, 5]) >>> >>> a = np.zeros((4, 4)) >>> a[1:3, 1:3] = 2.0 >>> a array([[ 0., 0., 0., 0.], [ 0., 2., 2., 0.], [ 0., 2., 2., 0.], [ 0., 0., 0., 0.]]) Views and copies of arrays Simple assignment creates references to arrays Slicing creates views to the arrays Use copy() for real copying of arrays example.py a = np.arange(10) b = a # reference, changing values in b changes a b = a.copy() # true copy c = a[1:4] # view, changing c changes elements [1:4] of a c = a[1:4].copy() # true copy of subarray

12 Array manipulation reshape : change the shape of array >>> mat = np.array([[1, 2, 3], [4, 5, 6]]) >>> mat array([[1, 2, 3], [4, 5, 6]]) >>> mat.reshape(3,2) array([[1, 2], [3, 4], [5, 6]]) ravel : flatten array to 1-d >>> mat.ravel() array([1, 2, 3, 4, 5, 6]) Array manipulation concatenate : join arrays together >>> mat1 = np.array([[1, 2, 3], [4, 5, 6]]) >>> mat2 = np.array([[7, 8, 9], [10, 11, 12]]) >>> np.concatenate((mat1, mat2)) array([[ 1, 2, 3], [ 4, 5, 6], [ 7, 8, 9], [10, 11, 12]]) >>> np.concatenate((mat1, mat2), axis=1) array([[ 1, 2, 3, 7, 8, 9], [ 4, 5, 6, 10, 11, 12]]) split : split array to N pieces >>> np.split(mat1, 3, axis=1) [array([[1], [4]]), array([[2], [5]]), array([[3], [6]])] Array operations Most operations for numpy arrays are done elementwise +, -, *, /, ** >>> a = np.array([1.0, 2.0, 3.0]) >>> b = 2.0 >>> a * b array([ 2., 4., 6.]) >>> a + b array([ 3., 4., 5.]) >>> a * a array([ 1., 4., 9.]) Array operations Numpy has special functions which can work with array arguments sin, cos, exp, sqrt, log,... >>> import numpy, math >>> a = numpy.linspace(-math.pi, math.pi, 8) >>> a array([ , , , , , , , ]) >>> numpy.sin(a) array([ e-16, e-01, e-01, e-01, e-01, e-01, e-01, e-16]) >>> >>> math.sin(a) Traceback (most recent call last): File "<stdin>", line 1, in? TypeError: only length-1 arrays can be converted to Python scalars I/O with Numpy NUMPY TOOLS Numpy provides functions for reading data from file and for writing data into the files Simple text files numpy.loadtxt numpy.savetxt Data in regular column layout Can deal with comments and different column delimiters Random numbers The module numpy.random provides several functions for constructing random arrays random: uniform random numbers normal: normal distribution poisson: Poisson distribution... >>> import numpy.random as rnd >>> rnd.random((2,2)) array([[ , ], [ , ]]) >>> rnd.poisson(size=(2,2)) Polynomials Polynomial is defined by array of coefficients p p(x, N) = p[0] x N-1 + p[1] x N p[n-1] Least square fitting: numpy.polyfit Evaluating polynomials: numpy.polyval Roots of polynomial: numpy.roots... >>> x = np.linspace(-4, 4, 7) >>> y = x**2 + rnd.random(x.shape) >>> >>> p = np.polyfit(x, y, 2) >>> p array([ , , ])

13 Linear algebra Numpy can calculate matrix and vector products efficiently: dot, vdot,... Eigenproblems: linalg.eig, linalg.eigvals, Linear systems and matrix inversion: linalg.solve, linalg.inv >>> A = np.array(((2, 1), (1, 3))) >>> B = np.array(((-2, 4.2), (4.2, 6))) >>> C = np.dot(a, B) >>> >>> b = np.array((1, 2)) >>> np.linalg.solve(c, b) # solve C x = b array([ , ]) Linear algebra Normally, NumPy utilises high performance libraries in linear algebra operations Exampe: matrix multiplication C = A * B matrix dimension 200 pure python: 5.30 s naive C: 0.09 s numpy.dot: 0.01 s Anatomy of NumPy array NUMPY ADVANCED TOPICS ndarray type is made of one dimensional contiguous block of memory (raw data) indexing scheme: how to locate an element data type descriptor: how to interpret an element NumPy indexing There are many possible ways of arranging items of N- dimensional array in a 1-dimensional block NumPy uses striding where N-dimensional index (n 0, n 1, n N-1 ) corresponds to offset from the beginning of 1-dimensional block offset = N 1 s k n k k=0 s k stride in dimension k ndarray attributes a = np.array( ) a.flags various information about memory layout a.strides bytes to step in each dimension when traversing a.itemsize size of one array element in bytes a.data Python buffer object pointing to start of arrays data (n 0, n 1 ) offset a. array_interface Python internal interface Advanced indexing Numpy arrays can be indexed also with other arrays (integer or boolean) >>> x = np.arange(10,1,-1) >>> x array([10, 9, 8, 7, 6, 5, 4, 3, 2]) >>> x[np.array([3, 3, 1, 8])] array([7, 7, 9, 2]) Boolean mask arrays >>> m = x > 7 >>> m array([ True, True, True, False, False,... >>> x[m] array([10, 9, 8]) Advanced indexing creates copies of arrays Vectorized operations for loops in Python are slow Use vectorized operations when possible Example: difference example.py # brute force using a for loop arr = np.arange(1000) dif = np.zeros(999, int) for i in range(1, len(arr)): dif[i-1] = arr[i] - arr[i-1] # vectorized operation arr = np.arange(1000) dif = arr[1:] - arr[:-1] for loop is ~80 times slower!

14 Broadcasting If array shapes are different, the smaller array may be broadcasted into a larger shape >>> from numpy import array >>> a = array([[1,2],[3,4],[5,6]], float) >>> a array([[ 1., 2.], [ 3., 4.], [ 5., 6.]]) >>> b = array([[7,11]], float) >>> b array([[ 7., 11.]]) >>> >>> a * b array([[ 7., 22.], [ 21., 44.], [ 35., 66.]]) Broadcasting Example: calculate distances from a given point example.py # array containing 3d coordinates for 100 points points = np.random.random((100, 3)) origin = np.array((1.0, 2.2, -2.2)) dists = (points - origin)**2 dists = np.sqrt(np.sum(dists, axis=1)) # find the most distant point i = np.argmax(dists) print(points[i]) Temporary arrays In complex expressions, NumPy stores intermediate values in temporary arrays Memory consumption can be higher than expected example.py a = np.random.random((1024, 1024, 50)) b = np.random.random((1024, 1024, 50)) # two temporary arrays will be created c = 2.0 * a 4.5 * b temp1 temp2 # three temporary arrays will be created due to unnecessary parenthesis c = (2.0 * a 4.5 * b) * (np.sin(a) + np.cos(b)) Temporary arrays Broadcasting approaches can lead also to hidden temporary arrays Example: pairwise distance of M points in 3 dimensions Input data is M x 3 array Output is M x M array containing the distance between points i and j example.py X = np.random.random((1000, 3)) D = np.sqrt(((x[:, np.newaxis, :] - X) ** 2).sum(axis=-1)) Temporary 1000 x 1000 x 3 array Numexpr Evaluation of complex expressions with one operation at a time can lead also into suboptimal performance Effectively, one carries out multiple for loops in the NumPy C-code Numexpr package provides fast evaluation of array expressions example.py import numexpr as ne x = np.random.random(( , 1)) y = np.random.random(( , 1)) poly = ne.evaluate("((.25*x +.75)*x - 1.5)*x - 2") Numexpr By default, numexpr tries to use multiple threads Number of threads can be queried and set with ne.set_num_threads(nthreads) Supported operators and functions +,-,*,/,**, sin, cos, tan, exp, log, sqrt Speedups in comparison to NumPy are typically between 0.95 and 4 Works best on arrays that do not fit in CPU cache Summary Numpy provides a static array data structure Multidimensional arrays Fast mathematical operations for arrays Tools for linear algebra and random numbers Arrays can be broadcasted into same shapes Expression evaluation can lead into temporary arrays

15 PERFORMANCE MEASUREMENT

16

17 Measuring application performance Correctness is the most import factor in any application Premature optimization is the root of all evil! Before starting to optimize application, one should measure where time is spent Typically 90 % of time is spent in 10 % of application Applications own timers timeit module cprofile module Full fedged profiling tools: TAU, Intel Vtune, Python Tools for Visual Studio Measuring application performance Python time module can be used for measuring time spent in specific part of the program time.time(), time.clock(), In Python 3: time.perf_counter(), time.process_time() timing.py import time t0 = time.time() for n in range(niter): heavy_calculation() t1 = time.time() Print( Time spent in heavy calculation, t1-t0) timeit module Easy timing of small bits of Python code Tries to avoid common pitfalls in measuring execution times Command line interface and Python interface $ python m timeit s from mymodule import func func() 10 loops, best of 3: 433 msec per loop %timeit magic in IPython In [1]: from mymodule import func In [2]: %timeit func() 10 loops, best of 3: 433 msec per loop cprofile Execution profile of Python program Time spent in different parts of the program Call graphs Python API: profile.py import cprofile # profile statement and save results to a file func.prof cprofile.run( func(), func.prof ) Profiling whole program from command line $ python m cprofile o myprof.prof myprogram.py Investigating profile with pstats Printing execution time of selected functions Sorting by function name, time, cumulative time, Python module interface and interactive browser In [1]: from pstats import Stats In [2]: p = Stats( myprof.prof ) In [3]: p.strip_dirs() In [4]: p.sort_stats( time ) In [5]: p.print_stats(5) Mon Oct 12 10:11: my.prof $ python m pstats myprof.prof Welcome to the profile statistics % strip % sort time % stats 5 Mon Oct 12 10:11: my.prof Summary Python has various built-in tools for measuring application performance time module timeit module cprofile and pstats modules

18

19 CYTHON

20

21 Cython Optimising static compiler for Python Extended Cython programming language Tune readable Python code into plain C performance by adding static type declarations Easy interfacing to external C libraries Python overheads Interpreting Boxing - everything is an object Function call overhead Global interpreter lock no threading benefits (CPython) Interpreting Case study: Mandelbrot fractal Cython command generates a C /C++ source file from a Cython source file C/C++ source is then compiled into an extension module Interpreting overhead is normally not drastic setup.py from distutils.core import setup from Cython.Build import cythonize # Normally, one compiles cython extended code with.pyx ending setup(ext_modules=cythonize( mandel_cyt.py ), ) $ python setup.py build_ext --inplace In [1]: import mandel_cyt Pure Python: Compiled with Cython: 2.71 s 2.61 s mandel.py def kernel(zr, zi, cr, ci, lim, cutoff): count = 0 while ((zr*zr + zi*zi) < (lim*lim)) \ and count < cutoff: zr = zr * zr - zi * zi + cr zi = zr * zr - zi * zi + cr count += 1 return count Boxing In Python, everything is an object Object Integer int 7 other stuff Object Integer int 6 other stuff Check the types: integers int 7 + int 6 = int 13 Object Integer int 13 other stuff Static type declarations Cython extended code should have.pyx ending Cannot be run with normal Python Types are declared with cdef keyword In function signatures only type is given example.py def integrate(f, a, b, N): s = 0 dx = (b-a)/n for i in range(n): s += f(a+i*dx) return s * dx example.pyx def integrate(f, double a, double b, int N): cdef double s = 0 cdef int i cdef double dx = (b-a)/n for i in range(n): s += f(a+i*dx) return s * dx Static type declarations Pure Python: 2.71 s Type declarations in kernel: 20.2 ms Function call overhead Function calls in Python can involve lots of checking and boxing Overhead can be reduced by declaring functions to be C-functions cdef keyword: functions can be called only from Cython cpdef keyword: generate also Python wrapper (can have additional overhead in some cases)

22 Using C functions NumPy arrays with Cython Static type declarations: Kernel as C function: 20.2 ms 12.5 ms Cython supports fast indexing for NumPy arrays Type and dimensions of array have to be declared mandel.py cdef int kernel(double zr, double zi, ): cdef int count = 0 while ((zr*zr + zi*zi) < (lim*lim)) \ and count < cutoff: zr = zr * zr - zi * zi + cr zi = zr * zr - zi * zi + cr count += 1 return count numpy_example.py import numpy as np # Normal NumPy import cimport numpy as cnp # Import for NumPY C-API def func(): # declarations can be made only in function scope cdef cnp.ndarray[cnp.int_t, ndim=2] data data = np.empty((n, N), dtype=int) for i in range(n): for j in range(n): data[i,j] = # double loop is done in nearly C speed Compiler directives Compiler directives can be used for turning of certain Python features for additional performance boundscheck (False) : assume no IndexErrors wraparound (False): no negative indexing numpy_example.py import numpy as np # Normal NumPy import cimport numpy as cnp # Import for NumPY C-API import cython Final performance Pure Python: 2.7 s Static type declarations: 20.2 ms Kernel as C function: 12.5 ms Fast indexing and directives: 2.4 def func(): # declarations can be made only in function scope cdef cnp.ndarray[cnp.int_t, ndim=2] data data = np.empty((n, N), dtype=int) Where to add types? HTML-report Typing everything reduces readibility and can even slow down the performance Profiling should be first step when optimising Cython is able to provide annotated HTML-report Lines are colored according to the level of typedness white lines translate to pure C lines that require the Python C-API are yellow (darker as they translate to more C-API interaction) $cython a cython_module.pyx $firefox cython_module.html Profiling Cython code By default, Cython code does not show up in profile produced by cprofile Profiling can be enabled for entire source file or on per function basis profiling.py # cython: profile=true profiling.py # cython: profile=false Summary Cython is optimising static compiler for Python Possible to add type declarations with Cython language Fast indexing for NumPy arrays At best cases, huge speed ups can be obtained Some compromise for Python flexibility import cdef func(): import cdef func():

23 Further functionality in Cython Using C structs and C++ classes in Cython Exceptions handling Parallelisation (threading) with Cython INTERFACING EXTERNAL LIBRARIES Increasing performance with compiled code There are Python interfaces for many high performance libraries However, sometimes one might want to utilize a library without Python interface Existing libraries Own code written in C or Fortran Python C-API provides the most comprehensive way to extend Python Cffi, cython, and f2py can provide easier approaches cffi C Foreign Function Interface for Python Interact with almost any C code C-like declarations within Python Can often be copy-pasted from headers / documentation ABI and API modes ABI does not require compilation API can be more robust Only ABI discussed here Some understanding of C required cffi_example.py from cffi import FFI import numpy as np cffi example ffi = FFI() lib = ffi.dlopen("./myclib.so") ffi.cdef("""void add(double *x, double *y, int n);""") ffi.cdef("""void subtract(double *x, double *y, int n);""") Interfacing C with cython As Cython code compiles down to C code, it is relatively easy to call C functions C declarations need to be given in Cython code C sources or libraries need to be specified in setup.py a = np.random.random(( ,1)) b = np.zeros_like(a) # Pointer objects need to be passed to library aptr = ffi.cast("double *", ffi.from_buffer(a)) bptr = ffi.cast("double *", ffi.from_buffer(b)) lib.add(bptr, aptr, len(a)) lib.subtract(bptr, aptr, len(a)) cython_example.pyx import numpy as np cimport numpy as cnp cdef extern from "myclib.h": Cython example void add(double *a, double *b, int n) void subtract(double *a, double *b, int n) def add_py(cnp.ndarray[cnp.double_t,ndim=1] a, cnp.ndarray[cnp.double_t,ndim=1] b): Cython example Including C-code in Cython module setup.py from distutils.core import setup, Extension from Cython.Build import cythonize # Specify all sources in Extension object ext = Extension("module_name", sources=["cython_source.pyx", "c_source.c"]) setup(ext_modules=cythonize(ext)) add(&a[0], &b[0], len(a))

24 Linking agains C library Cython example setup.py from distutils.core import setup, Extension from Cython.Build import cythonize # Specify all sources in Extension object ext = Extension("module_name", sources=["cython_source.pyx",], libraries=["name",], # Cython module is linked against library_dirs=[".",]) # libname.so, looked in "." setup(ext_modules=cythonize(ext)) diff.f90 Interfacing with Fortran NumPy includes f2py for connecting Python with Fortran codes f2py creates C wrapper for Fortran code that can then be used as extension module subroutine diff(in, out, n) real*8, intent(in) :: in(n) real*8, intent(out) :: out(n-1)!f2py intent(in, out) :: out integer :: n... $ f2py -c diff.f90 -m diff_mod example.py import numpy as np import diff_mod a = np.arange(10.0) b = np.zeros_like(a[1:]) diff_mod.diff(a, b, len(a)) Interfacing with Fortran f2py nitty-gritties If multidimensional NumPy arrays are not stored in Fortran order, f2py creates copies of arrays Modifications of arrays in Fortran are not seen in Python example.py a = np.zeros((10, 10)) modify(a) # a won t be modified a = np.asfortranarray(a) # convert a to Fortran modify(a) # a will now be modified # Array can be created in Fortran order in first place a = np.zeros((10, 10), order="f") Summary External libraries can interfaced in various ways cffi provides easy interfacing Cython can give more control and sometimes better performance Fortran routines are called implicitly via C wrappers f2py automates the creation of wrappers

25 MULTIPROCESSING PROCESS BASED THREADING

26

27 Processes and threads Serial region Processes and threads Serial region Parallel processes Serial region Serial region Parallel region Parallel region Parallel processes Serial region Serial region Parallel region Parallel region Process Independent execu:on units Have their own state informa:on and own memory address space Thread A single process may contain mul:ple threads Have their own state informa:on, but share the same memory address space Process Long-lived: created when parallel program started, killed when program is finished Explicit communica:on between processes Thread Short-lived: created when entering a parallel region, destroyed (joined) when region ends Communica:on through shared memory Parallel processes Process MPI good performance scales from a laptop to a supercomputer Processes and threads Serial region Serial region Serial region Parallel region Parallel region Thread OpenMP C / Fortran, not Python threading module only for I/O bound tasks (maybe) Global Interpreter Lock (GIL) limits usability Parallel processes Process MPI good performance scales from a laptop to a supercomputer Processes and threads master process master master + workers master + workers master process Thread Process mul:processing module relies on OS for forking worker processes that mimic threads limited communica:on between the parallel processes Mul;processing Underlying OS used to spawn new independent subprocesses processes are independent and execute code in an asynchronous manner no guarantee on the order of execu:on Communica:on possible only through dedicated, shared communica:on channels Queues, Pipes must be created before a new process is forked Spawn a process spawn.py from multiprocessing import Process import os def hello(name): print 'Hello', name print 'My PID is', os.getpid() print "My parent's PID is", os.getppid() # Create a new process p = Process(target=hello, args=('alice', )) # Start the process p.start() print 'Spawned a new process from PID', os.getpid() # End the process p.join() Communica;on Sharing data shared memory, data manager Pipes direct communica:on between two processes Queues work sharing among a group of processes Pool of workers offloading tasks to a group of worker processes Queues FIFO (first-in-first-out) task queues that can be used to distribute work among processes Shared among all processes all processes can add and retrieve data from the queue Automa:cally takes care of locking, so can be used safely with minimal hassle

28 Queues task-queue.py from multiprocessing import Process, Queue def f(q): while True: x = q.get() if x is None: break print(x**2) q = Queue() for i in range(100): q.put(i) # task queue: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9,..., 99] for i in range(3): q.put(none) p = Process(target=f, args=(q, )) p.start() Queues task-queue.py from multiprocessing import Process, Queue def f(q): while True: x = q.get() if x is None: # if sentinel, stop execution break print(x**2) q = Queue() for i in range(100): q.put(i) # task queue: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9,..., 99] for i in range(3): q.put(none) # add sentinels to the queue to signal STOP p = Process(target=f, args=(q, )) p.start() Pool of workers Group of processes that carry out tasks assigned to them 1. Master process submits tasks to the pool 2. Pool of worker processes perform the tasks 3. Master process retrieves the results from the pool Blocking and non-blocking (= asynchronous) calls available pool.py from multiprocessing import Pool import time def f(x): return x**2 pool = Pool(8) Pool of workers # Blocking execution (with a single process) result = pool.apply(f, (4,)) print(result) # Non-blocking execution "in the background" result = pool.apply_async(f, (12,)) while not result.ready(): time.sleep(1) print(result.get()) # an alternative to "sleeping" is to use e.g. result.get(timeout=1) Pool of workers Mul;processing summary pool-map.py from multiprocessing import Pool import time def f(x): return x**2 pool = Pool(8) # calculate x**2 in parallel for x in 0..9 result = pool.map(f, range(10)) print(result) # non-blocking alternative result = pool.map_async(f, range(10)) while not result.ready(): time.sleep(1) print(result.get()) Parallelism achieved by launching new OS processes Only limited communica:on possible work sharing: queues / pool of workers Non-blocking execu:on available do something else while wai:ng for results Further informa:on: h_ps://docs.python.org/2/library/mul:processing.html

29 MESSAGE PASSING INTERFACE

30

31 Message passing interface MPI is an applica,on programming interface (API) for communica,on between separate processes The most widely used approach for distributed parallel compu,ng MPI programs are portable and scalable the same program can run on different types of computers, from PC's to supercomputers MPI is flexible and comprehensive large (over 120 procedures) concise (ohen only 6 procedures are needed) MPI standard defines C and Fortran interfaces MPI for Python (mpi4py) provides an unofficial Python interface Parallel processes Process Independent execu,on units Have their own state informa,on and own memory address space Processes and threads Serial region Serial region Serial region Parallel region Parallel region Thread A single process may contain mul,ple threads Have their own state informa,on, but share the same memory address space Execu>on model MPI program is launched as a set of independent, iden0cal processes execute the same program code and instruc,ons can reside in different nodes (or even in different computers) The way to launch a MPI program depends on the system mpirun, mpiexec, aprun, srun,... aprun on sisu.csc.fi, srun on taito.csc.fi MPI rank Rank: ID number given to a process it is possible to query for rank processes can perform different tasks based on their rank example.py if (rank == 0): # do something elif (rank == 1): # do something else else: # all other processes do something different Data model Each MPI process has its own separate memory space, i.e. all variables and data structures are local to the process Processes can exchange data by sending and receiving messages rank 0 a = 1.0 b = 2.0 process 1 MPI messages rank 1 a = -0.7 b = 3.0 process 2 MPI communicator Communicator: a group containing all the processes that will par,cipate in communica,on in mpi4py all MPI calls are implemented as methods of a communicator object MPI_COMM_WORLD contains all processes (MPI.COMM_WORLD in mpi4py) user can define custom communicators Rou>nes in MPI for Python Communica,on between processes sending and receiving messages between two processes sending and receiving messages between several processes Synchroniza,on between processes Communicator crea,on and manipula,on Advanced features (e.g. user defined datatypes, onesided communica,on and parallel I/O) GeHng started Basic methods of communicator object Get_size() Number of processes in communicator Get_rank() rank of this process hello.py from mpi4py import MPI comm = MPI.COMM_WORLD # communicator object containing all processes size = comm.get_size() rank = comm.get_rank() print("i am rank %d in group of %d processes" % (rank, size))

32 Running an example program hello.py from mpi4py import MPI comm = MPI.COMM_WORLD # communicator object containing all processes size = comm.get_size() rank = comm.get_rank() print("i am rank %d in group of %d processes" % (rank, size)) POINT-TO-POINT COMMUNICATION $ mpirun np 4 python hello.py I am rank 2 in group of 4 processes I am rank 0 in group of 4 processes I am rank 3 in group of 4 processes I am rank 1 in group of 4 processes MPI communica>on MPI point-to-point opera>ons MPI processes are independent, they communicate to coordinate work Point-to-point communica,on Messages are sent between two processes Collec,ve communica,on Involving a number of processes at the same,me One process sends a message to another process that receives it Sends and receives in a program should match one receive per send Sending and receiving data Sending and receiving data Sending and receiving a dic,onary send.py from mpi4py import MPI comm = MPI.COMM_WORLD # communicator object containing all processes rank = comm.get_rank() if rank == 0: data = {'a': 7, 'b': 3.14} comm.send(data, dest=1, tag=11) elif rank == 1: data = comm.recv(source=0, tag=11) Arbitrary Python objects can be communicated with the send and receive methods of a communicator send(data, dest, tag) data Python object to send dest des,na,on rank tag ID given to the message recv(source, tag) source source rank tag ID given to the message data is provided as return value Des,na,on and source ranks as well as tags have to match Case study: parallel sum Case study: parallel sum Memory P0 P1 Array originally on process #0 (P0) Parallel algorithm Scader Half of the array is sent to process 1 Compute P0 & P1 sum independently their segments Reduc,on Par,al sum on P1 sent to P0 P0 sums the par,al sums Memory P0 P1 Step 1.1: Receive opera,on in scader Timeline P0 P1 recv P1 posts a receive to receive half of the array from P0

33 Case study: parallel sum Case study: parallel sum Memory Step 1.2: Send opera,on in scader Memory Step 2: Compute the sum in parallel P0 P1 Timeline P0 P1 Timeline P0 send P0 send compute P1 recv = P1 recv compute P0 posts a send to send the lower part of the array to P1 = P0 & P1 computes their parallel sums and store them locally Case study: parallel sum Case study: parallel sum Memory Step 3.1: Receive opera,on in reduc,on Memory Step 3.2: send opera,on in reduc,on P0 P1 Timeline P0 P1 Timeline P0 send compute r P0 send compute r = P1 recv compute = P1 recv compute s = P0 posts a receive to receive par,al sum = P1 posts a send with par,al sum Case study: parallel sum Blocking rou>nes & deadlocks Memory P0 P1 Step 4: compute final answer Timeline P0 send compute r P1 recv compute s P0 sums the par,al sums send() and recv() are blocking rou,nes the func,ons exit only once it is safe to use the data (memory) involved in the communica,on Comple,on depends on other processes => risk for deadlocks for example, if all processes call recv() there is no-one leh to call a corresponding send() and the program is stuck forever = Typical point-to-point communica>on parerns Communica>ng NumPy arrays Pairwise exchange Process 0 Process 1 Process 2 Process 3 Pipe, a ring of processes exchanging data Process 0 Process 1 Process 2 Process 3 Arbitrary Python objects are converted to byte streams (pickled) when sending and back to Python objects (unpickled) when receiving these conversions may be a serious overhead to communica,on Con,guous memory buffers (such as NumPy arrays) can be communicated with very lidle overhead using upper case methods: Send(data, dest, tag) Recv(data, source, tag) Incorrect ordering of sends and receives may result in a deadlock note the difference in receiving: the data array has to exist at the,me of call

34 send-array.py from mpi4py import MPI import numpy comm = MPI.COMM_WORLD rank = comm.get_rank() Send/receive a NumPy array if rank == 0: data = numpy.arange(100, dtype=float) comm.send(data, dest=1, tag=13) elif rank == 1: data = numpy.empty(100, dtype=float) comm.recv(data, source=0, tag=13) Note the difference between upper/lower case! send/recv: general Python objects, slow Send/Recv: con,nuous arrays, fast Combined send and receive sendrecv.py data = numpy.arange(10, dtype=float) * (rank + 1) # send buffer buffer = numpy.empty(10, float) # receive buffer if rank == 0: comm.sendrecv(data, dest=1, sendtag=8, recvbuf=buffer, source=1, recvtag=22) elif rank == 1: comm.sendrecv(data, dest=0, sendtag=22, recvbuf=buffer, source=0, recvtag=8) Send one message and receive another with a single command reduces risk for deadlocks Des,na,on and source ranks can be same or different MPI.PROC_NULL can be used for no des0na0on/source MPI datatypes MPI has a number of predefined datatypes to represent data e.g. MPI.INT for integer and MPI.DOUBLE for float No need to specify the datatype for Python objects or Numpy arrays objects are serialised as byte streams automa,c detec,on for NumPy arrays If needed, one can also define custom datatypes for example to use non-con,guous data buffers Summary Point-to-point communica,on = messages are sent between two MPI processes Point-to-point opera,ons enable any parallel communica,on padern (in principle) Arbitrary Python objects (that can be pickled!) send / recv sendrecv Memory buffers such as Numpy arrays Send / Recv Sendrecv Non-blocking communica>on NON-BLOCKING COMMUNICATION Non-blocking sends and receives isend & irecv returns immediately and sends/receives in background return value is a Request object Enables some compu,ng concurrently with communica,on Avoids many common dead-lock situa,ons Non-blocking communica>on Typical usage parern Have to finalize send/receive opera,ons wait() Waits for the communica,on started with isend or irecv to finish (blocking) test() Tests if the communica,on has finished (non-blocking) You can mix non-blocking and blocking p2p rou,nes e.g., receive isend with recv request = comm.irecv(ghost_data) comm.isend(border_data) compute(ghost_independent_data) request.wait() compute(border_data) P 0 P 1 P 2

35 Non-blocking send/receive Interleaving communica,on and computa,on isend.py rank = comm.get_rank() size = comm.get_size() if rank == 0: data = arange(size, dtype=float) * (rank + 1) req = comm.isend(data, dest=1, tag=0) # start a send calculate_something(rank) #.. do something else.. MPI.Request.Wait(req) # wait for send to finish # safe to read/write data again elif rank == 1: data = empty(size, float) req = comm.irecv(data, source=0, tag=0) # post a receive calculate_something(rank) #.. do something else.. MPI.Request.Wait(req) # wait for receive to finish # data is now ready for use Addi>onal comple>on opera>ons Other useful rou,nes in MPI.Request Waitall(requests) wait for all ini,ated requests to complete Waitany(requests) wait for any ini,ated request to complete Test(request) test for the comple,on of a send or receive Communicator also has a blocking call e.g. to probe the incoming message size before pos,ng a receive comm.probe(status) Non-blocking collec>ves Non-blocking collec>ves New in MPI 3: no support in mpi4py Non-blocking collec,ves enable the overlapping of communica,on and computa,on together with the benefits of collec,ve communica,on Restric,ons have to be called in same order by all ranks in a communicator mixing of blocking and non-blocking collec,ves is not allowed MPI.Allreduce Two independent opera,ons NOT SUPPORTED by MPI for Python MPI_Iallreduce MPI_Wait Summary Non-blocking communica,on is usually the smart way to do point-to-point communica,on in MPI Non-blocking communica,on realiza,on isend / Isend irecv / Irecv MPI.Request.Wait(req) MPI-3 contains also non-blocking collec,ves, but these are not supported by MPI for Python COMMUNICATORS Communicators User-defined communicators MPI.COMM_WORLD Comm Comm 2 e.g. comm.split 1 0 Comm 3 By default a single, universal communicator exists to which all processes belong (MPI.COMM_WORLD) One can create new communicators, e.g. by spliong this into sub-groups split.py comm = MPI.COMM_WORLD rank = comm.get_rank() color = rank % 4 local_comm = comm.split(color) local_rank = local_comm.get_rank() print("global rank: %d Local rank: %d" % (rank, local_rank))

36 Collec>ve communica>on COLLECTIVE COMMUNICATION Collec,ve communica,on transmits data among all processes in a process group (communicator) These rou,nes must be called by all the processes in the group Collec,ve communica,on includes data movement collec,ve computa,on synchroniza,on Example comm.barrier() makes every task hold un,l all tasks in the communicator comm have called it Collec>ve communica>on Collec,ve communica,on typically outperforms point-topoint communica,on Code becomes more compact (and efficient!) and easier to maintain: p2p.py if rank == 0: for i in range(1, size): comm.send(data, i, tag) else: comm.recv(data, 0, tag) collective.py comm.bcast(data, 0) Collec>ve communica>on Amount of sent and received data must match No tag arguments Order of execu,on must coincide across processes Communica,ng a Numpy array of 1M elements from task 0 to all other tasks Broadcas>ng Send the same data from one process to all the other Processes A Local memory Bcast This buffer may contain mul,ple elements of any datatype. A A A A Broadcas>ng Broadcast sends same data to all processes bcast.py from mpi4py import MPI import numpy comm = MPI.COMM_WORLD rank = comm.get_rank() if rank == 0: n = 100 data = numpy.arange(n, dtype=float) comm.bcast(n, root=0) else: n = comm.bcast(none, root=0) # returns the value! data = numpy.zeros(n, float) # prepare a receive buffer comm.bcast(data, root=0) # in-place modification on the receive side ScaRering Send equal amount of data from one process to others Processes Send buffer Recv buffer A B C D A Scatter B C D Segments A, B, may contain mul,ple elements ScaRering Scader distributes data to processes scatter.py from mpi4py import MPI from numpy import arange, empty comm = MPI.COMM_WORLD rank = comm.get_rank() size = comm.get_size() if rank == 0: py_data = range(size) data = arange(size**2, dtype=float) else: py_data = None data = None buffer = empty(size, float) # prepare a receive buffer new_data = comm.scatter(py_data, root=0) # returns the value comm.scatter(data, buffer, root=0) # in-place modification

37 Gathering Collect data from all the process to one process Processes Send buffer Recv buffer A B C D Gather A B C D Gathering Gather pulls data from all processes gather.py from mpi4py import MPI from numpy import arange, zeros comm = MPI.COMM_WORLD rank = comm.get_rank() size = comm.get_size() data = arange(10, dtype=float) * (rank + 1) buffer = zeros(size * 10, float) n = comm.gather(rank, root=0) # returns the value comm.gather(data, buffer, root=0) # in-place modification Segments A, B, may contain mul,ple elements Reduce opera>on Applies an opera,on over set of processes and places result in single process Send buffer A 0 B 0 C 0 D 0 Reduce Recv buffer A i B i C i D i Reduce opera>on Reduce gathers data and applies an opera,on on it reduce.py from mpi4py import MPI from numpy import arange, empty comm = MPI.COMM_WORLD rank = comm.get_rank() size = comm.get_size() Processes A 1 B 1 C 1 D 1 A 2 B 2 C 2 D 2 (sum) data = arange(10 * size, dtype=float) * (rank + 1) buffer = zeros(size * 10, float) n = comm.reduce(rank, op=mpi.sum, root=0) # returns the value comm.reduce(data, buffer, op=mpi.sum, root=0) # in-place modification A 3 B 3 C 3 D 3 Other common collec>ve opera>ons Scaderv each process receives different amount of data Gatherv each process sends different amount of data Allreduce all processes receive the results of reduc,on Alltoall each process sends and receives to/from each other Alltoallv each process sends and receives different amount of data to/from each other Common mistakes with collec>ves Using a collec,ve opera,on within one branch of an iftest of the rank if rank == 0: comm.bcast(...) all processes in a communicator must call a collec,ve rou,ne! Assuming that all processes making a collec,ve call would complete at the same,me Using the input buffer as the output buffer comm.scatter(a, a, MPI.SUM) Summary Collec,ve communica,ons involve all the processes within a communicator all processes must call them Collec,ve opera,ons make code more transparent and compact Collec,ve rou,nes allow op,miza,ons by MPI library On-line resources Documenta,on for mpi4py is quite limited short on-line manual and API reference available at hdp://pythonhosted.org/mpi4py/ Some good references: "A Python Introduc,on to Parallel Programming with MPI" by Jeremy Bejarano hdp://materials.jeremybejarano.com/mpiwithpython/ "mpi4py examples" by Jörg Bornschein hdps://github.com/jbornschein/mpi4py-examples

38 mpi4py performance Summary Ping-pong test mpi4py provides Python interface to MPI MPI calls via communicator object Possible to communicate arbitrary Python objects NumPy arrays can be communicated with nearly same speed as from C/Fortran

Numpy fast array interface

Numpy fast array interface NUMPY Numpy fast array interface Standard Python is not well suitable for numerical computations lists are very flexible but also slow to process in numerical computations Numpy adds a new array data type

More information

NumPy. Arno Proeme, ARCHER CSE Team Attributed to Jussi Enkovaara & Martti Louhivuori, CSC Helsinki

NumPy. Arno Proeme, ARCHER CSE Team Attributed to Jussi Enkovaara & Martti Louhivuori, CSC Helsinki NumPy Arno Proeme, ARCHER CSE Team aproeme@epcc.ed.ac.uk Attributed to Jussi Enkovaara & Martti Louhivuori, CSC Helsinki Reusing this material This work is licensed under a Creative Commons Attribution-

More information

Python in Scientific Computing

Python in Scientific Computing Jussi Enkovaara import sys, os try: from Bio.PDB import PDBParser biopython_installed = True except ImportError: biopython_installed = False Python in Scientific Computing June 10, 2014 Scientific computing

More information

Jussi Enkovaara Martti Louhivuori. Python in High-Performance Computing. CSC IT Center for Science Ltd, Finland January 29-31, 2018

Jussi Enkovaara Martti Louhivuori. Python in High-Performance Computing. CSC IT Center for Science Ltd, Finland January 29-31, 2018 Jussi Enkovaara Martti Louhivuori Python in High-Performance Computing CSC IT Center for Science Ltd, Finland January 29-31, 2018 import sys, os try: from Bio.PDB import PDBParser biopython_installed =

More information

Session 12: Introduction to MPI (4PY) October 9 th 2018, Alexander Peyser (Lena Oden)

Session 12: Introduction to MPI (4PY) October 9 th 2018, Alexander Peyser (Lena Oden) Session 12: Introduction to MPI (4PY) October 9 th 2018, Alexander Peyser (Lena Oden) Overview Introduction Basic concepts mpirun Hello world Wrapping numpy arrays Common Pitfalls Introduction MPI: de

More information

Python in High-Performance Computing

Python in High-Performance Computing Jussi Enkovaara Martti Louhivuori Python in High-Performance Computing CSC IT Center for Science Ltd, Finland March 1-3, 2017 import sys, os try: from Bio.PDB import PDBParser biopython_installed = True

More information

Session 12: Introduction to MPI (4PY) October 10 th 2017, Lena Oden

Session 12: Introduction to MPI (4PY) October 10 th 2017, Lena Oden Session 12: Introduction to MPI (4PY) October 10 th 2017, Lena Oden Overview Introduction Basic concepts mpirun Hello world Wrapping numpy arrays Common Pittfals Introduction MPI de facto standard for

More information

June 10, 2014 Scientific computing in practice Aalto University

June 10, 2014 Scientific computing in practice Aalto University Jussi Enkovaara import sys, os try: from Bio.PDB import PDBParser biopython_installed = True except ImportError: biopython_installed = False Exercises for Python in Scientific Computing June 10, 2014 Scientific

More information

mpi4py HPC Python R. Todd Evans January 23, 2015

mpi4py HPC Python R. Todd Evans January 23, 2015 mpi4py HPC Python R. Todd Evans rtevans@tacc.utexas.edu January 23, 2015 What is MPI Message Passing Interface Most useful on distributed memory machines Many implementations, interfaces in C/C++/Fortran

More information

Diffusion processes in complex networks

Diffusion processes in complex networks Diffusion processes in complex networks Digression - parallel computing in Python Janusz Szwabiński Outlook: Multiprocessing Parallel computing in IPython MPI for Python Cython and OpenMP Python and OpenCL

More information

http://tinyurl.com/cq-advanced-python-20151029 1 2 ##: ********** ## csuser## @[S## ********** guillimin.hpc.mcgill.ca class## ********** qsub interactive.pbs 3 cp -a /software/workshop/cq-formation-advanced-python

More information

CIS192 Python Programming

CIS192 Python Programming CIS192 Python Programming Graphical User Interfaces Robert Rand University of Pennsylvania December 03, 2015 Robert Rand (University of Pennsylvania) CIS 192 December 03, 2015 1 / 21 Outline 1 Performance

More information

Collective Communication

Collective Communication Lab 14 Collective Communication Lab Objective: Learn how to use collective communication to increase the efficiency of parallel programs In the lab on the Trapezoidal Rule [Lab??], we worked to increase

More information

MPI: the Message Passing Interface

MPI: the Message Passing Interface 15 Parallel Programming with MPI Lab Objective: In the world of parallel computing, MPI is the most widespread and standardized message passing library. As such, it is used in the majority of parallel

More information

NumPy. Daniël de Kok. May 4, 2017

NumPy. Daniël de Kok. May 4, 2017 NumPy Daniël de Kok May 4, 2017 Introduction Today Today s lecture is about the NumPy linear algebra library for Python. Today you will learn: How to create NumPy arrays, which store vectors, matrices,

More information

Exercises for Python in HPC

Exercises for Python in HPC Jussi Enkovaara Martti Louhivuori import sys, os try: from Bio.PDB import PDBParser biopython_installed = True except ImportError: biopython_installed = False Exercises for Python in HPC April 16-18, 2013

More information

multiprocessing and mpi4py

multiprocessing and mpi4py multiprocessing and mpi4py 02-03 May 2012 ARPA PIEMONTE m.cestari@cineca.it Bibliography multiprocessing http://docs.python.org/library/multiprocessing.html http://www.doughellmann.com/pymotw/multiprocessi

More information

ECE 574 Cluster Computing Lecture 13

ECE 574 Cluster Computing Lecture 13 ECE 574 Cluster Computing Lecture 13 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 21 March 2017 Announcements HW#5 Finally Graded Had right idea, but often result not an *exact*

More information

Exercises for Python in HPC

Exercises for Python in HPC Sebastian von Alfthan Jussi Enkovaara Martti Louhivuori import sys, os try: from Bio.PDB import PDBParser biopython_installed = True except ImportError: biopython_installed = False Exercises for Python

More information

An Introduction to Parallel Programming using MPI

An Introduction to Parallel Programming using MPI Lab 13 An Introduction to Parallel Programming using MPI Lab Objective: Learn the basics of parallel computing on distributed memory machines using MPI for Python Why Parallel Computing? Over the past

More information

What s in this talk? Quick Introduction. Programming in Parallel

What s in this talk? Quick Introduction. Programming in Parallel What s in this talk? Parallel programming methodologies - why MPI? Where can I use MPI? MPI in action Getting MPI to work at Warwick Examples MPI: Parallel Programming for Extreme Machines Si Hammond,

More information

Scientific Computing with Python and CUDA

Scientific Computing with Python and CUDA Scientific Computing with Python and CUDA Stefan Reiterer High Performance Computing Seminar, January 17 2011 Stefan Reiterer () Scientific Computing with Python and CUDA HPC Seminar 1 / 55 Inhalt 1 A

More information

MPI: Parallel Programming for Extreme Machines. Si Hammond, High Performance Systems Group

MPI: Parallel Programming for Extreme Machines. Si Hammond, High Performance Systems Group MPI: Parallel Programming for Extreme Machines Si Hammond, High Performance Systems Group Quick Introduction Si Hammond, (sdh@dcs.warwick.ac.uk) WPRF/PhD Research student, High Performance Systems Group,

More information

multiprocessing HPC Python R. Todd Evans January 23, 2015

multiprocessing HPC Python R. Todd Evans January 23, 2015 multiprocessing HPC Python R. Todd Evans rtevans@tacc.utexas.edu January 23, 2015 What is Multiprocessing Process-based parallelism Not threading! Threads are light-weight execution units within a process

More information

Practical Introduction to Message-Passing Interface (MPI)

Practical Introduction to Message-Passing Interface (MPI) 1 Outline of the workshop 2 Practical Introduction to Message-Passing Interface (MPI) Bart Oldeman, Calcul Québec McGill HPC Bart.Oldeman@mcgill.ca Theoretical / practical introduction Parallelizing your

More information

Advanced and Parallel Python

Advanced and Parallel Python Advanced and Parallel Python December 1st, 2016 http://tinyurl.com/cq-advanced-python-20161201 By: Bart Oldeman and Pier-Luc St-Onge 1 Financial Partners 2 Setup for the workshop 1. Get a user ID and password

More information

Slides prepared by : Farzana Rahman 1

Slides prepared by : Farzana Rahman 1 Introduction to MPI 1 Background on MPI MPI - Message Passing Interface Library standard defined by a committee of vendors, implementers, and parallel programmers Used to create parallel programs based

More information

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Parallelism Decompose the execution into several tasks according to the work to be done: Function/Task

More information

Advanced Message-Passing Interface (MPI)

Advanced Message-Passing Interface (MPI) Outline of the workshop 2 Advanced Message-Passing Interface (MPI) Bart Oldeman, Calcul Québec McGill HPC Bart.Oldeman@mcgill.ca Morning: Advanced MPI Revision More on Collectives More on Point-to-Point

More information

PyConZA High Performance Computing with Python. Kevin Colville Python on large clusters with MPI

PyConZA High Performance Computing with Python. Kevin Colville Python on large clusters with MPI PyConZA 2012 High Performance Computing with Python Kevin Colville Python on large clusters with MPI Andy Rabagliati Python to read and store data on CHPC Petabyte data store www.chpc.ac.za High Performance

More information

Message-Passing and MPI Programming

Message-Passing and MPI Programming Message-Passing and MPI Programming 2.1 Transfer Procedures Datatypes and Collectives N.M. Maclaren Computing Service nmm1@cam.ac.uk ext. 34761 July 2010 These are the procedures that actually transfer

More information

An introduction to scientific programming with. Session 5: Extreme Python

An introduction to scientific programming with. Session 5: Extreme Python An introduction to scientific programming with Session 5: Extreme Python Managing your environment Efficiently handling large datasets Optimising your code Squeezing out extra speed Writing robust code

More information

Sebastian von Alfthan Jussi Enkovaara Martti Louhivuori. Python in High-performance Computing

Sebastian von Alfthan Jussi Enkovaara Martti Louhivuori. Python in High-performance Computing Sebastian von Alfthan Jussi Enkovaara Martti Louhivuori import sys, os try: from Bio.PDB import PDBParser biopython_installed = True except ImportError: biopython_installed = False Python in High-performance

More information

CS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011.

CS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011. CS4961 Parallel Programming Lecture 16: Introduction to Message Passing Administrative Next programming assignment due on Monday, Nov. 7 at midnight Need to define teams and have initial conversation with

More information

Programming with MPI

Programming with MPI Programming with MPI p. 1/?? Programming with MPI Composite Types and Language Standards Nick Maclaren Computing Service nmm1@cam.ac.uk, ext. 34761 March 2008 Programming with MPI p. 2/?? Composite Types

More information

Implementation of Parallelization

Implementation of Parallelization Implementation of Parallelization OpenMP, PThreads and MPI Jascha Schewtschenko Institute of Cosmology and Gravitation, University of Portsmouth May 9, 2018 JAS (ICG, Portsmouth) Implementation of Parallelization

More information

NumPy quick reference

NumPy quick reference John W. Shipman 2016-05-30 12:28 Abstract A guide to the more common functions of NumPy, a numerical computation module for the Python programming language. This publication is available in Web form1 and

More information

DSC 201: Data Analysis & Visualization

DSC 201: Data Analysis & Visualization DSC 201: Data Analysis & Visualization Arrays Dr. David Koop Class Example class Rectangle: def init (self, x, y, w, h): self.x = x self.y = y self.w = w self.h = h def set_corner(self, x, y): self.x =

More information

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign Introduction to MPI May 20, 2013 Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign Top500.org PERFORMANCE DEVELOPMENT 1 Eflop/s 162 Pflop/s PROJECTED 100 Pflop/s

More information

Introduction to the Message Passing Interface (MPI)

Introduction to the Message Passing Interface (MPI) Introduction to the Message Passing Interface (MPI) CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Introduction to the Message Passing Interface (MPI) Spring 2018

More information

Python for Scientists

Python for Scientists High level programming language with an emphasis on easy to read and easy to write code Includes an extensive standard library We use version 3 History: Exists since 1991 Python 3: December 2008 General

More information

Advanced Fortran Programming

Advanced Fortran Programming Sami Ilvonen Pekka Manninen Advanced Fortran Programming March 20-22, 2017 PRACE Advanced Training Centre CSC IT Center for Science Ltd, Finland type revector(rk) integer, kind :: rk real(kind=rk), allocatable

More information

Message-Passing and MPI Programming

Message-Passing and MPI Programming Message-Passing and MPI Programming More on Collectives N.M. Maclaren Computing Service nmm1@cam.ac.uk ext. 34761 July 2010 5.1 Introduction There are two important facilities we have not covered yet;

More information

Programming with MPI

Programming with MPI Programming with MPI p. 1/?? Programming with MPI Miscellaneous Guidelines Nick Maclaren Computing Service nmm1@cam.ac.uk, ext. 34761 March 2010 Programming with MPI p. 2/?? Summary This is a miscellaneous

More information

Introduction to NumPy

Introduction to NumPy Lab 3 Introduction to NumPy Lab Objective: NumPy is a powerful Python package for manipulating data with multi-dimensional vectors. Its versatility and speed makes Python an ideal language for applied

More information

Administrivia. HW1 due Oct 4. Lectures now being recorded. I ll post URLs when available. Discussing Readings on Monday.

Administrivia. HW1 due Oct 4. Lectures now being recorded. I ll post URLs when available. Discussing Readings on Monday. Administrivia HW1 due Oct 4. Lectures now being recorded. I ll post URLs when available. Discussing Readings on Monday. Keep posting discussion on Piazza Python Multiprocessing Topics today: Multiprocessing

More information

Interfacing With Other Programming Languages Using Cython

Interfacing With Other Programming Languages Using Cython Lab 19 Interfacing With Other Programming Languages Using Cython Lab Objective: Learn to interface with object files using Cython. This lab should be worked through on a machine that has already been configured

More information

Exercise: Introduction to NumPy arrays

Exercise: Introduction to NumPy arrays Exercise: Introduction to NumPy arrays Aim: Introduce basic NumPy array creation and indexing Issues covered: Importing NumPy Creating an array from a list Creating arrays of zeros or ones Understanding

More information

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh MPI Optimisation Advanced Parallel Programming David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh Overview Can divide overheads up into four main categories: Lack of parallelism Load imbalance

More information

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Parallel Programming Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Challenges Difficult to write parallel programs Most programmers think sequentially

More information

CIS192 Python Programming

CIS192 Python Programming CIS192 Python Programming Profiling and Parallel Computing Harry Smith University of Pennsylvania November 29, 2017 Harry Smith (University of Pennsylvania) CIS 192 November 29, 2017 1 / 19 Outline 1 Performance

More information

LECTURE 20. Optimizing Python

LECTURE 20. Optimizing Python LECTURE 20 Optimizing Python THE NEED FOR SPEED By now, hopefully I ve shown that Python is an extremely versatile language that supports quick and easy development. However, a lot of the nice features

More information

NumPy and SciPy. Shawn T. Brown Director of Public Health Applications Pittsburgh Supercomputing Center Pittsburgh Supercomputing Center

NumPy and SciPy. Shawn T. Brown Director of Public Health Applications Pittsburgh Supercomputing Center Pittsburgh Supercomputing Center NumPy and SciPy Shawn T. Brown Director of Public Health Applications Pittsburgh Supercomputing Center 2012 Pittsburgh Supercomputing Center What are NumPy and SciPy NumPy and SciPy are open-source add-on

More information

Introduction to MPI, the Message Passing Library

Introduction to MPI, the Message Passing Library Chapter 3, p. 1/57 Basics of Basic Messages -To-? Introduction to, the Message Passing Library School of Engineering Sciences Computations for Large-Scale Problems I Chapter 3, p. 2/57 Outline Basics of

More information

Message Passing Interface

Message Passing Interface MPSoC Architectures MPI Alberto Bosio, Associate Professor UM Microelectronic Departement bosio@lirmm.fr Message Passing Interface API for distributed-memory programming parallel code that runs across

More information

Speeding up Python. Antonio Gómez-Iglesias April 17th, 2015

Speeding up Python. Antonio Gómez-Iglesias April 17th, 2015 Speeding up Python Antonio Gómez-Iglesias agomez@tacc.utexas.edu April 17th, 2015 Why Python is nice, easy, development is fast However, Python is slow The bottlenecks can be rewritten: SWIG Boost.Python

More information

Introduction to C. Sami Ilvonen Petri Nikunen. Oct 6 8, CSC IT Center for Science Ltd, Espoo. int **b1, **b2;

Introduction to C. Sami Ilvonen Petri Nikunen. Oct 6 8, CSC IT Center for Science Ltd, Espoo. int **b1, **b2; Sami Ilvonen Petri Nikunen Introduction to C Oct 6 8, 2015 @ CSC IT Center for Science Ltd, Espoo int **b1, **b2; /* Initialise metadata */ board_1->height = height; board_1->width = width; board_2->height

More information

Course May 18, Advanced Computational Physics. Course Hartmut Ruhl, LMU, Munich. People involved. SP in Python: 3 basic points

Course May 18, Advanced Computational Physics. Course Hartmut Ruhl, LMU, Munich. People involved. SP in Python: 3 basic points May 18, 2017 3 I/O 3 I/O 3 I/O 3 ASC, room A 238, phone 089-21804210, email hartmut.ruhl@lmu.de Patrick Böhl, ASC, room A205, phone 089-21804640, email patrick.boehl@physik.uni-muenchen.de. I/O Scientific

More information

Day 15: Science Code in Python

Day 15: Science Code in Python Day 15: Science Code in Python 1 Turn In Homework 2 Homework Review 3 Science Code in Python? 4 Custom Code vs. Off-the-Shelf Trade-offs Costs (your time vs. your $$$) Your time (coding vs. learning) Control

More information

COMP 322: Fundamentals of Parallel Programming. Lecture 34: Introduction to the Message Passing Interface (MPI), contd

COMP 322: Fundamentals of Parallel Programming. Lecture 34: Introduction to the Message Passing Interface (MPI), contd COMP 322: Fundamentals of Parallel Programming Lecture 34: Introduction to the Message Passing Interface (MPI), contd Vivek Sarkar, Eric Allen Department of Computer Science, Rice University Contact email:

More information

An introduction to scientific programming with. Session 5: Extreme Python

An introduction to scientific programming with. Session 5: Extreme Python An introduction to scientific programming with Session 5: Extreme Python PyTables For creating, storing and analysing datasets from simple, small tables to complex, huge datasets standard HDF5 file format

More information

Programming Scalable Systems with MPI. Clemens Grelck, University of Amsterdam

Programming Scalable Systems with MPI. Clemens Grelck, University of Amsterdam Clemens Grelck University of Amsterdam UvA / SurfSARA High Performance Computing and Big Data Course June 2014 Parallel Programming with Compiler Directives: OpenMP Message Passing Gentle Introduction

More information

CS 326: Operating Systems. Process Execution. Lecture 5

CS 326: Operating Systems. Process Execution. Lecture 5 CS 326: Operating Systems Process Execution Lecture 5 Today s Schedule Process Creation Threads Limited Direct Execution Basic Scheduling 2/5/18 CS 326: Operating Systems 2 Today s Schedule Process Creation

More information

Lecture 9: MPI continued

Lecture 9: MPI continued Lecture 9: MPI continued David Bindel 27 Sep 2011 Logistics Matrix multiply is done! Still have to run. Small HW 2 will be up before lecture on Thursday, due next Tuesday. Project 2 will be posted next

More information

Table of Contents EVALUATION COPY

Table of Contents EVALUATION COPY Table of Contents Introduction... 1-2 A Brief History of Python... 1-3 Python Versions... 1-4 Installing Python... 1-5 Environment Variables... 1-6 Executing Python from the Command Line... 1-7 IDLE...

More information

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI Chip Multiprocessors COMP35112 Lecture 9 - OpenMP & MPI Graham Riley 14 February 2018 1 Today s Lecture Dividing work to be done in parallel between threads in Java (as you are doing in the labs) is rather

More information

Parallel Programming

Parallel Programming Parallel Programming for Multicore and Cluster Systems von Thomas Rauber, Gudula Rünger 1. Auflage Parallel Programming Rauber / Rünger schnell und portofrei erhältlich bei beck-shop.de DIE FACHBUCHHANDLUNG

More information

Non-Blocking Communications

Non-Blocking Communications Non-Blocking Communications Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Collective Communication in MPI and Advanced Features

Collective Communication in MPI and Advanced Features Collective Communication in MPI and Advanced Features Pacheco s book. Chapter 3 T. Yang, CS240A. Part of slides from the text book, CS267 K. Yelick from UC Berkeley and B. Gropp, ANL Outline Collective

More information

CSC Advanced Scientific Computing, Fall Numpy

CSC Advanced Scientific Computing, Fall Numpy CSC 223 - Advanced Scientific Computing, Fall 2017 Numpy Numpy Numpy (Numerical Python) provides an interface, called an array, to operate on dense data buffers. Numpy arrays are at the core of most Python

More information

Problem Based Learning 2018

Problem Based Learning 2018 Problem Based Learning 2018 Introduction to Machine Learning with Python L. Richter Department of Computer Science Technische Universität München Monday, Jun 25th L. Richter PBL 18 1 / 21 Overview 1 2

More information

Short Notes of CS201

Short Notes of CS201 #includes: Short Notes of CS201 The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with < and > if the file is a system

More information

Programming with MPI Collectives

Programming with MPI Collectives Programming with MPI Collectives Jan Thorbecke Type to enter text Delft University of Technology Challenge the future Collectives Classes Communication types exercise: BroadcastBarrier Gather Scatter exercise:

More information

CS201 - Introduction to Programming Glossary By

CS201 - Introduction to Programming Glossary By CS201 - Introduction to Programming Glossary By #include : The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with

More information

High Performance Computing Course Notes Message Passing Programming I

High Performance Computing Course Notes Message Passing Programming I High Performance Computing Course Notes 2008-2009 2009 Message Passing Programming I Message Passing Programming Message Passing is the most widely used parallel programming model Message passing works

More information

Message-Passing and MPI Programming

Message-Passing and MPI Programming Message-Passing and MPI Programming 5.1 Introduction More on Datatypes and Collectives N.M. Maclaren nmm1@cam.ac.uk July 2010 There are a few important facilities we have not covered yet; they are less

More information

Advanced Parallel Programming

Advanced Parallel Programming Sebastian von Alfthan Jussi Enkovaara Pekka Manninen Advanced Parallel Programming February 15-17, 2016 PRACE Advanced Training Center CSC IT Center for Science Ltd, Finland All material (C) 2011-2016

More information

MPI 2. CSCI 4850/5850 High-Performance Computing Spring 2018

MPI 2. CSCI 4850/5850 High-Performance Computing Spring 2018 MPI 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning Objectives

More information

Programming with MPI

Programming with MPI Programming with MPI p. 1/?? Programming with MPI More on Datatypes and Collectives Nick Maclaren nmm1@cam.ac.uk May 2008 Programming with MPI p. 2/?? Less Basic Collective Use A few important facilities

More information

Review of previous examinations TMA4280 Introduction to Supercomputing

Review of previous examinations TMA4280 Introduction to Supercomputing Review of previous examinations TMA4280 Introduction to Supercomputing NTNU, IMF April 24. 2017 1 Examination The examination is usually comprised of: one problem related to linear algebra operations with

More information

High Performance Computing with Python

High Performance Computing with Python High Performance Computing with Python Pawel Pomorski SHARCNET University of Waterloo ppomorsk@sharcnet.ca April 29,2015 Outline Speeding up Python code with NumPy Speeding up Python code with Cython Using

More information

Programming Scalable Systems with MPI. UvA / SURFsara High Performance Computing and Big Data. Clemens Grelck, University of Amsterdam

Programming Scalable Systems with MPI. UvA / SURFsara High Performance Computing and Big Data. Clemens Grelck, University of Amsterdam Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data Message Passing as a Programming Paradigm Gentle Introduction to MPI Point-to-point Communication Message Passing

More information

Lecture 7: More about MPI programming. Lecture 7: More about MPI programming p. 1

Lecture 7: More about MPI programming. Lecture 7: More about MPI programming p. 1 Lecture 7: More about MPI programming Lecture 7: More about MPI programming p. 1 Some recaps (1) One way of categorizing parallel computers is by looking at the memory configuration: In shared-memory systems

More information

Introduction to Scientific Computing with Python, part two.

Introduction to Scientific Computing with Python, part two. Introduction to Scientific Computing with Python, part two. M. Emmett Department of Mathematics University of North Carolina at Chapel Hill June 20 2012 The Zen of Python zen of python... fire up python

More information

CS4961 Parallel Programming. Lecture 18: Introduction to Message Passing 11/3/10. Final Project Purpose: Mary Hall November 2, 2010.

CS4961 Parallel Programming. Lecture 18: Introduction to Message Passing 11/3/10. Final Project Purpose: Mary Hall November 2, 2010. Parallel Programming Lecture 18: Introduction to Message Passing Mary Hall November 2, 2010 Final Project Purpose: - A chance to dig in deeper into a parallel programming model and explore concepts. -

More information

CPSC 3740 Programming Languages University of Lethbridge. Data Types

CPSC 3740 Programming Languages University of Lethbridge. Data Types Data Types A data type defines a collection of data values and a set of predefined operations on those values Some languages allow user to define additional types Useful for error detection through type

More information

The Starving CPU Problem

The Starving CPU Problem Or Why Should I Care About Memory Access? Software Architect Continuum Analytics Outline Motivation 1 Motivation 2 3 Computing a Polynomial We want to compute the next polynomial: y = 0.25x 3 + 0.75x²

More information

Standard MPI - Message Passing Interface

Standard MPI - Message Passing Interface c Ewa Szynkiewicz, 2007 1 Standard MPI - Message Passing Interface The message-passing paradigm is one of the oldest and most widely used approaches for programming parallel machines, especially those

More information

Implementing MPI on Windows: Comparison with Common Approaches on Unix

Implementing MPI on Windows: Comparison with Common Approaches on Unix Implementing MPI on Windows: Comparison with Common Approaches on Unix Jayesh Krishna, 1 Pavan Balaji, 1 Ewing Lusk, 1 Rajeev Thakur, 1 Fabian Tillier 2 1 Argonne Na+onal Laboratory, Argonne, IL, USA 2

More information

Basic MPI Communications. Basic MPI Communications (cont d)

Basic MPI Communications. Basic MPI Communications (cont d) Basic MPI Communications MPI provides two non-blocking routines: MPI_Isend(buf,cnt,type,dst,tag,comm,reqHandle) buf: source of data to be sent cnt: number of data elements to be sent type: type of each

More information

LECTURE 22. Numerical and Scientific Packages

LECTURE 22. Numerical and Scientific Packages LECTURE 22 Numerical and Scientific Packages NUMERIC AND SCIENTIFIC APPLICATIONS As you might expect, there are a number of third-party packages available for numerical and scientific computing that extend

More information

MPI Message Passing Interface

MPI Message Passing Interface MPI Message Passing Interface Portable Parallel Programs Parallel Computing A problem is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

Scientific Programming. Lecture A08 Numpy

Scientific Programming. Lecture A08 Numpy Scientific Programming Lecture A08 Alberto Montresor Università di Trento 2018/10/25 Acknowledgments: Stefano Teso, Documentation http://disi.unitn.it/~teso/courses/sciprog/python_appendices.html https://docs.scipy.org/doc/numpy-1.13.0/reference/

More information

ECE 574 Cluster Computing Lecture 13

ECE 574 Cluster Computing Lecture 13 ECE 574 Cluster Computing Lecture 13 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 15 October 2015 Announcements Homework #3 and #4 Grades out soon Homework #5 will be posted

More information

An introduction to MPI

An introduction to MPI An introduction to MPI C MPI is a Library for Message-Passing Not built in to compiler Function calls that can be made from any compiler, many languages Just link to it Wrappers: mpicc, mpif77 Fortran

More information

Part I. Wei Tianwen. A Brief Introduction to Python. Part I. Wei Tianwen. Basics. Object Oriented Programming

Part I. Wei Tianwen. A Brief Introduction to Python. Part I. Wei Tianwen. Basics. Object Oriented Programming 2017 Table of contents 1 2 Integers and floats Integer int and float float are elementary numeric types in. integer >>> a=1 >>> a 1 >>> type (a) Integers and floats Integer int and float

More information

LECTURE 19. Numerical and Scientific Packages

LECTURE 19. Numerical and Scientific Packages LECTURE 19 Numerical and Scientific Packages NUMERICAL AND SCIENTIFIC APPLICATIONS As you might expect, there are a number of third-party packages available for numerical and scientific computing that

More information

15-440: Recitation 8

15-440: Recitation 8 15-440: Recitation 8 School of Computer Science Carnegie Mellon University, Qatar Fall 2013 Date: Oct 31, 2013 I- Intended Learning Outcome (ILO): The ILO of this recitation is: Apply parallel programs

More information

Elementary Parallel Programming with Examples. Reinhold Bader (LRZ) Georg Hager (RRZE)

Elementary Parallel Programming with Examples. Reinhold Bader (LRZ) Georg Hager (RRZE) Elementary Parallel Programming with Examples Reinhold Bader (LRZ) Georg Hager (RRZE) Two Paradigms for Parallel Programming Hardware Designs Distributed Memory M Message Passing explicit programming required

More information