Parallel Computing Lab 10: Parallel Game of Life (GOL) with structured grid

Size: px

Start display at page:

Download "Parallel Computing Lab 10: Parallel Game of Life (GOL) with structured grid"

Kathleen Gardner
6 years ago
Views:

1 Parallel Computing Lab 10: Parallel Game of Life (GOL) with structured grid November 17, 2012 This lab requires to write a program in Bulk Synchronous Parallelism (BSP), particularly parallel Game of Life (GOL) sumulation on regular grid. It also shows how to use VTK visual library (no details yet). 1 Code and setup There is a skeleton code given (see gol1-template.tgz), in case you don t want to write everything from scratch. You can use any part of the code, although its easier to evaluate a solution if it follows my code structure. Unzip the code on aur and compile with module load intelmpi module load VTK make In case you want to compile on your computer change VTK directories in the Makefile. 1.1 Sequential GOL There are 2 sequential version: 1. Fortran with terminal output./main_gol-seq (a) sources are main_gol-seq.f90, gol-seq.f90, commonf.f90, and common.h (b) this program prints out GOL field new state to console every time a key is pressed 2. C++ with VTK visualization./gol-visual-seq (a) you need to give -X parameter to ssh and possibly set LIBGL_ALWAYS_INDIRECT environment variable to 1 (b) sources are gol-visual-seq.cc (c) this program draws GOL field new state with VTK every second (d) you can use mouse to rotate and zoom the field (it is 3D) As you can see the grid is structured. If 3D is too slow with ssh try NX, see instructions on HPC web site[1]. 1

2 compcomm compute 3 compute 1 MPI_COMM_WORLD vis process compute 4 compute 2 Figure 1: Proccesses and communicators 2 Parallel GOL There is one parallel version which is incomplete (just a template) main_gol-mpi.f90. It may run in two modes 1. mpirun -np 1./main_gol-mpi initializes MPI, runs the sequential GOL and prints output to console 2. mpirun -np 1./main_gol-mpi : -np 1./gol-visual-mpi runs one visualize MPI process (gol-visualmpi.cc) and one compute MPI process (main_gol-mpi.f90) visualizer waits for messages from compute process compute process calculates GOL field new state on a key press In any case compcomm communicator is created (see Figure 1) from MPI_COMM_WORLD that combines only MPI processes with compute role (main_gol-mpi.f90 ). Use this communicator in the following task. Task 1 ReWrite the program, so that GOL is run in parallel. For that create gol-mpi.f90 and write code similar to gol-seq.f90 that runs in parallel by decomposing the domain into n n subdomains and exchanging neighbours (ghost) values as necessary (see lecture slides). You may completely ignore visual VTK C++ part if you don t like it and just use the console for output. An example of splitting the field into 3 3 parts for 9 processes is shown on Figure 2. The idea is to create larger local (lxn+2) (lyn+2) data field that comprises: local values that computed by the current process, inner values that in addition are not needed by any other process and can also be computed without any ghost values, ghost values that is a thin region of non-local values along the boundary that are needed by local computations and must be eventually copied from other procceses. 2

3 all values data(1:lxn+2,1:lyn+2) inner values data(3:lxn,3:lyn) local values data(2:lxn+1,2:lyn+1) YN lyn lxn XN Figure 2: Decomposition of GOL between 9 processes; local, inner and all values for one process It is possible to exchange latest ghost values every time before doing local computations. In this case the code I have is the following:!> Do one step in gol subroutine gol_parfield_step(pf) type(parfield_t),intent(inout) :: pf type(ghostinfo_t) :: ghostinfo! exchange ghosts call gol_parfield_exchange_ghosts start(pf, ghostinfo) call gol_parfield_exchange_ghosts finish(pf, ghostinfo)! make the step call gol_field_calculate(pf%base, (/2,2,pf%lXN+1,pf%lYN+1/)) call gol_field_step_finish(pf%base) end subroutine gol_parfield_step The second argument in gol_field_calculate routine specifies the region to compute. In this case these are local values, but in the next task we will need more fine control. Value ghostinfo holds all MPI request handles, buffers, and other useful info between MPI_Isend/MPI_Irecv and MPI_Wait calls. 3 Parallel optimized GOL One optimization idea is to initiate the exchange ghost values and in the while compute inner local values, then as ghost values are available compute non-inner local values. The code of one GOL step than changes to the following: subroutine gol_parfield_step_async(pf) 3

(a) simple ghost exchange (b) overlap ghost exchange with inner computations Figure 3: Intel TAC charts for parallel Game of Life with 9 processes type(parfield_t),intent(inout) :: pf

start exchanging ghost values and calculate inner part call gol_parfield_exchange_ghosts start(pf, ghostinfo) call gol_field_calculate(pf%base, (/3,3,pf%lXN,pf%lYN/)) call

4 (a) simple ghost exchange (b) overlap ghost exchange with inner computations Figure 3: Intel TAC charts for parallel Game of Life with 9 processes type(parfield_t),intent(inout) :: pf type(ghostinfo_t) :: ghostinfo! start exchanging ghost values and calculate inner part call gol_parfield_exchange_ghosts start(pf, ghostinfo) call gol_field_calculate(pf%base, (/3,3,pf%lXN,pf%lYN/)) call gol_parfield_exchange_ghosts finish(pf, ghostinfo)! calculate outer part! top and bottom borders call gol_field_calculate(pf%base, (/2,2,pf%lXN+1,2/)) call gol_field_calculate(pf%base, (/2,pf%lYN+1,pf%lXN+1,pf%lYN+1/))! left and right borders call gol_field_calculate(pf%base, (/2,3,2,pf%lYN/)) call gol_field_calculate(pf%base, (/pf%lxn+1,3,pf%lxn+1,pf%lyn/))! finish the step call gol_field_step_finish(pf%base) end subroutine gol_parfield_step_async Task 2 ReWrite the GOL code, so that each process first calculates its boundary values needed for other proccesses, sends them out and then calculates the rest. Benchmark the new code against the non-optimized one for large GOL field and provide table with run times for 1,4,9, and 16 processes for both versions. Disable visualization and provide event charts from Intel TAC for 4,9 and 16 MPI processes for both versions. Is optimization helpful? For the numbers I tried ( grid) and 4,9,16,36 processes I did not notice any substantial win in time. The charts from ITAC are shown on Figure 3. References [1] 4

5 Appendix A: Using VTK Here follow the details of VTK programming. 5

Elementary Parallel Programming with Examples. Reinhold Bader (LRZ) Georg Hager (RRZE)

Elementary Parallel Programming with Examples Reinhold Bader (LRZ) Georg Hager (RRZE) Two Paradigms for Parallel Programming Hardware Designs Distributed Memory M Message Passing explicit programming required