Supercomputing in Plain English Exercise #6: MPI Point to Point

Similar documents
MPI: The Message-Passing Interface. Most of this discussion is from [1] and [2].

Exercise: Calling LAPACK

To connect to the cluster, simply use a SSH or SFTP client to connect to:

Holland Computing Center Kickstart MPI Intro

Introduction to the Message Passing Interface (MPI)

To connect to the cluster, simply use a SSH or SFTP client to connect to:

Supercomputing in Plain English

Supercomputing in Plain English

Parallel Programming & Cluster Computing Distributed Multiprocessing

CS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011.

Linux Clusters Institute: Introduction to MPI. Presented by: David Swanson, Director, HCC Based upon slides by: Henry Neeman, University of Oklahoma

Practical Introduction to Message-Passing Interface (MPI)

lslogin3$ cd lslogin3$ tar -xvf ~train00/mpibasic_lab.tar cd mpibasic_lab/pi cd mpibasic_lab/decomp1d

MPI Runtime Error Detection with MUST

L14 Supercomputing - Part 2

MPI Lab. How to split a problem across multiple processors Broadcasting input to other nodes Using MPI_Reduce to accumulate partial sums

An introduction to MPI

Tutorial on MPI: part I

Introduction to MPI. SHARCNET MPI Lecture Series: Part I of II. Paul Preney, OCT, M.Sc., B.Ed., B.Sc.

Message Passing Interface

MPI introduction - exercises -

MPI 1. CSCI 4850/5850 High-Performance Computing Spring 2018

15-440: Recitation 8

COSC 6374 Parallel Computation. Message Passing Interface (MPI ) I Introduction. Distributed memory machines

MPI Runtime Error Detection with MUST

Simple examples how to run MPI program via PBS on Taurus HPC

Introduction to Parallel Programming & Cluster Computing Distributed Multiprocessing Josh Alexander, University of Oklahoma Ivan Babic, Earlham

The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs

Programming Scalable Systems with MPI. Clemens Grelck, University of Amsterdam

MPI-Hello World. Timothy H. Kaiser, PH.D.

MPI-Hello World. Timothy H. Kaiser, PH.D.

Part One: The Files. C MPI Slurm Tutorial - Hello World. Introduction. Hello World! hello.tar. The files, summary. Output Files, summary

Introduction to Parallel Programming with MPI

Practical Course Scientific Computing and Visualization

Distributed Memory Programming with MPI

The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs

What is Hadoop? Hadoop is an ecosystem of tools for processing Big Data. Hadoop is an open source project.

MPI Lab. Steve Lantz Susan Mehringer. Parallel Computing on Ranger and Longhorn May 16, 2012

Intel Parallel Studio XE Cluster Edition - Intel MPI - Intel Traceanalyzer & Collector

MPI: Message Passing Interface An Introduction. S. Lakshmivarahan School of Computer Science

Introduction to MPI. Ekpe Okorafor. School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014

ECE 574 Cluster Computing Lecture 13

Chapter 3. Distributed Memory Programming with MPI

CS 426. Building and Running a Parallel Application

Parallel Programming & Cluster Computing

Programming Scalable Systems with MPI. UvA / SURFsara High Performance Computing and Big Data. Clemens Grelck, University of Amsterdam

CSE. Parallel Algorithms on a cluster of PCs. Ian Bush. Daresbury Laboratory (With thanks to Lorna Smith and Mark Bull at EPCC)

Distributed Memory Programming With MPI Computer Lab Exercises

Elementary Parallel Programming with Examples. Reinhold Bader (LRZ) Georg Hager (RRZE)

Matlab MPI. Shuxia Zhang Supercomputing Institute Tel: (direct) (help)

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign

CSE 160 Lecture 18. Message Passing

Parallel programming in Matlab environment on CRESCO cluster, interactive and batch mode

Advanced Message-Passing Interface (MPI)

MPICH User s Guide Version Mathematics and Computer Science Division Argonne National Laboratory

Docker task in HPC Pack

Introduction to Parallel Programming & Cluster Computing MPI Collective Communications

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

Building Library Components That Can Use Any MPI Implementation

Slides prepared by : Farzana Rahman 1

Beginner's Guide for UK IBM systems

Distributed Memory Programming with Message-Passing

ACEnet for CS6702 Ross Dickson, Computational Research Consultant 29 Sep 2009

Blue Waters Programming Environment

Introduction in Parallel Programming - MPI Part I

Parallel Processing Experience on Low cost Pentium Machines

Introduction to the SHARCNET Environment May-25 Pre-(summer)school webinar Speaker: Alex Razoumov University of Ontario Institute of Technology

JURECA Tuning for the platform

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

MPI MESSAGE PASSING INTERFACE

An Introduction to MPI

Practical Introduction to Message-Passing Interface (MPI)

PPCES 2016: MPI Lab March 2016 Hristo Iliev, Portions thanks to: Christian Iwainsky, Sandra Wienke

Parallel Programming

Parallel Short Course. Distributed memory machines

Parallel Programming in C with MPI and OpenMP

MPI Collective communication

Tool for Analysing and Checking MPI Applications

Shared Memory Programming With OpenMP Exercise Instructions

Introduction to MPI. Ricardo Fonseca.

CS4961 Parallel Programming. Lecture 18: Introduction to Message Passing 11/3/10. Final Project Purpose: Mary Hall November 2, 2010.

High Performance Computing Course Notes Message Passing Programming I

Introduction to Parallel Programming (Session 2: MPI + Matlab/Octave/R)

Debugging on Blue Waters

Parallel Programming Using Basic MPI. Presented by Timothy H. Kaiser, Ph.D. San Diego Supercomputer Center

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

No Time to Read This Book?

Shared Memory Programming With OpenMP Computer Lab Exercises

Preventing and Finding Bugs in Parallel Programs

FROM SERIAL FORTRAN TO MPI PARALLEL FORTRAN AT THE IAS: SOME ILLUSTRATIVE EXAMPLES

Peter Pacheco. Chapter 3. Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved

Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved

MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA

AMS 200: Working on Linux/Unix Machines

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

Parallel Programming in C with MPI and OpenMP

Working with Shell Scripting. Daniel Balagué

Co-arrays to be included in the Fortran 2008 Standard

A few words about MPI (Message Passing Interface) T. Edwald 10 June 2008

ITCS 4145/5145 Assignment 2

Transcription:

Supercomputing in Plain English Exercise #6: MPI Point to Point In this exercise, we ll use the same conventions and commands as in Exercises #1, #2, #3, #4 and #5. You should refer back to the Exercise #1, #2, #3, #4 and #5 descriptions for details on various Unix commands. In the exercise, you ll again be parallelizing and benchmarking, but this time you ll be parallelizing with MPI instead of with OpenMP. Then you ll benchmark various numbers of MPI processes using various compilers and various levels of compiler optimization. NOTE: This exercise is VERY DIFFICULT! Here are the steps for this exercise: 1. Log in to the Linux cluster supercomputer (sooner.oscer.ou.edu). 2. Choose which language you want to use (C or Fortran90), and cd into the appropriate directory: OR: % cd ~/SIPE2011_exercises/NBody/C/ % cd ~/SIPE2011_exercises/NBody/Fortran90/ 3. Copy the Serial directory to a new MPI directory: % cp -r Serial/ MPI_p2p/ 4. Copy the MPI Point-to-Point batch script into your MPI_p2p directory. % cp -r ~hneeman/sipe2011_exercises/nbody/c/ nbody_mpi_p2p.bsub MPI_p2p/ 5. Go into your MPI point to point directory: % cd MPI_p2p 6. Edit the makefile to change the compiler to mpicc (regardless of the compiler family). 7. Parallelize the code using MPI, specifically using point-to-point communication with MPI_Send and MPI_Recv. HINTS a. Be sure to include mpi.h (C) at the top of the file, or include mpif.h (Fortran) at the beginning of the main program. b. In the declaration section of the main function (C) or main program (Fortran), declare the appropriate variables (my rank, the number of processes and the mpi error code that s returned by each MPI routine) c. In the main function (C) or the main program (Fortran), put the startup of MPI (MPI_Init, MPI_Comm_rank and MPI_Comm_size) as the very first executable statements in the body of the main function, and put the shutdown of MPI (MPI_Finalize) as the very last statement in the body of the main function. d. Identify the routine where the bulk of the calculations take place, as in Exercise #5. 1

e. You can get the information about my rank and the number of processes into that routine in either of the following ways: i. you can pass, as arguments, those variables from the main function (C) or the main program (Fortran) to the routine that does the bulk of the calculations, OR ii. inside the routine that does the bulk of the calculations, you can declare the same variables, and then call MPI_Comm_rank and MPI_Comm_size inside that routine. f. At the beginning of that routine, determine the subset of interactions that will be performed by each MPI process. i. A simple way to do this is to split up the iterations of the outer loop (which correspond to the list of particles) into roughly equal sized subregions, with each subregion calculated by a different MPI process. ii. It s a good idea to have a pair of integer arrays, each of length equal to the number of processes, with one array containing the first iteration of the outer loop to be calculated by each MPI process, and the other array containing the last iteration of the outer loop for each MPI process. So, be sure to declare dynamically allocatable arrays for the local first iteration and local last iteration, and at the beginning of the routine, allocate those arrays. In C, it would look something like this: int* local_first_iteration = (int*)null; int* local_last_iteration = (int*)null; local_first_iteration = (int*)malloc(sizeof(int) * ); if (local_first_iteration == (int*)null) { fprintf(stderr, %d: ERROR: can t allocate local_first_iteration of length %d.\n, my_rank, ); mpi_error_code = MPI_Abort(MPI_Comm_world, 0); } /* if (local_first_iteration == (int*)null) */ local_last_iteration = (int*)malloc(sizeof(int) * ); if (local_last_iteration == (int*)null) { fprintf(stderr, %d: ERROR: can t allocate local_last_iteration of length %d.\n, my_rank, ); mpi_error_code = MPI_Abort(MPI_Comm_world, 0); } /* if (local_last_iteration == (int*)null) */ 2

iii. iv. And in Fortran90, it d look like this: INTEGER,PARAMETER :: memory_success = 0 INTEGER,DIMENSION(:),ALLOCATABLE :: local_first_iteration INTEGER,DIMENSION(:),ALLOCATABLE :: local_first_iteration INTEGER :: memory_status ALLOCATE(local_first_iteration(),STAT=memory_status) WRITE(0,*) my_rank, ERROR: can t allocate local_first_iteration of length, ALLOCATE(local_last_iteration(),STAT=memory_status) WRITE(0,*) my_rank, ERROR: can t allocate local_last_iteration of length, For the first MPI process, the first iteration to be calculated is the first iteration of the outer loop (i.e., the first particle). For each subsequent MPI process, the number of iterations is either (a) the total number of iterations divided by the number of MPI processes, or (b) one more than that (in case there is a remainder when dividing the number of iterations by the number of MPI processes). A good rule of thumb is to put at most one leftover (remainder) iteration on each of the MPI processes. How can you determine which MPI processes should get one of the leftover iterations? v. For each MPI process, the last iteration to be calculated is the first iteration plus the number of iterations, minus one. vi. Just in case, check to be sure that the last iteration of the last MPI process is the last iteration overall. 3

g. Dynamically allocate two buffer arrays of floating point (real) elements, both having length the number of spatial dimensions (in this case, 3 for X, Y and Z) times the total number of particles. These are the send buffer and the receive buffer. h. Fill both buffer arrays with zeros. i. In the outer loop, instead of looping over all of the particles, loop from the first particle for this MPI process to the last particle for this MPI process. That way, this MPI process will only calculate a subset of the forces. j. After the outer loop, you ll have an array of total aggregate forces on the subset of particles corresponding to the subset of outer loop iterations calculated by this MPI process. k. Copy those values into the appropriate portion of the send buffer. l. Using MPI_Send and MPI_Recv, send this MPI process s subset of total aggregate forces to all other MPI processes. You can use the same basic structure as in the greetings.c or greetings.f program, except after the master process receives all of the subsets of the forces sent from each of the other processes (as the master in greetings receives the messages sent from the other processes), the master then sends the full force array to each of the other processes (and they each receive the full force array). m. Copy the array of received values into the original data structure. n. At the end of that routine, deallocate the dynamically allocated arrays (VERY IMPORTANT!). In C, a deallocation would look something like this: free(local_first_iteration); And in Fortran90, a deallocation would look like this: DEALLOCATE(local_first_iteration,STAT=memory_status) WRITE(0,*) my_rank, ERROR: can t deallocate local_first_iteration of length, o. For outputting, only one of the MPI processes should do the output (which one?). 13. Copy the MPI point-to-point template batch script: % cp ~hneeman/sipe2011_exercises/nbody/c/nbody_mpi_p2p.bsub MPI_p2p/ % cp ~hneeman/sipe2011_exercises/nbody/fortran90/nbody_mpi_p2p.bsub MPI_p2p/ 14. Edit your new batch script, following the instructions in the template batch script to make it use the appropriate files in the appropriate directories, as well as your e-mail address and so on. 4

15. Set the MPI_COMPILER and MPI_INTERCONNECT environment variables. Typically, you ll need to do this once per login. a. If your Unix shell is tcsh, then you would do this: % setenv MPI_COMPILER gnu % setenv MPI_INTERCONNECT ib b. If your Unix shell is bash, then you would do this: % export MPI_COMPILER=gnu % export MPI_INTERCONNECT=ib 16. Compile using make. You may need to do this multiple times, debugging as you go. 17. Submit the batch job and let it run to completion. If it seems to take a very long time, probably you have a bug. 18. For each run, once the batch job completes: a. Examine the various output files to see the timings for your runs with executables created by the various compilers under the various levels of optimization. b. Profile, as described above. 19. How can you debug an MPI program? The most straightforward way is to add outputs to standard error (stderr) before and after each MPI call. For example, in C: fprintf(stderr, "%d: about to call MPI_Send(force_array, %d, )\n", my_rank, last_iteration[my_rank] first_iteration[my_rank] + 1); mpi_error_code = MPI_Send(); fprintf(stderr, "%d: done calling MPI_Send(force_array, %d, )\n", my_rank, last_iteration[my_rank] first_iteration[my_rank] + 1); fprintf(stderr, "%d: mpi_error_code=%d\n", mpi_error_code); Or, in Fortran90: WRITE (0,*) my_rank, ": about to call MPI_Send(force_array, ", last_iteration(my_rank) first_iteration(my_rank) + 1), ")" CALL MPI_Send(, mpi_error_code) WRITE (0,*) my_rank, ": done calling MPI_Send(force_array, ", last_iteration(my_rank) first_iteration(my_rank) + 1), ")" WRITE (0,*) my_rank, ": mpi_error_code=", mpi_error_code 20. Continue to debug and run until you ve got a working version of the code. 21. Use your favorite graphing program (for example, Microsoft Excel) to create graphs of your various runs, so that you can compare the various methods visually. 22. You should also run different problem sizes, to see how problem size affects relative performance. 5