Introduction to Parallel Programming Message Passing Interface Practical Session Part I

Similar documents
Message Passing Interface

Holland Computing Center Kickstart MPI Intro

Introduction to the Message Passing Interface (MPI)

CS 426. Building and Running a Parallel Application

HPC Parallel Programing Multi-node Computation with MPI - I

Message Passing Interface

Lecture 6: Parallel Matrix Algorithms (part 3)

All-Pairs Shortest Paths - Floyd s Algorithm

MPI Message Passing Interface

int sum;... sum = sum + c?

15-440: Recitation 8

ITCS 4/5145 Parallel Computing Test 1 5:00 pm - 6:15 pm, Wednesday February 17, 2016 Solutions Name:...

MPI Tutorial. Shao-Ching Huang. High Performance Computing Group UCLA Institute for Digital Research and Education

Distributed Memory Programming with Message-Passing

Assignment 3 MPI Tutorial Compiling and Executing MPI programs

Department of Informatics V. HPC-Lab. Session 4: MPI, CG M. Bader, A. Breuer. Alex Breuer

Parallel Numerical Algorithms

Assignment 3 Key CSCI 351 PARALLEL PROGRAMMING FALL, Q1. Calculate log n, log n and log n for the following: Answer: Q2. mpi_trap_tree.

mith College Computer Science CSC352 Week #7 Spring 2017 Introduction to MPI Dominique Thiébaut

Introduction to MPI. Ricardo Fonseca.

MPI MESSAGE PASSING INTERFACE

Practical Course Scientific Computing and Visualization

MPI introduction - exercises -

Parallel Computing and the MPI environment

The Message Passing Model

PCAP Assignment I. 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail.

Introduction to MPI. Ekpe Okorafor. School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014

MPI Lab. How to split a problem across multiple processors Broadcasting input to other nodes Using MPI_Reduce to accumulate partial sums

Introduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS. Teacher: Jan Kwiatkowski, Office 201/15, D-2

Parallel Programming with MPI and OpenMP

Practical Introduction to Message-Passing Interface (MPI)

CS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011.

MPI introduction - exercises -

Message-Passing Computing

Message-Passing Interface Basics

Introduction in Parallel Programming - MPI Part I

Learning Lab 1: Parallel Algorithms of Matrix-Vector Multiplication

MPI: The Message-Passing Interface. Most of this discussion is from [1] and [2].

First day. Basics of parallel programming. RIKEN CCS HPC Summer School Hiroya Matsuba, RIKEN CCS

Assignment 5 Using Paraguin to Create Parallel Programs

ITCS 4145/5145 Assignment 2

CSE 160 Lecture 15. Message Passing

Distributed Memory Parallel Programming

Recap of Parallelism & MPI

MPI 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Collective Communications I

Introduction to MPI. HY555 Parallel Systems and Grids Fall 2003

Practical Introduction to Message-Passing Interface (MPI)

Programming with MPI. Pedro Velho

The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs

Message Passing Interface. most of the slides taken from Hanjun Kim

Message Passing Interface

Chapter 3. Distributed Memory Programming with MPI

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Parallel Programming, MPI Lecture 2

Distributed Memory Programming with MPI

Parallel Programming Using MPI

Programming with MPI on GridRS. Dr. Márcio Castro e Dr. Pedro Velho

MPI: Parallel Programming for Extreme Machines. Si Hammond, High Performance Systems Group

High Performance Computing Course Notes Message Passing Programming I

D-BAUG Informatik I. Exercise session: week 5 HS 2018

The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs

Simple examples how to run MPI program via PBS on Taurus HPC

NUMERICAL PARALLEL COMPUTING

Lecture 6: Message Passing Interface

MPI MPI. Linux. Linux. Message Passing Interface. Message Passing Interface. August 14, August 14, 2007 MPICH. MPI MPI Send Recv MPI

CSE 160 Lecture 18. Message Passing

Introduction to MPI. SHARCNET MPI Lecture Series: Part I of II. Paul Preney, OCT, M.Sc., B.Ed., B.Sc.

Topics. Lecture 6. Point-to-point Communication. Point-to-point Communication. Broadcast. Basic Point-to-point communication. MPI Programming (III)

MPI 5. CSCI 4850/5850 High-Performance Computing Spring 2018

Tutorial 2: MPI. CS486 - Principles of Distributed Computing Papageorgiou Spyros

Parallel Programming Assignment 3 Compiling and running MPI programs

MPI. (message passing, MIMD)

Message Passing Interface - MPI

Introduction to MPI: Part II

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

High Performance Computing Lecture 41. Matthew Jacob Indian Institute of Science

Learning Lab 2: Parallel Algorithms of Matrix Multiplication

ECE 563 Midterm 1 Spring 2015

Scientific Computing

Parallel hardware. Distributed Memory. Parallel software. COMP528 MPI Programming, I. Flynn s taxonomy:

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

lslogin3$ cd lslogin3$ tar -xvf ~train00/mpibasic_lab.tar cd mpibasic_lab/pi cd mpibasic_lab/decomp1d

Programming Scalable Systems with MPI. Clemens Grelck, University of Amsterdam

OpenMP - exercises - Paride Dagna. May 2016

Lecture 7: Distributed memory

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

MPI 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Peter Pacheco. Chapter 3. Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved

Part One: The Files. C MPI Slurm Tutorial - TSP. Introduction. TSP Problem and Tutorial s Purpose. tsp.tar. The C files, summary

Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved

Introduction to parallel computing concepts and technics

Point-to-Point Communication. Reference:

OpenMP - exercises -

Introduction to the Message Passing Interface (MPI)

MPI Program Structure

Parallel Programming with MPI: Day 1

Tutorial: parallel coding MPI

Lecture 3 Message-Passing Programming Using MPI (Part 1)

Introduction to MPI. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2018

Transcription:

Introduction to Parallel Programming Message Passing Interface Practical Session Part I T. Streit, H.-J. Pflug streit@rz.rwth-aachen.de October 28, 2008 1

1. Examples We provide codes of the theoretical part as well as serial codes for the exercises. Download and extract the codes by using the commands: wget http://support.rz.rwth-aachen.de/public/mpi1codes.tar.gz tar xzvf MPI1Codes.tar.gz wget http://support.rz.rwth-aachen.de/public/mpi1exercises.tar.gz tar xzvf MPI1Exercises.tar.gz 2. Hello World Test the hello.c example. Every process prints the Hello World line. Compile and run the program with 4 MPI processes using: $MPICC hello.c -o hello $MPIEXEC -n 4 hello 3. Size & Rank Test the ranks.c example. Every process identifies itself with appropriate rank myrank and communicator size nprocs. Compile and run the program with 4 MPI processes using: $MPICC ranks.c -o ranks $MPIEXEC -n 4 ranks

4. Array of Integers - Part I The program numarray serial.c creates an array of k (e.g. k=50) integer numbers between 0 an 9 and then counts zeros. The output is: k=50 Array: 3 6 7 5 3 5 6 2 9 1 2 7 0 9 3 6 0 6 2 6 1 8 7 9 2 0 2 3 7 5 9 2 2 8 9 7 3 6 1 2 9 3 1 9 4 7 8 4 5 0 Number 0 was found 4 times in the array Parallelize the code, so that only the master creates the array. The integer numbers should be between 0 and nprocs-1. All CPU s have access to the value of k. Then the master prints the array and sends it to all workers. Each worker counts how many times myrank is in the array. All workers send their counts back to the master. The master receives all results and prints the results. Example: k=50, 10 CPU s Array: 6 1 1 4 8 8 2 3 2 3 3 9 1 6 7 6 4 0 5 2 4 8 9 3 0 4 7 5 6 7 2 4 1 3 8 1 2 2 5 4 5 0 5 7 8 4 5 4 5 1 Number 0 was found 8 times in the array Number 1 was found 12 times in the array Number 2 was found 15 times in the array Number 3 was found 15 times in the array 3

5. Array of Integers - Part II Now, modify your parallel code of part I. Send an array of a size that is unknown to the workers - only the root initializes the value of k. Workers have to check the incoming size and allocate the memory according to it. You will have to use the MPI Probe and MPI Get count functions. Type man MPI Probe and man MPI Get count for help. Example: k=50, 4 CPU s Array: 3 2 1 3 1 3 2 0 1 1 2 3 2 3 3 2 0 2 0 0 3 0 3 1 2 2 2 3 3 3 1 2 2 2 1 3 1 0 3 2 1 1 1 3 0 1 2 0 3 2 Number 0 was found 8 times in the array Process 1 received value 50 Process 2 received value 50 Number 1 was found 12 times in the array Process 3 received value 50 Number 2 was found 15 times in the array Number 3 was found 15 times in the array 4

6. Array of Integers - Part III Modify part I again, now to test the collective operations MPI Scatter and MPI Bcast. The master distributes the array using the scatter function, so that each worker only works on a small part of the array. Assume, that the number of array elements is divisible by the number of CPU s. In addition, the number n to look for (now only ONE number for all CPU s) is a random number (between 0 and nprocs 1), the master broadcasts it to all workers, so each worker looks for the same number. The master iteratively receives and sums up all results. For help type man MPI Scatter and man MPI Bcast. Example: k=50, 4 CPU s, n=3 Array: 2 1 3 1 3 2 0 1 1 2 3 2 3 3 2 0 2 0 0 3 0 3 1 2 2 2 3 3 3 1 2 2 2 1 3 1 0 3 2 1 1 1 3 0 1 2 0 3 2 1 Process 0 received value 3 Process 0 has found 3 times the number 3 in part of the array Process 1 received value 3 Process 1 has found 4 times the number 3 in the part of the array Process 2 received value 3 Process 2 has found 4 times the number 3 in the part of the array Process 3 received value 3 Process 3 has found 3 times the number 3 in the part of the array All processes have found 14 times the number 3 in the array 5

7. Array of Integers - Part IV (Homework) Modify III in the following way: use MPI Reduce to receive and automatically sum up all results. For help: man MPI Reduce. Example: k=50, 4 CPU s, n=3 Array: 2 1 3 1 3 2 0 1 1 2 3 2 3 3 2 0 2 0 0 3 0 3 1 2 2 2 3 3 3 1 2 2 2 1 3 1 0 3 2 1 1 1 3 0 1 2 0 3 2 1 Process 0 received value 3 Process 0 has found 3 times the number 3 in the part of the array Process 2 received value 3 Process 1 received value 3 Process 2 has found 4 times the number 3 in the part of the array Process 3 received value 3 Process 1 has found 4 times the number 3 in the part of the array Process 3 has found 3 times the number 3 in the part of the array All processes have found 14 times the number 3 in the array 8. Array of Integers - Part V (Homework) Is it possible to modify Part IV (using MPI Scatter ) so that the program works for array sizes not divisible by the number of CPU s? If yes, explain how. If not, explain why? 6

9. Calculation of π with Numerical Integration Given is a program ( pi tangent serial.c ) calculating π using square approximation of the integral formula π = 1 0 4 (1 + x 2 ) dx To do the integration, the domain [0, 1] is divided into intervals, and the well known trapezium rule (http://en.wikipedia.org/wiki/trapezium rule) is used. One can either use the simple trapezium rule 1 0 ( ) f(0) + f(1) n 1 f(x) dx w + f(0 + iw) 2 or (in this code) the more complex tangent rule i=1 1 0 f(x) dx w n f (0 + w (i 0.5)) i=1 where f(x) = 4 and w is the interval width, i.e. w = 1/n. (1+x 2 ) Results will differ for small numbers of intervals. Now, think about a way to distribute the work among several processes. Parallelize the code. The master should read the number of intervals and broadcast it to the other processes. All processes (including master) should get approx. the same amount of work. Compute the final sum using the collective MPI Reduce function. Example: single CPU: hpclab@sciprog:~/desktop/exercises/day1/pi$ pi_tangent_serial how many intervals: 10 The computed value of the integral is 3.142425985001098 hpclab@sciprog:~/desktop/exercises/day1/pi$ pi_tangent_serial how many intervals: 100 The computed value of the integral is 3.141600986923125 hpclab@sciprog:~/desktop/exercises/day1/pi$ pi_tangent_serial how many intervals: 1000 The computed value of the integral is 3.141592736923123 2 CPUs: hpclab@sciprog:~/desktop/exercises/day1/pi$ $MPIEXEC -n 2 pi 7

how many intervalls: 10 calculated pi value: 3.14242598500109826531 hpclab@sciprog:~/desktop/exercises/day1/pi$ $MPIEXEC -n 2 pi how many intervalls: 100 calculated pi value: 3.14160098692312494961 hpclab@sciprog:~/desktop/exercises/day1/pi$ $MPIEXEC -n 2 pi how many intervalls: 1000 calculated pi value: 3.14159273692313067983 8

10. Matrix-Vector Multiplication Matrix-vector multiplication c = A b is a widely used operation in scientific computing. Given a matrix A R l,m and a vector b R m the result is a vector c R l. In pseudocode: do i=1,l do j=1,m c(i) = c(i) + A(i,j)*b(j) end do end do Given is a serial code for matrix-vector multiplication ( mxv serial 1pointer.c ). Parallelize the code, so that the master distributes the vector b and matrix rows with equal number of rows for each process, i.e. P 0 P 0...... P 0............... P 0 P 0...... P 0 P 1 P 1...... P 1............... P 1 P 1...... P 1............... P N 1 P N 1...... P N 1............... P N 1 P N 1...... P N 1 Assume that the number of rows is divisible by the process number. The master also computes one of the blocks. The master collects the calculated elements in a vector and prints the result. Example: Number of rows: 10 Number of columns: 10 A[0] = 0 1 2 3 4 5 6 7 8 9 A[1] = 1 2 3 4 5 6 7 8 9 0 A[2] = 2 3 4 5 6 7 8 9 0 1 A[3] = 3 4 5 6 7 8 9 0 1 2 A[4] = 4 5 6 7 8 9 0 1 2 3 A[5] = 5 6 7 8 9 0 1 2 3 4 A[6] = 6 7 8 9 0 1 2 3 4 5 A[7] = 7 8 9 0 1 2 3 4 5 6 A[8] = 8 9 0 1 2 3 4 5 6 7 A[9] = 9 0 1 2 3 4 5 6 7 8 b = 0 1 2 3 4 5 6 7 8 9 9

resultvector(0) = 285 resultvector(1) = 240 resultvector(2) = 205 resultvector(3) = 180 resultvector(4) = 165 resultvector(5) = 160 resultvector(6) = 165 resultvector(7) = 180 resultvector(8) = 205 resultvector(9) = 240 The quadratic matrix in our test example can be initialized very simple: A i,j = i+j i, j {0, 1,... rows 1}. Test your program for some large matrices. 10

11. Floyd s Shortest Path Algorithm (Homework) Given is the program floyd serial.c that computes shortest paths using the Floyd algorithm, sometimes also called Floyd-Warshall algorithm. Given a graph, G = (V, E), the Floyd algorithm finds the shortest path between all pairs of nodes i, j. Input: a distance matrix D i,j 0. We assume that v is the number of vertices and D i,i = 0 i {0, 1,... v 1}. (http://en.wikipedia.org/wiki/floyd-warshall algorithm) Serial Floyd algorithm: for k = 0 to v-1 for i = 0 to v-1 for j = 0 to v-1 D i,j = min{d i,j, D i,k + D k,j ) Output: D i,j contains shortest path from i to j. Parallel Floyd with rowwise distribution: A simple parallel Floyd algorithm is based on a one-dimensional, rowwise domain decomposition of the intermediate matrix D. Each of the n processors owns v/n rows. 1 for k = 0 to v-1 Processor that holds row k broadcasts it to all others. for i = i_local_start to i_local_end for j = 0 to v-1 D i,j = min{d i,j, D i,k + D k,j ) In the kth step, each task requires, in addition to its local data, the kth row of D. 1 An alternative parallel version of Floyd s algorithm uses a two-dimensional decomposition of the various matrices. This version allows the use of up to n 2 processors. The parallel Floyd with checkerboard distribution requires new row and column communicator schemes. 11

for k = 0 to v-1 Processor that holds row k broadcasts it (or parts of it) to all others. Processor that holds column k broadcasts it (or parts of it) to all others. for i = i_local_start to i_local_end for j = j_local_start to j_local_end D i,j = min{d i,j, D i,k + D k,j ) In each step, each task requires, in addition to its local data, data from the kth row and the kth column of D. Hence, communication requires two broadcast operations for each step. Implement the parallel Floyd with rowwise distribution. Example (floydmatrix.txt) Input: 0 1 3 20 20 20 2 9 20 20 9 0 9 4 20 4 20 20 20 9 20 20 0 7 3 20 20 20 7 2 2 20 3 0 20 20 5 8 20 4 9 4 3 9 0 20 20 20 8 20 6 4 9 20 3 0 20 20 5 20 20 2 1 20 3 20 0 20 20 20 5 2 20 8 20 20 20 0 20 7 2 20 20 20 20 20 4 20 0 20 3 20 4 20 7 6 20 4 8 0 Result: 0 1 3 5 5 5 2 9 10 5 6 0 7 4 7 4 8 12 9 8 5 6 0 7 3 8 7 6 7 2 2 3 3 0 6 7 4 8 10 4 8 4 3 8 0 8 10 9 8 5 6 4 6 8 3 0 8 12 5 8 6 2 1 6 3 6 0 7 8 3 5 2 8 6 9 6 7 0 11 7 2 3 5 7 7 7 4 11 0 7 3 4 4 8 7 6 5 4 8 0 Test your program for larger matrices. Upload your examples as well. 12

12. Clean Buggy Code (Homework) We have prepared an MPI program with five errors (fixme.c). These errors violate the MPI-standard. /* * This program does the same operation as an MPI_Bcast() but * does it using MPI_Send() and MPI_Recv(). */ #include <mpi.h> #include <stdio.h> int main(int argc, char **argv) { int nprocs; /* the number of processes in the task */ int myrank; /* my rank */ int i; int l = 0; int tag = 42; /* tag used for all communication */ int tag2 = 99; /* extra tag used for what ever you want */ int data = 0; /* initialize all the data buffers to 0 */ MPI_Status status; /* status of MPI_Recv() operation */ MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); /* Initialize the data for rank 0 process only. */ if (myrank == 0) { data = 399; } if (myrank == 0) { for (i = 1; i < nprocs; i++){ MPI_Send(&data, 1, MPI_BYTE, i, tag, MPI_COMM_WORLD); } } else { MPI_Recv(data, l, MPI_INT, 0, tag2, MPI_COMM_WORLD, &status); } MPI_Barrier(MPI_COMM_WORLD); /* Check the data everywhere. */ if (data!= 399) { fprintf(stdout, "Whoa! The data is incorrect\n"); } 13

else fprintf(stdout, "Whoa! Got the message... \n"); } return 0; Fix the errors in the code. Give comments to your changes. 14