lslogin3$ cd lslogin3$ tar -xvf ~train00/mpibasic_lab.tar cd mpibasic_lab/pi cd mpibasic_lab/decomp1d

Similar documents
MPI Lab. Steve Lantz Susan Mehringer. Parallel Computing on Ranger and Longhorn May 16, 2012

MPI Lab. How to split a problem across multiple processors Broadcasting input to other nodes Using MPI_Reduce to accumulate partial sums

Introduction to Parallel Processing. Lecture #10 May 2002 Guy Tel-Zur

AMath 483/583 Lecture 18 May 6, 2011

CINES MPI. Johanne Charpentier & Gabriel Hautreux

AMath 483/583 Lecture 21

High Performance Computing Lecture 41. Matthew Jacob Indian Institute of Science

MPI-Hello World. Timothy H. Kaiser, PH.D.

MPI-Hello World. Timothy H. Kaiser, PH.D.

MPI: The Message-Passing Interface. Most of this discussion is from [1] and [2].

MPI. (message passing, MIMD)

CS 426. Building and Running a Parallel Application

Parallel Programming Using Basic MPI. Presented by Timothy H. Kaiser, Ph.D. San Diego Supercomputer Center

Introduction to MPI. Ricardo Fonseca.

Report S1 Fortran. Kengo Nakajima Information Technology Center

Introduction to the Message Passing Interface (MPI)

Report S1 C. Kengo Nakajima

MPI Program Structure

Code Parallelization

Practical Introduction to Message-Passing Interface (MPI)

Practical Introduction to Message-Passing Interface (MPI)

The Message Passing Model

CSE. Parallel Algorithms on a cluster of PCs. Ian Bush. Daresbury Laboratory (With thanks to Lorna Smith and Mark Bull at EPCC)

Report S1 C. Kengo Nakajima Information Technology Center. Technical & Scientific Computing II ( ) Seminar on Computer Science II ( )

Collective Communication in MPI and Advanced Features

The Message Passing Interface (MPI) TMA4280 Introduction to Supercomputing

Department of Informatics V. HPC-Lab. Session 4: MPI, CG M. Bader, A. Breuer. Alex Breuer

Slides prepared by : Farzana Rahman 1

Introduction to MPI part II. Fabio AFFINITO

Topics. Lecture 7. Review. Other MPI collective functions. Collective Communication (cont d) MPI Programming (III)

An Introduction to MPI

Parallel Programming Using MPI

Message Passing Interface

mpi-02.c 1/1. 15/10/26 mpi-01.c 1/1. 15/10/26

Introduction to MPI. Ekpe Okorafor. School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014

Parallel Programming, MPI Lecture 2

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

Introduction to Parallel Programming with MPI

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

東京大学情報基盤中心准教授片桐孝洋 Takahiro Katagiri, Associate Professor, Information Technology Center, The University of Tokyo

Parallel Programming Basic MPI. Timothy H. Kaiser, Ph.D.

Supercomputing in Plain English Exercise #6: MPI Point to Point

Report S1 C. Kengo Nakajima. Programming for Parallel Computing ( ) Seminar on Advanced Computing ( )

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

Elementary Parallel Programming with Examples. Reinhold Bader (LRZ) Georg Hager (RRZE)

MPI Workshop - III. Research Staff Cartesian Topologies in MPI and Passing Structures in MPI Week 3 of 3

L14 Supercomputing - Part 2

IPM Workshop on High Performance Computing (HPC08) IPM School of Physics Workshop on High Perfomance Computing/HPC08

Recap of Parallelism & MPI

MPI introduction - exercises -

CEE 618 Scientific Parallel Computing (Lecture 5): Message-Passing Interface (MPI) advanced

Supercomputing in Plain English

Practical Course Scientific Computing and Visualization

First day. Basics of parallel programming. RIKEN CCS HPC Summer School Hiroya Matsuba, RIKEN CCS

mith College Computer Science CSC352 Week #7 Spring 2017 Introduction to MPI Dominique Thiébaut

Distributed Memory Parallel Programming

Chapter 4. Message-passing Model

MPI Message Passing Interface

MPI version of the Serial Code With One-Dimensional Decomposition. Timothy H. Kaiser, Ph.D.

Distributed Memory Programming with Message-Passing

CS 179: GPU Programming. Lecture 14: Inter-process Communication

MPI Tutorial. Shao-Ching Huang. High Performance Computing Group UCLA Institute for Digital Research and Education

MPI Tutorial. Shao-Ching Huang. IDRE High Performance Computing Workshop

Introduction to MPI. SHARCNET MPI Lecture Series: Part I of II. Paul Preney, OCT, M.Sc., B.Ed., B.Sc.

Overview of MPI 國家理論中心數學組 高效能計算 短期課程. 名古屋大学情報基盤中心教授片桐孝洋 Takahiro Katagiri, Professor, Information Technology Center, Nagoya University

Paul Burton April 2015 An Introduction to MPI Programming

Chapter 3. Distributed Memory Programming with MPI

Parallel Programming in C with MPI and OpenMP

The Message Passing Interface (MPI): Parallelism on Multiple (Possibly Heterogeneous) CPUs

Introduction to MPI Part II Collective Communications and communicators

Programming with MPI Collectives

HPC Parallel Programing Multi-node Computation with MPI - I

Tutorial: parallel coding MPI

Parallel Programming & Cluster Computing Distributed Multiprocessing

MPI Collective communication

MPI Version of the Stommel Code with One and Two Dimensional Decomposition

Review of MPI Part 2

Assignment 3 MPI Tutorial Compiling and Executing MPI programs

Introduction in Parallel Programming - MPI Part I

Topics. Lecture 6. Point-to-point Communication. Point-to-point Communication. Broadcast. Basic Point-to-point communication. MPI Programming (III)

Tutorial 2: MPI. CS486 - Principles of Distributed Computing Papageorgiou Spyros

AMath 483/583 Lecture 24. Notes: Notes: Steady state diffusion. Notes: Finite difference method. Outline:

Distributed Memory Programming with MPI

Introduction to parallel computing with MPI

Introduction to MPI. SuperComputing Applications and Innovation Department 1 / 143

CS 6230: High-Performance Computing and Parallelization Introduction to MPI

Introduction to Parallel Programming Message Passing Interface Practical Session Part I

Parallel Programming in C with MPI and OpenMP

Parallel Programming Basic MPI. Timothy H. Kaiser, Ph.D.

Message Passing Interface

Peter Pacheco. Chapter 3. Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved

Distributed Memory Programming with MPI. Copyright 2010, Elsevier Inc. All rights Reserved

CS4961 Parallel Programming. Lecture 18: Introduction to Message Passing 11/3/10. Final Project Purpose: Mary Hall November 2, 2010.

MA471. Lecture 5. Collective MPI Communication

Introduction to parallel computing concepts and technics

MULTI GPU PROGRAMMING WITH MPI AND OPENACC JIRI KRAUS, NVIDIA

Introduction to MPI. Jerome Vienne Texas Advanced Computing Center January 10 th,

Parallel Computing: Overview

Distributed Memory Programming with MPI

Topologies in MPI. Instructor: Dr. M. Taufer

Transcription:

MPI Lab

Getting Started Login to ranger.tacc.utexas.edu Untar the lab source code lslogin3$ cd lslogin3$ tar -xvf ~train00/mpibasic_lab.tar Part 1: Getting Started with simple parallel coding hello mpi-world Part 2: Calculating PI cd mpibasic_lab/pi Part 3: 1-D domain decomposition cd mpibasic_lab/decomp1d

Part 1: Hello MPI world! Starting with serial hello.f (or hello.c) Make a copy, name it mpi-hello.f or your own. Include mpi.h header file. Insert MPI_INIT & MPI_FINALIZE Getting basic parallelization info using MPI_COMM_SIZE & MPI_COMM_RANK Modify print* statement Now you have made a parallel code with MPI.

Hello, MPI-World! Check out your module set first Running serial code lslogin3$ ifort hello.f lslogin3$./a.out After converting serial code to parallel code: lslogin3$ mpif90 mpi-hello.f o mpi-hello Edit your job submission script, then submit your job: lslogin3$ qsub job.sge Edit your job submission script, then submit your job: lslogin3$ more $Job_Name.o$Job_ID

Part 2: Calculation of Pi Objective: parallelize serial p calculation, starting with serial code (serial_pi.c or serial_pi.f90). in C for (i = 1; i <= n; i++) { x = h * ( (double)(i) - 0.5e0 ); sum = sum + f(x); } pi = h*sum in Fortran do i = 1, n x = h * (dble(i) - 0.5_KR8) sum = sum + f(x) end do pi = h*sum Main focus on Parallelization How to split a problem across multiple processors Broadcasting input to other nodes Using MPI_Reduce to accumulate partial sums

MPI Parallelization Basic Setup Modify the serial_pi.f or serial_pi.c file. cp serial_pi.f90 mpi-pi.f90 or cp serial_pi.c to mpi-pi.c Include MPI startup and finalization routines at the beginning and end of pi.c/f90. Also include declaration statements for the rank and number of processors (myid and numprocs, respectively) C: #include "mpi.h or F90: include mpif.h MPI_Init( ) MPI_Comm_rank(MPI_COMM_WORLD...) MPI_Comm_size(MPI_COMM_WORLD ) MPI_Finalize( ) Don t forget: - Declare myid, numprocs and ierrs as ints in C and integers in fortran. - Use call and an error argument in FORTRAN; and use error return in C code. - Use myid and numprocs for the rank and processor count Top of Code Serial Code End of Code ( and error argument in f90 codes)

Calculating p Parallel Version Each processor will perform partial sum: for x i,x i+n,x i+2n,x i+3n, where N is the processor count, and i is the rank. for (i = myid+1; i <= n; i = i + numprocs) { x = h * ( (double)(i) - 0.5e0 ); sum = sum + f(x); } part_pi = h*sum do i = myid+1, n, numprocs x = h * (dble(i) - 0.5_KR8) sum = sum + f(x) end do part_pi = h*sum Accumulate and add partial sums on processor 0. ierr = MPI_Reduce(&part_pi, &pi, 1, MPI_DOUBLE, MPI_SUM,0, MPI_COMM_WORLD) call MPI_Reduce(part_pi, pi, 1, MPI_DOUBLE_PRECISION,MPI_SUM,0, MPI_COMM_WORLD,ierr)

Read & Form Partial Sums Have rank 0 processor read n, the total # elements to integrate Make the read statement conditional, only on root, with: if ( myid == 0 ) read Broadcast n to the other nodes MPI_Bcast(n,1,<datatype>,0,MPI_COMM_WORLD ) Use MPI_INTEGER and MPI_INT for Fortran and C datatypes, respectively. (Use &n address for C). Specify integral elements for each processor F90: do i = 1,n do i = myid+1, n, numprocs C: for(i=1; i<=n; i++) for(i=myid+1; i<=n; i=i+numprocs)

MPI-Reduce partial sums, print Assign the sum from each rank to a partial sum declare part_pi as a double [ real(kr8) in F90 ] after the loop, replace pi = h * sum with : part_pi = h * sum; followed by Sum the partial sums with an MPI_Reduce call MPI_Reduce(part_pi,pi,1,<type>,MPI_SUM,0, MPI_COMM_WORLD ) where <type> is MPI_DOUBLE or MPI_DOUBLE_PRECISION for C and F90, respectively; use addresses &part_pi and &pi in C code Write out p & calc. pi, from rank 0 proc (use if) if (myid == 0) print

Compilation & Running the code Compile code: mpif90 O3 mpi-pi.f90 mpicc O3 mpi-pi.c Prepare job Modify the processor count: Keep the # of processors per node set to 16 (keep the 16way) The last argument, divided by 16, is the number of nodes. #$ -pe 16way 16 change 16 to 32, 48, 64, etc. create a file called input and include the total number of elements (n) on the first line echo 2000 >input Submit job qsub job See mpi_pi.f90 (.c) for finished parallel versions.

Part 3: Domain Decomposition Solve 2-D partial differential equation (finite difference) - Represent x-y domain as 2-D Cartesian grid - Solution Matrix=A(x,y) - Initialize grid elements with guess. - Iteratively update Solution Matrix (A) until converged. - Each iteration uses neighbor elements to update A. A(I,j+1) A A(i-1,j) A(I,j) A(i+1,j) update A (I,j) A(I,j-1) A holds updated elements of A

Domain Decomposition: Sharing Data Across Processors Decompose 2-D grid into column blocks across p processors (1-D decomposition) Need to duplicate edge columns on neighbor processors & send updated values after each iteration. That is, create ghost columns (gray) from real (patterned) column on neighbor processor.

Domain Decomposition Matrix Layout with Ghost Cells Redefine Array for easy ghost access real*8 :: A(n, 0:n+1) Fortran #define A(i,j) a( (i-1) + (j)*n ) double a[n*(n+2)]; C A (1,0) A (n,0) 0 1 n n+1 1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16 A (1,n+1) A (n,n+1) ghosts ghosts

Domain Decomposition Node Exchange NODE 0 destination NODE 1 source 0 1 n n+1 0 1 n n+1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 destination source F90 code

Domain Decomposition (cont d) MPI functions used in the example code: MPI_INIT/FINALIZE MPI_COMM_SIZE/RANK MPI_BCAST MPI_CART_CREATE/COORDS/SHIFT MPI_SENDRECV MPI_BARRIER MPI_REDUCE/ALLREDUCE Subroutines in the code: 1. mpe_decomp1d: decomposes a domain (matrix) into column or row slices for each processor, redefines local grid index (for the example code, the decomposition is done along y-axis) 2. onedinit : initializes the right hand side of poisson equation and the initial solution guess. 3. exchng1: MPI send & receive between ghost columns. 4. sweep1d: performs a Jacobi sweep for a 1-D decomposition

mpi_1d_decomp.f90: work flow MPI Initialization MPI_INIT MPI_COMM_SIZE MPI_COMM_RANK Cartesian MPI Topology Setup MPI_CART_CREATE MPI_CART_COORDS MPI_CART_SHIFT Domain Decomposition (1-D) mpe_decomp1d Jacobian Iteration Sweep1d Ghost Cell Information Exchange Exchng1 MPI_SENDRECV Function diff Check Out Convergence MPI_ALLREDUCE Run Time Calculation and Result Print-out MPI_WTIME MPI_REDUCE Finaliize MPI_FINALIZE

Domain Decomposition Compile code: mpif90 O3 mpi_1d_decomp.f90 Prepare job Modify the job submission script: - Keep the # of precessors per node set to 16 (keep the 16way) - The last argument, divided by 16, is the number of nodes. #$ -pe 16way 16 change 16 to 32, 48, 64, etc. If the number of cores is not a multiple of 16, then set the MY_NSLOTS environment variable with the number of cores. E.g. setenv MY_NSLOTS 31. But the last argument of the pe option must be a multiple of 16, so as to determine the number of nodes. Submit job qsub job