Hands-on. MPI basic exercises

Similar documents
15-440: Recitation 8

Parallel programming MPI

Message Passing Interface. most of the slides taken from Hanjun Kim

Message Passing Interface

Working with IITJ HPC Environment

Programming with MPI. Pedro Velho

Introduction to MPI. Ekpe Okorafor. School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

CS 470 Spring Mike Lam, Professor. Distributed Programming & MPI

Practical Introduction to Message-Passing Interface (MPI)

Programming with MPI on GridRS. Dr. Márcio Castro e Dr. Pedro Velho

Distributed Memory Systems: Part IV

MPI. (message passing, MIMD)

Lesson 1. MPI runs on distributed memory systems, shared memory systems, or hybrid systems.

HPC Parallel Programing Multi-node Computation with MPI - I

COSC 6374 Parallel Computation. Message Passing Interface (MPI ) I Introduction. Distributed memory machines

Parallel Programming. Using MPI (Message Passing Interface)

Xeon Phi Native Mode - Sharpen Exercise

Message Passing Interface

MPI introduction - exercises -

MPI: Parallel Programming for Extreme Machines. Si Hammond, High Performance Systems Group

Xeon Phi Native Mode - Sharpen Exercise

To connect to the cluster, simply use a SSH or SFTP client to connect to:

Parallel Computing Paradigms

Message Passing Interface. George Bosilca

MPI 1. CSCI 4850/5850 High-Performance Computing Spring 2018

Computing with the Moore Cluster

Our new HPC-Cluster An overview

What s in this talk? Quick Introduction. Programming in Parallel

Distributed Systems + Middleware Advanced Message Passing with MPI

High performance computing. Message Passing Interface

PRACE PATC Course: Intel MIC Programming Workshop MPI LRZ,

Parallel Applications Design with MPI

Message Passing Interface

Simple examples how to run MPI program via PBS on Taurus HPC

Programming Scalable Systems with MPI. Clemens Grelck, University of Amsterdam

SCALABLE HYBRID PROTOTYPE

Cluster Clonetroop: HowTo 2014

L14 Supercomputing - Part 2

ITCS 4145/5145 Assignment 2

Working with Shell Scripting. Daniel Balagué

MVAPICH MPI and Open MPI

Introduction to PICO Parallel & Production Enviroment

Overview of Intel Xeon Phi Coprocessor

Practical Introduction to Message-Passing Interface (MPI)

Document Classification

MPI and CUDA. Filippo Spiga, HPCS, University of Cambridge.

MPI Runtime Error Detection with MUST

Cluster Computing MPI. Industrial Standard Message Passing

MPI Runtime Error Detection with MUST

Introduction to the Message Passing Interface (MPI)

Parallel Programming, MPI Lecture 2

Distributed Memory Programming with Message-Passing

CS 426. Building and Running a Parallel Application

Introduction to MPI. Ricardo Fonseca.

Introduction to GALILEO

MPI MPI. Linux. Linux. Message Passing Interface. Message Passing Interface. August 14, August 14, 2007 MPICH. MPI MPI Send Recv MPI

Parallel Programming Assignment 3 Compiling and running MPI programs

Department of Informatics V. HPC-Lab. Session 4: MPI, CG M. Bader, A. Breuer. Alex Breuer

MPI Mechanic. December Provided by ClusterWorld for Jeff Squyres cw.squyres.com.

Symmetric Computing. SC 14 Jerome VIENNE

Outline. Communication modes MPI Message Passing Interface Standard. Khoa Coâng Ngheä Thoâng Tin Ñaïi Hoïc Baùch Khoa Tp.HCM

Exercise 1: Basic Tools

To connect to the cluster, simply use a SSH or SFTP client to connect to:

Point-to-Point Communication. Reference:

PROGRAMMING WITH MESSAGE PASSING INTERFACE. J. Keller Feb 26, 2018

Symmetric Computing. Jerome Vienne Texas Advanced Computing Center

Capstone Project. Project: Middleware for Cluster Computing

MPI 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Symmetric Computing. John Cazes Texas Advanced Computing Center

Parallel Programming in C with MPI and OpenMP

MPI - The Message Passing Interface

Intel Parallel Studio XE Cluster Edition - Intel MPI - Intel Traceanalyzer & Collector

Advanced Message-Passing Interface (MPI)

Programming Scalable Systems with MPI. UvA / SURFsara High Performance Computing and Big Data. Clemens Grelck, University of Amsterdam

Introduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS. Teacher: Jan Kwiatkowski, Office 201/15, D-2

COSC 6374 Parallel Computation. Debugging MPI applications. Edgar Gabriel. Spring 2008

Holland Computing Center Kickstart MPI Intro

UNIVERSITY OF MORATUWA

Symmetric Computing. ISC 2015 July John Cazes Texas Advanced Computing Center

Part One: The Files. C MPI Slurm Tutorial - Hello World. Introduction. Hello World! hello.tar. The files, summary. Output Files, summary

ACEnet for CS6702 Ross Dickson, Computational Research Consultant 29 Sep 2009

Programming Using the Message-Passing Paradigm (Chapter 6) Alexandre David

Parallel Programming Overview

Introduction to MPI. HY555 Parallel Systems and Grids Fall 2003

Accelerator Programming Lecture 1

Parallel Short Course. Distributed memory machines

NUMERICAL PARALLEL COMPUTING

Topics. Lecture 6. Point-to-point Communication. Point-to-point Communication. Broadcast. Basic Point-to-point communication. MPI Programming (III)

High Performance Computing Course Notes Message Passing Programming I

Distributed Memory Programming With MPI Computer Lab Exercises

MIC Lab Parallel Computing on Stampede

First day. Basics of parallel programming. RIKEN CCS HPC Summer School Hiroya Matsuba, RIKEN CCS

Introduction to parallel computing concepts and technics

Slurm basics. Summer Kickstart June slide 1 of 49

CS 179: GPU Programming. Lecture 14: Inter-process Communication

Lecture 7: Distributed memory

Distributed Memory Programming with MPI

Parallel Programming, MPI Lecture 2

PCAP Assignment I. 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail.

Transcription:

WIFI XSF-UPC: Username: xsf.convidat Password: 1nt3r3st3l4r WIFI EDUROAM: Username: roam06@bsc.es Password: Bsccns.4 MareNostrum III User Guide http://www.bsc.es/support/marenostrum3-ug.pdf Remember to include #BSUB U patc1 In your job scripts. Hands-on. MPI basic exercises Go to mpi dir, there you can find a folder for each exercise. You can use the mpich man pages for the MPI calls: http://www.mpich.org/static/docs/v3.1/ Exercise 0 Using MPI try to print out: Hello world, I am proc X of total Y With a total number of tasks Y = 4 and X ranging from 0 to 3. After having compiled the code, try to launch it as a batch job. You have to complete hello.c and job.lsf with proper values. Once completed, you can compile with make And execute with make submit HELP: int MPI_Init(int *argc, char ***argv) int MPI_Comm_size(MPI_Comm comm, int *size) int MPI_Comm_rank(MPI_Comm comm, int *rank) int MPI_Finalize(void) 1

Exercise 1 Write a code using point to point communication that makes two processes send each other an array of floats containing their rank. Each of the processes will declare two float arrays, A and B, of a fixed dimension (10000). All of the elements of the array A will be initialized with the rank of the process. Then, A and B will be used as the buffers for SEND and RECEIVE, respectively. The program terminates with each process printing out one element of the array B. The program should follow this scheme: Once you have compiled the code using make you can try it doing: make submit The Output out-exercise2 should look like: I am task 0 and I have received b(0) = 1.00 I am task 1 and I have received b(0) = 0.00 HELP: int MPI_Send(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) int MPI_Recv(void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status) Exercise 2 Let s extend the previous exercise and let the program be executed with more than 2 processes. Each process sends the A array to its right process (the last one sends it to the process 0) and receive B from its left one (the first one receive from N). The send receive mechanism is represented in this figure: 2

HINT: You can use modulo % function in C. For example, the following code: for (i=0;i<10;i++){ printf( %d,i); printf( %d,i%2); } Will print: 0 1 2 3 4 5 6 7 8 9 10 0 1 0 1 0 1 0 1 0 1 0 The output should look like (the order may change): I am task 3 and I have received b(0) = 2.00 I am task 1 and I have received b(0) = 0.00 I am task 2 and I have received b(0) = 1.00 I am task 0 and I have received b(0) = 3.00 HELP: int MPI_Isend(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request) Exercise 3 Task 0 initializes a variable to a given value (in this case 47), then modifies the variable (for example, by calculating the square of its value) and finally broadcasts it to all the others tasks. The program will print the value of this number before and after the Broadcast. HELP: int MPI_Bcast(void* buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm ) 3

Hands-on. Simple Performance tools Go to./performance For all the following exercises if you don't have a mpi code to test it you can use the one in performance/mpi_test (it only needs to be compiled using the Makefile available in the same directory) [PERF-1] TOP example: this example shows you how to execute a top command in your job script to monitor the master node of your job. It helps to find memory problems, etc Exercise PERF-1-A: Create a LSF job using the top.sh bash script. The idea is to launch in background the top.sh command and then execute the standard mpirun command with any of the MPI jobs previously executed during this training. Exercise PERF-1-B: Increase the number of repetitions for top.sh script and put its output in a separate file (different to the standard output of you job). Help about top: http://linux.about.com/od/commands/l/blcmdl1_top.htm [PERF-2] MPIP example: this example shows you how to use MPIP performance monitoring tool with any MPI application. In this case the MPIP wrapper is compiled with Intel MPI. The job.cmd file contains the instructions to use MPIP with a MPI job. Exercise PERF-2-A: Launch the LSF job (job.cmd). Check the output generated by the MPIP performance tool. Exercise PERF-2-B: Try the MPIP tool with any other binary. It implies the modification of the job.cmd file. Help about mpip : http://mpip.sourceforge.net/ [PERF-3] VMSTAT example: this example shows you how to execute a vmstat command in your job script to monitor the master node of your job. It helps to find memory usage, disk usage, etc Exercise PERF-3-A: Create a LSF job using the vmstat.sh bash script. The idea is to launch in background the vmstat.sh command and then execute the standard mpirun command with any of the MPI jobs previously executed during this training. Exercise PERF-3-B: Increase the number of repetitions for vmstat.sh script and put its output in a separate file (different to the standard output of you job). Limit the time of vmstat.sh to the duration of your main execution. Exercise PERF-3-C: Check the meaning of all the fields showed by the vmstat command. Help about vmstat: http://linux.about.com/library/cmd/blcmdl8_vmstat.htm 4

[PERF-4] PERF example: this example shows you how to use the perf command in a LSF job script. It is useful to monitor the performance of a serial or parallel job. The output of perf is a list of different system performance metrics. Exercise PERF-4-A: Launch the LSF job (job.cmd). Check the output generated by the PERF performance tool. Exercise PERF-4-B: Try the PERF tool with any other binary. It implies the modification of the job.cmd file. Optional -- Exercise PERF-4-C: Modify the perf.sh file to run with a parallel job (MPI) and changing the output file for perf depending on the MPI_RANK of the parallel execution. Help about perf : https://perf.wiki.kernel.org/index.php/main_page Hands-on. MPI advanced settings Go to./linpack: In this exercise you have to compile and execute the same MPI test program (the linpack benchmark) with different MPI implementations. To do so, you have to load the correct environment before compilation and execution. Exercise 1: different MPI implementations: 1. OpenMPI go to the code for openmpi binary : cd mp_linpack_openmpi If you have the default module environment, you just type 'make arch=intel64'. This will generate the first OpenMPI binary. Check the binary is created: ls../../linpack/mp_linpack_openmpi/bin/intel64/xhpl Try to execute it doing: cd runs Create a job file using the example from job.cmd bsub < openmpi_job.cmd ## take into account you have to generate this openmpi_job.cmd 2. IntelMPI Now you have to switch the modules environment to compile with IntelMPI. Once this is done, go to the mp_linpack_intelmpi directory and execute: make arch=intel64 test the binary in the same way than the OpenMPI binary. 3. Try MVAPICH: directory mp_linpack_mpich2 4. Try POE: directory mp_linpack_poe 5

5. When all the binaries have been compiled submit the job.cmd file and check the output with the./check.sh script: bsub < job.cmd./check.sh stdout of your job.cmd file What kind of results do you get? Which MPI implementation is better? Exercise 2: different MPI implementations and advanced MPI tuning: 1. When all the binaries are compiled submit the job2.cmd file and check the output with the./check.sh script: bsub < job2.cmd./check.sh stdout of your job.cmd file what kind of results do you get? Are the variables affecting the results? Which MPI version is better now? 2. Try to modify the variables to improve the linpack execution. Go to greasy directory. Hands-on. GREASY Now you are going to generate a tasklist and execute it using Greasy. We have prepared a simple C program which generates a random number between 10 and 20 and then it calls sleep(r). This program will let us simulate real sequential tasks that may be executed with Greasy. This program is called like this:./rand <ID> 1. Generate the task list of 100 tasks each one with its ID./rand 1./rand 2 HELP: you can use for function in bash: for i in `seq 1 100`; do >> tasklist.txt ; done 2. Now prepare the bsc_greasy.lsf.job to execute with: a. 32 workers 16 per node b. 16 workers 8 per node c. 4 workers 1 per node Check the greasy log to ensure that everything is working properly 6

Hands-on. Xeon Phi (MIC) Preparation * Use 'bsub -Is -x -q mic-interactive -W 02:00 /bin/bash' to access The Xeon Phi nodes * Load Intel 14.0.1 module: module switch intel intel/14.0.1 source /gpfs/apps/mn3/intel/2013_sp1.1.106/bin/compilervars.sh intel64 module switch openmpi impi * Enter the mic handson directory Native Programming * Enter the Native directory. Native Intel(r) Xeon Phi(tm) coprocessor applications treat the coprocessor as a standalone multicore computer. Once the binary is built on your host system, it is copied to the filesystem on the coprocessor along with any other binaries and data it requires. The program is then run from the ssh console. In the BSC system, we need to copy the binaries manually to /scratch/tmp/<something> because it is the shared filesystem by NFS. Important: Run all the commands of this section from the master node where you log in * Edit the native.cpp file and note that is not different than any other program that would run on a Intel(r) Xeon system. * Compile the program using the -mmic flag to generate a MIC binary. Remember to add the -openmp flag. * Copy the compiled binary to /scratch/tmp * Now log in into the coprocessor in your node using: ssh mic0 * Go to the same directory 'cd /scratch/tmp/<something> * And now, try to execute the binary by issuing the command:./<binary name> 1024 * Did this work? We can alternatively use ssh to run the binary from the host directly by using the following command: ssh mic0 "LD_LIBRARY_PATH=$MIC_LD_LIBRARY_PATH /scratch/tmp/native 1024" * Try to run the command line and see if it works. Note two things from this command line. First, we set the LD_LIBRARY_PATH environment variable using MIC_LD_LIBRARY_PATH. MIC_LD_LIBRARY_PATH is defined by the Intel Compiler setup scripts to point to the location of the MIC version of the compiler libraries. 7

Sometimes it is not so simple to prepare the execution environment of our application and it is better to encapsulate all the commands in a shell script that we can then invoke from ssh. Now we are going to use LSF batch mode. * Run the command bsub -x -q mic -W 02:00./<binary name> 1024 * Did you need to set the LD_LIBRARY_PATH variable? Why do you think that is? In the BSC computing nodes the Intel Compiler libraries are already installed on a standard directory and that is why is not necessary to set any extra environment. Actually LSF does source the correct compilevars before execute MIC. On Intel Xeon Phi coprocessors, as in any other system, the exact configuration of your environment will depend on what software environment the system administrators provide you. Be sure to check beforehand. MPI + MIC Example * Copy the MPI directory wherever in /gpfs/scratch/nct01/$user/ * Enter the MPI directory. * Check the source code of test_mpi.c * Compile the program with and without the -mmic flag to generate a MIC binary and a standard MPI binary. * Have a look at mic_wrapper.sh * Edit job.cmd and replace 'change_me_host' and 'change_me_mic' with correct values * Submit the job and check the results. Is it what you expected? 8