WIFI XSF-UPC: Username: xsf.convidat Password: 1nt3r3st3l4r WIFI EDUROAM: Username: roam06@bsc.es Password: Bsccns.4 MareNostrum III User Guide http://www.bsc.es/support/marenostrum3-ug.pdf Remember to include #BSUB U patc1 In your job scripts. Hands-on. MPI basic exercises Go to mpi dir, there you can find a folder for each exercise. You can use the mpich man pages for the MPI calls: http://www.mpich.org/static/docs/v3.1/ Exercise 0 Using MPI try to print out: Hello world, I am proc X of total Y With a total number of tasks Y = 4 and X ranging from 0 to 3. After having compiled the code, try to launch it as a batch job. You have to complete hello.c and job.lsf with proper values. Once completed, you can compile with make And execute with make submit HELP: int MPI_Init(int *argc, char ***argv) int MPI_Comm_size(MPI_Comm comm, int *size) int MPI_Comm_rank(MPI_Comm comm, int *rank) int MPI_Finalize(void) 1
Exercise 1 Write a code using point to point communication that makes two processes send each other an array of floats containing their rank. Each of the processes will declare two float arrays, A and B, of a fixed dimension (10000). All of the elements of the array A will be initialized with the rank of the process. Then, A and B will be used as the buffers for SEND and RECEIVE, respectively. The program terminates with each process printing out one element of the array B. The program should follow this scheme: Once you have compiled the code using make you can try it doing: make submit The Output out-exercise2 should look like: I am task 0 and I have received b(0) = 1.00 I am task 1 and I have received b(0) = 0.00 HELP: int MPI_Send(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) int MPI_Recv(void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status) Exercise 2 Let s extend the previous exercise and let the program be executed with more than 2 processes. Each process sends the A array to its right process (the last one sends it to the process 0) and receive B from its left one (the first one receive from N). The send receive mechanism is represented in this figure: 2
HINT: You can use modulo % function in C. For example, the following code: for (i=0;i<10;i++){ printf( %d,i); printf( %d,i%2); } Will print: 0 1 2 3 4 5 6 7 8 9 10 0 1 0 1 0 1 0 1 0 1 0 The output should look like (the order may change): I am task 3 and I have received b(0) = 2.00 I am task 1 and I have received b(0) = 0.00 I am task 2 and I have received b(0) = 1.00 I am task 0 and I have received b(0) = 3.00 HELP: int MPI_Isend(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request) Exercise 3 Task 0 initializes a variable to a given value (in this case 47), then modifies the variable (for example, by calculating the square of its value) and finally broadcasts it to all the others tasks. The program will print the value of this number before and after the Broadcast. HELP: int MPI_Bcast(void* buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm ) 3
Hands-on. Simple Performance tools Go to./performance For all the following exercises if you don't have a mpi code to test it you can use the one in performance/mpi_test (it only needs to be compiled using the Makefile available in the same directory) [PERF-1] TOP example: this example shows you how to execute a top command in your job script to monitor the master node of your job. It helps to find memory problems, etc Exercise PERF-1-A: Create a LSF job using the top.sh bash script. The idea is to launch in background the top.sh command and then execute the standard mpirun command with any of the MPI jobs previously executed during this training. Exercise PERF-1-B: Increase the number of repetitions for top.sh script and put its output in a separate file (different to the standard output of you job). Help about top: http://linux.about.com/od/commands/l/blcmdl1_top.htm [PERF-2] MPIP example: this example shows you how to use MPIP performance monitoring tool with any MPI application. In this case the MPIP wrapper is compiled with Intel MPI. The job.cmd file contains the instructions to use MPIP with a MPI job. Exercise PERF-2-A: Launch the LSF job (job.cmd). Check the output generated by the MPIP performance tool. Exercise PERF-2-B: Try the MPIP tool with any other binary. It implies the modification of the job.cmd file. Help about mpip : http://mpip.sourceforge.net/ [PERF-3] VMSTAT example: this example shows you how to execute a vmstat command in your job script to monitor the master node of your job. It helps to find memory usage, disk usage, etc Exercise PERF-3-A: Create a LSF job using the vmstat.sh bash script. The idea is to launch in background the vmstat.sh command and then execute the standard mpirun command with any of the MPI jobs previously executed during this training. Exercise PERF-3-B: Increase the number of repetitions for vmstat.sh script and put its output in a separate file (different to the standard output of you job). Limit the time of vmstat.sh to the duration of your main execution. Exercise PERF-3-C: Check the meaning of all the fields showed by the vmstat command. Help about vmstat: http://linux.about.com/library/cmd/blcmdl8_vmstat.htm 4
[PERF-4] PERF example: this example shows you how to use the perf command in a LSF job script. It is useful to monitor the performance of a serial or parallel job. The output of perf is a list of different system performance metrics. Exercise PERF-4-A: Launch the LSF job (job.cmd). Check the output generated by the PERF performance tool. Exercise PERF-4-B: Try the PERF tool with any other binary. It implies the modification of the job.cmd file. Optional -- Exercise PERF-4-C: Modify the perf.sh file to run with a parallel job (MPI) and changing the output file for perf depending on the MPI_RANK of the parallel execution. Help about perf : https://perf.wiki.kernel.org/index.php/main_page Hands-on. MPI advanced settings Go to./linpack: In this exercise you have to compile and execute the same MPI test program (the linpack benchmark) with different MPI implementations. To do so, you have to load the correct environment before compilation and execution. Exercise 1: different MPI implementations: 1. OpenMPI go to the code for openmpi binary : cd mp_linpack_openmpi If you have the default module environment, you just type 'make arch=intel64'. This will generate the first OpenMPI binary. Check the binary is created: ls../../linpack/mp_linpack_openmpi/bin/intel64/xhpl Try to execute it doing: cd runs Create a job file using the example from job.cmd bsub < openmpi_job.cmd ## take into account you have to generate this openmpi_job.cmd 2. IntelMPI Now you have to switch the modules environment to compile with IntelMPI. Once this is done, go to the mp_linpack_intelmpi directory and execute: make arch=intel64 test the binary in the same way than the OpenMPI binary. 3. Try MVAPICH: directory mp_linpack_mpich2 4. Try POE: directory mp_linpack_poe 5
5. When all the binaries have been compiled submit the job.cmd file and check the output with the./check.sh script: bsub < job.cmd./check.sh stdout of your job.cmd file What kind of results do you get? Which MPI implementation is better? Exercise 2: different MPI implementations and advanced MPI tuning: 1. When all the binaries are compiled submit the job2.cmd file and check the output with the./check.sh script: bsub < job2.cmd./check.sh stdout of your job.cmd file what kind of results do you get? Are the variables affecting the results? Which MPI version is better now? 2. Try to modify the variables to improve the linpack execution. Go to greasy directory. Hands-on. GREASY Now you are going to generate a tasklist and execute it using Greasy. We have prepared a simple C program which generates a random number between 10 and 20 and then it calls sleep(r). This program will let us simulate real sequential tasks that may be executed with Greasy. This program is called like this:./rand <ID> 1. Generate the task list of 100 tasks each one with its ID./rand 1./rand 2 HELP: you can use for function in bash: for i in `seq 1 100`; do >> tasklist.txt ; done 2. Now prepare the bsc_greasy.lsf.job to execute with: a. 32 workers 16 per node b. 16 workers 8 per node c. 4 workers 1 per node Check the greasy log to ensure that everything is working properly 6
Hands-on. Xeon Phi (MIC) Preparation * Use 'bsub -Is -x -q mic-interactive -W 02:00 /bin/bash' to access The Xeon Phi nodes * Load Intel 14.0.1 module: module switch intel intel/14.0.1 source /gpfs/apps/mn3/intel/2013_sp1.1.106/bin/compilervars.sh intel64 module switch openmpi impi * Enter the mic handson directory Native Programming * Enter the Native directory. Native Intel(r) Xeon Phi(tm) coprocessor applications treat the coprocessor as a standalone multicore computer. Once the binary is built on your host system, it is copied to the filesystem on the coprocessor along with any other binaries and data it requires. The program is then run from the ssh console. In the BSC system, we need to copy the binaries manually to /scratch/tmp/<something> because it is the shared filesystem by NFS. Important: Run all the commands of this section from the master node where you log in * Edit the native.cpp file and note that is not different than any other program that would run on a Intel(r) Xeon system. * Compile the program using the -mmic flag to generate a MIC binary. Remember to add the -openmp flag. * Copy the compiled binary to /scratch/tmp * Now log in into the coprocessor in your node using: ssh mic0 * Go to the same directory 'cd /scratch/tmp/<something> * And now, try to execute the binary by issuing the command:./<binary name> 1024 * Did this work? We can alternatively use ssh to run the binary from the host directly by using the following command: ssh mic0 "LD_LIBRARY_PATH=$MIC_LD_LIBRARY_PATH /scratch/tmp/native 1024" * Try to run the command line and see if it works. Note two things from this command line. First, we set the LD_LIBRARY_PATH environment variable using MIC_LD_LIBRARY_PATH. MIC_LD_LIBRARY_PATH is defined by the Intel Compiler setup scripts to point to the location of the MIC version of the compiler libraries. 7
Sometimes it is not so simple to prepare the execution environment of our application and it is better to encapsulate all the commands in a shell script that we can then invoke from ssh. Now we are going to use LSF batch mode. * Run the command bsub -x -q mic -W 02:00./<binary name> 1024 * Did you need to set the LD_LIBRARY_PATH variable? Why do you think that is? In the BSC computing nodes the Intel Compiler libraries are already installed on a standard directory and that is why is not necessary to set any extra environment. Actually LSF does source the correct compilevars before execute MIC. On Intel Xeon Phi coprocessors, as in any other system, the exact configuration of your environment will depend on what software environment the system administrators provide you. Be sure to check beforehand. MPI + MIC Example * Copy the MPI directory wherever in /gpfs/scratch/nct01/$user/ * Enter the MPI directory. * Check the source code of test_mpi.c * Compile the program with and without the -mmic flag to generate a MIC binary and a standard MPI binary. * Have a look at mic_wrapper.sh * Edit job.cmd and replace 'change_me_host' and 'change_me_mic' with correct values * Submit the job and check the results. Is it what you expected? 8