Intel Parallel Studio XE Cluster Edition - Intel MPI - Intel Traceanalyzer & Collector

Size: px

Start display at page:

Download "Intel Parallel Studio XE Cluster Edition - Intel MPI - Intel Traceanalyzer & Collector"

Betty Johnson
6 years ago
Views:

1 Intel Parallel Studio XE Cluster Edition - Intel MPI - Intel Traceanalyzer & Collector

2 A brief Introduction to MPI 2

3 What is MPI? Message Passing Interface Explicit parallel model All parallelism is explicit: the programmer is responsible for correctly identifying parallelism and implementing parallel algorithms using MPI constructs For parallel computers, clusters, heterogeneous networks and accelerators like Intel MIC architecture Designed as a standard to provide access to advanced parallel hardware for End users Library writers Tool developers Communication is done between MPI ranks typically implemented as operating system processes 3

4 MPI Standard Standard maintained by open forum Intel is one of the founders in 1992 and is still very actively engaged Version 1.0, (1994), 2 ( 2000), 2.1 ( 2008), 2.2 ( 2009) Version 3 released 2012 is latest No all implementation support this yet A message-passing library specification extended message-passing model not a language or compiler specification not a specific implementation or product 4

5 Notes on C and Fortran C and Fortran bindings correspond closely In C: mpi.h must be #included MPI functions return error codes or MPI_SUCCESS In Fortran: mpif.h must be included, or use MPI module (MPI-2) All MPI calls are to subroutines, with a place for the return code in the last argument. C++ bindings, and Fortran-90 issues, are part of MPI-2; MPI-3 introduces Fortran2008 interface 5

6 A first MPI program #include "mpi.h" #include <stdio.h> int main( int argc, char *argv[] ) { int rank, size; MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); MPI_Comm_size( MPI_COMM_WORLD, &size ); printf( "I am %d of %d\n", rank, size ); MPI_Finalize(); return 0; } MPI_COMM_WORLD is the default communicator whose group contains all processes initially. 6

7 Point-To_Point Communication MPI_SEND (start, count, datatype, dest, tag, comm) MPI_RECV(start, count, datatype, source, tag, comm, status) Messages are sent with an accompanying user-defined integer tag, to assist the receiving process in identifying the message. Messages can be screened at the receiving end by specifying a specific tag, or not screened by specifying MPI_ANY_TAG as the tag in a receive. MPI_SEND and MPI_RECV are blocking - there are non-blocking versions too ( ISEND, IRECEIVE) The 6 function introduced up to now ( Init, Finalize, Comm_rank, Comm_size, Send, Receive ) is all needed for many numerical programs but there is a lot more like MPI collective operations, 7

8 MPI Collective Routines Many Routines: MPI_ALLGATHER, MPI_ALLGATHERV, MPI_ALLREDUCE, MPI_ALLTOALL, MPI_ALLTOALLV, MPI_BCAST, MPI_GATHER, MPI_GATHERV, MPI_REDUCE, MPI_REDUCESCATTER, MPI_SCAN, MPI_SCATTER, MPI_SCATTERV, Collective operations are called by all processes in a communicator. All versions deliver results to all participating processes V versions (stands for vector) allow the hunks to have different sizes MPI_ALLREDUCE, MPI_REDUCE, MPI_REDUCESCATTER, and MPI_SCAN take both built-in ( like MPI_SUM, MPI_MAX) and userdefined combiner functions 8

9 Extending MPI Here MPI 2 Dynamic Process Management Dynamic process startup Dynamic establishment of connections One-sided communication Put/get Other operations Parallel I/O Other MPI-2 features Generalized requests Bindings for C++/ Fortran-90; inter-language topics 9

10 MPI-3 Planned and Added Features Topic Motivation Main Result Intel MPI 5.0 Support? Collective Operations Collective performance Non-blocking & sparse collectives Remote Memory Access Backward Compatibility Language Bindings Cache coherence, PGAS support Buffers > 2GB Supported ABI for latest C++ and Fortran Fast RMA Large buffer support, const buffers Fortran 2008 binding, removed C++ binding Tools Support PMPI limitations MPIT interface ( a very little bit was added) Hybrid Programming Core count growth MPI_Mprobe, shared memory windows Fault Tolerance Node count growth None. Next time? N/A Slide courtesy of A. Supalov 10

11 Overview Intel MPI Library 11

12 Intel MPI Library Intel MPI Library is derived from MPICH2; latest version is 5.0 Reasons to use Intel MPI Many additional features make Intel MPI Library more user friendly compared to other implementations correctness checking, statistics, Trace Analayzer support, Intel MPI Library provides top performance E.g. extensive performance tuning on key algorithms such as collective operations MPITUNE tool for automatic selection of best algorithms and settings Scalability to up to 150K ranks Available for Linux* and Windows* Professional support 12

13 Setting the environment Use this handy script to define all necessary paths: $ source /shared/intel/impi_5.0.1/bin64/mpivars.sh or $ module load Intel_MPI No additional paths to binaries and libs have to be specified Recommended: If Intel MPI is the only MPI you will use just include the above into your.bashrc 13

14 Intro to Intel MPI Library Compilation A simple test program is part of the Intel MPI Library distribution: $ cp $I_MPI_ROOT/test/test.c. $ mpiicc -o test.x test.c mpiicc is the wrapper script for Intel icc ( C-Compiler) mpicc is the wrapper script for GNU gcc Also available are mpiifort (Intel Fortran Compiler) mpiicpc (Intel C++ Compiler) mpicxx ( GNU g++) 14

15 Intro to Intel MPI Library Execution Intel MPI provides an easy-to-use run script: $ mpirun n <nprocs>./test.x Above works automatically on a single node and clusters with job schedulers present For more nodes we usually need to define a host file with a single node name per line: Unlike years before, there is no need anymore to start a daemon $ mpirun -f <host file> -n <nprocs>./test.x like legacy mpd since the MPICH-Hydra process management is used 15

16 Intro to Intel MPI Library Execution The test program prints out rank and hostname for each MPI process More debug information available by setting: $ export I_MPI_DEBUG=5 Will be propagated to all ranks automatically Prints basic settings of the Intel MPI Library 16

17 Output of test Program 17

18 Simple process placement using the Intel MPI Library Default pinning scheme: cores, sockets and nodes Easiest way to override default behavior is to use the process per node flag: $ mpirun ppn <nprocs-per-node> n <nprocs>./test.x if <nprocs-per-node> == 1, round robin with next process on next node is used 18

19 Intel MPI Library: ppn = 1 19

20 Overview Intel TraceAnalyzer and Collector 20

21 Intel Trace Analyzer and Collector (ITAC) A tool for understanding MPI program behavior, finding bottlenecks, performance analysis and MPI-correctness checking More than a profiler: Visualizes temporal behavior of MPI routines Shows dependencies and load imbalances Includes a correctness checking library Easy to use. Invoke via: Setting an extra flag to mpirun/mpiexec Setting an environment variable without changing your application or your run scripts 21

22 Intel Trace Analyzer and Collector ITAC may be applied without touching the program or environment. One way to get a first trace is: $ mpirun trace n <nprocs>./test.x Alternatively, just set the preload library and run without the trace flag: $ export LD_PRELOAD=libVT.so $ mpirun f <hostfile> -n <nprocs>./test.x this is actually what the flag does internally. This methodology may be applied to situations with complex run scripts not knowing where mpirun is actually executed. Note: this does not work for statically linked Intel MPI (not recommended). 22

23 Viewing the trace file ITAC will generate several files inside the directory where you started mpirun. Just start traceanalyzer in this directory: $ traceanalyzer test.x.stf Alternatively there is a Windows version of traceanalyzer contained in the Linux ICS package. 23

24 ITAC Function Profile After starting ITAC a window showing a basic timing profile for MPI and Application will be displayed. Right click on the red MPI bar to show the profiling for each used MPI routine: 24

25 ITAC Event Timeline Most important view of ITAC is the Event Timeline. This shows the temporal development of MPI routines and messages: 25

26 ITAC MPI Correctness Checker Correctness Checker validates MPI correctness. It uses another library but may be started like the ordinary ITAC: or $ mpirun check n <nprocs>./test.x $ export LD_PRELOAD=libVTmc.so $ mpirun n <nprocs>./test.x 26

27 Intel VTune Amplifier XE for MPI Intel VTune Analyzer XE provides detailed information timings and core events. It can also provide insight into the behavior of threaded applications: $ source /opt/intel/vtune_amplifier_xe/amplxe-vars.sh $ mpirun n <N> amplxe-cl --result-dir <result dir> --collect <mode> \ -- <MPI executable> Example: hotspots and concurrency are predefined analysis types; $ mpirun n 2 amplxe-cl --result-dir axe_ho -collect hotspots -- concurrency./poisson.x only makes sense with additional threading $ mpirun n 2 amplxe-cl --result-dir axe_co c concurrency --./poisson.x 27

28 Results with Intel VTune Amplifier XE After running the MPI program result directories should appear with the previously defined base name and indexed by MPI rank. Results may be viewed as ASCII output: or by using the Intel Vtune Amplifier GUI: $ amplxe-cl --report hotspots -r axe_ho.0 Results may also be transferred to Windows Laptop and viewed $ amplxe-gui axe_ho.0 by the Windows* version of Intel Vtune Amplifier XE 28

29 Intel Inspector XE for MPI Application Intel Inspector XE offers memory checking and correctness checking for threaded applications. For MPI applications we may use it in the following way: $ source /opt/intel/inspector_xe/inspxe-vars.sh intel64 $ mpirun n <N> inspxe-cl --result-dir <result dir> --collect <mode> \ -- <MPI executable> Example : $ mpirun n 4 inspxe-cl --result-dir insp_mi3 --collect mi3 --./poisson.x $ mpirun n 4 inspxe-cl --result-dir insp_ti3 --collect ti3 --./poisson.x mi3 and ti3 are the most demanding memory and threading modes. 29

30 Results with Intel Inspector XE After running the MPI program result directories should appear with the previously defined base name and indexed with MPI rank. Results may be viewed as ASCII output: $ inspxe-cl --report problems --r insp_mi3.0 or by using the Intel Inspector XE GUI: Results may also be transferred to a Windows* computer and $ inspxe-gui insp_mi3.0 viewed by the Windows* version of Intel Inspector XE 30

31 Advanced Topics: Cluster Exploration Tools 31

32 Cluster Exploration Tools cpuinfo: included in the Intel MPI Library package Debug level: raising the debug level of Intel MPI Library will provide extra information ifconfig etc: Linux tools for showing available network devices Intel MPI Benchmarks (IMB): Collection of timed MPI tests for generic MPI performance evaluation MPITUNE: tuning script for automatic determination of optimal setting. Results can be stored and used on demand. This lecture covers the generic mode using IMB as the Program to be tuned 32

33 Cluster Node Exploration: cpuinfo Shows important features of a node: number of sockets, cores per socket including hyper-threads and caches Part of the Intel MPI Library distribution Reads its data from /proc/cpuinfo and prints it in a more appropriate format 33

34 Cluster Node Exploration: cpuinfo Shows important features of a node: number of sockets, cores per socket including hyper-threads and caches Part of the Intel MPI Library distribution Reads its data from /proc/cpuinfo and prints it in a more appropriate format 34

35 Using Environment Variables Environment variables may be exported inside your shell and automatically propagated to each rank Or, they can be specified on the command line for a single run by: $ mpirun genv I_MPI_DEBUG 4 <program.x> -genv stands for global environment propagated to all nodes It is also possible to define local environments for different nodes: -env defines environment variables locally $ mpirun env OMP_NUM_THREADS 4 n 2 <program1.x> : \ env OMP_NUM_THREADS 2 n 4 <program2.x> 35

36 Cluster Node Exploration: Debug Info Setting the I_MPI_DEBUG environment variable increases the information printed to std_out depending on the non negative integer value specified For example, I_MPI_DEBUG=4 prints information about process pinning, used network interfaces and Intel MPI Library environment variables set by the user Process pinning is the mapping of MPI ranks to hardware resources like cores, sockets, caches etc. Default pinning strategy of Intel MPI Library may depend on version! To increase performance you should control the pinning especially for hybrid programs (pinning domains) 36

37 Cluster Node Exploration: Debug Info Setting the I_MPI_DEBUG environment variable increases the information printed to std_out depending on the non negative integer value specified For example, I_MPI_DEBUG=4 prints information about process pinning, used network interfaces and Intel MPI Library environment variables set by the user Process pinning is the mapping of MPI ranks to hardware resources like cores, sockets, caches etc. Default pinning strategy of Intel MPI Library may depend on version! To increase performance you should control the pinning especially for hybrid programs (pinning domains) 37

38 Cluster Node Exploration: Pinning Pin the ranks to explicit processors using the environment variable as shown below: $ export I_MPI_PIN_PROCESSOR_LIST=p1,p2,p3, rank #n is mapped to logical processor pn besides explicit mapping of ranks to logical processors as shown, you can also use the predefined settings 38

39 I_MPI_PIN_PROCESSOR_LIST=1-8 First rank on socket #0 and core #0 Second rank on socket #1 and core #1 39

40 Cluster Structure Inter Node IB router Inter Socket (QPI) ETH router Intra Socket Head Node: Compile, Edit, Job management Internet 40

41 Three Levels of Communication Speed Communication speed is not homogeneous: Inter node (Infiniband*, Ethernet, etc) Intra node inter sockets (Quick Path Interconnect QPI) Intra socket Two additional levels when using Intel Xeon Phi coprocessor: host Intel Xeon Phi coprocessor communication Inter Intel Xeon Phi coprocessor communication 41

42 Measuring Comm Speed with IMB The most simple benchmark in IMB is called PingPong: data packages of different size are sent from rank 0 to rank 1 and back: $ mpirun n 2 IMB-MPI1 pingpong 42

43 Placing MPI Ranks on a Cluster Process placement on a single node was already discussed The default strategy for mapping MPI ranks on a cluster tries to balance resources (same number of processes on each socket) and to minimize the distance between adjacent ranks A mapping with 2 MPI ranks on different nodes may be enforced by using the flag ppn 1 PPN stands for Processes Per Node Parameter value 1 will place first rank on first node and the second rank on the next node (alternative env. Var.: I_MPI_PERHOST=1) 43

44 Measuring 3 Levels of Comm Speed Inter node communication (e.g. InfiniBand*): $ mpirun ppn 1 n 2 IMB-MPI1 pingpong Intra node inter socket (QPI): $ export I_MPI_PIN_PROCESSOR_LIST=allsocks $ mpirun n 2 IMB-MPI1 pingpong Intra node intra socket (between cores on a processor) $ export I_MPI_PIN_PROCESSOR_LIST=allcores:grain=1 $ mpirun n 2 IMB-MPI1 pingpong 44

45 Multiple PingPongs The default IMB pingpong will just use the first 2 ranks for the pingpong an put all other ranks into a barrier It is possible to do simultaneous pingpongs e.g. 4 pairs: $ mpirun n 8 IMB-MPI multi <x> pingpong with x=0 for average results and x=1 for all results Stretch goal for the Labs is to show all different communication speeds in a single IMB run 45

46 Three Different Comm Levels 46

47 Automatic Tuning with MPITUNE Provides generic tuning of optimal settings for environment variables Uses the IMB benchmark Provides results in scripts that can be read by using mpirun with -tune The resulting settings may be just copied or used as a hint for further optimization The resulting settings are only taken if the time is reduce by more than 3% The 3% limit can be configured to another value 47

48 How To Run MPITUNE MPITUNE is an executable script. The easiest way is to simply run: $ mpitune We may restrict MPITUNE on full nodes and the default fabric $ mpitune pr 8:8 fl shm:dapl hosts should be taken from provided hostfile or the batch system 48

49 MPITUNE output 49

50 MPITUNE result file File: mpiexec_shm:dapl_nn_1_ppn_8.conf 50

51 IMB and Cache Effects IMB may deliver too optimistic results because send and receive buffers stay in cache Real applications will normally use data from main memory for sending Results may be more realistic if we make sure that cache lines are not reused The flag -off_cache <last level cache size [MB]> may help in avoiding cache usage 51

52 Summary Tuning can only be effective when hardware parameters like node structure and communication speeds are well known cpuinfo and I_MPI_DEBUG=4 provide some useful information about node structure process mapping and taken network fabric IMB provides information about communication speeds Many environment variables are available for fine tuning. We may automatically set some of them by using MPITUNE Labs show practical usage IMB and MPITUNE 52

53 Performance Caveats and Notes Performance varies with each application, regardless of the technology and methods used. Certain types of HPC applications are amenable to acceleration and it is important to understand their characteristics. Once an application is identified to take advantage of acceleration, the high level and low level techniques are expected to work equally well. 53

Outline. Communication modes MPI Message Passing Interface Standard. Khoa Coâng Ngheä Thoâng Tin Ñaïi Hoïc Baùch Khoa Tp.HCM

Outline. Communication modes MPI Message Passing Interface Standard. Khoa Coâng Ngheä Thoâng Tin Ñaïi Hoïc Baùch Khoa Tp.HCM THOAI NAM Outline Communication modes MPI Message Passing Interface Standard TERMs (1) Blocking If return from the procedure indicates the user is allowed to reuse resources specified in the call Non-blocking