Advanced MPI: MPI Tuning or How to Save SUs by Optimizing your MPI Library! The lab.

Size: px

Start display at page:

Download "Advanced MPI: MPI Tuning or How to Save SUs by Optimizing your MPI Library! The lab."

Stephany Wilkinson
5 years ago
Views:

1 Advanced MPI: MPI Tuning or How to Save SUs by Optimizing your MPI Library! The lab. Jérôme VIENNE Texas Advanced Computing Center (TACC). University of Texas at Austin Tuesday 25 th March, / 20

2 Grab the Lab files Login to Stampede ssh < username XSEDE users, follow the instructions provided in the Change to your $WORK directory: cdw Untar the file mpi tuning.tar.gz (in train00) into your directory tar xvzf train00/mpi tuning.tar.gz ls 2 / 20

3 Access to compute nodes Training allocation XSEDE TG-TRA TACC TRAINING-HPC Login idev -t 2:00:00 3 / 20

4 Goals 1 Play with the mapping of MVAPICH2 2 Try the different intra-node mechanisms with MVAPICH2 3 Tune MPI Gather of Intel MPI for osu gather 4 / 20

5 Plan 1 Map me if you can! 2 The Daltons 3 Out of Sight 5 / 20

6 Plan 1 Map me if you can! 2 The Daltons 3 Out of Sight 6 / 20

7 Things to know go to mpi tuning/1.mapping Default MPI stack on Stampede is MVAPICH2 1.9a2, there is no need to change it. You just have to edit and execute the launch script. You will probably need to change the number of MPI tasks (Default is 16) 7 / 20

8 Mapping command MVAPICH2 MV2 CPU BINDING POLICY=bunch scatter MV2 CPU BINDING LEVEL=core socket numanode Manual: MV2 CPU MAPPING=0:8:9-15:1-7 Report: MV2 SHOW CPU BINDING=1 How to set/unset env variables? To set export MV2 CPU BINDING POLICY=bunch To unset unset MV2 CPU BINDING POLICY 8 / 20

9 1 MPI task per socket Machine (32GB) NUMANode P#0 (16GB) Socket P#0 PCI 8086:1521 L3 (20MB) eth0 PCI 8086:1521 eth1 PCI 1a03:2000 Core P#0 Core P#1 Core P#2 Core P#3 0 Core P#4 Core P#5 Core P#6 Core P#7 PCI 8086:1d02 sda NUMANode P#1 (16GB) Socket P#1 PCI 15b3:1003 L3 (20MB) ib0 mlx4_0 PCI 8086:225c mic0 Core P#0 Core P#1 Core P#2 Core P#3 1 Core P#4 Core P#5 Core P#6 Core P#7 9 / 20

10 1 MPI task per socket, IMPI style Machine (32GB) NUMANode P#0 (16GB) Socket P#0 PCI 8086:1521 L3 (20MB) eth0 PCI 8086:1521 eth1 PCI 1a03:2000 Core P#0 Core P#1 Core P#2 Core P#3 1 Core P#4 Core P#5 Core P#6 Core P#7 PCI 8086:1d02 sda NUMANode P#1 (16GB) Socket P#1 PCI 15b3:1003 L3 (20MB) ib0 mlx4_0 PCI 8086:225c mic0 Core P#0 Core P#1 Core P#2 Core P#3 0 Core P#4 Core P#5 Core P#6 Core P#7 10 / 20

11 Solutions Exercice 1 export MV2 CPU BINDING POLICY=scatter export MV2 CPU BINDING LEVEL=socket or export MV2 CPU MAPPING=0-7:8-15 Exercice 2 export MV2 CPU MAPPING=8-15: / 20

12 Solutions Exercice 1 export MV2 CPU BINDING POLICY=scatter export MV2 CPU BINDING LEVEL=socket or export MV2 CPU MAPPING=0-7:8-15 Exercice 2 export MV2 CPU MAPPING=8-15: / 20

13 Plan 1 Map me if you can! 2 The Daltons 3 Out of Sight 12 / 20

14 Things to know go to mpi tuning/2.intra-node You will need to change the MPI stack to MVAPICH2 1.9 by using module load mvapich2/1.9 You just have to edit and execute the launch script. We will evaluate the performance of NAS FT with Shared Memory, LiMIC, CMA and Loopback 13 / 20

15 Env. variables to set Shared CMA MV2 USE SHARED MEM=1 MV2 SMP USE LIMIC2=0 MV2 SMP USE CMA=0 MV2 USE SHARED MEM=1 MV2 SMP USE LIMIC2=0 MV2 SMP USE CMA=1 Loopback MV2 USE SHARED MEM=0 MV2 SMP USE LIMIC2=0 MV2 SMP USE CMA=0 LiMIC (Default) MV2 USE SHARED MEM=1 MV2 SMP USE LIMIC2=1 MV2 SMP USE CMA=0 14 / 20

16 Results Here is what I found CMA = seconds LiMIC = seconds Shared Memory = seconds Loopback =26.2 seconds 15 / 20

17 Results Here is what I found CMA = seconds LiMIC = seconds Shared Memory = seconds Loopback =26.2 seconds 15 / 20

18 Results Here is what I found CMA = seconds LiMIC = seconds Shared Memory = seconds Loopback =26.2 seconds 15 / 20

19 Results Here is what I found CMA = seconds LiMIC = seconds Shared Memory = seconds Loopback =26.2 seconds 15 / 20

20 Plan 1 Map me if you can! 2 The Daltons 3 Out of Sight 16 / 20

21 Things to know go to mpi tuning/3.tune-collective You will need to change the MPI stack to Intel MPI by using module swap mvapich2 impi/ Tuning will be done for 1 node, 16 cores! You have to try each algorithms and evaluate which one is the best for each message sizes. We will use omb gather which is part of OSU Benchmarks You can redirect output into a file like ibrun./a.out tee tune-1, and use paste tune-1 tune-2 tune-3 to compare the different results. 17 / 20

22 Gather tuning command Intel MPI I MPI ADJUST GATHER=1 2 3 I MPI ADJUST GATHER=1: I MPI ADJUST GATHER= 2: ;1: How to set/unset env variables? To set export I MPI ADJUST GATHER=1: To unset unset I MPI ADJUST GATHER 18 / 20

23 You should see Ref Latency(us) K 4K 16K 64K 256K 1M Message Size (Bytes) 19 / 20

24 My Solution Possible solution export I MPI ADJUST GATHER= 3: ;1: Ref Tuned Latency(us) K 4K 16K 64K 256K 1M Message Size (Bytes) 20 / 20

Advantages to Using MVAPICH2 on TACC HPC Clusters

Advantages to Using MVAPICH2 on TACC HPC Clusters Jérôme VIENNE viennej@tacc.utexas.edu Texas Advanced Computing Center (TACC) University of Texas at Austin Wednesday 27 th August, 2014 1 / 20 Stampede