Symmetric Computing. ISC 2015 July John Cazes Texas Advanced Computing Center

Size: px

Start display at page:

Download "Symmetric Computing. ISC 2015 July John Cazes Texas Advanced Computing Center"

Rosamond Gibbs
5 years ago
Views:

1 Symmetric Computing ISC 2015 July 2015 John Cazes Texas Advanced Computing Center

2 Symmetric Computing Run MPI tasks on both MIC and host Also called heterogeneous computing Two executables are required: CPU MIC Works with Intel MPI (IMPI) and MVAPICH2

3 Definition of a Node A node contains a host component and a MIC component Host refers to the Sandy Bridge component MIC refers to one or two Intel Xeon Phi co-processor cards NODE Host 2x Intel 2.7 GHz E cores MIC 1-2 Intel Xeon PHI SE10P 61 cores / 244 HW threads

4 Environment variables for MIC By default, environment variables are inherited by all MPI tasks Since the MIC has a different architecture, several environment variables must be modified OMP_NUM_THREADS # of threads on MIC LD_LIBRARY_PATH must point to MIC libraries I_MPI_PIN_MODE controls the placement of tasks KMP_AFFINITY controls thread binding

5 Symmetric run on 1 Node 16 tasks on host mpiexec.hydra \ n 16 host localhost./host.exe \ : env OMP_NUM_THREADS 30 \ env LD_LIBRARY_PATH $MIC_LD_LIBRARY_PATH \ env I_MPI_PIN_MODE mpd \ env KMP_AFFINITY balanced \ n 4 host mic0./mic.exe 4 tasks on mic0 Environment variables for MIC tasks

6 Steps to create a symmetric run 1. Compile a host executable and a MIC executable: mpicc openmp o host.exe my_code.c mpicc openmp mmic o mic.exe my_code.c 2. Determine the appropriate number of tasks and threads for both MIC and host: 16 tasks/host 1 thread/mpi task 4 tasks/mic 30 threads/mpi task

7 Steps to create a symmetric run 3. Create a batch script to distribute the job #!/bin/bash # # symmetric.slurm # Generic symmetric script MPI + OpenMP # #SBATCH J symmetric #Job name #SBATCH -o symmetric.%j.out #stdout; %j expands to jobid #SBATCH e symmetric.%j.err #stderr; skip to combine #SBATCH p development #queue #SBATCH N 2 #Number of nodes #SBATCH n 32 #Total number of MPI tasks #SBATCH t 00:30:00 #max time #SBATCH A TG #necessary if multiple projects export MIC_PPN=4 export MIC_OMP_NUM_THREADS=30 ibrun.symm m./mic.exe c./host.exe

8 Steps to create a symmetric run 1. Compile a host executable and a MIC executable 2. Determine the appropriate number of tasks and threads for both MIC and host 3. Create the batch script 4. Submit the batch script sbatch symmetric.slurm

9 Symmetric launcher ibrun.symm Usage: ibrun.symm m./<mic_executable> -c./<cpu_executable> Analog of ibrun for symmetric execution # of MIC tasks and threads are controlled by environment variables MIC_PPN=<# of MPI tasks/mic card> MIC_OMP_NUM_THREADS=<# of OMP threads/mic MPI task> MIC_MY_NSLOTS=<Total # of MIC MPI tasks>

10 Symmetric launcher # of host tasks determined by batch script (same as regular ibrun) ibrun.symm does not support np and n flags Command line arguments may be passed within quotes ibrun.symm m./my_exe.mic args c./my_exe.cpu args

11 Symmetric launcher If the executables require redirection or complicated command lines, a simple shell script may be used: run_mic.sh #!/bin/sh a.out.mic <args> < inputfile run_cpu.sh #!/bin/bash a.out.host <args> < inputfile ibrun.symm m./run_mic.sh c./run_cpu.sh Note: The csh and tcsh shells are not available on MIC. So, the MIC script must begin with #!/bin/sh or #!/bin/bash

12 Symmetric Launcher Example #SBATCH N 4 n 32 export OMP_NUM_THREADS=2 export MIC_OMP_NUM_THREADS=60 export MIC_PPN=2 The MPI tasks will be allocated in consecutive order by node (CPU tasks first, then MIC tasks). For example, the task allocation described by the above script snippet will be: NODE 1 8 host tasks (0-7) 2 MIC tasks (8-9) NODE 2 8 host tasks (10-17) 2 MIC tasks (18-19) NODE 3 8 host tasks (20-27) 2 MIC tasks (28-29) NODE 4 8 host tasks (30-37) 2 MIC tasks (38-39)

13 IMPI Task Binding When using IMPI, the mode of process binding may be controlled with the following environment variable: I_MPI_PIN_MODE=<pinmode> mpd pm lib mpd daemon pins MPI processes at startup (Best performance for MIC) Hydra launcher pins MPI processes at startup (Doesn t appear to work on MIC) MPI library pins processes BUT this does not guarantee colocation of CPU and memory (Default) I_MPI_PIN_MODE=mpd (default for ibrun.symm)

14 IMPI Task Binding You can also lay out tasks across the local cores Explicitly: I_MPI_PIN_PROCESSOR_LIST=<proclist> export I_MPI_PIN_PROCESSOR_LIST=1-7,9-15 Grouped: I_MPI_PIN_PROCESSOR_LIST=<map> export I_MPI_PIN_PROCESSOR=spread bunch scatter spread The processes are mapped as closely as possible on the socket The processes are mapped as remotely as possible to avoid sharing common resources: caches, cores The processes are mapped consecutively with the possibility to not share common resources

15 IMPI Task Binding Be careful when using MIC and host MIC 244 H/W threads and 1 socket Host 16 cores and 2 sockets To set I_MPI_PROCESSOR_LIST for MIC simply use the MIC prefix, e.g. export MIC_I_MPI_PROCESSOR_LIST=1,61,121,181 Turn on debugging to see where tasks are placed: export I_MPI_DEBUG=4 Recommend using IMPI defaults

16 IMPI Task Binding Using IMPI defaults Example: 4 MPI tasks/mic with 60 threads each export KMP_AFFINITY=balanced

17 MVAPICH2 Task Binding Binding is disabled by default HOWEVER Since the Xeon Phi appears as one processor to MVAPICH2, all MPI tasks will be bound to the same hardware threads without explicit binding Example: 4 MPI tasks/mic with 60 threads each export KMP_AFFINITY=balanced

18 MVAPICH2 Task Binding Enable affinity with a direct mapping to avoid overlap Example: 4 MPI tasks/mic with 60 threads each export KMP_AFFINITY=balanced export MV2_MIC_MAPPING=1-60:61-120: :

19 Thread Placement Thread placement may be controlled with the following environment variable MIC_KMP_AFFINITY=<type> compact pack threads close to each other scatter Round-Robin threads to cores compact balanced keep OMP thread ids consecutive (MIC only) scatter explicit use the proclist modifier to pin threads none does not pin threads balanced

MPI Performance MPI performance is highly dependent on the path the data takes Imbalance in perfomance is mitigated by the use of a proxy Proxy runs on the host and controls the movement of data

20 MPI Performance MPI performance is highly dependent on the path the data takes Imbalance in perfomance is mitigated by the use of a proxy Proxy runs on the host and controls the movement of data through the host Both Intel MPI and MVAPICH2 provide proxy support Source: Configuration and Benchmarks of Peer-to-Peer Communication over Gigabit Ethernet and InfiniBand in a Cluster with Intel Xeon Phi Coprocessors by Vadim Karpusenko and Andrey Vladimirov (Colfax International)

21 Proxy can help quite a bit for both Intel MPI and MVAPICH2 even on one node Pingpong Host to MIC

22 Doesn t help much for latency bound codes Pingpong Host to MIC off node

23 MIC to MIC bandwidth is almost as good as host to MIC Pingpong MIC to MIC off node

24 Intel Trace Analyzer and Collector (ITAC) ITAC is a graphical tool for understanding MPI application behavior, quickly finding bottlenecks, and achieving high performance. Use with Intel and Intel MPI Load the ITAC module: module load itac Trace files provide a timeline of MPI execution More powerful than an MPI profiler such as mpip or IPM Trace files can become quite large Supports symmetric execution on the Xeon Phi

25 Compile and Run for ITAC To instrument for ITAC, the application must be compiled and linked with the trace option for both MIC and host executable. C/C++ : Fortran: mpicc trace mmic -g o mycode.mic mycode.c mpicc trace xavx g o mycode.host mycode.c mpif90 trace mmic o mycode.mic mycode.f90 mpif90 trace xavx o mycode.host mycode.f90 Run normally: ibrun.symm m./mycode.mic c./mycode.host ITAC will generate multiple trace files for each task and a master file named <executable_name>.stf

26 Analyze ITAC Trace files Once the trace files are created, the traceanalyzer used to analyze the results. $ traceanalyzer mycode.stf tool may be Since this is a graphical tool, the user should be logged in with X11 enabled or via a VNC session. SSH: ssh Y stampede.tacc.utexas.edu VNC: Follow the instructions in the Stampede user guide for Remote Desktop Access ( )

27 Example with 2 nodes and 2 MPI tasks per host and MIC ITAC Summary

28 MPI Pingpong benchmark ITAC Flat Profile

29 Note the long MPI_Finalize time for 4 of the 8 tasks These are the MPI tasks running on the MICs. ITAC Event Timeline

30 Collective Parallel I/O Stampede has the Lustre filesystem mounted on the MIC as an NSF mount from the host Results in very poor performance Lustre client is available for the MIC we haven t tested it But, there is a way to get decent parallel I/O performance for collective I/O operations

31 ROMIO Hints Use ROMIO hints to limit the writers of collective I/O to the hosts Example: Running on 2 nodes 2 hosts and 2 MICs c stampede.tacc.utexas.edu c stampede.tacc.utexas.edu c mic0.stampede.tacc.utexas.edu c mic0.stampede.tacc.utexas.edu Set hints to only write from the host $ echo cb_config_list c stampede.tacc.utexas.edu:1,\ c stampede.tacc.utexas.edu:1 > romio_hints $ export ROMIO_HINTS=romio_hints $ ibrun.symm m./test_io.mic c./test_io.host NOTE: Must use hostname returned by MPI_Get_processor_name

32 Balance How to balance the code? Sandy Bridge Xeon Phi Memory 32 GB 8 GB Cores Clock Speed 2.7 GHz 1.1 GHz Memory Bandwidth 51.2 GB/s(x2) 352 GB/s Vector Length 4 DP words 8 DP words

33 Balance Example: Memory balance Balance memory use and performance by using a different # of tasks/threads on host and MIC Host 16 tasks/1 thread/task 2GB/task Xeon PHI 4 tasks/60 threads/task 2GB/task

34 Balance Example: Performance balance Balance performance by tuning the # of tasks and threads on host and MIC Host? tasks/? threads/task?gb/task Xeon PHI? tasks/? threads/task?gb/task

35 MPI with Offload Sections ADVANTAGES Offload Sections may easily be added to MPI/OpenMP codes with directives Intel compiler will automatically detect and compile offloaded sections CAVEATS However, there may be no MPI calls within offload sections Each host task will spawn an offload section Must explicitly place each offload set of threads if you have more than one MPI task per node offload

36 Exercises Exercise 1 Run in MPI symmetric mode using mpiexec.hydra Exercise 2 Run in a symmetric mode using MIC and host Exercise 3 Compare performance with and without proxy Exercise 4 Compare collective I/O performance with and without hints

37 John Cazes For more information:

Symmetric Computing. SC 14 Jerome VIENNE

Symmetric Computing. SC 14 Jerome VIENNE Symmetric Computing SC 14 Jerome VIENNE viennej@tacc.utexas.edu Symmetric Computing Run MPI tasks on both MIC and host Also called heterogeneous computing Two executables are required: CPU MIC Currently