Koronis Performance Tuning 2. By Brent Swartz December 1, 2011

Size: px

Start display at page:

Download "Koronis Performance Tuning 2. By Brent Swartz December 1, 2011"

Tobias Howard
5 years ago
Views:

1 Koronis Performance Tuning 2 By Brent Swartz December 1, 2011

2 Application Tuning Methodology coll=linux&db=bks&cmd=toc&pth=/sgi_develop er/lx_86_apptune

3 1. Tune the serial performance If possible incorporate already tuned lib routines, e.g.: MKL: Intel's Math Kernel Library. Library includes BLAS, LAPACK, Sparse Matrix, and FFT routines. VML: the Vector Math Library, part of the MKL package (libmkl_vml_itp.so). Described here: tm?iid=ipp_home+software_libraries&

4 2. Tune parallel performance a. OpenMP: Stream OpenMP results/graphs, before and after dplace. b. For MPI, can use perfboost/perfcatcher. Show stream MPI results/graphs. c. Can improve scaling by determining which routines are NOT scaling well (sorting the routines by a metric), and then improve those routines. The metric can be as simple as the ratio of routine times running on different numbers of nodes (e.g. run on 24 and then 72 procs, then sort the ratio of the routine times).

5 Profiling Tools 1. gprof (easiest) 2. perfcatch/perfboost (MPI) See PerfBoost in the Message Passing Toolkit (MPT) User's Guide available on the Tech Pubs Library at 3. OPT 4. TAU

6 Application Examples A. Stream B. NAMD

7 Stream Benchmark Download: Fortran is compiled with: mpif90 -O3 -ipo -xsse4.2 -fno-alias -i8 -openmp -extend_source -mcmodel=medium -i-dynamic -opt-streaming-stores always -nolib-inline C is compiled with: icc -O3 -ipo -xsse4.2 -fnoalias -openmp -extend_source -mcmodel=medium -i-dynamic -opt-streamingstores always -nolib-inline

8 Stream Benchmark (cont.) Stream benchmark: We start with the original stream benchmark, running the C and Fortran versions, running with and without dplace, from 4 to 720 procs (setting OMP_NUM_THREADS). Conclusion: The Fortran and C performance is essentially the same (for dplace anyway). Note how the dplace bandwidth is consistently better. Here is the PBS job that ran the benchmarks (note form of the dplace command):

9 #!/bin/bash #PBS -j oe #PBS -q uv1000 #PBS -l select=120:ncpus=6:mpiprocs=6 #PBS -l place=pack:excl:group=boardpair #PBS -l walltime=1:0:0 cpuset -d. sleep 5 ulimit -s unlimited pset=`qstat -f $PBS_JOBID grep pset cut -f 2- -d"="` echo "pset:" $pset cd $PBS_O_WORKDIR export KMP_AFFINITY=disabled for NT in do export OMP_NUM_THREADS=$NT NT1=`echo "$NT - 1" bc` echo "NT1=$NT1" dplace -x 2 -c 0-$NT1 stream_c.omp.exe > stream_c.omp.dplace.out.$nt stream_c.omp.exe > stream_c.omp.out.$nt dplace -x 2 -c 0-$NT1 stream_f.exe > stream_f.omp.dplace.out.$nt stream_f.exe > stream_f.omp.out.$nt done

11 Stream Benchmark (cont.) Now we look at the MPI Fortran Copy and Triad performance, with and without dplace. The number of procs is varied from 1 to 384 (1 rack). Conclusion: dplace does not affect MPI performance.

13 Stream Benchmark (cont.) Now we look at the OpenMP Triad performance with the MPI Triad performance. Conclusion: MPI is scaling much better. Why? Also KMP_AFFINITY=disabled (by default). KMP_AFFINITY determines how the threads are mapped to the cores ( ntation/studio/composer/enus/2009/compiler_c/optaps/common/optaps_op enmp_thread_affinity.htm).

15 Stream Benchmark (cont.) Now we do an empirical parameter study of various KMP_AFFINITY alternatives. Eleven alternatives are tried, with and without dplace. e.g. export KMP_AFFINITY=granularity=fine,compact,2,0 Conclusion: dplace is consistently worse for all KMP_AFFINITY settings except disabled and none. Optimal performance is seen with many KMP_AFFINITY settings, without dplace, e.g. KMP_AFFINITY=granularity=fine,compact,2,0 or KMP_AFFINITY=granularity=core,compact,1,0

17 Stream Benchmark (cont.) Looking at the OpenMP vs MPI performance, we realize the cause of the OpenMP performance falling off is Gustafson's Law (weak vs strong scaling): The original benchmark is spreading the 2M arrays across all threads, so at 100 threads, each proc (thread) is only working on 20,000 element array slices, so the time to do the work is approaching the time required to create the threads. The stream benchmark is recoded to malloc the 2M arrays instead of statically allocating them. So now we dynamically allocate arrays based on the number of threads (procs), giving a constant 2M/proc, and look at various good KMP_AFFINITY settings.

19 Stream Benchmark (cont.) Also tried OMPLACE_AFFINITY_COMPAT and omplace alternatives, with no improvement. Conclusion: Now the OpenMP performance is on a par with the MPI performance, and better for Nprocs between 1 and 192, transitioning to MPI performance for 193 to ~300, and matching MPI performance from 336 on, with KMP_AFFINITY=core,compact,1,0 OR KMP_AFFINITY=fine,compact,2,0 For MPI apps, no dplace or KMP_AFFINITY is needed.

20 NAMD NAMD 2.7 is currently installed on Koronis. NAMD 2.8 has been installed locally to test performance, and has shown a ~80% improvement in the apoa1 benchmark using the following invocation. numactl is used to obtain the processor list from the batch control system, and this proclist is fed into charm via the +setcpuaffinity flag. The stmv banchmark shows a performance degradation, so this is still experimental.

21 #!/bin/bash -x NAMD #PBS -j oe #PBS -q uv1000 #PBS -l select=7:ncpus=6:mpiprocs=6 #PBS -l place=pack:excl:group=iruquadrant #PBS -l walltime=1:0:0 cpuset -d. ; sleep 5 ulimit -s 100M ; ulimit -v unlimited export KMP_STACKSIZE=100M export NAMD_PATH=/home/koronis/swartzbr/namd/NAMD_2.8_Linux-x86_64multicore cd $PBS_O_WORKDIR proc_list=`numactl --show awk '/^physcpubind/ {printf "+p%d +pemap %d",(nf-1), $2; for(i=3;i<=nf;++i){printf ",%d",$i}}'` $NAMD_PATH/namd2 +isomalloc_sync +setcpuaffinity $proc_list apoa1.namd > apoa1.out.mc.39

22 NAMD NAMD can utilize Koronis GPUs for certain types of runs. For more info see: Versions of GAMESS, GROMACS and LAMMPS have just been released which can utilize GPUs for certain types of runs. These have not yet been installed on Koronis, so if you plan to use these codes on the GPUs, send an to:

Cerebro Quick Start Guide

Cerebro Quick Start Guide Overview of the system Cerebro consists of a total of 64 Ivy Bridge processors E5-4650 v2 with 10 cores each, 14 TB of memory and 24 TB of local disk. Table 1 shows the hardware