GRID Testing and Profiling November 2017
2 GRID C++ library for Lattice Quantum Chromodynamics (Lattice QCD) calculations Developed by Peter Boyle (U. of Edinburgh) et al. Hybrid MPI+OpenMP plus NUMA aware socket optimizations When building GRID, a set of test and benchmark programs also gets built One benchmark program in particular, Benchmark_ITT, exercises the library in various ways and prints several performance measures For more information Grid: A next generation data parallel C++ QCD library: https://arxiv.org/abs/1512.03487 Grid: data parallel library for QCD: https://indico2.riken.jp/indico/getfile.py/access?contribid=5&sessionid=4&resid=0&materialid=slides&confid=18 05
Test Cluster Configuration Thor cluster Dell PowerEdge R730/R630 36-node cluster Dual Socket Intel Xeon 16-core CPUs E5-2697A V4 @ 2.60 GHz Mellanox ConnectX-5 EDR 100Gb/s InfiniBand adapters Mellanox Switch-IB 2 SB7800 36-Port 100Gb/s EDR InfiniBand switches Memory: 256GB DDR4 2400MHz RDIMMs per node 1TB 7.2K RPM SATA 2.5" hard drives per node Helios cluster Supermicro SYS-6029U-TR4 16-node cluster Dual Socket Intel Xeon Gold 6138 CPU @ 2.00GHz Mellanox ConnectX-5 EDR 100Gb/s InfiniBand/VPI adapters Mellanox Switch-IB 2 SB7800 36-Port 100Gb/s EDR InfiniBand switches Memory: 192GB DDR4 2677MHz RDIMMs per node 1TB 7.2K RPM SSD 2.5" hard drive per node 3
4 Getting GRID GRID is available from GitHub: https://github.com/paboyle/grid Two packages are needed: GMP (https://gmplib.org/) and MPFR (http://www.mpfr.org/) For the DiRAC ITT benchmark, a slightly different tarball was provided wget https://github.com/paboyle/grid/archive/dirac-itt-fix1.tar.gz tar zxvf dirac-itt-fix1.tar.gz cd Grid-dirac-ITT-fix1./bootstrap.sh
5 Configuring GRID After loading the appropriate software modules (Intel 2017.5 or 2018.0 compilers and HPC-X 2.0 for the example below and the results shown later on): mkdir build_hpcx cd build_hpcx../configure --enable-precision=single --enable-simd=avx2 \ --enable-comms=mpi3-auto --enable-mkl CXX=mpicxx \ --prefix=/<path-to-installation-directory>/grid_hpcx20 The DiRAC benchmark instructions called for running the program in single precision, but we also ran it in double precision (as suggested in the README file from the GitHub distribution) because that is what was initially used to guide the build AVX512 and KNL are among the other possible options for --enable-simd
6 Building and Testing GRID Building and testing is straight forward: make 2>&1 tee make_i17hpcx20.log make check 2>&1 tee check_i17hpcx20.log make install 2>&1 tee install_i17hpcx20.log NOTE: It may be necessary to add -lrt to the end of the line starting with LIBS = in each of sixteen Makefile files ({,*/,*/*/}Makefile) NOTE: If the build server or node has processors that don t support the SIMD option configured, the make check step will fail. Nonetheless, the executables will have been produced
7 Running Benchmark_ITT Requires several arguments: --mpi M.N.P.Q, where M N P Q and M*N*P*Q is the total number of ranks; each of M, N, P and Q must be a power of two --shm size, where size is the size (in kb) of shared memory segments that will be allocated (1024 for 4 and fewer nodes, 2048 for 8 or more worked fine) --threads nthreads, where nthreads is the number of OpenMP threads --shm-hugetlb (for Xeon Phi runs)
Launch Command Generic launch command for HPC-X /usr/bin/time -p mpirun -np $ranks --map-by ppr:$rpn:node \ --bind-to $object -report-bindings --display-map \ -mca coll_hcoll_enable 0 -mca mtl ^mxm \ -x UCX_RC_VERBS_TM_ENABLE=y -mca pml ucx \ -x UCX_TLS=$transport,self,shm \ /path-to-installation-directory/bin/benchmark_itt \ --mpi M.N.P.Q --shm 2048 --threads $OMP_NUM_THREADS $ranks is the total number of MPI ranks (= M*N*P*Q) $rpn is the number of MPI ranks per node $object is none if $rpn is 1 and socket otherwise $transport is rc or rc_x 8
Notes HPC-X 2.0 / UCX It introduced two new accelerated transports, dc_x and rc_x, that use enhanced verbs The best transports for GRID turned out to be rc_x and rc In the mpirun example, we enabled hardware tag matching (UCX_RC_VERBS_TM_ENABLE)that perform the tag matching lookup in hardware instead of software Selection of a pml (point-to-point management layer) The UCX pml and the MXM pml are mutually exclusive; thus, MXM was disabled Single vs. Double Precision Single Precision numbers uses 32 bits while Double Precision numbers uses 64 bits The double precision calculations generates more data transfers for the same test size The differences in the results presented later in the presentation emphasis the need for fast accelerated interconnect for best performance 9
MPI Profiling (IPM 2.06) 10
MPI Profiling (IPM 2.06) 11
12 Performance Measures At the end of the run, Benchmark_ITT writes a Memory benchmark results table, a Communications benchmark table (for runs on two or more nodes), and an overall floating point performance report Memory benchmark example: Memory benchmark = Benchmarking a*x + y bandwidth L bytes GB/s Gflop/s seconds GB/s / node ---------------------------------------------------------- 8 50331648.000 4371.720 364.310 1.492 273.232 12 254803968.000 3818.499 318.208 1.708 238.656 16 805306368.000 5851.782 487.649 1.115 365.736 20 1966080000.000 5225.404 435.450 1.248 326.588 24 4076863488.000 2918.898 243.242 2.235 182.431 28 7552892928.000 3062.299 255.192 2.129 191.394 32 12884901888.000 2952.386 246.032 2.208 184.524 36 20639121408.000 2824.411 235.368 2.309 176.526 40 31457280000.000 2825.827 235.486 2.304 176.614 44 46056603648.000 2798.276 233.190 2.321 174.892 48 65229815808.000 2801.425 233.452 2.328 175.089
13 Performance Measures (cont.) Communications benchmark example: Communications benchmark ================== = Benchmarking threaded STENCIL halo exchange in 4 dimensions ================== L Ls bytes MB/s uni (err/min/max) MB/s bidi (err/min/max) 4 8 49152 1472.6 178.3 88.9 2531.4 2945.2 356.6 177.9 5062.9 8 8 393216 6147.1 605.2 381.8 8886.2 12294.2 1210.3 763.6 17772.5 12 8 1327104 8722.7 270.0 1960.3 9909.9 17445.3 539.9 3920.5 19819.8 16 8 3145728 9456.4 77.4 3753.1 10058.3 18912.8 154.8 7506.2 20116.6 20 8 6144000 9475.3 61.0 4976.9 10245.7 18950.5 121.9 9953.8 20491.4 24 8 10616832 9499.0 47.4 4991.3 10102.0 18998.1 94.8 9982.5 20204.1 28 8 16859136 9340.3 45.1 4918.4 9866.6 18680.5 90.3 9836.9 19733.2 32 8 25165824 9320.1 47.7 4917.7 9949.8 18640.2 95.4 9835.4 19899.5 Per Node Summary table Ls=16 L Wilson DWF4 DWF5 8 3895.6 33637.9 35835.8 12 13308.6 74021.3 91408.2 16 35034.3 85631.9 114475.0 24 76632.2 98933.4 133285.8
14 Performance Measures Overall floating point performance report sample: Comparison point result: 79826.6 Mflop/s per node Comparison point is 0.5*(85631.9+74021.3) Comparison point robustness: 0.819 Note that the Comparison point result is given in Mflop/s per node. For the performance graphs that follow, that value was multiplied by the number of nodes and converted to Gflop/s to obtain total Gflop/s
15 GRID Double Precision Performance: Interconnects CPU: Xeon E5-2697A v4 (code name Broadwell ) 48% Higher is Better
16 GRID Double Precision Performance: Interconnects CPU: Xeon Gold 6138 (code name Skylake ) 27% Higher is Better
17 GRID Double Precision Performance: Larger Scale Intel Xeon E5-2697A v4 (code name Broadwell) Interconnect: InfiniBand Higher is Better
18 Grid Single Precision Performance: Larger Scale CPU: Xeon Gold 6148 (code name Skylake ) Interconnect: InfiniBand Higher is Better
19 References GRID https://github.com/paboyle/grid https://arxiv.org/abs/1512.03487 https://indico2.riken.jp/indico/getfile.py/access?contribid=5&sessionid=4&resid=0&materialid =slides&confid=1805 HPC Advisory Council write-up https://hpcadvisorycouncil.atlassian.net/wiki/spaces/hpcworks/pages/11370500/hpc- X+2.0+Boosts+Performance+of+Grid+Benchmark
Thank You All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein