GRID Testing and Profiling. November 2017

Size: px

Start display at page:

Download "GRID Testing and Profiling. November 2017"

Scarlett Gibbs
5 years ago
Views:

1 GRID Testing and Profiling November 2017

2 2 GRID C++ library for Lattice Quantum Chromodynamics (Lattice QCD) calculations Developed by Peter Boyle (U. of Edinburgh) et al. Hybrid MPI+OpenMP plus NUMA aware socket optimizations When building GRID, a set of test and benchmark programs also gets built One benchmark program in particular, Benchmark_ITT, exercises the library in various ways and prints several performance measures For more information Grid: A next generation data parallel C++ QCD library: Grid: data parallel library for QCD: 05

3 Test Cluster Configuration Thor cluster Dell PowerEdge R730/R node cluster Dual Socket Intel Xeon 16-core CPUs E5-2697A 2.60 GHz Mellanox ConnectX-5 EDR 100Gb/s InfiniBand adapters Mellanox Switch-IB 2 SB Port 100Gb/s EDR InfiniBand switches Memory: 256GB DDR4 2400MHz RDIMMs per node 1TB 7.2K RPM SATA 2.5" hard drives per node Helios cluster Supermicro SYS-6029U-TR4 16-node cluster Dual Socket Intel Xeon Gold GHz Mellanox ConnectX-5 EDR 100Gb/s InfiniBand/VPI adapters Mellanox Switch-IB 2 SB Port 100Gb/s EDR InfiniBand switches Memory: 192GB DDR4 2677MHz RDIMMs per node 1TB 7.2K RPM SSD 2.5" hard drive per node 3

4 4 Getting GRID GRID is available from GitHub: Two packages are needed: GMP ( and MPFR ( For the DiRAC ITT benchmark, a slightly different tarball was provided wget tar zxvf dirac-itt-fix1.tar.gz cd Grid-dirac-ITT-fix1./bootstrap.sh

5 5 Configuring GRID After loading the appropriate software modules (Intel or compilers and HPC-X 2.0 for the example below and the results shown later on): mkdir build_hpcx cd build_hpcx../configure --enable-precision=single --enable-simd=avx2 \ --enable-comms=mpi3-auto --enable-mkl CXX=mpicxx \ --prefix=/<path-to-installation-directory>/grid_hpcx20 The DiRAC benchmark instructions called for running the program in single precision, but we also ran it in double precision (as suggested in the README file from the GitHub distribution) because that is what was initially used to guide the build AVX512 and KNL are among the other possible options for --enable-simd

6 6 Building and Testing GRID Building and testing is straight forward: make 2>&1 tee make_i17hpcx20.log make check 2>&1 tee check_i17hpcx20.log make install 2>&1 tee install_i17hpcx20.log NOTE: It may be necessary to add -lrt to the end of the line starting with LIBS = in each of sixteen Makefile files ({,*/,*/*/}Makefile) NOTE: If the build server or node has processors that don t support the SIMD option configured, the make check step will fail. Nonetheless, the executables will have been produced

7 7 Running Benchmark_ITT Requires several arguments: --mpi M.N.P.Q, where M N P Q and M*N*P*Q is the total number of ranks; each of M, N, P and Q must be a power of two --shm size, where size is the size (in kb) of shared memory segments that will be allocated (1024 for 4 and fewer nodes, 2048 for 8 or more worked fine) --threads nthreads, where nthreads is the number of OpenMP threads --shm-hugetlb (for Xeon Phi runs)

8 Launch Command Generic launch command for HPC-X /usr/bin/time -p mpirun -np $ranks --map-by ppr:$rpn:node \ --bind-to $object -report-bindings --display-map \ -mca coll_hcoll_enable 0 -mca mtl ^mxm \ -x UCX_RC_VERBS_TM_ENABLE=y -mca pml ucx \ -x UCX_TLS=$transport,self,shm \ /path-to-installation-directory/bin/benchmark_itt \ --mpi M.N.P.Q --shm threads $OMP_NUM_THREADS $ranks is the total number of MPI ranks (= M*N*P*Q) $rpn is the number of MPI ranks per node $object is none if $rpn is 1 and socket otherwise $transport is rc or rc_x 8

9 Notes HPC-X 2.0 / UCX It introduced two new accelerated transports, dc_x and rc_x, that use enhanced verbs The best transports for GRID turned out to be rc_x and rc In the mpirun example, we enabled hardware tag matching (UCX_RC_VERBS_TM_ENABLE)that perform the tag matching lookup in hardware instead of software Selection of a pml (point-to-point management layer) The UCX pml and the MXM pml are mutually exclusive; thus, MXM was disabled Single vs. Double Precision Single Precision numbers uses 32 bits while Double Precision numbers uses 64 bits The double precision calculations generates more data transfers for the same test size The differences in the results presented later in the presentation emphasis the need for fast accelerated interconnect for best performance 9

10 MPI Profiling (IPM 2.06) 10

11 MPI Profiling (IPM 2.06) 11

12 12 Performance Measures At the end of the run, Benchmark_ITT writes a Memory benchmark results table, a Communications benchmark table (for runs on two or more nodes), and an overall floating point performance report Memory benchmark example: Memory benchmark = Benchmarking a*x + y bandwidth L bytes GB/s Gflop/s seconds GB/s / node

13 13 Performance Measures (cont.) Communications benchmark example: Communications benchmark ================== = Benchmarking threaded STENCIL halo exchange in 4 dimensions ================== L Ls bytes MB/s uni (err/min/max) MB/s bidi (err/min/max) Per Node Summary table Ls=16 L Wilson DWF4 DWF

14 14 Performance Measures Overall floating point performance report sample: Comparison point result: Mflop/s per node Comparison point is 0.5*( ) Comparison point robustness: Note that the Comparison point result is given in Mflop/s per node. For the performance graphs that follow, that value was multiplied by the number of nodes and converted to Gflop/s to obtain total Gflop/s

15 15 GRID Double Precision Performance: Interconnects CPU: Xeon E5-2697A v4 (code name Broadwell ) 48% Higher is Better

16 16 GRID Double Precision Performance: Interconnects CPU: Xeon Gold 6138 (code name Skylake ) 27% Higher is Better

17 17 GRID Double Precision Performance: Larger Scale Intel Xeon E5-2697A v4 (code name Broadwell) Interconnect: InfiniBand Higher is Better

18 18 Grid Single Precision Performance: Larger Scale CPU: Xeon Gold 6148 (code name Skylake ) Interconnect: InfiniBand Higher is Better

19 19 References GRID =slides&confid=1805 HPC Advisory Council write-up X+2.0+Boosts+Performance+of+Grid+Benchmark

20 Thank You All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein

LS-DYNA Performance Benchmark and Profiling. October 2017

LS-DYNA Performance Benchmark and Profiling October 2017 2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: LSTC, Huawei, Mellanox Compute resource