GRID Testing and Profiling. November 2017

Similar documents
LS-DYNA Performance Benchmark and Profiling. October 2017

LS-DYNA Performance Benchmark and Profiling. April 2015

LS-DYNA Performance Benchmark and Profiling. October 2017

OpenFOAM Performance Testing and Profiling. October 2017

MILC Performance Benchmark and Profiling. April 2013

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

MM5 Modeling System Performance Research and Profiling. March 2009

SNAP Performance Benchmark and Profiling. April 2014

AcuSolve Performance Benchmark and Profiling. October 2011

NAMD Performance Benchmark and Profiling. January 2015

CPMD Performance Benchmark and Profiling. February 2014

Altair RADIOSS Performance Benchmark and Profiling. May 2013

The Effect of In-Network Computing-Capable Interconnects on the Scalability of CAE Simulations

ABySS Performance Benchmark and Profiling. May 2010

GROMACS Performance Benchmark and Profiling. September 2012

GROMACS Performance Benchmark and Profiling. August 2011

CP2K Performance Benchmark and Profiling. April 2011

OCTOPUS Performance Benchmark and Profiling. June 2015

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

ICON Performance Benchmark and Profiling. March 2012

NAMD Performance Benchmark and Profiling. February 2012

STAR-CCM+ Performance Benchmark and Profiling. July 2014

AMBER 11 Performance Benchmark and Profiling. July 2011

CP2K Performance Benchmark and Profiling. April 2011

AcuSolve Performance Benchmark and Profiling. October 2011

HYCOM Performance Benchmark and Profiling

Himeno Performance Benchmark and Profiling. December 2010

Performance Analysis of LS-DYNA in Huawei HPC Environment

CESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011

LAMMPS Performance Benchmark and Profiling. July 2012

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

n N c CIni.o ewsrg.au

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

NAMD Performance Benchmark and Profiling. November 2010

NAMD GPU Performance Benchmark. March 2011

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

Maximizing Cluster Scalability for LS-DYNA

UAntwerpen, 24 June 2016

Application Performance on Dual Processor Cluster Nodes

LAMMPSCUDA GPU Performance. April 2011

Birds of a Feather Presentation

HPC Architectures. Types of resource currently in use

Performance Optimizations for LS-DYNA with Mellanox HPC-X Scalable Software Toolkit

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

NEMO Performance Benchmark and Profiling. May 2011

2008 International ANSYS Conference

APENet: LQCD clusters a la APE

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

Application Performance Optimizations. Pak Lui

Advances of parallel computing. Kirill Bogachev May 2016

Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace

arxiv: v2 [hep-lat] 3 Nov 2016

Benchmark results on Knight Landing (KNL) architecture

Clustering Optimizations How to achieve optimal performance? Pak Lui

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

Interconnect Your Future

Erkenntnisse aus aktuellen Performance- Messungen mit LS-DYNA

Making Supercomputing More Available and Accessible Windows HPC Server 2008 R2 Beta 2 Microsoft High Performance Computing April, 2010

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Performance of Mellanox ConnectX Adapter on Multi-core Architectures Using InfiniBand. Abstract

Introduc)on to Hyades

LS-DYNA Performance on Intel Scalable Solutions

Bei Wang, Dmitry Prohorov and Carlos Rosales

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

GPUs and Emerging Architectures

Performance and Energy Efficiency of the 14 th Generation Dell PowerEdge Servers

Ravindra Babu Ganapathi

NUMA-aware OpenMP Programming

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Intel MIC Programming Workshop, Hardware Overview & Native Execution. IT4Innovations, Ostrava,

Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ,

Multicore Performance and Tools. Part 1: Topology, affinity, clock speed

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

arxiv: v1 [hep-lat] 1 Dec 2017

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

RADIOSS Benchmark Underscores Solver s Scalability, Quality and Robustness

Mellanox GPUDirect RDMA User Manual

Advanced Threading and Optimization

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

BNL FY17-18 Procurement

unleashed the future Intel Xeon Scalable Processors for High Performance Computing Alexey Belogortsev Field Application Engineer

HPC and AI Solution Overview. Garima Kochhar HPC and AI Innovation Lab

COSC 6385 Computer Architecture - Multi Processor Systems

IXPUG 16. Dmitry Durnov, Intel MPI team

OPEN MPI AND RECENT TRENDS IN NETWORK APIS

Performance and Energy Usage of Workloads on KNL and Haswell Architectures

Single-Points of Performance

Xeon Phi Native Mode - Sharpen Exercise

ADINA DMP System 9.3 Installation Notes

High Performance Linpack Benchmark on AMD EPYC Processors

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Atos ARM solutions for HPC

Transcription:

GRID Testing and Profiling November 2017

2 GRID C++ library for Lattice Quantum Chromodynamics (Lattice QCD) calculations Developed by Peter Boyle (U. of Edinburgh) et al. Hybrid MPI+OpenMP plus NUMA aware socket optimizations When building GRID, a set of test and benchmark programs also gets built One benchmark program in particular, Benchmark_ITT, exercises the library in various ways and prints several performance measures For more information Grid: A next generation data parallel C++ QCD library: https://arxiv.org/abs/1512.03487 Grid: data parallel library for QCD: https://indico2.riken.jp/indico/getfile.py/access?contribid=5&sessionid=4&resid=0&materialid=slides&confid=18 05

Test Cluster Configuration Thor cluster Dell PowerEdge R730/R630 36-node cluster Dual Socket Intel Xeon 16-core CPUs E5-2697A V4 @ 2.60 GHz Mellanox ConnectX-5 EDR 100Gb/s InfiniBand adapters Mellanox Switch-IB 2 SB7800 36-Port 100Gb/s EDR InfiniBand switches Memory: 256GB DDR4 2400MHz RDIMMs per node 1TB 7.2K RPM SATA 2.5" hard drives per node Helios cluster Supermicro SYS-6029U-TR4 16-node cluster Dual Socket Intel Xeon Gold 6138 CPU @ 2.00GHz Mellanox ConnectX-5 EDR 100Gb/s InfiniBand/VPI adapters Mellanox Switch-IB 2 SB7800 36-Port 100Gb/s EDR InfiniBand switches Memory: 192GB DDR4 2677MHz RDIMMs per node 1TB 7.2K RPM SSD 2.5" hard drive per node 3

4 Getting GRID GRID is available from GitHub: https://github.com/paboyle/grid Two packages are needed: GMP (https://gmplib.org/) and MPFR (http://www.mpfr.org/) For the DiRAC ITT benchmark, a slightly different tarball was provided wget https://github.com/paboyle/grid/archive/dirac-itt-fix1.tar.gz tar zxvf dirac-itt-fix1.tar.gz cd Grid-dirac-ITT-fix1./bootstrap.sh

5 Configuring GRID After loading the appropriate software modules (Intel 2017.5 or 2018.0 compilers and HPC-X 2.0 for the example below and the results shown later on): mkdir build_hpcx cd build_hpcx../configure --enable-precision=single --enable-simd=avx2 \ --enable-comms=mpi3-auto --enable-mkl CXX=mpicxx \ --prefix=/<path-to-installation-directory>/grid_hpcx20 The DiRAC benchmark instructions called for running the program in single precision, but we also ran it in double precision (as suggested in the README file from the GitHub distribution) because that is what was initially used to guide the build AVX512 and KNL are among the other possible options for --enable-simd

6 Building and Testing GRID Building and testing is straight forward: make 2>&1 tee make_i17hpcx20.log make check 2>&1 tee check_i17hpcx20.log make install 2>&1 tee install_i17hpcx20.log NOTE: It may be necessary to add -lrt to the end of the line starting with LIBS = in each of sixteen Makefile files ({,*/,*/*/}Makefile) NOTE: If the build server or node has processors that don t support the SIMD option configured, the make check step will fail. Nonetheless, the executables will have been produced

7 Running Benchmark_ITT Requires several arguments: --mpi M.N.P.Q, where M N P Q and M*N*P*Q is the total number of ranks; each of M, N, P and Q must be a power of two --shm size, where size is the size (in kb) of shared memory segments that will be allocated (1024 for 4 and fewer nodes, 2048 for 8 or more worked fine) --threads nthreads, where nthreads is the number of OpenMP threads --shm-hugetlb (for Xeon Phi runs)

Launch Command Generic launch command for HPC-X /usr/bin/time -p mpirun -np $ranks --map-by ppr:$rpn:node \ --bind-to $object -report-bindings --display-map \ -mca coll_hcoll_enable 0 -mca mtl ^mxm \ -x UCX_RC_VERBS_TM_ENABLE=y -mca pml ucx \ -x UCX_TLS=$transport,self,shm \ /path-to-installation-directory/bin/benchmark_itt \ --mpi M.N.P.Q --shm 2048 --threads $OMP_NUM_THREADS $ranks is the total number of MPI ranks (= M*N*P*Q) $rpn is the number of MPI ranks per node $object is none if $rpn is 1 and socket otherwise $transport is rc or rc_x 8

Notes HPC-X 2.0 / UCX It introduced two new accelerated transports, dc_x and rc_x, that use enhanced verbs The best transports for GRID turned out to be rc_x and rc In the mpirun example, we enabled hardware tag matching (UCX_RC_VERBS_TM_ENABLE)that perform the tag matching lookup in hardware instead of software Selection of a pml (point-to-point management layer) The UCX pml and the MXM pml are mutually exclusive; thus, MXM was disabled Single vs. Double Precision Single Precision numbers uses 32 bits while Double Precision numbers uses 64 bits The double precision calculations generates more data transfers for the same test size The differences in the results presented later in the presentation emphasis the need for fast accelerated interconnect for best performance 9

MPI Profiling (IPM 2.06) 10

MPI Profiling (IPM 2.06) 11

12 Performance Measures At the end of the run, Benchmark_ITT writes a Memory benchmark results table, a Communications benchmark table (for runs on two or more nodes), and an overall floating point performance report Memory benchmark example: Memory benchmark = Benchmarking a*x + y bandwidth L bytes GB/s Gflop/s seconds GB/s / node ---------------------------------------------------------- 8 50331648.000 4371.720 364.310 1.492 273.232 12 254803968.000 3818.499 318.208 1.708 238.656 16 805306368.000 5851.782 487.649 1.115 365.736 20 1966080000.000 5225.404 435.450 1.248 326.588 24 4076863488.000 2918.898 243.242 2.235 182.431 28 7552892928.000 3062.299 255.192 2.129 191.394 32 12884901888.000 2952.386 246.032 2.208 184.524 36 20639121408.000 2824.411 235.368 2.309 176.526 40 31457280000.000 2825.827 235.486 2.304 176.614 44 46056603648.000 2798.276 233.190 2.321 174.892 48 65229815808.000 2801.425 233.452 2.328 175.089

13 Performance Measures (cont.) Communications benchmark example: Communications benchmark ================== = Benchmarking threaded STENCIL halo exchange in 4 dimensions ================== L Ls bytes MB/s uni (err/min/max) MB/s bidi (err/min/max) 4 8 49152 1472.6 178.3 88.9 2531.4 2945.2 356.6 177.9 5062.9 8 8 393216 6147.1 605.2 381.8 8886.2 12294.2 1210.3 763.6 17772.5 12 8 1327104 8722.7 270.0 1960.3 9909.9 17445.3 539.9 3920.5 19819.8 16 8 3145728 9456.4 77.4 3753.1 10058.3 18912.8 154.8 7506.2 20116.6 20 8 6144000 9475.3 61.0 4976.9 10245.7 18950.5 121.9 9953.8 20491.4 24 8 10616832 9499.0 47.4 4991.3 10102.0 18998.1 94.8 9982.5 20204.1 28 8 16859136 9340.3 45.1 4918.4 9866.6 18680.5 90.3 9836.9 19733.2 32 8 25165824 9320.1 47.7 4917.7 9949.8 18640.2 95.4 9835.4 19899.5 Per Node Summary table Ls=16 L Wilson DWF4 DWF5 8 3895.6 33637.9 35835.8 12 13308.6 74021.3 91408.2 16 35034.3 85631.9 114475.0 24 76632.2 98933.4 133285.8

14 Performance Measures Overall floating point performance report sample: Comparison point result: 79826.6 Mflop/s per node Comparison point is 0.5*(85631.9+74021.3) Comparison point robustness: 0.819 Note that the Comparison point result is given in Mflop/s per node. For the performance graphs that follow, that value was multiplied by the number of nodes and converted to Gflop/s to obtain total Gflop/s

15 GRID Double Precision Performance: Interconnects CPU: Xeon E5-2697A v4 (code name Broadwell ) 48% Higher is Better

16 GRID Double Precision Performance: Interconnects CPU: Xeon Gold 6138 (code name Skylake ) 27% Higher is Better

17 GRID Double Precision Performance: Larger Scale Intel Xeon E5-2697A v4 (code name Broadwell) Interconnect: InfiniBand Higher is Better

18 Grid Single Precision Performance: Larger Scale CPU: Xeon Gold 6148 (code name Skylake ) Interconnect: InfiniBand Higher is Better

19 References GRID https://github.com/paboyle/grid https://arxiv.org/abs/1512.03487 https://indico2.riken.jp/indico/getfile.py/access?contribid=5&sessionid=4&resid=0&materialid =slides&confid=1805 HPC Advisory Council write-up https://hpcadvisorycouncil.atlassian.net/wiki/spaces/hpcworks/pages/11370500/hpc- X+2.0+Boosts+Performance+of+Grid+Benchmark

Thank You All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein