Performance of a multi-physics code on Cavium ThunderX2

Size: px

Start display at page:

Download "Performance of a multi-physics code on Cavium ThunderX2"

Hector Eaton
5 years ago
Views:

1 Performance of a multi-physics code on Cavium ThunderX2 User Productivity Enhancement, Technology Transfer, and Training (PETTT) Presented by John G. Wohlbier (PETTT/Engility), Keith Obenschain, Gopal Patnaik (NRL-DC) September 26, 2018 PETTT

2 This material is based upon work supported by, or in part by, the Department of Defense (DoD) High Performance Computing Modernization Program (HPCMP) under the User Productivity, Technology Transfer and Training (PETTT) Program, contract number GS04T09DBC0017. PETTT 2

3 Outline Aster Comparison methodology Initial performance numbers Extract serial kernel Rev 1 performance numbers Extract more kernels Characterize kernels Summary and future work PETTT 3

Aster 1 Direct drive inertial confinement fusion code Spherical, structured grid Two temperature explicit CFD Tabular equation of state Implicit species heat conduction Laser ray tracing Fusion

4 Aster 1 Direct drive inertial confinement fusion code Spherical, structured grid Two temperature explicit CFD Tabular equation of state Implicit species heat conduction Laser ray tracing Fusion reactions Multi-group radiation diffusion Operator split time stepping 1 I.V. Igumenshchev, et al. Three-Dimensional Modeling of Direct-Drive Cryogenic Implosions on OMEGA. Physics of Plasmas 23, (2016). 2 Image: PETTT 4

5 Comparison methodology Fixed problem size on one node 21.3M cells/node 1 MPI rank/core Node characteristics Two sockets 32 ranks on SKL, 2 x 125W TDP 56/64 ranks on TX2, 2 x 165W TDP Compilers Intel/Intelmpi on SKL Gcc/OpenMPI on TX2 Manufacturer Architecture Part SIMD width Peak GF/s Peak DRAM GB/s Cores Peak DRAM GB/s/core Intel Skylake Cavium ThunderX2 B Memory channels ThunderX2 B PETTT 5

6 Compare STREAM Triad and HPCG Triad peak: Peak GB/s / 24 B/triad x 2 FLOP/triad / Peak GF/s STREAM HPCG Comparison methodology Manufacturer Architecture Part Peak GF/s Peak DRAM GB/s CPU N rank Stack Read (GB/s) Write (GB/s) Triad peak efficiency (%) Total (GB/s) STREAM Triad (GB/s) Threads Intel Skylake x x Cavium ThunderX2 B0 2 x x Cavium ThunderX2 B1 2 x x FLOPS (GF/s) Peak (GF/s) Skylake 32 Intel 19.0, Intel MPI GB/s % peak ThunderX2 64 gcc 7.2, OpenMPI ThunderX2 56 gcc 7.2, OpenMPI PETTT 6

7 Initial performance numbers Ten cycles of Aster 21.3M cells Architecture Ranks Time (s) Ratio SKL TX Profile with Arm MAP mpi_recv + gauss_seidel_in_plane are largest cost, but have very similar absolute run times pow + exp have disparate run times SKL 15s TX2 60s Difference (60-15=45) is nearly ½ of the total difference ( =107) PETTT 7

8 Initial performance numbers Profile with Arm MAP TX2 Self SKL % seconds Function % seconds 13.70% mpi_recv_ 14.40% % gauss_seidel_in_plane 18.70% % pow_finite 4.60% % mpi_waitall_ 3.60% % mpi_bcast_ 1.60% % _gfortran_string_index 7.20% % mpi_send_ 3.00% % exp1 <0.1% 4.70% ppi 4.00% % get_diff_coefs 6.10% PETTT 8

9 Extract serial kernel Identified pow heavy Aster function Extract kernel using KGen Instruments application and generates verification data for kernel aster_tubr ( temperature update by radiation ) Kernel used internally at Arm to work on precision issues with armflang Arm recommends using Arm Performance Libraries (ArmPL) Found Arm Optimized-Routines (AOR) Upstream for ArmPL PETTT 9

10 Extract serial kernel aster_tubr results Run on same input data Weak scaling implies TX2 data set would be ½ size as SKL data set Best effective TX2 time ~ 1.36s compared to 0.88s on SKL Architecture Time (s) Compiler, library SKL 0.88 Intel TX gcc 7.2, default TX armflang , default TX armflang , -L${ARMPL_LIBRARIES} -lamath TX gcc 7.2, Arm Optimized-Routines -lmathlib PETTT 10

11 Rev 1 performance numbers Ten cycles of Aster 21.3M cells Architecture Ranks Time (s) Ratio SKL TX TX Profile with Arm MAP mpi_recv + mpi_waitall + mpi_send SKL 69s TX2 95s Difference (95-69=16) is nearly as large as total difference ( =19) PETTT 11

12 Rev 1 performance numbers Profile with Arm MAP TX2 Self SKL % seconds Function % seconds 13.60% gauss_seidel_in_plane 18.70% % mpi_recv_ 14.40% % mpi_waitall_ 3.60% % _gfortran_string_index 7.20% % ppi 4.00% % mpi_send_ 3.00% % get_diff_coefs 6.10% % _int_free <0.1% 3.20% log_inline [inlined] 0.50% % get_residual PETTT 12

13 Extract more kernels gauss_seidel_in_plane Most expensive function Called many times during l-cycles and V-cycles with variable sized input data for fine and coarse grids Tridiagonal solver in radial direction introduces MPI sweep like dependency Characterization useful, but not as important as multigrid Many calls to gauss_seidel_in_plane Accounts for 48% inclusive time PETTT 13

14 Extract more kernels gauss_seidel_in_plane Architecture Ranks Time (s) Ratio SKL TX multigrid Architecture Ranks Time (s) Ratio SKL TX TX MPI in multigrid Function SKL (s) TX2 64 (s) TX2 56 mpi_recv mpi_waitall mpi_send Total PETTT 14

Run multigrid kernel through Intel VTune on SKL to determine performance characterization Intel performance analysis tools provide extensive detail multigrid kernel is memory bound on SKL 65% of

15 Run multigrid kernel through Intel VTune on SKL to determine performance characterization Intel performance analysis tools provide extensive detail multigrid kernel is memory bound on SKL 65% of pipeline slots stalled due to load/store ~10% clock ticks stalled on cache 35% clock ticks stalled on DRAM Characterize kernels 41% clock ticks stalled for DRAM bandwidth boundedness 16% clock ticks stalled for DRAM latency PETTT 15

16 Characterize kernels Multigrid Time vs Arithmetic Intensity Low arithmetic intensity implies memory bandwidth will be limiting factor PETTT 16

17 Characterize kernels Multigrid roofline on SKL Heavy vertical lines show bounds of measured arithmetic intensity PETTT 17

18 Characterize kernels Based on DRAM bandwidth boundedness, expect higher aggregate bandwidth to run code faster Would like to measure effective bandwidth on Arm Histogram shows MPI imbalance due to sweep dependency of tridiagonalsolver Larger number of ranks on TX2 as SKL exacerbates sweep dependency Number of ranks in angular dimensions stays same, only sweep direction increases in ranks PETTT 18

19 Multigrid on four CPU architectures Single node performance for multigrid kernel Available memory bandwidth has large impact on performance Intel VTune measured 41% clock ticks limited by DRAM bandwidth boundedness More work needed to understand discrepancy between TX2 and EPYC CPU Bandwidth (GB/s) Measured kernel time (s) Broadwell Aster time (s) Skylake ThunderX EPYC PETTT 19

20 Summary and future work Node level results for Aster code are encouraging Initially disparate results were reconciled through profiling and finding correct math libraries Codes that are clearly bandwidth bound might be expected to perform similarly on TX2 and SKL Shared memory byte transport layers show similar bandwidths and latencies when measured with micro-benchmarks Additional latencies appear to be present in Aster and the extracted kernels, which requires further study Preparing Aster to run on Astra Will perform multi-node scaling studies next Sweep algorithm needs to be studied for improvement Will benefit both SKL and TX2 PETTT 20

21 Additional Material PETTT 21

22 Xeon Gold GHz 16 cores, 32 threads Max turbo frequency: 3.7 GHz 22 MB L3 cache TDP 125 W Max memory speed: 2666 MHz Number of AVX-512 FMA Units: 2 Max number of memory channels: 6 Single: GiB/s [= (64/8/ ) GiB x 2666 MHz], [= 21.3 GB/s] Intel Skylake Double: GiB/s [= 42.7 GB/s] Quad: GiB/s [= 85.4 GB/s] Hexa: GiB/s [= GB/s] PETTT 22

23 Floating point capacity 2 x 512 bit VPU/core Fused Multiply Add (FMA): 2 FLOP/VPU/cycle Double precision 2 FLOP/VPU/cycle x 2 VPU/core x 8 reals = 32 FLOP/cycle/core 32 FLOP/cycle/core x [ ] GHz = [ ] GF/s/core [ ] GF/s/core x 16 cores = [ ] GF/s Single thread measurement 1 : 110 GF/s/core Single precision 1 Intel Advisor 19 Intel Skylake 2 FLOP/VPU/cycle x 2 VPU/core x 16 reals = 64 FLOP/cycle/core 64 FLOP/cycle/core x [ ] GHz = [ ] GF/s/core [ ] GF/s/core x 16 cores = [ ] GF/s Single thread measurement 1 : 220 GF/s/core PETTT 23

24 Cavium Thunder X2 2.2 GHz B0 stepping Some specs are best guess based on public information and A2 stepping 32 cores, 64 threads (up to 128 threads) Max turbo frequency? GHz 32 MB L3 cache TDP 165 W Max memory speed: 2666 MHz Max number of DDR4 memory channels: 8 Single: GiB/s [= (64/8/ ) GiB x 2666 MHz], [= 21.3 GB/s] Cavium ThunderX2 Double: GiB/s [= 42.7 GB/s] Quad: GiB/s [= 85.4 GB/s] Hexa: GiB/s [= GB/s] Octo: GiB/s [= GB/s] PETTT 24

25 Floating point capacity 1 2 x 128 bit VPU/core Fused Multiply Add (FMA): 2 FLOP/VPU/cycle Double precision 2 FLOP/VPU/cycle x 1 VPU/core x 4 reals = 8 FLOP/cycle/core 8 FLOP/cycle/core x 2.2 GHz = 17.6 GF/s/core 17.6 GF/s/core x 32 cores = 563 GF/s Single thread measurement:? GF/s/core Single precision Cavium ThunderX2 (32 core) 2 FLOP/VPU/cycle x 1 VPU/core x 8 reals = 16 FLOP/cycle/core 16 FLOP/cycle/core x 2.2 GHz = 35.2 GF/s/core 35.2 GF/s/core x 32 cores = 1126 GF/s Single thread measurement:? GF/s/core 1 Based on Broadcom Vulcan PETTT 25

26 Floating point capacity 1 2 x 128 bit VPU/core Fused Multiply Add (FMA): 2 FLOP/VPU/cycle Double precision 2 FLOP/VPU/cycle x 1 VPU/core x 4 reals = 8 FLOP/cycle/core 8 FLOP/cycle/core x 2.2 GHz = 17.6 GF/s/core 17.6 GF/s/core x 28 cores = 493 GF/s Single thread measurement:? GF/s/core Single precision Cavium ThunderX2 (28 core) 2 FLOP/VPU/cycle x 1 VPU/core x 8 reals = 16 FLOP/cycle/core 16 FLOP/cycle/core x 2.2 GHz = 35.2 GF/s/core 35.2 GF/s/core x 28 cores = 986 GF/s Single thread measurement:? GF/s/core 1 Based on Broadcom Vulcan PETTT 26

SNAP Performance Benchmark and Profiling. April 2014

SNAP Performance Benchmark and Profiling April 2014 Note The following research was performed under the HPC Advisory Council activities Participating vendors: HP, Mellanox For more information on the supporting