Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA

Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA Kazuhiko Komatsu, S. Momose, Y. Isobe, O. Watanabe, A. Musa, M. Yokokawa, T. Aoyama, M. Sato, H. Kobayashi Tohoku University 14 November, 2018 SC18

Outline Background Overview of SX-Aurora TSUBASA Performance evaluation Benchmark performance Application performance Conclusions 14 November, 2018 SC18 2

Background Supercomputers become important infrastructures Widely used for scien8fic researches as well as various industries Top1 Summit system reaches 143.5 Pflop/s Big gap between theore8cal performance and sustained performance Compute-intensive applica8ons stand to benefit from high peak performance Memory-intensive applica8ons are limited by lower memory performance Memory performance has gained more and more attentions 14 November, 2018 SC18 3

A new vector supercomputer SX-Aurora TSUBASA Two important concepts of its design High usability High sustained performance New memory integration technology Realize the world s highest memory bandwidth New architecture Vector host (VH) is attached to vector engines (VEs) VE is responsible for executing an entire application VH is used for processing system calls invoked by the applications 14 November, 2018 SC18 5 VH X86 Linux VE VE Vector processor

New execution model Conven1onal model New execution model Host GPU VH VE Start processing Exe module load Start processing Kernel execution OS function Transparent offload System call (I/O, etc) Kernel execution Finish processing Fisnish processing SC18 6

Highlights of the execution model Two advantages over conventional execution model Avoid frequent data transfers between VE and VH Applications are entirely executed on VE Only necessary data for system calls need to be transferred High sustained performance No special programming Explicit specifications of computation kernels are not necessary System calls are transparently offloaded to the VH Programmers do not need to care system calls High usability 14 November, 2018 SC18 7 VH Exe module load Offload OS function VE Start processing System call (I/O, etc) Finish processing

Specification of SX-Aurora TSUBASA Memory bandwidth 1.22 TB/s world s highest memory bandwidth Six HBM2 memory modules integraeon 3.0 TB/s LLC bandwidth LLC is connected to cores via 2D mesh network ComputaEonal capability 2.15 Tflop/s@1.4 GHz 8 powerful vector cores 16 nm FINFET process technology 4.8 billion transistors 14.96 mm x 33.00 mm 14 November, 2018 SC18 Block diagram of a vector processor

Outline Background Overview of SX-Aurora TSUBASA Performance evaluation Benchmark performance Application performance Conclusions 14 November, 2018 SC18 9

Experimental environments SX-Aurora TSUBASA A300-2 2x VEs Type 10B 1x VH VH X86 Linux VE VE Vector Engines VH Intel Xeon Gold 6126 VE Type 10B Frequency 2.60 GHz / 3.70 GHz (Turbo) Frequency 1.4 GHz Peak FP / core 83.2 Gflop/s Peak FP / core 268.8 Gflop/s # cores 12 # cores 8 Peak DP Flops 998.4 Gflop/s Peak DP Flops / socket 2.15 Tflop/s Mem BW 128 GB/s Memory BW 1.2 TB/s Mem Capacity 96 GB Memory capacity 48 GB Mem config DDR4-2666 DIMM 16GB x 6 14 November, 2018 SC18 Memory config HBM2 10

Experimental environments cont. Processor SX-Aurora Type 10B Xeon Gold 6126 SX-ACE Tesla V100 Xeon Phi KNL 7290 Frequency 1.4 GHz 2.6 GHz 1.0 GHz 1.245 GHz 1.5 GHz # of cores 8 12 4 5120 72 DP flop/s (SP flop/s) Memory subsystem 2.15 T (4.30 T) 998.4 GF (1996.8 GF) 256 GF 7 TF (14 TF) HBM2 x6 DDR4 x6ch DDR3 x16ch HBM2 x4 Memory BW 1.22 TB/s 128 GB/s 256 GB/s 900 GB/s Memory capacity 48 GB 96 GB 64 GB 16 GB 3.456 TF (6.912 TF) MCDRAM DDR4 450+ GB/s 115.2 GB/s 16 GB 96 GB LLC BW 2.66 TB/s N/A 1.0 TB/s N/A N/A 16 MB 19.25 MB 1 MB shared LLC capacity 1 MB private 6 MB shared 14 November, shared 2018 shared SC18 by 211 cores

Applications used for evaluation SGEMM/DGEMM Matrix-matrix mul;plica;ons to evaluate the Peak flop/s Stream benchmark Simple kernels (copy, scale, add, triad) to measure sustained memory performance Himeno benchmark Jacobi kernels with a 19-point stencil as a memory-intensive kernels Applica;on kernels Kernels of prac;cal applica;ons of Tohoku univ in Earthquake, CFD, Electromagne;c Microbenchmark to evaluate the execu;on model Mixture with vector-friendly Jacobi kernels and I/O kernels 14 November, 2018 SC18 12

Overview of application kernels Kernels Fields Methods Land mine Earthquake Turbulent Flow Antenna Plasma Electro magnetic Seismolo gy CFD Electro magnetic Physics Memory access Mesh size Code B/F 14 November, 2018 SC18 13 Actual B/F FDTD Sequential 100x750x750 6.22 5.15 Friction Law Navier- Stokes Sequential 2047x2047x256 4.00 4.00 Sequential 512x16384x512 1.91 0.35 FDTD Sequential 252755x9x97336 1.73 0.98 Lax- Wendroff Indirect 20,048,000 1.12 0.075 Turbine CFD LU-SGS Indirect 480x80x80x10 0.96 0.0084

Outline Background Overview of SX-Aurora TSUBASA Performance evaluation Benchmark performance Application performance Conclusions 14 November, 2018 SC18 14

SGEMM/DGEMM Performance GEMM performance (Gflop/s) 4500 4000 3500 3000 2500 2000 1500 1000 500 0 DGEMM Gflop/s DGEMM efficiency SGEMM Gflop/s SGEMM Efficiency 1 2 3 4 5 6 7 8 Number of threads High scalability up to 8 threads High vectorization ratio 99.36%, good vector length 253.8 High efficiency Efficiency 97.8~99.2% 14 November, 2018 SC18 15 100 90 80 70 60 50 40 30 20 10 0 Efficiency (%)

Memory performance(stream Triad) Stream bandwidth (GB/s) 1200 1000 800 600 400 200 0 SX-Aurora TSUBASA x 4.72 x 11.7 x 1.37 x 2.23 SX-ACE Skylake Tesla V100 KNL High sustained memory bandwidth of SX-Aurora TSUBASA Efficiency: Aurora 79%, ACE 83%, Skylake 66%, V100 81% Scalability Saturated even when the number of threads is less than half Stream bandwidth (GB/s) 1400 1200 1000 800 600 400 200 0 SX-Aurora SX-ACE Skylake 0 2 4 6 8 10 12 Number of threads 14 November, 2018 SC18 16

Himeno (Jacobi) performance Himeno performance (Gflop/s) 350 300 250 200 150 100 50 0 SX-Aurora TSUBASA x 3.4 x 7.96 x 0.93 x 2.1 SX-ACE Skylake Tesla V100 KNL Himeno performance (Gflop/s) 320 280 240 200 160 120 80 40 0 SX-Aurora TSUBASA SX-ACE Skylake x 6.9 0 2 4 6 8 10 12 Number of threads Higher performance except GPU Vector reduction becomes bottleneck due to copy among vector pipes Nice thread scalability 6.9x speedup in 8 threads => 86% parallel efficiency 14 November, 2018 SC18 17

Application kernel performance Execution time (sec) 80 70 60 50 40 30 20 10 0 x 4.3 SX-Aurora TSUBASA SX-ACE Skylake x 2.9 x 3.4 Land mine Earthquake Turbulent flow x 9.9 x 2.6 x 3.4 Antenna Plasma Turbine Attainable performance (Gflop/s) 1 0.1250.25 0.5 1 2 4 8 16 32 64 128 Arithmetic intensity (Flops/Byte) SX-Aurora TSUBASA could achieve high performance Plasma, Turbine => Indirect access, memory latency-bound Antenna => computation-bound to memory BW-bound Land mine, Earthqauke, Turbulent flow =>memory or LLC BWbound 4096 1024 256 64 16 4 Earthquake Land mine SX-Aurora TSUBASA Antenna Antenna Turbulent flow Plasma Plasma SX-ACE Turbine 14 November, 2018 SC18 18

Memory bound? or LLC bound? Further analysis using 4 types of Bytes/Flop ratio Memory B/F = (memory BW) / (peak performance) LLC B/F = (LLC BW) / (peak performance) Code B/F = (necessary data in Byte) / (# FP operations) Actual B/F = (# block memory access) * (block size) / (# FP operations) B/F ratio Actual < Memory Memory > Actual Code < LLC Computation-bound Memory BW-bound Code > LLC LLC BW-bound Memory or LLC bound * Code B/F > Actual B/F * LLC BW / Memory BW => LLC bound Code B/F < Actual B/F * LLC BW / Memory BW => memory bound 14 November, 2018 SC18 19

Application kernel performance Execution time (sec) 80 70 60 50 40 30 20 10 0 x 4.3 SX-Aurora TSUBASA SX-ACE Skylake x 2.9 x 3.4 Land mine Earthquake Turbulent flow x 9.9 x 2.6 x 3.4 Antenna Plasma Turbine Memory B/F LLC B/F Land mine (Code 6.22, Actual 5.79) Earthqauke (Code 6.00, Actual 2.00) Turbulent flow (Code 1.91, Actual 0.35) Antenna (Code 1.73, Actual 0.98) B/F ratio Actual < Memory Actual > Memory Code < LLC Computation-bound Memory BW-bound Code > LLC LLC BW-bound Memory or LLC bound = 1.22 TB/s / 2.15 TF = 0.57 = 2.66 TB/s / 2.15 TF = 1.24 => LLC bound => LLC bound => memory BW bound => memory BW bound 14 November, 2018 SC18 20

Evaluation of the execution model (Transparent/Explicit) Offload from VE to VH Offload from VH to VE VH VE VH VE S VH offload V AP S AP VE offload V 40 35 Data transfer Jacobi 1 I/O Jacobi 2 Execution time (sec) 30 25 20 15 10 5 0 V AP S V AP S V V AP V AP S AP VE execution VH offload VE offload Transparent VE to VH explicit VE to VH V S AP VH to VE 21

Stream bandwidth (GB/s) 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 Multi-VE performance on A300-8 Aurora Ideal 1 2 3 4 5 6 7 8 Number of VEs 2400 2200 2000 1800 1600 1400 1200 1000 800 600 400 200 0 Stream VE-level scalability Almost ideal scalability up to 8 VEs Himeno VE-level scalability Good scalability up to 4VEs Lack of vector lengths when more than 5VEs Problem size is too small Himeno performance (Gflop/s) Aurora Ideal 1 2 3 4 5 6 7 8 Number of VEs 14 November, 2018 SC18 22

Outline Background Overview of SX-Aurora TSUBASA Performance evaluation Benchmark performance Application performance Conclusions 14 November, 2018 SC18 23

Application evaluation Applications Tsunami simulation code Mainly consists on ground fluctuation and Tsunami inundation prediction Used for a real-time tsunami inundation system A costal region of Japan(1244x826km) with an 810m resolution Direct numerical simulation code of turbulent flows Incompressible Navier-Stokes equations and a continuum equation are solved by 3D FFT MPI parallelization to 4VEs (1~32 processes) 128 3, 256 3, 512 3 grid points 14 November, 2018 SC18 24

Performance of the Tsunami code Core performance is x 11.4, x 4.1, x2.4, to KNL, Skylake, ACE. Socket performance is x 2.5, x 1.9, x 3.5 14 November, 2018 SC18 25

Performance of the DNS code About 8.14, 6.73, 5.48 speedup by 32cores to 1core in 128 3, 256 3, 512 3 grids LLC hit ratios of 256 3 and 512 3 are lower than that of 128 3 10% is bank conflicts 14 November, 2018 SC18 26

Conclusions A vector supercomputer SX-Aurora TSUBASA The highest bandwidth 1.22 TB/s by six HBM2 integration New execution model to achieve high usability and high sustained performance Performance evaluation Benchmark programs High sustained memory performance effectiveness of a new execution model Application programs High sustained performance Future work Further optimizations for SX-Aurora TSUBASA 14 November, 2018 SC18 27

Acknowledgements People Hiroyuki Takizawa Ryusuke Egawa Souya Fujimoto Yasuhisa Masaoka Projects The closed beta program for early access to Aurora High performance computing division (jointly-organized with NEC), Cyberscience Center, Tohoku university 14 November, 2018 SC18 28