Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA

Similar documents
Vector Engine Processor of SX-Aurora TSUBASA

Activities of Cyberscience Center and Performance Evaluation of the SX-9 Supercomputer

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Cerebrospinal Fluid Flow Analysis in Subarachnoid Space

Mapping MPI+X Applications to Multi-GPU Architectures

Brand-New Vector Supercomputer

Composite Metrics for System Throughput in HPC

Fujitsu s Approach to Application Centric Petascale Computing

Intel Architecture for HPC

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Evaluation of Intel Memory Drive Technology Performance for Scientific Applications

Performance and Energy Usage of Workloads on KNL and Haswell Architectures

Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet. Swamy N. Kandadai and Xinghong He and

Master Informatics Eng.

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Fujitsu s new supercomputer, delivering the next step in Exascale capability

CAUTIONARY STATEMENT 1 AMD NEXT HORIZON NOVEMBER 6, 2018

Philippe Thierry Sr Staff Engineer Intel Corp.

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

n N c CIni.o ewsrg.au

ENDURING DIFFERENTIATION. Timothy Lanfear

ENDURING DIFFERENTIATION Timothy Lanfear

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Opportunities and Challenges in Sparse Linear Algebra on Many-Core Processors with High-Bandwidth Memory

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

PyFR: Heterogeneous Computing on Mixed Unstructured Grids with Python. F.D. Witherden, M. Klemm, P.E. Vincent

The knight makes his play for the crown Phi & Omni-Path Glenn Rosenberg Computer Insights UK 2016

CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to

Advanced Parallel Programming I

The Architecture and the Application Performance of the Earth Simulator

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

Fujitsu HPC Roadmap Beyond Petascale Computing. Toshiyuki Shimizu Fujitsu Limited

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

Current Status of the Next- Generation Supercomputer in Japan. YOKOKAWA, Mitsuo Next-Generation Supercomputer R&D Center RIKEN

Performance of a multi-physics code on Cavium ThunderX2

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

HPCS HPCchallenge Benchmark Suite

Parallel Accelerators

HPC-CINECA infrastructure: The New Marconi System. HPC methods for Computational Fluid Dynamics and Astrophysics Giorgio Amati,

Towards Exascale Computing with the Atmospheric Model NUMA

Economic Viability of Hardware Overprovisioning in Power- Constrained High Performance Compu>ng

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPUS FOR NGVLA. M Clark, April 2015

The STREAM Benchmark. John D. McCalpin, Ph.D. IBM eserver Performance ^ Performance

Intel Xeon Phi архитектура, модели программирования, оптимизация.

General Purpose GPU Computing in Partial Wave Analysis

A Memory-Efficient Implementation of a Plasmonics Simulation Application on SX-ACE

Modern CPU Architectures

Key Technologies for 100 PFLOPS. Copyright 2014 FUJITSU LIMITED

Large scale Imaging on Current Many- Core Platforms

Intel Knights Landing Hardware

Parallel Accelerators

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Theory and Practice of Vector Processing for Data and Memory Centric Applications

Toward Automated Application Profiling on Cray Systems

Benchmark results on Knight Landing (KNL) architecture

Application Performance on Dual Processor Cluster Nodes

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Preliminary Performance Evaluation of Application Kernels using ARM SVE with Multiple Vector Lengths

Godson Processor and its Application in High Performance Computers

QR Decomposition on GPUs

Optimising the Mantevo benchmark suite for multi- and many-core architectures

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning

April 2 nd, Bob Burroughs Director, HPC Solution Sales

HPC Architectures evolution: the case of Marconi, the new CINECA flagship system. Piero Lanucara

Performance Evaluation of Quantum ESPRESSO on SX-ACE. REV-A Workshop held on conjunction with the IEEE Cluster 2017 Hawaii, USA September 5th, 2017

Intel Many Integrated Core (MIC) Architecture

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Challenges Simulating Real Fuel Combustion Kinetics: The Role of GPUs

IBM Power 9 надежная платформа для развертывания облаков. Ташкент. Юрий Кондратенко Cross-Brand Sales Specialist

Performance Analysis and Prediction for distributed homogeneous Clusters

Parallel Computing. November 20, W.Homberg

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

An Overview of Fujitsu s Lustre Based File System

Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

What s P. Thierry

Basics of performance modeling for numerical applications: Roofline model and beyond

Instructor: Leopold Grinberg

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

VLPL-S Optimization on Knights Landing

OP2 FOR MANY-CORE ARCHITECTURES

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Double-Precision Matrix Multiply on CUDA

Cray XC Scalability and the Aries Network Tony Ford

Introduction to tuning on KNL platforms

Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC

Transcription:

Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA Kazuhiko Komatsu, S. Momose, Y. Isobe, O. Watanabe, A. Musa, M. Yokokawa, T. Aoyama, M. Sato, H. Kobayashi Tohoku University 14 November, 2018 SC18

Outline Background Overview of SX-Aurora TSUBASA Performance evaluation Benchmark performance Application performance Conclusions 14 November, 2018 SC18 2

Background Supercomputers become important infrastructures Widely used for scien8fic researches as well as various industries Top1 Summit system reaches 143.5 Pflop/s Big gap between theore8cal performance and sustained performance Compute-intensive applica8ons stand to benefit from high peak performance Memory-intensive applica8ons are limited by lower memory performance Memory performance has gained more and more attentions 14 November, 2018 SC18 3

A new vector supercomputer SX-Aurora TSUBASA Two important concepts of its design High usability High sustained performance New memory integration technology Realize the world s highest memory bandwidth New architecture Vector host (VH) is attached to vector engines (VEs) VE is responsible for executing an entire application VH is used for processing system calls invoked by the applications 14 November, 2018 SC18 5 VH X86 Linux VE VE Vector processor

New execution model Conven1onal model New execution model Host GPU VH VE Start processing Exe module load Start processing Kernel execution OS function Transparent offload System call (I/O, etc) Kernel execution Finish processing Fisnish processing SC18 6

Highlights of the execution model Two advantages over conventional execution model Avoid frequent data transfers between VE and VH Applications are entirely executed on VE Only necessary data for system calls need to be transferred High sustained performance No special programming Explicit specifications of computation kernels are not necessary System calls are transparently offloaded to the VH Programmers do not need to care system calls High usability 14 November, 2018 SC18 7 VH Exe module load Offload OS function VE Start processing System call (I/O, etc) Finish processing

Specification of SX-Aurora TSUBASA Memory bandwidth 1.22 TB/s world s highest memory bandwidth Six HBM2 memory modules integraeon 3.0 TB/s LLC bandwidth LLC is connected to cores via 2D mesh network ComputaEonal capability 2.15 Tflop/s@1.4 GHz 8 powerful vector cores 16 nm FINFET process technology 4.8 billion transistors 14.96 mm x 33.00 mm 14 November, 2018 SC18 Block diagram of a vector processor

Outline Background Overview of SX-Aurora TSUBASA Performance evaluation Benchmark performance Application performance Conclusions 14 November, 2018 SC18 9

Experimental environments SX-Aurora TSUBASA A300-2 2x VEs Type 10B 1x VH VH X86 Linux VE VE Vector Engines VH Intel Xeon Gold 6126 VE Type 10B Frequency 2.60 GHz / 3.70 GHz (Turbo) Frequency 1.4 GHz Peak FP / core 83.2 Gflop/s Peak FP / core 268.8 Gflop/s # cores 12 # cores 8 Peak DP Flops 998.4 Gflop/s Peak DP Flops / socket 2.15 Tflop/s Mem BW 128 GB/s Memory BW 1.2 TB/s Mem Capacity 96 GB Memory capacity 48 GB Mem config DDR4-2666 DIMM 16GB x 6 14 November, 2018 SC18 Memory config HBM2 10

Experimental environments cont. Processor SX-Aurora Type 10B Xeon Gold 6126 SX-ACE Tesla V100 Xeon Phi KNL 7290 Frequency 1.4 GHz 2.6 GHz 1.0 GHz 1.245 GHz 1.5 GHz # of cores 8 12 4 5120 72 DP flop/s (SP flop/s) Memory subsystem 2.15 T (4.30 T) 998.4 GF (1996.8 GF) 256 GF 7 TF (14 TF) HBM2 x6 DDR4 x6ch DDR3 x16ch HBM2 x4 Memory BW 1.22 TB/s 128 GB/s 256 GB/s 900 GB/s Memory capacity 48 GB 96 GB 64 GB 16 GB 3.456 TF (6.912 TF) MCDRAM DDR4 450+ GB/s 115.2 GB/s 16 GB 96 GB LLC BW 2.66 TB/s N/A 1.0 TB/s N/A N/A 16 MB 19.25 MB 1 MB shared LLC capacity 1 MB private 6 MB shared 14 November, shared 2018 shared SC18 by 211 cores

Applications used for evaluation SGEMM/DGEMM Matrix-matrix mul;plica;ons to evaluate the Peak flop/s Stream benchmark Simple kernels (copy, scale, add, triad) to measure sustained memory performance Himeno benchmark Jacobi kernels with a 19-point stencil as a memory-intensive kernels Applica;on kernels Kernels of prac;cal applica;ons of Tohoku univ in Earthquake, CFD, Electromagne;c Microbenchmark to evaluate the execu;on model Mixture with vector-friendly Jacobi kernels and I/O kernels 14 November, 2018 SC18 12

Overview of application kernels Kernels Fields Methods Land mine Earthquake Turbulent Flow Antenna Plasma Electro magnetic Seismolo gy CFD Electro magnetic Physics Memory access Mesh size Code B/F 14 November, 2018 SC18 13 Actual B/F FDTD Sequential 100x750x750 6.22 5.15 Friction Law Navier- Stokes Sequential 2047x2047x256 4.00 4.00 Sequential 512x16384x512 1.91 0.35 FDTD Sequential 252755x9x97336 1.73 0.98 Lax- Wendroff Indirect 20,048,000 1.12 0.075 Turbine CFD LU-SGS Indirect 480x80x80x10 0.96 0.0084

Outline Background Overview of SX-Aurora TSUBASA Performance evaluation Benchmark performance Application performance Conclusions 14 November, 2018 SC18 14

SGEMM/DGEMM Performance GEMM performance (Gflop/s) 4500 4000 3500 3000 2500 2000 1500 1000 500 0 DGEMM Gflop/s DGEMM efficiency SGEMM Gflop/s SGEMM Efficiency 1 2 3 4 5 6 7 8 Number of threads High scalability up to 8 threads High vectorization ratio 99.36%, good vector length 253.8 High efficiency Efficiency 97.8~99.2% 14 November, 2018 SC18 15 100 90 80 70 60 50 40 30 20 10 0 Efficiency (%)

Memory performance(stream Triad) Stream bandwidth (GB/s) 1200 1000 800 600 400 200 0 SX-Aurora TSUBASA x 4.72 x 11.7 x 1.37 x 2.23 SX-ACE Skylake Tesla V100 KNL High sustained memory bandwidth of SX-Aurora TSUBASA Efficiency: Aurora 79%, ACE 83%, Skylake 66%, V100 81% Scalability Saturated even when the number of threads is less than half Stream bandwidth (GB/s) 1400 1200 1000 800 600 400 200 0 SX-Aurora SX-ACE Skylake 0 2 4 6 8 10 12 Number of threads 14 November, 2018 SC18 16

Himeno (Jacobi) performance Himeno performance (Gflop/s) 350 300 250 200 150 100 50 0 SX-Aurora TSUBASA x 3.4 x 7.96 x 0.93 x 2.1 SX-ACE Skylake Tesla V100 KNL Himeno performance (Gflop/s) 320 280 240 200 160 120 80 40 0 SX-Aurora TSUBASA SX-ACE Skylake x 6.9 0 2 4 6 8 10 12 Number of threads Higher performance except GPU Vector reduction becomes bottleneck due to copy among vector pipes Nice thread scalability 6.9x speedup in 8 threads => 86% parallel efficiency 14 November, 2018 SC18 17

Application kernel performance Execution time (sec) 80 70 60 50 40 30 20 10 0 x 4.3 SX-Aurora TSUBASA SX-ACE Skylake x 2.9 x 3.4 Land mine Earthquake Turbulent flow x 9.9 x 2.6 x 3.4 Antenna Plasma Turbine Attainable performance (Gflop/s) 1 0.1250.25 0.5 1 2 4 8 16 32 64 128 Arithmetic intensity (Flops/Byte) SX-Aurora TSUBASA could achieve high performance Plasma, Turbine => Indirect access, memory latency-bound Antenna => computation-bound to memory BW-bound Land mine, Earthqauke, Turbulent flow =>memory or LLC BWbound 4096 1024 256 64 16 4 Earthquake Land mine SX-Aurora TSUBASA Antenna Antenna Turbulent flow Plasma Plasma SX-ACE Turbine 14 November, 2018 SC18 18

Memory bound? or LLC bound? Further analysis using 4 types of Bytes/Flop ratio Memory B/F = (memory BW) / (peak performance) LLC B/F = (LLC BW) / (peak performance) Code B/F = (necessary data in Byte) / (# FP operations) Actual B/F = (# block memory access) * (block size) / (# FP operations) B/F ratio Actual < Memory Memory > Actual Code < LLC Computation-bound Memory BW-bound Code > LLC LLC BW-bound Memory or LLC bound * Code B/F > Actual B/F * LLC BW / Memory BW => LLC bound Code B/F < Actual B/F * LLC BW / Memory BW => memory bound 14 November, 2018 SC18 19

Application kernel performance Execution time (sec) 80 70 60 50 40 30 20 10 0 x 4.3 SX-Aurora TSUBASA SX-ACE Skylake x 2.9 x 3.4 Land mine Earthquake Turbulent flow x 9.9 x 2.6 x 3.4 Antenna Plasma Turbine Memory B/F LLC B/F Land mine (Code 6.22, Actual 5.79) Earthqauke (Code 6.00, Actual 2.00) Turbulent flow (Code 1.91, Actual 0.35) Antenna (Code 1.73, Actual 0.98) B/F ratio Actual < Memory Actual > Memory Code < LLC Computation-bound Memory BW-bound Code > LLC LLC BW-bound Memory or LLC bound = 1.22 TB/s / 2.15 TF = 0.57 = 2.66 TB/s / 2.15 TF = 1.24 => LLC bound => LLC bound => memory BW bound => memory BW bound 14 November, 2018 SC18 20

Evaluation of the execution model (Transparent/Explicit) Offload from VE to VH Offload from VH to VE VH VE VH VE S VH offload V AP S AP VE offload V 40 35 Data transfer Jacobi 1 I/O Jacobi 2 Execution time (sec) 30 25 20 15 10 5 0 V AP S V AP S V V AP V AP S AP VE execution VH offload VE offload Transparent VE to VH explicit VE to VH V S AP VH to VE 21

Stream bandwidth (GB/s) 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 Multi-VE performance on A300-8 Aurora Ideal 1 2 3 4 5 6 7 8 Number of VEs 2400 2200 2000 1800 1600 1400 1200 1000 800 600 400 200 0 Stream VE-level scalability Almost ideal scalability up to 8 VEs Himeno VE-level scalability Good scalability up to 4VEs Lack of vector lengths when more than 5VEs Problem size is too small Himeno performance (Gflop/s) Aurora Ideal 1 2 3 4 5 6 7 8 Number of VEs 14 November, 2018 SC18 22

Outline Background Overview of SX-Aurora TSUBASA Performance evaluation Benchmark performance Application performance Conclusions 14 November, 2018 SC18 23

Application evaluation Applications Tsunami simulation code Mainly consists on ground fluctuation and Tsunami inundation prediction Used for a real-time tsunami inundation system A costal region of Japan(1244x826km) with an 810m resolution Direct numerical simulation code of turbulent flows Incompressible Navier-Stokes equations and a continuum equation are solved by 3D FFT MPI parallelization to 4VEs (1~32 processes) 128 3, 256 3, 512 3 grid points 14 November, 2018 SC18 24

Performance of the Tsunami code Core performance is x 11.4, x 4.1, x2.4, to KNL, Skylake, ACE. Socket performance is x 2.5, x 1.9, x 3.5 14 November, 2018 SC18 25

Performance of the DNS code About 8.14, 6.73, 5.48 speedup by 32cores to 1core in 128 3, 256 3, 512 3 grids LLC hit ratios of 256 3 and 512 3 are lower than that of 128 3 10% is bank conflicts 14 November, 2018 SC18 26

Conclusions A vector supercomputer SX-Aurora TSUBASA The highest bandwidth 1.22 TB/s by six HBM2 integration New execution model to achieve high usability and high sustained performance Performance evaluation Benchmark programs High sustained memory performance effectiveness of a new execution model Application programs High sustained performance Future work Further optimizations for SX-Aurora TSUBASA 14 November, 2018 SC18 27

Acknowledgements People Hiroyuki Takizawa Ryusuke Egawa Souya Fujimoto Yasuhisa Masaoka Projects The closed beta program for early access to Aurora High performance computing division (jointly-organized with NEC), Cyberscience Center, Tohoku university 14 November, 2018 SC18 28