Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Similar documents
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Quantum ESPRESSO on GPU accelerated systems

Parallelization of DQMC Simulations for Strongly Correlated Electron Systems

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Introduction to parallel Computing

Building NVLink for Developers

GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis

MICROWAY S NVIDIA TESLA V100 GPU SOLUTIONS GUIDE

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

A GPU Sparse Direct Solver for AX=B

CafeGPI. Single-Sided Communication for Scalable Deep Learning

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Using multifrontal hierarchically solver and HPC systems for 3D Helmholtz problem

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

CUDA Experiences: Over-Optimization and Future HPC

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

High performance Computing and O&G Challenges

Frequency Scaling and Energy Efficiency regarding the Gauss-Jordan Elimination Scheme on OpenPower 8

simulation framework for piecewise regular grids

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

Asynchronous OpenCL/MPI numerical simulations of conservation laws

Applications of Berkeley s Dwarfs on Nvidia GPUs

Optimising the Mantevo benchmark suite for multi- and many-core architectures

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

GPU COMPUTING WITH MSC NASTRAN 2013

S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS

Algorithms, System and Data Centre Optimisation for Energy Efficient HPC

Performance Benefits of NVIDIA GPUs for LS-DYNA

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

PARDISO Version Reference Sheet Fortran

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

Brief notes on setting up semi-high performance computing environments. July 25, 2014

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

Sparse Multifrontal Performance Gains via NVIDIA GPU January 16, 2009

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

MAGMA. Matrix Algebra on GPU and Multicore Architectures

FatMan vs. LittleBoy: Scaling up Linear Algebraic Operations in Scale-out Data Platforms

GAMER : a GPU-accelerated Adaptive-MEsh-Refinement Code for Astrophysics GPU 與自適性網格於天文模擬之應用與效能

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.

Solving Dense Linear Systems on Graphics Processors

STRATEGIES TO ACCELERATE VASP WITH GPUS USING OPENACC. Stefan Maintz, Dr. Markus Wetzstein

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

SUPPLEMENTARY INFORMATION

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li

CPU-GPU Heterogeneous Computing

Optimisation Myths and Facts as Seen in Statistical Physics

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

NVIDIA COLLECTIVE COMMUNICATION LIBRARY (NCCL)

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

Accelerating the Iterative Linear Solver for Reservoir Simulation

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography

Technology for a better society. hetcomp.com

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

John Levesque Nov 16, 2001

IBM Power AC922 Server

Trends in HPC (hardware complexity and software challenges)

NVIDIA COLLECTIVE COMMUNICATION LIBRARY (NCCL)

Performance impact of dynamic parallelism on different clustering algorithms

Intel Performance Libraries

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

Analysis and Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms

High-Performance Data Loading and Augmentation for Deep Neural Network Training

Fast and reliable linear system solutions on new parallel architectures

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

ME964 High Performance Computing for Engineering Applications

Tesla GPU Computing A Revolution in High Performance Computing

Performance of Implicit Solver Strategies on GPUs

Large scale Imaging on Current Many- Core Platforms

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

Technical Report Performance Analysis of CULA on different NVIDIA GPU Architectures. Prateek Gupta

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Parallel H.264/AVC Motion Compensation for GPUs using OpenCL

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

The Future of High Performance Interconnects

SYNERGIE VON HPC UND DEEP LEARNING MIT NVIDIA GPUS

Available online at ScienceDirect. Parallel Computational Fluid Dynamics Conference (ParCFD2013)

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

RECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016

Lecture 15: More Iterative Ideas

IBM Deep Learning Solutions

Porting Scalable Parallel CFD Application HiFUN on NVIDIA GPU

Transcription:

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M a t h e m a t i c a l S c i e n c e s N a t i o n a l T a i w a n U n i v e r s i t y T a i p e i, T a i w a n * * I B M T. J. W a t s o n R e s e a r c h C e n t e r N Y, U S

Outline 2 Introduction Implementation Numerical Results I P2P Matrix Sharing Numerical Results II Summary

Introduction (Ref: Sun et al., Nature 528, 2015) 3 Photonics Waveguides Resonant cavities Frequency filters Plasmonic devices Design concerns Structural characteristics Parameter refinement Experiment data (Ref: Ivinskaya & Lavrinenko, 2011)

Introduction - Why Multi-GPU Scaling Global supercomputing trend High energy efficiency Growing popularity in deep learning applications Integration of high-performance numerical simulation and deep learning Source: ORNL 4 Source: NVIDIA

Introduction 5 Machine-Learning-Derived Behavior Model and Intelligent Design Photonic Integrated Circuit Design Broadband Spectral Analysis Nonlinear Equations with Multiphysics Features Photonic Crystal Analyzer Shift-Inverse Eigensolver Preconditioner and Algorithm for Iterative Side-Equation Solver Parallel Direct FDFD Solver Kernel

Introduction 6 Machine-Learning-Derived Behavior Model Photonic Integrated Circuit Design Broadband Spectral Analysis Nonlinear Equations with Multiphysics Features Photonic Crystal Analyzer Shift-Inverse Eigensolver Preconditioner and Algorithm for Iterative Side-Equation Solver When iterative solver fails Parallel Direct FDFD Solver Kernel

Objectives Introduction Fast generation of numerical data for different parameters Data-driven intelligent design of optical components Explicit and fast acquisition of quantitative characteristics Reduction of postprocessing and data storage/transfer requirement 7 Finite-Difference Frequency-Domain Parallel Direct FDFD Solver Kernel

Outline 8 Introduction Implementation Numerical Results I P2P Matrix Sharing Numerical Results II Summary

FDFD Problem Implementation Linear system E + k 2 0 ε r E = cԧj Direct solver for robust solution Yee s mesh Perfectly-matched layer High-frequency problem Challenge Heavy factorization loads 9 Parallel Direct FDFD Solver Kernel

Implementation Compressed hierarchical Schur method (CHiS) Domain decomposition, multi-level algorithm 3D nested dissection of Yee s mesh (N x N y N z ) Ideal periodic structure D 1 = D 2 = D 3 = = D 16 S 1,1 = S 1,2 = S 1,3 = = S 1,8 S 2,1 = S 2,2 = S 2,3 = S 2,4 S 3,1 = S 3,2 S 4,1 10

Implementation Compressed hierarchical Schur method Elimination tree deduplication Diagonals Interfaces to children 11 I U I L

Implementation Compressed hierarchical Schur method Elimination tree deduplication Diagonals Interfaces to children 12

Implementation Compressed hierarchical Schur method Leaf-level Interface Compression (LIC) Use one updating submatrix over multiple Schur complement submatrices with row/column permutations. The less sparse matrix computing, the less CPU-centric load 13

Implementation Compressed Hierarchical Schur method Expose larger chunks of matrix computation Major function calls and libraries Subdomains 14 Sparse diagonal: Sparse factorize Sparse interface: Sparse LS solve and matrix multiply Separators Dense diagonal: Dense LU (Option 1) PARDISO, Sparse BLAS (Option 2) MUMPS Packed dense interface: Dense LS solve and matrix multiply Hardware Acceleration (GPU: cublas, cusolver, etc.) BLAS (ZGEMM) and LAPACK (ZGETRF, ZGETRS)

GPU acceleration Implementation Considerations Multi-GPU scaling in single node (Scale-up) No longer solely based on nested dissection Asynchronous streams for small submatrices Overlapping some computation kernels Hardware scheduling Threaded GPU controls Thread affinity 15

Implementation GPU acceleration 16 Factorize all diagonal blocks S i,j related to level i. (CPU or GPU work.)

Implementation GPU acceleration 17 Asynchronously send some blocks to GPU and perform S 1 i,j I U

GPU acceleration Implementation 18 Continue to ZGEMM, no D2H data transmission S 1 i,j I U kept in GPU for I L S 1 i,j I U operation later. Workspace will be simply discarded if no longer needed.

Implementation GPU acceleration 19 Asynchronously perform ZGEMM I L (S 1 i,j I U )

Implementation GPU acceleration 20 Collect I L (S 1 i,j I U ) from all GPUs and perform higher-level Schur update by CPU

Implementation GPU acceleration 21 Continue more ZGEMM I L (S 1 i,j I U ) related to (S 1 i,j I U ) and Schur updates

GPU acceleration Workload balance for multi-gpu Distribute I U blocks by parent levels Tackle extreme cases with lots of duplicates Minor increase in H2D transfer Implementation 22

GPU acceleration Workload balance for multi-gpu Panel I U Each I U column should be large enough Multiple I L copies sent to GPUs Moderate increase in H2D transfer Implementation 23

Implementation 24 GPU acceleration Without workload balance Finishing time > 325 seconds

Implementation 25 GPU acceleration With workload balance Finishing time < 250 seconds

Outline 26 Introduction Implementation Numerical Results I P2P Matrix Sharing Numerical Results II Summary

Hardware specifications Numerical Results I Server Brillante P8Exp CPU 2 Intel E5-2670 v3 12 + 12 cores used Memory 256 GB 1 TB 27 2 IBM Power8 8 + 8 cores used GPU 2 K40 4 K80 Software Intel Parallel Studio 2016 update 1 Intel PARDISO IBM ESSL and Parallel ESSL IBM XL Fortran and XL C Compiler CUDA 7.5 MUMPS 5.0.1 CUDA 7.5

SOI dielectric waveguide Numerical Results I Total grids: 79 319 39, 2,948,517 in matrix dimension Wavelength: 1.5 μm Grid size: 0.02 μm 100 GB RAM 28

Numerical Results I 29 Brillante: 2 K40 ZGETRS + ZGEMM 439. 3 seconds (90% overall time)

Naïve GPU acceleration yields good speedup due to high AI. Scatter time includes D2H transfer. Brillante: 2 K40 Numerical Results I 30

Brillante: 2 K40 Numerical Results I Async streams apply to low-level 31 separators, which is finished in seconds even in CPU-only mode.

Brillante: 2 K40 Numerical Results I 32 Workload balance yields better speedup and multi-gpu scaling.

Numerical Results I P8Exp: 4 K80 with autoboost 33 Good performance scaling in quad-k80 server Higher performance with half-k80 computing Two threads competing single PCI-E bandwidth when using full-k80

Numerical Results I P8Exp: 4 K80 with autoboost 34 AccTRSMM: multi-gpu scaling Increased H2D transfer due to multiple I L copies to worksharing GPUs We still get acceptable scaling performance

Numerical Results I Periodic air hole wavelength filter No propagation at λ 0 = 1.5 μm Total grids: 79 575 47, 6,404,925 in matrix dimension 188 GB RAM 35

Brillante: 2 K40 Numerical Results I 36

Numerical Results I P8Exp: 4 K80 with autoboost 37

Numerical Results I P8Exp: GPU-scaling of AccTRSMM Much more dense matrix operations Good scaling in multi-gpu systems 38

Outline 39 Introduction Implementation Numerical Results I P2P Matrix Sharing Numerical Results II Summary

P2P Matrix Sharing 40 Improved multi-gpu scaling with P2P transfer Past: Multiple I L copies sent to work-sharing GPUs Growing H2D transfer with increasing GPU sharing Major bottleneck for multi-p100 acceleration No cublas-xt: some matrix contents already distributed in GPUs S 1 Broadcast

+ + + 41

P2P Matrix Sharing 42 Improved multi-gpu scaling with P2P transfer I L division cudamemcpypeerasync Threaded GPU control with busy-waiting S 1 division I U is shared with identical S 1 Expectation Replace massive H2D with P2P Reduced H2D transmission Other improvements Asynchronous D2H transfer right after ZGEMM S 1 D2H will be counted in AccTRSMM time in our P2P scheme

Outline 43 Introduction Implementation Numerical Results I P2P Matrix Sharing Numerical Results II Summary

IntelExp Numerical Results II 2 Intel E5-2640 v4 (20 physical cores) 8 Tesla P100 with 16 GB device memory PCI-E switch enclosure No NVLink DGX-1 2 Intel E5-2698 v4 (40 physical cores) 8 NVLink-enabled Tesla P100 44

IntelExp Numerical Results II PCI-E enclosure on one CPU (experimental build) Aggregate CPU-GPU bandwidth: 10~12 GB/s (Uni-direction) GPU-GPU link bandwidth: 12.5 GB/s (Uni-direction) 45 CPU0 CPU1 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7

IntelExp: 4GPU Numerical Results II Consistent PCI-E speed between GPUs at 12.5 GB/s Saturated CPU-GPU link 46

IntelExp: 8GPU Numerical Results II Some GPU links slow down by half Heavy congestion between CPU-GPU 47

Numerical Results II IntelExp: SOI waveguide simulation 48

Numerical Results II IntelExp: AccTRSMM Speedup (SOI waveguide) 49

Numerical Results II GPU AccTRSMM in SOI waveguide case 50 Great scaling performance in computing H2D and D2H transfer becomes the major scaling bottleneck P2P sharing eliminates H2D growth in multi-gpu Total H2D (GB) Total D2H (GB) AccTRSMM time (seconds) AccTRSMM scale No-P2P W/P2P No-P2P W/P2P No-P2P W/P2P No-P2P W/P2P 1-GPU 207.8 207.8 170.9 170.9 121.3 146.1 1.00X 1.00X 2-GPU 341.1 207.8 170.9 170.9 85.4 89.6 1.42X 1.63X 4-GPU 531.2 207.8 170.9 170.9 87.1 67.1 1.39X 2.18X 8-GPU 805.5 207.8 170.9 170.9 109.3 58.4 1.11X 2.50X

Numerical Results II IntelExp: Periodic air hole wavelength filter 51

Numerical Results II IntelExp: AccTRSMM Speedup (Air hole filter) 52

Numerical Results II GPU AccTRSMM in filter case 53 Great scaling performance in computing H2D and D2H transfer becomes the major scaling bottleneck P2P sharing eliminates H2D growth in multi-gpu Total H2D (GB) Total D2H (GB) AccTRSMM time (seconds) AccTRSMM scale No-P2P W/P2P No-P2P W/P2P No-P2P W/P2P No-P2P W/P2P 1-GPU 427.5 427.5 348.0 348.0 320.4 376.3 1.00X 1.00X 2-GPU 690.9 427.5 348.0 348.0 204.2 220.6 1.57X 1.71X 4-GPU 1144.2 427.5 348.0 348.0 195.5 158.8 1.64X 2.37X 8-GPU 1839.9 427. 5 348.0 348.0 252.1 134. 0 1.27X 2. 81X

DGX-1 Numerical Results II 54 Doubled CPU-GPU bandwidth in multi-gpu computing Aggregate bandwidth: 24 GB/s (Uni-direction) NVLink Up to 20GB/s (Uni-direction) Over 18GB/s in profiler Source: NVIDIA

Numerical Results II 55 DGX-1: SOI waveguide simulation Strange CPU behavior with OpenMP under investigation

Numerical Results II DGX-1: AccTRSMM (SOI waveguide) 56

Numerical Results II DGX-1 AccTRSMM in SOI waveguide case 57 Significant speedup from H2D and D2H (Double CPU-GPU links) NVLink further reduces sharing overheads NVLink between CPU-GPU? AccTRSMM time (seconds) AccTRSMM scale DGX1 IntelExp DGX1 IntelExp 1-GPU 146.1 146.1 1.00X 1.00X 2-GPU 78.5 89.6 1.86X 1.63X 4-GPU 47.5 67.1 3.08X 2.18X 8-GPU 35. 3 58.4 4.14X 2.50X From 439.3 (24 Haswell cores) to 35.3 seconds Over 12. 4X speedup

Outline 58 Introduction Implementation Numerical Results I P2P Matrix Sharing Numerical Results II Summary

Summary CHiS solver for 3D photonic simulation with multi-gpu FLOP, time, and memory saving: CPU-GPU traffic reduced Dense LA functions: ready for modern HPC architecture Sparse LA functions: SpMM, sparse LS solver Balanced multi-gpu acceleration with asynchronous data transfer and matrix computations P2P transfer: great computation scaling up to 8 GPUs Successful harnessing high-density GPU-accelerated systems Fast transfer between CPU-GPU MPI implementation in progress Fit computation task unit into GPU Maintain resource saving and scheduling and expose parallelization simultaneously 59

IBM Research NVIDIA Taiwan NVAITC Program Acknowledgement 60 Thank you!