Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M a t h e m a t i c a l S c i e n c e s N a t i o n a l T a i w a n U n i v e r s i t y T a i p e i, T a i w a n * * I B M T. J. W a t s o n R e s e a r c h C e n t e r N Y, U S

Outline 2 Introduction Implementation Numerical Results I P2P Matrix Sharing Numerical Results II Summary

Introduction (Ref: Sun et al., Nature 528, 2015) 3 Photonics Waveguides Resonant cavities Frequency filters Plasmonic devices Design concerns Structural characteristics Parameter refinement Experiment data (Ref: Ivinskaya & Lavrinenko, 2011)

Introduction - Why Multi-GPU Scaling Global supercomputing trend High energy efficiency Growing popularity in deep learning applications Integration of high-performance numerical simulation and deep learning Source: ORNL 4 Source: NVIDIA

Introduction 5 Machine-Learning-Derived Behavior Model and Intelligent Design Photonic Integrated Circuit Design Broadband Spectral Analysis Nonlinear Equations with Multiphysics Features Photonic Crystal Analyzer Shift-Inverse Eigensolver Preconditioner and Algorithm for Iterative Side-Equation Solver Parallel Direct FDFD Solver Kernel

Introduction 6 Machine-Learning-Derived Behavior Model Photonic Integrated Circuit Design Broadband Spectral Analysis Nonlinear Equations with Multiphysics Features Photonic Crystal Analyzer Shift-Inverse Eigensolver Preconditioner and Algorithm for Iterative Side-Equation Solver When iterative solver fails Parallel Direct FDFD Solver Kernel

Objectives Introduction Fast generation of numerical data for different parameters Data-driven intelligent design of optical components Explicit and fast acquisition of quantitative characteristics Reduction of postprocessing and data storage/transfer requirement 7 Finite-Difference Frequency-Domain Parallel Direct FDFD Solver Kernel

Outline 8 Introduction Implementation Numerical Results I P2P Matrix Sharing Numerical Results II Summary

FDFD Problem Implementation Linear system E + k 2 0 ε r E = cԧj Direct solver for robust solution Yee s mesh Perfectly-matched layer High-frequency problem Challenge Heavy factorization loads 9 Parallel Direct FDFD Solver Kernel

Implementation Compressed hierarchical Schur method (CHiS) Domain decomposition, multi-level algorithm 3D nested dissection of Yee s mesh (N x N y N z ) Ideal periodic structure D 1 = D 2 = D 3 = = D 16 S 1,1 = S 1,2 = S 1,3 = = S 1,8 S 2,1 = S 2,2 = S 2,3 = S 2,4 S 3,1 = S 3,2 S 4,1 10

Implementation Compressed hierarchical Schur method Elimination tree deduplication Diagonals Interfaces to children 11 I U I L

Implementation Compressed hierarchical Schur method Elimination tree deduplication Diagonals Interfaces to children 12

Implementation Compressed hierarchical Schur method Leaf-level Interface Compression (LIC) Use one updating submatrix over multiple Schur complement submatrices with row/column permutations. The less sparse matrix computing, the less CPU-centric load 13

Implementation Compressed Hierarchical Schur method Expose larger chunks of matrix computation Major function calls and libraries Subdomains 14 Sparse diagonal: Sparse factorize Sparse interface: Sparse LS solve and matrix multiply Separators Dense diagonal: Dense LU (Option 1) PARDISO, Sparse BLAS (Option 2) MUMPS Packed dense interface: Dense LS solve and matrix multiply Hardware Acceleration (GPU: cublas, cusolver, etc.) BLAS (ZGEMM) and LAPACK (ZGETRF, ZGETRS)

GPU acceleration Implementation Considerations Multi-GPU scaling in single node (Scale-up) No longer solely based on nested dissection Asynchronous streams for small submatrices Overlapping some computation kernels Hardware scheduling Threaded GPU controls Thread affinity 15

Implementation GPU acceleration 16 Factorize all diagonal blocks S i,j related to level i. (CPU or GPU work.)

Implementation GPU acceleration 17 Asynchronously send some blocks to GPU and perform S 1 i,j I U

GPU acceleration Implementation 18 Continue to ZGEMM, no D2H data transmission S 1 i,j I U kept in GPU for I L S 1 i,j I U operation later. Workspace will be simply discarded if no longer needed.

Implementation GPU acceleration 19 Asynchronously perform ZGEMM I L (S 1 i,j I U )

Implementation GPU acceleration 20 Collect I L (S 1 i,j I U ) from all GPUs and perform higher-level Schur update by CPU

Implementation GPU acceleration 21 Continue more ZGEMM I L (S 1 i,j I U ) related to (S 1 i,j I U ) and Schur updates

GPU acceleration Workload balance for multi-gpu Distribute I U blocks by parent levels Tackle extreme cases with lots of duplicates Minor increase in H2D transfer Implementation 22

GPU acceleration Workload balance for multi-gpu Panel I U Each I U column should be large enough Multiple I L copies sent to GPUs Moderate increase in H2D transfer Implementation 23

Implementation 24 GPU acceleration Without workload balance Finishing time > 325 seconds

Implementation 25 GPU acceleration With workload balance Finishing time < 250 seconds

Outline 26 Introduction Implementation Numerical Results I P2P Matrix Sharing Numerical Results II Summary

Hardware specifications Numerical Results I Server Brillante P8Exp CPU 2 Intel E5-2670 v3 12 + 12 cores used Memory 256 GB 1 TB 27 2 IBM Power8 8 + 8 cores used GPU 2 K40 4 K80 Software Intel Parallel Studio 2016 update 1 Intel PARDISO IBM ESSL and Parallel ESSL IBM XL Fortran and XL C Compiler CUDA 7.5 MUMPS 5.0.1 CUDA 7.5

SOI dielectric waveguide Numerical Results I Total grids: 79 319 39, 2,948,517 in matrix dimension Wavelength: 1.5 μm Grid size: 0.02 μm 100 GB RAM 28

Numerical Results I 29 Brillante: 2 K40 ZGETRS + ZGEMM 439. 3 seconds (90% overall time)

Naïve GPU acceleration yields good speedup due to high AI. Scatter time includes D2H transfer. Brillante: 2 K40 Numerical Results I 30

Brillante: 2 K40 Numerical Results I Async streams apply to low-level 31 separators, which is finished in seconds even in CPU-only mode.

Brillante: 2 K40 Numerical Results I 32 Workload balance yields better speedup and multi-gpu scaling.

Numerical Results I P8Exp: 4 K80 with autoboost 33 Good performance scaling in quad-k80 server Higher performance with half-k80 computing Two threads competing single PCI-E bandwidth when using full-k80

Numerical Results I P8Exp: 4 K80 with autoboost 34 AccTRSMM: multi-gpu scaling Increased H2D transfer due to multiple I L copies to worksharing GPUs We still get acceptable scaling performance

Numerical Results I Periodic air hole wavelength filter No propagation at λ 0 = 1.5 μm Total grids: 79 575 47, 6,404,925 in matrix dimension 188 GB RAM 35

Brillante: 2 K40 Numerical Results I 36

Numerical Results I P8Exp: 4 K80 with autoboost 37

Numerical Results I P8Exp: GPU-scaling of AccTRSMM Much more dense matrix operations Good scaling in multi-gpu systems 38

Outline 39 Introduction Implementation Numerical Results I P2P Matrix Sharing Numerical Results II Summary

P2P Matrix Sharing 40 Improved multi-gpu scaling with P2P transfer Past: Multiple I L copies sent to work-sharing GPUs Growing H2D transfer with increasing GPU sharing Major bottleneck for multi-p100 acceleration No cublas-xt: some matrix contents already distributed in GPUs S 1 Broadcast

+ + + 41

P2P Matrix Sharing 42 Improved multi-gpu scaling with P2P transfer I L division cudamemcpypeerasync Threaded GPU control with busy-waiting S 1 division I U is shared with identical S 1 Expectation Replace massive H2D with P2P Reduced H2D transmission Other improvements Asynchronous D2H transfer right after ZGEMM S 1 D2H will be counted in AccTRSMM time in our P2P scheme

Outline 43 Introduction Implementation Numerical Results I P2P Matrix Sharing Numerical Results II Summary

IntelExp Numerical Results II 2 Intel E5-2640 v4 (20 physical cores) 8 Tesla P100 with 16 GB device memory PCI-E switch enclosure No NVLink DGX-1 2 Intel E5-2698 v4 (40 physical cores) 8 NVLink-enabled Tesla P100 44

IntelExp Numerical Results II PCI-E enclosure on one CPU (experimental build) Aggregate CPU-GPU bandwidth: 10~12 GB/s (Uni-direction) GPU-GPU link bandwidth: 12.5 GB/s (Uni-direction) 45 CPU0 CPU1 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7

IntelExp: 4GPU Numerical Results II Consistent PCI-E speed between GPUs at 12.5 GB/s Saturated CPU-GPU link 46

IntelExp: 8GPU Numerical Results II Some GPU links slow down by half Heavy congestion between CPU-GPU 47

Numerical Results II IntelExp: SOI waveguide simulation 48

Numerical Results II IntelExp: AccTRSMM Speedup (SOI waveguide) 49

Numerical Results II GPU AccTRSMM in SOI waveguide case 50 Great scaling performance in computing H2D and D2H transfer becomes the major scaling bottleneck P2P sharing eliminates H2D growth in multi-gpu Total H2D (GB) Total D2H (GB) AccTRSMM time (seconds) AccTRSMM scale No-P2P W/P2P No-P2P W/P2P No-P2P W/P2P No-P2P W/P2P 1-GPU 207.8 207.8 170.9 170.9 121.3 146.1 1.00X 1.00X 2-GPU 341.1 207.8 170.9 170.9 85.4 89.6 1.42X 1.63X 4-GPU 531.2 207.8 170.9 170.9 87.1 67.1 1.39X 2.18X 8-GPU 805.5 207.8 170.9 170.9 109.3 58.4 1.11X 2.50X

Numerical Results II IntelExp: Periodic air hole wavelength filter 51

Numerical Results II IntelExp: AccTRSMM Speedup (Air hole filter) 52

Numerical Results II GPU AccTRSMM in filter case 53 Great scaling performance in computing H2D and D2H transfer becomes the major scaling bottleneck P2P sharing eliminates H2D growth in multi-gpu Total H2D (GB) Total D2H (GB) AccTRSMM time (seconds) AccTRSMM scale No-P2P W/P2P No-P2P W/P2P No-P2P W/P2P No-P2P W/P2P 1-GPU 427.5 427.5 348.0 348.0 320.4 376.3 1.00X 1.00X 2-GPU 690.9 427.5 348.0 348.0 204.2 220.6 1.57X 1.71X 4-GPU 1144.2 427.5 348.0 348.0 195.5 158.8 1.64X 2.37X 8-GPU 1839.9 427. 5 348.0 348.0 252.1 134. 0 1.27X 2. 81X

DGX-1 Numerical Results II 54 Doubled CPU-GPU bandwidth in multi-gpu computing Aggregate bandwidth: 24 GB/s (Uni-direction) NVLink Up to 20GB/s (Uni-direction) Over 18GB/s in profiler Source: NVIDIA

Numerical Results II 55 DGX-1: SOI waveguide simulation Strange CPU behavior with OpenMP under investigation

Numerical Results II DGX-1: AccTRSMM (SOI waveguide) 56

Numerical Results II DGX-1 AccTRSMM in SOI waveguide case 57 Significant speedup from H2D and D2H (Double CPU-GPU links) NVLink further reduces sharing overheads NVLink between CPU-GPU? AccTRSMM time (seconds) AccTRSMM scale DGX1 IntelExp DGX1 IntelExp 1-GPU 146.1 146.1 1.00X 1.00X 2-GPU 78.5 89.6 1.86X 1.63X 4-GPU 47.5 67.1 3.08X 2.18X 8-GPU 35. 3 58.4 4.14X 2.50X From 439.3 (24 Haswell cores) to 35.3 seconds Over 12. 4X speedup

Outline 58 Introduction Implementation Numerical Results I P2P Matrix Sharing Numerical Results II Summary

Summary CHiS solver for 3D photonic simulation with multi-gpu FLOP, time, and memory saving: CPU-GPU traffic reduced Dense LA functions: ready for modern HPC architecture Sparse LA functions: SpMM, sparse LS solver Balanced multi-gpu acceleration with asynchronous data transfer and matrix computations P2P transfer: great computation scaling up to 8 GPUs Successful harnessing high-density GPU-accelerated systems Fast transfer between CPU-GPU MPI implementation in progress Fit computation task unit into GPU Maintain resource saving and scheduling and expose parallelization simultaneously 59

IBM Research NVIDIA Taiwan NVAITC Program Acknowledgement 60 Thank you!