Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Size: px

Start display at page:

Download "Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation"

Laurel Hines
6 years ago
Views:

1 Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M a t h e m a t i c a l S c i e n c e s N a t i o n a l T a i w a n U n i v e r s i t y T a i p e i, T a i w a n * * I B M T. J. W a t s o n R e s e a r c h C e n t e r N Y, U S

2 Outline 2 Introduction Implementation Numerical Results I P2P Matrix Sharing Numerical Results II Summary

3 Introduction (Ref: Sun et al., Nature 528, 2015) 3 Photonics Waveguides Resonant cavities Frequency filters Plasmonic devices Design concerns Structural characteristics Parameter refinement Experiment data (Ref: Ivinskaya & Lavrinenko, 2011)

learning applications Integration of high-performance

4 Introduction - Why Multi-GPU Scaling Global supercomputing trend High energy efficiency Growing popularity in deep learning applications Integration of high-performance numerical simulation and deep learning Source: ORNL 4 Source: NVIDIA

Introduction 5 Machine-Learning-Derived Behavior Model and Intelligent Design Photonic Integrated Circuit Design Broadband Spectral Analysis Nonlinear Equations with

5 Introduction 5 Machine-Learning-Derived Behavior Model and Intelligent Design Photonic Integrated Circuit Design Broadband Spectral Analysis Nonlinear Equations with Multiphysics Features Photonic Crystal Analyzer Shift-Inverse Eigensolver Preconditioner and Algorithm for Iterative Side-Equation Solver Parallel Direct FDFD Solver Kernel

6 Introduction 6 Machine-Learning-Derived Behavior Model Photonic Integrated Circuit Design Broadband Spectral Analysis Nonlinear Equations with Multiphysics Features Photonic Crystal Analyzer Shift-Inverse Eigensolver Preconditioner and Algorithm for Iterative Side-Equation Solver When iterative solver fails Parallel Direct FDFD Solver Kernel

7 Objectives Introduction Fast generation of numerical data for different parameters Data-driven intelligent design of optical components Explicit and fast acquisition of quantitative characteristics Reduction of postprocessing and data storage/transfer requirement 7 Finite-Difference Frequency-Domain Parallel Direct FDFD Solver Kernel

8 Outline 8 Introduction Implementation Numerical Results I P2P Matrix Sharing Numerical Results II Summary

9 FDFD Problem Implementation Linear system E + k 2 0 ε r E = cԧj Direct solver for robust solution Yee s mesh Perfectly-matched layer High-frequency problem Challenge Heavy factorization loads 9 Parallel Direct FDFD Solver Kernel

Implementation Compressed hierarchical Schur method (CHiS) Domain decomposition, multi-level algorithm 3D nested dissection of Yee s mesh (N x

10 Implementation Compressed hierarchical Schur method (CHiS) Domain decomposition, multi-level algorithm 3D nested dissection of Yee s mesh (N x N y N z ) Ideal periodic structure D 1 = D 2 = D 3 = = D 16 S 1,1 = S 1,2 = S 1,3 = = S 1,8 S 2,1 = S 2,2 = S 2,3 = S 2,4 S 3,1 = S 3,2 S 4,1 10

11 Implementation Compressed hierarchical Schur method Elimination tree deduplication Diagonals Interfaces to children 11 I U I L

12 Implementation Compressed hierarchical Schur method Elimination tree deduplication Diagonals Interfaces to children 12

13 Implementation Compressed hierarchical Schur method Leaf-level Interface Compression (LIC) Use one updating submatrix over multiple Schur complement submatrices with row/column permutations. The less sparse matrix computing, the less CPU-centric load 13

14 Implementation Compressed Hierarchical Schur method Expose larger chunks of matrix computation Major function calls and libraries Subdomains 14 Sparse diagonal: Sparse factorize Sparse interface: Sparse LS solve and matrix multiply Separators Dense diagonal: Dense LU (Option 1) PARDISO, Sparse BLAS (Option 2) MUMPS Packed dense interface: Dense LS solve and matrix multiply Hardware Acceleration (GPU: cublas, cusolver, etc.) BLAS (ZGEMM) and LAPACK (ZGETRF, ZGETRS)

15 GPU acceleration Implementation Considerations Multi-GPU scaling in single node (Scale-up) No longer solely based on nested dissection Asynchronous streams for small submatrices Overlapping some computation kernels Hardware scheduling Threaded GPU controls Thread affinity 15

16 Implementation GPU acceleration 16 Factorize all diagonal blocks S i,j related to level i. (CPU or GPU work.)

17 Implementation GPU acceleration 17 Asynchronously send some blocks to GPU and perform S 1 i,j I U

18 GPU acceleration Implementation 18 Continue to ZGEMM, no D2H data transmission S 1 i,j I U kept in GPU for I L S 1 i,j I U operation later. Workspace will be simply discarded if no longer needed.

19 Implementation GPU acceleration 19 Asynchronously perform ZGEMM I L (S 1 i,j I U )

20 Implementation GPU acceleration 20 Collect I L (S 1 i,j I U ) from all GPUs and perform higher-level Schur update by CPU

21 Implementation GPU acceleration 21 Continue more ZGEMM I L (S 1 i,j I U ) related to (S 1 i,j I U ) and Schur updates

22 GPU acceleration Workload balance for multi-gpu Distribute I U blocks by parent levels Tackle extreme cases with lots of duplicates Minor increase in H2D transfer Implementation 22

23 GPU acceleration Workload balance for multi-gpu Panel I U Each I U column should be large enough Multiple I L copies sent to GPUs Moderate increase in H2D transfer Implementation 23

24 Implementation 24 GPU acceleration Without workload balance Finishing time > 325 seconds

25 Implementation 25 GPU acceleration With workload balance Finishing time < 250 seconds

26 Outline 26 Introduction Implementation Numerical Results I P2P Matrix Sharing Numerical Results II Summary

27 Hardware specifications Numerical Results I Server Brillante P8Exp CPU 2 Intel E v cores used Memory 256 GB 1 TB 27 2 IBM Power cores used GPU 2 K40 4 K80 Software Intel Parallel Studio 2016 update 1 Intel PARDISO IBM ESSL and Parallel ESSL IBM XL Fortran and XL C Compiler CUDA 7.5 MUMPS CUDA 7.5

28 SOI dielectric waveguide Numerical Results I Total grids: , 2,948,517 in matrix dimension Wavelength: 1.5 μm Grid size: 0.02 μm 100 GB RAM 28

29 Numerical Results I 29 Brillante: 2 K40 ZGETRS + ZGEMM seconds (90% overall time)

30 Naïve GPU acceleration yields good speedup due to high AI. Scatter time includes D2H transfer. Brillante: 2 K40 Numerical Results I 30

31 Brillante: 2 K40 Numerical Results I Async streams apply to low-level 31 separators, which is finished in seconds even in CPU-only mode.

32 Brillante: 2 K40 Numerical Results I 32 Workload balance yields better speedup and multi-gpu scaling.

33 Numerical Results I P8Exp: 4 K80 with autoboost 33 Good performance scaling in quad-k80 server Higher performance with half-k80 computing Two threads competing single PCI-E bandwidth when using full-k80

34 Numerical Results I P8Exp: 4 K80 with autoboost 34 AccTRSMM: multi-gpu scaling Increased H2D transfer due to multiple I L copies to worksharing GPUs We still get acceptable scaling performance

35 Numerical Results I Periodic air hole wavelength filter No propagation at λ 0 = 1.5 μm Total grids: , 6,404,925 in matrix dimension 188 GB RAM 35

36 Brillante: 2 K40 Numerical Results I 36

37 Numerical Results I P8Exp: 4 K80 with autoboost 37

38 Numerical Results I P8Exp: GPU-scaling of AccTRSMM Much more dense matrix operations Good scaling in multi-gpu systems 38

39 Outline 39 Introduction Implementation Numerical Results I P2P Matrix Sharing Numerical Results II Summary

P2P Matrix Sharing 40 Improved multi-gpu scaling with P2P transfer Past: Multiple I L copies sent to work-sharing GPUs Growing H2D transfer with

40 P2P Matrix Sharing 40 Improved multi-gpu scaling with P2P transfer Past: Multiple I L copies sent to work-sharing GPUs Growing H2D transfer with increasing GPU sharing Major bottleneck for multi-p100 acceleration No cublas-xt: some matrix contents already distributed in GPUs S 1 Broadcast

42 P2P Matrix Sharing 42 Improved multi-gpu scaling with P2P transfer I L division cudamemcpypeerasync Threaded GPU control with busy-waiting S 1 division I U is shared with identical S 1 Expectation Replace massive H2D with P2P Reduced H2D transmission Other improvements Asynchronous D2H transfer right after ZGEMM S 1 D2H will be counted in AccTRSMM time in our P2P scheme

43 Outline 43 Introduction Implementation Numerical Results I P2P Matrix Sharing Numerical Results II Summary

44 IntelExp Numerical Results II 2 Intel E v4 (20 physical cores) 8 Tesla P100 with 16 GB device memory PCI-E switch enclosure No NVLink DGX-1 2 Intel E v4 (40 physical cores) 8 NVLink-enabled Tesla P100 44

45 IntelExp Numerical Results II PCI-E enclosure on one CPU (experimental build) Aggregate CPU-GPU bandwidth: 10~12 GB/s (Uni-direction) GPU-GPU link bandwidth: 12.5 GB/s (Uni-direction) 45 CPU0 CPU1 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7

46 IntelExp: 4GPU Numerical Results II Consistent PCI-E speed between GPUs at 12.5 GB/s Saturated CPU-GPU link 46

47 IntelExp: 8GPU Numerical Results II Some GPU links slow down by half Heavy congestion between CPU-GPU 47

48 Numerical Results II IntelExp: SOI waveguide simulation 48

49 Numerical Results II IntelExp: AccTRSMM Speedup (SOI waveguide) 49

Numerical Results II GPU AccTRSMM in SOI waveguide case 50 Great scaling performance in computing H2D and D2H transfer becomes the major scaling bottleneck P2P sharing eliminates H2D growth in

50 Numerical Results II GPU AccTRSMM in SOI waveguide case 50 Great scaling performance in computing H2D and D2H transfer becomes the major scaling bottleneck P2P sharing eliminates H2D growth in multi-gpu Total H2D (GB) Total D2H (GB) AccTRSMM time (seconds) AccTRSMM scale No-P2P W/P2P No-P2P W/P2P No-P2P W/P2P No-P2P W/P2P 1-GPU X 1.00X 2-GPU X 1.63X 4-GPU X 2.18X 8-GPU X 2.50X

51 Numerical Results II IntelExp: Periodic air hole wavelength filter 51

52 Numerical Results II IntelExp: AccTRSMM Speedup (Air hole filter) 52

53 Numerical Results II GPU AccTRSMM in filter case 53 Great scaling performance in computing H2D and D2H transfer becomes the major scaling bottleneck P2P sharing eliminates H2D growth in multi-gpu Total H2D (GB) Total D2H (GB) AccTRSMM time (seconds) AccTRSMM scale No-P2P W/P2P No-P2P W/P2P No-P2P W/P2P No-P2P W/P2P 1-GPU X 1.00X 2-GPU X 1.71X 4-GPU X 2.37X 8-GPU X 2. 81X

54 DGX-1 Numerical Results II 54 Doubled CPU-GPU bandwidth in multi-gpu computing Aggregate bandwidth: 24 GB/s (Uni-direction) NVLink Up to 20GB/s (Uni-direction) Over 18GB/s in profiler Source: NVIDIA

55 Numerical Results II 55 DGX-1: SOI waveguide simulation Strange CPU behavior with OpenMP under investigation

56 Numerical Results II DGX-1: AccTRSMM (SOI waveguide) 56

57 Numerical Results II DGX-1 AccTRSMM in SOI waveguide case 57 Significant speedup from H2D and D2H (Double CPU-GPU links) NVLink further reduces sharing overheads NVLink between CPU-GPU? AccTRSMM time (seconds) AccTRSMM scale DGX1 IntelExp DGX1 IntelExp 1-GPU X 1.00X 2-GPU X 1.63X 4-GPU X 2.18X 8-GPU X 2.50X From (24 Haswell cores) to 35.3 seconds Over 12. 4X speedup

58 Outline 58 Introduction Implementation Numerical Results I P2P Matrix Sharing Numerical Results II Summary

59 Summary CHiS solver for 3D photonic simulation with multi-gpu FLOP, time, and memory saving: CPU-GPU traffic reduced Dense LA functions: ready for modern HPC architecture Sparse LA functions: SpMM, sparse LS solver Balanced multi-gpu acceleration with asynchronous data transfer and matrix computations P2P transfer: great computation scaling up to 8 GPUs Successful harnessing high-density GPU-accelerated systems Fast transfer between CPU-GPU MPI implementation in progress Fit computation task unit into GPU Maintain resource saving and scheduling and expose parallelization simultaneously 59

60 IBM Research NVIDIA Taiwan NVAITC Program Acknowledgement 60 Thank you!

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization