Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Size: px

Start display at page:

Download "Speedup Altair RADIOSS Solvers Using NVIDIA GPU"

Emil Fox
6 years ago
Views:

1 Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012

2 Innovation Intelligence ALTAIR OVERVIEW

3 Altair s Vision Simulation, predictive analytics and optimization leveraging high performance computing for engineering and business decision making 3

Altair Engineering 25+ 40+ 1500+ Years of

4 Altair Engineering Years of Innovation Offices in 16 Countries Employees Worldwide 4

5 Altair s Brands and Companies Engineering Simulation Platform On-demand Cloud Computing Technology Product Innovation Consulting Industrial Design Technology Business Intelligence & Data Analytics Solutions Solid State Lighting Products 5

6 HyperWorks for Analysis and Optimization Pre-Post Statics NVH Non-Linear (Implicit) Non-Linear (Explicit) Thermal and CFD Multi-Body Dynamics 1-D Systems, Fatigue, Ergonomics, Industrial Design, Injection Molding, Noise and Vibration (NVH), Composite Materials, Electromagnetics RADIOSS AcuSolve MotionSolve Partner Alliance Solutions Optimization OptiStruct and HyperStudy Data and Process Management 6

7 Simulate Real-life Models with HyperWorks Solvers 7

AcuSolve Already Benefits from GPU Accelerator High performance computational fluid dynamics software (CFD) The leading finite element based CFD technology High accuracy and scalability on massively

8 AcuSolve Already Benefits from GPU Accelerator High performance computational fluid dynamics software (CFD) The leading finite element based CFD technology High accuracy and scalability on massively parallel architectures S-duct 80K Degrees of Freedom Hybrid MPI/OpenMP for Multi-GPU test Xeon Nehalem CPU Nehalem CPU + Tesla GPU core CPU core + 1 GPU 4 core CPU 165 2X* 3.3X* 2 core + 2 GPU Lower is better *Performance gain versus 4 core CPU

9 Innovation Intelligence RADIOSS PORTING ON GPU

10 Motivations to use GPU Cost effective solution Power efficient solution How much of the peak can we get? Which part of the code is best suited? Which coding effort is required? What is the speedup for my application? Gflops ,74 0,74 1 Node Intel X Nodes Intel X , Node+1 Nvidia M ,5 2,1 1 Node+2 Nvidia M2090 Gflop Peak (DP) Gflop cost in $ Gflops per Watt 60 $

11 RADIOSS Porting on Nvidia GPU Assess the potential of GPU for RADIOSS Focus on Implicit Direct Solver Highly optimized compute intensive solver Limited scalability on multicores and cluster Iterative Solver Ability to solve huge problems with low memory requirement Efficient parallelization High cost on CPU due to large number of iterations for convergence Double precision required Integrate GPU parallelization into our Hybrid parallelization technology 11

12 Innovation Intelligence RADIOSS DIRECT SOLVER PORTING

13 Multifrontal Sparse Direct Solver do j = 1, N assemble(a(j)) factor(j) update(j) end Non-pivoting: Cholesky Pivoting: LDLT 13

14 Concerns with GPU Acceleration CUBLAS (DGEMM) perfect candidate to speed up update module Frontal matrix could be too huge to fit in GPU memory Frontal matrix could be too small and thus inefficient w/ GPU Data transfer is not trivial Only the lower triangular matrix is interesting Pivoting is required in real applications Pivot searching is a sequential operation Factor module has limited parallel potential It could be as expensive as update module in extreme case 14

15 CUBLAS speedup Update module Improve profile of Update module Base BCSLIB4.3 Asynchronous computing S1 S2 S3 Overlap the computation Overlap the communication T U = LDL 15

Numerical Test : Non-Pivoting Case Linear static - Non-pivoting case Benchmark 2,8 Millions of Degrees of Freedom Platform Intel Xeon X5550, 4 Core, 48GB RAM Nvidia C2070, CUDA 4.0 MKL 10.3, RHEL 5.

16 Numerical Test : Non-Pivoting Case Linear static - Non-pivoting case Benchmark 2,8 Millions of Degrees of Freedom Platform Intel Xeon X5550, 4 Core, 48GB RAM Nvidia C2070, CUDA 4.0 MKL 10.3, RHEL 5.1 GPU speedup - non-pivoting case solver elapsed 1,2 1 base(bcslib-ext) improved elapsed (s) Lower is better 3.9X 4 CPU 4 CPU + 1GPU 2.9X profile(%) 0,8 0,6 0,4 0,2 0 update total 16

17 Numerical Test: Pivoting Case Nonlinear static - Pivoting case Customer model Engine Block Benchmark Platform 2,5 Millions of Degrees of Freedom Intel Xeon X5550, 4 Core, 48GB RAM Nvidia C2070, CUDA 4.0 MKL 10.3, RHEL 5.1 GPU speedup - pivoting case elapsed (s) solver Lower is better elapsed profile(%) 1,2 1 0,8 0,6 0,4 base(bcslib-ext) improved ,2 2.8X 2.5X CPU 4 CPU + 1GPU update total 17

18 Challenging Work 1,2 On models update is not that 1 base(bcslib-ext) improved dominated E.g. engine block with 1 st order element profile(%) 0,8 0,6 0,4 0,2 In-core / Quasi in-core is preferred A memory threshold to get reasonable good speedup will be provided to the user Make sure the essential computation is in core Elapsed (s) update solver total 1.5 Millions of DOFs with Pivoting Lower is better elapsed 2.1X 1.8X 0 4 CPU 4 CPU + 1GPU 18

19 Summary & Perspectives GPU and CUDA enhance the performance of direct solver significantly CUBLAS on Fermi card is so fast! Asynchronous computing is the key component Improving the profile is a necessary procedure Amdahl's law Application Area Nonlinear analysis robust and accurate solution Optimization matrix factorization reused in sensitivity analysis Future works Other dynamic solvers Block Lanczos Eigen value solver Altair AMSES (automated multilevel-substructure Eigen value solver) 19

20 Innovation Intelligence RADIOSS ITERATIVE SOLVER PORTING

21 Preconditioned Conjugate Gradient Method Problem Solve linear equation Ax b = 0 with A symmetric positive-definite Solution Preconditioned Conjugate Gradient (PCG) solves iteratively: M -1. (Ax b) = 0 Method quickly converges Efficiency depends on M Other advantages Low memory consumption Efficient parallelization: SMP, MPI 21

22 PCG Porting on GPU using Cuda Pretty simple original Fortran code Few kernels to write in Cuda Sparse Matrix Vector Focus optimization effort on this kernel Use of CUBLAS DAXPY DDOT r 0 = b Ax 0 z 0 = M -1 r 0 p 0 = z 0 k = 0 DO WHILE NOT DONE alpha k = r kt. z k / p kt. A. p k x k+1 = x k + alpha k p k r k+1 = r k alpha k A. p k IF(r k+1 < precision) THEN END IF DONE = TRUE z k+1 = M -1. r k+1 beta k = z k+1t. r k+1 / z k t. r k p k+1 = z k+1 + beta k p k k = k+1 END DO result = x k+1 22

23 Hybrid MPP Computing with CUDA and MPI Hybrid MPP version of RADIOSS 2 parallelization levels MPI parallelization based on domain decomposition Multi-threaded MPI processes under OpenMP Enhanced performance Higher scalability Better efficiency RADIOSS 11 Hybrid MPP Speedup Extend Hybrid to multi GPUs programming Integrate GPU parallelization into our hybrid programming model MPI to manage communication between GPUs Portions of code not GPU enable benefit from OpenMP parallelization Programming model expendable to multi nodes with multi GPUs 23

PCG Multi GPUs Parallelization Domain decomposition to split original problem Optimal load balancing MPI Communications between domains

24 PCG Multi GPUs Parallelization Domain decomposition to split original problem Optimal load balancing MPI Communications between domains Controlled by the CPU One GPU associated to one MPI OpenMP Multithreading Speedup calculations not performed under GPU Leverage the available CPU cores 24

25 Benchmark #1 400 Linear Problem #1 Hood of a car with pressure loads SMP 6-core Hybrid 2 MPI x 6 SMP SMP GPU Compute displacements and stresses Benchmark 0,9 Millions of Degrees of Freedom Elapsed(s) Hybrid 2 MPI x 6 SMP + 2 GPUs Platform 24 Millions of non zero Shells Solids RBE iterations Nvidia PSG Cluster 2 nodes with: X* 1 node - 2 Nividia M X* Lower is better Dual Nvidia M2090 GPUs Cuda v3.2 Intel Westmere 2x6 X5670@2,93Ghz Linux RHEL 5.4 with Intel MPI 4.0 Elapsed(s) Hybrid 4 MPI x 6 SMP Hybrid 4 MPI x 6 SMP + 4 GPUs 38 Performance Elapsed time decreased by up to 9X X* 0 2 nodes - 4 Nvidia M2090 *Performance gain versus SMP 6-core 25

26 Benchmark #2 Linear Problem #2 Hood of a car with pressure loads Refined model Compute displacements and stresses Benchmark 2,2 Millions of Degrees of Freedom 62 Millions of non zero Shells Solids RBE iterations Platform Nvidia PSG Cluster 2 nodes with: Dual Nvidia M2090 GPUs Cuda v3.2 Intel Westmere 2x6 X5670@2,93Ghz Elapsed(s) Elapsed(s) SMP 6-core Hybrid 2 MPI x 6 SMP SMP GPU Hybrid 2 MPI x 6 SMP + 2 GPUs X* 7.5X* 1 node - 2 Nividia M Hybrid 4 MPI x 6 SMP Hybrid 4 MPI x 6 SMP + 4 GPUs Lower is better Performance Linux RHEL 5.4 with Intel MPI 4.0 Elapsed time decreased by up to 13X X* 2 nodes - 4 Nvidia M2090 *Performance gain versus SMP 6-core 26

27 Benchmark #3 Gravity Problem with contacts Full car model Compute displacements and stresses Benchmark 0,85 Millions of Degrees of Freedom 21 Millions of non zero Shells Solids D elements Rigid bodies + 6 Rigid walls 1 Gravity load and 22 Contact interfaces ~ 9000 iterations Elapsed(s) SMP 6-core 1223 Hybrid 2 MPI x 6 SMP SMP GPU Hybrid 2 MPI x 6 SMP + 2 GPUs node - 2 Nividia M X* 6.4X* Lower is better Platform Nvidia PSG Cluster 2 nodes with: Dual Nvidia M2090 GPUs Cuda v3.2 Elapsed(s) Hybrid 4 MPI x 6 SMP Hybrid 4 MPI x 6 SMP + 4 GPUs Intel Westmere 2x6 X5670@2,93Ghz Performance Linux RHEL 5.4 with Intel MPI 4.0 Elapsed time decreased by up to 9X X* 2 nodes - 4 Nvidia M2090 *Performance gain versus SMP 6-core 27

28 Innovation Intelligence CONCLUSION

29 Conclusion RADIOSS implicit direct & iterative solvers have been successfully ported on Nvidia GPU using Cuda Adding GPU improves significantly the performance of these solvers For iterative solver, Hybrid MPP allows to run on multi GPU card workstation and GPU cluster with good scalability and enhanced performance GPU support for implicit solvers is planed for HyperWorks 12 A big thanks to the Nvidia team for their great support: Stan Posey, Steven Rennich, Thomas Reed, Peng Wang! 29

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,