Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012

Innovation Intelligence ALTAIR OVERVIEW

Altair s Vision Simulation, predictive analytics and optimization leveraging high performance computing for engineering and business decision making 3

Altair Engineering 25+ 40+ 1500+ Years of Innovation Offices in 16 Countries Employees Worldwide 4

Altair s Brands and Companies Engineering Simulation Platform On-demand Cloud Computing Technology Product Innovation Consulting Industrial Design Technology Business Intelligence & Data Analytics Solutions Solid State Lighting Products 5

HyperWorks for Analysis and Optimization Pre-Post Statics NVH Non-Linear (Implicit) Non-Linear (Explicit) Thermal and CFD Multi-Body Dynamics 1-D Systems, Fatigue, Ergonomics, Industrial Design, Injection Molding, Noise and Vibration (NVH), Composite Materials, Electromagnetics RADIOSS AcuSolve MotionSolve Partner Alliance Solutions Optimization OptiStruct and HyperStudy Data and Process Management 6

Simulate Real-life Models with HyperWorks Solvers 7

AcuSolve Already Benefits from GPU Accelerator High performance computational fluid dynamics software (CFD) The leading finite element based CFD technology High accuracy and scalability on massively parallel architectures S-duct 80K Degrees of Freedom Hybrid MPI/OpenMP for Multi-GPU test 1000 750 Xeon Nehalem CPU Nehalem CPU + Tesla GPU 500 549 549 250 0 4 core CPU 279 1 core + 1 GPU 4 core CPU 165 2X* 3.3X* 2 core + 2 GPU Lower is better *Performance gain versus 4 core CPU

Innovation Intelligence RADIOSS PORTING ON GPU

Motivations to use GPU Cost effective solution Power efficient solution How much of the peak can we get? Which part of the code is best suited? Which coding effort is required? What is the speedup for my application? Gflops 1600 1400 1200 1000 800 600 400 200 0 50 140 50 280 0,74 0,74 1 Node Intel X5670 2 Nodes Intel X5670 795 1,92 12 1 Node+1 Nvidia M2090 1450 8,5 2,1 1 Node+2 Nvidia M2090 Gflop Peak (DP) Gflop cost in $ Gflops per Watt 60 $ 50 40 30 20 10 0 10

RADIOSS Porting on Nvidia GPU Assess the potential of GPU for RADIOSS Focus on Implicit Direct Solver Highly optimized compute intensive solver Limited scalability on multicores and cluster Iterative Solver Ability to solve huge problems with low memory requirement Efficient parallelization High cost on CPU due to large number of iterations for convergence Double precision required Integrate GPU parallelization into our Hybrid parallelization technology 11

Innovation Intelligence RADIOSS DIRECT SOLVER PORTING

Multifrontal Sparse Direct Solver do j = 1, N assemble(a(j)) factor(j) update(j) end Non-pivoting: Cholesky Pivoting: LDLT 13

Concerns with GPU Acceleration CUBLAS (DGEMM) perfect candidate to speed up update module Frontal matrix could be too huge to fit in GPU memory Frontal matrix could be too small and thus inefficient w/ GPU Data transfer is not trivial Only the lower triangular matrix is interesting Pivoting is required in real applications Pivot searching is a sequential operation Factor module has limited parallel potential It could be as expensive as update module in extreme case 14

CUBLAS speedup Update module Improve profile of Update module Base BCSLIB4.3 Asynchronous computing S1 S2 S3 Overlap the computation Overlap the communication T U = LDL 15

Numerical Test : Non-Pivoting Case Linear static - Non-pivoting case Benchmark 2,8 Millions of Degrees of Freedom Platform Intel Xeon X5550, 4 Core, 48GB RAM Nvidia C2070, CUDA 4.0 MKL 10.3, RHEL 5.1 GPU speedup - non-pivoting case 1600 1400 solver elapsed 1,2 1 base(bcslib-ext) improved elapsed (s) 1200 1000 800 600 400 200 0 Lower is better 3.9X 4 CPU 4 CPU + 1GPU 2.9X profile(%) 0,8 0,6 0,4 0,2 0 update total 16

Numerical Test: Pivoting Case Nonlinear static - Pivoting case Customer model Engine Block Benchmark Platform 2,5 Millions of Degrees of Freedom Intel Xeon X5550, 4 Core, 48GB RAM Nvidia C2070, CUDA 4.0 MKL 10.3, RHEL 5.1 GPU speedup - pivoting case elapsed (s) 4500 4000 3500 3000 2500 2000 1500 solver Lower is better elapsed profile(%) 1,2 1 0,8 0,6 0,4 base(bcslib-ext) improved 1000 500 0,2 2.8X 2.5X 0 0 4 CPU 4 CPU + 1GPU update total 17

Challenging Work 1,2 On models update is not that 1 base(bcslib-ext) improved dominated E.g. engine block with 1 st order element profile(%) 0,8 0,6 0,4 0,2 In-core / Quasi in-core is preferred A memory threshold to get reasonable good speedup will be provided to the user Make sure the essential computation is in core Elapsed (s) 0 1200 1000 800 600 400 200 update solver total 1.5 Millions of DOFs with Pivoting Lower is better elapsed 2.1X 1.8X 0 4 CPU 4 CPU + 1GPU 18

Summary & Perspectives GPU and CUDA enhance the performance of direct solver significantly CUBLAS on Fermi card is so fast! Asynchronous computing is the key component Improving the profile is a necessary procedure Amdahl's law Application Area Nonlinear analysis robust and accurate solution Optimization matrix factorization reused in sensitivity analysis Future works Other dynamic solvers Block Lanczos Eigen value solver Altair AMSES (automated multilevel-substructure Eigen value solver) 19

Innovation Intelligence RADIOSS ITERATIVE SOLVER PORTING

Preconditioned Conjugate Gradient Method Problem Solve linear equation Ax b = 0 with A symmetric positive-definite Solution Preconditioned Conjugate Gradient (PCG) solves iteratively: M -1. (Ax b) = 0 Method quickly converges Efficiency depends on M Other advantages Low memory consumption Efficient parallelization: SMP, MPI 21

PCG Porting on GPU using Cuda Pretty simple original Fortran code Few kernels to write in Cuda Sparse Matrix Vector Focus optimization effort on this kernel Use of CUBLAS DAXPY DDOT r 0 = b Ax 0 z 0 = M -1 r 0 p 0 = z 0 k = 0 DO WHILE NOT DONE alpha k = r kt. z k / p kt. A. p k x k+1 = x k + alpha k p k r k+1 = r k alpha k A. p k IF(r k+1 < precision) THEN END IF DONE = TRUE z k+1 = M -1. r k+1 beta k = z k+1t. r k+1 / z k t. r k p k+1 = z k+1 + beta k p k k = k+1 END DO result = x k+1 22

Hybrid MPP Computing with CUDA and MPI Hybrid MPP version of RADIOSS 2 parallelization levels MPI parallelization based on domain decomposition Multi-threaded MPI processes under OpenMP Enhanced performance Higher scalability Better efficiency RADIOSS 11 Hybrid MPP Speedup Extend Hybrid to multi GPUs programming Integrate GPU parallelization into our hybrid programming model MPI to manage communication between GPUs Portions of code not GPU enable benefit from OpenMP parallelization Programming model expendable to multi nodes with multi GPUs 23

PCG Multi GPUs Parallelization Domain decomposition to split original problem Optimal load balancing MPI Communications between domains Controlled by the CPU One GPU associated to one MPI OpenMP Multithreading Speedup calculations not performed under GPU Leverage the available CPU cores 24

Benchmark #1 400 Linear Problem #1 Hood of a car with pressure loads 350 300 351 SMP 6-core Hybrid 2 MPI x 6 SMP SMP 6 + 1 GPU Compute displacements and stresses Benchmark 0,9 Millions of Degrees of Freedom Elapsed(s) 250 200 150 185 Hybrid 2 MPI x 6 SMP + 2 GPUs Platform 24 Millions of non zero 140000 Shells + 13000 Solids + 1100 RBE3 4300 iterations Nvidia PSG Cluster 2 nodes with: 100 50 0 120 84 4.2X* 1 node - 2 Nividia M2090 53 6.5X* Lower is better Dual Nvidia M2090 GPUs Cuda v3.2 Intel Westmere 2x6 X5670@2,93Ghz Linux RHEL 5.4 with Intel MPI 4.0 Elapsed(s) 100 80 60 40 104 Hybrid 4 MPI x 6 SMP Hybrid 4 MPI x 6 SMP + 4 GPUs 38 Performance Elapsed time decreased by up to 9X 20 9.2X* 0 2 nodes - 4 Nvidia M2090 *Performance gain versus SMP 6-core 25

Benchmark #2 Linear Problem #2 Hood of a car with pressure loads Refined model Compute displacements and stresses Benchmark 2,2 Millions of Degrees of Freedom 62 Millions of non zero 380000 Shells + 13000 Solids + 1100 RBE3 5300 iterations Platform Nvidia PSG Cluster 2 nodes with: Dual Nvidia M2090 GPUs Cuda v3.2 Intel Westmere 2x6 X5670@2,93Ghz Elapsed(s) Elapsed(s) 1200 1000 800 600 400 200 0 350 300 250 200 150 1106 SMP 6-core Hybrid 2 MPI x 6 SMP SMP 6 + 1 GPU Hybrid 2 MPI x 6 SMP + 2 GPUs 572 254 143 4.3X* 7.5X* 1 node - 2 Nividia M2090 306 Hybrid 4 MPI x 6 SMP Hybrid 4 MPI x 6 SMP + 4 GPUs Lower is better Performance Linux RHEL 5.4 with Intel MPI 4.0 Elapsed time decreased by up to 13X 100 50 0 85 13X* 2 nodes - 4 Nvidia M2090 *Performance gain versus SMP 6-core 26

Benchmark #3 Gravity Problem with contacts Full car model Compute displacements and stresses Benchmark 0,85 Millions of Degrees of Freedom 21 Millions of non zero 138000 Shells + 11000 Solids + 5700 1-D elements + 230 Rigid bodies + 6 Rigid walls 1 Gravity load and 22 Contact interfaces ~ 9000 iterations Elapsed(s) 1600 1400 1200 1000 800 600 400 200 0 600 1430 SMP 6-core 1223 Hybrid 2 MPI x 6 SMP SMP 6 + 1 GPU Hybrid 2 MPI x 6 SMP + 2 GPUs 443 1 node - 2 Nividia M2090 531 225 3.2X* 6.4X* Lower is better Platform Nvidia PSG Cluster 2 nodes with: Dual Nvidia M2090 GPUs Cuda v3.2 Elapsed(s) 500 400 300 Hybrid 4 MPI x 6 SMP Hybrid 4 MPI x 6 SMP + 4 GPUs Intel Westmere 2x6 X5670@2,93Ghz 200 163 Performance Linux RHEL 5.4 with Intel MPI 4.0 Elapsed time decreased by up to 9X 100 0 8.8X* 2 nodes - 4 Nvidia M2090 *Performance gain versus SMP 6-core 27

Innovation Intelligence CONCLUSION

Conclusion RADIOSS implicit direct & iterative solvers have been successfully ported on Nvidia GPU using Cuda Adding GPU improves significantly the performance of these solvers For iterative solver, Hybrid MPP allows to run on multi GPU card workstation and GPU cluster with good scalability and enhanced performance GPU support for implicit solvers is planed for HyperWorks 12 A big thanks to the Nvidia team for their great support: Stan Posey, Steven Rennich, Thomas Reed, Peng Wang! 29