Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Similar documents
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

GPU COMPUTING WITH MSC NASTRAN 2013

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)

ANSYS HPC. Technology Leadership. Barbara Hutchings ANSYS, Inc. September 20, 2011

ANSYS HPC Technology Leadership

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Solving Large Complex Problems. Efficient and Smart Solutions for Large Models

Stan Posey, CAE Industry Development NVIDIA, Santa Clara, CA, USA

Faster Innovation - Accelerating SIMULIA Abaqus Simulations with NVIDIA GPUs. Baskar Rajagopalan Accelerated Computing, NVIDIA

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

Two-Phase flows on massively parallel multi-gpu clusters

Performance Benefits of NVIDIA GPUs for LS-DYNA

Performance of Implicit Solver Strategies on GPUs

Enhancing Analysis-Based Design with Quad-Core Intel Xeon Processor-Based Workstations

A Comprehensive Study on the Performance of Implicit LS-DYNA

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

Accelerating Implicit LS-DYNA with GPU

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

The Fermi GPU and HPC Application Breakthroughs

MAGMA. Matrix Algebra on GPU and Multicore Architectures

HPC with Multicore and GPUs

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

TFLOP Performance for ANSYS Mechanical

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016

Intel Performance Libraries

High performance Computing and O&G Challenges

Industrial finite element analysis: Evolution and current challenges. Keynote presentation at NAFEMS World Congress Crete, Greece June 16-19, 2009

ANSYS High. Computing. User Group CAE Associates

Engineers can be significantly more productive when ANSYS Mechanical runs on CPUs with a high core count. Executive Summary

Simulation-Driven Design Has Never Been Easier

MAGMA: a New Generation

Maximize automotive simulation productivity with ANSYS HPC and NVIDIA GPUs

Accelerating MCAE with GPUs

S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Multi-GPU simulations in OpenFOAM with SpeedIT technology.

A GPU Sparse Direct Solver for AX=B

MAGMA. LAPACK for GPUs. Stan Tomov Research Director Innovative Computing Laboratory Department of Computer Science University of Tennessee, Knoxville

Sparse Multifrontal Performance Gains via NVIDIA GPU January 16, 2009

Fast and reliable linear system solutions on new parallel architectures

Accelerated ANSYS Fluent: Algebraic Multigrid on a GPU. Robert Strzodka NVAMG Project Lead

Recent Developments in LS-DYNA II

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Large scale Imaging on Current Many- Core Platforms

NEW ADVANCES IN GPU LINEAR ALGEBRA

The Cray CX1 puts massive power and flexibility right where you need it in your workgroup

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Evaluation of sparse LU factorization and triangular solution on multicore architectures. X. Sherry Li

GPU-Acceleration of CAE Simulations. Bhushan Desam NVIDIA Corporation

Advances of parallel computing. Kirill Bogachev May 2016

Matrix-free IPM with GPU acceleration

GPU Implementation of Elliptic Solvers in NWP. Numerical Weather- and Climate- Prediction

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

AcuSolve Performance Benchmark and Profiling. October 2011

New Features in LS-DYNA HYBRID Version

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

ME964 High Performance Computing for Engineering Applications

Code modernization of Polyhedron benchmark suite

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

GPU PROGRESS AND DIRECTIONS IN APPLIED CFD

MUMPS. The MUMPS library. Abdou Guermouche and MUMPS team, June 22-24, Univ. Bordeaux 1 and INRIA

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Algorithms, System and Data Centre Optimisation for Energy Efficient HPC

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

On the Parallel Solution of Sparse Triangular Linear Systems. M. Naumov* San Jose, CA May 16, 2012 *NVIDIA

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

Premiers retours d expérience sur l utilisation de GPU pour des applications de mécanique des structures

RADIOSS Benchmark Underscores Solver s Scalability, Quality and Robustness

CODE Product Solutions

PARDISO - PARallel DIrect SOlver to solve SLAE on shared memory architectures

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

HPC Architectures. Types of resource currently in use

International Conference on Computational Science (ICCS 2017)

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report

Introduction to parallel Computing

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

Second Conference on Parallel, Distributed, Grid and Cloud Computing for Engineering

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

Altair RADIOSS Performance Benchmark and Profiling. May 2013

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Particleworks: Particle-based CAE Software fully ported to GPU

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Intel MKL Sparse Solvers. Software Solutions Group - Developer Products Division

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

Report of Linear Solver Implementation on GPU

14MMFD-34 Parallel Efficiency and Algorithmic Optimality in Reservoir Simulation on GPUs

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager

n N c CIni.o ewsrg.au

Transcription:

Innovation Intelligence Speedup Altair RADIOSS Solvers Using NVIDIA GPU Eric LEQUINIOU, HPC Director Hongwei Zhou, Senior Software Developer May 16, 2012

Innovation Intelligence ALTAIR OVERVIEW

Altair s Vision Simulation, predictive analytics and optimization leveraging high performance computing for engineering and business decision making 3

Altair Engineering 25+ 40+ 1500+ Years of Innovation Offices in 16 Countries Employees Worldwide 4

Altair s Brands and Companies Engineering Simulation Platform On-demand Cloud Computing Technology Product Innovation Consulting Industrial Design Technology Business Intelligence & Data Analytics Solutions Solid State Lighting Products 5

HyperWorks for Analysis and Optimization Pre-Post Statics NVH Non-Linear (Implicit) Non-Linear (Explicit) Thermal and CFD Multi-Body Dynamics 1-D Systems, Fatigue, Ergonomics, Industrial Design, Injection Molding, Noise and Vibration (NVH), Composite Materials, Electromagnetics RADIOSS AcuSolve MotionSolve Partner Alliance Solutions Optimization OptiStruct and HyperStudy Data and Process Management 6

Simulate Real-life Models with HyperWorks Solvers 7

AcuSolve Already Benefits from GPU Accelerator High performance computational fluid dynamics software (CFD) The leading finite element based CFD technology High accuracy and scalability on massively parallel architectures S-duct 80K Degrees of Freedom Hybrid MPI/OpenMP for Multi-GPU test 1000 750 Xeon Nehalem CPU Nehalem CPU + Tesla GPU 500 549 549 250 0 4 core CPU 279 1 core + 1 GPU 4 core CPU 165 2X* 3.3X* 2 core + 2 GPU Lower is better *Performance gain versus 4 core CPU

Innovation Intelligence RADIOSS PORTING ON GPU

Motivations to use GPU Cost effective solution Power efficient solution How much of the peak can we get? Which part of the code is best suited? Which coding effort is required? What is the speedup for my application? Gflops 1600 1400 1200 1000 800 600 400 200 0 50 140 50 280 0,74 0,74 1 Node Intel X5670 2 Nodes Intel X5670 795 1,92 12 1 Node+1 Nvidia M2090 1450 8,5 2,1 1 Node+2 Nvidia M2090 Gflop Peak (DP) Gflop cost in $ Gflops per Watt 60 $ 50 40 30 20 10 0 10

RADIOSS Porting on Nvidia GPU Assess the potential of GPU for RADIOSS Focus on Implicit Direct Solver Highly optimized compute intensive solver Limited scalability on multicores and cluster Iterative Solver Ability to solve huge problems with low memory requirement Efficient parallelization High cost on CPU due to large number of iterations for convergence Double precision required Integrate GPU parallelization into our Hybrid parallelization technology 11

Innovation Intelligence RADIOSS DIRECT SOLVER PORTING

Multifrontal Sparse Direct Solver do j = 1, N assemble(a(j)) factor(j) update(j) end Non-pivoting: Cholesky Pivoting: LDLT 13

Concerns with GPU Acceleration CUBLAS (DGEMM) perfect candidate to speed up update module Frontal matrix could be too huge to fit in GPU memory Frontal matrix could be too small and thus inefficient w/ GPU Data transfer is not trivial Only the lower triangular matrix is interesting Pivoting is required in real applications Pivot searching is a sequential operation Factor module has limited parallel potential It could be as expensive as update module in extreme case 14

CUBLAS speedup Update module Improve profile of Update module Base BCSLIB4.3 Asynchronous computing S1 S2 S3 Overlap the computation Overlap the communication T U = LDL 15

Numerical Test : Non-Pivoting Case Linear static - Non-pivoting case Benchmark 2,8 Millions of Degrees of Freedom Platform Intel Xeon X5550, 4 Core, 48GB RAM Nvidia C2070, CUDA 4.0 MKL 10.3, RHEL 5.1 GPU speedup - non-pivoting case 1600 1400 solver elapsed 1,2 1 base(bcslib-ext) improved elapsed (s) 1200 1000 800 600 400 200 0 Lower is better 3.9X 4 CPU 4 CPU + 1GPU 2.9X profile(%) 0,8 0,6 0,4 0,2 0 update total 16

Numerical Test: Pivoting Case Nonlinear static - Pivoting case Customer model Engine Block Benchmark Platform 2,5 Millions of Degrees of Freedom Intel Xeon X5550, 4 Core, 48GB RAM Nvidia C2070, CUDA 4.0 MKL 10.3, RHEL 5.1 GPU speedup - pivoting case elapsed (s) 4500 4000 3500 3000 2500 2000 1500 solver Lower is better elapsed profile(%) 1,2 1 0,8 0,6 0,4 base(bcslib-ext) improved 1000 500 0,2 2.8X 2.5X 0 0 4 CPU 4 CPU + 1GPU update total 17

Challenging Work 1,2 On models update is not that 1 base(bcslib-ext) improved dominated E.g. engine block with 1 st order element profile(%) 0,8 0,6 0,4 0,2 In-core / Quasi in-core is preferred A memory threshold to get reasonable good speedup will be provided to the user Make sure the essential computation is in core Elapsed (s) 0 1200 1000 800 600 400 200 update solver total 1.5 Millions of DOFs with Pivoting Lower is better elapsed 2.1X 1.8X 0 4 CPU 4 CPU + 1GPU 18

Summary & Perspectives GPU and CUDA enhance the performance of direct solver significantly CUBLAS on Fermi card is so fast! Asynchronous computing is the key component Improving the profile is a necessary procedure Amdahl's law Application Area Nonlinear analysis robust and accurate solution Optimization matrix factorization reused in sensitivity analysis Future works Other dynamic solvers Block Lanczos Eigen value solver Altair AMSES (automated multilevel-substructure Eigen value solver) 19

Innovation Intelligence RADIOSS ITERATIVE SOLVER PORTING

Preconditioned Conjugate Gradient Method Problem Solve linear equation Ax b = 0 with A symmetric positive-definite Solution Preconditioned Conjugate Gradient (PCG) solves iteratively: M -1. (Ax b) = 0 Method quickly converges Efficiency depends on M Other advantages Low memory consumption Efficient parallelization: SMP, MPI 21

PCG Porting on GPU using Cuda Pretty simple original Fortran code Few kernels to write in Cuda Sparse Matrix Vector Focus optimization effort on this kernel Use of CUBLAS DAXPY DDOT r 0 = b Ax 0 z 0 = M -1 r 0 p 0 = z 0 k = 0 DO WHILE NOT DONE alpha k = r kt. z k / p kt. A. p k x k+1 = x k + alpha k p k r k+1 = r k alpha k A. p k IF(r k+1 < precision) THEN END IF DONE = TRUE z k+1 = M -1. r k+1 beta k = z k+1t. r k+1 / z k t. r k p k+1 = z k+1 + beta k p k k = k+1 END DO result = x k+1 22

Hybrid MPP Computing with CUDA and MPI Hybrid MPP version of RADIOSS 2 parallelization levels MPI parallelization based on domain decomposition Multi-threaded MPI processes under OpenMP Enhanced performance Higher scalability Better efficiency RADIOSS 11 Hybrid MPP Speedup Extend Hybrid to multi GPUs programming Integrate GPU parallelization into our hybrid programming model MPI to manage communication between GPUs Portions of code not GPU enable benefit from OpenMP parallelization Programming model expendable to multi nodes with multi GPUs 23

PCG Multi GPUs Parallelization Domain decomposition to split original problem Optimal load balancing MPI Communications between domains Controlled by the CPU One GPU associated to one MPI OpenMP Multithreading Speedup calculations not performed under GPU Leverage the available CPU cores 24

Benchmark #1 400 Linear Problem #1 Hood of a car with pressure loads 350 300 351 SMP 6-core Hybrid 2 MPI x 6 SMP SMP 6 + 1 GPU Compute displacements and stresses Benchmark 0,9 Millions of Degrees of Freedom Elapsed(s) 250 200 150 185 Hybrid 2 MPI x 6 SMP + 2 GPUs Platform 24 Millions of non zero 140000 Shells + 13000 Solids + 1100 RBE3 4300 iterations Nvidia PSG Cluster 2 nodes with: 100 50 0 120 84 4.2X* 1 node - 2 Nividia M2090 53 6.5X* Lower is better Dual Nvidia M2090 GPUs Cuda v3.2 Intel Westmere 2x6 X5670@2,93Ghz Linux RHEL 5.4 with Intel MPI 4.0 Elapsed(s) 100 80 60 40 104 Hybrid 4 MPI x 6 SMP Hybrid 4 MPI x 6 SMP + 4 GPUs 38 Performance Elapsed time decreased by up to 9X 20 9.2X* 0 2 nodes - 4 Nvidia M2090 *Performance gain versus SMP 6-core 25

Benchmark #2 Linear Problem #2 Hood of a car with pressure loads Refined model Compute displacements and stresses Benchmark 2,2 Millions of Degrees of Freedom 62 Millions of non zero 380000 Shells + 13000 Solids + 1100 RBE3 5300 iterations Platform Nvidia PSG Cluster 2 nodes with: Dual Nvidia M2090 GPUs Cuda v3.2 Intel Westmere 2x6 X5670@2,93Ghz Elapsed(s) Elapsed(s) 1200 1000 800 600 400 200 0 350 300 250 200 150 1106 SMP 6-core Hybrid 2 MPI x 6 SMP SMP 6 + 1 GPU Hybrid 2 MPI x 6 SMP + 2 GPUs 572 254 143 4.3X* 7.5X* 1 node - 2 Nividia M2090 306 Hybrid 4 MPI x 6 SMP Hybrid 4 MPI x 6 SMP + 4 GPUs Lower is better Performance Linux RHEL 5.4 with Intel MPI 4.0 Elapsed time decreased by up to 13X 100 50 0 85 13X* 2 nodes - 4 Nvidia M2090 *Performance gain versus SMP 6-core 26

Benchmark #3 Gravity Problem with contacts Full car model Compute displacements and stresses Benchmark 0,85 Millions of Degrees of Freedom 21 Millions of non zero 138000 Shells + 11000 Solids + 5700 1-D elements + 230 Rigid bodies + 6 Rigid walls 1 Gravity load and 22 Contact interfaces ~ 9000 iterations Elapsed(s) 1600 1400 1200 1000 800 600 400 200 0 600 1430 SMP 6-core 1223 Hybrid 2 MPI x 6 SMP SMP 6 + 1 GPU Hybrid 2 MPI x 6 SMP + 2 GPUs 443 1 node - 2 Nividia M2090 531 225 3.2X* 6.4X* Lower is better Platform Nvidia PSG Cluster 2 nodes with: Dual Nvidia M2090 GPUs Cuda v3.2 Elapsed(s) 500 400 300 Hybrid 4 MPI x 6 SMP Hybrid 4 MPI x 6 SMP + 4 GPUs Intel Westmere 2x6 X5670@2,93Ghz 200 163 Performance Linux RHEL 5.4 with Intel MPI 4.0 Elapsed time decreased by up to 9X 100 0 8.8X* 2 nodes - 4 Nvidia M2090 *Performance gain versus SMP 6-core 27

Innovation Intelligence CONCLUSION

Conclusion RADIOSS implicit direct & iterative solvers have been successfully ported on Nvidia GPU using Cuda Adding GPU improves significantly the performance of these solvers For iterative solver, Hybrid MPP allows to run on multi GPU card workstation and GPU cluster with good scalability and enhanced performance GPU support for implicit solvers is planed for HyperWorks 12 A big thanks to the Nvidia team for their great support: Stan Posey, Steven Rennich, Thomas Reed, Peng Wang! 29