Krishnan Suresh Associate Professor Mechanical Engineering

Similar documents
ME964 High Performance Computing for Engineering Applications

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

FOR P3: A monolithic multigrid FEM solver for fluid structure interaction

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Efficient Generation of Large-Scale Pareto-Optimal Topologies

Iterative Sparse Triangular Solves for Preconditioning

GPU Cluster Computing for FEM

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

Accelerated ANSYS Fluent: Algebraic Multigrid on a GPU. Robert Strzodka NVAMG Project Lead

S0432 NEW IDEAS FOR MASSIVELY PARALLEL PRECONDITIONERS

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016

Challenges Simulating Real Fuel Combustion Kinetics: The Role of GPUs

Maximize automotive simulation productivity with ANSYS HPC and NVIDIA GPUs

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

CUDA Accelerated Compute Libraries. M. Naumov

Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC

Faster Innovation - Accelerating SIMULIA Abaqus Simulations with NVIDIA GPUs. Baskar Rajagopalan Accelerated Computing, NVIDIA

Mathematical Methods in Fluid Dynamics and Simulation of Giant Oil and Gas Reservoirs. 3-5 September 2012 Swissotel The Bosphorus, Istanbul, Turkey

Application of GPU-Based Computing to Large Scale Finite Element Analysis of Three-Dimensional Structures

GPU-Accelerated Algebraic Multigrid for Commercial Applications. Joe Eaton, Ph.D. Manager, NVAMG CUDA Library NVIDIA

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

EPILYSIS. The new FEA solver

Engineers can be significantly more productive when ANSYS Mechanical runs on CPUs with a high core count. Executive Summary

Institute of Cardiovascular Science, UCL Centre for Cardiovascular Imaging, London, United Kingdom, 2

Generating 3D Topologies with Multiple Constraints on the GPU

ANSYS HPC Technology Leadership

Finite Element Analysis using ANSYS Mechanical APDL & ANSYS Workbench

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

Large Scale Finite Element Analysis via Assembly-Free Deflated Conjugate Gradient

Multi-GPU simulations in OpenFOAM with SpeedIT technology.

NEW ADVANCES IN GPU LINEAR ALGEBRA

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

PolyTop++: an efficient alternative for serial and parallel topology optimization on CPUs & GPUs

Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling

CHAO YANG. Early Experience on Optimizations of Application Codes on the Sunway TaihuLight Supercomputer

Lecture 15: More Iterative Ideas

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

14MMFD-34 Parallel Efficiency and Algorithmic Optimality in Reservoir Simulation on GPUs

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

Figure 6.1: Truss topology optimization diagram.

GPU COMPUTING WITH MSC NASTRAN 2013

Finite Element Implementation

Accelerating Finite Element Analysis in MATLAB with Parallel Computing

GPU-based Parallel Reservoir Simulators

GPU-Acceleration of CAE Simulations. Bhushan Desam NVIDIA Corporation

Performance of Implicit Solver Strategies on GPUs

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

High Performance Computing for PDE Some numerical aspects of Petascale Computing

GPU PROGRESS AND DIRECTIONS IN APPLIED CFD

Parallelizing a Real-Time 3D Finite Element Algorithm using CUDA: Limitations, Challenges and Opportunities.

OpenFOAM + GPGPU. İbrahim Özküçük

Enhanced Oil Recovery simulation Performances on New Hybrid Architectures

Performance Benefits of NVIDIA GPUs for LS-DYNA

ANSYS HPC. Technology Leadership. Barbara Hutchings ANSYS, Inc. September 20, 2011

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

International Conference on Computational Science (ICCS 2017)

A GPU Sparse Direct Solver for AX=B

GPU Acceleration of Unmodified CSM and CFD Solvers

Large scale Imaging on Current Many- Core Platforms

Accelerating the Conjugate Gradient Algorithm with GPUs in CFD Simulations

GPU DEVELOPMENT & FUTURE PLAN OF MIDAS NFX

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs

Algorithms, System and Data Centre Optimisation for Energy Efficient HPC

CMSC 858M/AMSC 698R. Fast Multipole Methods. Nail A. Gumerov & Ramani Duraiswami. Lecture 20. Outline

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Assembly-Free Large-Scale Modal Analysis on the GPU Praveen Yadav, Krishnan Suresh

HYBRID DIRECT AND ITERATIVE SOLVER FOR H ADAPTIVE MESHES WITH POINT SINGULARITIES

Topology Optimization and JuMP

ANSYS High. Computing. User Group CAE Associates

Accelerating a Simulation of Type I X ray Bursts from Accreting Neutron Stars Mark Mackey Professor Alexander Heger

Applications of Berkeley s Dwarfs on Nvidia GPUs

Why Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends

Industrial finite element analysis: Evolution and current challenges. Keynote presentation at NAFEMS World Congress Crete, Greece June 16-19, 2009

A Comparison of Algebraic Multigrid Preconditioners using Graphics Processing Units and Multi-Core Central Processing Units

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)

Large-scale Deep Unsupervised Learning using Graphics Processors

Parallel Performance Studies for an Elliptic Test Problem on the Cluster maya

TFLOP Performance for ANSYS Mechanical

Advances of parallel computing. Kirill Bogachev May 2016

Urychlení návrhu a optimalizace teplotních senzorů s využitím HPC technologií

Parallel Systems. Project topics

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards

PARALUTION - a Library for Iterative Sparse Methods on CPU and GPU

Stan Posey, CAE Industry Development NVIDIA, Santa Clara, CA, USA

Contents. I The Basic Framework for Stationary Problems 1

midas NFX 2017R1 Release Note

MAGMA. Matrix Algebra on GPU and Multicore Architectures

Introduction to Parallel Programming for Multicore/Manycore Clusters Part II-3: Parallel FVM using MPI

Parallel solution for finite element linear systems of. equations on workstation cluster *

Finite Element Multigrid Solvers for PDE Problems on GPUs and GPU Clusters

Transcription:

Large Scale FEA on the GPU Krishnan Suresh Associate Professor Mechanical Engineering

High-Performance Trick Computations (i.e., 3.4*1.22): essentially free Memory access determines speed of code Pick memory-efficient algorithm!! 2

Finite Element Analysis Simulation of engineering phenomena Structural, thermal, fluid, Static, modal, transient Linear, non-linear 3

Structural Static FEA K f e e K = f = K f e Ku= f e Model Discretize Element Stiffness Assemble/ Solve Postprocess 4

Typical Bottleneck K f e e K f = = K f e Ku= f e Model Discretize Element Stiffness Assemble/ Solve Postprocess GPU emphasis 5

Linear Solvers Ku= f K is sparse & usually symmetric P.D K = Direct LDL T 1 1 T u= L D L f 6

Direct Sparse on GPU Ku = f (2008) 7

Direct Sparse on GPU Ku = f Memory Limited! Effective speed-up 8

Linear Solvers Ku = f K = Direct LDL T 1 1 T u= L D L f Iterative i 1 i i u + = u + B( f Ku ) B : Preconditioner of K Repeated Matrix-Vector ops Ideally suited for GPU!

Linear Solvers Ku = f Direct Iterative i 1 i i u + = u + B( f Ku ) B : Preconditioner of K Two Goals: 1. Minimize #iterations B dependent 2. Minimize cost of K*u (SpMv) 10

Iterative Ku = f Presentations at GTC 2012 Conference Fine-grained Parallel Preconditioners CULA MAGMA Accelerating Iterative Linear Solvers Efficient AMG on Hybrid GPU Clusters Preconditioning for Large-Scale Linear Solvers 11

CuSparse 12

CULA Sparse 13

Ku = f in FEA context K f e e K f = = K f e Ku= f e Model Discretize Element Stiffness Assemble/ Solve Postprocess Accelerate linear solve in FEA context! 14

Linear Solvers Ku = f Direct Iterative i 1 i i u + = u + P( f Ku ) P : Preconditioner of K Assembled 15

Assembled K*u 1. Assemble K (CPU/GPU) 2. Push to GPU 3. Execute K*u in parallel GPU thread per node Coalesced memory access limits performance!? = ~100 non-zeros

Linear Solvers Ku = f Direct Iterative i 1 i i u + = u + P( f Ku ) P : Preconditioner of K Assembled Assemblyfree 17

Matrix-Free K = Ke e Ku= K u= K u e e ( ) e e e? = + +...

Overview Hex-mesh Detect Congruency Compute Ke of templates Push data to GPU Assembly-free Ku = f on GPU

Congruence Examples Structured grid, isotropic material 8192 elements One template element 1 Ke matrix (24*24); 4.6 KB Unstructured grid, isotropic material 6144 elements 34 template elements (~0.5%) 34 Ke matrices; 156 KB (vs 28 MB)

Results Unstructured grid, composite layers 83584 elements Material congruency check 322 templates (~0.4%) 1.48 MB storage (vs 384 MB)

Exploit Element Congruency 120 distinct elements 1 distinct element 20 distinct elements 1.48 MB storage (vs. 384 MB)

Philosophy Less memory you use Less memory you will fetch Faster will be your code!! 23

Assembly-free y = Kx in GPU Algorithm (per thread/node): For each neighboring elem For each of 8 nodes of elem Get (u,v,w) of node Apply Ke of elem, node Accumulate into uresult, Push uresult to Memory

GPU Data Separate storage for (u,v,w) for coalesced memory access. Can reduce foot-print further

K*u on GPU 26

Platform 64 bit Windows Office Professional CPU: i7, 3.2 GHz, 6 GB C Code single core GPU: GTX 480 (400 cores) 1.5 GB

Timing: Problem 1 Fixed Isotropic structured hex mesh 32 x 16 x 16; 1 template element 28,000 dof Overhead cost (s) Matlab Assembled; direct solve 1.8 (K assembly) Matlab Assembled; CG (1e-10) 1.8 (K assembly) Solution time (s) 7.0 8.7 C Assembly-free; CG (1e-10) - 8.0 CUDA Assembly-free; CG (1e-10) 0.13 (GPU transfer) 0.20

Timing: Problem 2 Fixed Isotropic structured hex mesh 32 x 32 x 16; 1 template element 55,539 dof Overhead cost (s) Matlab Assembled; direct solve 3.6 (K assembly) Matlab Assembled; CG (1e-10) 3.6 (K assembly) Solution time (s) 33.2 2.73 C Assembly-free; CG (1e-10) - 19.8 CUDA Assembly-free; CG (1e-10) 0.16 (GPU transfer) 0.29

Timing: Problem 3 Fixed Isotropic structured hex mesh 64x 32 x 16; 1 template element 109,385 dof Overhead cost (s) Matlab Assembled; direct solve 7.1 (K assembly) Matlab Assembled; CG (1e-10) 7.1 (K assembly) Solution time (s) > 8 mins (stalled) 8.8 C Assembly-free; CG (1e-10) 0.0 60.8 CUDA Assembly-free; CG (1e-10) 0.19 s (GPU transfer) 0.66

TIming: Problem 4 Fixed Isotropic structured hex mesh 64 x 32x 16; 2 template element 109,000 dof Overhead cost (s) Solution time (s) Isotropic Composite Isotropic Composite Matlab: Direct -- -- -- -- Matlab: CG 7.1 7.1 8.8 22.0 C: CG, AF 0.0 0.0 60.8 154 CUDA: CG, AF 0.16 0.17 0.66 1.6 Same K, but 2.5 x CG iterations. No penalty in GPU due to non isotropy

Timing: Problem 5 Fixed Isotropic structured hex mesh 256 x 128 x 64; 1 template element 6.5 million dof 468 MB GPU memory Overhead cost (s) Solution time (s) Matlab Assembled; direct solve -- Stalls! Matlab Assembled; CG (1e-10) -- Stalls! C Assembly-free; CG (1e-10) -- Stalls! CUDA Assembly-free; CG (1e-10) 0.41 s (GPU transfer) 87.2

Timing: Problem 6 Load Fixed Overhead cost (s) Unstructured hex mesh Isotropic and Composite (alternate layers +/- 60 deg fibers) 274,000 dof 322 template elements Solution time (s) Isotropic Composite Isotropic Composite Matlab: Direct -- -- -- -- Matlab: CG 18.3 18.3 214.2 216.3 C: CG, AF 0.0 0.0 720.3 770.6 CUDA: CG, AF 0.18 0.18 29.3 35.2

Typical FEA Computing Time DOF PareTO CPU GPU Cantilever 110K 8.3 secs 1.9 secs Beam; Edge Stool 2.7M 186 secs 32 secs Point Load Cantilever 15M 51 mins 5.73 mins 92M 12 hr - 1.5 hr on 44 core (ONR) 34

Topology Optimization Structure problem Optimal topologies D Reduce weight, but keep it stiff 10% reduction in weight 2% loss in stiffness 50% reduction in weight 12% loss in stiffness Topology optimization is the systematic generation of such structures 35

Topology Optimization 10^5 ~ 10^7 dof Solve Ku = f K: Sparse SPD Design Space 100 s of iterations! Finite Element Analysis (FEA) Change Topology No Optimal? - Multi-load - Nonlinear FEA - Constraints - 36

Knuckle Optimization 37

TopOpt Computing Time Name of part & volume fraction DOF Pub. Data CPU PareTO GPU Cantilever 110K 2.4 hr 200 secs 45 secs Beam; Edge (0.50) Knuckle (0.55) 20K -- 111 secs 44 secs Bridge (0.35) 113K -- 2 mins 36.2 secs Stool; N=96 (0.20) 2.7M 21.8 hr 1 hr, 24 mins 14 mins 783K 3.9 hr 16 mins 125 s Point Load Cantilever (0.50) 15M - 19hr, 28 mins 2hr, 12 mins 92M - 12 days, 2hr - 38

PareTO Download: www.ersl.wisc.edu Email: suresh@engr.wisc.edu 39

Acknowledgements Graduate Students NSF UW-Madison Kulicke and Soffa Luvata Trek Bicycles Publications available at www.ersl.wisc.edu Email suresh@engr.wisc.edu