ME964 High Performance Computing for Engineering Applications

Size: px

Start display at page:

Download "ME964 High Performance Computing for Engineering Applications"

Lindsay Moody
6 years ago
Views:

1 ME964 High Performance Computing for Engineering Applications Outlining Midterm Projects Topic 3: GPU-based FEA Topic 4: GPU Direct Solver for Sparse Linear Algebra March 01, 2011 Dan Negrut, 2011 ME964 UW-Madison The real problem is not whether machines think but whether men do. B. F. Skinner

2 Before We Get Started Last time Midterm Project topics 1 and 2 Today Discrete Element Method on the GPU. Area coordinator: Toby Heyn Collision Detection on the GPU. Area coordinator: Arman Pazouki Midterm Project topics 3 and 4 Finite Element Method on the GPU. Area coordinators: Prof. Suresh and Naresh Khude Sparse direct solver on the GPU (Cholesky). Area coordinator: Dan Negrut Midterm Project Related Issues Midterm Project is due on 04/13 at 11:59 PM (use Learn@UW drop-box) Intermediate report due on 03/22 at 11:59 PM (use the same Learn@UW drop-box) Each area coordinator Will provide a test problem for you to test your GPU implementation Will also assist you with questions related to the non-programming aspects (the theory ) behind the topic you chose You can continue your Midterm Project (MP) and have it become your Final Project (FP) In this case you will be expected to show how the FP implementation is superior to your MP implementation Other issues HW5 due tonight at 11:59 PM Use Learn@UW drop-box to submit homework 2

3 Finite Element Analysis on the GPU? Krishnan Suresh Associate Professor

4 Finite Element Analysis Computer simulation of engineering models Physics: Structural, thermal, fluid, Mode: Static, modal, transient Linear, non-linear, multi-physics

5 Why GPU? [Gordon; JPL] Hours or even days of CPU time.

6 Question Can one exploit graphics programmable units (GPU) to speedup Finite Element analysis? +

7 Structural Static FEA K f e e K f = = K f e e Ku = f Model Discretize Element Stiffness Assemble/ Solve Postprocess

8 FEA: Variations K f e e K f = = Ku = K f e f e Tet/Hex/ Order/Hybrid Direct/Iterative Model Discretize Element Stiffness Assemble/ Solve Postprocess Nonlinear Optimization

9 FEA: Challenges K f e e K f = = Ku = K f e f e Tet/Hex/ Order/Hybrid Direct/Iterative Model Discretize Element Stiffness Assemble/ Solve Postprocess Optimization Nonlinear 1. Accuracy 2. Automation 3. Speed

10 Typical Bottleneck K f e e K f = = K f e e Ku = f Model Discretize Element Stiffness Assemble/ Solve Postprocess

11 GPU & Engineering Analysis Model CPU Discretize GPU? Discretization Data: Small b-rep (+) Logic: Complex (-) Threads: Few (-) Not a good candidate for GPU!?

12 Element Stiffness K e f e Model Discretize Element Stiffness Hex 2 nd Order CPU CPU GPU? Hex Hybrid Element Stiffness Data: O(N) (+/-) Logic: Simple (+) Threads: N (+)

13 Stiffness: Hex 2 nd Order K = e [ ] ( M, M ) (8 Corners) (27 Nodes) 8 Corners~100 Bytes Data (x y z) 27 Nodes~ M = 81 DOF (u v w) k ij ~ Gaussian integration 30 flops 2 Flops N(15 M ) N T = , M= 81 CPU 4sec

14 Typical Bottleneck K f e e K f = = K f e e Ku = f Model Discretize Element Stiffness Assemble/ Solve

15 Direct vs. Iterative Ku = f K = K is sparse & usually symmetric P.D Direct Iterative LDL T 1 1 T u = L D L f i 1 i i u + = u + B( f Ku ) B : Preconditioner of K (GPU Variation: Assembly-free) Note: Nvidia offers CuBLAS-3 dense matrix library

16 Direct Sparse on GPU (1) (2006)

17 Direct Sparse on GPU (1) Ku = f

18 Direct Sparse on GPU (1) Ku = f

19 Direct Sparse on GPU (2) Ku = f (2008)

20 Direct Sparse on GPU (2) Ku = f

21 Iterative Sparse on GPU (1) (2008) Jacobi preconditioned conjugate gradient ATI GPU Speed-up 3.5.

22 Iterative Sparse on GPU (2) Double precision real world SpMv CPU (2.3 GHz Dual Xeon): 1 GFLOPS GPU (GTX 280): 16 GFLOPS Speedup ~ 16

23 FEA/GPU Class Projects? 1. Complete < 6 weeks 2. Important (publishable) 3. Pilot code

24 FEA/GPU Class Projects? 1. GPU Friendly Preconditioners for Thin Structures Research papers OpenCL and ViennaCL Pilot Code 2. Topology Optimization Research papers CUDA code 3. Others Can discuss

25 Thin Structure?

26 Thin Structure? Large K

27 Preconditioners? Ku = f i 1 i i u + = u + B( f Ku ) B : Preconditioner of K Iterative Methods: GPU methods available for K*u Typical preconditioners: simple Jacobi, Poor preconditioner slow convergence Objective: GPU friendly preconditioner for thin structures

28 Research Publication

29 Basic Idea

30 Algorithm

31 Why Preconditioner?

32 Why Double Precision?

33 How Expensive is Preconditioner?

34 GPU Friendly Speed-up without Preconditioner Speed-up with Preconditioner

35 FEA/GPU Class Projects? 1. GPU Friendly Preconditioners for Thin Structures Research papers OpenCL and ViennaCL Pilot Code 2. Topology Optimization Research papers CUDA code 3. Others Can discuss

36 Topology Optimization D Stiffest topology for a given volume? Where to remove material? Min Ω D J Ω= V 0 V = 50% [Sigmund 2001] Multi Objective + Topology Optimization = MOTO Min {J, V } Ω D 0

37 Demo Matlab code

38 Pareto Optimal Designs Purely pareto optimal

39 Comparison D

40 3-D Pareto-Method SIMP

41 3-D GPU Implementation Multi-grid Topology Optimization on the GPU (IDETC conf. 2011)

42 Motivation for Topic 4: Sparse Direct Solver 42

43 Nomenclature & Simplifying Assumptions 43

44 The Schur Complement Problem in Multi-Body Dynamics Applications 44

45 Formulation Framework Position: r = [ x, y, z ] T i i i i Orientation: Euler parameters, p [,,, ] T i = ei ei ei ei Translational Velocity: rɺ = [ xɺ, yɺ, zɺ ] T i i i i Angular velocities ω = [ω, ω, ω ] x y y T i i i i 45

46 Constrained Equations of Motion Φ ( r, p, t) = 0 Φ ( r, p, t) rɺ + Φ ( r, p, t) ω = Φ ( r, p, t) η ρ Φ ( r, p, t) ɺɺ r + Φ ( r, p, t) ω ɺ = τ ( rɺ, ω, r, p, t) η ρ t T M 0 ɺɺ r Φη( r, p, t) F( rɺ, ω, r, p, t) λ T ω + = Φ (,, t) ˆ ρ (, ω,,, t) 0 J ɺ r p n rɺ r p 46

47 Numerical Solution of the Newton-Euler Constrained Equations of Motion One has to solve a set of Differential Algebraic Equations (DAEs) to find the time evolution of a mechanical system Most often the numerical solution of the DAEs requires the solution of a linear system of the form: T M 0 Φη ɺɺ r F T Φ ˆ ρ ω 0 J ɺ = n Φ η Φ ρ 0 λ τ 47

48 Approach Followed First solve the Reduced System for : λ 1 T M 0 Φ η η ρ 1 T 0 J Φρ Φ Φ λ = b Then recover accelerations 1 T ɺɺ r = M ( F Φ λ ) ω ɺ 1 = J n η T ( ˆ Φ λ ) ρ 48

49 Iterative Solution of the Reduced System Define positive definite Reduced Matrix E 1 T M 0 Φ η = Φ η Φ ρ 1 T 0 J Φρ E Preconditioned Conjugate Gradient requires computation at time of ( k ) t n E n λ requires preconditioning: Eold λ = b 49

50 Computing E ( k) n λ Time step n, iteration (k): e 1 e e J e 2 ( k ) ( k ) m n = Enλ n = R A thread is associated with each body We ll look at how thread 9 does its share of work to compute e 3 50

51 How Thread-9 Does its Work S1. Compute reaction forces acting on me: F = ( Φ ) λ + ( Φ ) λ + ( Φ ) λ C 3 T 5 T 6 T S2. Compute my constraint acceleration a = M F C 1 C S3. Project my constraint acceleration Π = Φ a Π = Φ a Π = Φ a 3 3 C 5 5 C 6 6 C Finally, e = Π + Π

52 Iteration Operation Count for Body 9 (Thread-9) Step Multiplications Additions S1 S2 6 C ( C 1) 9 5 S3 6 C 9 5 C 9 52

53 Computing [Concluding Remarks] E n λ ( k) The algorithm scales very well: one thread for each body Each thread only interacts with adjacent joints Load balance is obtained when the bodies have similar topology index 53

54 Direct Solution of the Reduced System 54

55 The Sparse Direct Solver 55

56 The Direct Solver: How Things Get Done In the reduced linear system Eλ = b each constraint induces an equation Example: constraint 3 induced equation: E λ + E λ + E λ + E λ = b Since E is positive definite, E33 is also positive definite Fundamental Idea: Solve for λ 3 and substitute it in all the equations where it shows up 56

57 First Example: Seven-Body Mechanism 57

58 58

The Elimination Sequence The fundamental question is this: what should be the sequence in which the unknowns (the edges of the graph) are eliminated?

59 The Elimination Sequence The fundamental question is this: what should be the sequence in which the unknowns (the edges of the graph) are eliminated? Different elimination sequences result in different levels of effort The question becomes more complicated since you are interested in a parallel elimination sequence You would like to limit the amount of synchronization barriers that you impose in the implementation In the end, although it s formulated like solving a system, the problem becomes that starting with a graph and eliminating its edges in parallel Similar to a Mikado, or pick-up sticks, game that you want to play in parallel 59

60 Second Example: HMMWV Model Elim. Sequence A M I F NNZ Bad Good Index Reduction

Krishnan Suresh Associate Professor Mechanical Engineering

Large Scale FEA on the GPU Krishnan Suresh Associate Professor Mechanical Engineering High-Performance Trick Computations (i.e., 3.4*1.22): essentially free Memory access determines speed of code Pick