TFLOP Performance for ANSYS Mechanical

Size: px

Start display at page:

Download "TFLOP Performance for ANSYS Mechanical"

Clinton McKinney
5 years ago
Views:

1 TFLOP Performance for ANSYS Mechanical Dr. Herbert Güttler Engineering GmbH Holunderweg Bernstadt Engineering H. Güttler Seite 1

2 May 2009, Ansys12, 512 cores, 1 TFLOP per secoond Engineering H. Güttler Seite 2

4 Numerical Effort for a random selection of MCE Projects ANSYS MAPDL, sparse solver How long will your simulation take? Can vary by an order of magnitude for the same # DOFs Source: AnandTech Engineering H. Güttler Seite 4

5 Stats data can be found here =========================== = multifrontal statistics = p.ex. file.dsp =========================== number of equations = no. of nonzeroes in lower triangle of a = no. of nonzeroes in the factor l = ratio of nonzeroes in factor (min/max) = number of super nodes = maximum order of a front matrix = maximum size of a front matrix = maximum size of a front trapezoid = no. of floating point ops for factor = D+13 no. of floating point ops for solve = D+10 ratio of flops for factor (min/max) = near zero pivot monitoring activated number of pivots adjusted = 0 negative pivot monitoring activated number of negative pivots encountered = 0 factorization panel size = 128 number of cores used = 64 GPU acceleration activated percentage of GPU accelerated flops = time (cpu & wall) for structure input = time (cpu & wall) for ordering = time (cpu & wall) for other matrix prep = time (cpu & wall) for value input = time (cpu & wall) for matrix distrib. = time (cpu & wall) for numeric factor = computational rate (mflops) for factor = time (cpu & wall) for numeric solve = computational rate (mflops) for solve = effective I/O rate (MB/sec) for solve = Memory allocated on core 0 = MB Memory allocated on core 1 = MB Memory allocated on core 62 = MB Memory allocated on core 63 = MB Total Memory allocated by all cores = MB DSP Matrix Solver CPU Time (sec) = DSP Matrix Solver ELAPSED Time (sec) = DSP Matrix Solver Memory Used ( MB) = Engineering H. Güttler Seite 5

6 Performance Results Engineering H. Güttler Seite 6

7 Numerical Effort for a random selection of MCE Projects ANSYS MAPDL, sparse solver 260 s on a 1 TFLOPs machine 40 s on a 2TFLOP TFLOPs machine Source: AnandTech Engineering H. Güttler Seite 7

8 Current status of HPC Computing Source: AnandTech Engineering H. Güttler Seite 8

9 Tools (Hardware: Oct 2010) Compute Servers 8 Intel Harpertown systems: (SUN X4150) total of 64 cores, 496 GB RAM 16 Intel Nehalem systems: (SUN X4170) total of 128 cores, 1140 GB RAM Memory / core typ. 8GB Infiniband interconnect across servers Each with local Raid 0 disk array Operating System: SUSE Linux Enterprise Server Latest addition: 1 AMD Opteron 6172 System (Magny Cours ) 48 cores, 192 GB RAM UPS, Air conditioning Max. power consumption ~ 18kW Applications: ANSYS Mechanical, optislang Engineering H. Güttler Seite 9

10 Interconnect: FDR Performance Latencies Bandwidth Latency time from master to core 1 = µs Latency time from master to core 2 = µs Latency time from master to core 3 = µs Communication speed from master to core Communication speed from master to core Communication speed from master to core 1 = MB/sec 2 = MB/sec 3 = MB/sec Latency time from master to core 9 = µs Latency time from master to core 10 = µs Latency time from master to core 11 = µs Latency time from master to core 16 = µs Latency time from master to core 17 = µs Latency time from master to core 18 = µs Latency time from master to core 28 = µs Latency time from master to core 29 = µs Latency time from master to core 30 = µs Latency time from master to core 31 = µs Communication speed from master to core 9 = MB/sec Communication speed from master to core 10 = MB/sec Communication speed from master to core 11 = MB/sec Communication speed from master to core 16 = MB/sec Communication speed from master to core 17 = MB/sec Communication speed from master to core 18 = MB/sec Communication speed from master to core 28 = MB/sec Communication speed from master to core 29 = MB/sec Communication speed from master to core 30 = MB/sec Communication speed from master to core 31 = MB/sec core core on die socket - socket node -node Engineering H. Güttler Seite 10

11 Tools (Hardware: Jan 2013) 128 E5 Sandy Bridge cores 2.9 GHz 156 Westmere cores 2.9 GHz Up to 4 GPUs per node Up to 2GPUs per node Engineering H. Güttler Seite 11

12 Tools (Hardware: April 2013) 1 TB RAM 1,1 3.5 kw (theoretical ti l peak25tflop 2.5 TFLOPs) 128 E5 Sandy Bridge cores 2.9 GHz 8 nodes in a 4U case Engineering H. Güttler Seite 12

13 Tools (Hardware: June 2013) 0.4 TB RAM kw (theoretical peak TFLOPs) 2 nodes, total of 32 E5 cores, 2.9 GHz + 8 K20x GPUs Engineering H. Güttler Seite 13

14 Comparison for 5 MDOF model (R14.5.7) w/o GPUs (16x E5 2690) =========================== = multifrontal statistics = =========================== number of equations = no. of nonzeroes in lower triangle of a = no. of nonzeroes in the factor l = ratio of nonzeroes in factor (min/max) = number of super nodes = maximum order of a front matrix = maximum size of a front matrix = maximum size of a front trapezoid = no. of floating point ops for factor = D+13 no. of floating point ops for solve = D+10 ratio of flops for factor (min/max) = near zero pivot monitoring activated number of pivots adjusted = 0 negative pivot monitoring activated number of negative pivots encountered = 0 factorization panel size = 128 number of cores used = 128 time (cpu & wall) for structure input = time (cpu & wall) for ordering = time (cpu & wall) for other matrix prep = time (cpu & wall) for value input = time (cpu & wall) for matrix distrib. = time (cpu & wall) for numeric factor = computational rate (mflops) for factor = time (cpu & wall) for numeric solve = computational rate (mflops) for solve = effective I/O rate (MB/sec) for solve = Memory allocated on core 0 = MB Memory allocated on core 1 = MB Memory allocated on core 126 = MB Memory allocated on core 127 = MB Total Memory allocated by all cores = MB w dual GPUs (4x E / 8x Kepler K20x) =========================== = multifrontal statistics = =========================== number of equations = no. of nonzeroes in lower triangle of a = no. of nonzeroes in the factor l = ratio of nonzeroes in factor (min/max) = number of super nodes = maximum order of a front matrix = maximum size of a front matrix = maximum size of a front trapezoid = no. of floating point ops for factor = D+13 no. of floating point ops for solve = D+10 ratio of flops for factor (min/max) = near zero pivot monitoring activated number of pivots adjusted = 0 negative pivot monitoring activated number of negative pivots encountered = 0 factorization panel size = 128 number of cores used = 32 GPU acceleration activated percentage of GPU accelerated flops = time (cpu & wall) for structure input = time (cpu & wall) for ordering = time (cpu & wall) for other matrix prep = time (cpu & wall) for value input = time (cpu & wall) for matrix distrib. = time (cpu & wall) for numeric factor = computational rate (mflops) for factor f = time (cpu & wall) for numeric solve = computational rate (mflops) for solve = effective I/O rate (MB/sec) for solve = Memory allocated on core 0 = MB Memory allocated on core 1 = MB.. Memory allocated on core 30 = MB Memory allocated on core 31 = MB Total Memory allocated by all cores = MB DSP Matrix Solver CPU Time (sec) = DSP Matrix Solver ELAPSED Time (sec) = DSP Matrix Solver CPU Time (sec) = DSP Matrix Solver Memory Used ( MB) = DSP Matrixt i Solver l ELAPSED Time (sec)( = DSP Matrix Solver Memory Used ( MB) = Engineering H. Güttler Seite 14

15 Applications Engineering H. Güttler Seite 15

16 Example: Ball grid array M O D E L S U M M A R Y Processor Number Max Min Elements Nodes Shared Nodes DOFs (solid 186 &187)!! no contact elements!! Mold Solder balls PCB Engineering H. Güttler Seite 16

17 HPC mit ANSYS Engineering H. Güttler Seite 17

18 HPC mit ANSYS 14.0 Engineering H. Güttler Seite 18

19 HPC mit ANSYS 14.0 Engineering H. Güttler Seite 19

20 HPC mit ANSYS 14.5 Engineering H. Güttler Seite 20

21 BGA Benchmark with R14.5 on Sandy Bridge Xeons + GPUs Single node / Workstation class Engineering H. Güttler Seite 21

22 BGA Benchmark with R145 (compilation of all results) Engineering H. Güttler Seite 22

23 GPU Acceleration Real life : Hardware: E x Tesla K20X Accelerator, DSPARSE Duty Cycle ca % Engineering H. Güttler Seite 23

24 Next steps Engineering H. Güttler Seite 24

25 Applications: BGA, LQFP Einzelbauteile & Systembetrachtung Schwerpunkt Lotkriechen Engineering H. Güttler Seite 25

26 Benchmark Results: Leda Benchmark Procedure ANSYS S 11 ANSYS12 S ANSYS12. S ANSYS13 S ANSYS S 14 ANSYS S 14.5 ANSYS S SP02 (UP ) (UP ) Thermal (full model) 3 MDOF 4h (8 cores) 1h (8 cores + 1 GPU) 0.8h (32 cores) Thermomechanical Simulation (full model) 78MDOF Interpolation of boundary conditions Submodell: ~ 5.5 days for 163 iterations (8 ) 37h for 16 Loadsteps 34.3h for 164 iterations (20 cores) 7.8 MDOF (8 cores). Identical to ANSYS 11 Creep Strain Analysis 5.5 MDOF 12.5h for 195 iterations (64 cores) Identical to ANSYS h for 195 iterations (64 cores) 0.2h (improved algorithm) 6.1h for 488 iterations (128 cores) 7.5h for 195 iterations (128 cores) 0.2h ~ 5.5 days for 38.5h for h for h for 498 iterationsti 492 iterations iterations (64 cores + iterations (16 cores) (76 cores) 8GPUs) (16 cores) 4.2h (256 cores) 2 weeks 5 days 2 days 1 day ½ day 6.4h for 196 iterations (128 E5 cores) Best Performance with E5 Xeons 4h for 498 iterations ti (128 E5 cores) 4.8h (128 E5 cores + 16 GPUs) All runs with SMP Sparse or DSPARSE solver Hardware 11 & 12: Dual X5460 (3.16 GHz Harpertown Xeon) Hardware : 1 14 Dual X5570 (2.93 GHz Nehalem Xeon) or Dual X5670 (2.93 GHz Westmere Xeon), M207x Nvidia GPUs, 14.5 results also with Dual E (2.9 GHz Sandy Bridge Xeon) ANSYS creep runs with NROPT,,crpl + DDOPT, metis ANSYS runs with Infiniband interconnect 7.2h for 196 iterations (72 cores + 12 GPUs) 5.5h for 498 iterations ti (72 cores + 12 GPUs) Engineering H. Güttler Seite 26

27 Comparison: 2009 vs Update 2013: softwarecosts dominate, 128 cores. Engineering H. Güttler Seite 27

28 Examples periodic structure, identical pins Engineering H. Güttler Seite 28

29 Comparison for 5 MDOF model (w. contacts; R14.5) w/o GPUs (E5 2690) =========================== = multifrontal statistics = =========================== number of equations = no. of nonzeroes in lower triangle of a = no. of nonzeroes in the factor l = ratio of nonzeroes in factor (min/max) = number of super nodes = maximum order of a front matrix = maximum size of a front matrix = maximum size of a front trapezoid = no. of floating point ops for factor = D+13 no. of floating point ops for solve = D+10 ratio of flops for factor (min/max) = near zero pivot monitoring activated number of pivots adjusted = 0 negative pivot monitoring activated number of negative pivots encountered = 0 factorization panel size = 128 number of cores used = 128 time (cpu & wall) for structure input = time (cpu & wall) for ordering = time (cpu & wall) for other matrix prep = time (cpu & wall) for value input = time (cpu & wall) for matrix distrib. = time (cpu & wall) for numeric factor = computational rate (mflops) for factor = time (cpu & wall) for numeric solve = computational rate (mflops) for solve = effective I/O rate (MB/sec) for solve = Memory allocated on core 0 = MB Memory allocated on core 1 = MB Memory allocated on core 126 = MB Memory allocated on core 127 = MB Total Memory allocated by all cores = MB w dual GPUs (E5 2690) =========================== = multifrontal statistics = =========================== number of equations = no. of nonzeroes in lower triangle of a = no. of nonzeroes in the factor l = ratio of nonzeroes in factor (min/max) = number of super nodes = maximum order of a front matrix = maximum size of a front matrix = maximum size of a front trapezoid = no. of floating point ops for factor = D+13 no. of floating point ops for solve = D+10 ratio of flops for factor (min/max) = near zero pivot monitoring activated number of pivots adjusted = 0 negative pivot monitoring activated number of negative pivots encountered = 0 factorization panel size = 128 number of cores used = 64 GPU acceleration activated percentage of GPU accelerated flops = time (cpu & wall) for structure input = time (cpu & wall) for ordering = time (cpu & wall) for other matrix prep = time (cpu & wall) for value input = time (cpu & wall) for matrix distrib. = time (cpu & wall) for numeric factor = computational rate (mflops) for factor f = time (cpu & wall) for numeric solve = computational rate (mflops) for solve = effective I/O rate (MB/sec) for solve = Memory allocated on core 0 = MB Memory allocated on core 1 = MB Memory allocated on core 62 = MB Memory allocated on core 63 = MB Total Memory allocated by all cores = MB DSP Matrix Solver CPU Time (sec) = DSP Matrix Solver ELAPSED Time (sec) = DSP Matrix Solver Memory Used ( MB) = DSP Matrix Solver CPU Time (sec) = DSP Matrixt i Solver l ELAPSED Time (sec)( = DSP Matrix Solver Memory Used ( MB) = Engineering H. Güttler Seite 29

30 GPU Performance tested with mold injected part (w. fibers) Engineering H. Güttler Seite 30

31 Objective For a plastic cover generated via mold injection from a fiber reinforced plastic (PA66GF30) there is a considerable variation of the material properties caused by avariation in the direction of the fiber orientation. Furthermore, the degree of orientation will vary locally. The fiber orientation can be calculated outside of ANSYS and mapped onto the model. However, a much finer mesh is needed to represent the locally varying material accurately, compared to the situation with a homogenous material. During acustomer project we made astudy with models of different meshing density (meshed inside workbench) to investigate the displacements under thermal load The model is a simple bulk model (solid 186), no contacts, no material nonlinearities. Coarse model (2mm Tets): 07MDOF 0.7 Medium model (0.5mm HexDom): 5.9 MDOF Engineering H. Güttler Seite 31

32 Objective Orientation Material Mapping Engineering H. Güttler Seite 32

33 Model 0.5 mm Hex Dominant Engineering H. Güttler Seite 33

34 Difference in Displacements (free expansion) 2mm Tet Mesh 0.5mm Hex Dom Mesh Coarse model (2mm Tets): 0.7 MDOF Medium model (0.5mm HexDom): 5.9 MDOF Displacement range off by about 50% Engineering H. Güttler Seite 34

35 Results for 0.5mm HexDom model 100% speedup when using GPUs. & latesthardware Engineering H. Güttler Seite 35

36 Conclusions ANSYS Mechanical routinely deliver TFLOP per second performance in a HPC environment! Highest Peak performance with GPUs (and suitable case) Conventional solution provides similar performance with fewer surprises. GPU licensing & stability critical for adoption Engineering H. Güttler Seite 36

37 Acknoweldgements Jeff Beisheim, ANSYS Inc. Erke Wang, Peter Tiefenthaler, CADFEM GmbH Natalja Schafet, Wolfgang Müller-Hirsch, Robert Bosch GmbH Philipp Schmid, Holger Mai, Engineering GmbH Engineering H. Güttler Seite 37

38 Engineering H. Güttler Seite 38

Recent Advances in ANSYS Toward RDO Practices Using optislang. Wim Slagter, ANSYS Inc. Herbert Güttler, MicroConsult GmbH

Recent Advances in ANSYS Toward RDO Practices Using optislang Wim Slagter, ANSYS Inc. Herbert Güttler, MicroConsult GmbH 1 Product Development Pressures Source: Engineering Simulation & HPC Usage Survey