Understanding Hardware Selection to Speedup Your CFD and FEA Simulations

Size: px

Start display at page:

Download "Understanding Hardware Selection to Speedup Your CFD and FEA Simulations"

Cecilia Carr
5 years ago
Views:

1 Understanding Hardware Selection to Speedup Your CFD and FEA Simulations 1

2 Agenda Why Talking About Hardware HPC Terminology ANSYS Work-flow Hardware Considerations Additional resources 2

3 Agenda Why Talking About Hardware HPC Terminology ANSYS Work-flow Hardware Considerations Additional resources 3

4 Most Users Constrained by Hardware 4 Source: HPC Usage survey with over 1,800 ANSYS respondents

5 Problem Statement I am not achieving the performance and throughput I was expecting from my hardware & software 5 Image courtesy of Intel Corporation

6 Building A Balanced System Is The Key To Improving Your Experience If Your System Is Slow So Are Your Engineers & Analysts Networks Storage Memory Processors 6

7 What Hardware Configuration to Select? HDD vs. SSD SMP vs. DMP 7 CPUs? The right combination of hardware and software leads to maximum efficiency Clusters? GPUs? Interconnects?

8 Agenda Why Talking About Hardware HPC Terminology ANSYS Work-flow Hardware Considerations Additional resources 8

9 HPC Hardware Terminology Machine 1 (or Node 1) Machine N (or Node N) Processor 1 (or Socket 1) Processor 1 (or Socket 1) Processor 2 (or Socket 2) Processor 2 (or Socket 2) GPU GPU Interconnect (GigE or InfiniBand) 9

10 Shared Memory Parallel Machine 1 (or Node 1) Processor 1 (or Socket 1) Single Machine Parallel (SMP) systems share a single global memory image that may be distributed physically across multiple cores, but is globally addressable. OpenMP is the industry standard. 10

11 Distributed Memory Parallel Machine 1 (or Node 1) Processor 1 (or Socket 1) Distributed memory parallel processing (DMP) assumes that physical memory for each process is separate from all other processes. Parallel processing on such a system requires some form of message passing software to exchange data between the cores. 11 MPI (Message Passing Interface) is the industry standard for this.

12 Agenda Why Talking About Hardware HPC Terminology ANSYS Work-flow Hardware Considerations Additional resources 12

13 Typical HPC Growth Path Desktop User Workstation and/or Server Users Cluster Users Cloud Solution 13

14 Remote Visualization Ideal for remote users submitting jobs from a Windows machine to a Linux cluster or local users submitting jobs to a Linux cluster users that do not have enough power (memory

0 supports the following remote visualization applications Nice Desktop Cloud Visualiation (DCV) 2013 Linux server + Linux/Windows client OpenText Exceed ondemand 8 SP2/SP3 Linux server +

14 14 Remote Visualization Ideal for remote users submitting jobs from a Windows machine to a Linux cluster or local users submitting jobs to a Linux cluster users that do not have enough power (memory or graphics) on their local workstation to build large meshes or view graphics. ANSYS 16.0 supports the following remote visualization applications Nice Desktop Cloud Visualiation (DCV) 2013 Linux server + Linux/Windows client OpenText Exceed ondemand 8 SP2/SP3 Linux server + Linux/Windows client RealVNC Enterprise Edition (with VirtualGL) Linux server + Linux/Windows client (on Windows cluster: Microsoft Remote Desktop) Hardware requirements for remote visualization servers require: GPU capable video cards large amounts of RAM accessible for multiple user availability when running ANSYS applications and pre/post processing

Virtual Desktop (VDI) Support Key focus area at ANSYS

(NVIDIA Grid) as it matures; testing internally Not SW

15 Virtual Desktop (VDI) Support Key focus area at ANSYS (internal use & software QA) Focus on GPU Pass-Through One GPU per VM, up to 8 VMs per machine (K1, K2 cards); memory constraints will limit in any case vgpu (NVIDIA Grid) as it matures; testing internally Not SW rendering, Not Shared GPU (too slow) Supported at R16.0: 15

16 ANSYS Remote Solve Manager (RSM) Desktop Server Cluster (with 3 rd party scheduler) The Remote Solve Manager (RSM) is a GUI-based, job queuing system that RSM as a scheduler RSM as a transport mechanism distributes simulation tasks to (shared) computing resources Submits to RSM itself. Submits through RSM to a high-level RSM enables tasks to be scheduler such as LSF, PBS Pro, Run in background mode on the local machine Windows HPC Server 2008 R2 / 2012, Sent Unit to recognition: a remote compute jobs (e.g. machine a run of a and Univa Grid Engine (at R15.0). Broken solver into such a as series CFX, of Fluent jobs for Mechanical) parallel processing Unit across recognition: a variety cores of computers 16

17 RSM Usage Scenarios Submission from a client to a centralized (shared) compute resource, allowing back-ground queuing on a centralized machine multiple users to share a common, usually large memory/fast machine (compared to client machine) 17

18 RSM Usage Scenarios Submission from a client to a centralized (shared) compute resource, allowing back-ground queuing on a centralized machine multiple users to share a common, usually large memory/fast machine (compared to client machine) Submission from a client to multiple (shared) compute resources, allowing back-ground queuing on a centralized machine that submits to other machines (compute servers) multiple users to share user workstations (often at night) using the RSM Limit Times for Job Submission feature 18

RSM Usage Scenarios Submission from a client to a centralized (shared) compute resource, allowing back-ground queuing on a centralized machine multiple users to share a common, usually large

19 RSM Usage Scenarios Submission from a client to a centralized (shared) compute resource, allowing back-ground queuing on a centralized machine multiple users to share a common, usually large memory/fast machine (compared to client machine) Submission from a client to multiple (shared) compute resources, allowing back-ground queuing on a centralized machine that submits to other machines (compute servers) multiple users to share user workstations (often at night) using the RSM Limit Times for Job Submission feature Submission from a client to a centralized (shared) compute resource with a job scheduler, allowing back-ground queuing on a centralized machine that submits to a job scheduler (e.g. LSF) multiple users to run multi-node jobs on shared compute resources 19

Recent Enhancements in RSM Improved robustness and scalability Added support

users on Linux can now use RSM wizard Enriched support for RSM customization

Design Point updates Design objectives: Equal fresh and exhaust gas mass flow

parameters: Radii of 3 fillets near inlet (8 design points) ~5.

20 Recent Enhancements in RSM Improved robustness and scalability Added support for Univa Grid Engine Added support for Mechanical/MAPDL restart Non-root users on Linux can now use RSM wizard Enriched support for RSM customization Added component override for design point update Improved efficiency of Design Point updates Design objectives: Equal fresh and exhaust gas mass flow distribution to each cylinder To minimize the overall pressure drop Input parameters: Radii of 3 fillets near inlet (8 design points) ~5.0x speed-up over sequential execution 20 Parametric, Optimization of Intake Manifold Initial Optimized

21 Guidelines : Know your hardware lifecycle Have a goal in mind for what you want to achieve. Using Licensing productively Using ANSYS provided processes effectively. 21

22 Agenda Why Talking About Hardware HPC Terminology ANSYS Work-flow Hardware Considerations Additional resources 22

23 What Hardware Configuration to Select? CPUs? GPU/Phi? HDD vs. SSD SMP vs. DMP Clusters? Interconnects? 23

Understanding the effect of clock speed Generally,

Cost/performance argues for high clock (but maybe

helpful to realize productivity gains ANSYS DMP

24 Understanding the effect of clock speed Generally, ANSYS applications scale with clock frequency Cost/performance argues for high clock (but maybe not top bin) Using higher clock speed is always helpful to realize productivity gains ANSYS DMP benchmarks (8 core) Clock effect is highest for sparse solver 24

25 Understanding the effect of memory bandwidth - Is 24 Cores Equal to 24 Cores? 3 x (2 x 4) = 24 cores 2 x (2 x 6) = 24 cores x5570 x5670 x5570 x5570 x

26 Understanding the effect of memory bandwidth - Is 24 Cores Equal to 24 Cores? 3 x (2 x 4) = 24 cores 2 x (2 x 6) = 24 cores x5570 x5670 x5570 x5570 x5670 Consider memory per core! 26

27 Understanding the effect of memory bandwidth - Is 16 Cores Equal to 16 Cores? 2 x (2 x 4) = 16 cores 2 x (2 x 4) = 16 cores x5570 x5670 x5570 x5670 Using less cores per node can be helpful to realize productivity gains 27

28 Understanding the effect of memory bandwidth - ANSYS Mechanical Consider memory per core! 28

Understanding the effect of memory speed We can see here the

On other processors nonoptimally filling the memory channels can

29 Understanding the effect of memory speed We can see here the effect of memory speed. This has implications on how you build your hardware. Some processors types have slower memory speeds by default. On other processors nonoptimally filling the memory channels can slow the memory speed. Has an effect on memory bandwidth Using higher memory speed can be helpful to realize productivity gains 29

With the Intel s have seen variable performance with this ranging between 0-8% improvement depending on

30 Turbo Boost (Intel) / Turbo Core (AMD) - ANSYS CFD Turbo Boost (Intel)/ Turbo Core(AMD) is a form of over-clocking that allows you to give more GHz to individual processors when others are idle. With the Intel s have seen variable performance with this ranging between 0-8% improvement depending on the numbers of cores in use. The graph below for CFX on a Intel X5550. This only sees a maximum of 2.5% improvement. 30

using Turbo Boost on the E5 processor family.

31 Turbo Boost (Intel) / Turbo Core (AMD) - ANSYS Mechanical We can see that relative to 1 core we can see good performance gains in many cases by using Turbo Boost on the E5 processor family. Using Turbo Boost / Core can be helpful to realize productivity gains - particularly for lower core counts 31

32 Hyper-threading Evaluation of Hyperthreading on ANSYS/FLUENT Performance idataplex M3 (Intel Xeon x5670, 2.93 GHz) TURBO: ON (measurement is improvement relative ot Hyperthtreading OFF) HT OFF (12 threads on 12 physical cores) HT ON (24 threads on 12 physical cores) 1.10 Higher is better Improvemet due to Hyperthreading Hyper-threading is NOT recommended 0.90 eddy_417k turbo_500k aircraft_2m sedan_4m truck_14m ANSYS/FLUENT Model 32

33 Generation to Generation - ANSYS Mechanical Optimized for Intel Xeon E5 v3 processors: ANSYS Mechanical 16.0 performs well on the latest Intel processor architecture Haswell processor-based system is 20% to 40% faster than Sandy Bridge processor-based system for a variety of benchmarks 34

34 ANSYS Fluent on Intel Ivy Bridge Ivy Bridge vs. Sandy Bridge Single Node Ivy Bridge = Tick release of Sandy Bridge Similar micro architecture, more cores, reduced power Expect similar core-to-core performance on Ivy Bridge and Sandy Bridge Improved node-to-node Single-node performance of ANSYS Fluent 14.5 over six benchmark cases 2x8 core Sandy Bridge vs. 2x12 core Ivy Bridge 50% performance boost matches core count increase Scaling maintained on higher core density Achieved via efficient memory use (and higher RAM speed) ANSYS, Inc. June 18, 2015 Case Ivy Bridge Sandy Bridge Ratio turbo_500k eddy_417k aircraft_2m sedan_4m truck_14m truck_poly_14m

35 ANSYS Fluent Ivy Bridge vs. Sandy Bridge Scaling Multi-node performance of ANSYS Fluent 14.5 Up to 192 cores Nearly identical core-to-core scaling confirms system balance for Fluent Truck_14m Solver Rating, Fluent Solver Rating SandyBridge Ivybridge Number of Cores ANSYS, Inc. June 18, 2015

36 Per Node vs. Per Core Comparisons This is a 4 socket vs. 2 socket node comparison. Xeon E v GHz (4 socket) Xeon E v GHz (2 socket) From the per node comparison you d assume it was better to go with the 4 socket. Per core however the 2 socket is the better choice. Both are not showing linear scalability as they are running on all the cores per node (bandwidth constrained) 37

37 Generation to Generation - ANSYS Fluent ANSYS Application Example Case Details: Flow through a Combustor Number of cells: 12 Million Cell Type: Polyhedra Models used: Realizable K-ε turbulence Pressure based coupled, species transport, Least Square cell based, pseudo transient 38

cells: 4 Million Cell Type: Mixed Models used: Standard K-ε

38 Generation to Generation - ANSYS Fluent ANSYS Application Example Case Details: External flow over a passenger sedan Number of cells: 4 Million Cell Type: Mixed Models used: Standard K-ε turbulence Solver: Pressure based coupled, steady, Green-Gauss cell based 39

39 Recap Faster cores mean faster solution Faster memory means faster solution Memory bandwidth is an important factor for (linear) scale-ability Turbo Boost/Turbo Core modes do give some benefit especially at low core counts per node. In general hyper threading should not be used because of licensing implications. Be careful when looking at comparisons! Make sure you are comparing like with like! 40

40 What Hardware Configuration to Select? CPUs? GPU/Phi? HDD vs. SSD SMP vs. DMP Clusters? Interconnects? 41

Understanding the effect of the interconnect Need fast interconnects to feed fast processors Two main characteristics for each interconnect: latency and bandwidth Distributed ANSYS is highly

41 Understanding the effect of the interconnect Need fast interconnects to feed fast processors Two main characteristics for each interconnect: latency and bandwidth Distributed ANSYS is highly bandwidth bound D I S T R I B U T E D A N S Y S S T A T I S T I C S Release: 14.5 Build: UP Platform: LINUX x64 Date Run: 08/09/2012 Time: 23:07 Processor Model: Intel(R) Xeon(R) CPU E GHz Total number of cores available : 32 Number of physical cores available : 32 Number of cores requested : 4 (Distributed Memory Parallel) MPI Type: INTELMPI Core Machine Name Working Directory hpclnxsmc00 /data1/ansyswork 1 hpclnxsmc00 /data1/ansyswork 2 hpclnxsmc01 /data1/ansyswork 3 hpclnxsmc01 /data1/ansyswork Latency time from master to core 1 = microseconds Latency time from master to core 2 = microseconds Latency time from master to core 3 = microseconds Communication speed from master to core 1 = MB/sec Same machine Communication speed from master to core 2 = MB/sec QDR Infiniband Communication speed from master to core 3 = MB/sec QDR Infiniband 42

42 Understanding the effect of the interconnect - ANSYS Fluent ANSYS/FLUENT Performance idataplex M3 (Intel Xeon x5670, 12C 2.93 GHz) Network: Gigabit, 10-Gigabit, 4X QDR Infiniband (QLogic, Voltaire) Hyperthreading: OFF, TURBO: ON Models: truck_14m 5000 QLogic Voltaire 10-Gigabit Gigabit FLUENT Rating Higher is better Number of Cores used by a single job 43

Understanding the effect of the interconnect - ANSYS Fluent Exhaust Model 44 7.

43 Understanding the effect of the interconnect - ANSYS Fluent Exhaust Model M cells Transient simulation with explicit time stepping for engine startup cycle Fujitsu PRIMERGY CX250 HPC systems (E5-2690v2 with 20 and E5-2697v2 with 24 cores per node, resp.) For CFD we can see the performance of IB vs GiGE GiGE starts to drop off after 2 nodes

44 Understanding the effect of the interconnect - ANSYS Fluent For CFD 10 GiGE starts to taper off after 8 nodes 45

One iteration 10 Direct sparse Linux cluster (8 cores per node) 0 Interconnect

45 Understanding the effect of the interconnect - ANSYS Mechanical V13sp-5 Model Rating (runs/day) Turbine geometry 2,100 K DOF 20 SOLID187 FEs Static, nonlinear One iteration 10 Direct sparse Linux cluster (8 cores per node) 0 Interconnect Performance Gigabit Ethernet DDR Infiniband 8 cores 16 cores 32 cores 64 cores 128 cores 46

46 Understanding the effect of the interconnect - ANSYS Mechanical For ANSYS Mechanical GiGE does not scale to more than 1 node! 47

47 Understanding the effect of the interconnect - ANSYS Mechanical GiGE (Gigabit Ethernet) 1 Gbits/sec ( 100 MB/sec ) 10 GiGE 10 Gbits/sec ( 1000 MB/sec ) Not recommended!! Bare minimum!! Myrinet (Myricom, Inc) 2 Gbits/sec ( 250 MB/sec ) Myri 10G 10 Gbits/sec (4 th generation Myrinet) Infiniband (many vendors/speeds) SDR/DDR/QDR 1x, 4x, 12x RECOMMENDATION Over 1000 MB/s, especially when running on more than 4 nodes 48

48 Recap 10GiGE and Infiniband are recommended for HPC Clusters. Currently Infiniband only for large clusters is recommended QDR should be more than adequate for small to medium clusters. FDR for large clusters. For more than 1 node you will see performance decrease using GiGE. For Mechanical users do not use GiGE at all if their jobs span more than one node. 49

49 What Hardware Configuration to Select? CPUs? GPU/Phi? HDD vs. SSD SMP vs. DMP Clusters? Interconnects? 50

50 Parallel file systems NFS Server and/or master node causes IO bottleneck Master node causes IO bottleneck IO scales with cluster 51

51 Parallel file systems - ANSYS Mechanical The example across from here is using GPFS for Mechanical. Notice how it is very similar in speed to a local RAID 0 configuration (4 x 15k SAS) 52

Understanding the effect of I/O - ANSYS Fluent Parallel I/O is based on MPI-IO Implemented for data file read and write A single file is written collectively by the nodes Suited for

52 Understanding the effect of I/O - ANSYS Fluent Parallel I/O is based on MPI-IO Implemented for data file read and write A single file is written collectively by the nodes Suited for parallel file systems Does not work on NFS Support for Panasas, PVFS2, HP/SFS, IBM/GPFS, EMC/MPFS2, Lustre Files cannot be written directly compressed but can be compressed asynchronously 53

00 Parallel IO = 7x ( Legacy-NAS ) Parallel IO = 4x ( Serial-IO ) 200.00 0.

53 Understanding the effect of I/O - ANSYS Fluent Truck-111million (uses DES model with the segregated implicit solver) Truck-111m Write Data File Throughput (MB/s) Parallel IO = 7x ( Legacy-NAS ) Parallel IO = 4x ( Serial-IO ) Legacy NAS Serial IO Parallel IO Parallel IO (RAID-10, CW) 176 Cores Panasas layout available with MPI-IO Hints in Fluent

54 Understanding the effect of I/O - ANSYS Fluent Landing Gear Noise Predictions using Scale-Resolving Simulations (180M cell model using pressure based segregated solver) 55

Understanding the effect of I/O - ANSYS Fluent Asynchronous I/O for Linux Fluent Total write time 3-5x quicker over NFS Even larger speed-ups on bigger cases and local disk (up to 10x) Mesh

55 Understanding the effect of I/O - ANSYS Fluent Asynchronous I/O for Linux Fluent Total write time 3-5x quicker over NFS Even larger speed-ups on bigger cases and local disk (up to 10x) Mesh File Location Async I/O Time 15M Cas NFS OFF 217s 15M Cas NFS ON 62s 15M Dat NFS OFF 113s 15M Dat NFS ON 8s 30M Cas NFS OFF 207s 30M Cas NFS ON 75s 30M Dat NFS OFF 144s 30M Dat NFS ON 10s 56

56 Understanding the effect of I/O - ANSYS Mechanical 4XSSD-RAID-0-SATA-3Gb/s 2XSSD-RAID-0-SATA-3Gb/s SSD-SATA-6Gb/s HD(7.2K RPM)-SATA-6Gb/s SP-5 (in-core) R14.5 Benchmark Results Rating (jobs/day) #Machine X #Core Memory 1X1 1X2 1X4 1X8 1X16 29GB 33GB 35.6GB 40.8GB 47.8GB 57

57 Understanding the effect of I/O - ANSYS Mechanical 4XSSD-RAID-0-SATA-3Gb/s 2XSSD-RAID-0-SATA-3Gb/s SSD-SATA-6Gb/s HD(7.2K RPM)-SATA-6Gb/s SP-5 (in-core) R14.5 Benchmark Results Rating (jobs/day) #Machine X #Core Memory 1X1 1X2 1X4 1X8 1X16 29GB 33GB 35.6GB 40.8GB 47.8GB 58

450 400 350 Understanding the effect of I/O - ANSYS Mechanical 4XSSD-RAID-0-SATA-3Gb/s

5 Benchmark Results 419 384 368 300 301 275 283 Rating (jobs/day) 250 200 180 180 180 150 145

58 Understanding the effect of I/O - ANSYS Mechanical 4XSSD-RAID-0-SATA-3Gb/s 2XSSD-RAID-0-SATA-3Gb/s SSD-SATA-6Gb/s HD(7.2K RPM)-SATA-6Gb/s SP-5 (in-core) R14.5 Benchmark Results Rating (jobs/day) #Machine X #Core Memory 1X1 1X2 1X4 1X8 1X16 29GB 33GB 35.6GB 40.8GB 47.8GB 59

59 Recap IO is very important for Mechanical Solver o Raid 0 mandatory for multiple disks o SSD s recommended for speed, 15k SAS drives FLUENT and CFX for most customers won t require fast local disk access (for most type of job) Parallel file systems can meet the requirements of both types of solver. 60

60 I/O [Mb/s] Is Your Hardware Ready for HPC? - ANSYS Mechanical 2x SSD x SSD 2x SAS > 6 Mdof 4 Mdof 1x SAS 61 2 Mdof Mdof RAM [Gb]

61 What Hardware Configuration to Select? CPUs? GPU/Phi? HDD vs. SSD SMP vs. DMP Clusters? Interconnects? 62

62 DMP Outperforming SMP 6 Mio Degrees of Freedom Plasticity, Contact Bolt pretension 4 load steps 63

of Cores 1 Mio Degrees of Freedom Harmonic, linear, structural 4

63 DMP: Good Performance at High Core Counts Number of Cores 10.7 Mio Degrees of Freedom Static, linear, structural 1 load step Number of Cores 1 Mio Degrees of Freedom Harmonic, linear, structural 4 frequencies Intel Xeon E processors (2.9 GHz, 16 cores total) 128 GB of RAM 64

ANSYS Mechanical 14.5 DMP Enabling Scalability at High Core Counts Minimum time to solution more important than scaling V14sp-5 Model 25 20 Solution Scalability Turbine geometry 2.

64 ANSYS Mechanical 14.5 DMP Enabling Scalability at High Core Counts Minimum time to solution more important than scaling V14sp-5 Model Solution Scalability Turbine geometry 2.1 million DOF Static, nonlinear analysis 1 loadstep, 7 substeps, 25 equilibrium iterations 8-node Linux cluster (with 8 cores per node) Speedup

65 ANSYS Mechanical 15.0 Faster Performance at Higher Core Counts by an enhanced domain decomposition method 6 Improved Scaling at 8 cores 8-node Linux cluster (with 8 cores and 48 GB of RAM per node, InfiniBand DDR) Speedup over R x 1.7x 2.7x 2.4x 0 Engine (9 MDOF) Stent (520 KDOF) Clutch (160 KDOF) Bracket (45 KDOF) 66

66 ANSYS Mechanical 15.0 Faster Performance at Higher Core Counts by an enhanced domain decomposition method 6 Improved Scaling at 16 cores 8-node Linux cluster (with 8 cores and 48 GB of RAM per node, InfiniBand DDR) Speedup over R x 1.8x 3.8x 4.0x 0 Engine (9 MDOF) Stent (520 KDOF) Clutch (160 KDOF) Bracket (45 KDOF) 67

ANSYS Mechanical 15.0 Faster Performance at Higher Core Counts by an enhanced domain decomposition method Speedup over R14.5 6 5 4 3 2 1 1.

67 ANSYS Mechanical 15.0 Faster Performance at Higher Core Counts by an enhanced domain decomposition method Speedup over R x Improved Scaling at 32 cores 2.2x 3.9x 8-node Linux cluster (with 8 cores and 48 GB of RAM per node, InfiniBand DDR) 5.0x 0 Engine (9 MDOF) Stent (520 KDOF) Clutch (160 KDOF) Bracket (45 KDOF) 68

68 ANSYS Mechanical 16.0 Faster Performance at Higher Core Counts Continually improving Core Solver Rating to 128 cores Courtesy of HP 70

eigen solver supports Shared and Distributed Parallel

69 ANSYS Mechanical 15.0 HPC & Solver Technology Improvements Coupled Acoustic, 1.2 M DOF, Full Harmonic Response Improved Scalability of Distributed solver at higher core counts NEW Subspace eigen solver supports Shared and Distributed Parallel technology NEW MSUP Harmonic method for unsymmetric systems e.g vibro-acoustics 2.09 MDOFs first 20 modes 71

70 What Hardware Configuration to Select? CPUs? GPU/Phi? HDD vs. SSD SMP vs. DMP Clusters? Interconnects? 72

user-transparent Only requirement is to inform ANSYS of how many GPUs to use Schematic of

71 Some Basics ANSYS Software on NVIDIA GPUs GPUs are accelerators and can significantly speed up your simulations GPUs work hand in hand with CPUs Most ANSYS GPU acceleration is user-transparent Only requirement is to inform ANSYS of how many GPUs to use Schematic of a CPU with an attached GPU accelerator CPU begins/ends job, GPU manages heavy computations 73

72 GPU Accelerator Capability - ANSYS Fluent GPU-based Model: Radiation Heat Transfer using OptiX GPU-based Solver: Coupled Algebraic Multigrid (AMG) PBNS linear solver Operating Systems: Both Linux and Win64 for workstations and servers Parallel Methods: Shared and distributed memory Supported GPUs: Tesla K40, Tesla K80, and Quadro 6000 Multi-GPU Support: Full multi-gpu and multi-node support Model Suitability: Unlimited (hardware dependent) 74

73 ANSYS Fluent on GPU Performance of Pressure-Based Solver 27 Jobs/day 1.9x Sedan Model 12 Jobs/day Higher is Better 15 Jobs/day Sedan geometry 3.6M mixed cells Steady, turbulent External aerodynamics Coupled PBNS, DP CPU: Intel Xeon E5-2680; 8 cores GPU: 2 X Tesla K40 CPU only Segregated solver CPU only CPU + GPU Coupled solver Convergence criteria: 10e-03 for all variables; No of iterations until convergence: segregated CPU-2798 iterations (7070 secs); coupled CPU-967 iterations (5900 secs); coupled 985 iterations (3150 secs) NOTE: Times for total solution until convergence 75

(16 CPU cores) and dual Tesla K80 GPUs Additional cost of adding GPUs 40% CPU-only solution cost 100% 100% Simulation productivity from

74 ANSYS Fluent on GPU Performance of Pressure-Based Solver Higher is Better 33 Jobs/day Truck Model 200% Additional productivity from GPUs 11 Jobs/day External aerodynamics 14 million cells Steady, k-ε turbulence Coupled PBNS, DP 2 nodes each with dual Intel Xeon E V3 (16 CPU cores) and dual Tesla K80 GPUs Additional cost of adding GPUs 40% CPU-only solution cost 100% 100% Simulation productivity from CPU-only system 64 CPU cores 56 CPU cores + 4 Tesla K80 Cost CPU Benefit GPU Simulation productivity (with an HPC Workgroup 64 license) 76

ANSYS Fluent on GPU Better Speedup on Larger Models 36 CPU cores 36 CPU cores + 12 GPUs 144 CPU cores 144 CPU cores + 48 GPUs 36 Truck Model ANSYS Fluent Time (Sec) 13 1.4 X 9.

75 ANSYS Fluent on GPU Better Speedup on Larger Models 36 CPU cores 36 CPU cores + 12 GPUs 144 CPU cores 144 CPU cores + 48 GPUs 36 Truck Model ANSYS Fluent Time (Sec) X 9.5 Lower is Better 2 X 18 External aerodynamics Steady, k-ε turbulence Double-precision solver CPU: Intel Xeon E5-2667; 12 cores per node GPU: Tesla K40, 4 per node 14 million cells 111 million cells NOTE: Reported times are per iteration 77

NVIDIA-GPU Solution Fit for ANSYS Fluent CFD analysis Is it

No Not ideal for GPUs Pressure based coupled solver Segregated

No 78 Best-fit for GPUs Yes Consider switching to the pressure-based

76 NVIDIA-GPU Solution Fit for ANSYS Fluent CFD analysis Is it single-phase & flow dominant? No Yes Pressurebased coupled solver? No Not ideal for GPUs Pressure based coupled solver Segregated solver Is it a steady-state analysis? No 78 Best-fit for GPUs Yes Consider switching to the pressure-based coupled solver for better performance (faster convergence) and further speedups with GPUs. Please see the next slide.

NVIDIA-GPU Solution Fit for ANSYS Fluent - Supported Hardware Configurations CPU GPU Homogeneous process distribution Homogeneous GPU selection Number of processes be an exact

77 NVIDIA-GPU Solution Fit for ANSYS Fluent - Supported Hardware Configurations CPU GPU Homogeneous process distribution Homogeneous GPU selection Number of processes be an exact multiple of number of GPUs CPU CPU GPU Some nodes with 16 processes and some with 12 processes GPU Some nodes with 2 GPUs some with 1 GPU CPU 79 GPU 15 processes not divisible by 2 GPUs

78 ANSYS Fluent - Power Consumption Study Adding GPUs to a CPU-only node resulted in 2.1x speed up while reducing energy consumption by 38% 80

79 NVIDIA-GPU Solution Fit for ANSYS Fluent GPUs accelerate the AMG solver portion of the CFD analysis, thus benefit problems with relatively high %AMG Coupled solvers have high %AMG in the range of 60-70% Fine meshes and low-dissipation problems have high %AMG In some cases, pressure-based coupled solvers offer faster convergence compared to segregated solvers (problem-dependent) The whole problem must fit on GPUs for the calculations to proceed In pressure-based coupled solver, each million cells need approx. 4 GB of GPU memory High-memory cards such as Tesla K40 or Quadro K6000 are ideal Moving scalar equations such as turbulence may not benefit much because of low workloads (using scalar yes option in amg-options ) Better performance on lower CPU core counts A ratio of 3 or 4 CPU cores to 1 GPU is recommended 81

GPU Accelerator Capability - ANSYS Mechanical Supports majority of

PCG iterative solvers Only a few minor limitations Ease of use:

no additional installation steps Performance: Offer significantly

80 GPU Accelerator Capability - ANSYS Mechanical Supports majority of ANSYS structural mechanics solvers: Covers both sparse direct and PCG iterative solvers Only a few minor limitations Ease of use: Requires at least one supported GPU card to be installed No rebuild, no additional installation steps Performance: Offer significantly faster time to solution Should never slow down your simulation V14sp-5 Model 82

81 Influence of GPU Accelerator on Speedup ANSYS Mechanical Model Impeller Impeller geometry of ~2M DOF, solid FEs Normal modes analysis using cyclic symmetry ANSYS Mechanical SMP and Block-Lanczos solver ANSYS Mechanical Model Speaker Speaker geometry of ~0.7M DOF, solid FEs Vibroacoustic harmonic analysis for one frequency ANSYS Mechanical distributed sparse solver Speedup 5.9x 3.7x 2.4x Impeller 2M DOF Normal modes 4 cores + GPU = 2.4x speedup vs. 4 cores Speedup Speaker 0.7M DOF Harmonic analysis 4 cores + GPU = 2.7x speedup vs. 4 cores 83

82 NVIDIA-GPU Solution Fit for ANSYS Mechanical GPUs accelerate the solver part of analysis, consequently problems with high solver workloads benefit the most from GPUs Characterized by both high DOF and high factorization requirements Models with solid elements (such as castings) and have >500K DOF experience good speedups Better performance when run on DMP mode over SMP mode GPU and system memories both play important roles in performance Sparse solver: Bulkier and/or higher-order FE models are good and will be accelerated If the model exceeds 5M DOF, then either add another GPU with 5-6 GB of memory (Tesla K20 or K20X) or use a single GPU with 12 GB memory (Tesla K40 or Quadro K6000). PCG/JCG solver: Memory saving (MSAVE) option should be turned off for enabling GPUs Models with lower Level of Difficulty value (Lev_Diff) are better suited for GPUs 84

GPU Achievements ANSYS Mechanical 16.0 Supporting Newest GPUs 371 Jobs/day V15sp-4 Model 2.3x Higher is Better 247 Jobs/day V15sp-5 Model Turbine geometry 3.

83 GPU Achievements ANSYS Mechanical 16.0 Supporting Newest GPUs 371 Jobs/day V15sp-4 Model 2.3x Higher is Better 247 Jobs/day V15sp-5 Model Turbine geometry 3.2 million DOF SOLID187 elements Static, nonlinear analysis Sparse direct solver 159 Jobs/day 135 Jobs/day 1.8x Ball grid array geometry 6.0 million DOF Static, nonlinear analysis Sparse direct solver 8 CPU cores 6 CPU cores + K80 GPU 8 CPU cores 6 CPU cores + K80 GPU 87 Distributed ANSYS Mechanical 16.0 with Intel Xeon E5-2697v2 2.7 GHz 8-core CPU; Tesla K80 GPU with boost clocks.

84 GPU Achievements ANSYS Mechanical 15.0 Supporting Newest GPUs GPUs can offer significantly faster time to solution Higher core counts favor multiple GPUs Lower core counts favor a single GPU Courtesy of HP 89

parallelism (SMP) on Linux only Intel Xeon Phi coprocessor support R16 now supports distributed memory

85 GPU Achievements ANSYS Mechanical 16.0 Supporting Xeon Phi Background: ANSYS Mechanical 15.0 was the first commercial FEA program to support Intel Xeon Phi coprocessor It was limited to shared memory parallelism (SMP) on Linux only Intel Xeon Phi coprocessor support R16 now supports distributed memory parallelism (DMP) and Windows Speedup core 2 cores 4 cores 8 cores 16 cores No Xeon Phi Xeon Phi

86 GPU Achievements ANSYS License Scheme for GPU and Phi Licensing Examples: 1 x ANSYS HPC Pack Total 8 HPC Tasks (4 GPU/Phi Max) Example of Valid Configurations: 6 CPU Cores + 2 GPU/Phi 4 CPU Cores + 4 GPU/Phi 2 x ANSYS HPC Pack Total 32 HPC Tasks (16 GPU/Phi Max)..... (Applies to all schemes: ANSYS HPC, ANSYS HPC Pack, ANSYS HPC Workgroup) 24 CPU Cores + 8 GPU/Phi (Total Use of 2 Compute Nodes) 93

87 Maximizing Performance Putting it Together HDD vs. SSD SMP vs. DMP The right combination of hardware and software leads to maximum efficiency 95 CPUs? Clusters? GPU/Phi? Interconnects?

88 Maximizing Performance ANSYS Mechanical #1 Rule Avoid waiting for I/O to complete Always check if job is I/O bound or compute bound Check output file for CPU and Elapsed times When Elapsed time >> main thread CPU time Total CPU time for main thread : seconds Elapsed Time (sec) = Date = 03/21/2013 I/O bound Consider adding more RAM or faster hard drive configuration When Elapsed time main thread CPU time Compute bound Considering moving simulation to a machine with newer, faster processors Consider using Distributed ANSYS (DMP) instead of SMP Consider running on more CPU cores or possibly using GPU(s) 96

89 Maximizing Performance ANSYS Mechanical How to improve an I/O bound simulation First consider adding more RAM Always the best option for optimal performance Allows the operating system to cache file data in memory Next consider improving the I/O configuration Need fast hard drives to feed fast processors Consider SSDs Higher bandwidths and extremely low seek times Consider RAID configurations RAID 0 for speed RAID 1,5 for redundancy RAID 10 for speed and redundancy 97

Maximizing Performance ANSYS Mechanical Example of an I/O bound simulation 2.1 million DOF Nonlinear static analysis Direct sparse solver (DSPARSE) 2 Intel Xeon E5-2670 (2.

90 Maximizing Performance ANSYS Mechanical Example of an I/O bound simulation 2.1 million DOF Nonlinear static analysis Direct sparse solver (DSPARSE) 2 Intel Xeon E (2.6 GHz, 16 cores total) One 10k rpm HDD, one SSD Windows 7 Relative Speedup Benefits of SSD and RAM 16 GB RAM 5.9x 5.9x 128 GB RAM 2.7x 2.9x 0.8x 2 cores, HDD 8 cores, HDD 8 cores, SSD Adding RAM gives biggest gains & allows good scaling Single SSD helps allow some scaling. Not as helpful as RAM, but cheaper Lack of RAM and slow HDD ruin scaling 98

91 Maximizing Performance ANSYS Mechanical How to improve a compute bound simulation First consider using newer, faster processors New CPU architecture and faster clock speeds always help Next consider using parallel processing DMP virtually always recommended over SMP More computations performed in parallel with DMP Significantly faster speedups achieved using DMP DMP can take advantage of all resources on a cluster Whole new class of problems can be solved!! Last consider using GPU acceleration Can help accelerate critical, time-consuming computations 99

92 Maximizing Performance ANSYS Mechanical Example of a compute bound simulation 2.1 million DOF Nonlinear static analysis Direct sparse solver (DSPARSE) 2 Intel Xeon E (2.6 GHz, 16 cores total) 128 GB RAM 1 Tesla K20c Windows 7 Relative Speedup Benefits of DMP and GPU 11.0x Xeon x5675 Xeon E x 1.8x 2 cores 8 cores 8 cores, GPU Maximum performance found by adding GPU Using 8 cores gives faster performance Using newer Xeons gives big gain 100

93 Maximizing Performance ANSYS Mechanical Balanced System for Overall Optimum Performance 2.1 million DOF Nonlinear static analysis Direct sparse solver (DSPARSE) 2 Intel Xeon E (2.6 GHz, 16 cores total) 16 GB RAM SSD and SATA disks 1 Tesla K20c Windows 7 Relative Speedup x Balanced Performance IO Bound 2.7x 5.2x 2 cores 8 cores 8 cores + GPU 12.5x 8 cores + GPU + SSD 101

94 Maximizing Performance ANSYS Mechanical Balanced System for Overall Optimum Performance 2.1 million DOF Nonlinear static analysis Direct sparse solver (DSPARSE) 2 Intel Xeon E (2.6 GHz, 16 cores total) 128 GB RAM SSD and SATA disks 1 Tesla K20c Windows 7 Relative Speedup x 5.7x Balanced Performance IO Bound Compute Bound 12.0x 2.7x 5.2x 24.8x 2 cores 8 cores 8 cores + GPU 12.5x 27.3x 8 cores + GPU + SSD 102

95 Agenda Why Talking About Hardware HPC Terminology ANSYS Work-flow Hardware Considerations Additional resources 103

There is no point in spending all your money on the processor

96 Wrap-up - Hardware An important part of specifying an HPC system is to purchase a balanced system. There is no point in spending all your money on the processor if the I/O is your biggest bottleneck. You are only as good as your slowest component! 104

Scalable HPC Licensing ANSYS HPC (per-process) ANSYS HPC Pack HPC product rewarding volume parallel processing

added Packs ANSYS HPC Workgroup HPC product rewarding volume parallel processing for increased simulation

single server Enterprise options available to deploy and use anywhere in the world Single HPC solution for

97 Scalable HPC Licensing ANSYS HPC (per-process) ANSYS HPC Pack HPC product rewarding volume parallel processing for high-fidelity simulations Each simulation consumes one or more Packs Parallel enabled increases quickly with added Packs ANSYS HPC Workgroup HPC product rewarding volume parallel processing for increased simulation throughput within a single colocated workgroup 16 to parallel shared across any number of simulations on a single server Enterprise options available to deploy and use anywhere in the world Single HPC solution for FEA/CFD/FSI and any level of fidelity Parallel Enabled (Cores) HPC Packs per Simulation 105

98 Which type of Licensing is right for me? ANSYS HPC and ANSYS HPC Workgroup gives Flexible use of a pool of licenses. ANSYS HPC Pack gives quick scale-up but is more restrictive in how users can use it. The ability to be more flexible is why HPC Workgroup options cost more than the HPC Packs. 106

99 ANSYS HPC Parametric Pack License HPC license for running parametric FEA or CFD simulations on multiple CPU cores simultaneously, and more cost effectively Key Benefits Ability to automatically and simultaneously execute design points while consuming just one set of application licenses Scalable because number of simultaneous design points enabled increases quickly with added packs Amplifies complete workflow because design points can include execution of multiple applications (pre, meshing, solve, HPC, post) 107 Number of Simultaneous Design Points Enabled Number of HPC Parametric Pack Licenses

Additional Resources - IT Webinars Watch recorded webinars by clicking below: Understanding Hardware Selection for ANSYS 15.0 How to Speed Up ANSYS 15.

100 Additional Resources - IT Webinars Watch recorded webinars by clicking below: Understanding Hardware Selection for ANSYS 15.0 How to Speed Up ANSYS 15.0 with GPUs Intel Technologies Enabling Faster, More Effective Simulation Optimizing Remote Access to Simulation Click on webinars related to HPC/IT for more and upcoming ones! 108

Additional Resources - IT White Papers & Technical Briefs White Papers by clicking below: Optimizing Business Value in High-Performance Engineering Computing IBM Application Ready Solutions Reference

101 Additional Resources - IT White Papers & Technical Briefs White Papers by clicking below: Optimizing Business Value in High-Performance Engineering Computing IBM Application Ready Solutions Reference Architecture for ANSYS Intel Solid-State Drives Increase Productivity of Product Design and Simulation Value of HPC for Ensuring Product Integrity Technical Briefs by clicking below: Parallel Scalability of ANSYS 15.0 on Hewlett-Packard Systems SGI Technology Guide for ANSYS Mechanical Analysts SGI Technology Guide for ANSYS Fluent Analysts Accelerating ANSYS Fluent 15.0 Using NVIDIA GPUs 109

102 Additional Resources - ANSYS IT Webcast Series On-demand webinars: Understanding Hardware Selection for ANSYS 15.0 How to Speed Up ANSYS 15.0 with GPUs Cloud Hosting of ANSYS: Gompute On-Demand Solutions Simplified HPC Clusters for ANSYS Users Intel Technologies Enabling Faster, More Effective Simulation Accelerating Time-to-Results with Parallel I/O Extreme Scalability for High-Fidelity CFD Simulations Methodology and Tools for Compute Performance at Any Scale Understanding Hardware Selection for Structural Mechanics Optimizing Remote Access to Simulation Scalable Storage and Data Management for Engineering Simulation 110

Additional Resources ANSYS Platform Support http://www.ansys.

103 Additional Resources ANSYS Platform Support Platform Support Policies Supported Platforms Supported Hardware Tested Systems ANSYS Benchmarks 111

104 Additional Resources ANSYS Partner Solutions Reference configurations Performance data White papers Sales contact points Performance Data 112

105 Additional Resources The Manual Sections on best practices and parallel processing for various solvers Performance Guide for Mechanical Installation walkthroughs for installing the products, parallel processing, licensing and RSM (remote solve manager) ANSYS Advantage Online Magazine 113

106 Thank You! Connect with Me Connect with ANSYS, Inc. LinkedIn ANSYSInc Facebook ANSYSInc Follow our Blog ansys-blog.com 114

Maximize automotive simulation productivity with ANSYS HPC and NVIDIA GPUs

Presented at the 2014 ANSYS Regional Conference- Detroit, June 5, 2014 Maximize automotive simulation productivity with ANSYS HPC and NVIDIA GPUs Bhushan Desam, Ph.D. NVIDIA Corporation 1 NVIDIA Enterprise