ANSYS HPC. Technology Leadership. Barbara Hutchings ANSYS, Inc. September 20, 2011

Size: px

Start display at page:

Download "ANSYS HPC. Technology Leadership. Barbara Hutchings ANSYS, Inc. September 20, 2011"

Diana Jefferson
6 years ago
Views:

1 ANSYS HPC Technology Leadership Barbara Hutchings 1 ANSYS, Inc. September 20,

Innovate with confidence HPC delivers throughput Consider multiple design ideas

2 Why ANSYS Users Need HPC Insight you can t get any other way HPC enables high-fidelity Include details - for reliable results Be sure your design is right Innovate with confidence HPC delivers throughput Consider multiple design ideas Optimize the design Ensure performance across range of conditions 2 ANSYS, Inc. September 20,

15 % spent on R&D 570 software developers Partner relationships

Parallel dynamic moving/deforming mesh Distributed memory particle

for Linux clusters, low latency interconnects 10M cell fluids

with interactive client-server user environment 1980 2009 Ideal

(structures) Parallel I/O (fluids) Domain Decomposition introduced

using Windows HPC 1994-1995 Parallel dynamic mesh refinement and

Mainframes 2010 2010 Hybrid parallel for sustained multicore

Optimized performance on multicore processors 1 st One Billion cell

PCG solver Variational Technology DANSYS released Distributed Solve

1999-2000 Today s multi-core / many-core 64bit hardware large

makes HPC a software development imperative.

Memory Multiprocessing for structural simulations leadership.

3 15 % spent on R&D 570 software developers Partner relationships ANSYS HPC Leadership A History of HPC Performance Parallel dynamic moving/deforming mesh Distributed memory particle tracking Integration with load management systems Support for Linux clusters, low latency interconnects 10M cell fluids simulations, 128 processors st general-purpose parallel CFD with interactive client-server user environment Ideal scaling to 2048 cores (fluids) Teraflop performance at 512 core (structures) Parallel I/O (fluids) Domain Decomposition introduced (HFSS 12) Parallel meshing (fluids) Support for clusters using Windows HPC Parallel dynamic mesh refinement and coarsening Dynamic load balancing s Vector Processing on Mainframes Hybrid parallel for sustained multicore performance (fluids) GPU acceleration (structures) Optimized performance on multicore processors 1 st One Billion cell fluids simulation Distributed sparse solver Distributed PCG solver Variational Technology DANSYS released Distributed Solve (DSO) HFSS st company to solve 100M structural DOF Today s multi-core / many-core 64bit hardware large memory addressing evolution Shared memory multiprocessing (HFSS 7) makes HPC a software development imperative. ANSYS 1990 is committed to maintaining performance 1990 Shared Memory Multiprocessing for structural simulations leadership Iterative PCG Solver Introduced for large structural analysis 3 ANSYS, Inc. September 20, 2008 ANSYS, Inc. All rights reserved. ANSYS, Inc. Proprietary

4 HPC A Software Development Imperative Clock Speed Leveling off Core Counts Growing Exploding (GPUs) Future performance depends on highly scalable parallel software 4 Source: ANSYS, Inc. September 20,

5 RATING RATING ANSYS FLUENT Scaling Achievement 2008 Hardware (Intel Harpertown, DDR IB) 2010 Hardware (Intel Westmere, QDR IB) IDEAL IDEAL Number of Cores Systems keep improving: faster processors, more cores Ideal rating (speed) doubled in two years! ANSYS, Inc. September 20, Number of Cores Memory bandwidth per core and network latency/bw stress scalability 2008 release (12.0) re-architected MPI huge scaling improvement, for a while 2010 release (13.0) introduces hybrid parallelism and scaling continues!

6 Core Solver Rating Extreme CFD Scaling s of cores 2500 Scaling to Thousands of Cores 111M Cell Truck Benchmark Number of Cores ANSYS Fluent 13.0 ANSYS Fluent 14.0 (Pre- Release) Enabled by ongoing software innovation Hybrid parallel: fast shared memory communication (OpenMP) within a machine to speed up overall solver performance; distributed memory (MPI) between machines 6 ANSYS, Inc. September 20,

Solution Rating Solution Rating Parallel Scaling ANSYS Mechanical 300 Sparse Solver (Parallel Re-Ordering) Focus on bottlenecks in 250 200 R12.1 R13.

7 Solution Rating Solution Rating Parallel Scaling ANSYS Mechanical 300 Sparse Solver (Parallel Re-Ordering) Focus on bottlenecks in R12.1 R13.0 the distributed memory solvers (DANSYS) 150 Sparse Solver Number of cores Parallelized equation ordering 40% faster w/ updated Intel MKL PCG Solver (Pre-Conditioner Scaling) R12.1 R13.0 Preconditioned Conjugate Gradient (PCG) Solver Parallelized preconditioning step ANSYS, Inc. September 20, Number of cores

Architecture-Aware Partitioning Original partitions are remapped to the cluster considering the network topology and latencies Minimizes inter-machine traffic reducing load on network switches

8 Architecture-Aware Partitioning Original partitions are remapped to the cluster considering the network topology and latencies Minimizes inter-machine traffic reducing load on network switches Improves performance, particularly on slow interconnects and/or large clusters Partition Graph 3 machines, 8 cores each Colors indicate machines Original mapping New mapping 8 ANSYS, Inc. September 20,

9 File I/O Performance Case file IO Both read and write significantly faster in R13 A combination of serial-io optimizations as well as parallel-io techniques, where available Parallel-IO (.pdat) Significant speedup of parallel IO, particularly for cases with large number of zones Support for Lustre, EMC/MPFS, AIX/GPFS file systems added Data file IO (.dat) Performance in R12 was highly optimized. Further incremental improvements done in R13 9 ANSYS, Inc. September 20, 91.2 Parallel Data write truck_14m, case read 79.5 R12 vs. R13 BMW -68% FL5L2 4M -63% Circuit -97% Truck 14M -64%

Multi-core processors Typically 4-6 cores Powerful, general purpose

10 What about GPU Computing? CPUs and GPUs work in a collaborative fashion CPU GPU PCI Express channel Multi-core processors Typically 4-6 cores Powerful, general purpose Many-core processors Typically hundreds of cores Great for highly parallel code, within memory constraints 10 ANSYS, Inc. September 20,

11 ANSYS Mechanical SMP GPU Speedup Solver Kernel Speedups Overall Speedups From NAFEMS World Congress May Boston, MA, USA Accelerate FEA Simulations with a GPU -by Jeff Beisheim, ANSYS 11 ANSYS, Inc. September 20, Tesla C2050 and Intel Xeon 5560

12 R14: GPU Acceleration for DANSYS 3 R14 Distributed ANSYS Total Simulation Speedups for R13 Benchmark set 4 CPU cores CPU cores + 1 GPU V13cg-1 (JCG, 1100k) V13sp-1 (sparse, 430k) V13sp-2 (sparse, 500k) V13sp-3 (sparse, 2400k) V13sp-4 (sparse, 1000k) V13sp-5 (sparse, 2100k) Windows workstation : Two Intel Xeon 5560 processors (2.8 GHz, 8 cores total), 32 GB RAM, NVIDIA Tesla C2070, Windows 7, TCC driver mode 12 ANSYS, Inc. September 20,

0 3.0 R14 Distributed ANSYS w/wo GPU Without GPU With GPU 3.4x 3.2x 4.4x 2.0 1.7x 1.9x 1.

13 Total Speedup ANSYS Mechanical Multi-Node GPU Solder Joint Benchmark (4 MDOF, Creep Strain Analysis) Linux cluster : Each node contains 12 Intel Xeon 5600-series cores, 96 GB RAM, NVIDIA Tesla M2070, InfiniBand R14 Distributed ANSYS w/wo GPU Without GPU With GPU 3.4x 3.2x 4.4x x 1.9x 1.0 Solder balls Mold cores 32 cores 64 cores 13 ANSYS, Inc. September 20, PCB Results Courtesy of MicroConsult Engineering, GmbH

14 GPU Acceleration for CFD Radiation viewfactor calculation (ANSYS FLUENT 14 - beta) First capability for specialty physics view factors, ray tracing, reaction rates, etc. R&D focus on linear solvers, smoothers but potential limited by Amdahl s Law 14 ANSYS, Inc. September 20,

Case Study HPC for High Fidelity CFD 8M to 12M element turbocharger models (ANSYS CFX) Previous practice (8 nodes HPC) Full stage compressor runs 36-48 hours Turbine simulations up to 72 hours

15 Case Study HPC for High Fidelity CFD 8M to 12M element turbocharger models (ANSYS CFX) Previous practice (8 nodes HPC) Full stage compressor runs hours Turbine simulations up to 72 hours Current practice (160 nodes) 32 nodes per simulation Full stage compressor 4 hours Turbine simulations 5-6 hours Simultaneous consideration of 5 ideas Ability to address design uncertainty clearance tolerance ANSYS HPC technology is enabling Cummins to use larger models with greater geometric details and more-realistic treatment of physical phenomena ANSYS, Inc. September 20,

16 Case Study HPC for High Fidelity CFD EURO/CFD Model sizes up to 200M cells (ANSYS FLUENT) cluster of 700 cores cores per simulation 25 Millions (4 Days) 50 Millions (2 Days) 3 Millions of Cells (6 Days) 10 Millions (5 Days) Compressibility Conduction/Convection Supersonic Multiphase Radiation Increase of : Transient Optimisation / DOE Dynamic Mesh Spatial-temporal Accuracy LES Combustion Aeroacoustic Fluid Structure Interaction Complexity of Physical Phenomenon 16 ANSYS, Inc. September 20,

Microconsult GmbH Case Study HPC for High Fidelity Mechanical

cores (past) to 128 cores (present) HPC is an important

17 Microconsult GmbH Case Study HPC for High Fidelity Mechanical Solder joint failure analysis Thermal stress 7.8 MDOF Creep strain 5.5 MDOF Simulation time reduced from 2 weeks to 1 day From 8 26 cores (past) to 128 cores (present) HPC is an important competitive advantage for companies looking to optimize the performance of their products and reduce time to market. 17 ANSYS, Inc. September 20,

Case Study HPC for Desktop Productivity Cognity Limited steerable conductors for oil recovery ANSYS Mechanical simulations to determine load carrying capacity 750K elements, many contacts 12 core

18 Case Study HPC for Desktop Productivity Cognity Limited steerable conductors for oil recovery ANSYS Mechanical simulations to determine load carrying capacity 750K elements, many contacts 12 core workstations / 24 GB RAM 6X speedup / results in 1 hour or less 5-10 design iterations per day Parallel processing makes it possible to evaluate five to 10 design iterations per day, enabling Cognity to rapidly improve their design ANSYS, Inc. September 20,

4GB Memory total Using 16 cores 2 hrs. solution time 0.

19 Case Study Skewed Waveguide Array (HFSS) 16X16 (256 elements and excitations) Skewed Rectangular Waveguide (WR90) Array 1.3M Matrix Size Using 8 cores 3 hrs. solution time 0.4GB Memory total Using 16 cores 2 hrs. solution time 0.8GB Memory total Additional Cores Faster solution time More memory. Unit cell shown with wireframe view of virtual array 19 ANSYS, Inc. September 20,

Case Study Desktop Productivity Cautionary Tale NVIDIA - Case study on the value of HW refresh and SW best-practice Deflection and bending of 3D glasses ANSYS Mechanical 1M DOF models Optimization

20 Case Study Desktop Productivity Cautionary Tale NVIDIA - Case study on the value of HW refresh and SW best-practice Deflection and bending of 3D glasses ANSYS Mechanical 1M DOF models Optimization of: Solver selection (direct vs iterative) Machine memory (in core execution) Multicore (8-way) parallel with GPU acceleration Before/After: 77x speedup from 60 hours per simulation to 47 minutes. Most importantly: HPC tuning added scope for design exploration and optimization. 20 ANSYS, Inc. September 20,

21 Take Home Points / Discussion ANSYS HPC performance enables scaling for high-fidelity What could you learn from a 10M (or 100M) cell / DOF model? What could you learn if you had time to consider 10 x more design ideas? Scaling applies to all physics, all hardware (desktop and cluster) ANSYS continually invests in software development for HPC Maximized value from your HPC investment This creates differentiated competitive advantage for ANSYS users Comments / Questions / Discussion 21 ANSYS, Inc. September 20,

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)

HPC and IT Issues Session Agenda Deployment of Simulation (Trends and Issues Impacting IT) Discussion Mapping HPC to Performance (Scaling, Technology Advances) Discussion Optimizing IT for Remote Access