ANSYS HPC Technology Leadership Barbara Hutchings barbara.hutchings@ansys.com 1 ANSYS, Inc. September 20,
Why ANSYS Users Need HPC Insight you can t get any other way HPC enables high-fidelity Include details - for reliable results Be sure your design is right Innovate with confidence HPC delivers throughput Consider multiple design ideas Optimize the design Ensure performance across range of conditions 2 ANSYS, Inc. September 20,
15 % spent on R&D 570 software developers Partner relationships ANSYS HPC Leadership A History of HPC Performance 2001-2003 Parallel dynamic moving/deforming mesh Distributed memory particle tracking 1998-1999 Integration with load management systems Support for Linux clusters, low latency interconnects 10M cell fluids simulations, 128 processors 1993 1 st general-purpose parallel CFD with interactive client-server user environment 1980 2009 Ideal scaling to 2048 cores (fluids) Teraflop performance at 512 core (structures) Parallel I/O (fluids) Domain Decomposition introduced (HFSS 12) 2005-2006 Parallel meshing (fluids) Support for clusters using Windows HPC 1994-1995 Parallel dynamic mesh refinement and coarsening Dynamic load balancing 2000 1980 s Vector Processing on Mainframes 2010 2010 Hybrid parallel for sustained multicore performance (fluids) GPU acceleration (structures) 2007-2008 Optimized performance on multicore processors 1 st One Billion cell fluids simulation 2005-2007 Distributed sparse solver Distributed PCG solver Variational Technology DANSYS released Distributed Solve (DSO) HFSS 10 2004 1 st company to solve 100M structural DOF 1999-2000 Today s multi-core / many-core 64bit hardware large memory addressing evolution Shared memory multiprocessing (HFSS 7) makes HPC a software development imperative. ANSYS 1990 is committed to maintaining performance 1990 Shared Memory Multiprocessing for structural simulations leadership. 1994 Iterative PCG Solver Introduced for large structural analysis 3 ANSYS, Inc. September 20, 2008 ANSYS, Inc. All rights reserved. ANSYS, Inc. Proprietary
HPC A Software Development Imperative Clock Speed Leveling off Core Counts Growing Exploding (GPUs) Future performance depends on highly scalable parallel software 4 Source: http://www.lanl.gov/news/index.php/fuseaction/1663.article/d/20085/id/13277 ANSYS, Inc. September 20,
RATING RATING ANSYS FLUENT Scaling Achievement 2008 Hardware (Intel Harpertown, DDR IB) 2010 Hardware (Intel Westmere, QDR IB) 450 400 350 300 6.3.35 12.0.5 IDEAL 1400 1200 1000 12.1.0 13.0.0 IDEAL 250 800 200 600 150 100 50 400 200 5 0 0 200 400 600 800 1000 1200 Number of Cores Systems keep improving: faster processors, more cores Ideal rating (speed) doubled in two years! ANSYS, Inc. September 20, 0 0 256 512 768 1024 1280 1536 Number of Cores Memory bandwidth per core and network latency/bw stress scalability 2008 release (12.0) re-architected MPI huge scaling improvement, for a while 2010 release (13.0) introduces hybrid parallelism and scaling continues!
Core Solver Rating Extreme CFD Scaling - 1000 s of cores 2500 Scaling to Thousands of Cores 111M Cell Truck Benchmark 2000 1500 1000 500 0 0 768 1536 2304 3072 3840 Number of Cores ANSYS Fluent 13.0 ANSYS Fluent 14.0 (Pre- Release) Enabled by ongoing software innovation Hybrid parallel: fast shared memory communication (OpenMP) within a machine to speed up overall solver performance; distributed memory (MPI) between machines 6 ANSYS, Inc. September 20,
Solution Rating Solution Rating Parallel Scaling ANSYS Mechanical 300 Sparse Solver (Parallel Re-Ordering) Focus on bottlenecks in 250 200 R12.1 R13.0 the distributed memory solvers (DANSYS) 150 Sparse Solver 100 50 0 Number of cores 0 64 128 192 256 Parallelized equation ordering 40% faster w/ updated Intel MKL 4000 3500 3000 2500 2000 1500 1000 PCG Solver (Pre-Conditioner Scaling) R12.1 R13.0 Preconditioned Conjugate Gradient (PCG) Solver Parallelized preconditioning step 500 7 0 0 16 32 48 64 ANSYS, Inc. September 20, Number of cores
Architecture-Aware Partitioning Original partitions are remapped to the cluster considering the network topology and latencies Minimizes inter-machine traffic reducing load on network switches Improves performance, particularly on slow interconnects and/or large clusters Partition Graph 3 machines, 8 cores each Colors indicate machines Original mapping New mapping 8 ANSYS, Inc. September 20,
File I/O Performance Case file IO Both read and write significantly faster in R13 A combination of serial-io optimizations as well as parallel-io techniques, where available Parallel-IO (.pdat) Significant speedup of parallel IO, particularly for cases with large number of zones Support for Lustre, EMC/MPFS, AIX/GPFS file systems added Data file IO (.dat) Performance in R12 was highly optimized. Further incremental improvements done in R13 9 ANSYS, Inc. September 20, 91.2 Parallel Data write truck_14m, case read 79.5 R12 vs. R13 BMW -68% FL5L2 4M -63% Circuit -97% Truck 14M -64% 12.1.0 13.0.0 90.4 94.4 37.3 35.2 39.6 45.3 48 96 192 384
What about GPU Computing? CPUs and GPUs work in a collaborative fashion CPU GPU PCI Express channel Multi-core processors Typically 4-6 cores Powerful, general purpose Many-core processors Typically hundreds of cores Great for highly parallel code, within memory constraints 10 ANSYS, Inc. September 20,
ANSYS Mechanical SMP GPU Speedup Solver Kernel Speedups Overall Speedups From NAFEMS World Congress May Boston, MA, USA Accelerate FEA Simulations with a GPU -by Jeff Beisheim, ANSYS 11 ANSYS, Inc. September 20, Tesla C2050 and Intel Xeon 5560
R14: GPU Acceleration for DANSYS 3 R14 Distributed ANSYS Total Simulation Speedups for R13 Benchmark set 4 CPU cores 2.24 2 1 4 CPU cores + 1 GPU 1.52 1.16 1.70 1.20 1.44 0 V13cg-1 (JCG, 1100k) V13sp-1 (sparse, 430k) V13sp-2 (sparse, 500k) V13sp-3 (sparse, 2400k) V13sp-4 (sparse, 1000k) V13sp-5 (sparse, 2100k) Windows workstation : Two Intel Xeon 5560 processors (2.8 GHz, 8 cores total), 32 GB RAM, NVIDIA Tesla C2070, Windows 7, TCC driver mode 12 ANSYS, Inc. September 20,
Total Speedup ANSYS Mechanical Multi-Node GPU Solder Joint Benchmark (4 MDOF, Creep Strain Analysis) Linux cluster : Each node contains 12 Intel Xeon 5600-series cores, 96 GB RAM, NVIDIA Tesla M2070, InfiniBand 5.0 4.0 3.0 R14 Distributed ANSYS w/wo GPU Without GPU With GPU 3.4x 3.2x 4.4x 2.0 1.7x 1.9x 1.0 Solder balls Mold 0.0 16 cores 32 cores 64 cores 13 ANSYS, Inc. September 20, PCB Results Courtesy of MicroConsult Engineering, GmbH
GPU Acceleration for CFD Radiation viewfactor calculation (ANSYS FLUENT 14 - beta) First capability for specialty physics view factors, ray tracing, reaction rates, etc. R&D focus on linear solvers, smoothers but potential limited by Amdahl s Law 14 ANSYS, Inc. September 20,
Case Study HPC for High Fidelity CFD 8M to 12M element turbocharger models (ANSYS CFX) Previous practice (8 nodes HPC) Full stage compressor runs 36-48 hours Turbine simulations up to 72 hours Current practice (160 nodes) 32 nodes per simulation Full stage compressor 4 hours Turbine simulations 5-6 hours Simultaneous consideration of 5 ideas Ability to address design uncertainty clearance tolerance ANSYS HPC technology is enabling Cummins to use larger models with greater geometric details and more-realistic treatment of physical phenomena. http://www.ansys.com/about+ansys/ansys+advantage+magazine/current+issue 15 ANSYS, Inc. September 20,
Case Study HPC for High Fidelity CFD EURO/CFD Model sizes up to 200M cells (ANSYS FLUENT) cluster of 700 cores 64-256 cores per simulation 25 Millions (4 Days) 50 Millions (2 Days) 3 Millions of Cells (6 Days) 10 Millions (5 Days) Compressibility Conduction/Convection Supersonic Multiphase Radiation Increase of : Transient Optimisation / DOE Dynamic Mesh Spatial-temporal Accuracy LES Combustion Aeroacoustic Fluid Structure Interaction Complexity of Physical Phenomenon 16 ANSYS, Inc. September 20,
Microconsult GmbH Case Study HPC for High Fidelity Mechanical Solder joint failure analysis Thermal stress 7.8 MDOF Creep strain 5.5 MDOF Simulation time reduced from 2 weeks to 1 day From 8 26 cores (past) to 128 cores (present) HPC is an important competitive advantage for companies looking to optimize the performance of their products and reduce time to market. 17 ANSYS, Inc. September 20,
Case Study HPC for Desktop Productivity Cognity Limited steerable conductors for oil recovery ANSYS Mechanical simulations to determine load carrying capacity 750K elements, many contacts 12 core workstations / 24 GB RAM 6X speedup / results in 1 hour or less 5-10 design iterations per day Parallel processing makes it possible to evaluate five to 10 design iterations per day, enabling Cognity to rapidly improve their design. http://www.ansys.com/about+ansys/ansys+advantage+magazine/current+issue 18 ANSYS, Inc. September 20,
Case Study Skewed Waveguide Array (HFSS) 16X16 (256 elements and excitations) Skewed Rectangular Waveguide (WR90) Array 1.3M Matrix Size Using 8 cores 3 hrs. solution time 0.4GB Memory total Using 16 cores 2 hrs. solution time 0.8GB Memory total Additional Cores Faster solution time More memory. Unit cell shown with wireframe view of virtual array 19 ANSYS, Inc. September 20,
Case Study Desktop Productivity Cautionary Tale NVIDIA - Case study on the value of HW refresh and SW best-practice Deflection and bending of 3D glasses ANSYS Mechanical 1M DOF models Optimization of: Solver selection (direct vs iterative) Machine memory (in core execution) Multicore (8-way) parallel with GPU acceleration Before/After: 77x speedup from 60 hours per simulation to 47 minutes. Most importantly: HPC tuning added scope for design exploration and optimization. 20 ANSYS, Inc. September 20,
Take Home Points / Discussion ANSYS HPC performance enables scaling for high-fidelity What could you learn from a 10M (or 100M) cell / DOF model? What could you learn if you had time to consider 10 x more design ideas? Scaling applies to all physics, all hardware (desktop and cluster) ANSYS continually invests in software development for HPC Maximized value from your HPC investment This creates differentiated competitive advantage for ANSYS users Comments / Questions / Discussion 21 ANSYS, Inc. September 20,