ANSYS HPC. Technology Leadership. Barbara Hutchings ANSYS, Inc. September 20, 2011

Similar documents
HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)

ANSYS HPC Technology Leadership

Solving Large Complex Problems. Efficient and Smart Solutions for Large Models

Stan Posey, CAE Industry Development NVIDIA, Santa Clara, CA, USA

Why HPC for. ANSYS Mechanical and ANSYS CFD?

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Maximize automotive simulation productivity with ANSYS HPC and NVIDIA GPUs

Making Supercomputing More Available and Accessible Windows HPC Server 2008 R2 Beta 2 Microsoft High Performance Computing April, 2010

Computer Aided Engineering with Today's Multicore, InfiniBand-Based Clusters ANSYS, Inc. All rights reserved. 1 ANSYS, Inc.

TFLOP Performance for ANSYS Mechanical

2008 International ANSYS Conference

Advances of parallel computing. Kirill Bogachev May 2016

Recent Advances in ANSYS Toward RDO Practices Using optislang. Wim Slagter, ANSYS Inc. Herbert Güttler, MicroConsult GmbH

Simulation Advances. Antenna Applications

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)

QLogic TrueScale InfiniBand and Teraflop Simulations

SIMPLIFYING HPC SIMPLIFYING HPC FOR ENGINEERING SIMULATION WITH ANSYS

Understanding Hardware Selection to Speedup Your CFD and FEA Simulations

ANSYS High. Computing. User Group CAE Associates

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

IBM Information Technology Guide For ANSYS Fluent Customers

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Real Application Performance and Beyond

A Comprehensive Study on the Performance of Implicit LS-DYNA

High performance Computing and O&G Challenges

HFSS 14 Update for SI and RF Applications Markus Kopp Product Manager, Electronics ANSYS, Inc.

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Industrial finite element analysis: Evolution and current challenges. Keynote presentation at NAFEMS World Congress Crete, Greece June 16-19, 2009

Two-Phase flows on massively parallel multi-gpu clusters

Simulation Advances for RF, Microwave and Antenna Applications

The Cray CX1 puts massive power and flexibility right where you need it in your workgroup

Performance Benefits of NVIDIA GPUs for LS-DYNA

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report

Session S0069: GPU Computing Advances in 3D Electromagnetic Simulation

Large scale Imaging on Current Many- Core Platforms

Faster Innovation - Accelerating SIMULIA Abaqus Simulations with NVIDIA GPUs. Baskar Rajagopalan Accelerated Computing, NVIDIA

Building NVLink for Developers

Introduction to parallel Computing

Overview of Parallel Computing. Timothy H. Kaiser, PH.D.

Maximizing Memory Performance for ANSYS Simulations

HPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms. Author: Correspondence: ABSTRACT:

Accelerated ANSYS Fluent: Algebraic Multigrid on a GPU. Robert Strzodka NVAMG Project Lead

Engineers can be significantly more productive when ANSYS Mechanical runs on CPUs with a high core count. Executive Summary

AcuSolve Performance Benchmark and Profiling. October 2011

Enhancing Analysis-Based Design with Quad-Core Intel Xeon Processor-Based Workstations

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

HPC Architectures. Types of resource currently in use

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

A Scalable GPU-Based Compressible Fluid Flow Solver for Unstructured Grids

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

GPU-Acceleration of CAE Simulations. Bhushan Desam NVIDIA Corporation

Center Extreme Scale CS Research

Accelerating Implicit LS-DYNA with GPU

Optimizing LS-DYNA Productivity in Cluster Environments

Architectures for Scalable Media Object Search

AcuSolve Performance Benchmark and Profiling. October 2011

2008 International ANSYS Conference

Intel Cluster Toolkit Compiler Edition 3.2 for Linux* or Windows HPC Server 2008*

Alternative GPU friendly assignment algorithms. Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

Algorithms, System and Data Centre Optimisation for Energy Efficient HPC

Splotch: High Performance Visualization using MPI, OpenMP and CUDA

Lecture 7: Introduction to HFSS-IE

InfiniBand-based HPC Clusters

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

Hybrid Implementation of 3D Kirchhoff Migration

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Particleworks: Particle-based CAE Software fully ported to GPU

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

MSC Nastran Explicit Nonlinear (SOL 700) on Advanced SGI Architectures

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

Birds of a Feather Presentation

New Technologies in CST STUDIO SUITE CST COMPUTER SIMULATION TECHNOLOGY

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Investigation and Feasibility Study of Linux and Windows in the Computational Processing Power of ANSYS Software

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance

Aerodynamics of a hi-performance vehicle: a parallel computing application inside the Hi-ZEV project

S8901 Quadro for AI, VR and Simulation

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

D036 Accelerating Reservoir Simulation with GPUs

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

HPC 2 Informed by Industry

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

IBM Power AC922 Server

GROMACS Performance Benchmark and Profiling. August 2011

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Industrial achievements on Blue Waters using CPUs and GPUs

Using an HPC Cloud for Weather Science

The Future of Interconnect Technology

Working Differently Accelerating Virtual Product Design with Intel Quad-Core Technology and ESI Group Software

The determination of the correct

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Transcription:

ANSYS HPC Technology Leadership Barbara Hutchings barbara.hutchings@ansys.com 1 ANSYS, Inc. September 20,

Why ANSYS Users Need HPC Insight you can t get any other way HPC enables high-fidelity Include details - for reliable results Be sure your design is right Innovate with confidence HPC delivers throughput Consider multiple design ideas Optimize the design Ensure performance across range of conditions 2 ANSYS, Inc. September 20,

15 % spent on R&D 570 software developers Partner relationships ANSYS HPC Leadership A History of HPC Performance 2001-2003 Parallel dynamic moving/deforming mesh Distributed memory particle tracking 1998-1999 Integration with load management systems Support for Linux clusters, low latency interconnects 10M cell fluids simulations, 128 processors 1993 1 st general-purpose parallel CFD with interactive client-server user environment 1980 2009 Ideal scaling to 2048 cores (fluids) Teraflop performance at 512 core (structures) Parallel I/O (fluids) Domain Decomposition introduced (HFSS 12) 2005-2006 Parallel meshing (fluids) Support for clusters using Windows HPC 1994-1995 Parallel dynamic mesh refinement and coarsening Dynamic load balancing 2000 1980 s Vector Processing on Mainframes 2010 2010 Hybrid parallel for sustained multicore performance (fluids) GPU acceleration (structures) 2007-2008 Optimized performance on multicore processors 1 st One Billion cell fluids simulation 2005-2007 Distributed sparse solver Distributed PCG solver Variational Technology DANSYS released Distributed Solve (DSO) HFSS 10 2004 1 st company to solve 100M structural DOF 1999-2000 Today s multi-core / many-core 64bit hardware large memory addressing evolution Shared memory multiprocessing (HFSS 7) makes HPC a software development imperative. ANSYS 1990 is committed to maintaining performance 1990 Shared Memory Multiprocessing for structural simulations leadership. 1994 Iterative PCG Solver Introduced for large structural analysis 3 ANSYS, Inc. September 20, 2008 ANSYS, Inc. All rights reserved. ANSYS, Inc. Proprietary

HPC A Software Development Imperative Clock Speed Leveling off Core Counts Growing Exploding (GPUs) Future performance depends on highly scalable parallel software 4 Source: http://www.lanl.gov/news/index.php/fuseaction/1663.article/d/20085/id/13277 ANSYS, Inc. September 20,

RATING RATING ANSYS FLUENT Scaling Achievement 2008 Hardware (Intel Harpertown, DDR IB) 2010 Hardware (Intel Westmere, QDR IB) 450 400 350 300 6.3.35 12.0.5 IDEAL 1400 1200 1000 12.1.0 13.0.0 IDEAL 250 800 200 600 150 100 50 400 200 5 0 0 200 400 600 800 1000 1200 Number of Cores Systems keep improving: faster processors, more cores Ideal rating (speed) doubled in two years! ANSYS, Inc. September 20, 0 0 256 512 768 1024 1280 1536 Number of Cores Memory bandwidth per core and network latency/bw stress scalability 2008 release (12.0) re-architected MPI huge scaling improvement, for a while 2010 release (13.0) introduces hybrid parallelism and scaling continues!

Core Solver Rating Extreme CFD Scaling - 1000 s of cores 2500 Scaling to Thousands of Cores 111M Cell Truck Benchmark 2000 1500 1000 500 0 0 768 1536 2304 3072 3840 Number of Cores ANSYS Fluent 13.0 ANSYS Fluent 14.0 (Pre- Release) Enabled by ongoing software innovation Hybrid parallel: fast shared memory communication (OpenMP) within a machine to speed up overall solver performance; distributed memory (MPI) between machines 6 ANSYS, Inc. September 20,

Solution Rating Solution Rating Parallel Scaling ANSYS Mechanical 300 Sparse Solver (Parallel Re-Ordering) Focus on bottlenecks in 250 200 R12.1 R13.0 the distributed memory solvers (DANSYS) 150 Sparse Solver 100 50 0 Number of cores 0 64 128 192 256 Parallelized equation ordering 40% faster w/ updated Intel MKL 4000 3500 3000 2500 2000 1500 1000 PCG Solver (Pre-Conditioner Scaling) R12.1 R13.0 Preconditioned Conjugate Gradient (PCG) Solver Parallelized preconditioning step 500 7 0 0 16 32 48 64 ANSYS, Inc. September 20, Number of cores

Architecture-Aware Partitioning Original partitions are remapped to the cluster considering the network topology and latencies Minimizes inter-machine traffic reducing load on network switches Improves performance, particularly on slow interconnects and/or large clusters Partition Graph 3 machines, 8 cores each Colors indicate machines Original mapping New mapping 8 ANSYS, Inc. September 20,

File I/O Performance Case file IO Both read and write significantly faster in R13 A combination of serial-io optimizations as well as parallel-io techniques, where available Parallel-IO (.pdat) Significant speedup of parallel IO, particularly for cases with large number of zones Support for Lustre, EMC/MPFS, AIX/GPFS file systems added Data file IO (.dat) Performance in R12 was highly optimized. Further incremental improvements done in R13 9 ANSYS, Inc. September 20, 91.2 Parallel Data write truck_14m, case read 79.5 R12 vs. R13 BMW -68% FL5L2 4M -63% Circuit -97% Truck 14M -64% 12.1.0 13.0.0 90.4 94.4 37.3 35.2 39.6 45.3 48 96 192 384

What about GPU Computing? CPUs and GPUs work in a collaborative fashion CPU GPU PCI Express channel Multi-core processors Typically 4-6 cores Powerful, general purpose Many-core processors Typically hundreds of cores Great for highly parallel code, within memory constraints 10 ANSYS, Inc. September 20,

ANSYS Mechanical SMP GPU Speedup Solver Kernel Speedups Overall Speedups From NAFEMS World Congress May Boston, MA, USA Accelerate FEA Simulations with a GPU -by Jeff Beisheim, ANSYS 11 ANSYS, Inc. September 20, Tesla C2050 and Intel Xeon 5560

R14: GPU Acceleration for DANSYS 3 R14 Distributed ANSYS Total Simulation Speedups for R13 Benchmark set 4 CPU cores 2.24 2 1 4 CPU cores + 1 GPU 1.52 1.16 1.70 1.20 1.44 0 V13cg-1 (JCG, 1100k) V13sp-1 (sparse, 430k) V13sp-2 (sparse, 500k) V13sp-3 (sparse, 2400k) V13sp-4 (sparse, 1000k) V13sp-5 (sparse, 2100k) Windows workstation : Two Intel Xeon 5560 processors (2.8 GHz, 8 cores total), 32 GB RAM, NVIDIA Tesla C2070, Windows 7, TCC driver mode 12 ANSYS, Inc. September 20,

Total Speedup ANSYS Mechanical Multi-Node GPU Solder Joint Benchmark (4 MDOF, Creep Strain Analysis) Linux cluster : Each node contains 12 Intel Xeon 5600-series cores, 96 GB RAM, NVIDIA Tesla M2070, InfiniBand 5.0 4.0 3.0 R14 Distributed ANSYS w/wo GPU Without GPU With GPU 3.4x 3.2x 4.4x 2.0 1.7x 1.9x 1.0 Solder balls Mold 0.0 16 cores 32 cores 64 cores 13 ANSYS, Inc. September 20, PCB Results Courtesy of MicroConsult Engineering, GmbH

GPU Acceleration for CFD Radiation viewfactor calculation (ANSYS FLUENT 14 - beta) First capability for specialty physics view factors, ray tracing, reaction rates, etc. R&D focus on linear solvers, smoothers but potential limited by Amdahl s Law 14 ANSYS, Inc. September 20,

Case Study HPC for High Fidelity CFD 8M to 12M element turbocharger models (ANSYS CFX) Previous practice (8 nodes HPC) Full stage compressor runs 36-48 hours Turbine simulations up to 72 hours Current practice (160 nodes) 32 nodes per simulation Full stage compressor 4 hours Turbine simulations 5-6 hours Simultaneous consideration of 5 ideas Ability to address design uncertainty clearance tolerance ANSYS HPC technology is enabling Cummins to use larger models with greater geometric details and more-realistic treatment of physical phenomena. http://www.ansys.com/about+ansys/ansys+advantage+magazine/current+issue 15 ANSYS, Inc. September 20,

Case Study HPC for High Fidelity CFD EURO/CFD Model sizes up to 200M cells (ANSYS FLUENT) cluster of 700 cores 64-256 cores per simulation 25 Millions (4 Days) 50 Millions (2 Days) 3 Millions of Cells (6 Days) 10 Millions (5 Days) Compressibility Conduction/Convection Supersonic Multiphase Radiation Increase of : Transient Optimisation / DOE Dynamic Mesh Spatial-temporal Accuracy LES Combustion Aeroacoustic Fluid Structure Interaction Complexity of Physical Phenomenon 16 ANSYS, Inc. September 20,

Microconsult GmbH Case Study HPC for High Fidelity Mechanical Solder joint failure analysis Thermal stress 7.8 MDOF Creep strain 5.5 MDOF Simulation time reduced from 2 weeks to 1 day From 8 26 cores (past) to 128 cores (present) HPC is an important competitive advantage for companies looking to optimize the performance of their products and reduce time to market. 17 ANSYS, Inc. September 20,

Case Study HPC for Desktop Productivity Cognity Limited steerable conductors for oil recovery ANSYS Mechanical simulations to determine load carrying capacity 750K elements, many contacts 12 core workstations / 24 GB RAM 6X speedup / results in 1 hour or less 5-10 design iterations per day Parallel processing makes it possible to evaluate five to 10 design iterations per day, enabling Cognity to rapidly improve their design. http://www.ansys.com/about+ansys/ansys+advantage+magazine/current+issue 18 ANSYS, Inc. September 20,

Case Study Skewed Waveguide Array (HFSS) 16X16 (256 elements and excitations) Skewed Rectangular Waveguide (WR90) Array 1.3M Matrix Size Using 8 cores 3 hrs. solution time 0.4GB Memory total Using 16 cores 2 hrs. solution time 0.8GB Memory total Additional Cores Faster solution time More memory. Unit cell shown with wireframe view of virtual array 19 ANSYS, Inc. September 20,

Case Study Desktop Productivity Cautionary Tale NVIDIA - Case study on the value of HW refresh and SW best-practice Deflection and bending of 3D glasses ANSYS Mechanical 1M DOF models Optimization of: Solver selection (direct vs iterative) Machine memory (in core execution) Multicore (8-way) parallel with GPU acceleration Before/After: 77x speedup from 60 hours per simulation to 47 minutes. Most importantly: HPC tuning added scope for design exploration and optimization. 20 ANSYS, Inc. September 20,

Take Home Points / Discussion ANSYS HPC performance enables scaling for high-fidelity What could you learn from a 10M (or 100M) cell / DOF model? What could you learn if you had time to consider 10 x more design ideas? Scaling applies to all physics, all hardware (desktop and cluster) ANSYS continually invests in software development for HPC Maximized value from your HPC investment This creates differentiated competitive advantage for ANSYS users Comments / Questions / Discussion 21 ANSYS, Inc. September 20,