AcuSolve Performance Benchmark and Profiling. October 2011

Similar documents
AcuSolve Performance Benchmark and Profiling. October 2011

Altair RADIOSS Performance Benchmark and Profiling. May 2013

CP2K Performance Benchmark and Profiling. April 2011

AMBER 11 Performance Benchmark and Profiling. July 2011

GROMACS Performance Benchmark and Profiling. September 2012

Himeno Performance Benchmark and Profiling. December 2010

HYCOM Performance Benchmark and Profiling

LAMMPS Performance Benchmark and Profiling. July 2012

NAMD Performance Benchmark and Profiling. February 2012

CESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

NAMD GPU Performance Benchmark. March 2011

GROMACS Performance Benchmark and Profiling. August 2011

STAR-CCM+ Performance Benchmark and Profiling. July 2014

SNAP Performance Benchmark and Profiling. April 2014

NAMD Performance Benchmark and Profiling. January 2015

ABySS Performance Benchmark and Profiling. May 2010

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

ICON Performance Benchmark and Profiling. March 2012

OCTOPUS Performance Benchmark and Profiling. June 2015

NAMD Performance Benchmark and Profiling. November 2010

LAMMPSCUDA GPU Performance. April 2011

OpenFOAM Performance Testing and Profiling. October 2017

LS-DYNA Performance Benchmark and Profiling. April 2015

CPMD Performance Benchmark and Profiling. February 2014

MM5 Modeling System Performance Research and Profiling. March 2009

CP2K Performance Benchmark and Profiling. April 2011

MILC Performance Benchmark and Profiling. April 2013

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

LS-DYNA Performance Benchmark and Profiling. October 2017

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

NEMO Performance Benchmark and Profiling. May 2011

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

LS-DYNA Performance Benchmark and Profiling. October 2017

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

LS-DYNA Productivity and Power-aware Simulations in Cluster Environments

Optimizing LS-DYNA Productivity in Cluster Environments

Clustering Optimizations How to achieve optimal performance? Pak Lui

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

2008 International ANSYS Conference

RADIOSS Benchmark Underscores Solver s Scalability, Quality and Robustness

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

IBM Information Technology Guide For ANSYS Fluent Customers

Introduction to High-Speed InfiniBand Interconnect

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Performance Optimizations for LS-DYNA with Mellanox HPC-X Scalable Software Toolkit

GRID Testing and Profiling. November 2017

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Birds of a Feather Presentation

ARISTA: Improving Application Performance While Reducing Complexity

The Effect of HPC Cluster Architecture on the Scalability Performance of CAE Simulations

QLogic TrueScale InfiniBand and Teraflop Simulations

Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Single-Points of Performance

DESCRIPTION GHz, 1.536TB shared memory RAM, and 20.48TB RAW internal storage teraflops About ScaleMP

Solutions for Scalable HPC

The Future of Interconnect Technology

Maximizing Memory Performance for ANSYS Simulations

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

FUSION1200 Scalable x86 SMP System

MSC Nastran Explicit Nonlinear (SOL 700) on Advanced SGI Architectures

Performance Analysis of LS-DYNA in Huawei HPC Environment

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Real Application Performance and Beyond

Unified Runtime for PGAS and MPI over OFED

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

ANSYS HPC Technology Leadership

DELIVERABLE D5.5 Report on ICARUS visualization cluster installation. John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS)

What is Parallel Computing?

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

The Effect of In-Network Computing-Capable Interconnects on the Scalability of CAE Simulations

The rcuda technology: an inexpensive way to improve the performance of GPU-based clusters Federico Silla

Interconnect Your Future

AMD EPYC and NAMD Powering the Future of HPC February, 2019

HIGH-PERFORMANCE STORAGE FOR DISCOVERY THAT SOARS

SUN CUSTOMER READY HPC CLUSTER: REFERENCE CONFIGURATIONS WITH SUN FIRE X4100, X4200, AND X4600 SERVERS Jeff Lu, Systems Group Sun BluePrints OnLine

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Performance of Mellanox ConnectX Adapter on Multi-core Architectures Using InfiniBand. Abstract

Future Routing Schemes in Petascale clusters

INCREASE IT EFFICIENCY, REDUCE OPERATING COSTS AND DEPLOY ANYWHERE

Advances of parallel computing. Kirill Bogachev May 2016

Maximizing Cluster Scalability for LS-DYNA

Amazon Web Services: Performance Analysis of High Performance Computing Applications on the Amazon Web Services Cloud

FROM HPC TO THE CLOUD WITH AMQP AND OPEN SOURCE SOFTWARE

Readme for Platform Open Cluster Stack (OCS)

PERFORMANCE ACCELERATED Mellanox InfiniBand Adapters Provide Advanced Levels of Data Center IT Performance, Productivity and Efficiency

Choosing the Best Network Interface Card for Cloud Mellanox ConnectX -3 Pro EN vs. Intel XL710

Introduction to High-Performance Computing

Interconnect Your Future

Remote GPU virtualization: pros and cons of a recent technology. Federico Silla Technical University of Valencia Spain

Cray XD1 Supercomputer Release 1.3 CRAY XD1 DATASHEET

Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science

Transcription:

AcuSolve Performance Benchmark and Profiling October 2011

Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox, Altair Compute resource: HPC Advisory Council Cluster Center For more info please refer to http://www.amd.com http://www.dell.com http://www.mellanox.com http://www.altairhyperworks.com/product,54,acusolve.aspx 2

AcuSolve AcuSolve AcuSolve is a leading general-purpose finite element-based Computational Fluid Dynamics (CFD) flow solver with superior robustness, speed, and accuracy AcuSolve can be used by designers and research engineers with all levels of expertise, either as a standalone product or seamlessly integrated into a powerful design and analysis application With AcuSolve, users can quickly obtain quality solutions without iterating on solution procedures or worrying about mesh quality or topology 3

Objectives The following was done to provide best practices AcuSolve performance benchmarking Understanding AcuSolve communication patterns Ways to increase AcuSolve productivity Network interconnects comparisons The presented results will demonstrate The scalability of the compute environment The capability of AcuSolve to achieve scalable productivity Considerations for performance optimizations 4

Test Cluster Configuration Dell PowerEdge C6145 6-node Quad-socket (288-core) cluster AMD Opteron 6174 (code name Magny-Cours ) 12-cores @ 2.2 GHz CPUs Memory: 128GB memory per node DDR3 1066MHz Mellanox ConnectX-3 VPI adapters for 56Gb/s FDR InfiniBand and 40Gb/s Ethernet Mellanox MTS3600Q 36-Port 40Gb/s QDR InfiniBand switch Fulcrum-based 10Gb/s Ethernet Switch OS: RHEL 6.1, MLNX-OFED 1.5.3 InfiniBand SW stack MPI: Platform MPI 7.1 Application: AcuSolve 1.8a Benchmark workload: Pipe_fine, 2 meshes 350 axial nodes, 1.52 million mesh points total, 8.89 million tetrahedral elements 700 axial nodes, 3.04 million mesh points total, 17.8 million tetrahedral elements The pipe_fine test computes the steady state flow conditions for the turbulent flow (Re = 30000) of water in a pipe with heat transfer. The pipe is 1 meter in length and 150 cm in diameter. Water enters the inlet at room temperature conditions. 5

About Dell PowerEdge Platform Advantages Best of breed technologies and partners Combination of AMD Opteron 6100 series platform and Mellanox ConnectX InfiniBand on Dell HPC Solutions provide the ultimate platform for speed and scale Dell PowerEdge C6145 system delivers 8 socket performance in dense 2U form factor Up to 48 core/32dimms per server 2016 core in 42U enclosure Integrated stacks designed to deliver the best price/performance/watt 2x more memory and processing power in half of the space Energy optimized low flow fans, improved power supplies and dual SD modules Optimized for long-term capital and operating investment protection System expansion Component upgrades and feature releases 6

AcuSolve Performance Threads Per Node AcuSolve allows running in MPI-thread hybrid mode Allow MPI process to focus on message passing while threads for computation The optimal thread count are different for the datasets Using 12 threads per node is the most optimal for the dataset with 350 axial nodes Using 24 threads per node is the most optimal for the dataset with 700 axial nodes Higher is better 6 nodes InfiniBand QDR 7

AcuSolve Performance Interconnect InfiniBand QDR delivers the best performance for AcuSolve Seen up to 75% better performance than 10GigE on 6-node (12 threads per node) Seen up to 99% better performance than 1GigE on 6-node (12 threads per node) Network bandwidth enables AcuSolve to scale Higher throughput allows AcuSolve to achieve higher productivity 99% 75% 67% 34% Higher is better 48 Cores/Node 8

AcuSolve Performance CPU Frequency Higher CPU core frequency enables higher job performance Seen 28% more jobs produced by running CPU core at 2200MHz instead of 1800MHz The increase in CPU core frequencies can directly improve the overall job performance 28% Higher is better 48 Threads/Node 9

AcuSolve Profiling MPI/User Time Ratio Communication time has a major role for AcuSolve Communication time occupies the majority of run time after 4 nodes for this benchmark High speed interconnect becomes crucial as the node number grows 48 Threads/Node 10

AcuSolve Profiling MPI/User Run Time InfiniBand reduces CPU overhead for processing network data Better network communication reduces time in computation and in communication InfiniBand offloads network transfers to HCA which CPU to focus on computation The Ethernet solutions causes job to run slower 12 Threads/Node 11

AcuSolve Profiling Number of MPI Calls The most used MPI functions are for data transfers MPI_Recv and MPI_Isend Reflects that AcuSolve does communication and requires good network throughput The number of calls increases proportionally as the cluster scales 48 Threads/Node 12

AcuSolve Profiling Time Spent of MPI Calls The time in communications is taken place in the following MPI functions: InfiniBand: MPI_Allreduce(41%), MPI_Recv(30%), MPI_Barrier(24%) 10GigE: MPI_Allreduce(58%), MPI_Recv(32%), MPI_Barrier(9%) 1GigE: MPI_Recv(54%), MPI_Barrier(29%), MPI_Allreduce(16%) 13

AcuSolve Profiling MPI Message Sizes Majority of the MPI messages are small to medium message sizes In the ranges of between 0B and 256B The ratio of the message distribution are very close between the 2 datasets The dataset with 700 mesh points has much larger number of messages 14

AcuSolve Profiling Data Transfer By Process Data transferred to each MPI rank are not evenly distributed Data transfer to the rank is mirrored according to the rank numbers Amount of data grows as the cluster scales From around 20GB max per rank on 4-node up to around 80GB per rank for 6-node 15

AcuSolve Profiling Aggregated Data Transfer Aggregated data transfer refers to: Total amount of data being transferred in the network between all MPI ranks collectively The total data transfer jumps unexpectedly as the cluster scales For both datasets, a sizable amount of data being sent and received across the network As a compute node being added, more generally data communications will take place InfiniBand QDR 16

Summary AcuSolve is a CFD application that has the capability to scale to many nodes MPI-thread Hybrid mode: Allow MPI process to focus on message passing while threads for computation Selecting a suitable thread count can have a huge impact on performance and productivity CPU: AcuSolve has a high demand for good CPU utilization Higher CPU core frequency allows AcuSolve to achieve higher performance Interconnects: InfiniBand QDR can deliver great network throughput needed for scaling to many nodes 10GigE and 1GigE takes away CPU runtime for handling network transfers Interconnect becomes crucial after 4 nodes as more time is spent on MPI for these datasets Profiling: Sizable load of data is exchanged in the network MPI calls are mostly concentrated for data transfers instead of data synchronization 17

Thank You HPC Advisory Council All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein 18 18