The Effect of HPC Cluster Architecture on the Scalability Performance of CAE Simulations

Similar documents
Clustering Optimizations How to achieve optimal performance? Pak Lui

LS-DYNA Performance Benchmark and Profiling. April 2015

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

Application Performance Optimizations. Pak Lui

The Effect of In-Network Computing-Capable Interconnects on the Scalability of CAE Simulations

STAR-CCM+ Performance Benchmark and Profiling. July 2014

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

NAMD Performance Benchmark and Profiling. January 2015

MILC Performance Benchmark and Profiling. April 2013

Altair RADIOSS Performance Benchmark and Profiling. May 2013

CPMD Performance Benchmark and Profiling. February 2014

SNAP Performance Benchmark and Profiling. April 2014

LS-DYNA Performance Benchmark and Profiling. October 2017

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

OCTOPUS Performance Benchmark and Profiling. June 2015

RADIOSS Benchmark Underscores Solver s Scalability, Quality and Robustness

OpenFOAM Performance Testing and Profiling. October 2017

HPC Applications Performance and Optimizations Best Practices Pak Lui

AcuSolve Performance Benchmark and Profiling. October 2011

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

Performance Optimizations for LS-DYNA with Mellanox HPC-X Scalable Software Toolkit

LS-DYNA Performance Benchmark and Profiling. October 2017

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

AcuSolve Performance Benchmark and Profiling. October 2011

GROMACS Performance Benchmark and Profiling. September 2012

AMBER 11 Performance Benchmark and Profiling. July 2011

LAMMPS Performance Benchmark and Profiling. July 2012

GROMACS Performance Benchmark and Profiling. August 2011

CP2K Performance Benchmark and Profiling. April 2011

CP2K Performance Benchmark and Profiling. April 2011

CESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

NAMD GPU Performance Benchmark. March 2011

HYCOM Performance Benchmark and Profiling

NEMO Performance Benchmark and Profiling. May 2011

NAMD Performance Benchmark and Profiling. February 2012

ICON Performance Benchmark and Profiling. March 2012

Dell HPC System for Manufacturing System Architecture and Application Performance

Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance

HPC Innovation Lab Update. Dell EMC HPC Community Meeting 3/28/2017

ABySS Performance Benchmark and Profiling. May 2010

Himeno Performance Benchmark and Profiling. December 2010

NAMD Performance Benchmark and Profiling. November 2010

GRID Testing and Profiling. November 2017

Technologies and application performance. Marc Mendez-Bermond HPC Solutions Expert - Dell Technologies September 2017

MM5 Modeling System Performance Research and Profiling. March 2009

LAMMPSCUDA GPU Performance. April 2011

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

Performance Analysis of LS-DYNA in Huawei HPC Environment

RECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016

Interconnect Your Future

FUSION1200 Scalable x86 SMP System

HPC and AI Solution Overview. Garima Kochhar HPC and AI Innovation Lab

GPU ACCELERATED COMPUTING. 1 st AlsaCalcul GPU Challenge, 14-Jun-2016, Strasbourg Frédéric Parienté, Tesla Accelerated Computing, NVIDIA Corporation

Performance Analysis of HPC Applications on Several Dell PowerEdge 12 th Generation Servers

Maximizing Cluster Scalability for LS-DYNA

Optimizing LS-DYNA Productivity in Cluster Environments

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING

LS-DYNA Productivity and Power-aware Simulations in Cluster Environments

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING

Interconnect Your Future

HP GTC Presentation May 2012

2008 International ANSYS Conference

Introduction to High-Performance Computing

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

LS-DYNA Scalability Analysis on Cray Supercomputers

Birds of a Feather Presentation

Scalable x86 SMP Server FUSION1200

MSC Nastran Explicit Nonlinear (SOL 700) on Advanced SGI Architectures

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)

Implementing SQL Server 2016 with Microsoft Storage Spaces Direct on Dell EMC PowerEdge R730xd

ANSYS HPC Technology Leadership

The State of Accelerated Applications. Michael Feldman

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

Memory Selection Guidelines for High Performance Computing with Dell PowerEdge 11G Servers

AMD EPYC and NAMD Powering the Future of HPC February, 2019

Optimal BIOS settings for HPC with Dell PowerEdge 12 th generation servers

New Features in LS-DYNA HYBRID Version

Analyzing Performance and Power of Applications on GPUs with Dell 12G Platforms. Dr. Jeffrey Layton Enterprise Technologist HPC

Building NVLink for Developers

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science

The Cray CX1 puts massive power and flexibility right where you need it in your workgroup

ANSYS HPC. Technology Leadership. Barbara Hutchings ANSYS, Inc. September 20, 2011

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING

QLogic in HPC Vendor Update IDC HPC User Forum April 16, 2008 Jeff Broughton Sr. Director Engineering Host Solutions Group

Manufacturing Bringing New Levels of Performance to CAE Applications

HP solutions for mission critical SQL Server Data Management environments

Power Systems AC922 Overview. Chris Mann IBM Distinguished Engineer Chief System Architect, Power HPC Systems December 11, 2017

HIGH-PERFORMANCE STORAGE FOR DISCOVERY THAT SOARS

Trends in systems and how to get efficient performance

Assessment of LS-DYNA Scalability Performance on Cray XD1

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

Transcription:

The Effect of HPC Cluster Architecture on the Scalability Performance of CAE Simulations Pak Lui HPC Advisory Council June 7, 2016 1

Agenda Introduction to HPC Advisory Council Benchmark Configuration Performance Benchmark Testing/Results Summary Q&A / For More Information

The HPC Advisory Council Mission Statement World-wide HPC non-profit organization (429+ members) Bridges the gap between HPC usage and its potential Provides best practices and a support/development center Explores future technologies and future developments Leading edge solutions and technology demonstrations

HPC Advisory Council Members

HPC Advisory Council Cluster Center Dell PowerEdge R730 32-node cluster Dell PowerVault MD3420 Dell PowerVault MD3460 HP Proliant XL230a Gen9 10-node cluster HP Cluster Platform 3000SL 16-node cluster InfiniBand Storage (Lustre) Dell PowerEdge C6145 6-node cluster Dell PowerEdge R815 11-node cluster Dell PowerEdge R720xd/R720 32-node cluster Dell PowerEdge M610 38-node cluster Dell PowerEdge C6100 4-node cluster White-box InfiniBand-based Storage (Lustre)

HPC Training HPC Training Center CPUs GPUs Interconnects Clustering Storage Cables Programming Applications Network of Experts Ask the experts

Special Interest Subgroups HPC Scale Subgroup Explore usage of commodity HPC as a replacement for multi-million dollar mainframes and proprietary based supercomputers HPC Storage Subgroup Demonstrate how to build highperformance storage solutions and their affect on application performance and productivity HPC Cloud Subgroup Explore usage of HPC components as part of the creation of external/public/internal/private cloud computing environments. HPC GPU Subgroup Explore usage models of GPU components as part of next generation compute environments and potential optimizations for GPU based computing HPC Works Subgroup Provide best practices for building balanced and scalable HPC systems, performance tuning and application guidelines. HPC Music To enable HPC in music production and to develop HPC cluster solutions that further enable the future of music production

University Award Program University award program Universities / individuals are encouraged to submit proposals for advanced research Selected proposal will be provided with: Exclusive computation time on the HPC Advisory Council s Compute Center Invitation to present in one of the HPC Advisory Council s worldwide workshops Publication of the research results on the HPC Advisory Council website 2010 award winner is Dr. Xiangqian Hu, Duke University Topic: Massively Parallel Quantum Mechanical Simulations for Liquid Water 2011 award winner is Dr. Marco Aldinucci, University of Torino Topic: Effective Streaming on Multi-core by Way of the FastFlow Framework 2012 award winner is Jacob Nelson, University of Washington Runtime Support for Sparse Graph Applications 2013 award winner is Antonis Karalis Topic: Music Production using HPC 2014 award winner is Antonis Karalis Topic: Music Production using HPC 2015 award winner is Christian Kniep Topic: Dockers To submit a proposal please check the HPC Advisory Council web site

Exploring All Platforms X86, Power, GPU, FPGA and ARM based Platforms x86 Power GPU FPGA ARM

158+ Applications Best Practices Published Abaqus CPMD LS-DYNA MILC AcuSolve Dacapo minife OpenMX Amber Desmond MILC PARATEC AMG DL-POLY MSC Nastran PFA AMR Eclipse MR Bayes PFLOTRAN ABySS FLOW-3D MM5 Quantum ANSYS CFX GADGET-2 MPQC ESPRESSO ANSYS FLUENT GROMACS NAMD RADIOSS ANSYS Mechanics Himeno Nekbone SPECFEM3D BQCD HOOMD-blue NEMO WRF CCSM HYCOM NWChem CESM ICON Octopus COSMO Lattice QCD OpenAtom CP2K LAMMPS OpenFOAM For more information, visit: http://www.hpcadvisorycouncil.com/best_practices.php

HPCAC - ISC 16 Student Cluster Competition University-based teams to compete and demonstrate the incredible capabilities of state-of- the-art HPC systems and applications on the International Super Computing HPC (ISC HPC) show-floor The Student Cluster Challenge is designed to introduce the next generation of students to the high performance computing world and community

ISC'15 Student Cluster Competition Award Ceremony

ISC'16 Student Cluster Competition Teams

Getting Ready to 2016 Student Cluster Competition

HPCAC Conferences 2015 Conferences

2016 HPC Advisory Council Conferences Introduction HPC Advisory Council (HPCAC) 429+ members, http://www.hpcadvisorycouncil.com/ Application best practices, case studies Benchmarking center with remote access for users World-wide workshops Value add for your customers to stay up to date and in tune to HPC market 2016 Conferences USA (Stanford University) February 24-26 Switzerland (CSCS) March 21-23 Mexico TBD Spain (BSC) September 21 China (HPC China) October 26 For more information www.hpcadvisorycouncil.com info@hpcadvisorycouncil.com

RADIOSS by Altair Altair RADIOSS Structural analysis solver for highly non-linear problems under dynamic loadings Consists of features for: multiphysics simulation and advanced materials such as composites Highly differentiated for Scalability, Quality and Robustness RADIOSS is used across all industry worldwide Improves crashworthiness, safety, and manufacturability of structural designs RADIOSS has established itself as an industry standard for automotive crash and impact analysis for over 20 years

Test Cluster Configuration Dell PowerEdge R730 32-node (896-core) Thor cluster Dual-Socket 14-core Intel E5-2697v3 @ 2.60 GHz CPUs (Turbo on, Max Perf set in BIOS) OS: RHEL 6.5, OFED MLNX_OFED_LINUX-2.4-1.0.5 InfiniBand SW stack Memory: 64GB memory, DDR3 2133 MHz Hard Drives: 1TB 7.2 RPM SATA 2.5 Mellanox ConnectX-4 EDR 100Gb/s InfiniBand VPI adapters Mellanox Switch-IB SB7700 100Gb/s InfiniBand VPI switch Mellanox ConnectX-3 40/56Gb/s QDR/FDR InfiniBand VPI adapters Mellanox SwitchX SX6036 56Gb/s FDR InfiniBand VPI switch MPI: Intel MPI 5.0.2, Mellanox HPC-X v1.2.0 Application: Altair RADIOSS 13.0 Benchmark datasets: Neon benchmarks: 1 million elements (8ms, Double Precision), unless otherwise stated

PowerEdge R730 Massive flexibility for data intensive operations Performance and efficiency Intelligent hardware-driven systems management with extensive power management features Innovative tools including automation for parts replacement and lifecycle manageability Broad choice of networking technologies from GbE to IB Built in redundancy with hot plug and swappable PSU, HDDs and fans Benefits Designed for performance workloads from big data analytics, distributed storage or distributed computing where local storage is key to classic HPC and large scale hosting environments High performance scale-out compute and low cost dense storage in one package Hardware Capabilities Flexible compute platform with dense storage capacity 2S/2U server, 7 PCIe slots Large memory footprint (Up to 1.5TB / 24 DIMMs) High I/O performance and optional storage configurations HDD: SAS, SATA, nearline SAS; SSD: SAS, SATA 16 x 2.5 up to 29TB via 1.8TB hot-plug SAS hard drives 8 x 3.5 up to 64TB via 8TB hot-plug nearline SAS hard drives

RADIOSS Performance Interconnect (MPP) EDR InfiniBand provides higher scalability than Ethernet 70 times better performance than 1GbE at 16 nodes / 448 cores 4.8x better performance than 10GbE at 16 nodes / cores Ethernet solutions does not scale beyond 4 nodes with pure MPI 70x 4.8x Intel MPI Higher is better 28 Processes/Node

RADIOSS Profiling % Time Spent on MPI RADIOSS utilizes point-to-point communications in most data transfers The most time MPI consuming calls is MPI_Waitany() and MPI_Wait() MPI_Recv(55%), MPI_Waitany(23%), MPI_Allreduce(13%) MPP Mode 28 Processes/Node

RADIOSS Performance Interconnect (MPP) EDR InfiniBand provides better scalability performance EDR IB improves over QDR IB by 28% at 16 nodes / 448 cores EDR InfiniBand outperforms FDR InfiniBand by 25% at 16 nodes 28% 25% Higher is better 28 Processes/Node

RADIOSS Performance CPU Cores Running more cores per node generally improves overall performance Seen improvement of 18% from 20 to 28 cores per node at 8 nodes Improvement seems not as consistent at higher node counts Guideline: Most optimal workload distribution is 4000 elements/process For test case of 1 million elements, most optimal core sizes is ~256 cores 4000 elements per process should provides sufficient workload for each process Hybrid MPP (HMPP) provides way to achieve additional scalability on more CPUs 18% 6% Higher is better Intel MPI

RADIOSS Performance IMPI Tuning (MPP) Tuning Intel MPI collective algorithm can improve performance MPI profile shows about 20% of runtime spent on MPI_Allreduce communications Default algorithm in Intel MPI is Recursive Doubling The default algorithm is the best among all tested for MPP Intel MPI Higher is better 28 Processes/Node

RADIOSS Performance Hybrid MPP version Enabling Hybrid MPP mode unlocks the RADIOSS scalability At larger scale, productivity improves as more threads involves As more threads involved, amount of communications by processes are reduced At 32 nodes/896 cores, best configuration is 1 process per socket to spawn 14 threads each 28 threads/1 PPN is not advised due to breach of data locality across different CPU socket The following environment setting and tuned flags are used for Intel MPI: I_MPI_PIN_DOMAIN auto I_MPI_ADJUST_ALLREDUCE 5 I_MPI_ADJUST_BCAST 1 KMP_AFFINITY compact KMP_STACKSIZE 400m ulimit -s unlimited 3.7x 32% 70% Intel MPI EDR InfiniBand

RADIOSS Profiling Number of MPI Calls For MPP utilizes most non-blocking calls for communications MPI_Recv, MPI_Waitany, MPI_Allreduce are used most of the time For HMPP, communication behavior has changed Higher time percentage in MPI_Waitany, MPI_Allreduce, and MPI_Recv MPI Communication behavior changed from previous RADIOSS version Most likely due to more CPU cores available on the current cluster MPP, 28PPN HMPP, 2PPN / 14 Threads

RADIOSS Profiling MPI Message Sizes The most time consuming MPI communications are: MPI_Recv: Messages concentrated at 640B, 1KB, 320B, 1280B MPI_Waitany: Messages are: 48B, 8B, 384B MPI_Allreduce: Most message sizes appears at 80B MPP, 28PPN HMPP, 2PPN / 14 Threads Pure MPP 28 Processes/Node

RADIOSS Performance Intel MPI Tuning (DP) For Hybrid MPP DP, tuning MPI_Allreduce shows more gain than MPP For DAPL provider, Binomial gather+scatter #5 improved perf by 27% over default For OFA provider, tuned MPI_Allreduce algorithm improves by 44% over default Both OFA and DAPL improved by tuning I_MPI_ADJUST_ALLREDUCE=5 Flags for OFA: I_MPI_OFA_USE_XRC 1. For DAPL: ofa-v2-mlx5_0-1u provider 27% 44% Intel MPI Higher is better 2 PPN / 14 OpenMP

RADIOSS Performance Interconnect (HMPP) EDR InfiniBand provides better scalability performance than Ethernet 214% better performance than 1GbE at 16 nodes 104% better performance than 10GbE at 16 nodes InfiniBand typically outperforms other interconnect in collective operations 214% 104% Intel MPI Higher is better 2 PPN / 14 OpenMP

RADIOSS Performance Interconnect (HMPP) EDR InfiniBand provides better scalability performance than FDR IB EDR IB outperforms FDR IB by 27% at 32 nodes Improvement for EDR InfiniBand occurs at high node count 27% Intel MPI Higher is better 2 PPN / 14 OpenMP

RADIOSS Performance System Generations Intel E5-2680v3 (Haswell) cluster outperforms prior generations Performs faster by 100% vs Jupiter, by 238% vs Janus at 16 nodes System components used: Thor: 2-socket Intel E5-2680v3@2.6GHz, 2133MHz DIMMs, EDR IB, v13.0 Jupiter: 2-socket Intel E5-2680@2.7GHz, 1600MHz DIMMs, FDR IB, v12.0 Janus: 2-socket Intel X5670@2.93GHz, 1333MHz DIMMs, QDR IB, v12.0 238% 100% Single Precision

RADIOSS Summary RADIOSS is designed to perform at scale in HPC environment Shows excellent scalability over 896 cores/32 nodes and beyond with Hybrid MPP Hybrid MPP version enhanced RADIOSS scalability 2 MPI processes per node (or 1 MPI process per socket), 14 threads each Additional CPU cores generally accelerating time to solution performance Network and MPI Tuning EDR IB outperforms other Ethernet in scalability EDR IB delivers higher scalability performance than FDR/QDR IB Tuning environment/parameters to maximize performance Tuning MPI collective ops helps RADIOSS to achieve even better scalability

STAR-CCM+ STAR-CCM+ An engineering process-oriented CFD tool Client-server architecture, object-oriented programming Delivers the entire CFD process in a single integrated software environment Developed by CD-adapco

Test Cluster Configuration Dell PowerEdge R730 32-node (896-core) Thor cluster Dual-Socket 14-Core Intel E5-2697v3 @ 2.60 GHz CPUs BIOS: Maximum Performance, Home Snoop Memory: 64GB memory, DDR4 2133 MHz (Snoop Mode: Home Snoop) OS: RHEL 6.5, MLNX_OFED_LINUX-3.0-1.0.1 InfiniBand SW stack Hard Drives: 2x 1TB 7.2 RPM SATA 2.5 on RAID 1 Mellanox ConnectX-4 EDR 100Gb/s InfiniBand Adapters Mellanox Switch-IB SB7700 36-port EDR 100Gb/s InfiniBand Switch Mellanox ConnectX-3 FDR VPI InfiniBand and 40Gb/s Ethernet Adapters Mellanox SwitchX-2 SX6036 36-port 56Gb/s FDR InfiniBand / VPI Ethernet Switch Dell InfiniBand-Based Lustre Storage based on Dell PowerVault MD3460 and Dell PowerVault MD3420 MPI: Platform MPI 9.1.2 Application: STAR-CCM+ 10.02.012 Benchmarks: lemans_poly_17m civil_trim_20m reactor_9m, LeMans_100M.amg

STAR-CCM+ Performance Network Interconnects EDR InfiniBand delivers superior scalability in application performance IB delivers 66% higher performance than 40GbE, 88% higher than 10GbE at 32 nodes Scalability stops beyond 4 nodes for 1GbE; scalability is limited for 10/40GbE Input data: Lemans_poly_17m: A race car model with 17 million cells 88%66% 748% Higher is better 28 MPI Processes / Node

STAR-CCM+ Performance Network Interconnects EDR InfiniBand delivers superior scalability in application performance EDR IB provides 177 higher performance than 40GbE, 194% than 40GbE at 32 nodes InfiniBand demonstrates continuous performance gain at scale Input data: reactor_9m: A reactor model with 9 million cells 194% 177% Higher is better 28 MPI Processes / Node

STAR-CCM+ Profiling % of MPI Calls For the most time consuming MPI calls: Lemans_17m: 55% MPI_Allreduce, 23% MPI_Waitany, 7% MPI_Bcast, 7% MPI_Recv Reactor_9m: 59% MPI_Allreduce, 21% MPI_Waitany, 7% MPI_Recv, 4% MPI_Bcast MPI as a percentage in wall clock times: Lemans_17m: 12% MPI_Allreduce, 5% MPI_Waitany, 2% MPI_Bcast, 2% MPI_Recv Reactor_9m: 15% MPI_Allreduce, 5% MPI_Waitany, 2% MPI_Recv, 1% MPI_Bcast lemans_17m 32 nodes / 896 Processes reactor_9m 32 Nodes / 896 Processes

STAR-CCM+ Profiling MPI Message Size Distribution For the most time consuming MPI calls Lemans_17m: MPI_Allreduce 4B (30%), 16B (19%), 8B (6%), MPI_Bcast 4B (4%) Reactor_9m: MPI_Allreduce 16B (35%), 4B (15%), 8B (8%), MPI_Bcast 1B (4%) lemans_17m 32 nodes / 896 Processes reactor_9m 32 Nodes / 896 Processes

STAR-CCM+ Profiling Time Spent in MPI Majority of the MPI time is spent on MPI collective Operations and nonblocking communications Heavy use of MPI collective operations (MPI_Allreduce, MPI_Bcast) and MPI_Waitany Some node imbalances characteristics shown on both input dataset Some processes appeared to take more time in communications, in MPI_Allreduce lemans_17m 32 nodes / 896 Processes reactor_9m 32 Nodes / 896 Processes

STAR-CCM+ Performance Scalability Speedup EDR InfiniBand demonstrates linear scaling for STAR-CCM+ STAR-CCM+ is able to achieve linear scaling with EDR InfiniBand Other interconnects only provided limited scalability As demonstrated in previous slides Higher is better 28 MPI Processes / Node

STAR-CCM+ Performance System Generations Current system generations of HW & SW configuration outperform prior generations Current Haswell systems outperformed Ivy Bridge by 38%, Sandy Bridge by 149%, Westmere by 409% Dramatic performance benefit due to better system architecture in compute and network scalability System components used: Haswell: 2-socket 14-core E5-2697v3@2.6GHz, DDR4 2133MHz DIMMs, ConnectX-4 EDR InfiniBand, v10.02.012 Ivy Bridge: 2-socket 10-core E5-2680v2@2.8GHz, DDR3 1600MHz DIMMs, Connect-IB FDR InfiniBand, v9.02.005 Sandy Bridge: 2-socket 8-core E5-2680@2.7GHz, DDR3 1600MHz DIMMs, ConnectX-3 FDR InfiniBand, v7.02.008 Westmere: 2-socket 6-core x5670@2.93ghz, DDR3 1333MHz DIMMs, ConnectX-2 QDR InfiniBand, v5.04.006 409% 149% 38% Higher is better

STAR-CCM+ Summary Compute: cluster of the current generation outperforms system architecture of previous generations Outperformed Ivy Bridge by 38%, Sandy Bridge by 149%, Westmere by 409% Dramatic performance benefit due to better system architecture in compute and network scalability Network: EDR InfiniBand demonstrates superior scalability in STAR-CCM+ performance EDR IB provides higher performance by over 4-5 times vs 1GbE, 10GbE and 40GbE, 15% vs FDR IB at 32 nodes Lemans_17m: Scalability stops beyond 4 nodes for 1GbE; scalability is limited for 10/40 GbE Reactor_9m: EDR IB provides 177 higher performance than 40GbE, 194% than 40GbE at 32 nodes EDR InfiniBand demonstrates linear scalability in STAR- CCM+ performance on the test cases

ANSYS Fluent Computational Fluid Dynamics (CFD) is a computational technology Enables the study of the dynamics of things that flow Enable better understanding of qualitative and quantitative physical phenomena in the flow which is used to improve engineering design CFD brings together a number of different disciplines Fluid dynamics, mathematical theory of partial differential systems, computational geometry, numerical analysis, Computer science ANSYS FLUENT is a leading CFD application from ANSYS Widely used in almost every industry sector and manufactured product

Test Cluster Configuration Dell PowerEdge R730 32-node (896-core) Thor cluster Dual-Socket 14-Core Intel E5-2697v3 @ 2.60 GHz CPUs Turbo enabled (Power Management: Maximum Performance) Memory: 64GB memory, DDR4 2133 MHz (Memory Snoop: Home Snoop) OS: RHEL 6.5, MLNX_OFED_LINUX-3.0-1.0.1 InfiniBand SW stack Hard Drives: 2x 1TB 7.2 RPM SATA 2.5 on RAID 1 Mellanox Switch-IB SB7700 36-port 100Gb/s EDR InfiniBand Switch Mellanox ConnectX-4 EDR 100Gbps EDR InfiniBand Adapters Mellanox SwitchX-2 SX6036 36-port 56Gb/s FDR InfiniBand / VPI Ethernet Switch Mellanox ConnectX-3 FDR InfiniBand, 10/40GbE Ethernet VPI Adapters MPI: Mellanox HPC-X v1.2.0-326, Platform MPI 9.1 Application: ANSYS Fluent 16.1 Benchmark datasets: eddy_417k truck_111m

Fluent Performance Network Interconnects InfiniBand delivers superior scalability performance EDR InfiniBand provides higher performance than Ethernet InfiniBand delivers ~20 to 44 times higher performance and continuous scalability Ethernet performance stays flat (or stops scaling) beyond 2 nodes 20x 44x 20x Higher is better 28 MPI Processes / Node

Fluent Performance EDR vs FDR InfiniBand EDR InfiniBand delivers superior scalability in application performance As the number of nodes scales, performance gap of EDR IB becomes widen Performance advantage of EDR InfiniBand increases for larger core counts EDR InfiniBand provides 111% versus FDR InfiniBand at 16 nodes (448 cores) 111% Higher is better 28 MPI Processes / Node

Fluent Performance MPI Libraries HPC-X delivers higher scalability performance than Platform MPI by 16% Support of HPC-X on Fluent Based on the support of Open MPI on Fluent The new yalla pml reduces the overhead. Flags used for HPC-X: Tuning Parameters used: -mca coll_fca_enable 1 -mca pml yalla -map-by node -x MXM_TLS=self,shm,ud --bind-to core 16% 12% Higher is better 28 MPI Processes / Node

Fluent Profiling Time Spent by MPI Calls Different communication patterns seen depending on data Eddy_417k: Most time spent in MPI_Recv, MPI_Allreduce, MPI_Waitall Truck_111m: Most time spent in MPI_Bcast, MPI_Recv, MPI_Allreduce eddy_417k truck_111m

Fluent Profiling Time Spent by MPI Calls Different communication patterns seen with different input data Eddy_417k: Most time spent in MPI_Recv, MPI_Allreduce, MPI_Waitall Truck_111m: Most time spent in MPI_Bcast, MPI_Recv, MPI_Allreduce eddy_417k truck_111m

Fluent Profiling MPI Message Sizes The most time consuming transfers are from small messages: Eddy_417k : MPI_Recv@16B (28%wall), MPI_Allreduce@4B (14% wall), MPI_Bcast @4B (6%wall) Truck_111m: MPI_Bcast @24B (22%wall), MPI_Recv @16B (19%wall), MPI_Bcast @4B (13%wall) eddy_417k truck_111m 32 Nodes

Fluent Summary Performance Compute: Intel Haswell cluster outperforms system architecture of previous generations Haswell cluster outperforms Ivy Bridge cluster by 26%-49% at 32 node (896 cores) depending on workload Network: EDR InfiniBand delivers superior scalability in application performance EDR InfiniBand provides 20 to 44 times higher performance and more scalable compared to 1GbE/10GbE/40GbE Performance for Ethernet (1GbE/10GbE/40GbE) stays flat (or stops scaling) beyond 2 nodes EDR InfiniBand provides 111% versus FDR InfiniBand at 16nodes / 448 cores MPI: HPC-X delivers higher scalability performance than Platform MPI by 16%

Thank You! All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council undertakes no duty and 52assumes no obligation to update or correct any information presented herein