Clustering Optimizations How to achieve optimal performance? Pak Lui

Similar documents
STAR-CCM+ Performance Benchmark and Profiling. July 2014

Application Performance Optimizations. Pak Lui

The Effect of In-Network Computing-Capable Interconnects on the Scalability of CAE Simulations

The Effect of HPC Cluster Architecture on the Scalability Performance of CAE Simulations

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

LS-DYNA Performance Benchmark and Profiling. April 2015

HPC Applications Performance and Optimizations Best Practices Pak Lui

NAMD Performance Benchmark and Profiling. January 2015

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

MILC Performance Benchmark and Profiling. April 2013

SNAP Performance Benchmark and Profiling. April 2014

CPMD Performance Benchmark and Profiling. February 2014

GROMACS Performance Benchmark and Profiling. August 2011

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

CESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011

OCTOPUS Performance Benchmark and Profiling. June 2015

OpenFOAM Performance Testing and Profiling. October 2017

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

AcuSolve Performance Benchmark and Profiling. October 2011

LS-DYNA Performance Benchmark and Profiling. October 2017

ICON Performance Benchmark and Profiling. March 2012

Altair RADIOSS Performance Benchmark and Profiling. May 2013

AcuSolve Performance Benchmark and Profiling. October 2011

GROMACS Performance Benchmark and Profiling. September 2012

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

AMBER 11 Performance Benchmark and Profiling. July 2011

CP2K Performance Benchmark and Profiling. April 2011

LAMMPS Performance Benchmark and Profiling. July 2012

LS-DYNA Performance Benchmark and Profiling. October 2017

NAMD GPU Performance Benchmark. March 2011

NAMD Performance Benchmark and Profiling. February 2012

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

CP2K Performance Benchmark and Profiling. April 2011

HYCOM Performance Benchmark and Profiling

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

ABySS Performance Benchmark and Profiling. May 2010

Performance Optimizations for LS-DYNA with Mellanox HPC-X Scalable Software Toolkit

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

Himeno Performance Benchmark and Profiling. December 2010

MM5 Modeling System Performance Research and Profiling. March 2009

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

IBM Information Technology Guide For ANSYS Fluent Customers

NEMO Performance Benchmark and Profiling. May 2011

LAMMPSCUDA GPU Performance. April 2011

NAMD Performance Benchmark and Profiling. November 2010

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

GRID Testing and Profiling. November 2017

2008 International ANSYS Conference

Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance

Performance Analysis of LS-DYNA in Huawei HPC Environment

Dell HPC System for Manufacturing System Architecture and Application Performance

Birds of a Feather Presentation

Performance Analysis of HPC Applications on Several Dell PowerEdge 12 th Generation Servers

Memory Selection Guidelines for High Performance Computing with Dell PowerEdge 11G Servers

FUSION1200 Scalable x86 SMP System

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017

Maximizing Cluster Scalability for LS-DYNA

RECENT TRENDS IN GPU ARCHITECTURES. Perspectives of GPU computing in Science, 26 th Sept 2016

RADIOSS Benchmark Underscores Solver s Scalability, Quality and Robustness

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

The State of Accelerated Applications. Michael Feldman

Optimizing LS-DYNA Productivity in Cluster Environments

HPC and AI Solution Overview. Garima Kochhar HPC and AI Innovation Lab

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Scalable x86 SMP Server FUSION1200

Future Routing Schemes in Petascale clusters

2008 International ANSYS Conference

Interconnect Your Future

AMD EPYC and NAMD Powering the Future of HPC February, 2019

GPU ACCELERATED COMPUTING. 1 st AlsaCalcul GPU Challenge, 14-Jun-2016, Strasbourg Frédéric Parienté, Tesla Accelerated Computing, NVIDIA Corporation

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

LS-DYNA Productivity and Power-aware Simulations in Cluster Environments

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

Optimal BIOS settings for HPC with Dell PowerEdge 12 th generation servers

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Assessment of LS-DYNA Scalability Performance on Cray XD1

Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet. Swamy N. Kandadai and Xinghong He and

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

SIMPLIFYING HPC SIMPLIFYING HPC FOR ENGINEERING SIMULATION WITH ANSYS

Interconnect Your Future

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Application Acceleration Beyond Flash Storage

Intel Cluster Toolkit Compiler Edition 3.2 for Linux* or Windows HPC Server 2008*

High-Performance Lustre with Maximum Data Assurance

You will not hear hold music while waiting for the event to begin.

GPUs and the Future of Accelerated Computing Emerging Technology Conference 2014 University of Manchester

Technologies and application performance. Marc Mendez-Bermond HPC Solutions Expert - Dell Technologies September 2017

MSC Nastran Explicit Nonlinear (SOL 700) on Advanced SGI Architectures

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

HPC Performance in the Cloud: Status and Future Prospects

Game-changing Extreme GPU computing with The Dell PowerEdge C4130

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING

Technical Computing Suite supporting the hybrid system

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Solutions for Scalable HPC

Transcription:

Clustering Optimizations How to achieve optimal performance? Pak Lui

130 Applications Best Practices Published Abaqus CPMD LS-DYNA MILC AcuSolve Dacapo minife OpenMX Amber Desmond MILC PARATEC AMG DL-POLY MSC Nastran PFA AMR Eclipse MR Bayes PFLOTRAN ABySS FLOW-3D MM5 Quantum ESPRESSO ANSYS CFX GADGET-2 MPQC RADIOSS ANSYS FLUENT GROMACS NAMD SPECFEM3D ANSYS Mechanics Himeno Nekbone WRF BQCD HOOMD-blue NEMO CCSM HYCOM NWChem CESM ICON Octopus COSMO Lattice QCD OpenAtom CP2K LAMMPS OpenFOAM For more information, visit: http://www.hpcadvisorycouncil.com/best_practices.php 2

Agenda Overview Of HPC Application Performance Ways To Inspect/Profile/Optimize HPC Applications CPU/Memory, File I/O, Network System Configurations and Tuning Case Studies, Performance Optimization and Highlights STAR-CCM+ ANSYS Fluent Conclusions 3

HPC Application Performance Overview To achieve scalability performance on HPC applications Involves understanding of the workload by performing profile analysis Tune for the most time spent (either CPU, Network, IO, etc) Underlying implicit requirement: Each node to perform similarly Run CPU/memory /network tests or cluster checker to identify bad node(s) Comparing behaviors of using different HW components Which pinpoint bottlenecks in different areas of the HPC cluster A selection of HPC applications will be shown To demonstrate method of profiling and analysis To determine the bottleneck in SW/HW To determine the effectiveness of tuning to improve on performance 4

Ways To Inspect and Profile Applications Computation (CPU/Accelerators) Tools: top, htop, perf top, pstack, Visual Profiler, etc Tests and Benchmarks: HPL, STREAM File I/O Bandwidth and Block Size: iostat, collectl, darshan, etc Characterization Tools and Benchmarks: iozone, ior, etc Network Interconnect Tools and Profilers: perfquery, MPI profilers (IPM, TAU, etc) Characterization Tools and Benchmarks: Latency and Bandwidth: OSU benchmarks, IMB 5

Case Study: STAR-CCM+ 6

STAR-CCM+ STAR-CCM+ An engineering process-oriented CFD tool Client-server architecture, object-oriented programming Delivers the entire CFD process in a single integrated software environment Developed by CD-adapco 7

Objectives The presented research was done to provide best practices CD-adapco performance benchmarking Interconnect performance comparisons Ways to increase CD-adapco productivity Power-efficient simulations The presented results will demonstrate The scalability of the compute environment The scalability of the compute environment/application Considerations for higher productivity and efficiency 8

Test Cluster Configuration Dell PowerEdge R720xd 32-node (640-core) Jupiter cluster Dual-Socket Hexa-Core Intel E5-2680 V2 @ 2.80 GHz CPUs (Static max Perf in BIOS) Memory: 64GB memory, DDR3 1600 MHz OS: RHEL 6.2, OFED 2.1-1.0.0 InfiniBand SW stack Hard Drives: 24x 250GB 7.2 RPM SATA 2.5 on RAID 0 Intel Cluster Ready certified cluster Mellanox Connect-IB FDR InfiniBand and ConnectX-3 Ethernet adapters Mellanox SwitchX 6036 VPI InfiniBand and Ethernet switches MPI: Mellanox HPC-X v1.0.0 (based on OMPI), Platform MPI 8.3.0.6, Intel MPI 4.1.3 Application: STAR-CCM+ version 9.02.005 (unless specified otherwise) Benchmarks: Lemans_Poly_17M (Epsilon Euskadi Le Mans car external aerodynamics) 9

STAR-CCM+ Performance Network FDR InfiniBand delivers the best network scalability performance Provides up to 208% higher performance than 10GbE at 32 nodes Provides up to 191% higher performance than 40GbE at 32 nodes FDR IB scales linearly while 10/40GbE has scalability limitation beyond 16 nodes 208% 191% Higher is better 20 Processes/Node 10

STAR-CCM+ Profiling Network InfiniBand reduces network overhead; results higher CPU utilization Reducing MPI communication overhead with efficient network interconnect As less time spent on the network, overall application runtime is improved Ethernet solutions consumes more time in communications Spent 73%-95% of overall time in network due to congestion in Ethernet While FDR IB spent about 38% of overall runtime Higher is better 20 Processes/Node 11

STAR-CCM+ Profiling MPI Comm. Time Identified MPI overheads by profiling communication time Dealt with communications in collective, point-to-point and non-blocking operations 10/40GbE vs FDR IB: Spent longer time in Allreduce, Bcast, Recv, Waitany 12

STAR-CCM+ Profiling MPI Comm. Time Observed MPI time spent by different network hardware FDR IB: MPI_Test(46%), MPI_Waitany(13%), MPI_Alltoall (13%), MPI_Reduce(13%) 10GbE: MPI_Recv(29%), MPI_Waitany(25%), MPI_Allreduce(15%), MPI_Wait(10%) 13

STAR-CCM+ Performance Software Versions Improvement in latest STAR-CCM+ results in higher performance at scale v9.02.005 demonstrated a 28% gain compared to the v8.06.005 on 32-node run Slight Change in communication pattern helps to improve the scalability Improvement gap expects to widen at scale See subsequence slides in the MPI profiling to show the differences 28% 7% 14% Higher is better 20 Processes/Node 14

STAR-CCM+ Profiling MPI Time Spent Communication time has dropped with the latest STAR-CCM+ version Observed less time spent in MPI, although communication pattern is roughly the same MPI Barrier time is reduced significantly between the 2 releases Higher is better 20 Processes/Node 15

STAR-CCM+ Performance-MPI Implementations STAR-CCM+ has made various MPI implementations available to run Default MPI implementation used in STAR-CCM+ is Platform MPI MPI implementations started to differentiate beyond 8 nodes Optimization flags have been set already in vendor s startup scripts Support for HPC-X is based on the existing Open MPI support in STAR-CCM+ HPC-X provides 21% of higher scalability than the alternatives 21% Higher is better Version 9.02.005 16

STAR-CCM+ Performance Single/Dual Port Benefit of deploying dual-port InfiniBand is demonstrated at scale Running with dual port provides up to 11% higher performance at 32 nodes Connect-IB on PCIe Gen3 x16 slot which can provide additional throughput with 2 links 11% Higher is better 20 Processes/Node 17

STAR-CCM+ Performance Turbo Mode Enabling Turbo mode results in higher application performance Up to 17% of the improvement seen by enabling Turbo mode Higher performance gain seen with higher node count Boosting base frequency; consequently resulted in higher power consumption Using kernel tools called msr-tools to adjust Turbo Mode dynamically Allows dynamically turn off/on Turbo mode in the OS level 13% 17% Higher is better 20 Processes/Node 18

STAR-CCM+ Performance File IO Advantages for staging data to temporary file system in memory Data write of ~8GB occurs at the end of the run for the benchmark By staging on local FS, which avoid accessing by all processes (vs NFS) By staging on local /dev/shm, even higher performance gain is seen (~11% gain) Using temporary storage is not recommended for production environment While /dev/shm reduces outperforms localfs, it is not recommended for production If available, parallel file system is more preferred solution versus local or /dev/shm 8% 11% Higher is better Version 8.06.005 19

STAR-CCM+ Performance - System Generations New generations of software and hardware provide performance gain Performance gain demonstrated through variables in HW and SW Latest stack provides ~55% higher performance versus 1 generation behind Latest stack provides ~2.6x higher performance versus 2 generations behind System components used: WSM: X5670@2.93GHz, DDR3-10666, ConnectX-2 QDR IB, 1 disk, v5.04.006 SNB: E5-2680@2.7GHz, DDR3-12800, ConnectX-3 FDR IB, 24 disks, v7.02.008 IVB: E5-2680v2@2.8GHz, DDR3-12800, Connect-IB FDR IB, 24 disks, v9.02.005 262% 55% Higher is better 20

STAR-CCM+ Profiling User/MPI Time Ratio STAR-CCM+ spent more time in computation than communication The time ratio for network gradually increases with more nodes in the job Improvement on network efficiency would reflect in improvement on overall runtime FDR InfiniBand 21

STAR-CCM+ Profiling Message Sizes Majority of messages are small messages Messages are concentrated below 64KB Number of messages increases with the number of nodes 22

STAR-CCM+ Profiling MPI Data Transfer As the cluster grows, less data transfers between MPI processes Drops from ~20GB per rank at 1 node vs ~3GB at 32 nodes Some node imbalances are seen through the amount of data transfers Rank 0 shows significantly higher network activities than other ranks 23

STAR-CCM+ Profiling Aggregated Transfer Aggregated data transfer refers to: Total amount of data being transferred in the network between all MPI ranks collectively Very large data transfer takes place in STAR-CCM+ High network throughput is required for delivering the network bandwidth 1.5TB of data transfer takes place between the MPI processes at 32 nodes Version 9.02.005 24

STAR-CCM+ Summary Performance STAR-CCM+ v9.02.005 improved on scalability over v8.06.005 by 28% at 32 nodes Performance gap expect to widen at higher node count FDR InfiniBand delivers the highest network performance for STAR-CCM+ to scale FDR IB provides higher performance against other networks FDR IB delivers ~191% higher compared to 40GbE, ~208% vs 10GbE on a 32 node run Deploying dual-port Connect-IB HCA provides 11% performance at 32 nodes Performance improvement seen compared to older hardware/software generations Approximately 55% higher performance for 1 generation and 2.6x for 2 generations Enabling Turbo mode results in higher application performance Up to 17% of the improvement seen by enabling Turbo mode Mellanox HPC-X provides better performance than the alternatives MPI Profiling Communication time reduction with v9.02.005 which improves overall performance Ethernet solutions consumes more time in communications Spent 73%-95% of overall time in network due to congestion in Ethernet, while IB spent ~38% 25

Case Study: ANSYS Fluent 26

ANSYS FLUENT Computational Fluid Dynamics (CFD) is a computational technology Enables the study of the dynamics of things that flow Enable better understanding of qualitative and quantitative physical phenomena in the flow which is used to improve engineering design CFD brings together a number of different disciplines Fluid dynamics, mathematical theory of partial differential systems, computational geometry, numerical analysis, Computer science ANSYS FLUENT is a leading CFD application from ANSYS Widely used in almost every industry sector and manufactured product 27

Objectives The presented research was done to provide best practices Fluent performance benchmarking MPI Library performance comparison Interconnect performance comparison CPUs comparison Compilers comparison The presented results will demonstrate The scalability of the compute environment/application Considerations for higher productivity and efficiency 28

Test Cluster Configuration Dell PowerEdge R720xd 32-node (640-core) Jupiter cluster Dual-Socket Hexa-Core Intel E5-2680 V2 @ 2.80 GHz CPUs (Turbo mode enabled unless otherwise stated) Memory: 64GB memory, DDR3 1600 MHz OS: RHEL 6.2, OFED 2.3-1.0.1 InfiniBand SW stack Hard Drives: 24x 250GB 7.2 RPM SATA 2.5 on RAID 0 Intel Cluster Ready certified cluster Mellanox Connect-IB FDR InfiniBand adapters Mellanox ConnectX-3 QDR InfiniBand and Ethernet VPI adapters Mellanox SwitchX SX6036 VPI InfiniBand and Ethernet switches MPI: Mellanox HPC-X v1.2.0 based on OMPI, (Provided): Intel MPI 4.1.030, IBM Platform MPI 9.1 Application: ANSYS Fluent 15.0.7 Benchmarks: eddy_417k, turbo_500k, aircraft_2m, sedan_4m, truck_poly_14m, truck_14m Descriptions for the test cases can be found at the ANSYS Fluent 15.0 Benchmark page 29

Fluent Performance Interconnects FDR InfiniBand enables the highest cluster productivity Surpassed other network interconnect in scalability performance FDR InfiniBand tops performance among different network interconnects FDR InfiniBand outperforms QDR InfiniBand by up to 200% at 32 nodes Similarly, FDR outperforms 10GbE by 16 times, and 1GbE by over 39 times 16x 39x 200% Higher is better 30

Fluent Performance Interconnects FDR InfiniBand performance outperforms on other Fluent benchmarks 31

Fluent Performance MPI Implementations HPC-X delivers higher scalability performance than other MPIs compared HPC-X outperforms over the default Platform MPI by 10%, and Intel MPI by 19% Support of HPC-X on Fluent is based on the support of Open MPI on Fluent The new yalla pml reduces the overhead. Flags used for HPC-X: -mca coll_fca_enable 1 -mca coll_fca_np 0 -mca pml yalla -map-by node -mca mtl mxm -mca mtl_mxm_np 0 -x MXM_TLS=self,shm,ud --bind-to core 19% 10% Higher is better FDR InfiniBand 32

Fluent Performance MPI Implementations HPC-X outperforms other MPIs on other benchmark data 33

Fluent Performance Turbo Mode and Clock Advantages are seen with running higher clock rate with Fluent Either by enabling Turbo mode or higher CPU clock frequency Boosting CPU clock rate yields higher performance at lower cost Increasing to 2800MHz (from 2200MHz) run 42% faster, 18% of increased power Running turbo mode also yields higher performance but at higher cost Increase of 13% of performance at a expense of a 25% of increased power usage 18% 25% 42% 13% Higher is better FDR InfiniBand 34

Fluent Performance Best Published Results demonstrated by HPCAC outperforms the previous best record The ANSYS Fluent 15.0 Benchmark publishes ANSYS Fluent performance results HPCAC achieved 26.36% higher performance than the best published results (as of 9/22/2014), despite slower CPUs are used on the Jupiter cluster by the HPCAC The 32-node/640-core result beats previous record of 96-node/1920-core by 8.53% Performance is expected to climb on the Jupiter cluster if more nodes are available 26.36% 8.53% Higher is better 35

Fluent Profiling I/O Profiling Minor disk I/O activities take place on all MPI ranks for this workload Majority of the read activities are disk appeared at the beginning of the job run InfiniBand FDR 36

Fluent Profiling Point-to-point dataflow Communication seems to be limited to MPI ranks that is closer to self Heavy communications seen between first and last ranks Communication pattern does not change as the cluster scales However, the amount of data being transferred is reduced as the node scales 2 nodes 32 nodes InfiniBand FDR 37

Fluent Profiling Time Spent by MPI Calls Majority of the MPI time is spent on MPI_Waitall Accounts for 30% Wall time MPI_Allreduce 20% MPI_Recv 11% eddy_417k, 32 nodes Some load imbalances in network are observed Some ranks spent more time MPI_Waitall and MPI_Allreduce Might be related to how workload is distributed among the MPI ranks 38

Fluent Profiling MPI Message Sizes Majority of data transfer messages are small to medium sizes MPI_Allreduce: Large concentration of 4-byte msg (~18% wall time) MPI_Wait: Large concentration of 16-byte msg (~11% wall time) eddy_417k, 32 nodes 39

Fluent Summary Performance Jupiter cluster outperforms other system architectures on Fluent FDR InfiniBand delivers higher performance against QDR InfiniBand by 200% FDR IB outperforms 10GbE by up to 11 times at 32 nodes / 640 cores FDR InfiniBand enable Fluent to break previous performance record Outperforms previously set record by 25.38% at 640 cores/ 32 nodes Outperforms previously set record by 8.52% at 1920 cores/ 96 nodes HPC-X MPI delivers higher performance against other MPI Implementation HPC-X outperforms Platform MPI by 10%, outperforms Intel MPI by 19% CPU Higher CPU clock rate and Turbo mode yields higher performance for Fluent Bumping CPU clock (from 2200MHz to 2800MHz) yields 42% faster perf at 18% of increased power Enabling turbo mode translates to 13% of increase performance at a 25% of additional power usage Profiling Heavy usage in small msg in MPI_Waitall, MPI_Allreduce, MPI_Recv communications 40

Thank You HPC Advisory Council All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein 41 41