AcuSolve Performance Benchmark and Profiling October 2011
Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox, Altair Compute resource: HPC Advisory Council Cluster Center For more info please refer to http://www.amd.com http://www.dell.com http://www.mellanox.com http://www.altairhyperworks.com/product,54,acusolve.aspx 2
AcuSolve AcuSolve AcuSolve is a leading general-purpose finite element-based Computational Fluid Dynamics (CFD) flow solver with superior robustness, speed, and accuracy AcuSolve can be used by designers and research engineers with all levels of expertise, either as a standalone product or seamlessly integrated into a powerful design and analysis application With AcuSolve, users can quickly obtain quality solutions without iterating on solution procedures or worrying about mesh quality or topology 3
Objectives The following was done to provide best practices AcuSolve performance benchmarking Understanding AcuSolve communication patterns Ways to increase AcuSolve productivity Network interconnects comparisons The presented results will demonstrate The scalability of the compute environment The capability of AcuSolve to achieve scalable productivity Considerations for performance optimizations 4
Test Cluster Configuration Dell PowerEdge C6145 6-node Quad-socket (288-core) cluster AMD Opteron 6174 (code name Magny-Cours ) 12-cores @ 2.2 GHz CPUs Memory: 128GB memory per node DDR3 1066MHz Mellanox ConnectX-3 VPI adapters for 56Gb/s FDR InfiniBand and 40Gb/s Ethernet Mellanox MTS3600Q 36-Port 40Gb/s QDR InfiniBand switch Fulcrum-based 10Gb/s Ethernet Switch OS: RHEL 6.1, MLNX-OFED 1.5.3 InfiniBand SW stack MPI: Platform MPI 7.1 Application: AcuSolve 1.8a Benchmark workload: Pipe_fine, 2 meshes 350 axial nodes, 1.52 million mesh points total, 8.89 million tetrahedral elements 700 axial nodes, 3.04 million mesh points total, 17.8 million tetrahedral elements The pipe_fine test computes the steady state flow conditions for the turbulent flow (Re = 30000) of water in a pipe with heat transfer. The pipe is 1 meter in length and 150 cm in diameter. Water enters the inlet at room temperature conditions. 5
About Dell PowerEdge Platform Advantages Best of breed technologies and partners Combination of AMD Opteron 6100 series platform and Mellanox ConnectX InfiniBand on Dell HPC Solutions provide the ultimate platform for speed and scale Dell PowerEdge C6145 system delivers 8 socket performance in dense 2U form factor Up to 48 core/32dimms per server 2016 core in 42U enclosure Integrated stacks designed to deliver the best price/performance/watt 2x more memory and processing power in half of the space Energy optimized low flow fans, improved power supplies and dual SD modules Optimized for long-term capital and operating investment protection System expansion Component upgrades and feature releases 6
AcuSolve Performance Threads Per Node AcuSolve allows running in MPI-thread hybrid mode Allow MPI process to focus on message passing while threads for computation The optimal thread count are different for the datasets Using 12 threads per node is the most optimal for the dataset with 350 axial nodes Using 24 threads per node is the most optimal for the dataset with 700 axial nodes Higher is better 6 nodes InfiniBand QDR 7
AcuSolve Performance Interconnect InfiniBand QDR delivers the best performance for AcuSolve Seen up to 75% better performance than 10GigE on 6-node (12 threads per node) Seen up to 99% better performance than 1GigE on 6-node (12 threads per node) Network bandwidth enables AcuSolve to scale Higher throughput allows AcuSolve to achieve higher productivity 99% 75% 67% 34% Higher is better 48 Cores/Node 8
AcuSolve Performance CPU Frequency Higher CPU core frequency enables higher job performance Seen 28% more jobs produced by running CPU core at 2200MHz instead of 1800MHz The increase in CPU core frequencies can directly improve the overall job performance 28% Higher is better 48 Threads/Node 9
AcuSolve Profiling MPI/User Time Ratio Communication time has a major role for AcuSolve Communication time occupies the majority of run time after 4 nodes for this benchmark High speed interconnect becomes crucial as the node number grows 48 Threads/Node 10
AcuSolve Profiling MPI/User Run Time InfiniBand reduces CPU overhead for processing network data Better network communication reduces time in computation and in communication InfiniBand offloads network transfers to HCA which CPU to focus on computation The Ethernet solutions causes job to run slower 12 Threads/Node 11
AcuSolve Profiling Number of MPI Calls The most used MPI functions are for data transfers MPI_Recv and MPI_Isend Reflects that AcuSolve does communication and requires good network throughput The number of calls increases proportionally as the cluster scales 48 Threads/Node 12
AcuSolve Profiling Time Spent of MPI Calls The time in communications is taken place in the following MPI functions: InfiniBand: MPI_Allreduce(41%), MPI_Recv(30%), MPI_Barrier(24%) 10GigE: MPI_Allreduce(58%), MPI_Recv(32%), MPI_Barrier(9%) 1GigE: MPI_Recv(54%), MPI_Barrier(29%), MPI_Allreduce(16%) 13
AcuSolve Profiling MPI Message Sizes Majority of the MPI messages are small to medium message sizes In the ranges of between 0B and 256B The ratio of the message distribution are very close between the 2 datasets The dataset with 700 mesh points has much larger number of messages 14
AcuSolve Profiling Data Transfer By Process Data transferred to each MPI rank are not evenly distributed Data transfer to the rank is mirrored according to the rank numbers Amount of data grows as the cluster scales From around 20GB max per rank on 4-node up to around 80GB per rank for 6-node 15
AcuSolve Profiling Aggregated Data Transfer Aggregated data transfer refers to: Total amount of data being transferred in the network between all MPI ranks collectively The total data transfer jumps unexpectedly as the cluster scales For both datasets, a sizable amount of data being sent and received across the network As a compute node being added, more generally data communications will take place InfiniBand QDR 16
Summary AcuSolve is a CFD application that has the capability to scale to many nodes MPI-thread Hybrid mode: Allow MPI process to focus on message passing while threads for computation Selecting a suitable thread count can have a huge impact on performance and productivity CPU: AcuSolve has a high demand for good CPU utilization Higher CPU core frequency allows AcuSolve to achieve higher performance Interconnects: InfiniBand QDR can deliver great network throughput needed for scaling to many nodes 10GigE and 1GigE takes away CPU runtime for handling network transfers Interconnect becomes crucial after 4 nodes as more time is spent on MPI for these datasets Profiling: Sizable load of data is exchanged in the network MPI calls are mostly concentrated for data transfers instead of data synchronization 17
Thank You HPC Advisory Council All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein 18 18