Himeno Performance Benchmark and Profiling. December 2010

Similar documents
AMBER 11 Performance Benchmark and Profiling. July 2011

CP2K Performance Benchmark and Profiling. April 2011

GROMACS Performance Benchmark and Profiling. September 2012

Altair RADIOSS Performance Benchmark and Profiling. May 2013

HYCOM Performance Benchmark and Profiling

NAMD Performance Benchmark and Profiling. February 2012

LAMMPS Performance Benchmark and Profiling. July 2012

AcuSolve Performance Benchmark and Profiling. October 2011

NAMD GPU Performance Benchmark. March 2011

CESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011

ICON Performance Benchmark and Profiling. March 2012

AcuSolve Performance Benchmark and Profiling. October 2011

NAMD Performance Benchmark and Profiling. January 2015

ABySS Performance Benchmark and Profiling. May 2010

GROMACS Performance Benchmark and Profiling. August 2011

MM5 Modeling System Performance Research and Profiling. March 2009

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

CP2K Performance Benchmark and Profiling. April 2011

LAMMPSCUDA GPU Performance. April 2011

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

MILC Performance Benchmark and Profiling. April 2013

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

OCTOPUS Performance Benchmark and Profiling. June 2015

SNAP Performance Benchmark and Profiling. April 2014

STAR-CCM+ Performance Benchmark and Profiling. July 2014

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

NAMD Performance Benchmark and Profiling. November 2010

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

NEMO Performance Benchmark and Profiling. May 2011

CPMD Performance Benchmark and Profiling. February 2014

LS-DYNA Performance Benchmark and Profiling. April 2015

OpenFOAM Performance Testing and Profiling. October 2017

LS-DYNA Performance Benchmark and Profiling. October 2017

LS-DYNA Performance Benchmark and Profiling. October 2017

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

2008 International ANSYS Conference

GRID Testing and Profiling. November 2017

LS-DYNA Productivity and Power-aware Simulations in Cluster Environments

OP2 FOR MANY-CORE ARCHITECTURES

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

Optimizing LS-DYNA Productivity in Cluster Environments

Birds of a Feather Presentation

AMD EPYC and NAMD Powering the Future of HPC February, 2019

Amazon Web Services: Performance Analysis of High Performance Computing Applications on the Amazon Web Services Cloud

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

RADIOSS Benchmark Underscores Solver s Scalability, Quality and Robustness

Making Supercomputing More Available and Accessible Windows HPC Server 2008 R2 Beta 2 Microsoft High Performance Computing April, 2010

Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet. Swamy N. Kandadai and Xinghong He and

Toward Building up Arm HPC Ecosystem --Fujitsu s Activities--

Performance comparison between a massive SMP machine and clusters

Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance

Clustering Optimizations How to achieve optimal performance? Pak Lui

Evolving HPC Solutions Using Open Source Software & Industry-Standard Hardware

FROM HPC TO THE CLOUD WITH AMQP AND OPEN SOURCE SOFTWARE

FUSION1200 Scalable x86 SMP System

DELIVERABLE D5.5 Report on ICARUS visualization cluster installation. John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS)

Upgrade to Microsoft SQL Server 2016 with Dell EMC Infrastructure

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning

Performance of Mellanox ConnectX Adapter on Multi-core Architectures Using InfiniBand. Abstract

Readme for Platform Open Cluster Stack (OCS)

DESCRIPTION GHz, 1.536TB shared memory RAM, and 20.48TB RAW internal storage teraflops About ScaleMP

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

An Introduction to the SPEC High Performance Group and their Benchmark Suites

Advances of parallel computing. Kirill Bogachev May 2016

Enabling Performance-per-Watt Gains in High-Performance Cluster Computing

Solutions for Scalable HPC

ARISTA: Improving Application Performance While Reducing Complexity

TPC-E testing of Microsoft SQL Server 2016 on Dell EMC PowerEdge R830 Server and Dell EMC SC9000 Storage

IBM Power 755 server. High performance compute node for scalable clusters using InfiniBand architecture interconnect products.

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

INCREASING DENSITY AND SIMPLIFYING SETUP WITH INTEL PROCESSOR-POWERED DELL POWEREDGE FX2 ARCHITECTURE

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System

Performance Analysis of HPC Applications on Several Dell PowerEdge 12 th Generation Servers

SIMPLIFYING HPC SIMPLIFYING HPC FOR ENGINEERING SIMULATION WITH ANSYS

Quotations invited. 2. The supplied hardware should have 5 years comprehensive onsite warranty (24 x 7 call logging) from OEM directly.

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

Who says world-class high performance computing (HPC) should be reserved for large research centers? The Cray CX1 supercomputer makes HPC performance

InfoBrief. Platform ROCKS Enterprise Edition Dell Cluster Software Offering. Key Points

SUN CUSTOMER READY HPC CLUSTER: REFERENCE CONFIGURATIONS WITH SUN FIRE X4100, X4200, AND X4600 SERVERS Jeff Lu, Systems Group Sun BluePrints OnLine

Single-Points of Performance

QLogic TrueScale InfiniBand and Teraflop Simulations

HPC 2 Informed by Industry

Intel Parallel Studio XE 2015

Exchange Server 2007 Performance Comparison of the Dell PowerEdge 2950 and HP Proliant DL385 G2 Servers

IBM Information Technology Guide For ANSYS Fluent Customers

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Fujitsu s Approach to Application Centric Petascale Computing

The Effect of In-Network Computing-Capable Interconnects on the Scalability of CAE Simulations

Future Routing Schemes in Petascale clusters

Choosing the Best Network Interface Card for Cloud Mellanox ConnectX -3 Pro EN vs. Intel XL710

Unified Runtime for PGAS and MPI over OFED

The Cray XT Compilers

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Emerging Technologies for HPC Storage

PERFORMANCE ACCELERATED Mellanox InfiniBand Adapters Provide Advanced Levels of Data Center IT Performance, Productivity and Efficiency

Transcription:

Himeno Performance Benchmark and Profiling December 2010

Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource - HPC Advisory Council Cluster Center For more info please refer to http:// www.amd.com http:// www.dell.com/hpc http://www.mellanox.com http:// accc.riken.jp/hpc_e/himenobmt_e.html 2

Himeno Himeno Developed by Dr. Ryutaro Himeno, RIKEN, Japan Intends to evaluate performance of incompressible fluid analysis code Takes in measurements to precede major loops in solving the Poisson's equation solution using the Jacobi iteration method Available under the LGPL 2.0 or later 3

Objectives The following was done to provide best practices Himeno performance benchmarking Interconnect performance comparisons Understanding Himeno communication patterns Ways to increase Himeno productivity Compilers and MPI libraries comparisons The presented results will demonstrate The scalability of the compute environment to provide nearly linear application scalability The capability of Himeno to achieve scalable productivity Considerations for performance optimizations 4

Test Cluster Configuration Dell PowerEdge R815 11-node cluster AMD Opteron 6174 (code name Magny-Cours ) 12-cores @ 2.2 GHz CPUs 4 CPUs per server node Mellanox ConnectX-2 VPI adapters for 40Gb/s QDR InfiniBand and 10Gb/s Ethernet Mellanox M3600 36-Port 40Gb/s QDR InfiniBand switch Fulcrum based 10Gb/s Ethernet switch Memory: 128GB memory per node DDR3 1333MHz OS: RHEL 5.5, MLNX-OFED 1.5.1 InfiniBand SW stack MPI: Intel MPI 4.0, Open MPI 1.5.1, Platform MPI 8.0.1 Compilers: GNU Compilers 4.1.2 & 4.4, Intel Compilers 11.1, Open64 4.2.4, PGI 10.9 Application: HimenoBMTxp (f77_xp_mpi) Benchmark Workload: L Grid size (512x256x256) 5

Dell PowerEdge R815 11-node cluster HPC Advisory Council Test-bed System New 11-node 528 core cluster - featuring Dell PowerEdge R815 servers Replacement system for Dell PowerEdge SC1435 (192 cores) cluster system following 2 years of rigorous benchmarking and product EOL System to be redirected to explore HPC in the Cloud applications Workload profiling and benchmarking Characterization for HPC and compute intense environments Optimization for scale, sizing and configuration and workload performance Test-bed Benchmarks RFPs Customers/Prospects, etc ISV & Industry standard application characterization Best practices & usage analysis 6

About Dell PowerEdge Platform Advantages Best of breed technologies and partners Combination of AMD Opteron 6100 series platform and Mellanox ConnectX InfiniBand on Dell HPC Solutions provide the ultimate platform for speed and scale Dell PowerEdge R815 system delivers 4 socket performance in dense 2U form factor Up to 48 core/32dimms per server 1008 core in 42U enclosure Integrated stacks designed to deliver the best price/performance/watt 2x more memory and processing power in half of the space Energy optimized low flow fans, improved power supplies and dual SD modules Optimized for long-term capital and operating investment protection System expansion Component upgrades and feature releases 7

Himeno Performance Compilers PGI provide the best CPU utilization among the compilers tested Compiler flags used: GNU412/GNU44: -O3 -ffast-math -ftree-vectorize -ftree-loop-linear -funroll-loops Open64: -O3 -OPT:unroll_level=2 -OPT:Ofast -ipa -ftz -Ofast -OPT:keep_ext=on -msse3 - mso -HP -INLINE -LNO -ffast-math Intel "-O3 -ip -xsse2 -w -ftz -align all -fno-alias -fp-model fast=1 -convert big_endian PGI: -fastsse -Mipa=fast,inline -Mconcur Open MPI 1.5 Higher is better 12 Cores/Node 8

Himeno Performance Interconnects InfiniBand enables higher scalability Up to 153% gain over 10GigE at 11-node (528 processes) with the L dataset The performance of GigE plummets after 2 nodes The effect of MPI saturating the Ethernet network 153% Higher is better 12 Cores/Node 9

Himeno Performance Scalability L dataset demonstrates network dependency Scalability of 10 GigE dropped down to 32% at 11-node While scalability of InfiniBand QDR maintains above 80% throughout Higher is better 12 Cores/Node 10

Himeno Performance MPI Implementations Platform MPI performs better at 528 cores (or 11-node) Both Platform MPI and Open MPI perform slightly performance as the cluster scales Higher is better 12 Cores/Node 11

Himeno Profiling MPI/User Time Ratio Shows InfiniBand maintain its low consumption by MPI Communications 1GigE is overtaken by MPI communications after 2-node (96 processes) Higher is better 12 Cores/Node 12

Himeno Profiling Number of MPI Calls The most used MPI functions are MPI_Isend and MPI_Irecv Each accounted for 38% of all MPI functions on a 14-node job 13

Himeno Profiling MPI Message Sizes Messages increase accelerates with the node count increases Majority of the MPI message sizes are in the range from 16KB to 64KB for the L dataset In the range from 64B to 256B and beyond 64KB for XL dataset 14

Summary PGI provide the best CPU utilization among the compilers tested Platform MPI performs better at 528 cores (or 11-node) L dataset is network dependent Scalability of 10GigE dropped off to 32% at 11-node While scalability of InfiniBand QDR maintains above 80% throughout 15

Thank You HPC Advisory Council All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein 16 16