MILC Performance Benchmark and Profiling. April 2013

Similar documents
ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

CPMD Performance Benchmark and Profiling. February 2014

SNAP Performance Benchmark and Profiling. April 2014

OCTOPUS Performance Benchmark and Profiling. June 2015

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

NAMD Performance Benchmark and Profiling. January 2015

STAR-CCM+ Performance Benchmark and Profiling. July 2014

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

GROMACS Performance Benchmark and Profiling. September 2012

LS-DYNA Performance Benchmark and Profiling. April 2015

CP2K Performance Benchmark and Profiling. April 2011

Altair RADIOSS Performance Benchmark and Profiling. May 2013

OpenFOAM Performance Testing and Profiling. October 2017

AMBER 11 Performance Benchmark and Profiling. July 2011

GROMACS Performance Benchmark and Profiling. August 2011

LAMMPS Performance Benchmark and Profiling. July 2012

CESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011

ICON Performance Benchmark and Profiling. March 2012

LS-DYNA Performance Benchmark and Profiling. October 2017

LS-DYNA Performance Benchmark and Profiling. October 2017

Himeno Performance Benchmark and Profiling. December 2010

AcuSolve Performance Benchmark and Profiling. October 2011

CP2K Performance Benchmark and Profiling. April 2011

NAMD GPU Performance Benchmark. March 2011

NAMD Performance Benchmark and Profiling. February 2012

NEMO Performance Benchmark and Profiling. May 2011

NAMD Performance Benchmark and Profiling. November 2010

AcuSolve Performance Benchmark and Profiling. October 2011

HYCOM Performance Benchmark and Profiling

Clustering Optimizations How to achieve optimal performance? Pak Lui

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

LAMMPSCUDA GPU Performance. April 2011

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

ABySS Performance Benchmark and Profiling. May 2010

MM5 Modeling System Performance Research and Profiling. March 2009

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

HP GTC Presentation May 2012

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

RADIOSS Benchmark Underscores Solver s Scalability, Quality and Robustness

GRID Testing and Profiling. November 2017

The Effect of HPC Cluster Architecture on the Scalability Performance of CAE Simulations

Intel Xeon E v4, Optional Operating System, 8GB Memory, 2TB SAS H330 Hard Drive and a 3 Year Warranty

HPE ProLiant ML350 Gen P 16GB-R E208i-a 8SFF 1x800W RPS Solution Server (P04674-S01)

HPE ProLiant ML350 Gen10 Server

Application Performance Optimizations. Pak Lui

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

Intel Xeon E v4, Windows Server 2016 Standard, 16GB Memory, 1TB SAS Hard Drive and a 3 Year Warranty

Performance Optimizations for LS-DYNA with Mellanox HPC-X Scalable Software Toolkit

All-new HP ProLiant ML350p Gen8 Server series

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

HPE ProLiant DL360 Gen P 16GB-R P408i-a 8SFF 500W PS Performance Server (P06453-B21)

TELECOMMUNICATIONS TECHNOLOGY ASSOCIATION JET-SPEED HHS3124F / HHS2112F (10 NODES)

HP ProLiant DL385p Gen8 Server

HPE Scalable Storage with Intel Enterprise Edition for Lustre*

Amazon Web Services: Performance Analysis of High Performance Computing Applications on the Amazon Web Services Cloud

Intel Select Solutions for Professional Visualization with Advantech Servers & Appliances

Data Sheet Fujitsu Server PRIMERGY CX250 S2 Dual Socket Server Node

Application Acceleration Beyond Flash Storage

Quotations invited. 2. The supplied hardware should have 5 years comprehensive onsite warranty (24 x 7 call logging) from OEM directly.

AMD EPYC and NAMD Powering the Future of HPC February, 2019

HPE ProLiant DL380p Gen8 Server

Performance Analysis and Modeling of the SciDAC MILC Code on Four Large-scale Clusters

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

January 28 29, 2014San Jose. Engineering Workshop

HP ProLiant m300 1P C2750 CPU 32GB Configure-to-order Server Cartridge B21

HPE ProLiant Gen10. Franz Weberberger Presales Consultant Server

Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance

Altos R320 F3 Specifications. Product overview. Product views. Internal view

Dell PowerEdge R720xd with PERC H710P: A Balanced Configuration for Microsoft Exchange 2010 Solutions

InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment. TOP500 Supercomputers, June 2014

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

GW2000h w/gw175h/q F1 specifications

HP solutions for mission critical SQL Server Data Management environments

QuickSpecs. HP ProLiant SL390s Generation 7 (G7) Overview

Front-loading drive bays 1 12 support 3.5-inch SAS/SATA drives. Optionally, front-loading drive bays 1 and 2 support 3.5-inch NVMe SSDs.

HPE ProLiant ML110 Gen10 Server

HP ProLiant DL560 Gen8

Intel Cluster Ready Allowed Hardware Variances

Illinois Proposal Considerations Greg Bauer

NVMe Takes It All, SCSI Has To Fall. Brave New Storage World. Lugano April Alexander Ruebensaal

HP ProLiant DL360 G7 Server - Overview

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Number of Hosts: 2 Uniform Hosts [yes/no]: yes Total sockets/cores/threads in test: 4/32/64

Implementing SQL Server 2016 with Microsoft Storage Spaces Direct on Dell EMC PowerEdge R730xd

QuickSpecs. HP ProLiant BL460c Generation 8 (Gen8) Server Blade - NEBS (GR-63 & GR-1089) and ETSI certified.

INCREASE IT EFFICIENCY, REDUCE OPERATING COSTS AND DEPLOY ANYWHERE

Notes Section Notes for Workload. Configuration Section Configuration

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Supports up to four 3.5-inch SAS/SATA drives. Drive bays 1 and 2 support NVMe SSDs. A size-converter

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Transform your data center cost-effectively with the ultra-dense, efficient, and high-performance HP ProLiant DL320 G6 enterpriseclass

Technologies in HP ProLiant G7 c-class server blades with AMD Opteron processors

HPE ProLiant ML110 Gen P 8GB-R S100i 4LFF NHP SATA 350W PS DVD Entry Server/TV (P )

IBM System x family brochure

IBM Power Systems HPC Cluster

ProLiant DL F100 Integrated Cluster Solutions and Non-Integrated Cluster Bundle Configurations. Configurations

QuickSpecs. HPE ProLiant XL170r Gen9 Server. Overview. HPE ProLiant XL170r Gen9 Server

Number of Hosts: 2 Uniform Hosts [yes/no]: yes Total sockets/cores/threads in test: 4/32/64

Transcription:

MILC Performance Benchmark and Profiling April 2013

Note The following research was performed under the HPC Advisory Council activities Special thanks for: HP, Mellanox For more information on the supporting vendors solutions please refer to: www.mellanox.com, http://www.hp.com/go/hpc For more information on the application: http://www.physics.utah.edu/~detar/milc/ 2

MILC MILC (MIMD Lattice Computation) QCD Code Developed by the MIMD Lattice Computation (MILC) collaboration Performs large scale numerical simulations to study quantum chromodynamics (QCD) QCD is the theory of the strong interactions of subatomic physics Simulates 4-dimensional SU(3) lattice gauge theory on MIMD parallel machines Contains a set of codes written in C. Publicly available for research purposes Free, open-source code, distributed under GNU General Public License The MILC Collaboration Produced application codes to study several different QCD research areas Is engaged in a broad research program in Quantum Chromodynamics (QCD) Its research addresses fundamental questions in high energy and nuclear physics Related to major experimental programs in the fields, including: Studies of the mass spectrum of strongly interacting particles, The weak interactions of these particles, The behavior of strongly interacting matter under extreme conditions 3

Objectives The presented research was done to provide best practices MILC performance benchmarking Interconnect performance comparisons MPI performance comparison Understanding MILC communication patterns The presented results will demonstrate The scalability of the compute environment to provide nearly linear application scalability 4

Test Cluster Configuration HP ProLiant SL230s Gen8 4-node Athena cluster Processors: Dual Eight-Core Intel Xeon E5-2680 @ 2.7 GHz Memory: 32GB per node, 1600MHz DDR3 DIMMs OS: RHEL 6 Update 2, OFED 1.5.3 InfiniBand SW stack Mellanox ConnectX-3 VPI InfiniBand adapters Mellanox SwitchX SX6036 56Gb/s FDR InfiniBand and 40G/s Ethernet VPI Switch MPI: Open MPI 1.6.4, Platform MPI 8.2.1 Compiler: Intel Compilers Version 13 (Intel Composer XE 2013) Application: milc_qcd-7.7.8 Benchmark Workload: Input dataset: n6_256.in Input dataset from NERSC: http://www.nersc.gov/assets/rd/milc7v1.0.tar Global lattice size of (32x32x32x36) 5

About HP ProLiant SL230s Gen8 Item Processor Chipset Memory Max Memory Internal Storage Max Internal Storage Networking I/O Slots Ports Power Supplies Integrated Management Additional Features Form Factor SL230 Gen8 Two Intel Xeon E5-2600 Series, 4/6/8 Cores, Intel Sandy Bridge EP Socket-R (512 GB), 16 sockets, DDR3 up to 1600MHz, ECC 512 GB Two LFF non-hot plug SAS, SATA bays or Four SFF non-hot plug SAS, SATA, SSD bays Two Hot Plug SFF Drives (Option) 8TB Dual port 1GbE NIC/ Single 10G Nic One PCIe Gen3 x16 LP slot 1Gb and 10Gb Ethernet, IB, and FlexF abric options Front: (1) Management, (2) 1GbE, (1) Serial, (1) S.U.V port, (2) PCIe, and Internal Micro SD card & Active Health 750, 1200W (92% or 94%), high power chassis ilo4 hardware-based power capping via SL Advanced Power Manager Shared Power & Cooling and up to 8 nodes per 4U chassis, single GPU support, Fusion I/O support 16P/8GPUs/4U chassis 6

MILC Performance - Interconnect InfiniBand FDR is the most efficient inter-node communication for MILC Outperforms 10GbE by 32% at 4 nodes Outperforms 1GbE by 332% at 4 nodes 1GbE do not show performance gain beyond 1 node 332% 32% Higher is better 16 Processes/Node 7

MILC Performance - Interconnect Open MPI performs better than Platform MPI Outperforms by around 20% at 64 processes No tuning flags used other than processor binding used in both cases Same compiler flags have been used for both cases 20% Higher is better 16 Processes/Node 8

MILC Performance - Processors Intel E5-2680 processors (Sandy Bridge) cluster outperforms prior CPU generation Performs 68% higher than X5670 cluster at 4 nodes Configurations used: Athena: 2-socket Intel E5-2680 @ 2.7GHz, 1600MHz DIMMs, FDR InfiniBand, 16PPN Plutus: 2-socket Intel X5670 @ 2.93GHz, 1333MHz DIMMs, QDR InfiniBand, 12PPN Compiler optimization flags: OCFLAGS=-O3 -ip. Athena has additional -xavx flag 68% Higher is better Platform MPI 9

MILC Profiling MPI Time Ratio InfiniBand FDR reduces the communication time at scale InfiniBand FDR consumes about 29% of total runtime 1GbE consumes 84% of total time, while 10GbE consumes about 41% 16 Processes/Node 10

MILC Profiling MPI Functions Mostl used MPI functions MPI_Wait (49%) and MPI_Isend (25%), MPI_Irecv (25%) 11

MILC Profiling MPI Functions The most time consuming MPI functions: MPI_Wait (75%), MPI_Allreduce (17%) 12

MILC Profiling Message Size Distribution of message sizes for the MPI calls MPI_Wait between 256B to 1KB MPI_Allreduce: small messages from 16B 64 MPI Processes 13

MILC Profiling Point-to-point Flow Heavy MPI communications seen between processes Mainly concentrated between close neighboring ranks For point-to-point communications, such as the non-blocking communications 64 MPI Processes 14

MILC Summary HP ProLiant Gen8 servers delivers better MILC Performance than its predecessor ProLiant Gen8 equipped with Intel E5 series processes and InfiniBand FDR Provides 68% higher performance than the ProLiant G7 servers when compare at 4 nodes InfiniBand FDR is the most efficient inter-node communication for MILC Outperforms 10GbE by 32% at 4 nodes Outperforms 1GbE by 332% (or by over 3x) at 4 nodes MILC Profiling Heavy MPI communications are seen between MPI processes InfiniBand FDR reduces communication time; leave more time for computation InfiniBand FDR consumes 29% of total time, versus 41% 10GbE, versus 84% 1GbE Non-blocking communications are seen: Time spent: MPI_Wait (75%), MPI_Allreduce (17%) Most used: MPI_Wait (49%) and MPI_Isend (25%), MPI_Irecv (25%) 15

Thank You HPC Advisory Council All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein 16 16