CPMD Performance Benchmark and Profiling. February 2014

Similar documents
MILC Performance Benchmark and Profiling. April 2013

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

SNAP Performance Benchmark and Profiling. April 2014

OCTOPUS Performance Benchmark and Profiling. June 2015

NAMD Performance Benchmark and Profiling. January 2015

CP2K Performance Benchmark and Profiling. April 2011

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

STAR-CCM+ Performance Benchmark and Profiling. July 2014

LS-DYNA Performance Benchmark and Profiling. April 2015

GROMACS Performance Benchmark and Profiling. August 2011

LS-DYNA Performance Benchmark and Profiling. October 2017

OpenFOAM Performance Testing and Profiling. October 2017

LS-DYNA Performance Benchmark and Profiling. October 2017

CP2K Performance Benchmark and Profiling. April 2011

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

GROMACS Performance Benchmark and Profiling. September 2012

CESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011

AMBER 11 Performance Benchmark and Profiling. July 2011

AcuSolve Performance Benchmark and Profiling. October 2011

ICON Performance Benchmark and Profiling. March 2012

LAMMPS Performance Benchmark and Profiling. July 2012

NEMO Performance Benchmark and Profiling. May 2011

NAMD GPU Performance Benchmark. March 2011

NAMD Performance Benchmark and Profiling. November 2010

Altair RADIOSS Performance Benchmark and Profiling. May 2013

NAMD Performance Benchmark and Profiling. February 2012

HYCOM Performance Benchmark and Profiling

Himeno Performance Benchmark and Profiling. December 2010

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

AcuSolve Performance Benchmark and Profiling. October 2011

ABySS Performance Benchmark and Profiling. May 2010

LAMMPSCUDA GPU Performance. April 2011

HP GTC Presentation May 2012

Clustering Optimizations How to achieve optimal performance? Pak Lui

Application Performance Optimizations. Pak Lui

Intel Xeon E v4, Optional Operating System, 8GB Memory, 2TB SAS H330 Hard Drive and a 3 Year Warranty

Performance Optimizations for LS-DYNA with Mellanox HPC-X Scalable Software Toolkit

HPE ProLiant ML350 Gen P 16GB-R E208i-a 8SFF 1x800W RPS Solution Server (P04674-S01)

The Effect of HPC Cluster Architecture on the Scalability Performance of CAE Simulations

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

HPE ProLiant ML350 Gen10 Server

Intel Xeon E v4, Windows Server 2016 Standard, 16GB Memory, 1TB SAS Hard Drive and a 3 Year Warranty

MM5 Modeling System Performance Research and Profiling. March 2009

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System

RADIOSS Benchmark Underscores Solver s Scalability, Quality and Robustness

GRID Testing and Profiling. November 2017

TELECOMMUNICATIONS TECHNOLOGY ASSOCIATION JET-SPEED HHS3124F / HHS2112F (10 NODES)

HPE ProLiant DL360 Gen P 16GB-R P408i-a 8SFF 500W PS Performance Server (P06453-B21)

HPE Scalable Storage with Intel Enterprise Edition for Lustre*

All-new HP ProLiant ML350p Gen8 Server series

Solutions for Scalable HPC

HPE ProLiant Gen10. Franz Weberberger Presales Consultant Server

HPE ProLiant DL380p Gen8 Server

Cisco UCS M5 Blade and Rack Servers

AMD EPYC and NAMD Powering the Future of HPC February, 2019

HP ProLiant m300 1P C2750 CPU 32GB Configure-to-order Server Cartridge B21

Data Sheet Fujitsu Server PRIMERGY CX250 S2 Dual Socket Server Node

HIGH-DENSITY HIGH-CAPACITY HIGH-DENSITY HIGH-CAPACITY

HP ProLiant DL385p Gen8 Server

Implementing SQL Server 2016 with Microsoft Storage Spaces Direct on Dell EMC PowerEdge R730xd

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Intel Select Solutions for Professional Visualization with Advantech Servers & Appliances

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

QuickSpecs. HPE Cloudline CL7100 Server. Overview

Supports up to four 3.5-inch SAS/SATA drives. Drive bays 1 and 2 support NVMe SSDs. A size-converter

QuickSpecs. HP ProLiant BL460c Generation 8 (Gen8) Server Blade - NEBS (GR-63 & GR-1089) and ETSI certified.

Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance

Altos R320 F3 Specifications. Product overview. Product views. Internal view

Front-loading drive bays 1 12 support 3.5-inch SAS/SATA drives. Optionally, front-loading drive bays 1 and 2 support 3.5-inch NVMe SSDs.

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Density Optimized System Enabling Next-Gen Performance

QuickSpecs. HP ProLiant SL390s Generation 7 (G7) Overview

January 28 29, 2014San Jose. Engineering Workshop

Quotations invited. 2. The supplied hardware should have 5 years comprehensive onsite warranty (24 x 7 call logging) from OEM directly.

DataON and Intel Select Hyper-Converged Infrastructure (HCI) Maximizes IOPS Performance for Windows Server Software-Defined Storage

Application Acceleration Beyond Flash Storage

GW2000h w/gw175h/q F1 specifications

IBM System x family brochure

HPE ProLiant ML110 Gen10 Server

Data Sheet FUJITSU Server PRIMERGY CX2550 M4 Dual Socket Server Node

HP ProLiant DL380 Gen8 and HP PCle LE Workload Accelerator 28TB/45TB Data Warehouse Fast Track Reference Architecture

HP ProLiant DL360 G7 Server - Overview

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Avid Configuration Guidelines HP Z6 G4 workstation Dual 8 to 28 Core CPU System

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Open storage architecture for private Oracle database clouds

Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet. Swamy N. Kandadai and Xinghong He and

IBM System x family brochure

BIGTWIN THE INDUSTRY S HIGHEST PERFORMING TWIN MULTI-NODE SYSTEM

HP solutions for mission critical SQL Server Data Management environments

ProLiant DL F100 Integrated Cluster Solutions and Non-Integrated Cluster Bundle Configurations. Configurations

Architecting High Performance Computing Systems for Fault Tolerance and Reliability

Transform your data center cost-effectively with the ultra-dense, efficient, and high-performance HP ProLiant DL320 G6 enterpriseclass

HP ProLiant DL560 Gen8

QuickSpecs. HPE ProLiant XL170r Gen9 Server. Overview. HPE ProLiant XL170r Gen9 Server

HPE ProLiant ML110 Gen P 8GB-R S100i 4LFF NHP SATA 350W PS DVD Entry Server/TV (P )

Transcription:

CPMD Performance Benchmark and Profiling February 2014

Note The following research was performed under the HPC Advisory Council activities Special thanks for: HP, Mellanox For more information on the supporting vendors solutions please refer to: www.mellanox.com, http://www.hp.com/go/hpc For more information on the application: http://www.cpmd.org/ 2

CPMD CPMD A parallelized implementation of density functional theory (DFT) Plane wave / pseudopotential implementation of DFT Particularly designed for ab-initio molecular dynamics Brings together methods Classical molecular dynamics Solid state physics Quantum chemistry CPMD supports MPI and Mixed MPI/SMP CPMD is distributed and developed by the CPMD consortium 3

Objectives The presented research was done to provide best practices CPMD performance benchmarking Interconnect performance comparisons MPI performance comparison Understanding CPMD communication patterns The presented results will demonstrate The scalability of the compute environment to provide nearly linear application scalability 4

Test Cluster Configuration HP ProLiant SL230s Gen8 4-node Athena cluster Processors: Dual-Socket 10-core Intel Xeon E5-2680v2 @ 2.8 GHz CPUs Memory: 32GB per node, 1600MHz DDR3 Dual-Ranked DIMMs OS: RHEL 6 Update 2, OFED 2.1-1.0.0 InfiniBand SW stack Mellanox Connect-IB FDR InfiniBand adapters Mellanox ConnectX-3 VPI adapters Mellanox SwitchX SX6036 56Gb/s FDR InfiniBand and Ethernet VPI Switch MPI: Platform MPI 8.3, Open MPI 1.7.4 (with MXM 2.1 and FCA 2.5) Compiler: Intel Composer XE 2013 (2013.5.192) Application: CPMD 3.17.1 Benchmark Workload: C120 120 carbon atoms 5

About HP ProLiant SL230s Gen8 Item Processor Chipset Memory Max Memory Internal Storage Max Internal Storage Networking I/O Slots Ports Power Supplies Integrated Management Additional Features Form Factor HP ProLiant SL230s Gen8 Server Two Intel Xeon E5-2600 v2 Series, 4/6/8/10/12 Cores, Intel Xeon E5-2600 v2 product family (256 GB), 16 DIMM slots, DDR3 up to 1600MHz, ECC 256 GB Two LFF non-hot plug SAS, SATA bays or Four SFF non-hot plug SAS, SATA, SSD bays Two Hot Plug SFF Drives (Option) 8TB Dual port 1GbE NIC/ Single 10G Nic One PCIe Gen3 x16 LP slot 1Gb and 10Gb Ethernet, IB, and FlexF abric options Front: (1) Management, (2) 1GbE, (1) Serial, (1) S.U.V port, (2) PCIe, and Internal Micro SD card & Active Health 750, 1200W (92% or 94%), high power chassis ilo4 hardware-based power capping via SL Advanced Power Manager Shared Power & Cooling and up to 8 nodes per 4U chassis, single GPU support, Fusion I/O support 16P/8GPUs/4U chassis 6

CPMD Performance - Processors Intel E5-2680v2 processors (Ivy Bridge) cluster outperforms prior CPU generation Performs up to 203% higher than Xeon X5670 (Westmere) cluster at 4 nodes Configurations used: Athena: 2-socket Intel E5-2680v2 @ 2.8GHz, 1600MHz DIMMs, FDR IB, 20PPN Plutus: 2-socket Intel X5670 @ 2.93GHz, 1333MHz DIMMs, QDR IB, 12PPN Same set of compiler optimization flags used in both cases 178% 203% 161% 151% Higher is better 7

CPMD Performance - Interconnect FDR InfiniBand is the most efficient inter-node communication for CPMD FDR outperforms 1GbE by over 7 times at 80 MPI processes FDR also outperforms 10GbE and 40GbE by over 3.5 times at 80 MPI processes FDR provides 117% higher performance over 40GbE-RoCE The performance benefit of InfiniBand expects to grow at larger CPU core counts 348% 350% 786% 117% Higher is better Open MPI 8

CPMD Performance - Interconnect Tuned Platform MPI performs better than Open MPI Slight advantage is seen at 80 MPI processes over Open MPI Same compiler flags have been used for both cases 5% Higher is better 20 Processes/Node 9

CPMD Profiling # of MPI Functions Mostly used MPI functions MPI_Bcast (78%) and MPI_Barrier (12%) 10

CPMD Profiling Time Spent on MPI The most time consuming MPI functions: MPI_Alltoall (74%), MPI_Bcast (16%), MPI_Allreduce (7%), MPI_Barrier (2%) Mostly MPI collective operations used 11

CPMD Profiling MPI Time Ratio FDR InfiniBand reduces the communication time at scale for MPI Collectives MPI_Alltoall: 1GbE takes >60x longer, and 10GbE spends ~7x longer than FDR IB MPI_Allreduce: 1GbE takes ~7x longer, while 10GbE run about 82% longer than IB 60x 80 MPI processes 12

CPMD Summary HP ProLiant Gen8 servers delivers better CPMD Performance than its predecessor ProLiant Gen8 equipped with Intel E5 2600 V2 series processors and FDR InfiniBand Provides 203% higher performance than the ProLiant G7 (Westmere) servers at 4 nodes FDR InfiniBand is the most efficient inter-node communication for CPMD FDR IB outperforms 10/40GbE by >3.5x with 4 nodes, and beat 1GbE over 7x with 4 nodes FDR IB also outperforms 40GbE-RoCE by 117% at 4 nodes CPMD Profiling FDR InfiniBand reduces communication time; leave more time for computation MPI_Alltoall: 1GbE takes >60x longer, and 10GbE spends ~7x longer than FDR IB MPI_Allreduce: 1GbE takes ~7x longer, while 10GbE run about 82% longer than FDR IB Collective operations communications are seen: Time spent: MPI_Alltoall (74%), MPI_Bcast (16%), MPI_Allreduce (7%), MPI_Barrier (2%) Most used: MPI_Bcast (78%) and MPI_Barrier (12%) 13

Thank You HPC Advisory Council All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein 14 14