CP2K Performance Benchmark and Profiling. April 2011

Similar documents
NEMO Performance Benchmark and Profiling. May 2011

CP2K Performance Benchmark and Profiling. April 2011

NAMD Performance Benchmark and Profiling. November 2010

CPMD Performance Benchmark and Profiling. February 2014

HYCOM Performance Benchmark and Profiling

OCTOPUS Performance Benchmark and Profiling. June 2015

GROMACS Performance Benchmark and Profiling. August 2011

NAMD GPU Performance Benchmark. March 2011

MILC Performance Benchmark and Profiling. April 2013

CESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011

LAMMPSCUDA GPU Performance. April 2011

GROMACS Performance Benchmark and Profiling. September 2012

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

SNAP Performance Benchmark and Profiling. April 2014

ICON Performance Benchmark and Profiling. March 2012

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

Himeno Performance Benchmark and Profiling. December 2010

AMBER 11 Performance Benchmark and Profiling. July 2011

NAMD Performance Benchmark and Profiling. January 2015

AcuSolve Performance Benchmark and Profiling. October 2011

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

AcuSolve Performance Benchmark and Profiling. October 2011

ABySS Performance Benchmark and Profiling. May 2010

STAR-CCM+ Performance Benchmark and Profiling. July 2014

LS-DYNA Performance Benchmark and Profiling. April 2015

Altair RADIOSS Performance Benchmark and Profiling. May 2013

OpenFOAM Performance Testing and Profiling. October 2017

LAMMPS Performance Benchmark and Profiling. July 2012

NAMD Performance Benchmark and Profiling. February 2012

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

LS-DYNA Performance Benchmark and Profiling. October 2017

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

LS-DYNA Performance Benchmark and Profiling. October 2017

MM5 Modeling System Performance Research and Profiling. March 2009

Birds of a Feather Presentation

Intel Cluster Toolkit Compiler Edition 3.2 for Linux* or Windows HPC Server 2008*

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

2008 International ANSYS Conference

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

Future Routing Schemes in Petascale clusters

GRID Testing and Profiling. November 2017

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

Clustering Optimizations How to achieve optimal performance? Pak Lui

LS-DYNA Productivity and Power-aware Simulations in Cluster Environments

Highest Levels of Scalability Simplified Network Manageability Maximum System Productivity

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Accelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage

Solutions for Scalable HPC

Mellanox InfiniBand Solutions Accelerate Oracle s Data Center and Cloud Solutions

FROM HPC TO THE CLOUD WITH AMQP AND OPEN SOURCE SOFTWARE

Introduction to High-Speed InfiniBand Interconnect

Application Acceleration Beyond Flash Storage

N V M e o v e r F a b r i c s -

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Meltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies

Evolving HPC Solutions Using Open Source Software & Industry-Standard Hardware

InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment. TOP500 Supercomputers, June 2014

S THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI SCALE. Presenter: Louis Capps, Solution Architect, NVIDIA,

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

InfiniBand Switch System Family. Highest Levels of Scalability, Simplified Network Manageability, Maximum System Productivity

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Sharing High-Performance Devices Across Multiple Virtual Machines

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

DELIVERABLE D5.5 Report on ICARUS visualization cluster installation. John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS)

Single-Points of Performance

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

Optimizing LS-DYNA Productivity in Cluster Environments

Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance

Interconnect Your Future

HPE ProLiant ML350 Gen10 Server

Implementing SQL Server 2016 with Microsoft Storage Spaces Direct on Dell EMC PowerEdge R730xd

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

The Future of Interconnect Technology

Making Supercomputing More Available and Accessible Windows HPC Server 2008 R2 Beta 2 Microsoft High Performance Computing April, 2010

RoCE vs. iwarp Competitive Analysis

The Exascale Architecture

QuickSpecs. HP InfiniBand Options for HP BladeSystems c-class. Overview

HP InfiniBand Options for HP ProLiant and Integrity Servers Overview

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

Assessment of LS-DYNA Scalability Performance on Cray XD1

HPE ProLiant DL360 Gen P 16GB-R P408i-a 8SFF 500W PS Performance Server (P06453-B21)

IBM Power 755 server. High performance compute node for scalable clusters using InfiniBand architecture interconnect products.

HPE ProLiant ML350 Gen P 16GB-R E208i-a 8SFF 1x800W RPS Solution Server (P04674-S01)

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Technical Computing Suite supporting the hybrid system

Building the Most Efficient Machine Learning System

HP InfiniBand Options for HP ProLiant and Integrity Servers Overview

Transcription:

CP2K Performance Benchmark and Profiling April 2011

Note The following research was performed under the HPC Advisory Council HPC works working group activities Participating vendors: HP, Intel, Mellanox Compute resource - HPC Advisory Council Cluster Center For more info please refer to http://www.hp.com/go/hpc www.intel.com www.mellanox.com http://cp2k.berlios.de 2

CP2K CP2K is an atomistic and molecular simulations software for solid state, liquid, molecular and biological systems CP2k provides a general framework for different methods, such as: Density functional theory (DFT) using a mixed Gaussian and plane waves approach (GPW) Classical pair and many-body potentials CP2K is a freely available (GPL) program, written in Fortran 95 3

Objectives The presented research was done to provide best practices MPI libraries comparisons Interconnect performance benchmarking CP2K Application profiling Understanding CP2K communication patterns CP2K performance optimization The presented results will demonstrate Balanced compute environment determines application performance Tips to tune MPI to achieve maximum CP2K scalability 4

Test Cluster Configuration HP ProLiant SL2x170z G6 16-node cluster Six-Core Intel X5670 @ 2.93 GHz CPUs Memory: 24GB per node OS: CentOS5U5, OFED 1.5.3 InfiniBand SW stack Mellanox ConnectX-2 InfiniBand QDR adapters and switches Fulcrum based 10Gb/s Ethernet switch MPI: Intel MPI 4, Open MPI 1.5.3 with KNEM 0.9.6, Platform MPI 8.0.1 Compilers: Intel Compilers 11.1 Application: CP2K version 2.2.196 Libraries: Intel MKL 10.1, BLACS, ScaLAPACK 1.8.0 Benchmark workload H2O-128.inp 5

About HP ProLiant SL6000 Scalable System Solution-optimized for extreme scale out ProLiant SL160z G6 ProLiant SL165z G7 Large memory -memory-cache apps Save on cost and energy -- per node, rack and data center ProLiant z6000 chassis Shared infrastructure fans, chassis, power ProLiant SL170z G6 Large storage -Web search and database apps Mix and match configurations Deploy with confidence * SPECpower_ssj2008 www.spec.org 17 June 2010, 13:28 ProLiant SL2x170z G6 Highly dense - HPC compute and web front-end apps #1 Power Efficiency* 6

CP2K Benchmark Results Open MPI Input Dataset H2O-128.inp MPI tuning enables nearly 18% performance gain at 16 nodes / 192 cores KNEM accelerates intra node communication MPI RDMA optimization reduce inter-node communication latency --mca btl_openib_eager_limit 65536 --mca btl_openib_max_eager_rdma 8 --mca btl_openib_eager_rdma_num 8 18% Higher is better 12-cores per node 7

CP2K Benchmark Results Intel MPI Applying optimized parameters improves CP2K performance by 23% Increase alignment for the sending buffer Use right Alltoall and Alltoallv algorithm 23% Higher is better 12-cores per node 8

CP2K Benchmark Results MPI Libraries Open MPI is better at smaller node count All tested MPI libraries have similar performance at 16 nodes Higher is better 12-cores per node 9

CP2K Benchmark Results Interconnects Only InfiniBand enables CP2K application scalability GigE can t scale even on 2 nodes 10GigE stops scaling after 2 nodes Higher is better 12-cores per node 10

CP2K MPI Profiling MPI Functions Both MPI collectives and point-to-point creates big communication time MPI_Alltoallv and MPI_Allreduce MPI_Waitall and MPI_Isend 11

CP2K MPI Profiling Message Size Majority messages are smaller than 64KB 12

CP2K MPI Profiling 10GigE vs IB 10GigE spends most time in communication rather than computation 48 Ranks 13

CP2K MPI Profiling 10GigE vs IB 10GigE has big communication overhead with MPI collectives 22 times longer than InfiniBand QDR 48 Ranks 14

CP2K Benchmark Summary CP2K performance benchmark demonstrates InifiniBand QDR enables application performance and scalability Neither GigE nor 10GigE meet CP2K network requirement CP2K MPI profiling Large number of small messages are used by CP2K MPI_Alltoallv, MPI_Alltoall, and MPI_Allreduce are major collectives Point-to-point has big communication overhead Interconnect latency is critical to CP2K performance MPI tuning accelerates CP2K performance More than 20% at 16 nodes / 192 cores 15

Thank You HPC Advisory Council All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein 16 16