LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

Similar documents
GROMACS (GPU) Performance Benchmark and Profiling. February 2016

NAMD Performance Benchmark and Profiling. January 2015

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

STAR-CCM+ Performance Benchmark and Profiling. July 2014

LS-DYNA Performance Benchmark and Profiling. April 2015

LAMMPS Performance Benchmark and Profiling. July 2012

SNAP Performance Benchmark and Profiling. April 2014

LAMMPSCUDA GPU Performance. April 2011

CPMD Performance Benchmark and Profiling. February 2014

OpenFOAM Performance Testing and Profiling. October 2017

MILC Performance Benchmark and Profiling. April 2013

OCTOPUS Performance Benchmark and Profiling. June 2015

NAMD GPU Performance Benchmark. March 2011

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

AMBER 11 Performance Benchmark and Profiling. July 2011

LS-DYNA Performance Benchmark and Profiling. October 2017

GROMACS Performance Benchmark and Profiling. September 2012

Altair RADIOSS Performance Benchmark and Profiling. May 2013

LS-DYNA Performance Benchmark and Profiling. October 2017

CP2K Performance Benchmark and Profiling. April 2011

NAMD Performance Benchmark and Profiling. February 2012

GROMACS Performance Benchmark and Profiling. August 2011

Himeno Performance Benchmark and Profiling. December 2010

AcuSolve Performance Benchmark and Profiling. October 2011

ICON Performance Benchmark and Profiling. March 2012

HYCOM Performance Benchmark and Profiling

NAMD Performance Benchmark and Profiling. November 2010

ABySS Performance Benchmark and Profiling. May 2010

MM5 Modeling System Performance Research and Profiling. March 2009

The Effect of HPC Cluster Architecture on the Scalability Performance of CAE Simulations

CESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011

CP2K Performance Benchmark and Profiling. April 2011

AcuSolve Performance Benchmark and Profiling. October 2011

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

NEMO Performance Benchmark and Profiling. May 2011

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

Application Performance Optimizations. Pak Lui

Clustering Optimizations How to achieve optimal performance? Pak Lui

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance

Performance Optimizations for LS-DYNA with Mellanox HPC-X Scalable Software Toolkit

RADIOSS Benchmark Underscores Solver s Scalability, Quality and Robustness

GRID Testing and Profiling. November 2017

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System

Game-changing Extreme GPU computing with The Dell PowerEdge C4130

Dell EMC HPC System for Life Sciences v1.4

HP GTC Presentation May 2012

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

IBM Power Systems HPC Cluster

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

AMD EPYC and NAMD Powering the Future of HPC February, 2019

Data Sheet Fujitsu Server PRIMERGY CX250 S2 Dual Socket Server Node

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

INCREASE IT EFFICIENCY, REDUCE OPERATING COSTS AND DEPLOY ANYWHERE

The Effect of In-Network Computing-Capable Interconnects on the Scalability of CAE Simulations

TELECOMMUNICATIONS TECHNOLOGY ASSOCIATION JET-SPEED HHS3124F / HHS2112F (10 NODES)

DESCRIPTION GHz, 1.536TB shared memory RAM, and 20.48TB RAW internal storage teraflops About ScaleMP

LAMMPS, LS- DYNA, HPL, and WRF on iwarp vs. InfiniBand FDR

HPC and AI Solution Overview. Garima Kochhar HPC and AI Innovation Lab

Dell Solution for High Density GPU Infrastructure

Emerging Technologies for HPC Storage

Power Systems AC922 Overview. Chris Mann IBM Distinguished Engineer Chief System Architect, Power HPC Systems December 11, 2017

IBM Power AC922 Server

Selecting Server Hardware for FlashGrid Deployment

Data Sheet FUJITSU Server PRIMERGY CX2550 M4 Dual Socket Server Node

IBM Power Advanced Compute (AC) AC922 Server

Analyzing Performance and Power of Applications on GPUs with Dell 12G Platforms. Dr. Jeffrey Layton Enterprise Technologist HPC

(Reaccredited with A Grade by the NAAC) RE-TENDER NOTICE. Advt. No. PU/R/RUSA Fund/Equipment Purchase-1/ Date:

Data Sheet FUJITSU Server PRIMERGY CX2550 M4 Multi-node Server

Architecting High Performance Computing Systems for Fault Tolerance and Reliability

HPC Hardware Overview

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Intel Xeon E v4, Optional Operating System, 8GB Memory, 2TB SAS H330 Hard Drive and a 3 Year Warranty

Data Sheet FUJITSU Server PRIMERGY CX2550 M4 Multi-node Server

Paving the Road to Exascale

HIGH-DENSITY HIGH-CAPACITY HIGH-DENSITY HIGH-CAPACITY

Comet Virtualization Code & Design Sprint

HPC Innovation Lab Update. Dell EMC HPC Community Meeting 3/28/2017

Interconnect Your Future

High Performance Computing

HPE ProLiant ML350 Gen10 Server

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

HIGH-PERFORMANCE STORAGE FOR DISCOVERY THAT SOARS

E4-ARKA: ARM64+GPU+IB is Now Here Piero Altoè. ARM64 and GPGPU

Dell EMC PowerEdge server portfolio: platforms and solutions

The Cray CX1 puts massive power and flexibility right where you need it in your workgroup

Dell PowerEdge R720xd with PERC H710P: A Balanced Configuration for Microsoft Exchange 2010 Solutions

Dell HPC System for Manufacturing System Architecture and Application Performance

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

Three Paths to Better Business Decisions

Intel Xeon E v4, Windows Server 2016 Standard, 16GB Memory, 1TB SAS Hard Drive and a 3 Year Warranty

LAMMPS and WRF on iwarp vs. InfiniBand FDR

Quotations invited. 2. The supplied hardware should have 5 years comprehensive onsite warranty (24 x 7 call logging) from OEM directly.

Illinois Proposal Considerations Greg Bauer

John Fragalla TACC 'RANGER' INFINIBAND ARCHITECTURE WITH SUN TECHNOLOGY. Presenter s Name Title and Division Sun Microsystems

The rcuda technology: an inexpensive way to improve the performance of GPU-based clusters Federico Silla

Data Sheet FUJITSU Server PRIMERGY CX272 S1 Dual socket server node for PRIMERGY CX420 cluster server

Inspur AI Computing Platform

Transcription:

LAMMPS-KOKKOS Performance Benchmark and Profiling September 2015

2 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox, NVIDIA Compute resource - HPC Advisory Council Cluster Center The following was done to provide best practices LAMMPS performance overview Understanding LAMMPS communication patterns Ways to increase LAMMPS productivity For more info please refer to http://lammps.sandia.gov http://www.dell.com http://www.intel.com http://www.mellanox.com http://www.nvidia.com

3 LAMMPS Large-scale Atomic/Molecular Massively Parallel Simulator Classical molecular dynamics code which can model: Atomic, Polymeric, Biological, Metallic, Granular, and coarse-grained systems LAMMPS-KOKKOS package contains Versions of pair, fix, and atom styles that use data structures and macros provided by the Kokkos library LAMMPS runs efficiently in parallel using message-passing techniques Developed at Sandia National Laboratories An open-source code, distributed under GNU Public License

Objectives 4 The presented research was done to provide best practices LAMMPS performance benchmarking MPI Library performance comparison Interconnect performance comparison CPUs comparison Optimization tuning The presented results will demonstrate The scalability of the compute environment/application Considerations for higher productivity and efficiency

5 Test Cluster Configuration Dell PowerEdge R730 32-node (896-core) Thor cluster Dual-Socket 14-Core Intel E5-2697v3 @ 2.60 GHz CPUs (BIOS: Maximum Performance, Home Snoop ) Memory: 64GB memory, DDR4 2133 MHz, Memory Snoop Mode in BIOS sets to Home Snoop OS: RHEL 6.5, MLNX_OFED_LINUX-3.0-1.0.1 InfiniBand SW stack Hard Drives: 2x 1TB 7.2 RPM SATA 2.5 on RAID 1 Mellanox ConnectX-4 EDR 100Gb/s InfiniBand Adapters Mellanox Switch-IB SB7700 36-port EDR 100Gb/s InfiniBand Switch Mellanox ConnectX-3 FDR VPI InfiniBand and 40Gb/s Ethernet Adapters Mellanox SwitchX-2 SX6036 36-port 56Gb/s FDR InfiniBand / VPI Ethernet Switch Dell InfiniBand-Based Lustre Storage based on Dell PowerVault MD3460 and Dell PowerVault MD3420 NVIDIA Tesla K40 (on 8 Nodes) and NVIDIA Tesla K80 GPUs (on 2 Nodes); 1 GPU per node MPI: Mellanox HPC-X v1.3 (based on Open MPI 1.8.7) with CUDA 6.5 and 7.0 support Application: LAMMPS 15May15 Benchmarks: Input data with embedded-atom method (in.eam)

6 PowerEdge R730 Massive flexibility for data intensive operations Performance and efficiency Intelligent hardware-driven systems management with extensive power management features Innovative tools including automation for parts replacement and lifecycle manageability Broad choice of networking technologies from Ethernet to InfiniBand Built in redundancy with hot plug and swappable PSU, HDDs and fans Benefits Designed for performance workloads High performance scale-out compute and low cost dense storage in one package Hardware Capabilities Flexible compute platform with dense storage capacity 2S/2U server, 6 PCIe slots Large memory footprint (Up to 768GB / 24 DIMMs) High I/O performance and optional storage configurations HDD options: 12 x 3.5 - or - 24 x 2.5 + 2x 2.5 HDDs in rear of server Up to 26 HDDs with 2 hot plug drives in rear of server for boot or scratch

7 LAMMPS Performance KOKKOS Vars Kokkos variable determines the communication models between host and device The best of kokkos vars used for CPU and GPU tests are different The most favorite kokkos vars for GPU appears to be among #2, 3, or 4 The most favorite kokkos vars for CPU appears to be among #1, 5, 6, or 7 Higher is better 8 Nodes

8 LAMMPS Performance CUDA Versions Both CUDA 6.5 and 7.0 versions perform similarly Using the given input data and workload Higher is better 8 Nodes; 1x K40 / Node

9 LAMMPS Performance Tesla K40 vs K80 Tesla K80 doubles the performance of K40 using with LAMMPS Demonstrates 110% increase in performance on 2 nodes with 1 GPU per node 110% Higher is better 1 GPU / Node

10 LAMMPS Performance Network Interconnects EDR InfiniBand delivers superior scalability in application performance InfiniBand delivers 77% higher performance than 10GbE on 8 nodes Performance of 4 IB nodes outperforms 8 Ethernet (10GbE) nodes Benefits of InfiniBand over Ethernet expect to increase as cluster scales Scalability for Ethernet stops beyond 4 nodes; while InfiniBand continue to scale 77% Higher is better GPU: 1 K40 / Node

11 LAMMPS Performance Network Interconnects EDR InfiniBand delivers superior scalability in application performance EDR IB demonstrates an 8% increase on 8 Nodes / 224 Cores Performance gap between FDR and EDR expect to increase as cluster scales 8% Higher is better 28 MPI Processes / Node

12 LAMMPS Performance Scalability kokkos_omp demonstrates higher speed up and performance than kokkos_cuda CPU performance outpaces Tesla K40 performance using 28 cores/node With the availability of the Tesla K80, it outperforms CPU; performance gap expect to grow 51% 78% 17% 25% 17% Higher is better CPU: 28 Processes/Node GPU: 1 GPU / Node

13 LAMMPS Profiling % of MPI Calls The most time consuming MPI calls for LAMMPS-KOKKOS (cuda): MPI_Send: 67% MPI / 8% Wall MPI_Wait: 18% MPI / 2% Wall MPI_Bcast: 11% MPI / 1% Wall Kokkos_cuda, 8 nodes / 224 processes EDR InfiniBand

14 LAMMPS Profiling MPI Message Size Distribution For the most time consuming MPI calls MPI_Send: 48KB (26% MPI) time, 32KB (15%), 56KB (7%) MPI_Wait: 0B (18% MPI time) MPI_Bcast: 4B (10% MPI time) Kokkos_cuda, 8 nodes / 224 processes EDR InfiniBand

15 LAMMPS Profiling MPI Memory Consumption Some load imbalance is seen on the processes Memory consumption: About 9GB of memory is used on each compute node for this input data Kokkos_cuda, 8 nodes / 224 processes EDR InfiniBand

16 LAMMPS Profiling CUDA Profiler NVIDIA Visual Profiler and nvprof: Profilers for GPUs Shows many Memcpy occurs between host and device throughout the run Kokkos_cuda, 8 nodes / 224 processes EDR InfiniBand

17 LAMMPS Summary Performance Compute: cluster of the current generation outperforms system architecture of previous generations Tesla K80 doubles the performance of K40 using with LAMMPS-KOKKOS The KOKKOS vars can be a significant performance implication to the performance of LAMMPS CUDA versions 6.5 and 7.0 performs similarly Network: EDR InfiniBand demonstrates superior scalability in LAMMPS performance EDR IB provides higher performance by 77% over 10GbE, 86% higher over 1GbE on 8 nodes Performance of 4 IB nodes outperforms 8 Ethernet (10GbE) nodes Benefits of InfiniBand over Ethernet expect to increase as cluster scales Scalability for Ethernet stops beyond 4 nodes; while InfiniBand continue to scale MPI Profiles The most time consuming MPI calls for LAMMPS-KOKKOS (cuda): MPI_Send: 67% MPI / 8% Wall MPI_Wait: 18% MPI / 2% Wall MPI_Bcast: 11% MPI / 1% Wall

18 18 Thank You HPC Advisory Council All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein