LAMMPSCUDA GPU Performance. April 2011

Similar documents
NAMD GPU Performance Benchmark. March 2011

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

LAMMPS Performance Benchmark and Profiling. July 2012

CP2K Performance Benchmark and Profiling. April 2011

AMBER 11 Performance Benchmark and Profiling. July 2011

NAMD Performance Benchmark and Profiling. November 2010

CP2K Performance Benchmark and Profiling. April 2011

GROMACS Performance Benchmark and Profiling. August 2011

NAMD Performance Benchmark and Profiling. January 2015

AcuSolve Performance Benchmark and Profiling. October 2011

GROMACS Performance Benchmark and Profiling. September 2012

Himeno Performance Benchmark and Profiling. December 2010

NAMD Performance Benchmark and Profiling. February 2012

ICON Performance Benchmark and Profiling. March 2012

CESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

AcuSolve Performance Benchmark and Profiling. October 2011

Altair RADIOSS Performance Benchmark and Profiling. May 2013

HYCOM Performance Benchmark and Profiling

STAR-CCM+ Performance Benchmark and Profiling. July 2014

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

NEMO Performance Benchmark and Profiling. May 2011

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

SNAP Performance Benchmark and Profiling. April 2014

LS-DYNA Performance Benchmark and Profiling. April 2015

OCTOPUS Performance Benchmark and Profiling. June 2015

CPMD Performance Benchmark and Profiling. February 2014

MILC Performance Benchmark and Profiling. April 2013

ABySS Performance Benchmark and Profiling. May 2010

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

LS-DYNA Performance Benchmark and Profiling. October 2017

MM5 Modeling System Performance Research and Profiling. March 2009

LS-DYNA Performance Benchmark and Profiling. October 2017

The rcuda middleware and applications

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

Game-changing Extreme GPU computing with The Dell PowerEdge C4130

OpenFOAM Performance Testing and Profiling. October 2017

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

Solutions for Scalable HPC

2008 International ANSYS Conference

Birds of a Feather Presentation

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Building the Most Efficient Machine Learning System

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

Interconnect Your Future

LAMMPS and WRF on iwarp vs. InfiniBand FDR

The Future of Interconnect Technology

Sharing High-Performance Devices Across Multiple Virtual Machines

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Future Routing Schemes in Petascale clusters

Increasing the efficiency of your GPU-enabled cluster with rcuda. Federico Silla Technical University of Valencia Spain

LAMMPS, LS- DYNA, HPL, and WRF on iwarp vs. InfiniBand FDR

Mellanox GPUDirect RDMA User Manual

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

PERFORMANCE ACCELERATED Mellanox InfiniBand Adapters Provide Advanced Levels of Data Center IT Performance, Productivity and Efficiency

Building the Most Efficient Machine Learning System

GPU acceleration on IB clusters. Sadaf Alam Jeffrey Poznanovic Kristopher Howard Hussein Nasser El-Harake

InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment. TOP500 Supercomputers, June 2014

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

Interconnect Your Future

Meltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies

rcuda: towards energy-efficiency in GPU computing by leveraging low-power processors and InfiniBand interconnects

The Future of High Performance Interconnects

High Performance Computing

Flex System IB port FDR InfiniBand Adapter Lenovo Press Product Guide

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Towards Transparent and Efficient GPU Communication on InfiniBand Clusters. Sadaf Alam Jeffrey Poznanovic Kristopher Howard Hussein Nasser El-Harake

S THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI SCALE. Presenter: Louis Capps, Solution Architect, NVIDIA,

Clustering Optimizations How to achieve optimal performance? Pak Lui

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

Paving the Road to Exascale

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Harnessing GPU speed to accelerate LAMMPS particle simulations

The rcuda technology: an inexpensive way to improve the performance of GPU-based clusters Federico Silla

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

The Cray CX1 puts massive power and flexibility right where you need it in your workgroup

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Interconnect Your Future

GRID Testing and Profiling. November 2017

Mellanox GPUDirect RDMA User Manual

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017

NVIDIA GPUDirect Technology. NVIDIA Corporation 2011

Opportunities of the rcuda remote GPU virtualization middleware. Federico Silla Universitat Politècnica de València Spain

Making Supercomputing More Available and Accessible Windows HPC Server 2008 R2 Beta 2 Microsoft High Performance Computing April, 2010

Performance Optimizations for LS-DYNA with Mellanox HPC-X Scalable Software Toolkit

Accelerating Implicit LS-DYNA with GPU

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

Optimizing LS-DYNA Productivity in Cluster Environments

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

n N c CIni.o ewsrg.au

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

Maximizing Cluster Scalability for LS-DYNA

Introduction to High-Performance Computing

Transcription:

LAMMPSCUDA GPU Performance April 2011

Note The following research was performed under the HPC Advisory Council activities Participating vendors: Dell, Intel, Mellanox Compute resource - HPC Advisory Council Cluster Center For more info please refer to http://www.dell.com http://www.intel.com http://www.mellanox.com http://www.nvidia.com http://code.google.com/p/lammps http://www.tu-ilmenau.de/theophys2/forschung/lammpscuda/ 2

LAMMPS Large-scale Atomic/Molecular Massively Parallel Simulator Classical molecular dynamics code which can model: Atomic Polymeric Biological Metallic Granular, and coarse-grained systems LAMMPS runs efficiently in parallel using message-passing techniques Developed at Sandia National Laboratories An open-source code, distributed under GNU Public License 3

Objectives The presented research was done to provide best practices LAMMPS performance benchmarking Interconnect performance comparisons Ways to increase LAMMPS productivity GPU acceleration evaluation (GPUDirect) The presented results will demonstrate The scalability of the compute environment / application Considerations for performance optimization 4

Test Cluster Configuration Dell PowerEdge C6100 4-node cluster Six-Core Intel X5670 @ 2.93 GHz CPUs, 24GB memory per node OS: RHEL5 Update 5, OFED 1.5.2 via MLNX_OFED-GPU_DIRECT: 2.1.0-1.1.0010 Dell PowerEdge C410x PCIe Expansion Chassis 4 NVIDIA Tesla C2050 Fermi GPUs (CUDA driver and runtime version 3.2) Mellanox ConnectX-2 VPI Mezzanine cards for InfiniBand &10GigE Mellanox MTS3600Q 36-Port 40Gb/s QDR InfiniBand switch Mellanox NVIDIA GPUDirect: MLNX_OFED-GPU_DIRECT 2.1.0-1.1.0010 Fulcrum based 10Gb/s Ethernet switch MPI: Open MPI 1.4.2 Application: gpulammps (subversion rev 627), based on LAMMPS (4 Dec 2010) LAMMPSCUDA (Version 1.1: 7th of March 2011) 5

LAMMPSCUDA Performance Interconnect InfiniBand provides the best data throughput Up to 153% higher performance than 1GigE at 3-node Up to 26% higher performance than 10GigE at 3-node 1GigE provides no job gain after 1 node GPU communications between nodes become a significantly burden to the network 153% 26% Higher is better 12 Cores/Node 6

1 LAMMPSCUDA GPUDirect Benefit of GPUDirect Allows GPU to directly access the system memory w/o CPU involvement Utilizes native RDMA for efficient data transfer over InfiniBand Reduce latency by 30% for GPU communications Open MPI needs to set MCA parameter to enable GPUDirect --mca btl_openib_flags 310 LAMMPSCUDA input data file needs to have pinned memory enabled accelerator cuda gpu/node 1 pinned 1 Receive Transmit System Memory 1 CPU CPU System Memory GPU Chip set Chip set GPU InfiniBand InfiniBand GPU Memory GPU Memory 7

LAMMPSCUDA in.lj-coul.cuda (1GPU/node) Dataset: in.lj-coul.cuda GI-System, Debye length 2.5, 7.0A cutoff 64584 atoms, 2000 steps, charge atom style, pair force: lj/cut/coul/debye 2.5 7.0 GPUDirect demonstrates an increase in job productivity Up to 4-6% increase in performance with on 2 and 3 nodes InfiniBand QDR Higher is better 12 Cores/Node 8

LAMMPSCUDA in.melt.cuda (1GPU/node) Dataset: in.melt.cuda 3d Lennard-Jones melt, 2.5A cutoff 108000 atoms, 2000 steps, atomic atom style, pair force: lj/cut 2.5 GPUDirect demonstrates an increase in job productivity Up to 9-13% increase in performance with GPUDirect on 2 and 3 nodes InfiniBand QDR Higher is better 12 Cores/Node 9

LAMMPSCUDA in.melt-expand.cuda (1GPU/node) Dataset: in.melt-expand.cuda 3d Lennard-Jones melt, 2.5A cutoff 108000 atoms, 2000 steps, atomic atom style, pair force: lj/expand 2.5 GPUDirect demonstrates an increase in job productivity Up to 9-13% increase in performance with GPUDirect on 2 and 3 nodes InfiniBand QDR Higher is better 12 Cores/Node 10

LAMMPSCUDA in.melt-msd.cuda (1GPU/node) Dataset: in.melt-msd.cuda 3d Lennard-Jones melt (Mean Square Displacement), 2.5A cutoff 108000 atoms, 2000 steps, atomic atom style, pair force: lj/cut 2.5 GPUDirect demonstrates an increase in job productivity Up to 10-13% increase in performance with GPUDirect on 2 and 3 nodes InfiniBand QDR Higher is better 12 Cores/Node 11

LAMMPSCUDA in.melt-morse.cuda(1gpu/node) Dataset: in.morse.cuda Morse interaction, 3.0A cutoff 108000 atoms, 2000 steps, atomic atom style, pair force: morse 3.0, box of 30x30x30 GPUDirect demonstrates an increase in job productivity Up to 7-11% increase in performance with GPUDirect on 2 and 3 nodes InfiniBand QDR Higher is better 12 Cores/Node 12

LAMMPSCUDA GPUDirect (1GPU/node) GPUDirect enables faster speedup with 1 GPU per node on multiple nodes Speed up of 4% to 10% for a 2-node job Speed up of 6% to 13% for a 3-node job The productivity gained will increase as the cluster size increases InfiniBand QDR Higher is better Open MPI 13

LAMMPSCUDA GPUDirect (2GPUs/node) GPUDirect enables faster speedup with 2 GPUs per node on 2 nodes Speed up of 16% for a 2-node job with the melt-expand dataset As the cluster scales, more speed up will be seen InfiniBand QDR Higher is better Open MPI 14

Summary LAMMPS performance with GPUDirect GPUDirect enables a substantial performance speedup Results demonstrated on small scale enivornment Bigger performance advantage is expected at larger system size Performance boost depends on the benchmark dataset Interconnects effect to LAMMPS performance InfiniBand enables the highest performance and best GPU cluster scalability Mainly due to low CPU overhead, native RDMA and low latency 15

Thank You HPC Advisory Council All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein 16 16