HYCOM Performance Benchmark and Profiling

Similar documents
CP2K Performance Benchmark and Profiling. April 2011

AMBER 11 Performance Benchmark and Profiling. July 2011

GROMACS Performance Benchmark and Profiling. September 2012

Himeno Performance Benchmark and Profiling. December 2010

Altair RADIOSS Performance Benchmark and Profiling. May 2013

NAMD Performance Benchmark and Profiling. February 2012

AcuSolve Performance Benchmark and Profiling. October 2011

LAMMPS Performance Benchmark and Profiling. July 2012

ABySS Performance Benchmark and Profiling. May 2010

CP2K Performance Benchmark and Profiling. April 2011

CESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011

NAMD Performance Benchmark and Profiling. November 2010

ICON Performance Benchmark and Profiling. March 2012

MM5 Modeling System Performance Research and Profiling. March 2009

GROMACS Performance Benchmark and Profiling. August 2011

AcuSolve Performance Benchmark and Profiling. October 2011

NEMO Performance Benchmark and Profiling. May 2011

NAMD GPU Performance Benchmark. March 2011

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

NAMD Performance Benchmark and Profiling. January 2015

STAR-CCM+ Performance Benchmark and Profiling. July 2014

SNAP Performance Benchmark and Profiling. April 2014

OCTOPUS Performance Benchmark and Profiling. June 2015

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

LS-DYNA Performance Benchmark and Profiling. April 2015

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

CPMD Performance Benchmark and Profiling. February 2014

LAMMPSCUDA GPU Performance. April 2011

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

OpenFOAM Performance Testing and Profiling. October 2017

MILC Performance Benchmark and Profiling. April 2013

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

LS-DYNA Performance Benchmark and Profiling. October 2017

LS-DYNA Performance Benchmark and Profiling. October 2017

2008 International ANSYS Conference

Birds of a Feather Presentation

DELIVERABLE D5.5 Report on ICARUS visualization cluster installation. John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS)

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

LS-DYNA Productivity and Power-aware Simulations in Cluster Environments

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

Future Routing Schemes in Petascale clusters

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Performance of Mellanox ConnectX Adapter on Multi-core Architectures Using InfiniBand. Abstract

AMD EPYC and NAMD Powering the Future of HPC February, 2019

Unified Runtime for PGAS and MPI over OFED

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

GRID Testing and Profiling. November 2017

Solutions for Scalable HPC

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

Optimizing LS-DYNA Productivity in Cluster Environments

Introduction to High-Speed InfiniBand Interconnect

ARISTA: Improving Application Performance While Reducing Complexity

Readme for Platform Open Cluster Stack (OCS)

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

The Future of Interconnect Technology

Who says world-class high performance computing (HPC) should be reserved for large research centers? The Cray CX1 supercomputer makes HPC performance

Single-Points of Performance

Clustering Optimizations How to achieve optimal performance? Pak Lui

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Computer Aided Engineering with Today's Multicore, InfiniBand-Based Clusters ANSYS, Inc. All rights reserved. 1 ANSYS, Inc.

Maximizing Memory Performance for ANSYS Simulations

Making Supercomputing More Available and Accessible Windows HPC Server 2008 R2 Beta 2 Microsoft High Performance Computing April, 2010

DESCRIPTION GHz, 1.536TB shared memory RAM, and 20.48TB RAW internal storage teraflops About ScaleMP

SIMPLIFYING HPC SIMPLIFYING HPC FOR ENGINEERING SIMULATION WITH ANSYS

Analytics of Wide-Area Lustre Throughput Using LNet Routers

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

The Effect of In-Network Computing-Capable Interconnects on the Scalability of CAE Simulations

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

POWEREDGE RACK SERVERS

Designing Shared Address Space MPI libraries in the Many-core Era

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

Linux Networx HPC Strategy and Roadmap

Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Performance Analysis and Evaluation of LANL s PaScalBB I/O nodes using Quad-Data-Rate Infiniband and Multiple 10-Gigabit Ethernets Bonding

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

Oak Ridge National Laboratory Computing and Computational Sciences

FROM HPC TO THE CLOUD WITH AMQP AND OPEN SOURCE SOFTWARE

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

TPC-E testing of Microsoft SQL Server 2016 on Dell EMC PowerEdge R830 Server and Dell EMC SC9000 Storage

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Building Multi-Petaflop Systems with MVAPICH2 and Global Arrays

Maximizing Cluster Scalability for LS-DYNA

SGI Overview. HPC User Forum Dearborn, Michigan September 17 th, 2012

QLogic TrueScale InfiniBand and Teraflop Simulations

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Transcription:

HYCOM Performance Benchmark and Profiling Jan 2011 Acknowledgment: - The DoD High Performance Computing Modernization Program

Note The following research was performed under the HPC Advisory Council activities Participating vendors: AMD, Dell, Mellanox Compute resource - HPC Advisory Council Cluster Center We would like to acknowledge The DoD High Performance Computing Modernization Program for providing access to the FY 2009 benchmark suite For more info please refer to www.mellanox.com, www.dell.com/hpc, www.amd.com, http://www.hycom.org 2

HYCOM (HYbrid Coordinate Ocean Model) A primitive equation ocean general circulation model Evolved from the Miami Isopycnic-Coordinate Ocean Model HYCOM provides the capability of selecting several different vertical mixing schemes for The surface mixed layer The comparatively weak interior diapycnal mixing HYCOM is fully parallelized Open source and joined developed by: University of Miami, the Los Alamos National Laboratory, and the Naval Research Laboratory physics 3

Objectives The presented research was done to provide best practices HYCOM performance benchmarking Interconnect performance comparisons MPI libraries performance comparisons Understanding HYCOM communication patterns The presented results will demonstrate Balanced system to provide good application scalability 4

Test Cluster Configuration Dell PowerEdge R815 11-node cluster AMD Opteron 6174 (code name Magny-Cours ) 12-cores @ 2.2 GHz CPUs 4 CPUs per server node Mellanox ConnectX-2 40Gb/s QDR InfiniBand adapter Mellanox M3600 36-Port 40Gb/s QDR InfiniBand switch Memory: 128GB memory per node DDR3 1333 OS: CentOS 5.5, MELNX-OFED 1.5.1 InfiniBand SW stack MPI: OpenMPI-1.5.1, Platform MPI 8.0, MVAPICH2-1.5.1 Application: HYCOM 2.2.10 Benchmark Workload HYCOM large benchmark dataset 26-layer 1/12 degree fully global HYCOM benchmark 5

Dell PowerEdge R815 6-node cluster HPC Advisory Council Test-bed System New 11-node 528 core cluster - featuring Dell PowerEdge R815 servers Replacement system for Dell PowerEdge SC1435 (192 cores) cluster system following 2 years of rigorous benchmarking and product EOL System to be redirected to explore HPC in the Cloud applications Workload profiling and benchmarking Characterization for HPC and compute intense environments Optimization for scale, sizing and configuration and workload performance Test-bed Benchmarks RFPs Customers/Prospects, etc ISV & Industry standard application characterization Best practices & usage analysis 6

About Dell PowerEdge Platform Advantages Best of breed technologies and partners Combination of AMD Opteron 6100 series platform and Mellanox ConnectX InfiniBand on Dell HPC Solutions provide the ultimate platform for speed and scale Dell PowerEdge R815 system delivers 4 socket performance in dense 2U form factor Up to 48 core/32dimms per server 1008 core in 42U enclosure Integrated stacks designed to deliver the best price/performance/watt 2x more memory and processing power in half of the space Energy optimized low flow fans, improved power supplies and dual SD modules Optimized for long-term capital and operating investment protection System expansion Component upgrades and feature releases 7

HYCOM Benchmark Results Multi Rail Dual InfiniBand Cards enable up to 9% better performance Higher advantage expected as cluster size scales 9% Higher is better 48-cores per node 8

HYCOM Benchmark Results - Interconnect InfiniBand enables better application performance and scalability GigE stops scaling after 2 nodes (Extremely slow with 3+ nodes) Up to 35% higher than 10GigE at 504 cores Application performance over InfiniBand scales as cluster size increases 35% Higher is better 48-cores per node 9

HYCOM Benchmark Results - KNEM KNEM is a Linux kernel module enabling high-performance intra-node MPI communication for large messages KNEM improves application performance By up to 4% at 504 cores Higher performance will gain as cluster size increases 4% Higher is better 48-cores per node 10

HYCOM Benchmark Results MPI Libraries Open MPI with KNEM enables better performance at 504 cores Higher is better Dual IB cards 48-cores per node 11

HYCOM MPI Profiling MPI Functions MPI_Waitall and MPI_Allreduce generate more communication overhead Time share of MPI_Allreduce increases as cluster size scales 123 Cores 256 Cores 504 Cores 12

HYCOM MPI Profiling Message Size Both large and small messages are used by HYCOM MPI_Waitall uses middle to large size messages MPI_Allreduce uses small size messages 256 Cores 504 Cores 13

HYCOM Benchmark Results Summary HYCOM Communication Pattern MPI collective function generates most overhead MPI_Allreduce and MPI_Waitall Both small and large messages are used by application At small scale, bandwidth is critical to HYCOM performance, as cluster size grows, latency becomes more important Performance Optimizations MPI libraries showed comparable performance overall KNEM enables better performance Dual InfiniBand cards improves higher performance Interconnect Characterization InfiniBand delivers superior scalability as cluster size increases InfiniBand enables much higher performance than 10GigE and GigE 14

Thank You HPC Advisory Council All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein 15 15