NEMO Performance Benchmark and Profiling. May 2011

Similar documents
CP2K Performance Benchmark and Profiling. April 2011

NAMD Performance Benchmark and Profiling. November 2010

ABySS Performance Benchmark and Profiling. May 2010

HYCOM Performance Benchmark and Profiling

CESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011

NAMD GPU Performance Benchmark. March 2011

GROMACS Performance Benchmark and Profiling. August 2011

OCTOPUS Performance Benchmark and Profiling. June 2015

CPMD Performance Benchmark and Profiling. February 2014

ICON Performance Benchmark and Profiling. March 2012

CP2K Performance Benchmark and Profiling. April 2011

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

MILC Performance Benchmark and Profiling. April 2013

STAR-CCM+ Performance Benchmark and Profiling. July 2014

SNAP Performance Benchmark and Profiling. April 2014

AcuSolve Performance Benchmark and Profiling. October 2011

AcuSolve Performance Benchmark and Profiling. October 2011

NAMD Performance Benchmark and Profiling. January 2015

Himeno Performance Benchmark and Profiling. December 2010

AMBER 11 Performance Benchmark and Profiling. July 2011

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

LAMMPSCUDA GPU Performance. April 2011

GROMACS Performance Benchmark and Profiling. September 2012

LS-DYNA Performance Benchmark and Profiling. April 2015

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

Altair RADIOSS Performance Benchmark and Profiling. May 2013

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

OpenFOAM Performance Testing and Profiling. October 2017

LAMMPS Performance Benchmark and Profiling. July 2012

LS-DYNA Performance Benchmark and Profiling. October 2017

NAMD Performance Benchmark and Profiling. February 2012

MM5 Modeling System Performance Research and Profiling. March 2009

LS-DYNA Performance Benchmark and Profiling. October 2017

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

Birds of a Feather Presentation

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

2008 International ANSYS Conference

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

LS-DYNA Productivity and Power-aware Simulations in Cluster Environments

Voltaire Making Applications Run Faster

Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

High-Performance Lustre with Maximum Data Assurance

1. ALMA Pipeline Cluster specification. 2. Compute processing node specification: $26K

Highest Levels of Scalability Simplified Network Manageability Maximum System Productivity

Intel Cluster Toolkit Compiler Edition 3.2 for Linux* or Windows HPC Server 2008*

Architecting High Performance Computing Systems for Fault Tolerance and Reliability

SIMPLIFYING HPC SIMPLIFYING HPC FOR ENGINEERING SIMULATION WITH ANSYS

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Optimizing LS-DYNA Productivity in Cluster Environments

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment. TOP500 Supercomputers, June 2014

Clustering Optimizations How to achieve optimal performance? Pak Lui

NetApp High-Performance Storage Solution for Lustre

Introduction to High-Performance Computing

InfiniBand Switch System Family. Highest Levels of Scalability, Simplified Network Manageability, Maximum System Productivity

Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science

GRID Testing and Profiling. November 2017

Handling a 3D oceanographic simulation program

Best Practices for Setting BIOS Parameters for Performance

An Overview of Fujitsu s Lustre Based File System

Implementing Storage in Intel Omni-Path Architecture Fabrics

Application Acceleration Beyond Flash Storage

Future Routing Schemes in Petascale clusters

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Solutions for Scalable HPC

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

The Future of Interconnect Technology

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Introduction to High-Speed InfiniBand Interconnect

Reports on user support, training, and integration of NEMO and EC-Earth community models Milestone MS6

INTEGRATING HPFS IN A CLOUD COMPUTING ENVIRONMENT

FROM HPC TO THE CLOUD WITH AMQP AND OPEN SOURCE SOFTWARE

HP InfiniBand Options for HP ProLiant and Integrity Servers Overview

HPC In The Cloud? Michael Kleber. July 2, Department of Computer Sciences University of Salzburg, Austria

Performance of Mellanox ConnectX Adapter on Multi-core Architectures Using InfiniBand. Abstract

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Evolving HPC Solutions Using Open Source Software & Industry-Standard Hardware

Analytics of Wide-Area Lustre Throughput Using LNet Routers

Single-Points of Performance

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

Interconnect Your Future

Scalability Testing of DNE2 in Lustre 2.7 and Metadata Performance using Virtual Machines Tom Crowe, Nathan Lavender, Stephen Simms

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

DESCRIPTION GHz, 1.536TB shared memory RAM, and 20.48TB RAW internal storage teraflops About ScaleMP

Microsoft SharePoint Server 2010 on Dell Systems

IBM Power 755 server. High performance compute node for scalable clusters using InfiniBand architecture interconnect products.

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

NEMOVAR, a data assimilation framework for NEMO

High Performance Interconnects: Landscape, Assessments & Rankings

Sharing High-Performance Devices Across Multiple Virtual Machines

Flex System IB port FDR InfiniBand Adapter Lenovo Press Product Guide

Implementing SQL Server 2016 with Microsoft Storage Spaces Direct on Dell EMC PowerEdge R730xd

Transcription:

NEMO Performance Benchmark and Profiling May 2011

Note The following research was performed under the HPC Advisory Council HPC works working group activities Participating vendors: HP, Intel, Mellanox Compute resource - HPC Advisory Council Cluster Center For more info please refer to http://www.hp.com/go/hpc www.intel.com www.mellanox.com http://www.nemo-ocean.eu 2

NEMO (Nucleus for European Modelling of the Ocean) NEMO is a state-of-the-art modeling framework for Oceanographic research Operational oceanography seasonal forecast Climate studies NEMO includes 4 major components The blue ocean (ocean dynamics, NEMO-OPA) The white ocean (sea-ice, NEMO-LIM) The green ocean (biogeochemistry, NEMO-TOP) The adaptative mesh refinement software (AGRIF) NEMO is used by a large community: 240 projects in 27 countries Under the CeCILL license (public license) controlled by a European Consortium between CNRS, Mercator-Ocean, UKMO and NERC NEMO is part of DEISA benchmark suite (http://www.deisa.eu) 3

Objectives The presented research was done to provide best practices File-system performance comparison MPI libraries comparisons Interconnect performance benchmarking NEMO Application profiling Understanding NEMO communication patterns The presented results will demonstrate Balanced compute environment determines application performance 4

Test Cluster Configuration HP ProLiant SL2x170z G6 16-node cluster Six-Core Intel X5670 @ 2.93 GHz CPUs Memory: 24GB per node OS: CentOS5U5, OFED 1.5.3 InfiniBand SW stack Mellanox ConnectX-2 InfiniBand QDR adapters and switches Fulcrum based 10Gb/s Ethernet switch MPI Intel MPI 4, Open MPI 1.7, Platform MPI 8.0.1 Compilers: Intel Compilers 11.1.064 Application: NEMO 3.2 Libraries: Intel MKL 2011.3.174, netcdf 2.122 Benchmark workload OPA (the ocean engine), confcoef=25 5

About HP ProLiant SL6000 Scalable System Solution-optimized for extreme scale out ProLiant SL160z G6 ProLiant SL165z G7 Large memory -memory-cache apps Save on cost and energy -- per node, rack and data center ProLiant z6000 chassis Shared infrastructure fans, chassis, power ProLiant SL170z G6 Large storage -Web search and database apps Mix and match configurations Deploy with confidence * SPECpower_ssj2008 www.spec.org 17 June 2010, 13:28 ProLiant SL2x170z G6 Highly dense - HPC compute and web front-end apps #1 Power Efficiency* 6

Lustre File System Configuration Lustre Configuration 1 MDS 4 OSS (Each has 2 OST) InfiniBand based Backend storage All components are connected through InfiniBand QDR interconnect Lustre MDS/OSS OSS OSS OSS IB QDR Switch IB QDR Switch Storage Cluster Nodes 7

NEMO Benchmark Results File System File I/O performance is important to NEMO performance InfiniBand powered Lustre file system enables application scalability NFS over GigE doesn t meet application file I/O requirement Higher is better Open MPI over InfiniBand QDR 12-cores per node 8

NEMO Benchmark Results MPI Optimization Intel MPI with tuning runs 13% faster than default mode at 16 nodes -genv I_MPI_RDMA_TRANSLATION_CACHE 1 -genv I_MPI_RDMA_RNDV_BUF_ALIGN 65536 -genv I_MPI_DAPL_DIRECT_COPY_THRESHOLD 65536 13% Higher is better 12-cores per node 9

NEMO Benchmark Results MPI Libraries Intel MPI and Open MPI are faster at 16 nodes Higher is better 12-cores per node 10

NEMO Benchmark Results Interconnects InfiniBand enables highest performance and linear scalability for NEMO 420% faster than 10GigE and 580% faster than GigE at 16 nodes 420% 580% Higher is better 12-cores per node 11

NEMO MPI Profiling MPI Functions MPI point-to-point communication overhead is dominated Point-to-point: MPI_Isend/recv Collectives: MPI_Allreduce overhead increases faster after 8 nodes 12

NEMO MPI Profiling Message Size Most messages are small messages: <12KB 13

NEMO MPI Profiling Interconnects InfiniBand QDR has least communication overhead 13% of total MPI time over GigE 17% of total MPI time over 10GigE 14

NEMO Benchmark Summary NEMO performance benchmark demonstrates InifiniBand QDR delivers higher application performance and linear scalability 420% higher performance than 10GigE and 580% higher than GigE Intel MPI tuning can boost application performance by 13% Application has intensive file I/O operations Lustre over InfiniBand eliminates NFS bottleneck and enables application performance NEMO MPI profiling Message send/recv creates big communication overhead Most are small message used by NEMO Collectives overhead increases as cluster size scales up 15

Thank You HPC Advisory Council All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein 16 16