MM5 Modeling System Performance Research and Profiling. March 2009

Similar documents
ABySS Performance Benchmark and Profiling. May 2010

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

ICON Performance Benchmark and Profiling. March 2012

CESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011

HYCOM Performance Benchmark and Profiling

GROMACS Performance Benchmark and Profiling. September 2012

AMBER 11 Performance Benchmark and Profiling. July 2011

Altair RADIOSS Performance Benchmark and Profiling. May 2013

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

Himeno Performance Benchmark and Profiling. December 2010

AcuSolve Performance Benchmark and Profiling. October 2011

GROMACS Performance Benchmark and Profiling. August 2011

CP2K Performance Benchmark and Profiling. April 2011

NAMD Performance Benchmark and Profiling. February 2012

AcuSolve Performance Benchmark and Profiling. October 2011

2008 International ANSYS Conference

LAMMPS Performance Benchmark and Profiling. July 2012

STAR-CCM+ Performance Benchmark and Profiling. July 2014

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

NAMD Performance Benchmark and Profiling. November 2010

NAMD Performance Benchmark and Profiling. January 2015

SNAP Performance Benchmark and Profiling. April 2014

Birds of a Feather Presentation

NAMD GPU Performance Benchmark. March 2011

LS-DYNA Productivity and Power-aware Simulations in Cluster Environments

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

LS-DYNA Performance Benchmark and Profiling. April 2015

MILC Performance Benchmark and Profiling. April 2013

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

CP2K Performance Benchmark and Profiling. April 2011

Future Routing Schemes in Petascale clusters

LAMMPSCUDA GPU Performance. April 2011

OCTOPUS Performance Benchmark and Profiling. June 2015

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

NEMO Performance Benchmark and Profiling. May 2011

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

LS-DYNA Performance Benchmark and Profiling. October 2017

CPMD Performance Benchmark and Profiling. February 2014

Optimizing LS-DYNA Productivity in Cluster Environments

GRID Testing and Profiling. November 2017

OpenFOAM Performance Testing and Profiling. October 2017

The Future of Interconnect Technology

Interconnect Your Future

LS-DYNA Performance Benchmark and Profiling. October 2017

Solutions for Scalable HPC

Performance of Mellanox ConnectX Adapter on Multi-core Architectures Using InfiniBand. Abstract

Introduction to High-Speed InfiniBand Interconnect

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

Sami Saarinen Peter Towers. 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

InfiniBand Strengthens Leadership as The High-Speed Interconnect Of Choice

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Maximizing Six-Core AMD Opteron Processor Performance with RHEL

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Single-Points of Performance

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

PERFORMANCE ACCELERATED Mellanox InfiniBand Adapters Provide Advanced Levels of Data Center IT Performance, Productivity and Efficiency

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

Enabling Performance-per-Watt Gains in High-Performance Cluster Computing

To Infiniband or Not Infiniband, One Site s s Perspective. Steve Woods MCNC

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

Interconnect Your Future

Paving the Road to Exascale

The NE010 iwarp Adapter

Meltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Quad-core Press Briefing First Quarter Update

High Performance MPI on IBM 12x InfiniBand Architecture

Paving the Road to Exascale Computing. Yossi Avni

Six-Core AMD Opteron Processor

Sharing High-Performance Devices Across Multiple Virtual Machines

Cray XD1 Supercomputer Release 1.3 CRAY XD1 DATASHEET

InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment. TOP500 Supercomputers, June 2014

Evolving HPC Solutions Using Open Source Software & Industry-Standard Hardware

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

RoCE vs. iwarp Competitive Analysis

Red Storm / Cray XT4: A Superior Architecture for Scalability

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

High Performance Computing

FROM HPC TO THE CLOUD WITH AMQP AND OPEN SOURCE SOFTWARE

UNDERSTANDING THE IMPACT OF MULTI-CORE ARCHITECTURE IN CLUSTER COMPUTING: A CASE STUDY WITH INTEL DUAL-CORE SYSTEM

Application Acceleration Beyond Flash Storage

High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A study with Parallel 3DFFT

Intel Cluster Toolkit Compiler Edition 3.2 for Linux* or Windows HPC Server 2008*

ARISTA: Improving Application Performance While Reducing Complexity

Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Linux Network Tuning Guide for AMD EPYC Processor Based Servers

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Intel Workstation Technology

QLogic TrueScale InfiniBand and Teraflop Simulations

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

IO virtualization. Michael Kagan Mellanox Technologies

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017

Transcription:

MM5 Modeling System Performance Research and Profiling March 2009

Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox HPC Advisory Council Cluster Center For more info please refer to www.mellanox.com, www.dell.com/hpc, www.amd.com 2

MM5 Fifth-Generation NCAR/Penn State Mesoscale Model Designed to simulate or predict mesoscale and regional-scale atmospheric circulation Mesoscale Meteorology is the study of weather systems smaller than synoptic scale systems but larger than microscale and storm-scale cumulus systems Horizontal dimensions generally range from around 5 kilometers to several hundred kilometers Examples: sea breezes, squall lines, and mesoscale convective complexes. http://www.mmm.ucar.edu/mm5/ Parallel version of MM5 with MPI enabled Support execution of the model on distributed memory (DM) parallel machines 3

Objectives The presented research was done to provide best practices MM5 performance benchmarking Interconnect performance comparisons Ways to increase MM5 productivity Understanding MM5 communication patterns 4

Test Cluster Configuration Dell PowerEdge SC 1435 24-node cluster Quad-Core AMD Opteron 2382 ( Shanghai ) CPUs Mellanox InfiniBand ConnectX DDR HCAs Mellanox InfiniBand DDR Switch Memory: 16GB memory, DDR2 800MHz per node OS: RHEL5U2, OFED 1.4 InfiniBand SW stack MPI: Platform MPI 5.6.4, MVAPICH-1.1.0 Application: MM5 Version 3 Benchmark Workload T3A benchmark test case from NCAR 5

Mellanox InfiniBand Solutions Industry Standard Hardware, software, cabling, management Design for clustering and storage interconnect Performance 40Gb/s node-to-node 120Gb/s switch-to-switch 1us application latency Most aggressive roadmap in the industry Reliable with congestion management Efficient RDMA and Transport Offload Kernel bypass CPU focuses on application processing Scalable for Petascale computing & beyond End-to-end quality of service Virtualization acceleration I/O consolidation Including storage The InfiniBand Performance Gap is Increasing 60Gb/s 20Gb/s 120Gb/s 40Gb/s 240Gb/s (12X) 80Gb/s (4X) Ethernet Fibre Channel InfiniBand Delivers the Lowest Latency 6

Quad-Core AMD Opteron Processor Performance Quad-Core Enhanced CPU IPC 4x 512K L2 cache 6MB L3 Cache Direct Connect Architecture HyperTransport Technology Up to 24 GB/s peak per processor Floating Point 128-bit FPU per core 4 FLOPS/clk peak per core Integrated Memory Controller Up to 12.8 GB/s DDR2-800 MHz or DDR2-667 MHz Scalability 48-bit Physical Addressing PCI-E Bridge Bridge I/O I/O Hub Hub PCI-E Bridge Bridge Compatibility Same power/thermal envelopes as 2 nd / 3 rd generation AMD Opteron processor 8 GB/S 8 GB/S 8 GB/S 8 GB/S 8 GB/S USB USB PCI PCI Dual Channel Reg DDR2 7 November5, 2007 7

Dell PowerEdge Servers helping Simplify IT System Structure and Sizing Guidelines 24-node cluster build with Dell PowerEdge SC 1435 Servers Servers optimized for High Performance Computing environments Building Block Foundations for best price/performance and performance/watt Dell HPC Solutions Scalable Architectures for High Performance and Productivity Dell's comprehensive HPC services help manage the lifecycle requirements. Integrated, Tested and Validated Architectures Workload Modeling Optimized System Size, Configuration and Workloads Test-bed Benchmarks ISV Applications Characterization Best Practices & Usage Analysis 8

MM5 Benchmark Results - Interconnect Input Data: T3A Resolution 36KM, grid size 112x136, 33 vertical levels 81 second time-step, 3 hour forecast InfiniBand DDR delivers higher performance in any cluster size Up to 46% versus GigE, 30% versus 10GigE MM5 Benchmark Results - T3A 60000 50000 Mflop/s 40000 30000 20000 10000 0 4 6 8 10 12 14 16 18 20 22 Number of Nodes Higher is better GigE 10GigE InfiniBand-DDR Platform MPI 9

MM5 Benchmark Results - Productivity InfiniBand DDR increases productivity by allowing multiple jobs to run simultaneously Providing required productivity for weather simulations Two cases are presented Single job over the entire systems Four jobs, each on two cores per CPU per server Four jobs per node increases productivity by up to 35% Utilizing Platform MPI dynamic binding of processes to cores Multiple MPI jobs per node works well using Platform MPI due to its dynamic binding of processes to cores MM5 Benchmark Results - T3A InfiniBand DDR Higher is better Mflop/s 80000 70000 60000 50000 40000 30000 20000 10000 0 4 8 12 16 20 Number of Nodes 1 Job per Node 4 Jobs per Node Platform MPI 10

MM5 Performance Results - MPI Platform MPI demonstrates higher performance versus MVAPICH Up to 25% higher performance Platform MPI advantage increases with increased cluster size 60000 50000 MM5 Benchmark Results - T3A Mflop/s 40000 30000 20000 10000 0 4 6 8 10 12 14 16 18 20 22 Number of Nodes Higher is better MVAPICH Platform MPI Single job on each node 11

MM5 Profiling MPI Functions MPI_Send and MPI_Recv are the mostly used MPI functions in MM5 MM5 MPI Profiling Number of Messages 35000 30000 25000 20000 15000 10000 5000 0 MPI_Bcast MPI_Comm_size MPI_Gather MPI_Irecv MPI_Keyval_free MPI_Recv MPI_Send MPI_Wait MPI Functions 4 Nodes 8 Nodes 12 Nodes 20 Nodes 22 Nodes 12

MM5 Profiling Data Transferred Most outgoing MPI messages are smaller than 128B MM5 MPI Profiling (MPI_Send) Number of Messages 25000 20000 15000 10000 5000 0 [0-128B> [128B-1K> [1-8K> [8-256K> Message Size 4 Nodes 8 Nodes 12 Nodes 20 Nodes 22 Nodes 13

MM5 Profiling Data Transferred Most received MPI messages are within 128Bytes to 1K Total number of medium size messages increases with cluster size MM5 MPI Profiling (MPI_Recv) Number of Messages 14000 12000 10000 8000 6000 4000 2000 0 [0-128B> [128B-1K> [1-8K> [8-256K> Message Size 4 Nodes 8 Nodes 12 Nodes 20 Nodes 22 Nodes 14

MM5 Profiling Summary - Interconnect MM5 was profiled to determine networking dependency Majority of data transferred between compute nodes Small to medium size messages Data transferred increases with cluster size Most used message sizes <128B messages MPI_Send 128B-1KB and 8K-256K MPI_Recv Total number of messages increases with cluster size Interconnects effect to MM5 performance Interconnect latency and throughput for <256KB message range Interconnect latency and throughput become critical with cluster size 15

MM5 Performance with Power Management Test Scenario 24 servers, 4 processes per node, 2 processes per CPU (socket) Similar performance with power management enabled or disabled Only 1.4% performance degradation 40 35 MM5 Benchmark Results - T3A Gflop/s 30 25 20 15 10 Higher is better Power Management Enabled Power Management Disabled InfiniBand DDR 16

MM5 Benchmark Power Consumption Power management reduces 5.1% of total system power consumption 0.15 MM5 Benchmark - T3A Power Consumption 0.13 Wh/iteration 0.11 0.09 0.07 0.05 Power Management Enabled Power Management Disabled InfiniBand DDR 17

MM5 Benchmark Power Cost Savings Power management saves 673$/year for the 24-node cluster As cluster size increases, bigger saving are expected 11000 MM5 Benchmark - T3A Power Cost Comparison 10600 7% $/year 10200 9800 9400 9000 Power Management Enabled Power Management Disabled 24 Node Cluster $/year = Total power consumption/year (KWh) * $0.20 For more information - http://enterprise.amd.com/downloads/svrpwrusecompletefinal.pdf InfiniBand DDR 18

Conclusions MM5 is widely used weather simulation software MM5 performance and productivity relies on Scalable HPC systems and interconnect solutions Low latency and high throughout interconnect technology NUMA aware application for fast access to memory Reasonably job distribution can dramatically improves productivity Increasing number of jobs per day while maintaining fast run time Interconnect comparison shows InfiniBand DDR delivers superior performance and productivity Versus GigE and 10GigE, Due to throughput and latency advantage Power management provide 5.1% saving in power consumption Per 24-node system $673 power savings per year for 24-node cluster Power saving increases with cluster size 19

Thank You HPC Advisory Council All trademarks are property of their respective owners. All information is provided As-Is without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information contained herein. HPC Advisory Council Mellanox undertakes no duty and assumes no obligation to update or correct any information presented herein 20 20