Single-Points of Performance

Similar documents
Real Application Performance and Beyond

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Optimizing LS-DYNA Productivity in Cluster Environments

2008 International ANSYS Conference

Birds of a Feather Presentation

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

LS-DYNA Productivity and Power-aware Simulations in Cluster Environments

Future Routing Schemes in Petascale clusters

Performance Analysis of LS-DYNA in Huawei HPC Environment

LS-DYNA Performance Benchmark and Profiling. October 2017

The NE010 iwarp Adapter

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Interconnect Your Future

The Optimal CPU and Interconnect for an HPC Cluster

LS-DYNA Performance Benchmark and Profiling. October 2017

Maximizing Cluster Scalability for LS-DYNA

Assessment of LS-DYNA Scalability Performance on Cray XD1

New Features in LS-DYNA HYBRID Version

ARISTA: Improving Application Performance While Reducing Complexity

QLogic TrueScale InfiniBand and Teraflop Simulations

Determining the MPP LS-DYNA Communication and Computation Costs with the 3-Vehicle Collision Model and the Infiniband Interconnect

LS-DYNA Performance Benchmark and Profiling. April 2015

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Accelerating Real-Time Big Data. Breaking the limitations of captive NVMe storage

Performance Optimizations for LS-DYNA with Mellanox HPC-X Scalable Software Toolkit

MM5 Modeling System Performance Research and Profiling. March 2009

Manufacturing Bringing New Levels of Performance to CAE Applications

LAMMPS, LS- DYNA, HPL, and WRF on iwarp vs. InfiniBand FDR

AN INTRODUCTION TO THE PATHSCALE INFINIPATH HTX ADAPTER. LLOYD DICKMAN Distinguished Architect, Office of the CTO PathScale, Inc.

ABySS Performance Benchmark and Profiling. May 2010

Platform Choices for LS-DYNA

Chelsio 10G Ethernet Open MPI OFED iwarp with Arista Switch

InfiniBand Networked Flash Storage

Partitioning Effects on MPI LS-DYNA Performance

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

OCTOPUS Performance Benchmark and Profiling. June 2015

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Choosing the Best Network Interface Card for Cloud Mellanox ConnectX -3 Pro EN vs. Intel XL710

Altair RADIOSS Performance Benchmark and Profiling. May 2013

High Performance Interconnects: Landscape, Assessments & Rankings

HPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms. Author: Correspondence: ABSTRACT:

AcuSolve Performance Benchmark and Profiling. October 2011

Considerations for LS-DYNA Efficiency in SGI IRIX and Linux Environments with a NUMA System Architecture

InfiniBand Strengthens Leadership as The High-Speed Interconnect Of Choice

CESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011

HYCOM Performance Benchmark and Profiling

Introduction to High-Performance Computing

TIBCO, HP and Mellanox High Performance Extreme Low Latency Messaging

Computer Aided Engineering with Today's Multicore, InfiniBand-Based Clusters ANSYS, Inc. All rights reserved. 1 ANSYS, Inc.

NAMD Performance Benchmark and Profiling. November 2010

GROMACS Performance Benchmark and Profiling. September 2012

Interconnect Your Future Enabling the Best Datacenter Return on Investment. TOP500 Supercomputers, November 2017

The Future of Interconnect Technology

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

Interconnect Your Future

CP2K Performance Benchmark and Profiling. April 2011

Technical guide. Windows HPC server 2016 for LS-DYNA How to setup. Reference system setup - v1.0

FUSION1200 Scalable x86 SMP System

NAMD Performance Benchmark and Profiling. February 2012

A Case for High Performance Computing with Virtual Machines

Interconnect Your Future

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

AcuSolve Performance Benchmark and Profiling. October 2011

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

High Performance MPI on IBM 12x InfiniBand Architecture

SNAP Performance Benchmark and Profiling. April 2014

The rcuda middleware and applications

AMBER 11 Performance Benchmark and Profiling. July 2011

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

iser as accelerator for Software Defined Storage Rahul Fiske, Subhojit Roy IBM (India)

Memory Management Strategies for Data Serving with RDMA

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

Sami Saarinen Peter Towers. 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1

Performance of Mellanox ConnectX Adapter on Multi-core Architectures Using InfiniBand. Abstract

ICON Performance Benchmark and Profiling. March 2012

Considerations for LS-DYNA Workflow Efficiencies in an HPC Linux Environment

RoCE vs. iwarp Competitive Analysis

Cluster Scalability of Implicit and Implicit-Explicit LS-DYNA Simulations Using a Parallel File System

Leveraging LS-DYNA Explicit, Implicit, Hybrid Technologies with SGI hardware and d3view Web Portal software

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

CP2K Performance Benchmark and Profiling. April 2011

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017

Paving the Road to Exascale

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

NVM PCIe Networked Flash Storage

A Comprehensive Study on the Performance of Implicit LS-DYNA

Introduction to TCP/IP Offload Engine (TOE)

OpenFOAM Performance Testing and Profiling. October 2017

IBM Information Technology Guide For ANSYS Fluent Customers

High Performance Computing

Multi-Host Sharing of NVMe Drives and GPUs Using PCIe Fabrics

Voltaire Making Applications Run Faster

Dell HPC System for Manufacturing System Architecture and Application Performance

Performance comparison between a massive SMP machine and clusters

HPC Architectures. Types of resource currently in use

Accelerating Implicit LS-DYNA with GPU

Linux Networx HPC Strategy and Roadmap

Transcription:

Single-Points of Performance Mellanox Technologies Inc. 29 Stender Way, Santa Clara, CA 9554 Tel: 48-97-34 Fax: 48-97-343 http://www.mellanox.com

High-performance computations are rapidly becoming a critical tool for conducting research, creative activity, and economic development. In the global economy, speed-to-market is an essential component to get ahead of the competition. High-performance computing utilizes compute power for solving highly complex problems, perform critical analysis, or run computationally intensive workloads faster and with greater efficiency. During the time needed to read this sentence, each of the Top1 clusters on the Top5 list would have performed over 15,,,, calculations. HPC clusters have become the most common building blocks for high-performance computing. Clusters provide the basic needs of flexibility and deliver superior price/performance compared to other proprietary symmetric multiprocessing (SMP) systems, with the simplicity and value of industry-standard computing. A cluster s ability to perform depends on three elements the central processing unit, the memory on a single server, and the interconnect that gathers the single servers into one cluster. The TOP5, an industry respected report since 1993, ranks the most powerful computer systems and provides an indication on usage trends in computing and interconnect solutions. According to the TOP5, InfiniBand is emerging as the most used high-speed interconnect, replacing the proprietary or low-performance solutions. As shown below, InfiniBand shows higher adoption rate in the TOP5 comparing to Ethernet. The graph compares the adoption rate of InfiniBand and Ethernet, from the first year they were introduced to the TOP5 list (InfiniBand June 23, Ethernet June 1996). Top5 Interconnect Penetration # of Clusters 9 8 7 6 5 4 3 2 1 Year 1 Year 2 Year 3 Year 4 Year 5 Year 6 Year 7 Year 8 Year of Introduction Single-points of performance InfiniBand Ethernet The most common approach for comparing between different interconnect solutions is the single-points approach. Latency, bandwidth, N/2 and message rate are known as single-points of performance used to make the initial determination. When referring to bandwidth, it is typical to report the maximum available bandwidth. Message rate, defined as the numbers of messages transmitted in a period of time, is yet another single-point on the bandwidth graph. Message rate is the product of the division between bandwidth and message size, typically done for message size of or 2 bytes. 26 Mellanox Technologies Inc. 2

N/2 benchmark provides a better understanding of the bandwidth characteristics, as it indicates the point where the interconnect achieves half of its maximum capability, therefore provides an indication on the bandwidth curve. However, the N/2 number must be followed with the corresponding bandwidth number, else it can be misleading. A higher bandwidth solution can present higher N/2 compared to a lower bandwidth one, while its bandwidth graph is higher for a single-point. An example is shown later in this paper. Any set of single-points relate to a specific application or application interface, and each application interface has a different set of parameters associated with it. The most common application interface in high-performance computing is Message Passing Interface (MPI). Among the single-points of performance, the N/2 benchmark provides an additional indication on the interconnect s capabilities how fast the bandwidth curve rises. Lower N/2 results in higher actual or effective bandwidth delivered by the interconnect to the applications. The following graphs compare the N/2 MPI benchmark between the two InfiniBand providers, Mellanox and QLogic. MPI bandwidth (4 CPU cores) Bandwitdh (MB/s) 1 9 8 7 6 5 4 3 2 1 Mellanox N/2 87 Bytes QLogic N/2 94 Bytes 1 2 4 8 16 32 64 128 256 512 Bytes QLogic HTX Mellanox PCIe 1Gb/s Mellanox s InfiniBand 1Gb/s PCIe adapter delivers lower N/2 than the QLogic HTX InfiniPath adapter, and therefore delivers overall higher, effective bandwidth. QLogic had reported N/2 of 88 Bytes on an 8 core server, yet it is still higher than the Mellanox 1Gb/s adapter on a 4 core server, as shown in the graph. The second graph compares between the best performance solutions from Mellanox and QLogic Mellanox s InfiniBand 2Gb/s and QLogic s InfiniPath HTX. The result is a good example of how N/2 can be misleading, if not followed by the corresponding bandwidth figures. Mellanox s 2Gb/s achieves higher N/2 of 11 Bytes versus QLogic s N/2 of 94 Bytes, but as the graph shows, Mellanox has a superior bandwidth curve, and therefore has an overall higher, effective bandwidth. 26 Mellanox Technologies Inc. 3

MPI bandwidth (4 CPU cores) 14 12 Bandwitdh (MB/s) 1 8 6 4 2 1 2 4 8 16 32 64 128 256 512 Bytes QLogic N/2 94 Bytes Mellanox N/2 11 Bytes QLogic HTX Mellanox PCIe 2Gb/s The bandwidth advantage of Mellanox increases with message size to about 5%. Mellanox achieves QLogic s maximum BW (of 954MB/s) in message size of ~2 Bytes and provides maximum uni-directional bandwidth of more than 14 MB/s (Mellanox adapter is capable of almost 2MB/s uni-directional bandwidth and limited by the current system components). Real world application performance The sets of single-points have traditionally been used as the prime metric for assessing the performance of the system s interconnect fabric. However, this metric is typically not sufficient to determine the performance of real-world applications. Real-world applications use a variety of message sizes and a diverse mixture of communication patterns. Moreover, the interconnect architecture becomes a key factor and greatly influences the overall performance and efficiency. Mellanox s adapter architecture is based on a full offload approach with RDMA capabilities, reducing the traditional protocol overhead from the CPU and increasing processor efficiency. QLogic s architecture is based on an on-load approach, where the CPU needs to deal with the transport layer, error handling etc., and therefore increases the overhead on the CPU and reduces processor efficiency, leaving less cycles for useful application processing. The following chart compares Mellanox InfiniBand and QLogic InfiniPath interconnects using LD-DYNA Neon-Refined benchmark (frontal crash with initial speed at 31.5 miles/hour, model size 535k elements, simulation length: 15ms - model created by National Crash Analysis Center (NCAC) at George Washington University). 26 Mellanox Technologies Inc. 4

LS-DYNA Neon-Refined Benchmark Time (sec) Lower is better 3 25 2 15 1 5 29% 4 8 16 32 Number of CPU Cores 32% QLogic InfiniPath Mellanox InfiniBand Mellanox on Intel Xeon 516, QLogic on AMD Opteron 2.6GHz Mellanox InfiniBand shows higher performance and better scaling compared to QLogic InfiniPath. Livermore Software Technology Corporation s (LSTC) LS-DYNA (general purpose transient dynamic finite element program capable of simulating complex real world problems) is a latency-sensitive application. While QLogic shows lower latency, as a single-point of performance, Mellanox s architecture delivers higher system performance, efficiency and scalability. It is difficult, and sometimes misleading, to predict real-time application performance with just single-points of data. In order to determine the system s performance and the interconnect of choice, one should take into consideration a set of metrics including the single-point of performance (bandwidth, latency etc.), architecture characteristics (CPU utilization, overlap capabilities of computations and communications scalability etc.), applications results, field proven experience and hardware reliability. 26 Mellanox Technologies Inc. 5