InfiniBand-based HPC Clusters

Similar documents
White Paper: Achieving Breakthrough MPI Performance with Fabric Collectives Offload

Paving the Road to Exascale Computing. Yossi Avni

Voltaire Making Applications Run Faster

Highest Levels of Scalability Simplified Network Manageability Maximum System Productivity

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

The Future of Interconnect Technology

InfiniBand Switch System Family. Highest Levels of Scalability, Simplified Network Manageability, Maximum System Productivity

2008 International ANSYS Conference

Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Birds of a Feather Presentation

ICON Performance Benchmark and Profiling. March 2012

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

HP InfiniBand Options for HP ProLiant and Integrity Servers Overview

The following InfiniBand products based on Mellanox technology are available for the HP BladeSystem c-class from HP:

Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University

IBM Information Technology Guide For ANSYS Fluent Customers

Future Routing Schemes in Petascale clusters

Voltaire. Fast I/O for XEN using RDMA Technologies. The Grid Interconnect Company. April 2005 Yaron Haviv, Voltaire, CTO

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

ABySS Performance Benchmark and Profiling. May 2010

Introduction to High-Speed InfiniBand Interconnect

Solutions for Scalable HPC

HP InfiniBand Options for HP ProLiant and Integrity Servers Overview

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017

40Gb/s InfiniBand Switch Module (HSSM) for IBM BladeCenter

Interconnect Your Future

VM Migration Acceleration over 40GigE Meet SLA & Maximize ROI

Interconnect Your Future Enabling the Best Datacenter Return on Investment. TOP500 Supercomputers, November 2017

Introduction to Infiniband

The following InfiniBand products based on Mellanox technology are available for the HP BladeSystem c-class from HP:

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017

White Paper. Technical Advances in the SGI. UV Architecture

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Cray XC Scalability and the Aries Network Tony Ford

Lossless 10 Gigabit Ethernet: The Unifying Infrastructure for SAN and LAN Consolidation

HPC Architectures. Types of resource currently in use

DELIVERABLE D5.5 Report on ICARUS visualization cluster installation. John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS)

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013

CESM (Community Earth System Model) Performance Benchmark and Profiling. August 2011

The following InfiniBand adaptor products based on Mellanox technologies are available from HP

QuickSpecs. HP InfiniBand Options for HP BladeSystems c-class. Overview

AcuSolve Performance Benchmark and Profiling. October 2011

Messaging Overview. Introduction. Gen-Z Messaging

Checklist for Selecting and Deploying Scalable Clusters with InfiniBand Fabrics

The Future of High Performance Interconnects

Fujitsu s Approach to Application Centric Petascale Computing

HP Cluster Platform Overview

Workshop on High Performance Computing (HPC) Architecture and Applications in the ICTP October High Speed Network for HPC

Flex System IB port FDR InfiniBand Adapter Lenovo Press Product Guide

Chelsio 10G Ethernet Open MPI OFED iwarp with Arista Switch

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

Optimizing LS-DYNA Productivity in Cluster Environments

OCTOPUS Performance Benchmark and Profiling. June 2015

Technical Computing Suite supporting the hybrid system

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

SNAP Performance Benchmark and Profiling. April 2014

Brocade Technology Conference Call: Data Center Infrastructure Business Unit Breakthrough Capabilities for the Evolving Data Center Network

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

The Missing Piece of Virtualization. I/O Virtualization on 10 Gb Ethernet For Virtualized Data Centers

High Performance Computing

CS500 SMARTER CLUSTER SUPERCOMPUTERS

IBM Virtual Fabric Architecture

Building the Most Efficient Machine Learning System

HP supports InfiniBand products from InfiniBand technology partners Mellanox and QLogic.

Large SAN Design Best Practices Using Cisco MDS 9700 and MDS 9500 Multilayer Directors

Mellanox InfiniBand Training IB Professional, Expert and Engineer Certifications

2-Port 40 Gb InfiniBand Expansion Card (CFFh) for IBM BladeCenter IBM BladeCenter at-a-glance guide

InfiniBand and Mellanox UFM Fundamentals

STAR-CCM+ Performance Benchmark and Profiling. July 2014

Building the Most Efficient Machine Learning System

MM5 Modeling System Performance Research and Profiling. March 2009

Interconnect Your Future

GROMACS Performance Benchmark and Profiling. August 2011

PERFORMANCE ACCELERATED Mellanox InfiniBand Adapters Provide Advanced Levels of Data Center IT Performance, Productivity and Efficiency

Industry Standards for the Exponential Growth of Data Center Bandwidth and Management. Craig W. Carlson

Storage System. David Southwell, Ph.D President & CEO Obsidian Strategics Inc. BB:(+1)

Ethernet. High-Performance Ethernet Adapter Cards

Creating an agile infrastructure with Virtualized I/O

CPMD Performance Benchmark and Profiling. February 2014

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

Arista 7010 Series: Q&A

Topology Awareness in the Tofu Interconnect Series

Best Practices for Deployments using DCB and RoCE

Cloud Builders. Billy Cox. Director Cloud Strategy Software and Services Group

ANSYS HPC Technology Leadership

The NE010 iwarp Adapter

HPE Deep Learning Cookbook: Recipes to Run Deep Learning Workloads. Natalia Vassilieva, Sergey Serebryakov

Best Practices for Setting BIOS Parameters for Performance

DESCRIPTION GHz, 1.536TB shared memory RAM, and 20.48TB RAW internal storage teraflops About ScaleMP

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

HPC Solution. Technology for a New Era in Computing

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

THE OPEN DATA CENTER FABRIC FOR THE CLOUD

Informatix Solutions INFINIBAND OVERVIEW. - Informatix Solutions, Page 1 Version 1.0

Grid Director 4000 Family

Transcription:

Boosting Scalability of InfiniBand-based HPC Clusters Asaf Wachtel, Senior Product Manager 2010 Voltaire Inc.

InfiniBand-based HPC Clusters Scalability Challenges Cluster TCO Scalability Hardware costs Software License costs Space, Power & Cooling Communication Scalability Handle increasing compute power Multi-core core, GPUs Utilization Scalability Many jobs & users Varying sizes, traffic patterns & QoS Application Scalability Home-grown or ISVs MPI Collectives 2

Voltaire 40Gb/s InfiniBand Portfolio Fabric provisioning and performance monitoring Application Acceleration 40Gb/s InfiniBand Switching Platforms HSSM SSI Blade Switch 4036 36 x IB ports 4036E 34 x IB ports + 2 x 1/10GbE 4200 162 x IB ports 4700 324/648 x IB ports 3

Scalable Architectures Fat Tree Full bi-sectional bandwidth at any node count Uniform oversubscription options HyperScale Scale to thousands of nodes with linear performance Large non-blocking islands (more than 2,000 cores) 4-hops maximum latency to any port Lowest number of switches and cables Torus Lowest cost solution Built entirely with edge switches and copper cables Optimized support by Voltaire software, including Torus2QoS routing 4

HyperScale in the Top500 Large, low-latency, non-blocking Islands Lowest number of switches & cables Scales to thousands of nodes with linear performance 8:1 Oversubscribed Core 1,200-node Interconnect in only 2 Racks 13 x non-blocking HyperScale Islands 1.05PFLOPs 83.7% Efficiency 5

The Challenge: Static Routing Inefficiency The Challenge: One Size Routing does not Fit All Static routing assumes uniform traffic across entire fabric Real life is different Most jobs use small portion of the clusters Different traffic patterns for different jobs Different requirements for different traffic types (e.g. storage) The Solution: Voltaire TARA (Traffic Aware Routing Algorithm) A new routing algorithm on top of OpenSM Dynamically optimizes routing according to defined traffic patterns: Fabric topology Job-specific communication patterns Symmetric/Asymmetric communication Traffic load/qos Fully integrated t with leading job schedulers 6

TARA Traffic Aware Routing Algorithm Maximizing Cluster Utilization OpenSM without UFM TARA UFM TARA is ON 2000 2000 1800 1800 1600 1600 1400 1400 1200 1200 1000 800 1000 800 port weight port weight 600 600 400 400 200 200 0 1.18 1.28 2.20 2.30 3.22 3.32 4.24 4.34 5.26 6.18 6.28 7.20 7.30 8.22 8.32 9.24 9.34 10.26 11.18 11.28 12.20 12.30 13.22 13.32 14.24 14.34 15.26 16.18 16.28 17.20 17.30 switch.port 7 0 1.18 1.28 2.20 2.30 3.22 3.32 4.24 4.34 5.26 6.18 6.28 7.20 7.30 8.22 8.32 9.24 9.34 10.26 11.18 11.28 12.20 12.30 13.22 13.32 14.24 14.34 15.26 16.18 16.28 17.20 17.30 Internal ports on the line cards switch.port

The Challenge: Collective Operations Scalability Grouping algorithms are unaware of the topology and inefficient Network congestion due to all-to-all communication Slow nodes & OS involvement impair scalability and predictability The more powerful servers get (GPUs, more cores), the more poorly collectives scale in the fabric % collectives out of total run time Total run time Run time variance # Ranks # Ranks # Ranks Significant Inhibitor to MPI Application Scalability 8

Introducing: Voltaire Fabric Collective Accelerator Grid Director Switches: Switches: Fabric Collective Processing operations Power offloaded to switch CPUs Grid Director FCA Manager: Unified Fabric Topology-based p collective Manager tree (UFM): + Separate Virtual + network Topology Aware IB multicast for result Orchestrator distribution. + FCA Agent: Inter-core processing localized & optimized. + Breakthrough performance with no additional hardware 9

FCA Fabric Collective Accelerator Unmatched Application Scalability First and only system-wide solution for offloading MPI collectives Accelerates MPI collective computation by as much as 100X 10-40% improvement in application runtime Integrated with leading MPI implementations Fluent truck_111m 192 cores 180 160 140 120 100 80 60 40 20 0 PMPI PMPI + FCA PMPI PMPI + FCA 10

Summary Reduced total cost of ownership via scalable topologies (HyperScale) Increase cluster utilization via Traffic Aware Routing g( (TARA) Boost application scalability using Fabric Collective Acceleration (FCA) More Performance for each $ Spent 11

Thank You Asaf Wachtel Senior Product Manager, InfiniBand Solutions asafw@voltaire.com 2010 Voltaire Inc.