InfiniBand and OpenFabrics Successes and Future Requirements:

Similar documents
HPC Customer Requirements for OpenFabrics Software

The Hyperion Project: Collaboration for an Advanced Technology Cluster Testbed. November 2008

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

2008 International ANSYS Conference

ABySS Performance Benchmark and Profiling. May 2010

Delivering HPC Performance at Scale

InfiniBand Administration Tools (IBADM)

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

QLogic in HPC Vendor Update IDC HPC User Forum April 16, 2008 Jeff Broughton Sr. Director Engineering Host Solutions Group

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

Red Hat HPC Solution Overview. Platform Computing

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

To Infiniband or Not Infiniband, One Site s s Perspective. Steve Woods MCNC

Checklist for Selecting and Deploying Scalable Clusters with InfiniBand Fabrics

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Readme for Platform Open Cluster Stack (OCS)

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

TOSS - A RHEL-based Operating System for HPC Clusters

HYCOM Performance Benchmark and Profiling

HPC NETWORKING IN THE REAL WORLD

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

RapidIO.org Update. Mar RapidIO.org 1

Interconnect Your Future

Oak Ridge National Laboratory Computing and Computational Sciences

An Approach for Implementing NVMeOF based Solutions

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Welcome to the InfiniBand Low Latency Technical Forum!

Birds of a Feather Presentation

The following InfiniBand products based on Mellanox technology are available for the HP BladeSystem c-class from HP:

Workshop on High Performance Computing (HPC) Architecture and Applications in the ICTP October High Speed Network for HPC

QuickSpecs. HP InfiniBand Options for HP BladeSystems c-class. Overview

BlueGene/L. Computer Science, University of Warwick. Source: IBM

RapidIO.org Update.

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters

InfiniBand-based HPC Clusters

MM5 Modeling System Performance Research and Profiling. March 2009

Single-Points of Performance

Informatix Solutions INFINIBAND OVERVIEW. - Informatix Solutions, Page 1 Version 1.0

The following InfiniBand products based on Mellanox technology are available for the HP BladeSystem c-class from HP:

Future Routing Schemes in Petascale clusters

Embracing Open Technologies in the HPEC Market

Creating an agile infrastructure with Virtualized I/O

DESCRIPTION GHz, 1.536TB shared memory RAM, and 20.48TB RAW internal storage teraflops About ScaleMP

PERFORMANCE ACCELERATED Mellanox InfiniBand Adapters Provide Advanced Levels of Data Center IT Performance, Productivity and Efficiency

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Introduction to High-Speed InfiniBand Interconnect

Optimizing LS-DYNA Productivity in Cluster Environments

Cluster Network Products

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Real Application Performance and Beyond

Unified Runtime for PGAS and MPI over OFED

Solutions for Scalable HPC

Emerging Technologies for HPC Storage

Experiences with HP SFS / Lustre in HPC Production

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Parallel Computing at DESY Zeuthen. Introduction to Parallel Computing at DESY Zeuthen and the new cluster machines

STAR-CCM+ Performance Benchmark and Profiling. July 2014

Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet. Swamy N. Kandadai and Xinghong He and

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System

Myri-10G Myrinet Converges with Ethernet

Voltaire Making Applications Run Faster

UCX: An Open Source Framework for HPC Network APIs and Beyond

FUJITSU PHI Turnkey Solution

The Cray XD1. Technical Overview. Amar Shan, Senior Product Marketing Manager. Cray XD1. Cray Proprietary

Compute Node Linux (CNL) The Evolution of a Compute OS

The RAMDISK Storage Accelerator

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

InfiniBand Networked Flash Storage

A Case for High Performance Computing with Virtual Machines

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

MAHA. - Supercomputing System for Bioinformatics

Agenda. Sun s x Sun s x86 Strategy. 2. Sun s x86 Product Portfolio. 3. Virtualization < 1 >

TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS. Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016

Management Scalability. Author: Todd Rimmer Date: April 2014

IBM Virtual Fabric Architecture

IBM WebSphere MQ Low Latency Messaging Software Tested With Arista 10 Gigabit Ethernet Switch and Mellanox ConnectX

NAMD Performance Benchmark and Profiling. February 2012

HIGH PERFORMANCE COMPUTING FROM SUN

SNAP Performance Benchmark and Profiling. April 2014

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Comparing Ethernet and Soft RoCE for MPI Communication

Mellanox Virtual Modular Switch

CAS 2K13 Sept Jean-Pierre Panziera Chief Technology Director

2 to 4 Intel Xeon Processor E v3 Family CPUs. Up to 12 SFF Disk Drives for Appliance Model. Up to 6 TB of Main Memory (with GB LRDIMMs)

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Storage System. David Southwell, Ph.D President & CEO Obsidian Strategics Inc. BB:(+1)

IBM Power AC922 Server

Open Fabrics Workshop 2013

The Role of InfiniBand Technologies in High Performance Computing. 1 Managed by UT-Battelle for the Department of Energy

iwarp Learnings and Best Practices

Welcome to the IBTA Fall Webinar Series

Transcription:

InfiniBand and OpenFabrics uccesses and Future Requirements: Matt Leininger, Ph.. andia National Laboratories calable Computing R& Livermore, CA 25 eptember 2006

Outline OE InfiniBand Clusters InfiniBand Cluster Experiences Early Experiences using OpenFabrics Enterprise istribution Future Petascale InfiniBand Cluster Requirements How do we reach these goals?

OE/AC has Evaluated everal enerations of InfiniBand 2001 2002: Nitro I & II: IB blade reference designs (NL) 2.2 Hz Xeon processors, small clusters, funded early MPI/IB work, and Cadillac (LANL) 128 node cluster 2003: Catalyst: 128 nodes 4X PCI X IB (NL), Blue teel: 256 dual nodes 4X PCI X (LANL), 96 nodes 4X PCI X Viz Red RoE (NL) 2004: Catalyst: Added 85 nodes 4X PCIe IB, 288 port IB switch(nl), ~300 nodes 4X PCIe Viz Red Rose (NL) 2005: Thunderbird and Talon: 4,480 and 128 dual 3.6 hz nodes, 4X PCIe IB (NL) Lustre/IB production @ NL Red RoE 2006: 2,048 + 1,500 nodes IB (LANL) 1,800 nodes (LLNL) NL, LANL, LLNL have ~11,000 nodes of IB today and more to come

OE oals for InfiniBand To accelerate the development of an Linux IB software stack for HPC High performance (high bandwidth, low latency, low CPU overhead) calability Robustness Portability Reliability Manageability ingle open source stack, diagnostic and management tools supported across multiple (i.e. all) system vendors Integrate IB stack into mainline Linux kernel at kernel.org et stack into Linux distributions (RedHat, ue, etc.) OpenFabrics was formed around these goals OE AC PathForward program has been funding OpenFabrics development since early 2005

andia Thunderbird Architecture 4,480 2-ocket, 1-Core EM64T (8,960 CPUs) Compute Nodes 4608 Port InfiniBand 4x R [280(168U)+8(288)] QsNet Elan3, 100BaseT Control andia Network 1/10 be witch 8 User Login Nodes 1/10 be attached 2 InfiniBand ubnet Management Nodes 16 Boot/ervice Nodes 10/1 E IO & Management O O O O N 8500 ystem Parameters 14.4 F/s dual socket 3.6 Hz single core Intel MP nodes R-2 400 RAM 50% blocking (2:1 oversubscription of InfiniBand fabric) ~300 InfiniBand switches to manage ~9,000 InfiniBand ports ~33,600 meters (or 21 miles) of 4X InfiniBand copper cables ~10,000 meters (or 6 miles) of copper Ethernet cables 26,880 1 B R-2 400 RAM modules 1.8 M of power, 400 tons of cooling Up to 2000 nodes Linpack efficiency was ~82% 16 Object torage ervers 16-32 1 be attached 120 TB and 6.0 B/s #5 in Top500 38.2 Tflops on 3721 nodes 71% efficiency M M Two Metaata ervers 1 be attached

LLNL Peloton Clusters 1024 4-ocket, ual Core (8,192 CPUs) Compute Nodes 1152 Port Infiniband 4x R [96(1212U)+4(288)] QsNet Elan3, 100BaseT Control 1/10 benet Federated witch 1/10 be supplied by site 100BaseT Management 8 ateway nodes @ 0.8+0.8 B/s delivered I/O over 1x10bE Metaata supplied by site Object torage ismos 1 or 10bE attached 400 TB and 25.6 B/s M M Metaata supplied by site ystem Parameters Three clusters 1024, 512, and 256 nodes Quad socketdual core AM Opterons 4X R PCIe InfiniBand Full fat-tree R2 667 RAM <3 us, 1.8 B/s MPI latency and Bandwidth over IBA 4x upport 800 MB/s transfers to Archive over quad Jumbo Frame b-enet and IBA links from each Login node. ome solutions have 10bE No local disk, remote boot and RP target for root and swap partitions on RAI5 device for improved RA oftware for build and acceptance 3 provided software stack distributions (all based on several common key components: RedHat, Lustre, OpenFabrics, MPICH or OpenMPI)

Thunderbird and Peloton Infiniband oftware andia Thunderbird (4,480 nodes; 8,960 processors) Production computing resource (~1 yr) Running Cisco/Mellanox VAPI proprietary software stack CiscoM and Cisco diagnostics MVAPICH1 and OpenMPI tarting to test OFE-1.0 with RHEL4U4 and OpenMPI LLNL Peloton (~1024+512+256 nodes; ~14,400 processors) First large IB cluster build using only OFE software First cluster of 512 nodes is undergoing hardware and software burned in Using an OFE-1.0 based stack with RedHat Using OpenM and OpenFabrics diagnostics/management tools

InfiniBand uccesses Large IB clusters are running OE production capacity computing workloads OFE-1.0 (and updates) available in RHEL4U4 and LE10 everal OE clusters are now using OFE or have plans to upgrade to OFE OFE-1.0 RHEL4U4 is undergoing testing this week up to 1024 nodes (2048 processors) tability and ease of use of OFE has been much better than proprietary stacks andia Thunderbird will likely move OFE into production with RHEL4U5 OpenMPI is proving to be more scalable than MVAPICH OpenFabrics diagnostics and management tools provide good basic (static) information about the IB fabric

InfiniBand Technical Issues Current solutions are working but there is still room for much improvement InfiniBand calability Current scaling is reasonable for small capacity jobs (< 256-512 nodes) Need single job scalability up to 4000 nodes (multipath routing, etc.) Network congestion information is still difficult to extract from fabric IB Congestion control architecture is not yet supported Vendors will need to implement congestion control agents in switches OpenFabrics RP and ier are not production worthy yet, thus OE production storage purchases have remained Fiber Channel rather than InfiniBand. Current OpenFabrics stack is missing a performance manager (PM) Error rates are more helpful than raw errors numbers PM could also provide information about fabric congestion

OpenFabrics upport Issues Vendors need to fully support OpenFabrics software in production environments top pushing proprietary solutions (still happening) There is no long term value add for customers in proprietary stacks Production environments are multi-vendor (e.g. NL has Voltaire, Cisco, ilverstorm, Mellanox, and Qlogic) Make sure OpenM supports your switches and any advanced features (performance manager, congestion manager, etc.) Customers are willing to pay for OpenFabrics support to meet their performance, stability, robustness requirements OFE is a reasonable start but we need vendors to stand behind the OFE product

OpenFabrics Enterprise istribution (OFE) oals to improve and control the quality of the OF software stack Performance Compliance tability Reliability iagnostic tool set Industrial-strength support and rigorous regression testing OFE has created a collaborative testing and productization environment ingle OpenFabrics software stack supported by all IB vendors and Linux distributions OFE is a good first step however... OFE process has been difficult Collaboration is working but it is a constant struggle Need a more robust and smoother release plan and process in order to succeed Release process will only get more complicated with the addition of iarp Use experiences from the OFE 1.0 and 1.1 release to improve process

4000 nodes with multi-core CPUs feasible in next 2-3 years Need InfiniBand to scale beyond the current 512 nodes to 4K nodes calability to this level will require: < 1us pt2pt latency High performance RC/U send/recv (near line rate) High message injection rate (15-20M/s) for small/medium sized messages Hardware and OF software support for congestion control architecture OF Performance manager and support in vendor switches Fully adaptive routing (addition to IB spec.) Cheap reliable fiber for 4X/12X R and QR (match the cost of copper) High performance (near line rate R, R, QR) native IB-IB routing Reliable multicast (up to a minimum of 128 peers) RQ shared receive block (improve flow control for RQ) Reduce/eliminate memory registration overhead More requirements presented at onoma orkshop (http://openfabrics.org/conference/spring2006sonoma/hpc-reqs-openib-0206-v3.pdf) Achieving these goals will be a collaborative effort between OFA, IBTA, and HPC community Petascale InfiniBand Cluster Requirements

How can OpenFabrics and IBTA Collaborate? Complete IB-IB routing specification Add fully adaptive routing to InfiniBand specification Have joint memberships between IBTA and OpenFabrics? More joint workshops Continued discussions...

How do we move forward? OFA is making good strides in the development and hardening of an OpenFabrics stack ingle multi-vendor software stack included in Linux distributions The use of OFE in a production environment is not 6 month out, it is today Need stronger commitment from vendors (from CEO to field engineers) treamlined OFE testing and release process Perhaps using better web-based collaborative tools Continue to develop a strong collaboration between OFA, HPC customers, and IBTA

For more information Matt Leininger mlleini@sandia.gov