Paving the Road to Exascale

Similar documents
MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Interconnect Your Future

High Performance Computing

Interconnect Your Future

Interconnect Your Future

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017

The Future of High Performance Interconnects

Interconnect Your Future

Solutions for Scalable HPC

Interconnect Your Future Enabling the Best Datacenter Return on Investment. TOP500 Supercomputers, November 2017

UCX: An Open Source Framework for HPC Network APIs and Beyond

Interconnect Your Future

The Future of Interconnect Technology

Building the Most Efficient Machine Learning System

Building the Most Efficient Machine Learning System

Unified Communication X (UCX)

In-Network Computing. Paving the Road to Exascale. June 2017

Paving the Road to Exascale Computing. Yossi Avni

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017

Interconnect Your Future Paving the Road to Exascale

2008 International ANSYS Conference

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

Birds of a Feather Presentation

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment. TOP500 Supercomputers, June 2014

Future Routing Schemes in Petascale clusters

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

The Exascale Architecture

IBM CORAL HPC System Solution

Application Acceleration Beyond Flash Storage

Maximizing Cluster Scalability for LS-DYNA

The Effect of In-Network Computing-Capable Interconnects on the Scalability of CAE Simulations

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Ethernet. High-Performance Ethernet Adapter Cards

Messaging Overview. Introduction. Gen-Z Messaging

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

NAMD Performance Benchmark and Profiling. January 2015

Gen-Z Memory-Driven Computing

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

UCX: An Open Source Framework for HPC Network APIs and Beyond

ROCm: An open platform for GPU computing exploration

Accelerating Ceph with Flash and High Speed Networks

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

Preparing GPU-Accelerated Applications for the Summit Supercomputer

High Performance Interconnects: Landscape, Assessments & Rankings

N V M e o v e r F a b r i c s -

MM5 Modeling System Performance Research and Profiling. March 2009

InfiniBand Strengthens Leadership as The High-Speed Interconnect Of Choice

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

Design challenges of Highperformance. MPI over InfiniBand. Presented by Karthik

IO virtualization. Michael Kagan Mellanox Technologies

2-Port 40 Gb InfiniBand Expansion Card (CFFh) for IBM BladeCenter IBM BladeCenter at-a-glance guide

Mapping MPI+X Applications to Multi-GPU Architectures

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Introduction to Infiniband

ABySS Performance Benchmark and Profiling. May 2010

High Performance MPI on IBM 12x InfiniBand Architecture

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

IBM Power Advanced Compute (AC) AC922 Server

Corporate Update. Enabling The Use of Data January Mellanox Technologies

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

RapidIO.org Update. Mar RapidIO.org 1

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

Deep Learning mit PowerAI - Ein Überblick

NAMD Performance Benchmark and Profiling. November 2010

LAMMPSCUDA GPU Performance. April 2011

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

By John Kim, Chair SNIA Ethernet Storage Forum. Several technology changes are collectively driving the need for faster networking speeds.

A unified multicore programming model

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Introduction to High-Speed InfiniBand Interconnect

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Single-Points of Performance

Gen-Z Overview. 1. Introduction. 2. Background. 3. A better way to access data. 4. Why a memory-semantic fabric

How to Network Flash Storage Efficiently at Hyperscale. Flash Memory Summit 2017 Santa Clara, CA 1

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Optimizing LS-DYNA Productivity in Cluster Environments

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

OpenFabrics Interface WG A brief introduction. Paul Grun co chair OFI WG Cray, Inc.

HPC Saudi Jeffrey A. Nichols Associate Laboratory Director Computing and Computational Sciences. Presented to: March 14, 2017

Networking at the Speed of Light

OpenPOWER Innovations for HPC. IBM Research. IWOPH workshop, ISC, Germany June 21, Christoph Hagleitner,

SNAP Performance Benchmark and Profiling. April 2014

Mellanox InfiniBand Solutions Accelerate Oracle s Data Center and Cloud Solutions

Voltaire Making Applications Run Faster

Université IBM i 2017

How Might Recently Formed System Interconnect Consortia Affect PM? Doug Voigt, SNIA TC

GPUs and Emerging Architectures

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

LiMIC: Support for High-Performance MPI Intra-Node Communication on Linux Cluster

Transcription:

Paving the Road to Exascale Gilad Shainer August 2015, MVAPICH User Group (MUG) Meeting

The Ever Growing Demand for Performance Performance Terascale Petascale Exascale 1 st Roadrunner 2000 2005 2010 2015 2020 Technology Development SMP to Clusters Single-Core to Multi-Core 2015 Mellanox Technologies 2

The Road to Exascale Computing Cluster Multi-Core Co-Design 2015 Mellanox Technologies 3

Co-Design Architecture From Discrete to System Focused Discrete Level System Level 2015 Mellanox Technologies 4

Exascale will be Enabled via Co-Design Architecture Software Hardware Hardware Hardware (e.g. GPU-Direct) Software Software (e.g. OpenUCX) Industry Users Academia Standard, Open Source, Eco-System Programmable, Configurable, Innovative 2015 Mellanox Technologies 5

Software-Hardware Co-Design? Example: Breaking the Latency Wall Network 10 years ago Communication Framework Network Today Communication Framework Co-Design Network Future Communication Framework ~10 microsecond ~100s microsecond ~0.1 microsecond ~10s microsecond ~0.1 microsecond ~1s microsecond Today: Network devices are in 100ns latency today Challenge: How to enable the next order of magnitude performance improvement? Solution: Co-Design - mapping the communication frameworks on all active devices Result: reduce HPC communication frameworks latency by an order of magnitude Co-Design Architecture Paves the Road to Exascale Performance 2015 Mellanox Technologies 6

The Road to Exascale Co-Design System Architecture The road to Exascale requires order of magnitude performance improvements Co-Design architecture enables all active devices to become co-processors Compute Communication Management Compute Compute Compute CPU CPU GPU FPGA HCA Data Switch Data Mapping Communication Frameworks on All Active Devices 2015 Mellanox Technologies 7

Co-Design Architecture Co-Design Co-Design Compute Compute Compute Co-Design CPU GPU FPGA HCA Data Co-Design Switch Data 2015 Mellanox Technologies 8

The Elements of the Co-Design Architecture Applications (Innovations, Scalability, Performance) Communication Frameworks (MPI, SHMEM/PGAS) Offloading Technologies Software-Defined X Direct Communications Flexibility Offloading Technologies RDMA GPUDirect Programmability Heterogeneous System Virtualization Backward and future Compatibility Co-Design Implementation Via Offloading Technologies 2015 Mellanox Technologies 9

Mellanox Co-Design Architecture (Collaborative Effort) Communication Frameworks (MPI, SHMEM/PGAS) Applications Transport RDMA SRIOV Collectives Peer-Direct GPU-Direct More Future The Only Approach to Deliver 10X Performance Improvements 2015 Mellanox Technologies 10

Mellanox InfiniBand Proven and Most Scalable HPC Interconnect Summit System Sierra System Paving the Road to Exascale 2015 Mellanox Technologies 11

High-Performance Designed 100Gb/s Interconnect Solutions 100Gb/s Adapter, 0.7us latency 150 million messages per second (10 / 25 / 40 / 50 / 56 / 100Gb/s) 36 EDR (100Gb/s) Ports, <90ns Latency Throughput of 7.2Tb/s 32 100GbE Ports, 64 25/50GbE Ports (10 / 25 / 40 / 50 / 100GbE) Throughput of 6.4Tb/s Copper (Passive, Active) Optical Cables (VCSEL) Silicon Photonics 2015 Mellanox Technologies 12

InfiniBand Adapters Performance Comparison Mellanox Adapters ConnectX-4 Connect-IB ConnectX-3 Pro Single Port Performance EDR 100G FDR 56G FDR 56G Uni-Directional Throughput 100 Gb/s 54.24 Gb/s 51.1 Gb/s Bi-Directional Throughput 195 Gb/s 107.64 Gb/s 98.4 Gb/s Latency 0.61 us 0.63 us 0.64 us Message Rate (Uni-Directional) 149.5 Million/sec 105 Million/sec 35.9 Million/sec 2015 Mellanox Technologies 13

EDR InfiniBand Performance Commercial Applications 40% 25% 15% 2015 Mellanox Technologies 14

EDR InfiniBand Performance Weather Simulation Weather Research and Forecasting Model Optimization effort with the HPCAC EDR InfiniBand delivers 28% higher performance 32-node cluster Performance advantage increase with system size 2015 Mellanox Technologies 15

Lenovo EDR InfiniBand System (TOP500) LENOX EDR InfiniBand connected system at the Lenovo HPC innovation center EDR InfiniBand provides >20% higher performance versus over FDR on Graph500 At 128nodes 2015 Mellanox Technologies 16

Unified Communication X Framework (UCX) The Next Generation HPC Software Framework To Meet the Needs of Future Systems / Applications 2015 Mellanox Technologies 17

Exascale Co-Design Collaboration The Next Generation HPC Software Framework Collaborative Effort Industry, National Laboratories and Academia 2015 Mellanox Technologies 18

UCX Framework Mission Collaboration between industry, laboratories, and academia Create open-source production grade communication framework for HPC applications To enable the highest performance through co-design of software-hardware interfaces To unify industry - national laboratories - academia efforts API Exposes broad semantics that target data centric and HPC programming models and applications Performance oriented Optimization for low-software overheads in communication path allows near native-level performance Production quality Developed, maintained, tested, and used by industry and researcher community Community driven Collaboration between industry, laboratories, and academia Research The framework concepts and ideas are driven by research in academia, laboratories, and industry Cross platform Support for Infiniband, Cray, various shared memory (x86-64 and Power), GPUs Co-design of Exascale Network APIs 2015 Mellanox Technologies 19

The UCX Framework UC-S for Services This framework provides basic infrastructure for component based programming, memory management, and useful system utilities Functionality: Platform abstractions and data structures UC-T for Transport Low-level API that expose basic network operations supported by underlying hardware Functionality: work request setup and instantiation of operations UC-P for Protocols High-level API uses UCT framework to construct protocols commonly found in applications Functionality: Multi-rail, device selection, pending queue, rendezvous, tag-matching, software-atomics, etc. 2015 Mellanox Technologies 20

UCX High-level Overview 2015 Mellanox Technologies 21

Collaboration Mellanox co-designs network interface and contributes MXM technology Infrastructure, transport, shared memory, protocols, integration with OpenMPI/SHMEM, MPICH ORNL co-designs network interface and contributes UCCS project InfiniBand optimizations, Cray devices, shared memory NVIDIA co-designs high-quality support for GPU devices GPU-Direct, GDR copy, etc. IBM co-designs network interface and contributes ideas and concepts from PAMI UH/UTK focus on integration with their research platforms 2015 Mellanox Technologies 22

UCX Performance 2015 Mellanox Technologies 23

Mellanox Multi-Host Technology Next Generation Data Center Architecture 2015 Mellanox Technologies 24

New Compute Rack / Data Center Architecture NIC CPU Scalable Data Center with Multi-Host Flexible, configurable, application optimized Optimized top-of-rack switches Takes advantage of high-throughput network Server Traditional Data Center Expensive design for fixed data centers Requires many ports on top-of-rack switch Dedicated NIC / cable per server CPU Slots NIC Server The Network is The Computer 2015 Mellanox Technologies 25

Multi-Host Dramatically Reduces Server Cost Traditional 4-Way SMP Machine CPU CPU Proprietary Cache Coherent Bus Multi-Host 4-Socket Architecture CPU CPU CPU CPU CPU CPU Modern CPUs with 8-20 cores don t require expensive SMP architectures. Additional parallelism achieved with higher level network based distributed programming techniques such as Hadoop Map-Reduce Expensive 4-Way CPU Massive but unused cache-coherent domain High overhead but un-necessary CPU bus High pin count,high power, complex layout Asymmetric (NUMA) of data access Low cost single-socket CPU Clean, simple, cost-effective, software transparent Cache coherent domain: Multi-Core CPU Eliminates pins, Lower power, Simpler layout Symmetric Data Access 2015 Mellanox Technologies 26

ConnectX-4 on Facebook OCP Multi-Host Platform (Yosemite) ConnectX-4 Multi-Host Adapter Compute Slots The Next Generation Compute and Storage Rack Design 2015 Mellanox Technologies 27

Summary Paving the Road to Exascale Computing 2015 Mellanox Technologies 28

End-to-End Interconnect Solutions for All Platforms Highest Performance and Scalability for X86, Power, GPU, ARM and FPGA-based Compute and Storage Platforms 10, 20, 25, 40, 50, 56 and 100Gb/s Speeds X86 Open POWER GPU ARM FPGA Smart Interconnect to Unleash The Power of All Compute Architectures 2015 Mellanox Technologies 29

Technology Roadmap One-Generation Lead over the Competition Mellanox 20Gbs 40Gbs 56Gbs 100Gbs 200Gbs Terascale Petascale Exascale 3 rd TOP500 2003 Virginia Tech (Apple) 1 st Roadrunner Mellanox Connected 2000 2005 2010 2015 2020 2015 Mellanox Technologies 30

Thank You