The Exascale Architecture

Similar documents
The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

2017 Storage Developer Conference. Mellanox Technologies. All Rights Reserved.

The Future of Interconnect Technology

The Role of InfiniBand Technologies in High Performance Computing. 1 Managed by UT-Battelle for the Department of Energy

Asynchronous Peer-to-Peer Device Communication

Solutions for Scalable HPC

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017

Interconnect Your Future

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017

Paving the Road to Exascale

Introduction to High-Speed InfiniBand Interconnect

Interconnect Your Future

In-Network Computing. Paving the Road to Exascale. June 2017

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Application Acceleration Beyond Flash Storage

Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

Interconnect Your Future

Paving the Road to Exascale Computing. Yossi Avni

PARAVIRTUAL RDMA DEVICE

Low latency, high bandwidth communication. Infiniband and RDMA programming. Bandwidth vs latency. Knut Omang Ifi/Oracle 2 Nov, 2015

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Design challenges of Highperformance. MPI over InfiniBand. Presented by Karthik

RDMA enabled NIC (RNIC) Verbs Overview. Renato Recio

Future Routing Schemes in Petascale clusters

High Performance Computing

MVAPICH-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Welcome to the IBTA Fall Webinar Series

Advanced Computer Networks. End Host Optimization

Interconnect Your Future

Fabric Interfaces Architecture. Sean Hefty - Intel Corporation

OceanStor 9000 InfiniBand Technical White Paper. Issue V1.01 Date HUAWEI TECHNOLOGIES CO., LTD.

Birds of a Feather Presentation

RDMA programming concepts

Memory Management Strategies for Data Serving with RDMA

Introduction to Infiniband

The NE010 iwarp Adapter

Op#miza#on and Tuning of Hybrid, Mul#rail, 3D Torus Support and QoS in MVAPICH2

2008 International ANSYS Conference

Unifying UPC and MPI Runtimes: Experience with MVAPICH

Lessons learned from MPI

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

Containing RDMA and High Performance Computing

LS-DYNA Performance Benchmark and Profiling. April 2015

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

Generic RDMA Enablement in Linux

14th ANNUAL WORKSHOP 2018 NVMF TARGET OFFLOAD. Liran Liss. Mellanox Technologies. April 2018

ABySS Performance Benchmark and Profiling. May 2010

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

SUSE Linux Enterprise Server (SLES) 12 SP4 Inbox Driver Release Notes SLES 12 SP4

The Future of High Performance Interconnects

STAR-CCM+ Performance Benchmark and Profiling. July 2014

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

CIS Operating Systems Memory Management Address Translation for Paging. Professor Qiang Zeng Spring 2018

NAMD Performance Benchmark and Profiling. February 2012

ehca Virtualization on System p

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Persistent Memory Over Fabrics. Paul Grun, Cray Inc Stephen Bates, Eideticom Rob Davis, Mellanox Technologies

Advanced Operating Systems (CS 202) Virtualization

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Efficient and Truly Passive MPI-3 RMA Synchronization Using InfiniBand Atomics

MOVING FORWARD WITH FABRIC INTERFACES

NAMD Performance Benchmark and Profiling. January 2015

UCX: An Open Source Framework for HPC Network APIs and Beyond

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

IsoStack Highly Efficient Network Processing on Dedicated Cores

IO virtualization. Michael Kagan Mellanox Technologies

CP2K Performance Benchmark and Profiling. April 2011

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Altair OptiStruct 13.0 Performance Benchmark and Profiling. May 2015

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb.

Unified Runtime for PGAS and MPI over OFED

SPDK China Summit Ziye Yang. Senior Software Engineer. Network Platforms Group, Intel Corporation

DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

The Tofu Interconnect 2

Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

The Design of Advanced Communication to Reduce Memory Usage for Exa-scale Systems

Networking at the Speed of Light

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects

The Tofu Interconnect D

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

RDMA in Embedded Fabrics

Overlapping Computation and Communication. through Offloading in MPI over InfiniBand

AcuSolve Performance Benchmark and Profiling. October 2011

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

CPMD Performance Benchmark and Profiling. February 2014

Advanced Computer Networks. Flow Control

Mellanox HPC-X Scalable Software Toolkit README

LS-DYNA Performance Benchmark and Profiling. October 2017

CSE 451: Operating Systems Winter Page Table Management, TLBs and Other Pragmatics. Gary Kimura

The Effect of In-Network Computing-Capable Interconnects on the Scalability of CAE Simulations

Flavors of Memory supported by Linux, their use and benefit. Christoph Lameter, Ph.D,

Transcription:

The Exascale Architecture Richard Graham HPC Advisory Council China 2013

Overview Programming-model challenges for Exascale Challenges for scaling MPI to Exascale InfiniBand enhancements Dynamically Connected Transport (DC) Cross-Channel synchronization Non-contiguous data transfer On Demand Paging 2013 Mellanox Technologies 2

Exascale-Class Computer Platforms Communication Challenges Very large functional unit count ~10,000,000 Implication to communication stack: Need scalable communication capabilities - Point-to-point - Collective Large on- node functional unit count ~500 Implication to communication stack: Scalable HCA architecture Deeper memory hierarchies Implication to communication stack: Cache aware network access Smaller amounts of memory per functional unit Implication to communication stack: Low latency, high b/w capabilities May have functional unit heterogeneity Implication to communication stack: Support for data heterogeneity Component failures part of normal operation Implication to communication stack: Resilient and redundant stack Data movement is expensive Implication to communication stack: Optimize data movement 2013 Mellanox Technologies 3

Challenges of a Scalable MPI Standard Scalable Communication interface Interface objects scale in less than system dimension Support for asynchronous communication Non-blocking communication interfaces Sparse and topology aware interfaces System noise issues Interfaces that support communication delegation (e.g., offloading) Minimize data motion Semantics that support data aggregation 2013 Mellanox Technologies 4

Extended InfiniBand Capabilities: Mellanox Innovation 2013 Mellanox Technologies 5

Mellanox Interconnect Development Timeline QDR InfiniBand End-to-End GPUDirect Technology Released FDR InfiniBand End-to-End MPI/SHMEM Collectives Offloads (FCA), Scalable HPC (MXM), Open SHMEM, PGAS/UPC Connect-IB - 100Gb/s HCA Dynamically Connected Transport 2008 2009 2010 2011 2012 2013 World s First Petaflop Systems CORE-Direct Technology InfiniBand Ethernet Bridging Long-Haul Solutions GPUDirect RDMA Technology and Solutions Leadership, EDR InfiniBand Expected in 2014-2015 2013 Mellanox Technologies 6

Mellanox InfiniBand Paves the Road to Exascale 2013 Mellanox Technologies 7

Leading Interconnect, Leading Performance Bandwidth 100Gb/s 200Gb/s 20Gb/s 40Gb/s 56Gb/s 10Gb/s 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 Same Software Interface 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 5usec 2.5usec 1.3usec 0.7usec Latency 0.5usec <0.5usec 2013 Mellanox Technologies 8

Dynamically Connected Transport A Scalable Transport 2013 Mellanox Technologies 9

New Transport Challenges being addressed: Scalable communication protocol High-performance communication Asynchronous communication Current status: Transports in widest use RC - High Performance: Supports RDMA and Atomic Operations - Scalability limitations: One connection per destination UD - Scalable: One QP services multiple destinations - Limited communication support: No support for RDMA and Atomic Operations, unreliable Need scalable transport that also supports RDMA and Atomic operations DC The best of both worlds High Performance: Supports RDMA and Atomic Operations, Reliable Scalable: One QP services multiple destinations #OFADevWorkshop 10 2013 Mellanox Technologies 10

Process p-1 Process p-1 Process p-1 Shared Contexts Process 1 Process 1 Process 1 Process 0 Process 0 Process 0 IB Reliable Transports Model RC XRC RD(+XRC) n*p n n*p n*p*p (~1M) n ~n*p (~64K) n ~n (~4K) n*p n QoS/Multipathing: 2 to 8 times the above Resource sharing (XRC/RD) causes processes to impact each-other Example n=4k p=16 2013 Mellanox Technologies 11 11

Process p-1 Process 1 p*(cs+cr)/2 Process 0 The DC Model Dynamic Connectivity Each DC Initiator can be used to reach any remote DC Target cs No resources sharing between processes process controls how many (and can adapt to load) process controls usage model (e.g. SQ allocation policy) no inter-process dependencies cr cs cr Resource footprint Function of HCA capability Independent of system size cs Fast Communication Setup Time cr cs concurrency of the sender cr=concurrency of the responder 2013 Mellanox Technologies 12 12

Host Memory Consumption (MB) ConnectIB Exascale Scalability 1,000,000,000 1,000,000 8 nodes 2K nodes 10K nodes 100K nodes 1,000 1 InfiniHost, RC 2002 InfiniHost-III, SRQ 2005 ConnectX, XRC 2008 Connect-IB, DCT 2012 2013 Mellanox Technologies 13

Dynamically Connected Transport Key objects DC Initiator: Initiates data transfer DC Target: Handles incoming data 2013 Mellanox Technologies 14

Reliable Connection Transport Mode 2013 Mellanox Technologies 15

Dynamically Connected Transport Mode 2013 Mellanox Technologies 16

Cross-Channel Synchronization 2013 Mellanox Technologies 17

Challenges Being Addressed Scalable Collective communication Asynchronous communication Communication resources (not computational resources) are used to manage communication Avoids some of the effects of system noise 2013 Mellanox Technologies 18

High Level Objectives Provide synchronization mechanism between QP s Provide mechanisms for managing communication dependencies to be managed without additional host intervention Support asynchronous progress of multi-staged communication protocols 2013 Mellanox Technologies 19

Motivating Example - HPC Collective communications optimization Communication pattern involving multiple processes Optimized collectives involve a communicator-wide data-dependent communication pattern, e.g., communication initiation is dependent on prior completion of other communication operations Data needs to be manipulated at intermediate stages of a collective operation (reduction operations) Collective operations limit application scalability Performance, scalability, and system noise 2013 Mellanox Technologies 20

Scalability of Collective Operations Ideal Algorithm Impact of System Noise 4 3 2 1 2013 Mellanox Technologies 21

Scalability of Collective Operations - II Offloaded Algorithm Nonblocking Algorithm - Communication processing 2013 Mellanox Technologies 22

Network Managed Multi-stage Communication Key Ideas Create a local description of the communication pattern Pass the description to the communication subsystem Manage the communication operations on the network, freeing the CPU to do meaningful computation Check for full-operation completion Current Assumptions Data delivery is detected by new Completion Queue Events Use Completion Queue to identify the data source Completion order is used to associate data with a specific operation Use RDMA with the immediate to generate Completion Queue events 2013 Mellanox Technologies 23

Key New Features New QP trait - Managed QP: WQE on such a QP must be enabled by WQEs from other QP s Synchronization primitives: Wait work queue entry: waits until specified completion queue (QP) reaches specified producer index value Enable tasks: WQE on one QP can enable a WQE on a second QP Submit lists of task to multiple QP s in single post - sufficient to describe chained operations (such as collective communication) Can setup a special completion queue to monitor list completion (request CQE from the relevant task) 2013 Mellanox Technologies 24

Setting up CORE-Direct QP s Create QP with the ability to use the CORE-Direct primitives Decide if managed QP s will be used, if so, need to create QP that will take the enable tasks. Most likely, this is a centralized resource handling both the enable and the wait tasks Decide on a Completion Queue strategy Setup all needed QP s 2013 Mellanox Technologies 25

Initiating CORE-Direct Communication Task list is created Target QP for task Operation send/wait/enable For wait, the number of completions to wait for Number of completions is specified relative to the beginning of the task list Number of completions can be positive, zero, or negative (wait on previously posted tasks) For enable, the number of send tasks on the target QP is specified Number of tasks to enable 2013 Mellanox Technologies 26

Posting of Tasks 2013 Mellanox Technologies 27

CORE-Direct Task-List Completion Can specify which task will generate completion, in Collective completion queue Single CQ signals full list (collective) completion CPU is not needed for progress 2013 Mellanox Technologies 28

Example Four Process Recursive Doubling 1 2 3 4 Step 1 1 2 3 4 Step 2 1 2 3 4 2013 Mellanox Technologies 29

Four Process Barrier Example Using Managed Queues Rank 0 2013 Mellanox Technologies 30

Four Process Barrier Example No Managed Queues Rank 0 2013 Mellanox Technologies 31

User-Mode Memory Registration 2013 Mellanox Technologies 32

Key Features Support combining contiguous registered memory regions into a single memory region. H/W treats them as a single contiguous region (and handles the non-contiguous regions) For a given memory region, supports non-contiguous access to memory, using a regular structure representation base pointer, element length, stride, repeat count. Can combine these from multiple different memory keys Memory descriptors are created by posting WQE s to fill in the memory key Supports local and remote non-contiguous memory access Eliminates the need for some memory copies 2013 Mellanox Technologies 33

Combining Contiguous Memory Regions 2013 Mellanox Technologies 34

Non-Contiguous Memory Access Regular Access y z x 2013 Mellanox Technologies 35

On-Demand Paging 2013 Mellanox Technologies 36

Memory Registration Apps register Memory Regions (MRs) for IO Referenced memory must be part of process address space at registration time Memory key returned to identify the MR Registration operation Pins down the MR Hands off the virtual to physical mapping to HW MR SQ RQ Applicat ion Userspace driver stack CQ Kern el drive r stac k Key HW 2013 Mellanox Technologies 37

Memory Registration continued Fast path Applications post IO operations directly to HCA HCA accesses memory using the translations referenced by the memory key MR Applicat ion Userspace driver stack CQ RQ SQ Wow!!! But Kern el drive r stac k Key HW 2013 Mellanox Technologies 38

Challenges Size of registered memory must fit physical memory Applications must have memory locking privileges Continuously synchronizing the translation tables between the address space and the HCA is hard Address space changes (malloc, mmap, stack) NUMA migration fork() Registration is a costly operation No locality of reference 2013 Mellanox Technologies 39

Achieving High Performance Requires careful design Dynamic registration Naïve approach induces significant overheads Pin-down cache logic is complex and not complete Pinned bounce buffers Application level memory management Copying adds overhead 2013 Mellanox Technologies 40

On Demand Paging MR pages are never pinned by the OS Paged in when HCA needs them Paged out when reclaimed by the OS HCA translation tables may contain non-present pages Initially, a new MR is created with non-present pages Virtual memory mappings don t necessarily exist 2013 Mellanox Technologies 41

Semantics ODP memory registration Specify IBV_ACCESS_ON_DEMAND access flag Work request processing WQEs in HW ownership must reference mapped memory - From Post_Send()/Recv() until PollCQ() RDMA operations must target mapped memory Access attempts to unmapped memory trigger an error Transport RC semantics unchanged UD responder drops packets while page faults are resolved - Standard semantics cannot be achieved unless wire is back-pressured 2013 Mellanox Technologies 42

Execution Time Breakdown (Send Requestor) 4K Page fault (135us total time) 0% 2% 4M Page fault (1ms total time) 0% 0% 1% 2% 7% 7% 37% 36% Schedule in Schedule in WQE read WQE read Get User Pages Get User Pages PTE update PTE update TLB flush TBL flush QP resume QP resume 16% 5% 4% Misc 83% Misc 2013 Mellanox Technologies 43

Thank You 2013 Mellanox Technologies 44