A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers

Similar documents
Operational Robustness of Accelerator Aware MPI

PLAN-E Workshop Switzerland. Welcome! September 8, 2016

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

Porting COSMO to Hybrid Architectures

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

R-Storm: A Resource-Aware Scheduler for STORM. Mohammad Hosseini Boyang Peng Zhihao Hong Reza Farivar Roy Campbell

Cray XC Scalability and the Aries Network Tony Ford

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

2008 International ANSYS Conference

The challenges of new, efficient computer architectures, and how they can be met with a scalable software development strategy.! Thomas C.

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Chelsio 10G Ethernet Open MPI OFED iwarp with Arista Switch

The Future of Interconnect Technology

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures

Opportunities & Challenges for Piz Daint s Cray XC50 with ~5000 P100 GPUs. Thomas C. Schulthess

The Future of High Performance Interconnects

Deutscher Wetterdienst

Deutscher Wetterdienst

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

Adaptive MPI Multirail Tuning for Non-Uniform Input/Output Access

LAMMPSCUDA GPU Performance. April 2011

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

GPU Consideration for Next Generation Weather (and Climate) Simulations

NAMD Performance Benchmark and Profiling. January 2015

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

Solutions for Scalable HPC

Hybrid MPI - A Case Study on the Xeon Phi Platform

ABySS Performance Benchmark and Profiling. May 2010

Interconnect Your Future

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Birds of a Feather Presentation

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS. Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016

Optimised all-to-all communication on multicore architectures applied to FFTs with pencil decomposition

Future Routing Schemes in Petascale clusters

Blue Waters I/O Performance

NAMD Performance Benchmark and Profiling. November 2010

GPU acceleration on IB clusters. Sadaf Alam Jeffrey Poznanovic Kristopher Howard Hussein Nasser El-Harake

Evaluating On-Node GPU Interconnects for Deep Learning Workloads

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Interconnection Network for Tightly Coupled Accelerators Architecture

Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.

GPUs and Emerging Architectures

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Communication Models for Resource Constrained Hierarchical Ethernet Networks

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures

CSCS Site Update. HPC Advisory Council Workshop Colin McMurtrie, Associate Director and Head of HPC Operations.

Power Systems AC922 Overview. Chris Mann IBM Distinguished Engineer Chief System Architect, Power HPC Systems December 11, 2017

GROMACS (GPU) Performance Benchmark and Profiling. February 2016

Accelerating Implicit LS-DYNA with GPU

Trends in HPC (hardware complexity and software challenges)

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Maximizing Cluster Scalability for LS-DYNA

Intra-MIC MPI Communication using MVAPICH2: Early Experience

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

Real Parallel Computers

An Effective Queuing Scheme to Provide Slim Fly topologies with HoL Blocking Reduction and Deadlock Freedom for Minimal-Path Routing

MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces

Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA. NVIDIA Corporation 2012

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

NAMD GPU Performance Benchmark. March 2011

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Fundamental CUDA Optimization. NVIDIA Corporation

IBM Spectrum Scale IO performance

Dell EMC PowerEdge R740xd as a Dedicated Milestone Server, Using Nvidia GPU Hardware Acceleration

Fundamental CUDA Optimization. NVIDIA Corporation

EPYC VIDEO CUG 2018 MAY 2018

CUDA Experiences: Over-Optimization and Future HPC

MM5 Modeling System Performance Research and Profiling. March 2009

End-to-End Adaptive Packet Aggregation for High-Throughput I/O Bus Network Using Ethernet

Single-Points of Performance

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

STAR-CCM+ Performance Benchmark and Profiling. July 2014

HA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication between GPUs

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

Is remote GPU virtualization useful? Federico Silla Technical University of Valencia Spain

SNAP Performance Benchmark and Profiling. April 2014

Transcription:

A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers Maxime Martinasso, Grzegorz Kwasniewski, Sadaf R. Alam, Thomas C. Schulthess, Torsten Hoefler Swiss National Supercomputing Centre, ETH Zurich, 6900 Lugano, Switzerland Department of Computer Science, ETH Zurich, Universitätstr. 6, 8092 Zurich, Switzerland Institute for Theoretical Physics, ETH Zurich, 8093 Zurich, Switzerland Computer Science and Mathematics Division, Oak Ridge National Laboratory, USA

Why more densely populated accelerator servers? accelerators are faster and more energy-efficient than CPU densely populated accelerator servers are high performance nodes reduce space occupancy of the data center HPC Advisory Council A PCIe performance model 2

Cray CS Storm new MeteoSwiss supercomputer 2 cabinets 12 hybrid computing nodes per cabinet 2 Intel Haswell 12-core CPUs per node 8 NVIDIA Tesla K80 GPU accelerators per node 2 GPU processors per accelerator 192 GPU processors in total 360 GPU teraflops in total Production system GPUs connected by PCI-Express HPC Advisory Council A PCIe performance model 3

generation 3, 16 GB/s using x16 wide lane dual simplex (a pair of unidirectional links) exchange buffer availability between pair of ports of a link tree-based topology Building a densely populated accelerator servers with PCIe: CPU CPU Root Complex Legend: PCIe Root Complex PCIe Switch GPU Port x16 Link HPC Advisory Council A PCIe performance model 4

generation 3, 16 GB/s using x16 wide lane dual simplex (a pair of unidirectional links) exchange buffer availability between pair of ports of a link tree-based topology Building a densely populated accelerator servers with PCIe: CPU CPU Root Complex GPU K80 Legend: PCIe Root Complex PCIe Switch GPU Port x16 Link HPC Advisory Council A PCIe performance model 4

generation 3, 16 GB/s using x16 wide lane dual simplex (a pair of unidirectional links) exchange buffer availability between pair of ports of a link tree-based topology Building a densely populated accelerator servers with PCIe: CPU CPU Root Complex GPU K80 GPU K80 Legend: PCIe Root Complex PCIe Switch GPU Port x16 Link HPC Advisory Council A PCIe performance model 4

generation 3, 16 GB/s using x16 wide lane dual simplex (a pair of unidirectional links) exchange buffer availability between pair of ports of a link tree-based topology Building a densely populated accelerator servers with PCIe: CPU CPU Root Complex GPU K80 Legend: PCIe Root Complex PCIe Switch GPU Port x16 Link GPU K80 GPU K80 HPC Advisory Council A PCIe performance model 4

generation 3, 16 GB/s using x16 wide lane dual simplex (a pair of unidirectional links) exchange buffer availability between pair of ports of a link tree-based topology Building a densely populated accelerator servers with PCIe: CPU CPU Root Complex Legend: PCIe Root Complex PCIe Switch GPU Port x16 Link GPU K80 GPU K80 GPU K80 GPU K80 HPC Advisory Council A PCIe performance model 4

generation 3, 16 GB/s using x16 wide lane dual simplex (a pair of unidirectional links) exchange buffer availability between pair of ports of a link tree-based topology Building a densely populated accelerator servers with PCIe: CPU CPU Root Complex Legend: PCIe Root Complex PCIe Switch GPU Port x16 Link K80 HPC Advisory Council A PCIe performance model 4

Communication conflicts 0 7, 1 4, 6 7 18 19 16 17 12 13 14 15 8 9 10 11 HPC Advisory Council A PCIe performance model 5

Communication conflicts 0 7, 1 4, 6 7 18 19 Upstream port conflict 12 16 13 14 17 15 8 9 10 11 HPC Advisory Council A PCIe performance model 5

Communication conflicts 0 7, 1 4, 6 7 18 19 Upstream port conflict 12 16 13 14 17 15 8 9 10 11 Downstream port conflict HPC Advisory Council A PCIe performance model 5

Communication conflicts 0 7, 1 4, 6 7 18 Crossing the Root Complex conflict 19 Upstream port conflict 12 16 13 14 17 15 8 9 10 11 Downstream port conflict HPC Advisory Council A PCIe performance model 5

Communication conflicts 0 7, 1 4, 6 7 18 Crossing the Root Complex conflict 19 Head-of-Line blocking Upstream port conflict 12 16 13 14 17 15 8 9 10 11 Downstream port conflict HPC Advisory Council A PCIe performance model 5

18 Crossing the Root Complex conflict 19 Head-of-Line blocking Upstream port conflict 8 12 16 13 17 14 15 9 10 11 Downstream port conflict Conflicts identify by: Carefully reading the PCIe and vendor documentation Self experiments with micro-benchmarks (looking for HoL Blocking) HPC Advisory Council A PCIe performance model 5

Motivation COSMO halo exchange GPU4 GPU5 GPU6 GPU7 GPU6 GPU7 GPU4 GPU5 GPU0 GPU1 GPU2 GPU3 2D domain decomposition GPU2 GPU3 GPU0 GPU1 3D domain decomposition Which order of communications is the fastest? 0 4 1 0 2 1 3 7 4 0 5 4 6 2 7 6 0 1 1 2 2 6 3 2 4 5 5 6 6 7 7 3 1 5 2 3 5 1 6 5 0 1 1 0 2 1 3 2 4 0 5 1 6 2 7 3 0 4 1 5 2 6 3 7 4 5 5 6 6 5 7 6 1 2 2 3 5 4 6 7... HPC Advisory Council A PCIe performance model 6

Motivation COSMO halo exchange GPU4 GPU5 GPU6 GPU7 GPU6 GPU7 GPU4 GPU5 GPU0 GPU1 GPU2 GPU3 2D domain decomposition GPU2 GPU3 GPU0 GPU1 3D domain decomposition Which order of communications is the fastest? 0 4 1 0 2 1 3 7 4 0 5 4 6 2 7 6 0 1 1 2 2 6 3 2 4 5 5 6 6 7 7 3 1 5 2 3 5 1 6 5 0 1 1 0 2 1 3 2 4 0 5 1 6 2 7 3 0 4 1 5 2 6 3 7 4 5 5 6 6 5 7 6 1 2 2 3 5 4 6 7... 2D domain decomposition example: 20, 376 possibilities 3D domain decomposition has more than 1.6 Million possibilities HPC Advisory Council A PCIe performance model 6

PCIe performance model We want to identify the congestion factors ρ [0, 1] which limit the available bandwidth per communication at each communication phase. model: compute Step (A) Step (B) Step (C) Step (D) Communication graph Source Arbitration Upstream port conflicts Downstream port conflicts + Head-of-Line Blocking Update messages Advance to next time step Update communication graph Yes Remaining messages? No End HPC Advisory Council A PCIe performance model 7

Communication phase update messages Elapsed time L c, message size M c, set of communication phases S c : Time L C0 = t 0 = 1 B m0,0 ρ 0,0 and M C0 = m 0,0 HPC Advisory Council A PCIe performance model 8

Communication phase update messages Elapsed time L c, message size M c, set of communication phases S c : Time L C0 = t 0 + t 1 = 1 B ( m0,0 ρ 0,0 + m0,1 ρ 0,1 ) and M C0 = m 0,0 + m 0,1 HPC Advisory Council A PCIe performance model 8

Communication phase update messages Elapsed time L c, message size M c, set of communication phases S c : Time L c = i S c t i = 1 B i S c m c,i ρ c,i and M c = i S c m c,i (1) start times tstart i and M c are known (2) ρ c,i are given by the model with (1) and (2) L c are computable HPC Advisory Council A PCIe performance model 8

Model conflicts on switch We want to identify the congestion factors ρ [0, 1] which limit the available bandwidth per communication at each communication phase. Each communication enters a switch with a congestion factor ρ and leaves with a congestion factor ρ If i ρ i > 1 then an arbitration policy is required, ρ = ρ otherwise HPC Advisory Council A PCIe performance model 9

Upstream port conflict Proportional sharing of available bandwidth HPC Advisory Council A PCIe performance model 10

Downstream port conflict Round-robin policy Performance reduction for crossing the root complex C R set of communications crossing the root complex; n number of grouped communication sets; R congestion factor of a grouped communication set; τ congestion factor for crossing the root complex; if C R = then R = 1 n if C R then R = { min(max( 1 τ, 0), R) n min( 1 + τ, R) n otherwise if R contains comm. CR HPC Advisory Council A PCIe performance model 11

Complete example Step (A) Source Arbitration Step (B) Upstream port conflicts 18 Step (C) Downstream port conflicts + 19 Step (D) Head-of-Line Blocking comm. Step (A) Step (B) Step (C) Step (D) Update (a) 0 2 messages 1 1/2 1/2 3/10 3/10 (b) 1 4 1 1/2 3/10 3/10 3/10 (c) 3 2 1 1 1/2 1/2 7/10 (d) 6 4 1 1 7 /10 7/10 7/10 16 17 Congestion graph step 1 12 13 14 15 comm. cong. factor data remaining elapsed time (a) 0 2 3/10 128 MB 36 ms 8 9 10 11 (b) 1 4 3/10 128 MB 36 ms τ = 0.2 Message size: 300MB Bandwidth: 11.6 GB/s (c) 3 2 7/10 0 MB 36 ms (d) 6 4 7 /10 0 MB 36 ms Congestion graph step 2 comm. cong. factor data remaining elapsed time (a) 0 2 1/2 0 MB 65 ms (b) 1 4 1/2 0 MB 65 ms HPC Advisory Council A PCIe performance model 12

Model Validation Architecture parameters: B = 11.6GB/s τ = 0.1735 22,259 graphs: non-isomorphic cudamemcpyasync Communication pattern: scatter, gather, all-to-all Entire set of graphs for subsets of GPUs Randomly generated 100K communications Message size: 300 MB Time no contention: T ref = 25.3ms 95% of communication are in range +/- 15% HPC Advisory Council A PCIe performance model 13

Back to the motivation COSMO halo exchange Time [s] 0.28 0.26 0.24 0.22 0.20 0.18 0.16 0.14 Upper limit on time to solution, throughput approach Running mode one instance per socket (8 GPUs) Large domain size 256x256x80 per GPU One step triggers 312 halo exchanges Message size: 40 KB to 254 KB Uses MPI (3!) 4 (2!) 4 = 20, 736 communication graphs for 2D domain GPU0 current implemented graph in COSMO GPU4 GPU1 GPU5 GPU2 GPU6 GPU3 2D domain decomposition Congestion HPC graphs Advisory Council sorted A PCIe performance by elapsed model 14 time GPU7 GPU4 GPU0 GPU6 GPU2 1.6x 1.9x GPU5 GPU1 GPU7 GPU3 3D domain decomposition

Fastest schedule for 2D decomposition 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 HPC Advisory Council A PCIe performance model 15

Fastest schedule for 2D decomposition 4 5 6 7 0 1 2 3 HPC Advisory Council A PCIe performance model 15

Fastest schedule for 2D decomposition 4 5 6 7 0 1 2 3 HPC Advisory Council A PCIe performance model 15

Fastest schedule for 2D decomposition 4 5 6 7 0 1 2 3 HPC Advisory Council A PCIe performance model 15

Fastest schedule for 3D decomposition 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 HPC Advisory Council A PCIe performance model 16

Fastest schedule for 3D decomposition 4 5 6 7 0 1 2 3 HPC Advisory Council A PCIe performance model 16

Fastest schedule for 3D decomposition 4 5 6 7 0 1 2 3 HPC Advisory Council A PCIe performance model 16

Fastest schedule for 3D decomposition 4 5 6 7 0 1 2 3 HPC Advisory Council A PCIe performance model 16

COSMO improvement fastest schedule 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 COSMO gain: 5.6% per halo exchange step, gain is limited by MPI 2-sided overhead, initial tests with 1-sided seems to clearly improve the performance. HPC Advisory Council A PCIe performance model 17

Conclusion - Latency not modeled - MPI 2-sided overhead not modeled (use one-sided in progress) + Captures all PCIe features including congestion + Simple model only 2 parameters (B and τ ) + Precise for large messages + Design of topology-aware algorithms + COSMO halo exchange performance gain HPC Advisory Council A PCIe performance model 18

Thank you for your attention.

12 Measured bandwitdh [GB/s] 11 10 9 8 7 6 5 4 3 2 1 0 1.44x 1.21x 1.80x 1.88x slower for slower for slower for conflict 3 conflict 2 conflict 1 0 1 or 0 3 alone 4 1 alone 0 3 in conflict 1 0 1 in conflict 2 0 1 in conflict 3 2 8 32 128 512 2K 8K 32K 128K 512K 2M 8M 32M Message size [Byte] τ = 1 1 /1.21 HPC Advisory Council A PCIe performance model 19

Legend: PCIe Root Complex PCIe Switch GPU Port x16 Link Socket Socket K80 K80 TOPOLOGY T1 TOPOLOGY T2 HPC Advisory Council A PCIe performance model 20