HA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication between GPUs

Similar documents
Tightly Coupled Accelerators Architecture

Interconnection Network for Tightly Coupled Accelerators Architecture

Tightly Coupled Accelerators with Proprietary Interconnect and Its Programming and Applications

T2K & HA-PACS Projects Supercomputers at CCS

HA-PACS Project Challenge for Next Step of Accelerating Computing

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Interconnect Your Future

ABySS Performance Benchmark and Profiling. May 2010

Fermi Cluster for Real-Time Hyperspectral Scene Generation

Appro Supercomputer Solutions Appro and Tsukuba University

Accelerated Computing Unified with Communication Towards Exascale

TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS. Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016

GPU-centric communication for improved efficiency

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Implementation and Evaluation of NAS Parallel CG Benchmark on GPU Cluster with Proprietary Interconnect TCA

High Performance Computing

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

MegaGauss (MGs) Cluster Design Overview

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017

HPC Hardware Overview

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory

E4-ARKA: ARM64+GPU+IB is Now Here Piero Altoè. ARM64 and GPGPU

Architectures for Scalable Media Object Search

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017

Interconnect Your Future

NVIDIA GPUDirect Technology. NVIDIA Corporation 2011

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

arxiv: v1 [physics.comp-ph] 4 Nov 2013

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Catapult: A Reconfigurable Fabric for Petaflop Computing in the Cloud

Mercury Computer Systems & The Cell Broadband Engine

Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

Asynchronous Peer-to-Peer Device Communication

Performance of Applications on Comet GPU Nodes Utilizing MVAPICH2-GDR. Mahidhar Tatineni MVAPICH User Group Meeting August 16, 2017

InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment. TOP500 Supercomputers, June 2014

Power Systems AC922 Overview. Chris Mann IBM Distinguished Engineer Chief System Architect, Power HPC Systems December 11, 2017

RapidIO.org Update. Mar RapidIO.org 1

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Vectorisation and Portable Programming using OpenCL

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation

Intel Workstation Technology

Integrating FPGAs in High Performance Computing A System, Architecture, and Implementation Perspective

High Performance Computing with Accelerators

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

EXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT. Konstantinos Alexopoulos ECE NTUA CSLab

The Nios II Family of Configurable Soft-core Processors

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

HMEM and Lemaitre2: First bricks of the CÉCI s infrastructure

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

TSUBAME-KFC : Ultra Green Supercomputing Testbed

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

Design of Scalable Network Considering Diameter and Cable Delay

Sharing High-Performance Devices Across Multiple Virtual Machines

Pactron FPGA Accelerated Computing Solutions

LBRN - HPC systems : CCT, LSU

Analysis of performance improvements for host and GPU interface of the APENet+ 3D Torus network

GPUs and Emerging Architectures

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

An Overview of Fujitsu s Lustre Based File System

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

The rcuda middleware and applications

Pedraforca: a First ARM + GPU Cluster for HPC

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

Overview of Tianhe-2

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

Big Data Systems on Future Hardware. Bingsheng He NUS Computing

Results from TSUBAME3.0 A 47 AI- PFLOPS System for HPC & AI Convergence

White Paper. Technical Advances in the SGI. UV Architecture

Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning

UCX: An Open Source Framework for HPC Network APIs and Beyond

Overview of Reedbush-U How to Login

NAMD Performance Benchmark and Profiling. January 2015

LS-DYNA Performance Benchmark and Profiling. October 2017

Intel Enterprise Processors Technology

Solutions for Scalable HPC

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

Introduc)on to Hyades

Vector Engine Processor of SX-Aurora TSUBASA

HPC Architectures. Types of resource currently in use

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

InfiniBand SDR, DDR, and QDR Technology Guide

Recent Advances in ANSYS Toward RDO Practices Using optislang. Wim Slagter, ANSYS Inc. Herbert Güttler, MicroConsult GmbH

2008 International ANSYS Conference

Transcription:

HA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication between GPUs Yuetsu Kodama Division of High Performance Computing Systems Center for Computational Sciences University of Tsukuba, Japan 1

What is Tightly Coupled Accelerators (TCA)? Concept: Direct connection between accelerators (GPUs) over the nodes Eliminate extra memory copies to the host Improve latency, improve strong scaling with small data size Using PCIe as a communication device between accelerator Most accelerator devices and other I/O devices are connected by PCIe as end-point (slave device) An intelligent PCIe device logically enables an end-point device to directly communicate with other end-point devices PEACH2: PCI Express Adaptive Communication Hub ver. 2 In order to configure TCA, each node is connected to other nodes through PEACH2 chip. 2

Design policy of PEACH2 Implement by FPGA with four PCIe Gen.2 IPs Altera Stratix IV GX Prototyping, flexible enhancement Sufficient communication bandwidth PCI Express Gen2 x8 for each port Sophisticated DMA controller Chaining DMA Latency reduction Hardwired logic Low-overhead routing mechanism Efficient address mapping in PCIe address area using unused bits Simple comparator for decision of output port Not only is it proof-of-concept implementation, but it will also be available for product-run in GPU cluster. 3

TCA node structure example PEACH2 can access every GPUs NVIDIA Kepler architecture + CUDA 5.0 GPUDirect Support for RDMA Performance over QPI is quite bad. => support only for GPU0, GPU1 Connect among 3 nodes using PEACH2 G2 x8 G2 x8 G2 x8 PEA CH2 G2 x8 CPU (Xeon E5) G2 x16 GPU 0 QPI Single PCIe address PCIe G2 x16 GPU 1 CPU (Xeon E5) G2 x16 GPU 2 G2 x16 GPU 3 GPU: NVIDIA K20, K20X (Kepler architecture) G3 x8 IB HCA 4

Overview of PEACH2 chip Fully compatible with PCIe Gen2 spec. Root and EndPoint must be paired according to PCIe spec. Port N: connected to the host and GPUs Port E and W: form the ring topology Port S: connected to the other ring Selectable between Root and Endpoint Write only except Port N Instead, Proxy write on remote node realizes pseudo-read. Port W To PEACH2 (Root Complex) NIOS (CPU) CPU & GPU side (Endpoint) DMAC Memory Routing function Port N To PEACH2 (Root Complex / Endpoint) Port S To PEACH2 (Endpoint) Port E 5

Communication by PEACH2 PIO CPU can store the data to remote node directly using mmap. DMA Chaining mode DMA requests are prepared as the DMA descriptors chained in the host memory. DMA transactions are operated automatically according to the DMA descriptors by hardware. Register mode DMA requests are registered into the PEACH2 by up to 16. Lower overhead than chaining mode by omitting transfer for descriptors from host Block stride transfer function Source Destination Next Length Flags Descriptor0 Descriptor1 Descriptor2 Descriptor3 Descriptor (n- 1) 6

PEACH2 board (Production version for HA-PACS/TCA) PCI Express Gen2 x8 peripheral board Compatible with PCIe Spec. Side View Top View 7

PEACH2 board (Production version for HA-PACS/TCA) Main board FPGA + sub board (Altera Stratix IV Most part operates at 250 MHz 530GX) (PCIe Gen2 logic runs at 250MHz) PCI Express x8 card edge PCIe x16 cable connecter PCIe x8 cable connecter DDR3- SDRAM Power supply for various voltage 8

HA-PACS System TCA: 5Rack x 2Line TCA since Nov. 2013 Base Cluster since Feb. 2012 9

HA-PACS Total System InfiniBand QDR 40port x 2ch between base cluster and TCA InfiniBand QDR 324port sw 40 InfiniBand QDR 108 port sw HA-PACS Base Cluster 268 nodes InfiniBand QDR 324port sw Lustre Filesystem 40 HA-PACS / TCA 64 nodes InfiniBand QDR 108 port sw 421 TFLOPS, Efficiency 54%, 41 st 2012.6 Top500 1.15 GFLOPS/W 277 TFLOPS, Efficiency 76%, 134 th 2013.11 Top500 3.52 GFLOPS/W 3 rd 2013.11 Green500 2013/12/18 ACHPC 10

HA-PACS/TCA (Computation node) 4 Channels 1,866 MHz 59.7 GB/sec 4 Channels 1,866 MHz 59.7 GB/sec (16 GB, 14.9 GB/s)x8 =128 GB, 119.4 GB/s AVX (2.8 GHz x 8 flop/clock) 22.4 GFLOPS x20 =448.0 GFLOPS Ivy Bridge Ivy Bridge Legacy Devices Total: 5.688 TFLOPS Gen 2 x 16 Gen 2 x 16 Gen 2 x 16 Gen 2 x 16 PEACH2 board (TCA interconnect) 1.31 TFLOPSx4 =5.24 TFLOPS 4 x NVIDIA K20X (6 GB, 250 GB/s)x4 =24 GB, 1 TB/s 8 GB/s 11 Gen 2 x 8 Gen 2 x 8 Gen 2 x 8

HA-PACS/TCA PEACH2 boards are installed and connected cables front view (8 node/rack) 3U height rear view 12

HA-PACS/TCA 13

TCA sub-cluster (16 nodes) TCA has four sub-clusters, and TCA sub cluster consists of two racks. 2x8 torus (one example) A ring consists of 8 nodes (between East port and West port, Orange links) Two rings are connected at each node(between both South port, Red links) We can use 32 GPUs in a subcluster seamlessly as same as multi-gpus in a node. only use 2GPU in a node because of bottleneck of QPI Sub-clusters are connected by IB(QDR 2port) 14

Evaluation items Ping-pong performance between nodes Latency and bandwidth Written as application Comparison with MVAPICH2 1.9 (with CUDA support) for GPU-GPU communication and MVAPICH2-GDR (with support GPU Direct support for RDMA) using IB (dual QDRx4 that bandwidth is twice of TCA) In order to access GPU memory by the other device, GPU Direct support for RDMA in CUDA5 API is used. Special driver named TCA p2p driver to enable memory mapping is developed. PEACH2 driver to control the board is also developed. 15

Ping-pong Latency Minimum Latency 8 7 6 5 4 3 2 1 0 8 32 128 512 80 70 60 50 40 30 20 10 0 PIO (CPU to CPU): 0.9us DMA:CPU to CPU: 1.9us GPU to GPU: 2.3us (cf. MVAPICH2 1.9:19 us MVAPICH2-GDR: 6us) 8 128 2048 32768 CPU(IVB) GPU(SB) GPU(IVB) MV2-GPU(SB) MV2GDR-GPU(IVB) 16

Ping-pong Bandwidth CPU-CPU DMA Max. 3.5GByte/sec (95% of theoretical peak) GPU-GPU DMA Max. 2.6GByte/sec GPU(SB) was saturated at 880MByte/sec because of poor performance of PCIe switch in CPU GPU(IVB) is faster than MV2GDR less than 512KB message size 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0 CPU(IVB) GPU(SB) GPU(IVB) MV2-GPU(SB) MV2GDR-GPU(IVB) 8 128 2048 32768 524288 17

Programming for TCA cluster Data transfer to remote GPU within TCA can be treated like multi-gpu in a node. In particular, suitable for stencil computation Good performance at nearest neighbor communication due to direct network Chaining DMA can bundle data transfers for every Halo planes XY-plane: contiguous array XZ-plane: block stride YZ-plane: stride In each iteration, DMA descriptors can be reused and only a DMA kick operation is needed => Improve strong scaling with small data size Bundle to 1 DMA 18

Summary TCA: Tightly Coupled Accelerators TCA enables direct communication among accelerators as an element technology becomes a basic technology for next gen s accelerated computing in exa-scale era. PEACH2 board: Implementation for realizing TCA using PCIe technology Bandwidth: max. 3.5 Gbyte/sec between CPUs (over 95% of theoretical peak) Min. Latency: 0.9 us (PIO), 1.9 us (DMA between CPUs), 2.3 us (DMA between GPUs) GPU-GPU communication over the nodes can be demonstrated with 16 node cluster. By the ping-pong program, PEACH2 can achieve lower latency than existing technology, such as MVAPICH2 in small data size. HA-PACS/TCA with 64 nodes was installed on the end of Oct. 2013. Actual proof system of TCA architecture with 4 GPUs per each node Development of the HPC application using TCA, and production-run 19