Tightly Coupled Accelerators Architecture

Similar documents
HA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication between GPUs

Interconnection Network for Tightly Coupled Accelerators Architecture

Tightly Coupled Accelerators with Proprietary Interconnect and Its Programming and Applications

T2K & HA-PACS Projects Supercomputers at CCS

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Accelerated Computing Unified with Communication Towards Exascale

HA-PACS Project Challenge for Next Step of Accelerating Computing

Interconnect Your Future

GPU-centric communication for improved efficiency

Fermi Cluster for Real-Time Hyperspectral Scene Generation

Appro Supercomputer Solutions Appro and Tsukuba University

NVIDIA GPUDirect Technology. NVIDIA Corporation 2011

ABySS Performance Benchmark and Profiling. May 2010

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

High Performance Computing

MegaGauss (MGs) Cluster Design Overview

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Coupling GPUDirect RDMA and InfiniBand Hardware Multicast Technologies for Streaming Applications

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

arxiv: v1 [physics.comp-ph] 4 Nov 2013

Interconnect Your Future

TECHNOLOGIES FOR IMPROVED SCALING ON GPU CLUSTERS. Jiri Kraus, Davide Rossetti, Sreeram Potluri, June 23 rd 2016

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

Implementation and Evaluation of NAS Parallel CG Benchmark on GPU Cluster with Proprietary Interconnect TCA

Analysis of performance improvements for host and GPU interface of the APENet+ 3D Torus network

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

HPC Hardware Overview

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017

E4-ARKA: ARM64+GPU+IB is Now Here Piero Altoè. ARM64 and GPGPU

Catapult: A Reconfigurable Fabric for Petaflop Computing in the Cloud

Architectures for Scalable Media Object Search

RapidIO.org Update. Mar RapidIO.org 1

The rcuda middleware and applications

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment. TOP500 Supercomputers, June 2014

VPI / InfiniBand. Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

An Overview of Fujitsu s Lustre Based File System

Solutions for Scalable HPC

Paving the Road to Exascale

Mercury Computer Systems & The Cell Broadband Engine

Performance of Applications on Comet GPU Nodes Utilizing MVAPICH2-GDR. Mahidhar Tatineni MVAPICH User Group Meeting August 16, 2017

Paving the Road to Exascale Computing. Yossi Avni

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Preparing GPU-Accelerated Applications for the Summit Supercomputer

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

EXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT. Konstantinos Alexopoulos ECE NTUA CSLab

White Paper. Technical Advances in the SGI. UV Architecture

2008 International ANSYS Conference

The Future of Interconnect Technology

Overview of Tianhe-2

Vectorisation and Portable Programming using OpenCL

10-Gigabit iwarp Ethernet: Comparative Performance Analysis with InfiniBand and Myrinet-10G

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

The Tofu Interconnect 2

The Future of High Performance Interconnects

CS/ECE 217. GPU Architecture and Parallel Programming. Lecture 16: GPU within a computing system

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

APENet: LQCD clusters a la APE

Asynchronous Peer-to-Peer Device Communication

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation

Sub-microsecond interconnects for processor connectivity The opportunity

Interconnect Your Future

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

Birds of a Feather Presentation

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Integrating FPGAs in High Performance Computing A System, Architecture, and Implementation Perspective

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb.

RapidIO.org Update.

High Performance Computing with Accelerators

Inspur AI Computing Platform

Power Systems AC922 Overview. Chris Mann IBM Distinguished Engineer Chief System Architect, Power HPC Systems December 11, 2017

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

Sharing High-Performance Devices Across Multiple Virtual Machines

Performance Accelerated Mellanox InfiniBand Adapters Provide Advanced Data Center Performance, Efficiency and Scalability

When MPPDB Meets GPU:

Implementation of Multipurpose PCI Express Adapter Cards with On-Board Optical Module

GPUs and Emerging Architectures

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

Pactron FPGA Accelerated Computing Solutions

Building the Most Efficient Machine Learning System

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

MILC Performance Benchmark and Profiling. April 2013

Improving DPDK Performance

<Insert Picture Here> Exadata Hardware Configurations and Environmental Information

Transcription:

Tightly Coupled Accelerators Architecture Yuetsu Kodama Division of High Performance Computing Systems Center for Computational Sciences University of Tsukuba, Japan 1

What is Tightly Coupled Accelerators (TCA)? Concept: Direct connection between accelerators (GPUs) over the nodes Eliminate extra memory copies to the host Improve latency, improve strong scaling with small data size Using PCIe as a communication device between accelerator Most accelerator devices and other I/O devices are connected by PCIe as end-point (slave device) An intelligent PCIe device logically enables an end-point device to directly communicate with other end-point devices PEACH2: PCI Express Adaptive Communication Hub ver. 2 In order to configure TCA, each node is connected to other nodes through PEACH2 chip. 2

Design policy of PEACH2 Implement by FPGA with four PCIe Gen.2 IPs Altera Stratix IV GX Prototyping, flexible enhancement Sufficient communication bandwidth PCI Express Gen2 x8 for each port Sophisticated DMA controller Chaining DMA Latency reduction Hardwired logic Low-overhead routing mechanism Efficient address mapping in PCIe address area using unused bits Simple comparator for decision of output port Not only is it proof-of-concept implementation, but it will also be available for product-run in GPU cluster. 3

TCA node structure example PEACH2 can access every GPUs NVIDIA Kepler architecture + CUDA 5.0 GPUDirect Support for RDMA Performance over QPI is quite bad. => support only for GPU0, GPU1 Connect among 3 nodes using PEACH2 G2 x8 G2 x8 G2 x8 PEA CH2 G2 x8 CPU (Xeon E5) G2 x16 GPU 0 QPI Single PCIe address PCIe G2 x16 GPU 1 CPU (Xeon E5) G2 x16 GPU 2 G2 x16 GPU 3 GPU: NVIDIA K20, K20X (Kepler architecture) G3 x8 IB HCA 4

Overview of PEACH2 chip Fully compatible with PCIe Gen2 spec. Root and EndPoint must be paired according to PCIe spec. Port N: connected to the host and GPUs Port E and W: form the ring topology Port S: connected to the other ring Selectable between Root and Endpoint Write only except Port N Instead, Proxy write on remote node realizes pseudo-read. Port W To PEACH2 (Root Complex) NIOS (CPU) CPU & GPU side (Endpoint) DMAC Memory Routing function Port N To PEACH2 (Root Complex / Endpoint) Port S To PEACH2 (Endpoint) Port E 5

Communication by PEACH2 PIO CPU can store the data to remote node directly using mmap. DMA Chaining mode DMA requests are prepared as the DMA descriptors chained in the host memory. DMA transactions are operated automatically according to the DMA descriptors by hardware. Register mode DMA requests are registered into the PEACH2 by up to 16. Lower overhead than chaining mode by omitting transfer for descriptors from host Block stride transfer function Source Destination Next Length Flags Descriptor0 Descriptor1 Descriptor2 Descriptor3 Descriptor (n- 1) 6

PEACH2 board (Production version for HA-PACS/TCA) PCI Express Gen2 x8 peripheral board Compatible with PCIe Spec. Side View Top View 7

PEACH2 board (Production version for HA-PACS/TCA) Main board FPGA + sub board (Altera Stratix IV Most part operates at 250 MHz 530GX) (PCIe Gen2 logic runs at 250MHz) PCI Express x8 card edge PCIe x16 cable connecter PCIe x8 cable connecter DDR3- SDRAM Power supply for various voltage 8

HA-PACS System TCA: 5Rack x 2Line TCA since Nov. 2013 Base Cluster since Feb. 2012 2013/12/18 ACHPC 9

HA-PACS Total System InfiniBand QDR 40port x 2ch between base cluster and TCA InfiniBand QDR 324port sw 40 InfiniBand QDR 108 port sw HA-PACS Base Cluster 268 nodes InfiniBand QDR 324port sw Lustre Filesystem 40 HA-PACS / TCA 64 nodes InfiniBand QDR 108 port sw 421 TFLOPS, Efficiency 54%, 41 st 2012.6 Top500 1.15 GFLOPS/W 277 TFLOPS, Efficiency 76%, 134 th 2013.11 Top500 3.52 GFLOPS/W 3 rd 2013.11 Green500 2013/12/18 ACHPC 10

HA-PACS/TCA (Computation node) 4 Channels 1,866 MHz 59.7 GB/sec 4 Channels 1,866 MHz 59.7 GB/sec (16 GB, 14.9 GB/s)x8 =128 GB, 119.4 GB/s AVX (2.8 GHz x 8 flop/clock) 22.4 GFLOPS x20 =448.0 GFLOPS Ivy Bridge Ivy Bridge Legacy Devices Total: 5.688 TFLOPS Gen 2 x 16 Gen 2 x 16 Gen 2 x 16 Gen 2 x 16 PEACH2 board (TCA interconnect) 1.31 TFLOPSx4 =5.24 TFLOPS 4 x NVIDIA K20X (6 GB, 250 GB/s)x4 =24 GB, 1 TB/s 8 GB/s 2013/12/18 ACHPC 11 Gen 2 x 8 Gen 2 x 8 Gen 2 x 8

HA-PACS/TCA PEACH2 boards are installed and connected cables front view (8 node/rack) 3U height rear view 12 AXIES2013 2013/12/19

HA-PACS/TCA 13 AXIES2013 2013/12/19

TCA sub-cluster (16 nodes) TCA has four sub-clusters, and TCA sub cluster consists of two racks. 2x8 torus (one example) A ring consists of 8 nodes (between East port and West port, Orange links) Two rings are connected at each node(between both South port, Red links) We can use 32 GPUs in a subcluster seamlessly as same as multi-gpus in a node. only use 2GPU in a node because of bottleneck of QPI Sub-clusters are connected by IB(QDR 2port) 14 AXIES2013 2013/12/19

Evaluation items Ping-pong performance between nodes Latency and bandwidth Written as application Comparison with MVAPICH2 1.9 (with CUDA support) for GPU-GPU communication and MVAPICH2-GDR (with support GPU Direct support for RDMA) using IB (dual QDRx4 that bandwidth is twice of TCA) In order to access GPU memory by the other device, GPU Direct support for RDMA in CUDA5 API is used. Special driver named TCA p2p driver to enable memory mapping is developed. PEACH2 driver to control the board is also developed. 15

Ping-pong Latency Minimum Latency 8 7 6 5 4 3 2 1 0 8 32 128 512 80 70 60 50 40 30 20 10 0 PIO (CPU to CPU): 0.9us DMA:CPU to CPU: 1.9us GPU to GPU: 2.3us (cf. MVAPICH2 1.9:19 us MVAPICH2-GDR: 6us) 8 128 2048 32768 CPU(IVB) GPU(SB) GPU(IVB) MV2-GPU(SB) MV2GDR-GPU(IVB) 16 AXIES2013 2013/12/19

Ping-pong Bandwidth CPU-CPU DMA Max. 3.5GByte/sec (95% of theoretical peak) GPU-GPU DMA Max. 2.6GByte/sec GPU(SB) was saturated at 880MByte/sec because of poor performance of PCIe switch in CPU GPU(IVB) is faster than MV2GDR less than 512KB message size 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0 CPU(IVB) GPU(SB) GPU(IVB) MV2-GPU(SB) MV2GDR-GPU(IVB) 8 128 2048 32768 524288 17 AXIES2013 2013/12/19

Programming for TCA cluster Data transfer to remote GPU within TCA can be treated like multi-gpu in a node. In particular, suitable for stencil computation Good performance at nearest neighbor communication due to direct network Chaining DMA can bundle data transfers for every Halo planes XY-plane: contiguous array XZ-plane: block stride YZ-plane: stride In each iteration, DMA descriptors can be reused and only a DMA kick operation is needed => Improve strong scaling with small data size Bundle to 1 DMA 18

Current activities Develop API for user programming similar to CudaMemcpy API. It enables use GPUs in a sub cluster seamlessly as same as Multi-GPUs in a node using CudaMemcpy API. XMP for TCA cooperating with RIKEN AICS, we develop XMP for TCA. Function offloading on TCA a reduction mechanism between GPUs in a sub cluster will be offloaded on TCA cooperating with Keio-Univ. Amano lab. and astrophysics group in CCS QUDA (QCD libraries for CUDA) 19 TCA feature will be added to QUDA cooperating with NVIDIA. Supported by JST/CREST program entitled Research and Development on Unified Environment of Accelerated Computing and Interconnection for Post-Petascale Era.

Summary TCA: Tightly Coupled Accelerators TCA enables direct communication among accelerators as an element technology becomes a basic technology for next gen s accelerated computing in exa-scale era. PEACH2 board: Implementation for realizing TCA using PCIe technology Bandwidth: max. 3.5 Gbyte/sec between CPUs (over 95% of theoretical peak) Min. Latency: 0.9 us (PIO), 1.9 us (DMA between CPUs), 2.3 us (DMA between GPUs) GPU-GPU communication over the nodes can be demonstrated with 16 node cluster. By the ping-pong program, PEACH2 can achieve lower latency than existing technology, such as MVAPICH2 in small data size. HA-PACS/TCA with 64 nodes was installed on the end of Oct. 2013. Actual proof system of TCA architecture with 4 GPUs per each node Development of the HPC application using TCA, and production-run 20

21

Related Work Non Transparent Bridge (NTB) NTB appends the bridge function to a downstream port of the PCI-E switch. Inflexible, the host must recognize during the BIOS scan It is not defined in the standard of PCI-E and is incompatible with the vendors. APEnet+ (Italy) GPU direct copy using Fermi GPU,different protocol from TCA is used. Latency between GPUs is around 5us? Original 3-D Torus network, QSFP+ cable MVAPICH2 + GPUDirect CUDA5 + Kepler Latency between GPUs is reported as 6us. 22