Interconnection Network for Tightly Coupled Accelerators Architecture

Size: px

Start display at page:

Download "Interconnection Network for Tightly Coupled Accelerators Architecture"

Brenda Hopkins
6 years ago
Views:

1 Interconnection Network for Tightly Coupled Accelerators Architecture Toshihiro Hanawa, Yuetsu Kodama, Taisuke Boku, Mitsuhisa Sato Center for Computational Sciences University of Tsukuba, Japan 1

2 What is Tightly Coupled Accelerators (TCA)? Concept: Direct connection between accelerators (GPUs) over the nodes Using PCIe as a communication link between accelerators over the nodes PEACH2: PCI Express Adaptive Communication Hub ver. 2 2

3 Design policy of PEACH2 Implement by FPGA with four PCIe Gen.2 IPs Sufficient communication bandwidth Latency reduction 3

4 TCA node structure example PEACH2 can access all GPUs Connect among 3 nodes using PEACH2 G2 x8 G2 x8 G2 x8 PEA CH2 G2 x8 CPU (Xeon E5) G2 x16 GPU 0 QPI Single PCIe address G2 PCIe G2 x16 GPU 1 CPU (Xeon E5) x16 GPU 2 G2 x16 GPU 3 GPU: NVIDIA K20, K20X (Kepler architecture) G3 x8 IB HCA 4

5 Overview of PEACH2 chip Fully compatible with PCIe Gen2 spec. Root and EndPoint must be paired according to PCIe spec. Port N: connected to the host and GPUs Port E and W: form the ring topology Port S: connected to the other ring Write only except Port N Port W To PEACH2 (Root Complex) NIOS (CPU) CPU & GPU side (Endpoint) DMAC Memory Routing function Port N To PEACH2 (Root Complex / Endpoint) Port S To PEACH2 (Endpoint) Port E 5

6 Communication by PEACH2 PIO DMA Source Destination Next Length Flags Descriptor0 Descriptor1 Descriptor2 Descriptor3 Descriptor (n-1) 6

7 PEACH2 board (Production version for HA-PACS/TCA) PCI Express Gen2 x8 peripheral board Compatible with PCIe Spec. Top View Side View 7

8 PEACH2 board (Production version for HA-PACS/TCA) Main board + sub board FPGA (Altera Stratix IV 530GX) PCI Express x8 card edge Most part operates at 250 MHz (PCIe Gen2 logic runs at 250MHz) DDR3SDRAM Power supply for various voltage PCIe x16 cable connecter PCIe x8 cable connecter 8

Performance Evaluation Environment: 8node GPU cluster (TCAMINI) 9 CPU: Intel Xeon-E5 (SandyBridge EP) 2.6GHz x2socket MB: SuperMicro X9DRG-QF Memory: DDR3 128GB OS: CentOS 6.

9 Performance Evaluation Environment: 8node GPU cluster (TCAMINI) 9 CPU: Intel Xeon-E5 (SandyBridge EP) 2.6GHz x2socket MB: SuperMicro X9DRG-QF Memory: DDR3 128GB OS: CentOS 6.3 (kernel el6.x86_64) GPU: NVIDIA K20, GDDR5 5GB x1 CUDA: 5.0, NVIDIA-Linuxx86_ PEACH2 board: Altera Stratix IV 530GX MPI: MVAPICH2 1.9 with IB FDR10

10 Evaluation items Ping-pong performance between nodes In order to access GPU memory by the other device, GPU Direct support for RDMA in CUDA5 API is used. Special driver named TCA p2p driver to enable memory mapping is developed. PEACH2 driver to control the board is also developed. 10

11 Ping-pong Latency Minimum Latency (nearest neighbor comm.) PIO: CPU to CPU: 0.9us PIO DMA (GPU) DMA:CPU to CPU: 1.9us GPU to GPU: 2.3us (cf. MVAPICH2 1.9: 19 usec) Latency (usec) DMA (CPU) 1 0 PIO, Good Performance (<64B) Data Size (bytes) 11

12 Ping-pong Latency Minimum Latency (nearest neighbor comm.) PIO: CPU to CPU: 0.9us DMA:CPU to CPU: 1.9us GPU to GPU: 2.3us (cf. MVAPICH2 1.9: 19 usec) Forwarding overhead 200~300 nsec Latency (usec) DMA Direct DMA 1 hop DMA 2 hop DMA 3 hop DMA (CPU) Data Size (bytes) 12

13 Ping-pong Bandwidth Max. 3.5 GByte/sec Max Payload Size = 256byte Theoretical peak: 4Gbyte/sec 256 / ( ) = 3.66 Gbyte/s GPU to GPU DMA is saturated by up to 880MByte/sec. Ø Bandwidth (MBytes/sec) DMA Direct DMA 1 hop DMA 2 hop DMA 3 hop GPU Direct MVA2 GPU Data Size (bytes) 3.5 Gbyte/s 880Mbyte/s 13

14 Programming for TCA cluster Data transfer to remote GPU within TCA can be treated like local GPU. In particular, suitable for stencil computation => Improve strong scaling with small data size Bundle to 1 DMA 14

15 Related Work Non Transparent Bridge (NTB) APEnet+ (Italy) MVAPICH2 + GPUDirect 15

16 Summary TCA: Tightly Coupled Accelerators PEACH2 board: Implementation for realizing TCA using PCIe technology HA-PACS/TCA with 64 nodes will be installed on the end of Oct

HA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication between GPUs

HA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication between GPUs Yuetsu Kodama Division of High Performance Computing Systems Center for Computational Sciences University of Tsukuba,