Design of Scalable Network Considering Diameter and Cable Delay

Similar documents
EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University

5051 & 5052 PCIe Card Overview

SwitchX Virtual Protocol Interconnect (VPI) Switch Architecture

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Lecture 3: Topology - II

Parallel Architectures

Product Overview. Programmable Network Cards Network Appliances FPGA IP Cores

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

ECE 4750 Computer Architecture, Fall 2017 T06 Fundamental Network Concepts

Interconnection Network. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

HA-PACS/TCA: Tightly Coupled Accelerators for Low-Latency Communication between GPUs

BlueGene/L. Computer Science, University of Warwick. Source: IBM

Lecture 2: Topology - I

Interconnection Network

Lecture 2 Parallel Programming Platforms

CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley Wide links, smaller routing delay Tremendous variation 3/19/99 CS258 S99 2

A Statically Scheduled Time- Division-Multiplexed Networkon-Chip for Real-Time Systems

Building blocks for custom HyperTransport solutions

MegaGauss (MGs) Cluster Design Overview

Interconnection Network

Interconnection Networks

What is Parallel Computing?

NetFPGA Hardware Architecture

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies

EE/CSCI 451: Parallel and Distributed Computation

Barcelona: a Fibre Channel Switch SoC for Enterprise SANs Nital P. Patwa Hardware Engineering Manager/Technical Leader

Parallel Computing Platforms

Challenges for Future Interconnection Networks Hot Interconnects Panel August 24, Dennis Abts Sr. Principal Engineer

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

Multicomputer distributed system LECTURE 8

The Cray XD1. Technical Overview. Amar Shan, Senior Product Marketing Manager. Cray XD1. Cray Proprietary

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017

Slim Fly: A Cost Effective Low-Diameter Network Topology

Introduction to High-Speed InfiniBand Interconnect

Field Programmable Gate Array (FPGA) Devices

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Lecture 20: Distributed Memory Parallelism. William Gropp

Network-on-Chip Architecture

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Interconnect Your Future

Scalability and Classifications

High Performance Embedded Applications. Raja Pillai Applications Engineering Specialist

Voltaire Making Applications Run Faster

Introduction to Infiniband

The Impact of Optics on HPC System Interconnects

100% PACKET CAPTURE. Intelligent FPGA-based Host CPU Offload NIC s & Scalable Platforms. Up to 200Gbps

Tightly Coupled Accelerators Architecture

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

InfiniBand Experiences of PC²

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Interconnection Network for Tightly Coupled Accelerators Architecture

Center Extreme Scale CS Research

A 400Gbps Multi-Core Network Processor

Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy

Lecture 18: Communication Models and Architectures: Interconnection Networks

ABySS Performance Benchmark and Profiling. May 2010

4. Networks. in parallel computers. Advances in Computer Architecture

Network Properties, Scalability and Requirements For Parallel Processing. Communication assist (CA)

Messaging Overview. Introduction. Gen-Z Messaging

Network Properties, Scalability and Requirements For Parallel Processing. Communication assist (CA)

RDMA in Embedded Fabrics

100 GBE AND BEYOND. Diagram courtesy of the CFP MSA Brocade Communications Systems, Inc. v /11/21

Industry Standards for the Exponential Growth of Data Center Bandwidth and Management. Craig W. Carlson

NoC Test-Chip Project: Working Document

PUSHING THE LIMITS, A PERSPECTIVE ON ROUTER ARCHITECTURE CHALLENGES

Blue Gene/Q. Hardware Overview Michael Stephan. Mitglied der Helmholtz-Gemeinschaft

Interconnection Networks

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Catapult: A Reconfigurable Fabric for Petaflop Computing in the Cloud

Mapping MPI+X Applications to Multi-GPU Architectures

Workshop on High Performance Computing (HPC) Architecture and Applications in the ICTP October High Speed Network for HPC

Flash Controller Solutions in Programmable Technology

Interfacing FPGAs with High Speed Memory Devices

In-Network Computing. Paving the Road to Exascale. 5th Annual MVAPICH User Group (MUG) Meeting, August 2017

Performance Evaluation of TOFU System Area Network Design for High- Performance Computer Systems

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013

Velo readout board RB3. Common L1 board (ROB)

CS575 Parallel Processing

Network Dilation: A Strategy for Building Families of Parallel Processing Architectures Behrooz Parhami

Dynamic Partitioned Global Address Spaces for Power Efficient DRAM Virtualization

Quality-of-Service for a High-Radix Switch

Interconnection Networks

Resource allocation and utilization in the Blue Gene/L supercomputer

Creating High Performance Clusters for Embedded Use

OCP Engineering Workshop - Telco

Maximizing heterogeneous system performance with ARM interconnect and CCIX

3D WiNoC Architectures

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

Highly Accurate, Record/ Playback of Digitized Signal Data Serves a Variety of Applications

N V M e o v e r F a b r i c s -

Scalable Computing Systems with Optically Enabled Data Movement

PCI Express Multi-Channel DMA Interface

Solaris Engineered Systems

RHiNET-3/SW: an 80-Gbit/s high-speed network switch for distributed parallel computing

A Parallel Hardware Architecture for Information-Theoretic Adaptive Filtering

Network-on-chip (NOC) Topologies

Cray XC Scalability and the Aries Network Tony Ford

Flashmatrix Technology

Transcription:

Tohoku Design of Scalable etwork Considering Diameter and Cable Delay Kentaro Sano Tohoku University, JAPA

Agenda Introduction Assumption Preliminary evaluation & candidate networks Cable length and delay Simulator & emulator Summary Design of Scalable etwork, K Sano 2

Introduction Feasibility study : 2012-2014 3 teams working for next-gen supercomputers Tohoku-EC-JAMSTEC team Working group for interconnection network subsystem Tohoku University W.G. for interconnection subsystem Osaka University Design of Scalable etwork, K Sano 3

Background and Objective More nodes with higher performance Requiring high performance and scalable network Application demands Global/collective communication Local communication (ex: p2p w/ 3D decomposition) Usability, performance robustness Scalability Goal: find W for next-gen supercomputers Exploring design space with application demands and technology constraints Small-diameter W using high-radix s, which is also good at local p2p communication Performance, cost, power, usability, reliability Design of Scalable etwork, K Sano 4

Assumption for Design Space Exploration System scale ~65536 SMP nodes Technology ~64x64 full crossbar switch 10~ GB/s per link node 1 etwork n planes (for SMP) Fat Fat tree tree 又は又は Hybrid Hybrid W W System overview node 65536 input q 1 input q 2 input q 64 Full cross bar out b out b out b IB technology roadmap 64 x 64 switch (virtual cut-through with virtual chs) Design of Scalable etwork, K Sano 5

Preliminary Evaluation Typical topologies Full fat tree 3D / 5D torus Dragonfly Full fat tree Dragonfly n-d torus Design of Scalable etwork, K Sano 6

Comparison of Topologies Topology Full fat-tree 3D Torus 5D Torus Dragonfly odes 65,536 65,536 65,536 65,536 Organization 3 stages 64 x 32 x 32 64x32x32 16x8x8x8x8 all-to-all (1D 16, 2D 16x16) ode injection BW [GB/s] 10 Bisection BW [TB/s] 320 20 80 160 min to Max hops 2 ~ 6 1~63 1~23 2 ~ 5 min to Max delay [ns] 100 ~ 500 100 ~ 6300 100 ~ 2300 100 ~ 400 Links 196,608 196,608 1,310,720 468,736 Switches 5120 within nodes within nodes 4096 no cable delay considered Too large diameter for low-d torus Too many links for high-d torus / dragonfly Fat tree looks good, but long cables? Design of Scalable etwork, K Sano 7

Full Fat Tree Small diameter, but big latency via spine s Max # of hops is limited especially with high-radix s. Cable length grows with # of nodes. 64 links, 10GB/s/link Spine 32 s 32 s 1024 nodes / islands 32 nodes Max 6 hops 65536 nodes / 64 islands Design of Scalable etwork, K Sano 8

Another Candidate: FTT Hybrid Hierarchical network Local fat tree (group) 256 nodes 2-stage fat tree Only short cables in a small fat tree x 16 128 Global W: 2D Torus of 16x16 groups 128 128 (FTT : Fat Tree & Torus) x 16 G 256 odes Global 2D torus 16x16 of 256-node groups Short cables to connect adjacent groups 512 links between groups Global 2D torus Expected advantages Shorter cables Expandable & flexible Design of Scalable etwork, K Sano Local fat tree 9

Comparison Summary Fat tree Features Diameter # of Links ote General-puropse, High usability High cable delay? Good cost Low-D torus - performance, High-D torus - Extendability Dragonfly FTT-hybrid Pseudo high-radix W - Combination of Fat tree and torus Low cable delay? Detailed & quantitative evaluation Full fat tree and FTT hybrid Consider more details about implementation & apps Design of Scalable etwork, K Sano 10

Cable Length and Delay Preliminary estimation based on expected implementation Boards (node, switch) Cabinets (node, switch) Floor layout Cabling C0 C1 C3 C2 cabinet FTT-hybrid layout example Design of Scalable etwork, K Sano 11

Preliminary Result Stage 3 Stage 2 spine switch 80 m, 400 ns 80 m, 400 ns 1~16 hops in 2D torus 15 m 75 ns 20 m 100 ns 20 m 100 ns 20 m 100 ns 10 m 50 ns 10 m 50 ns 10 m 50 ns Stage 1 2 m 10 ns 2 m 10 ns 2 m 10 ns 2 m 10 ns 2 m 10 ns 2 m 10 ns 2 m 10 ns 2 m 10 ns node A node B node D node E 0.05 % 1.5 % 98.4 % Fat tree (Max 6 hops) node A node B node D node E 0.05 % 0.33 % 99.6 % FTT-hybrid (Max 20 hops) o big difference in Max cable delay Fat tree = 1020ns + (5 -delay) Hybrid = 1395ns + (19 -delay) Hybrid can have shorter delay for local p2p communication. Design of Scalable etwork, K Sano 12

Example of 3D Mesh Communication 3D decomposition and adjacent communication x 16 y x z x 16 128 128 128 Global W: 2D Torus of 16x16 groups G y x 256 odes Data exchange among 3D subgrids Latency (4 hops) = 195ns + (4 -delay) : x, y = 120ns + (3 -delay) : z Much shorter than Fat tree = 1020ns + (5 -delay) : x, y, z Design of Scalable etwork, K Sano z (x & y can be assigned) 13

Quantitative Evaluation (On-going) input port 0 output port 0 input port 1 output port 1 output port 2 input port 2 output port 63 input port 63 Software simulator (OPET-based) Purpose Get rough results quickly Validate collective comm. Rough model Simple arbitration o back pressure Limited W size ~8129 nodes routing switching Rx delay given by send switch delay routing delay switching delay transferring delay buffering delay Hardware emulator FPGA-based emulator Obtain detailed results Cycle accurate model Real arbitration, flit-level transmission, back pressure Large W : ~65536 nodes Tx & Rx delay Switch structure and delay model Design of Scalable etwork, K Sano 14

Hardware Emulator Overview DDR3 DRAM A PC3-12800 (DDR3-1600) DDR3 DRAM B PC3-12800 (DDR3-1600) FPGA cluster 4 x host PCs 4 x FPGAs / PC 4 x 10G SFP+ ports / FPGA Implementation for nodes (on Linux) HW for switches (on FPGA) x 4 ode of FPGA cluster DE5-ET SFP+ 10G Ether 10G SFP+ A(Tx, Rx) 10G SFP+ B(Tx, Rx) 10G SFP+ C(Tx, Rx) 10G SFP+ D(Tx, Rx) 10Gbps+ each (Tx, Rx) 12.8 GB/s PCIe 3.0 x 8 : 8GB/s (Tx, Rx) QDR II+ SRAM A x18@ 500MHz QDR II+ SRAM D ALTERA Stratix V FPGA 5SGXEA7 2F45C2 QDR II+ SRAM B 1GB/s for read/write FPGA PCI-Express QDR II+ SRAM C 12.8 GB/s x64@ 800MHz (DDR) up to 1066MHz FPGA board (Stratix V) 2GB as default (up to 8GB) QDRII SRAM 18 Mbits each (20-bit addressing for 18-bit data) DDR3 memory Other nodes not installed yet Design of Scalable etwork, K Sano 15

Hardware Emulator Overview ode of FPGA cluster 4 x FPGA boards SFP+ 10GbE ports 64 port 10GbE switch Other nodes not installed yet Design of Scalable etwork, K Sano 16

Summary Design space exploration for small diameter Ws with high-radix switches Technology constraint Application demands global and local-p2p communication Two candidates after topology comparison Full fat tree & FTT-hybrid Preliminary evaluation for cable length & delay Future (on-going) work Quantitative evaluation with simulation & emulation Application performance estimation Design of Scalable etwork, K Sano 17

Thank you! Design of Scalable etwork, K Sano 18