Design of Scalable Network Considering Diameter and Cable Delay

Tohoku Design of Scalable etwork Considering Diameter and Cable Delay Kentaro Sano Tohoku University, JAPA

Agenda Introduction Assumption Preliminary evaluation & candidate networks Cable length and delay Simulator & emulator Summary Design of Scalable etwork, K Sano 2

Introduction Feasibility study : 2012-2014 3 teams working for next-gen supercomputers Tohoku-EC-JAMSTEC team Working group for interconnection network subsystem Tohoku University W.G. for interconnection subsystem Osaka University Design of Scalable etwork, K Sano 3

Background and Objective More nodes with higher performance Requiring high performance and scalable network Application demands Global/collective communication Local communication (ex: p2p w/ 3D decomposition) Usability, performance robustness Scalability Goal: find W for next-gen supercomputers Exploring design space with application demands and technology constraints Small-diameter W using high-radix s, which is also good at local p2p communication Performance, cost, power, usability, reliability Design of Scalable etwork, K Sano 4

Assumption for Design Space Exploration System scale ~65536 SMP nodes Technology ~64x64 full crossbar switch 10~ GB/s per link node 1 etwork n planes (for SMP) Fat Fat tree tree 又は又は Hybrid Hybrid W W System overview node 65536 input q 1 input q 2 input q 64 Full cross bar out b out b out b IB technology roadmap 64 x 64 switch (virtual cut-through with virtual chs) Design of Scalable etwork, K Sano 5

Preliminary Evaluation Typical topologies Full fat tree 3D / 5D torus Dragonfly Full fat tree Dragonfly n-d torus Design of Scalable etwork, K Sano 6

Comparison of Topologies Topology Full fat-tree 3D Torus 5D Torus Dragonfly odes 65,536 65,536 65,536 65,536 Organization 3 stages 64 x 32 x 32 64x32x32 16x8x8x8x8 all-to-all (1D 16, 2D 16x16) ode injection BW [GB/s] 10 Bisection BW [TB/s] 320 20 80 160 min to Max hops 2 ~ 6 1~63 1~23 2 ~ 5 min to Max delay [ns] 100 ~ 500 100 ~ 6300 100 ~ 2300 100 ~ 400 Links 196,608 196,608 1,310,720 468,736 Switches 5120 within nodes within nodes 4096 no cable delay considered Too large diameter for low-d torus Too many links for high-d torus / dragonfly Fat tree looks good, but long cables? Design of Scalable etwork, K Sano 7

Full Fat Tree Small diameter, but big latency via spine s Max # of hops is limited especially with high-radix s. Cable length grows with # of nodes. 64 links, 10GB/s/link Spine 32 s 32 s 1024 nodes / islands 32 nodes Max 6 hops 65536 nodes / 64 islands Design of Scalable etwork, K Sano 8

Another Candidate: FTT Hybrid Hierarchical network Local fat tree (group) 256 nodes 2-stage fat tree Only short cables in a small fat tree x 16 128 Global W: 2D Torus of 16x16 groups 128 128 (FTT : Fat Tree & Torus) x 16 G 256 odes Global 2D torus 16x16 of 256-node groups Short cables to connect adjacent groups 512 links between groups Global 2D torus Expected advantages Shorter cables Expandable & flexible Design of Scalable etwork, K Sano Local fat tree 9

Comparison Summary Fat tree Features Diameter # of Links ote General-puropse, High usability High cable delay? Good cost Low-D torus - performance, High-D torus - Extendability Dragonfly FTT-hybrid Pseudo high-radix W - Combination of Fat tree and torus Low cable delay? Detailed & quantitative evaluation Full fat tree and FTT hybrid Consider more details about implementation & apps Design of Scalable etwork, K Sano 10

Cable Length and Delay Preliminary estimation based on expected implementation Boards (node, switch) Cabinets (node, switch) Floor layout Cabling C0 C1 C3 C2 cabinet FTT-hybrid layout example Design of Scalable etwork, K Sano 11

Preliminary Result Stage 3 Stage 2 spine switch 80 m, 400 ns 80 m, 400 ns 1~16 hops in 2D torus 15 m 75 ns 20 m 100 ns 20 m 100 ns 20 m 100 ns 10 m 50 ns 10 m 50 ns 10 m 50 ns Stage 1 2 m 10 ns 2 m 10 ns 2 m 10 ns 2 m 10 ns 2 m 10 ns 2 m 10 ns 2 m 10 ns 2 m 10 ns node A node B node D node E 0.05 % 1.5 % 98.4 % Fat tree (Max 6 hops) node A node B node D node E 0.05 % 0.33 % 99.6 % FTT-hybrid (Max 20 hops) o big difference in Max cable delay Fat tree = 1020ns + (5 -delay) Hybrid = 1395ns + (19 -delay) Hybrid can have shorter delay for local p2p communication. Design of Scalable etwork, K Sano 12

Example of 3D Mesh Communication 3D decomposition and adjacent communication x 16 y x z x 16 128 128 128 Global W: 2D Torus of 16x16 groups G y x 256 odes Data exchange among 3D subgrids Latency (4 hops) = 195ns + (4 -delay) : x, y = 120ns + (3 -delay) : z Much shorter than Fat tree = 1020ns + (5 -delay) : x, y, z Design of Scalable etwork, K Sano z (x & y can be assigned) 13

Quantitative Evaluation (On-going) input port 0 output port 0 input port 1 output port 1 output port 2 input port 2 output port 63 input port 63 Software simulator (OPET-based) Purpose Get rough results quickly Validate collective comm. Rough model Simple arbitration o back pressure Limited W size ~8129 nodes routing switching Rx delay given by send switch delay routing delay switching delay transferring delay buffering delay Hardware emulator FPGA-based emulator Obtain detailed results Cycle accurate model Real arbitration, flit-level transmission, back pressure Large W : ~65536 nodes Tx & Rx delay Switch structure and delay model Design of Scalable etwork, K Sano 14

Hardware Emulator Overview DDR3 DRAM A PC3-12800 (DDR3-1600) DDR3 DRAM B PC3-12800 (DDR3-1600) FPGA cluster 4 x host PCs 4 x FPGAs / PC 4 x 10G SFP+ ports / FPGA Implementation for nodes (on Linux) HW for switches (on FPGA) x 4 ode of FPGA cluster DE5-ET SFP+ 10G Ether 10G SFP+ A(Tx, Rx) 10G SFP+ B(Tx, Rx) 10G SFP+ C(Tx, Rx) 10G SFP+ D(Tx, Rx) 10Gbps+ each (Tx, Rx) 12.8 GB/s PCIe 3.0 x 8 : 8GB/s (Tx, Rx) QDR II+ SRAM A x18@ 500MHz QDR II+ SRAM D ALTERA Stratix V FPGA 5SGXEA7 2F45C2 QDR II+ SRAM B 1GB/s for read/write FPGA PCI-Express QDR II+ SRAM C 12.8 GB/s x64@ 800MHz (DDR) up to 1066MHz FPGA board (Stratix V) 2GB as default (up to 8GB) QDRII SRAM 18 Mbits each (20-bit addressing for 18-bit data) DDR3 memory Other nodes not installed yet Design of Scalable etwork, K Sano 15

Hardware Emulator Overview ode of FPGA cluster 4 x FPGA boards SFP+ 10GbE ports 64 port 10GbE switch Other nodes not installed yet Design of Scalable etwork, K Sano 16

Summary Design space exploration for small diameter Ws with high-radix switches Technology constraint Application demands global and local-p2p communication Two candidates after topology comparison Full fat tree & FTT-hybrid Preliminary evaluation for cable length & delay Future (on-going) work Quantitative evaluation with simulation & emulation Application performance estimation Design of Scalable etwork, K Sano 17

Thank you! Design of Scalable etwork, K Sano 18