Search for Optimal Network Topologies for Supercomputers 寻找超级计算机优化的网络拓扑结构

Size: px

Start display at page:

Download "Search for Optimal Network Topologies for Supercomputers 寻找超级计算机优化的网络拓扑结构"

Rudolf Marshall
6 years ago
Views:

1 Search for Optimal Network Topologies for Supercomputers 寻找超级计算机优化的网络拓扑结构 GUO, Meng 郭猛 Shandong Computer Science Center (National Supercomputer Center in Jinan) 山东省计算中心 ( 国家超级计算济南中心 ) 2014/11/5 Guangzhou 广州

2 Acknowledgements Y.F. Deng of Stony Brook, USA & National Supercomputer Center in Jinan, China M. Michalewicz and L. Orlowski of A*CRC, Singapore and Stony Brook T. Mayer, Z. Ye, and L. Zhang of Stony Brook, USA C. C. Hwang, Y. T. Chen, C. H. Liang, and S. W. Liou of NCKU, Taiwan Joint work with Prof Deng s students, postdocs, and other colleagues Early work was done on Shenway Bluelight at The National Supercomputer Center in Jinan, China

3 Search for Optimal Network Topologies for Supercomputers 01 Motivations 02 Reviews of Network Topologies 03 Search for Optimal Topologies 04 Summaries

4 Development of calculator / computer Tianhe-2 (2013) ~50PFlops ENIAC (1946) ~300Flops Mechanical Calculator ~0.1Flops Human ~0.01Flops

Xeon E5-2692 12C 2.2GHz with Phi 31S1p 3.

5 TOP1: Tianhe-2 System Tianhe-2 Cores 3,120,000 Nodes 16,000 Flops/node RAM Rmax Rpeak Power Network OS TH-IVB FEP Cluster, Xeon E C 2.2GHz with Phi 31S1p Tflops / node 1,024,000 GB 33.9 Pflop/s 54.9 Pflop/s 17.8 MW TH Express-2 Kylin Linux

Communication Costly in Time and Energy [Power costs per operation, today] Operation DP FMADD flop DP DRAM read-to-register DP word transmit-to-neighbor DP word transmit-across-system Approximate

6 Communication Costly in Time and Energy [Power costs per operation, today] Operation DP FMADD flop DP DRAM read-to-register DP word transmit-to-neighbor DP word transmit-across-system Approximate energy cost 100 pico J 4800 pico J 7500 pico J 9000 pico J [Power costs per operation, 2019] Local calculation FMAdd costs: 11 pj Cross-die a double costs: 96 pj Communication vs. Computation: 96/11=8.73 DARPA report of P. Kogge (ND) et al. and T. Schulthess (ETH), and David Keyes PPT

7 Interconnection Network: Topologies & Technologies [Topologies] Bus, ring, grid/mesh, torus, hypercube, tree, fat tree, omega, crossbar, etc. [Technologies] Device: Ethernet, Myrinet, Infiniband, etc. Protocol: TCP/IP, UDP, VMMC, U-net, BIP, etc. NETWORK: Latency Bandwidth SYSTEM: Performance Cost

8 Search for Optimal Network Topologies for Supercomputers 01 Motivations 02 Reviews of Network Topologies 03 Search for Optimal Topologies 04 Summaries

9 Source: Diagram of different BASIC network topologies

10 Source: B. Parhami An Ocean of Networks

11 The Top Supercomputer in TOP500 Lists in The Last 20 Years System Site Topology Date TMC CM-5 Los Alamos National Lab Fat Tree 6/93~11/93 Fujitsu Numerical Wind Tunnel National Aerospace Laboratory of Japan Crossbar 11/93~6/96 Intel XP/S 140 Paragon Sandia National Labs 2D Mesh 6/94~11/94 Hitachi SR2201 University of Tokyo 3D Crossbar 6/96~11/96 Hitachi CP-PACS University of Tsukuba 3D Hyper- crossbar 11/96~6/97 Intel ASCI Red Sandia National Laboratory Mesh 6/97 ~11/00 IBM ASCI White Lawrence Livermore National Laboratory Omega 11/00~6/02 NEC The Earth Simulator Earth Simulator Center Crossbar 6/02~11/04 IBM BlueGene/L Lawrence Livermore National Laboratory 3D Torus 11/04~6/08 IBM Roadrunner Los Alamos National Laboratory Fat-Tree hierarchy of crossbars 6/08~11/09 Cray Jaguar Oak Ridge National Laboratory 3D Torus 11/09~11/10 NUDT Tianhe-1A National Supercomputing Center in Tianjin Fat Tree 11/10~6/11 Fujitsu K Computer RIKEN Advanced Institute for Computational Science Tofu: 6D Mesh / Torus 6/11~6/12 IBM Sequoia Blue Gene/Q Lawrence Livermore National Laboratory 5D Torus 6/12~11/12 Cray Titan Oak Ridge National Laboratory 3D Torus 11/12~6/13 NUDT Tianhe-2 National Super Computer Center in Guangzhou Fat Tree 6/13~ Source:

12 Source: Interconnect Family System Share of TOP 500 (June 2014)

13 Popular Networks: Butterfly (Monsoon) Dragonfly (Cray XC30) Hypercube (SGI Origin) 3D Torus (Cray Gemini) 5D Torus (IBM) Tofu: 6D Mesh/Torus (K)

14 Popular Network: Fat Tree The networks for Tianhe-2 (GZ), Shenway (JN), Dawning Nebulae (SZ),

15 Returning to Square One 2000s Scalability, local wires 1960s Mesh-based (ILLIAC IV) Direct to indirect, shared memory So, only a small portion of the of the networks has been explored in practical 1990s Fat tree, LAN-based 1970s Butterfly, other MINs parallel computers Greater bandwidth 1980s Hypercube, bus-based Lower diameter, message passing

16 Comparison of Common Topologies 网络拓扑节点度数网络直径对分带宽 Full Connected N 1 1 N 2 /4 Ring 2 N/2 2 2D Torus 4 N 1 2 N Diameter N - 1 Linear Array N / 2 Ring Tree - 2log d 1 N 1 Fat Tree - 2log 2 N N/2 sqrt N Torus Hypercube log 2 N log 2 N N/2 Butterfly 4 2l N/ l + 1 log N Binary tree, Hypercube de Bruijn d log d N 2dN/log d N Dcell k + 1 < log n N 1 N/ 4log n N 1 Full Connected Degree

17 Supercomputer Interconnects?!

18 Data Traffic in Computer is Similar to This New York

19 Search for Optimal Network Topologies for Supercomputers [Our Goals] Design the state-of-the-art interconnection networks. [Challenges] The entire ecosystem of network design is too big. [Our Focus] On discovering the optimal network topologies.

20 Search for Optimal Network Topologies for Supercomputers 01 Motivations 02 Reviews of Network Topologies 03 Search for Optimal Topologies 04 Summaries

21 Number of nodes N Interconnection Networks Heterogeneous nodes Longest wire Other attributes: Regularity Scalability Packageability Robustness Diameter D Bisection bandwidth B Node degree k Adapted from B. Parhami

22 Strategy to Search for Optimal Topologies Diameter N - 1 Linear Array N / 2 Ring sqrt N Torus Add bypass links on known topologies Add links on a Hamiltonian Cycle Remove links from full-connected network Successive construction Exhaustion and embedding log N Binary tree, Hypercube 1 Full Connected

23 Add Bypass Links on Torus ibt (8 8; b=<4>) ibt (9 3 ; b=<3>) Torus Link X-Axis Bypass Link Y-Axis Bypass Link Torus Link Source: P. Zhang, R. Powell, and Y. Deng, IEEE Trans. Parallel and Distributed Systems Vol. 22 Issue 2 (2011) pp

Diameter Mean Path Length 3D ibt vs. 4D Torus 64.0 32.0 60 30 50 25 40 20 30 15 20 10 3D ibt (8) 16.

24 Diameter Mean Path Length 3D ibt vs. 4D Torus D ibt (8) D ibt (8) , ,000 Network Size (# of nodes) 1,000 Network Size (# of nodes) 1,000

25 Add Links on A Hamiltonian Cycle N8k4 N16k6 N16k6 N64k6 Mean Path Length Diameter Number of cable HPL with CPU stability HPL with CPU Turbo Boost HPL with CPU Turbo & HT N8k Done Pending Pending N16k Done Pending Pending N32k Done Pending Pending N64k Pending Pending Pending Tests at 10/14 at NCKU

26 Add Links on A Hamiltonian Cycle 94.00% 92.00% 90.00% 88.00% 86.00% 84.00% 82.00% 80.00% 78.00% 76.00% 100% 98% 96% 94% 92% 90% 88% 86% HPL Efficiency HPL Efficiency (64G RAM) HPL Efficiency (96G RAM) Parallel Efficiency Parallel Efficiency(64G RAM) Parallel Efficiency(96G RAM) Node HPL Best Efficiency (64G RAM) HPL Best Efficiency (96G RAM) % 91.31% % 87.49% % 84.44% % 81.94% 64 Parallel Efficiency (64G RAM) Parallel Efficiency (96G RAM) 1 100% 100% % 95.82% % 92.48% % 89.74% 64 Tests at 10/14 at NCKU

27 Remove Links from Full-connected Network M. Michalewicz, L. Orlowski and Y.F. Deng, Constructing graphs by algorithmic edge removal (in preparation)

28 Successive Construction M. Michalewicz, L. Orlowski and Y.F. Deng, Constructing graphs by algorithmic edge removal (in preparation)

29 What is The Best Network Topology? Diameter N - 1 Linear Array N / 2 Ring sqrt N Torus Wires vs. Diameter? log N Binary tree, Hypercube 1 Full Connected

have an unlimited supply of degree-d nodes.

Hoffman-Singleton graph d = 7, D = 2, N=50

30 The Degree/Diameter Graph Problem The Degree/Diameter Graph Problem Suppose you have an unlimited supply of degree-d nodes. How many can be connected into a network of diameter D? Petersen graph d = 3, D = 2, N=10 Hoffman-Singleton graph d = 7, D = 2, N=50 Source: Singleton_graph

E8 Picture (E8 Lie Group: 240 points in 8-dim. Source: http://www.math.lsa.umich.edu/~jrs/coxplane.

diameter, defined as the longest of the geodesic distances between all pairs of nodes, is minimal for a

31 E8 Picture (E8 Lie Group: 240 points in 8-dim. Source: Graph Theory Topology Network Problem Statement: (N, k) Given N vertices, find a graph for which the diameter, defined as the longest of the geodesic distances between all pairs of nodes, is minimal for a fixed vertex degree k (defined as the number of edges incident to the vertex). Also, the mean path length is minimal.

32 Why do Exhaustion Search? 3D 2D Rearrange a couple links Rearrange a couple links Diameter Mean path length Rearranging order of links makes diameter reduced by 33.3% & mean path length by 8.3%!

33 Comparison of Topologies for N=16 Hypercube 4x4 Mesh 4x4 Torus Optimal N16k3 Optimal N16k4 Network Degree Diameter Mean path length Number of edges Using 25% the same less of amount the wires of wires to keep to get similar 25% mean less of path diameter length and mean 25% less path of length. diameter.

34 Comparison of Topologies for N=32 Hypercube 4x8 Torus Optimal N32k3 Optimal N32k4 Optimal N32k5 Network Degree Diameter Mean path length Number of edges

35 Graph for N=64 N64k6: D=3; A=2.33; L=192 How to find topologies for N=1,024 or even 3,000,000???

36 Possible to Generate Massive Graphs? Exhaustive searches for top topologies are possible for N64k6, i.e., N=64 and k=6. The search space for 256k8 is ~10 1,760 For a comparison, there are stars in the universe so it s probably impossible to do exhaustive search for larger graphs Therefore, we must invent techniques to search for top topologies (quasioptimal). McKay, B. D., & Wormald, N. C. (1990). Asymptotic enumeration by degree sequence of graphs of high degree. European Journal of Combinatorics, 11, Retrieved from Deng, Y. et al (2014, in preparation), The first-principle discovery of k-degree optimal graphs and engineering validations of optimality

37 Method 2: Graph Embedding (N8k3)x(N8k3(a))(M=64) =

38 Best Way to Connect M=32 Nodes Hypercube 4x8 Torus N4k2 x N8k3 (a) N4k2 x N8k3 (b) Network Degree Diameter Mean path length Number of edges

39 Graph Embedding N8k3xN8k3(a) (M=64) For hypercube 2^6 M=64, k=6 A = D = 6 L = 192 = (64x6/2) For 2D torus 8x8 M=64, k=4 A = D = 8 L = 128 = (64x4/2) N8k3 x N8k3(a): M=64, k=3 or 4 A =??? D =??? L = 76 = (12x8+8+4)

40 Graph Embedding N8k3 x N16k3 (M=128) For hypercube 2^7 M=128, k=7 A = D = 7 L = 448 For 16x8 Torus M=128, k=4 A = D = 12 L = 256 Hop Distributions N8k3 x N16k3: M=128, k=3 or 4 A = D = 13 L = 216

41 Graph Embedding (N16k3)^2 (M=256) For hypercube 2^8 M=256, k=8 A = D = 8 L = 1024 For 16x16 Torus M=256, k=4 A = D = 16 L = 512 Hop Distributions N16k3 x N16k3: M=256, k=3 or 4 A = 9.23 D = 15 L = 408

Graph Embedding (N8k3)^3 (M=512) For hypercube 2^9 M=512, k=9 A = 4.

0235 D = 24 L = 1024 30000 25000 20000 15000 10000 5000 0 25120

9704 7152 6000 33764040 1752 880 48 1 2 3 4 5 6 7 8 9 10 11 12 13 14

42 Graph Embedding (N8k3)^3 (M=512) For hypercube 2^9 M=512, k=9 A = D = 9 L = 2304 For 32x16 Torus M=512, k=4 A = D = 24 L = Hop Distributions N8k3xN8k3xN8k3: M=512, k=3 or 4 A = D = 20 L = 876

43 Graph Embedding (M=4096): (N16k3)^3 & (N8k3)^4

Hop Distributions for M=4096 For hypercube

0078, D = 64 L = 8,192 (N16k3)^3: M=4096,

44 Hop Distributions for M=4096 For hypercube 2^12 M=4096, k=12 A = , D = 12 L = 24,576 For 64x64 Torus M=4096, k=4 A = , D = 64 L = 8,192 (N16k3)^3: M=4096, k=3 or 4 A = 34.72, D = 60 L = 6,552 (N8k3)^4: M=4096, k=3 or 4 A = , D = 55 L = 7,020

45 Prototype 1: ibt(8^2,b=2) vs. T(8^2) vs. T(4^3) NAMD NAS Parallel Benchmarks HPC Challenge Benchmarks LINPACK Benchmarks

46 Prototype 2: Optimal N8k3 at NCKU

47 One Prototype with N=1024 at NCKU CK Pflops, 1.1 MWs 5,120 Fiber links

48 Search for Optimal Network Topologies for Supercomputers 01 Motivations 02 Reviews of Network Topologies 03 Search for Optimal Topologies 04 Summaries

49 Search for Optimal Network Topologies for Supercomputers [Summaries] Next generation supercomputer need better interconnect network: Technologies and topologies Optimal topology shows a better performance: Diameter, mean path length, utilization of wires, etc. There s a long way to find and use optimal topologies: Other optimization metrics: Bandwidth, applications, etc. Massive network: Optimization algorithm, embedding, packaging, etc. Routing, mapping; Scalability, robustness, etc. Engineering

Presentations: Jack Dongarra, University of Tennessee & ORNL. The HPL Benchmark: Past, Present & Future. Mike Heroux, Sandia National Laboratories

HPC Benchmarking Presentations: Jack Dongarra, University of Tennessee & ORNL The HPL Benchmark: Past, Present & Future Mike Heroux, Sandia National Laboratories The HPCG Benchmark: Challenges It Presents