The Tofu Interconnect D

Size: px

Start display at page:

Download "The Tofu Interconnect D"

Garry Daniel
5 years ago
Views:

2018 IEEE International Conferene on Cluster Computing The Tofu Interonnet D Yuihiro Ajima, Takahiro Kawashima, Takayuki Okamoto, Naoyuki Shida, Kouihi Hirai, Toshiyuki Shimizu Next Generation

1 2018 IEEE International Conferene on Cluster Computing The Tofu Interonnet D Yuihiro Ajima, Takahiro Kawashima, Takayuki Okamoto, Naoyuki Shida, Kouihi Hirai, Toshiyuki Shimizu Next Generation Tehnial Computing Unit Fujitsu Limited Kawasaki, Japan {aji, t-kawashima, tokamoto, shidax, k-hirai, t.shimizu}@jp.fujitsu.om Shinya Hiramoto, Yoshiro Ikeda, Takahide Yoshikawa, Kenji Uhida, Tomohiro Inoue AI Platform Business Unit Fujitsu Limited Kawasaki, Japan {hiramoto.shinya, ikeda.yoshir-02, yoshikawa.takah, k_uhida, inoue.tomohiro}@jp.fujitsu.om Abstrat In this paper, we introdue a new highly salable interonnet alled Tofu interonnet D that will be used in the post-k mahine. This mahine will offiially be operational around The letter D represents high density node and dynami paket sliing for dual-rail transfer. Herein we desribe the design and the evaluation results of TofuD. Due to the high-density pakaging, the optial link ratio of TofuD has dereased to 25% from the 66% optial link ratio of Tofu2. TofuD applies a new tehnique alled dynami paket sliing to redue lateny and to improve fault resiliene. The evaluation results show that the one-way 8-byte Put lateny is 0.49 μs. This is 31% lower than the lateny of Tofu2. The injetion rate per node is 38.1 GB/s whih is approximately 83% of the injetion rate of Tofu2. The link effiieny is as high as approximately 93%. Keywords high-performane omputing, interonnet, highdensity pakaging, fault resiliene I. INTRODUCTION The Tofu interonnet family is a group of system interonnets for highly salable HPC systems developed by Fujitsu. The Tofu Interonnet D (TofuD) is a new member to this family and designed for used in the post-k mahine [1] that will be operational around Tofu stands for torus fusion that represents the designed ombination of dimensions with an independent onfiguration and a routing algorithm. The letter D represents high density node and dynami paket sliing for dual-rail transfer. In this paper, we desribe the design overview, speifiation, and evaluation results of TofuD. The design overview inludes the new node onfiguration that inorporates the high-density memory pakaging tehnology, the optimizations for the inreasing number of non-uniform memory aess (NUMA) domains, and a new paket transfer tehnique that redues lateny and improves resiliene. Setion II explains the bakground of this work. Setion III presents related work. Setion IV introdues the design of TofuD, and Setion V presents the results of performane evaluation. Setion VI onludes this paper. II. BACKGROUND A. Tofu Interonnet The Tofu interonnet [2][3] was developed for the K omputer [4] that beame operational in The 6D mesh/torus network of Tofu ahieved high salability of 82,944 ompute nodes, and the virtual 3D torus rank mapping sheme provided both high availability and topology-aware programmability. Tofu was also used in the PRIMEHPC FX10 system whih doubled the number of proessor ores per node to sixteen from eight of the K omputer. A node address in the physial 6D network is represented by six-dimensional oordinates X, Y, Z, A, B, and C. The A and C oordinates an be 0 or 1, and the B oordinate an be 0, 1, or 2. The range of the X, Y, and Z oordinates depends on the system size. Two nodes whose oordinates are different by 1 in one axis and idential in the other five axes are adjaent and are onneted to eah other. When a ertain axis is onfigured as a torus, the node with oordinate 0 in the axis and the node with the maximum oordinate value are onneted to eah other. The A- and C-axes are fixed to the mesh onfiguration and the B-axis is fixed to the torus onfiguration. Eah node has 10 ports for the 6D mesh/torus network. Eah of the X-, Y-, Z-, and B-axes uses two ports, and eah of the A- and C-axes use one port. Eah link provided 5.0 GB/s peak throughput. Eah link had 8 lanes of high-speed differential I/O signals at a Gbps data rate. Tofu was implemented as an interonnet ontroller (ICC) hip with 80 lanes of signals for the network. All links were eletri, and there was no optial link in the original Tofu interonnet. Eah node had four Tofu network interfaes (TNIs) so that four data were simultaneously transmitted to four independent diretions and four data were reeived from four independent diretions. The injetion bandwidth per node was 20 GB/s. The total injetion bandwidth (whih yields the theoretial peak performane of the nearest neighbor data exhange) of the K omputer was 1.66 PB/s. The bisetion bandwidth (whih yields the theoretial peak performane of global data exhange) of the K omputer was 46.1 TB/s for the physial mesh and the torus network, or 34.6 TB/s for the virtual torus network. In a large torus network, /18/$ IEEE DOI /CLUSTER

2 there are performane differenes of one to two orders of magnitude depending on the ommuniation pattern; therefore topology-aware tuning of appliations is important. The TNI provided the ommuniation funtion of remote diret memory aess (RDMA) Put/Get, system paket, and Tofu barrier. The system paket was used for system ontrol and IP ommuniation. The Tofu barrier handles multiple stages of ommuniation for barrier synhronization with hardware that is unaffeted by OS jitter that severely deteriorates the lateny when software handles the ommuniation. Barrier gate (BG) is a hard-wired module that synhronously ommuniates with other BGs. Speifially, eah BG waits for signals from up to two preset BGs, and then transmits signals to up to two other preset BGs. There are two types of BG, start-and-end point and relay point. Eah startand-end point BG is fixedly assoiated with an interfae alled a barrier hannel (BCH). The MPI library alloates these ommuniation resoures at the reation of eah ommuniator. The redue-broadast tree algorithm onsumes one BCH and five BGs, or the reursive-doubling algorithm onsumes one BCH and log2(n) BGs. A BG an perform the redution operation so that the Tofu barrier an perform all-redue olletive ommuniation that is limited to one element. In Tofu, the Tofu barrier was available only on TNI number 0 and there were 8 BCHs and 64 BGs; 8 BGs were for start-and-end points and 56 BGs were for relay points. Therefore, up to eight ommuniators per node ould simultaneously use the Tofu barrier. When there were multiple proesses on a node, the intra-node proesses were synhronized by software and the representative proess used a BCH for the inter-node synhronization. B. Tofu Interonnet 2 The next version Tofu interonnet 2 (Tofu2) [5][6] was designed for the PRIMEHPC FX100 system launhed in Eah node of FX100 had eight pakages of hybrid memory ube (HMC) that ontained a stak of memory die. In ontrast, eah node of the K omputer and FX10 had eight inline memory modules that had been used over 30 years. This transition from a wide memory module to a small memory pakage redued the node footprint of FX100. To redue the node footprint further, the Tofu2 implementation also shifted to proessor hip integration from the independent ICC hip of Tofu. Considering the balane with 128 olloated signal lanes for memory on the proessor hip, Tofu2 halved the number of signal lanes to 40 from the 80 signal lanes of Tofu. To ompensate for halving the number of signal lanes, Tofu2 signifiantly improved the data rate of the signals from 6.25-Gbps to Gbps by introduing optial links. The link bandwidth and the injetion bandwidth per node were inreased to 12.5 GB/s and 50 GB/s, respetively. In the ommuniation funtion of Tofu2, the following features were extended; RDMA atomi read modify write, triggered ommuniation (alled session mode for nonbloking olletive ommuniation), and RDMA for system use. In FX100, not only the number of ompute ores were inreased to 32, but the reommended number of user proesses in a node was also inreased from 1 to 2 beause two NUMA domains alled ore-memory groups (CMGs) were introdued on a hip. Therefore, the number of RDMA ommuniation resoures alled ontrol queues (CQs) was required to be inreased to alloate dediated CQ to eah user proess. In Tofu, eah TNI had three CQs and one out of the three CQs was fixed for system use. For one or two user proesses per node, eah proess was assigned one dediated CQ per TNI and the MPI ommuniation library internally used four CQs simultaneously. When the number of proesses per node exeeded two, the total number of assigned CQs for eah proess dereased. When the number of proesses per node exeeded eight, CQs were shared by multiple proesses. In Tofu2, the number of CQs per TNI inreased from 3 to 12 to avoid shared CQ even if the number of proesses per node was 32. C. The Post-K Computer The post-k omputer is a system developed to replae the K omputer and will start operating around The post-k omputer is designed to take full advantage of the assets of the K omputer suh as appliations, users, tools, system operational knowledge, and the faility. The post-k is required not only to expand appliation domains, but also to signifiantly improve appliation performane, speifially up to 100 times or more than that on the K. Fujitsu ooperates with the asset holder RIKEN and develops leading edge tehnologies of FX100 to onstrut the post-k mahine. III. RELATED WORK This setion desribes the system interonnets used in the reent world-lass systems other than the Tofu interonnet family. All systems have the same level of bisetion bandwidth whih represents the theoretial peak performane of global data exhange. On the other hand, the total injetion bandwidth signifiantly differs depending on the type of network topology. Some systems have a total injetion bandwidth lose or equal to their own bisetion bandwidth and the other systems have a total injetion bandwidth muh higher than their own bisetion bandwidth. A. InfiniBand TM InfiniBand TM (IB) [7] is a standard speifiation of interonnet defined by the InfiniBand Trade Assoiation. IB produts have been widely used to build HPC lusters. The network interfae is alled host hannel adapter (HCA) and an ordinary HCA is implemented as a disrete hip and mounted on an adaptor ard. An ordinary IB network is onstruted by using swith boxes. Construting an interonnetion network with independent omponents suh as adapter ards and swith boxes is disadvantageous in terms of pakaging density and power onsumption. However, there is the advantage in the flexibility of onfiguration. For example, a node onfiguration that has an inreased number of HCAs enhanes injetion bandwidth and aelerates ommuniation intensive appliations. In the other example, the network onfiguration alled a full-bisetion bandwidth fat-tree, of whih the 647

3 bisetion bandwidth is equivalent to the total injetion bandwidth, suppresses variation in the exeution time of appliations not optimized for the network topology. Mellanox s dual-rail EDR IB HCA will be used in the Summit system [8] whih will start full operation in The injetion bandwidth per node is 25 GB/s. The total injetion or bisetion bandwidth will be approximately 115 TB/s. The TaihuLight system, whih started operation in 2016, also used Mellanox s IB HCAs and swith hips [9]. The Sunway network of TaihuLight was onstruted as a four-stage tapered fat-tree. The total injetion bandwidth was 512 TB/s and the bisetion bandwidth was approximately 70 TB/s. There was a rare example of IB HCA integration. Orale s Sonoma proessor [10] was designed for high-density sale-out servers and there were two built-in HCAs on a hip. The injetion bandwidth per node was 13.6 GB/s. B. Omni-Path Omni-Path [11] is Intel s HPC interonnet family. In the first generation, the host fabri interfae (HFI) is implemented as a disrete hip and mounted on an adaptor ard or integrated into a CPU pakage. Omni-Path is onsidered likely to be used in the future Aurora system [12]. The first-generation Omni- Path was used in the Oakforest-PACS system that beame operational in The injetion bandwidth per node was 12.5 GB/s. The total injetion or bisetion bandwidth was TB/s. C. Aries Interonnet The Aries interonnet [13] developed by Cray is a highly salable system interonnet that employs a Dragonfly-based topology. The network interfae and the router were implemented together in a disrete hip. Eah Aries hip had four network interfaes and onneted four nodes. Eah network interfae had two ports to onnet the internal router port. Eah router port operated at a link throughput of 4.7 GB/s for global links or 5.25 GB/s in a group of 384 nodes. Therefore, the injetion bandwidth per node was 10.5 GB/s. The upgraded Piz Daint system that started operation in 2016 used Aries. The total injetion bandwidth and the bisetion bandwidth were 71 TB/s and 36 TB/s respetively. D. Blue Gene/Q Five-dimensional Torus IBM Blue Gene/Q (BG/Q) was a highly salable superomputer that had a five-dimensional torus network [14][15]. Eah node has 10 links for the torus network and eah link provides 2.0 GB/s peak throughput. The injetion bandwidth per node was 20 GB/s. The Sequoia system that started lassified operations in 2013 was a BG/Q system with 98,304 nodes. The total injetion bandwidth was 1.97 PB/s and the bisetion bandwidth was 49.2 TB/s. The harateristis and performane of the BG/Q five-dimensional torus network were similar to those of the 6D mesh/torus network of the Tofu interonnet. IV. DESIGN OF TOFUD This setion desribes the design of TofuD fousing on the differene ompared to Tofu2. A. Node Configuration Figure 1 shows a blok diagram of the post-k omputer node. The number of CMGs inreased to four from two of Tofu2, and the number of TNIs also inreased from four to six. The CMGs and the TNIs are onneted by the network on hip (NOC). As the number of CMGs inreases, there is a differene in the distane between TNIs and eah CMG. Two CMGs are far from TNIs, and the other two CMGs are near TNIs. Figure 2 shows a prototype CMU. Two proessor pakages and three able ages are ooled by water. One ompute node onsists of one pakage in whih one proessor hip and four staks of high bandwidth memory (HBM) are integrated. As a trade-off with the use of the high-density memory pakaging tehnology, the number of memory staks per node has halved from FX100 that used eight pakages of HMC. In order to balane with the halved number of memory staks, the TofuD again halved the number of signal lanes to 20 from 40 of Tofu2. To redue the hardware ost, the TofuD uses mainstream quad-lane ative optial ables. Half of the CMUs in a shelf onnet two optial ables of the X- and Y-axes, and the other half onnet three optial ables of the X-, Y-, and Z-axes. Eah ative optial able is shared by two links in the same diretion of two ompute nodes on the same CMU. Although the number of signals for eah ative optial able is one-third of that of the board-mount optial assembly used in Tofu2, the number of optial modules on the board redues to 2.5 from 8 of FX100 owing to the redutions in the optial link ratio, number of high-speed signals per node, and number of nodes per board. CMG Memory Memory CMG NOC CMG Memory Memory CMG PCIe Controller TNI0 TNI1 TNI2 TNI3 TNI4 TNI5 Tofu Network Router X+ X- Y+ Y- Z+ Z- A B+ B- C Fig. 1. Blok diagram of the post-k omputer node 648

Eah half rak aommodates four building bloks alled shelves, two in the front-side and two in the rear-side. The geometry of a shelf is (X, Y, Z, A, B, C) = (1, 1, 4, 2, 3, 2).

4 Fig. 2. Prototype CPU memory unit B. Pakage Struture and Link Configuration In a rak of the post-k omputer, eah of the upper and lower halves of the rak houses 192 nodes with the geometry (X, Y, Z, A, B, C) = (2, 2, 4, 2, 3, 2). Eah half rak aommodates four building bloks alled shelves, two in the front-side and two in the rear-side. The geometry of a shelf is (X, Y, Z, A, B, C) = (1, 1, 4, 2, 3, 2). Figure 3 shows a prototype rak of the post-k omputer. Eah side of the rak stores four shelves vertially. Eah shelf houses 24 CPU memory units (CMUs) that loads two nodes onneted in C- axis. All onnetions in a half rak use eletri links and the onnetions out of a half rak use optial links. Therefore, half of the onnetions in the X- and Y-axes and one fourth of the onnetions in the Z-axis use optial links. Beause of the high-density pakaging and large struture of the half rak, the optial link ratio of the TofuD is as low as 25%, whih has substantially dereased from 66% for Tofu2 that used optial links for onnetion out of a 2U hassis with the geometry (X, Y, Z, A, B, C) = (1, 1, 3, 2, 1, 2). Fig. 3. Prototype rak of the post-k omputer C. Injetion Rate per Node Table I shows the omparison of node and link onfigurations within the Tofu family. TofuD uses a highspeed signal of 28-Gbps data rate that is approximately 9% faster than that of Tofu2. However, due to the redution of the number of signals, TofuD redues the link bandwidth to 6.8 GB/s, whih is approximately 54% for Tofu2. To ompensate the redution in the link bandwidth, TofuD inreases the number of simultaneous ommuniations from 4 of Tofu2 to 6. The injetion rate of TofuD is enhaned to approximately 80% of that of Tofu2. There are six adjaent nodes in the virtual 3D torus therefore topology-aware algorithms an use six simultaneous ommuniations effetively. The logi iruits of TofuD operate at a 425-MHz lok frequeny, whih is about 9% faster than the lok frequeny of Tofu2. The width of the datapath dereases from 256 to 128 bits as the number of signal lanes dereased. TABLE I. DATA RATES OF SIGNAL AND INJECTION RATES Tofu Tofu2 TofuD Number of signal lanes per node Data rate (Gbps) Link bandwidth (GB/s) Number of TNIs per node Injetion bandwidth per node (GB/s)

5 D. Communiation Resoures TABLE II shows a omparison of the number of ommuniation resoures within the Tofu family. Both the number of ompute ores and the number of TNIs per node inreased by 1.5 times from Tofu2, and the number of CQs per TNI remained onstant at 12. In Tofu2, there was no hange in the Tofu barrier. In TofuD, the amount of ommuniation resoures for the Tofu barrier has inreased as the number of CMGs has inreased. To alloate a BCH from a different TNI to eah CMG, the Tofu barrier beomes available on all TNIs in TofuD, and the number of resoures per node inreased signifiantly for both BCH and BG. The ratio of the BCH to BG inreased from 1:8 to 1:3 beause the redue-broadast tree algorithm for the intra-node part of synhronization is assumed to redue the number of BGs to be used. The buffer size of eah BG is also expanded so that the Tofu barrier an perform all-redue of eight integer or three floating point elements with one synhronization. TABLE II. NUMA DOMAIN AND COMMUNICATION RESOURCES Tofu Tofu2 TofuD Number of ompute ores per node 8, Number of CMGs per node Number of TNIs per node Number of CQs per node Number of BCHs per node Number of BGs per node E. Dynami Paket Sliing for Dual-rail Transfer The physial oding sublayer (PCS) of Tofu2 was developed based on the 100Gb Ethernet tehnology. The paket transfer lateny of Tofu2 was inreased to approximately 0.3 μs from approximately 0.1 μs for Tofu beause of the omplex transmission tehnology inluding enoding, symbol detetion, multi-lane distribution, and laneto-lane deskew. In Tofu2, there was another issue in the faulttolerane feature as follows. Tofu2 introdued the link degradation feature that redued the number of ative lanes without losing a paket. However, one the link degraded, the number of lanes never reovered; therefore, there is no fault resiliene. To address these issues, TofuD applies a new tehnique alled dynami paket sliing for dual-rail transfer. To address the lateny issue, TofuD implements independent PCS for eah signal lane and splits a paket in the data-link layer. To address the fault-resiliene issue, TofuD dupliates a paket and redundantly transfers it in both lanes as opposed to reduing the number of ative lanes. The data link layer adds information to the paket, indiating that the paket has been split or dupliated. The data link layer monitors the reeiverside PCS s detetion frequenies of CRC and other transmission errors and adds the transmission quality status information to the paket as well. The data link layer determines the split mode of the paket, depending on the reeived transmission quality status information. Figure 4 shows the frame format that inludes a routing header, a transport layer paket (TLP), and padding spae for a data link layer paket (DLLP). First, the data link layer stores a DLLP to the frame. Next, the data link layer simultaneously generates two slies from the frame. The routing header is dupliated to the two slies, TLP and DLLP are split or dupliated, and the padding is removed. Finally, the two slies are distributed to two PCSs and eah PCS adds a preamble, a CRC ode alled FCS, and inter-frame gap to the slie. Figure 5 shows the undivided slie format that inludes a routing header, full TLP, full DLLP, and ontrol odes to envelop the payload. Figure 6 shows the divided slie formats that inludes a routing header, a split TLP, a split DLLP, and ontrol odes to envelop the payload. The PAT field in a slie indiates the pattern of paket splitting, and the STAT field indiates the status of the observed transmission quality. The PAT field is defined as a 3-bit width field for future expansion to quad-lane routing header 1 LEN DABC1 DX DY DZ DABC2 DI B S 0 VC TLP +0 TLP +1 TLP +2 TLP +3 TLP +4 TLP +5 TLP +6 TLP +7 TLP +8 TLP +9 TLP +10 TLP +11 TLP +12 TLP +13 TLP +14 TLP +15 transport layer (padding) TLP TLP +(32LEN+16) TLP +(32LEN+17) TLP +(32LEN+18) TLP +(32LEN+19) TLP +(32LEN+20) TLP +(32LEN+21) TLP +(32LEN+22) TLP +(32LEN+23) TLP +(32LEN+24) TLP +(32LEN+25) TLP +(32LEN+26) TLP +(32LEN+27) TLP +(32LEN+28) TLP +(32LEN+29) TLP +(32LEN+30) TLP +(32LEN+31) (data link layer) F F Fig. 4. Frame format 650

6 preamble routing header 1 LEN DABC1 DX DY DZ DABC2 DI B S 0 VC PAT STAT SEQ TLP +0 TLP +1 TLP +2 TLP +3 TLP +4 TLP +5 TLP +6 TLP +7 TLP +8 TLP +9 TLP +10 TLP +11 TLP +12 TLP +13 TLP +14 TLP +15 transport layer TLP TLP +(32LEN+16) TLP +(32LEN+17) TLP +(32LEN+18) TLP +(32LEN+19) TLP +(32LEN+20) TLP +(32LEN+21) TLP +(32LEN+22) TLP +(32LEN+23) TLP +(32LEN+24) TLP +(32LEN+25) TLP +(32LEN+26) TLP +(32LEN+27) TLP +(32LEN+28) TLP +(32LEN+29) TLP +(32LEN+30) TLP +(32LEN+31) DLLP +0 DLLP +1 DLLP +2 DLLP +3 other ontrol +0 other ontrol +1 other ontrol +2 other ontrol +3 DLLP +4 DLLP +5 DLLP +6 DLLP +7 other ontrol +4 other ontrol +5 other ontrol +6 other ontrol +7 data link layer F DLLP +8 DLLP +9 DLLP +10 DLLP F DLLP +12 DLLP +13 DLLP +14 DLLP +15 FCS inter-frame gap Fig. 5. Undivided slie format for the dupliate-mode preamble routing header 1 LEN DABC1 DX DY DZ DABC2 DI B S 0 VC PAT STAT SEQ TLP +0 TLP +1 TLP +2 TLP +3 TLP +4 TLP +5 TLP +6 TLP +7 TLP +16 TLP +17 TLP +18 TLP +19 TLP +20 TLP +21 TLP +22 TLP +23 transport layer TLP +(32LEN) TLP +(32LEN+1) TLP +(32LEN+2) TLP +(32LEN+3) TLP +(32LEN+4) TLP +(32LEN+5) TLP +(32LEN+6) TLP +(32LEN+7) TLP +(32LEN+16) TLP +(32LEN+17) TLP +(32LEN+18) TLP +(32LEN+19) TLP +(32LEN+20) TLP +(32LEN+21) TLP +(32LEN+22) TLP +(32LEN+23) data link layer DLLP +0 DLLP +1 DLLP +2 DLLP +3 other ontrol +0 other ontrol +1 other ontrol +2 other ontrol +3 F DLLP +8 DLLP +9 DLLP +10 DLLP +11 FCS inter-frame gap preamble routing header 1 LEN DABC1 DX DY DZ DABC2 DI B S 0 VC PAT STAT SEQ TLP +8 TLP +9 TLP +10 TLP +11 TLP +12 TLP +13 TLP +14 TLP +15 TLP +24 TLP +25 TLP +26 TLP +27 TLP +28 TLP +29 TLP +30 TLP +31 transport layer TLP +(32LEN+8) TLP +(32LEN+9) TLP +(32LEN+10) TLP +(32LEN+11) TLP +(32LEN+12) TLP +(32LEN+13) TLP +(32LEN+14) TLP +(32LEN+15) TLP +(32LEN+24) TLP +(32LEN+25) TLP +(32LEN+26) TLP +(32LEN+27) TLP +(32LEN+28) TLP +(32LEN+29) TLP +(32LEN+30) TLP +(32LEN+31) data link layer DLLP +4 DLLP +5 DLLP +6 DLLP +7 other ontrol +4 other ontrol +5 other ontrol +6 other ontrol +7 F DLLP +12 DLLP +13 DLLP +14 DLLP +15 FCS inter-frame gap Fig. 6. Divided slie format for the split-mode V. PERFORMANCE EVALUATION This setion gives early evaluation results of the fundamental performane of TofuD. A. Evaluation Environment The ommuniation performane of TofuD was evaluated by system-level logi simulations. The simulation models were built using the Verilog RTL odes for the prodution, and inluded multiple nodes. The simulations were performed on Cadene s hardware emulators. The simulated proessor ores exeuted the test programs that used the TofuD hardware diretly. The lateny results were measured diretly from the simulation waveforms; thus we obtained one-way latenies without halving average round-trip latenies. The throughput results were derived from the measured lateny values. For Tofu and Tofu2, the evaluation results of lateny breakdown were obtained from the simulation waveforms as well as TofuD. The other results of Tofu and Tofu2 were evaluated with atual mahines using the low-level ommuniation library. In these preliminary evaluations, the test programs inluded no ommuniation software stak suh as an MPI library; therefore, the evaluation results inluded no software overhead, and all test programs performed nearest-neighbor ommuniation. B. Lateny TABLE III shows the evaluated results of the latenies of Tofu, Tofu2, and TofuD. In eah evaluation, it is assumed that a Put transfer is exeuted between the nearest neighbor nodes on the same board, and the time from when the initiator 651

7 proess started the Put transfer to when the target proess read the data was measured. In Tofu, the diret desriptor feature redued the lateny by more than 0.2 μs. In Tofu2, the ahe injetion feature redued the lateny by nearly 0.2 μs. Both these redutions in Tofu and Tofu2 are the result of bypassing the main memory with the newly introdued features of the network interfae. In TofuD, the lateny is redued by approximately 0.2 μs again. Overall, the lateny has been redued by 46% from Tofu and 31% from Tofu2. The redution is mainly due to the overhauling of the transmission tehnology suh as the ompensation for signal skew, and reonsideration of the pipeline design of data-paths. There is an additional penalty of approximately 0.05 μs if the initiator proess runs on a far CMG in the initiator node and the target proess also runs on a far CMG in the target node. Although the differene is small in TofuD, the inreasing density and loality on the hip may impat the ommuniation lateny in future systems. Figure 7 presents the breakdowns of lateny of one-way and one-hop Put transfer. A lateny value for eah omponent was obtained from the simulation waveforms. In Tofu2, the paket transfer lateny through one link and two swithes was inreased by approximately 0.2 μs from Tofu due to the omplex PCS derived from 100 Gb Ethernet. The paket transfer lateny of TofuD ahieved nearly the same lateny as Tofu owing to the new dynami paket sliing tehnique. In TofuD, the part of the one-way Put lateny other than the paket transfer was almost the same as Tofu2. In total, approximately 0.2 μs of one-way Put lateny has been redued in TofuD ompared with Tofu2. C. Injetion Rate TABLE IV lists the evaluation results of injetion rates and effiienies of Tofu, Tofu2, and TofuD. In Tofu and Tofu2, four Put transfers in different diretions were simultaneously exeuted and total throughputs were evaluated. In TofuD, six Put transfers in different diretions were exeuted. The injetion rate of TofuD is more than two times higher than that of Tofu and 17% lower than that of Tofu2. The effiienies of Tofu are lower than that of a single Put transfer, beause Tofu was not integrated in the proessor hip, leading to a bottlenek in the bus that onnets the proessor hip and the interonnet ontroller hip. The relatively low effiienies are mainly beause of the paket size of the bus, whih inludes only one ahe line of data. lateny (nse) Tofu Tofu2 TofuD Fig. 7. Comparison of lateny breakdowns of one-way Put transfer TABLE IV. Rx CPU Rx Host bus Rx TNI Paket Transfer Tx TNI Tx Host bus Tx CPU INJECTION RATES AND EFFICIENCIES OF SIMULTANEOUS PUT TRANSFERS OF TOFU FAMILY Injetion rate [GB/s] Effiieny [%] Tofu (K) Tofu (FX10) Tofu TofuD Tofu2 and TofuD are integrated into the proessor hips and the effiienies of injetion rates are almost the same as that of the single Put transfer presented in the next subsetion. D. Throughput TABLE V shows the evaluated results of Put throughputs and the effiienies of Tofu, Tofu2, and TofuD. The throughput of TofuD is 33% faster than that of Tofu and 45% slower than that of Tofu2. The effiienies exeed 90% for all versions. These high effiienies are the distintive harateristis of the Tofu interonnet family, and are due to the rather large paket size for an HPC interonnet. Although a larger paket size is ostly in design, it also redues the software overheads of system-wide ommuniation protools suh as IP over Tofu. TABLE III. ONE-WAY 8-BYTE PUT LATENCIES BETWEEN NEAREST NEIGHBOR NODES OF TOFU FAMILY Communiation settings Lateny [μs] Tofu Desriptor on main memory 1.15 Diret Desriptor 0.91 Tofu2 Cahe injetion OFF 0.87 Cahe injetion ON 0.71 TofuD To/From far CMGs 0.54 To/From near CMGs 0.49 TABLE V. THROUGHPUTS OF PUT TRANSFER AND EFFICIENCIES OF THE TOFU FAMILY Throughput [GB/s] Effiieny [%] Tofu Tofu TofuD

8 The effiieny of Tofu2 is slightly lower than that of Tofu and TofuD. This mainly beause of the overhead of data alignment. Tofu and TofuD were implemented in 128-bit datapaths and the data alignment was 16 bytes. Tofu2 was implemented in 256-bit width and the alignment was 32 bytes. E. Intra-node Lateny of the Tofu Barrier The Tofu barrier is extended for intra-node use in TofuD. This subsetion presents the evaluated lateny results of the intra-node Tofu barrier. First, the lateny of eah omponent was evaluated from the waveform of a simple test that uses only one BCH and two BGs onneted in series. The lateny result of a BCH and a start-and-end BG was approximately 0.48 μs, and the lateny result of a relay BG was nearly 0.13 μs. Next, intra-node synhronization latenies using Tofu barrier were evaluated using the test programs. The number of BCHs to be synhronized varied from 4 to 48. If the number of BCHs exeeds the number of TNI, multiple BCHs were used in a TNI. The test programs used the redue-broadast tree algorithm for intra-tni synhronization and the reursive doubling algorithm for inter-tni synhronization. The total number of used BGs per node and the number of ommuniation stages for eah test program was shown in TABLE VI. In these test programs, one proess operated all BCHs; therefore, the deviation of the synhronization start time was small as ompared with the atual usage ondition in whih eah BCH is operated by a different proess. Figure 8 shows the evaluated results and the estimated latenies. The minimum latenies were estimated so that the lateny omponent of relay BGs inreased in proportion to the log2 of the number of BCHs. However, as the number of BCHs per TNI inreased beyond 1, the evaluation results beame worse than the estimated minimum latenies. The waveform result showed that all BCHs and BGs were serially proessed. The lateny of the BCH and the BG at the start point were overlapped between BCHs for 0.19 μs out of 0.48 μs and the remaining 0.29 μs were serialized. The estimated latenies of proessing the BG and the BCH serially were lose to the evaluation results. The evaluation results showed that there was the lateny penalty when alloating multiple BCHs from the same TNI to the same ommuniator. The MPI library should be implemented using the Tofu barrier avoiding this penalty as follows. If the number of proesses in a node does not exeed six, the MPI library should alloate one BCH to eah proess from different TNI. If the number of proesses in a node exeeds six, the MPI library should alloate one BCH to eah of six groups of proesses. Eah group of proesses share one BCH and synhronize within the group via memory. lateny (μse) TABLE VI. CONFIGURATIONS OF THE TEST PROGRAMS OF THE TOFU BARRIER Number of start-and-end points Number of TNIs Max. number of BCHs per TNI Max. number of BGs per TNI Number of ommuniation stages Estimated latenies assuming serialization Evaluated results from waveform Estimated minimum latenies number of BCHs per node Fig. 8. Estimated and evaluated results of the Tofu barrier test programs VI. CONCLUSION In this paper, we introdued a new and highly salable interonnet alled Tofu Interonnet D that will be used in the post-k mahine, whih will be operational around The letter D represents high density node and dynami paket sliing for dual-rail transfer. This paper desribed the design of TofuD inluding the pakage struture of the node, the rak, the link onfiguration between nodes, the injetion rate per node, inreased ommuniation resoures and a new paket transfer tehnique. This paper also presented the evaluation results of TofuD. The one-way 8-byte Put lateny was 0.49 μs that was redued by 31% from that for Tofu2. The injetion rate per node was 38.1 GB/s whih was approximately 83% of the injetion rate for Tofu2. The link effiieny was as high as approximately 93%. Additionally, the evaluation results showed the onstraints on the in-node usage of the Tofu barrier to avoid performane penalty. 653

9 REFERENCES [1] RIKEN Center for Computational Siene About the Projet. [online] Available at: [Aessed: 06- May ] [2] Y. Ajima, S. Sumimoto and T. Shimizu, "Tofu: A 6D Mesh/Torus Interonnet for Exasale Computers," in IEEE Computer, vol. 42, no. 11, pp. 36?40, [3] Y. Ajima, Y. Takagi, T. Inoue, S. Hiramoto and T. Shimizu, The Tofu Interonnet, IEEE 19th Annual Symposium on High Performane Interonnets (HOTI), pp , [4] H. Miyazaki, Y. Kusano, N. Shinjo, F. Shoji, M. Yokokawa and T. Watanabe, Overview of the K omputer System, Fujitsu Sientifi and Tehnial Journal, vol. 48, no.3, pp , [5] Y. Ajima et al. "Tofu Interonnet 2: System-on-Chip Integration of High-Performane Interonnet," In Proeedings of the 29th International Conferene on Superomputing (ISC14), pp , [6] Y. Ajima et al., The Tofu Interonnet 2, IEEE 22nd Annual Symposium on High-Performane Interonnets (HOTI), pp , [7] InfiniBand Trade Assoiation, InfiniBand Arhiteture Speifiation Volume 1 Release 1.2.1, [8] Oak Ridge Leadership Computing Faility Summit. [online] Available at: [Aessed: 06- May ] [9] Jak Dongarra, "Report on the Sunway TaihuLight System." [online] Available at: [Aessed: 06- May ] [10] B. Vinaik and R. Puri, Orale s Sonoma Proessor: Advaned Lowost SPARC Proessor for Enterprise Workloads, HotChips 27, [11] M. S. Birrittella et al., Intel Omni-path Arhiteture: Enabling Salable, High Performane Fabris, IEEE 23rd Annual Symposium on High- Performane Interonnets (HOTI), pp. 1-9, [12] Intel Aurora Fat Sheet. [online] Available at: [Aessed: 15- May ] [13] G. Faanes, et al., Cray asade: a sable HPC system based on a Dragonfly network, In Proeedings of the International Conferene on High Performane [14] D. Chen, et al., The IBM Blue Gene/Q Interonnetion Network and Message Unit, In Proeedings of the International Conferene on High Performane Computing, Networking, Storage and Analysis (SC 2012), Artile 26, [15] D. Chen et al., Looking under the hood of the IBM Blue Gene/Q network, 2012 International Conferene for High Performane Computing, Networking, Storage and Analysis (SC), pp. 1-12,

The Tofu Interconnect D

The Tofu Interconnect D 11 September 2018 Yuichiro Ajima, Takahiro Kawashima, Takayuki Okamoto, Naoyuki Shida, Kouichi Hirai, Toshiyuki Shimizu, Shinya Hiramoto, Yoshiro Ikeda, Takahide Yoshikawa, Kenji