Deadlock-free Fault-tolerant Routing in the Multi-dimensional Crossbar Network and Its Implementation for the Hitachi SR2201

Size: px

Start display at page:

Download "Deadlock-free Fault-tolerant Routing in the Multi-dimensional Crossbar Network and Its Implementation for the Hitachi SR2201"

Ami Ray
6 years ago
Views:

1 Deadlock-free Fault-tolerant Routing in the Multi-dimensional Crossbar Network and Its Implementation for the Hitachi SR2201 Yoshiko Yasuda, Hiroaki Fujii, Hideya Akashi, Yasuhiro Inagami, Teruo Tanaka*, Junji Nakagoshi*, Hideo Wada* and Tsutomu Sumimoto* Central Research Laboratory, Hitachi, Ltd , Higashi-koigakubo, Kokubunji, Tokyo 185, Japan. Tel : , Fax : {yoshikoy, fujii, akashi, inagami}@crl.hitachi.co.jp *General Purpose Computer Division, Hitachi, Ltd. 1, Horiyamashita, Hadano, Kanagawa , Japan. Abstract We have developed a hardware detour path selection facility for the Hitachi SR2201 parallel computer, which uses a multi-dimensional crossbar as an inter-processor network to ensure operating efficiency and high reliability when a part of the network is y. When this hardware facility is used, packets are transmitted to their destination along alternative paths to avoid the. However, changing the routing may cause deadlock. This paper describes a deadlock-free -tolerant routing scheme that can be used by the detour path selection facility to avoid deadlock, and its implementation for the SR Introduction In recent years, parallel computer systems with distributed memory [1-4] have dominated the quest for high performance computing. Generally, these systems consist of a number of processing elements (s) and an inter-processor network composed of a combination of switches. In these machines, the number of switches composing the network is proportional to the scale of the system. Similarly, the rate of s in the network increases with scale. Thus, to maintain high reliability while the system is operational it is very important to avoid any s in the network. The IBM SP2 parallel computer, which has a bidirectional multistage interconnection network [2] that provides redundancy, ensures high reliability by setting detour routes from a source node to any destination in the routing table of each and by changing the switching technique when a part of the network is y. However, in this system, if even one switch is y, all data transmission must be controlled by the software. This may decrease the efficiency of data transmission. The CRAY T3D parallel system has a three-dimensional torus network [3-4]. In this system, data transmission is controlled by the routing tag look-up table, which contains the routing information each node uses to create the routing tag in the header of the packet. When a part of the network is y, the routing information in the look-up table of each node is rewritten so that no packet would pass the y point. By this table, all packets are transmitted to their destinations. The Hitachi SR2201 parallel computer system [5], which has a multi-dimensional crossbar network [6-8] made by combining common crossbar switches (i.e., switches that provide direct connections from any input port to any output port) in a multi-dimensional arrangement and uses a cutthrough routing switching technique [9-10] to transmit packets with low latency and high throughput, supports a hardware detour path selection facility that ensures operating efficiency and high reliability when a part of the network is y. When developing this facility, we wanted to minimize additional hardware and maintain the data transmission efficiency of the system while not affecting the user program. In the SR2201's -tolerant facility, to minimize the additional hardware, each switch has only the information of the switches that they are physically connected to. To avoid affecting the user program, packets add and delete the detour information automatically when passing along the detour paths. Furthermore, to fully utilize the system's data transmission ability, packets are transmitted around the y point without changing the switching technique. In such a case, though, the routing technique must be changed. However, changing the routing is problematic as deadlock may occur [11-18]. Thus, we propose a deadlock-free tolerant routing scheme suitable for the network topology and data transmission of the SR2201. In this paper, we begin by briefly describing the Hitachi SR2201 parallel computer in Section 2. In Section 3, we describe the inter-processor network of the SR2201: the multi-dimensional crossbar network and the data transmission that our point-to-point communication and broadcast communication utilizes by cut-through routing technique and dimension-order routing in the SR2201. In Section 4, we describe the hardware -tolerant facility and the routing scheme in the multi-dimensional crossbar network. Finally, we discuss the deadlock problem of the hardware -tolerant facility when point-to-point communication and broadcast communication using data transmission of

2 the SR2201 occur at the same time and demonstrate our deadlock-free -tolerant routing scheme. 2. Hitachi SR2201 Parallel Computer The structure of the SR2201 is shown in Fig. 1. The SR2201 connects up to 2048 processing elements (s) that operate independently with a inter-processor network. Each consists of a 150-MHz RISC microprocessor [22] based on the PA-RISC 1.1 architecture that provides a peak performance of 300 MFLOPS, DRAM local memory up to 1 GB, a storage controller (SC), and a network interface adapter (NIA). The SC is connected to the RISC microprocessor, NIA, and DRAM memory. It controls all data reading and writing from and to the memory. The NIA is connected to the network and it generates packets according to the instructions issued by the microprocessor and controls all data transmission between the network and the local memory. Thus, the network and the microprocessors operate independently. As an inter-processor network, the SR2201 uses a multidimensional crossbar network, that enables data transfer among any of the s at 300 MB/s. connected to the network Figure 1. Structure of the SR Multi-dimensional Crossbar Network in the SR Structure inter-processor network network interface adapter RISC microprocessor storage controller local memory : processing element The SR2201 uses a multi-dimensional (MD) crossbar network as its inter-processor network [6-8]. The definition of the d-dimensional crossbar network is as follows: (a) The number of s (n) is can be factorized as n=n1*n2*n3*...*nd ; where ni is the number of s on the ith-dimension. (b) Each corresponds to a lattice point of a d-dimensional solid, and lattice points in a line are connected by a common crossbar switch that provide direct connections from any input port to any output port (thus, each is connected to d crossbars). (c) Each connects a relay switch (router, ), that connects the with d crossbars. This relay switch is structured as a (d+1)x(d+1) crossbar switch. For the case of d=1, the MD crossbar network is equivalent to a conventional crossbar network, structured as an n x n crossbar switch. When d=log2n (n is a power of two), the routers are connected directly to each other; thus the MD crossbar network is equivalent to a hypercube network. The structure of a two-dimensional (2D) 4x3 crossbar network is shown in Fig. 2. As shown in Fig. 2, the 2D crossbar network puts XBs in a 4x3 2D arrangement and each connects with two crossbars. Y-dimension Crossbar Switch X-dimension Crossbar Switch : router : processing element Figure 2. A 4x3 Two-dimensional Crossbar Network. The MD crossbar network has many notable characteristics. Short communication distances: In the MD crossbar network, any two s connected by the same crossbar switch can communicate in only one hop via the routers. Any two s on a d-dimensional crossbar network can communicate with a maximum of d hops on d crossbars via the routers. For example, in the 2D crossbar network shown in Fig. 2, any two s communicate with a maximum of only two hops on the crossbars of the network. Since an MD crossbar network with only a few dimensions can connect more than a thousand s, the diameter of the network remains sufficiently small. Wide communication channels: In many parallel computers, each has a corresponding router, that is built into the. To make systems with a large number of s possible, the s must be very compact. Thus, in practice, the number of input-output pins of the router, which is approximately the number of ports times the physical channel bandwidth, is physically limited. In large-scale numerical applications, the physical channel bandwidth can be widened to raise the communication throughput. The number of ports needed by a router of an MD crossbar is equal to one plus the number of dimensions. This means that the physical channel bandwidth can be

3 made as wide as that of a mesh-connected network. By comparison, the router of a hypercube network needs log2n+1 ports, which limits the width of the physical channel bandwidth. Few network conflicts: A conventional crossbar network, structured as an n x n crossbar switch, has no conflicts in almost all communication patterns. The MD crossbar network is designed based on the conventional crossbar network. Thus, far fewer network conflicts occur in the MD crossbar network than in mesh-connected or torus networks, and the MD crossbar network also provides shorter transmission times and higher throughput [7]. Conflict-free remapping of other topologies: The high number of interconnections in an MD crossbar network allows many important topologies used in large-scale numerical applications to be efficiently mapped onto it. These topologies include ring, mesh, hypercube, and tree-connected networks. A program that generates no conflicts in these topologies will not generate conflicts when re-mapped onto the MD crossbar Data Transmission Each packet consists of data and a header that contains routing information such as a receiving address and a route change (RC) bit as shown in Fig. 3. The receiving address consists of d coordinates on the d-dimensional crossbar network. In the case of the 2D crossbar network as shown in Fig. 2, the receiving address consists of an X- coordinate and a Y-coordinate. The RC bit is set for changing the routing information. The possible meanings of the RC bit are shown in Fig.4. The receiving address only becomes effective when the RC bit equals 0. When the RC bit does not equal 0, packets are transmitted to destinations by a special routing. route change (RC) bit All packets are transmitted to their destination by using cut-through routing with low latency and high throughput as the switching technique and dimension-order routing as the routing technique. Cut-through routing approach proposed by Dally and Seize [10] has been used in many recent parallel computers. In cut-through routing, a packet is divided into a sequence of fixed-size units of data, called flits. The size of a flit depends on system parameters, in particular the channel bandwidth. The header flit (or flits) of a packet governs the route. Each switch in the network starts forwarding a packet as soon as the header flit is received and the required output port is free. If the header flit encounters a port already in use, it's blocked until the port becomes available. The routing order is set to the network hardware in advance. In this paper, we assume that the dimension-order (normal) routing is X-Y routing, first in the X-dimension, then in the Y-dimension. If a part of the network is y, however, the network hardware can change the routing order Ṫhe SR2201 supports two kinds of communication. One is point-to-point communication, and the other is broadcast. Point-to-point communication packets are transmitted using dimension-order X-Y routing according to the receiving address. However, broadcast packets cannot be transmitted by the same routing used in point-to-point communication. When several broadcast communications start at the same time in the network using cut-through routing, deadlock occurs, because they try to acquire channels already secured by other packets. An example of deadlock involving two broadcast packets on a 2D crossbar network is shown in Fig. 5. In Fig. 5, the crossbar switches are represented by the thin lines. Each thick line describes the flow of either broadcast packet 1 (BC 1) from 3 or broadcast packet 2 (BC 2) from 4. In broadcast communication, packets are first transmitted to all output ports of one of the XBs in the X dimension at the same time, and then transmitted to all output ports of all XBs in the Y dimension at the same time. As shown in Fig. 5, each packet can be transmitted to and (i.e., the XBs in the X-dimension) independently. However, in the Y dimension, both have to reserve all Y dimension crossbars. If each broadcast acquires some of these crossbars as shown in Fig. 5 (i.e., cyclic waiting), then deadlock occurs. receiving address X Y Figure 3. Packet Format. data RC bit meaning 0 normal routing 1 broadcast request routing 2 broadcast routing 3 detour routing Figure 4. Meanings of the RC bit Figure 5. An example of channel deadlock involving two broadcast communications.

4 Generally, to avoid broadcast deadlock when using cutthrough routing, conventional parallel computers limit broadcast communication by either using a separate treeconnected network [1] [19], or by performing the broadcast through the software [20-21]. On the other hand, the SR2201 avoids deadlock by gathering and serializing broadcast packets at a specific crossbar switch (the serialized crossbar, the ), which is one of the crossbars in the MD crossbar network, and then transmitting the packets from the to all s. The routing for this broadcast facility is set by the RC bit of the packets. The routing when broadcast packets are transmitted from 3 and 4 at the same time is shown in Fig. 6, where we assume the to be. Broadcast packets from 3 and 4 are first transmitted to the via 9 and 7 by point-to-point communication according to the RC bit 'broadcast request' at the same time (step 1). When these broadcast requests are transmitted to the, the changes the RC bit from 'broadcast request' to 'broadcast', then transmits the packets one-by-one in order of arrival to all s ( 7, 8, and 9) connected to the (step 2). In this case, we assume that the packet from 3 is transmitted, and the packet from 4 is made to wait in the. Since all packets are transmitted to their destination by using cut-through routing in the SR2201, the output ports of 4, and 7 are used by the packet from 4. The s ( 7, 8 and 9) connected to the transmit the packet from 3 to all Y-XBs (, Y2- XB, and ) and all s ( 7, 8, and 9) connected to the s in dimension-order X-Y routing according to the RC bit 'broadcast' (step 3). Finally, all Y-XBs (,, and ) transmit the packet to all s except the s ( 7, 8 and 9) connected to the, and then transmit the packet to the s connected to the s (step 4). After the broadcast packet from 3 is transmitted to all s, all output ports of the can be used. Then the packet from 4 waiting in the can be step step step 3 step 4 Figure 6. Broadcast Routing on the 2D Crossbar Network.

5 transmitted. Since broadcast packets are transmitted to all output ports of the (step 2) and are not transmitted to the s connected to the after passing the (step 4), this routing prevents deadlock. Since the broadcast packets on the SR2201 are transmitted via the, the broadcast routing becomes Y-X-Y routing, which is different from the X-Y routing of the point-to-point communication. 4. Fault-tolerant Routing in the Multi-dimensional Crossbar Network To ensure high reliability in the parallel processor system, it is very important to be able to continue operating the system even if a part of the network is y. In this section, we describe the hardware detour path selection facility of the multi-dimensional crossbar network which can be used when there is a single y point in the network. The hardware detour path selection facility of the 2D crossbar network is shown in Fig. 7. In Fig. 7, the thin line describes the routing when there is no y switch in the network and the thick line describes the routing when a part of the network () is y. To apply the detour path selection facility through the hardware, in this facility when a switch is y the information of the switches to which it is connected is set in advance. This information has at most a few bits. For example, the s set the information of the XBs that they are connected to and the XBs set the information of the s that they are connected to. If the information is set on any switch, the network hardware of the switch changes the route change (RC) bit of the packet from 'normal routing' to 'detour routing', then transmits the packet to the detour point (the detour XB, the D-XB), which is determined by the network hardware in advance. When the packet is transmitted to the D-XB, the network hardware changes the RC bit from 'detour' to 'normal', and then transmits the packet to its destination in dimension-order X-Y routing. The packet leaves no trace of the detour routing behind. source Figure 7. Hardware Detour Path Selection Facility of the 2D Crossbar Network. destination normal routing detour routing Next we describe the routing of the detour path selection facility. Case 1 : Normal When there is no y switch in the network, no information is set on any switch. Point-to-point communication packets are transmitted in dimension-order X-Y routing according to their receiving address since the RC bit is set to 'normal'. Broadcast packets are transmitted in Y-X- Y routing using the according to the RC bit (we described this routing in Sec. 2). Case 2 : a part of network () is y (a) Point-to-point communication routing First, the transmits the packet to the X-XB via the in dimension-order X-Y routing according to the receiving address. Second, the X-XB sets the RC bit to 'detour' if any which it is connected to is y, then transmits the packet to an that is not the y according to the RC bit. Third, the transmits the packet to the Y-XB according to the RC bit. Forth, the Y-XB transmits the packet to the D-XB according to the RC bit. Finally, the D-XB changes the RC bit from 'detour' to 'normal', then transmits the packet to its destination in the X-Y routing according to the receiving address. (b) Broadcast routing If the connected to the is y, another XB which is not connected to the y substitutes for the S- XB. Then the broadcast routing becomes the same as in the no- case. If an is y, the network hardware stops transmission of packets to the y. Using Fig. 8, we can describe the detour routing in detail D-XB Figure 8. Point-to-point communication routing when a part of the network is y. Point-to-point communication from 1 to 5 when 2 is y is shown in Fig. 8. Since 2 is y, the information of 2 is set in, and the XB used for detouring is set to. If the network has no, 1 would transmit the packet to 5 via because dimension-order X-Y routing

6 is normally used. However, since 2 is y, the packet is transmitted to the destination by using a detour path. That routing is as follows: step 1 : 1 transmits the packet to via 1 in the X-Y routing according to the receiving address. step 2 : set the RC bit to 'detour' and then transmits the packet to a specific (the detour : 3) avoiding except for the y since the information of 2 is set. step 3 : 3 transmits the packet to. step 4 : transmits the packet to (the D-XB) via 6 according to the RC bit. step 5 : The D-XB changes the RC bit from 'detour' to 'normal', then transmits the packet to the destination in X-Y routing according to the receiving address. The main feature of this -tolerant routing is to detour by using a specific XB (the D-XB). Since this limitation does not allow cyclic waiting, it prevents deadlock. 5. Deadlock-free Fault-tolerant Routing Scheme Suitable for Data Transmission of the SR2201 When the detour path selection facility described in the previous section was implemented in the multi-dimensional crossbar network, we had to adjust this facility to the existing data transmission facilities, such as the hardware broadcast facility of the SR2201, that use the (described in Sec. 2). However, if broadcast communication and point-topoint communication occur at the same time and a part of the network is y, the detour routing that we have described allows deadlock to occur. This is because cyclic waiting occurs between point-to-point communication and broadcast routing 5 deadlock D-XB detour routing Figure 9. An example of deadlock involving broadcast routing and detour routing. broadcast communication routes, since non-dimension-order routing is used in both the and the D-XB. We show an example of deadlock in Fig. 9. Figure 9 shows data transmission when broadcast communication from 4 and point-to-point communication from 1 to 5 occur at the same time and one of the s is y. (a) 1 transmits point-to-point communication data to 5 via (the D-XB) - 5. (b) 9 transmits a broadcast packet to, which is the serialized crossbar switch, via 9 by point-topoint communication, and then broadcasts to all Y-XBs. (c) Since the point-to-point communication packet occupies 's output port to 6, the broadcast packet cannot be transmitted while the point-to-point communication uses the output port. On the other hand, the point-topoint communication packet cannot be transmitted while 5's output port to 5 is occupied by the broadcast communication. The broadcast packet cannot be transmitted to its destination until all output ports of,, and are free, and the point-to-point communication packet from 1 cannot be transmitted to 5 until 5's output port to 5 is free. Again, since each packet is transmitted by cut-through routing, cyclic waiting occurs and deadlock results broadcast routing D-XB = detour routing Figure 10. Deadlock-free Fault-tolerant Routing. To resolve this deadlock problems, we propose a method of deadlock-free -tolerant routing scheme. In this routing, the D-XB is set to the same XB as the. Figure 10 shows the case where there is a broadcast communication from 9 and a point-to-point communication from 1 to 6 when 2 is y. (a) 9 transmits the broadcast packet to (which is

7 the ) by point-to-point communication, then broadcasts the packet to the Y-XBs. (b) 1 transmits the packet to 5 via Since the and the D-XB are the same XB, 3 transmits the packet to 9 which is connected to the S- XB () by point-to-point communication. The packet from 1 cannot be transmitted to until the output port of can be used because the broadcast packet occupies all output ports of. This routing allows the packets to avoid deadlock, because both the detour transfer and the broadcast communication are serialized in the (). There is only one non-dimension-order routing in this deadlock-free -tolerant routing even if both point-topoint communication and broadcast communication occur at the same time and a part of the network is y, so there is no cyclic waiting between the two kinds of communication. Thus, this routing prevents deadlock. 6. Conclusion We have described a deadlock-free -tolerant routing method for use in the hardware detour path selection facility in the Hitachi SR2201 parallel processor. This method ensures operating efficiency and high reliability when a part of the network is y. In this facility, the information of the switches connected to a y switch is set in advance. Since each switch has only the information of its neighboring switches, the hardware cost is lower than the cost of adding a redundant network. Each packet is transmitted to its destination according to the information and the route information it contains. To avoid deadlock, the detour point is determined according to the route information of the packet in advance. This detour path selection facility on the SR2201 limits the detour point to a specific crossbar switch, which is part of the multi-dimensional network and is used for serializing broadcast communication. Thus, it avoids deadlock by changing the routing. In our future research, we intend to improve this facility to further increase the system reliability. References [1] C. E. Leiserson et al.: The Network Architecture of the Connection Machine CM-5, Proc. 4th Ann. ACM Symp. Parallel Algorithms and Architectures, SPAA, (1992), [2] B. C. Strunkel et al.: The SP2 High-performance Switch, IBM SYSTEMS JOURNAL, Vol. 34, No. 2, (1995), [3] R. E. Kessler, J. L. Schwarzmeier.: CRAY T3D: A New Dimension for Cray Research, Digest of Papers COMPCON 93, (1993), [4] W. Oed.: The Cray research massively parallel processor system T3D, Tech. Rep., Cray Research Inc., (1993). [5] H. Fujii, Y.Yasuda, H. Akashi, Y. Inagami, M. Koga, O. Ishihara, M. Kashiyama, H. Wada, T. Sumimoto.: Architecture and Performance of the Hitachi SR2201 Massively Parallel Processor System, Proc. of IPPS'97, (1997). [6] N. Hamanaka, J. Nakagoshi, T. Tanaka.: Reducing Network Hardware Quantity by Employing Multi-processor Cluster Structure in Distributed Memory Parallel Processors, COMPAR 92/VAPP V, (1992), [7] Y. Yasuda, H. Fujii, T. Tanaka, Y. Inagami.: Performance Evaluation of the Hyper Crossbar Network, TECHNICAL REPO OF IEICE. CPSY.93.25, (1993), , (in Japanese). [8] A. Murata, T. Boku, T. Harada, H. Amano.: Structure and Performance of the MDX (Multi-Dimensional X'bar): A Network Class for Large Scale Multiprocessors, Proceedings of ISCA/ IEEE International Conference on Parallel and Distributed Computing, (1996), [9] Kermani P, Kleinrock L.: Virtual cut-through: a new computer communication switching technique, Computer Networks 3, (1979), [10] J. W. Dally, L. C. Seize.: The Torus Routing Chip, J. Distributed Computing, Vol. 1, No. 3, (1986), [11] D. H. Linder, J. C. Harden.: An Adaptive and Fault Tolerant Wormhole Routing Strategy for k-ary n-cubes, IEEE Trans. on Computers, Vol. 40, No. 1, (1991), [12] C. Su, K. G. Shin.: Adaptive Deadlock-Free Routing in Multicomputers Using Only One Extra Virtual Channel, 1993 Int'l Conference on Parallel Processing, (1993), I [13] S. Chalasani, R. V. Boppana.: Fault-Tolerant Wormhole Routing in Tori, Proc. International Conference on Supercomputing, (1994), [14] J. Duato.: A New Theory of Deadlock-Free Adaptive Routing in Wormhole Networks, IEEE Trans. on Parallel and Distributed Systems, Vol. 4, No. 12, (1993), [15] C. J. Glass, L. M. Ni.: The Turn Model for Adaptive Routing, Proc. 19th Int'l Symp. Computer Architecture, (1992), [16] P. T. Gaughan, S. Yalamanchili.: Adaptive Routing Protocols for Hypercube Interconnection Networks, IEEE Computer May, (1993), [17] W. J. Dally, H. Aoki.: Deadlock-Free Adaptive Routing in Multicomputer Networks Using Virtual Channels, IEEE Trans. on Parallel and Distributed Systems, Vol. 4, No. 4, (1993), [18] R. V. Boppana, S. Chalasani.: A comparison of adaptive wormhole routing algorithms, Proc. 20th Ann. Int'l Symp. Computer Architecture, (1993), [19] T. Shimizu, T. Horie, H. Ishihata.: Low-latency message communication support for the AP1000, Proc. 19th Ann. Int'l Symp. on Computer Architecture, (1992), [20] R. Ponnusamy, R. Thakur, A. Choudhary, G. Fox.: Scheduling Regular and Irregular Communication Patterns on the CM- 5, Proc. of Supercomputing '92, (1992), [21] S. L. Johnsson, C. -T. Ho.: Optimum broadcasting and personalized communication in hypercubes, IEEE Trans. Computers, Vol. 38, No. 9, (1989), [22] K. Saito, M. Hashimoto, H. Sawamoto, R. Yamagata, T. Kumagai, E. Kamada, K. Matsubara, T. Isobe, T. Hotta, T. Nakano, T. Shimizu, K. Nakazawa.: A 150MHz Superscalar RISC Processor with Pseudo Vector Processing Feature, Proc. Notebook for Hot Chips VII, (1995),

Wormhole Routing Techniques for Directly Connected Multicomputer Systems

Wormhole Routing Techniques for Directly Connected Multicomputer Systems PRASANT MOHAPATRA Iowa State University, Department of Electrical and Computer Engineering, 201 Coover Hall, Iowa State University,