Performance Evaluation of k-ary Data Vortex Networks with Bufferless and Buffered Routing Nodes

Performance Evaluation of k-ary Data Vortex Networks with Bufferless and Buffered Routing Nodes Qimin Yang Harvey Mudd College, Engineering Department, Claremont, CA 91711, USA qimin_yang@hmc.edu ABSTRACT This paper studies generalized k-ary Data Vortex networks both with and without node buffering. The comparison study focuses on effects of buffering in 4-ary networks and in original binary networks. The results show that with buffers the 4-ary networks gain more performance benefit than that in the binary networks. This is due to significantly reduced traffic backpressure that is present in 4-ary networks in bufferless operations. Detailed comparisons between 4-ary and binary networks with two different buffering schemes are presented. The routing performance in throughput and latency are studied for different network and load conditions. For practical implementation, additional cost due to buffering should be justified for the desirable performance enhancement. Keywords: interconnection network, packet switching, data vortex network, switching fabric, buffering, routing nodes. 1. Introduction to Data Vortex architecture Packet switched optical interconnection networks become key subsystems in high capacity communication systems and multiprocessor computing systems [1-2]. While current optical fiber technology provides enormous bandwidth, there is very limited capability in optical processing and optical buffering techniques [3-4], therefore the main challenge of these interconnection networks is to handle traffic contention which has led to difficulty in adapting most existing switching fabric architectures for optical implementations. For example, networks such as Banyan and Butterfly are widely used as effective self-routing electrical switching fabrics networks, but it is rather challenging to implement them in the optical domain because of the lack of buffering at each node. Deflection based routing on the other hand results in much longer delays due to the need to travel around the network diameter to reach to the next open path to the target. As a result, not only delay but also throughput performance are greatly degraded, and more complex routing logics are often required to avoid deadlock and unfairness. The overall performance scales poorly with the increase of network sizes and network loads in these two-dimensional network topologies [5]. Data Vortex network is uniquely designed with three dimensional arrangements of the routing nodes. Due to the additional dimension, it allows for bufferless operation using deflection based routing while keeping minimal deflection penalty and minimal routing decision. The bufferless operation offers the simplest possible contention resolution in the optical domain [6-7]. This switching architecture implements a single-packet-routing rule at each node through a traffic control mechanism, so that the routing logic at each node is extremely simple. The topology provides multiple open paths to each target address so that deflected packets encounter a much smaller latency penalty (in 2 hops) that is also independent of the network diameter. These network characteristics allow for superior network scalability so that throughput and latency performance are reasonable even for very large network sizes. In the physical layer implementation, wavelength division multiplexing (WDM) is used for stacking high throughput data in the wavelength domain and it also allows for simple header bit extraction using wavelength encoding and decoding. Broadband fast switching devices such as Semiconductor Optical Amplifier (SOA) are critical to switching the short duration packet and maintaining its signal power level. More details on the physical layer implementation and system limitation can be found in [8-9]. Network Architectures, Management, and Applications VII, edited by Ken-ichi Sato, Yuefeng Ji, Lena Wosinska, Jing Wu, Proc. of SPIE-OSA-IEEE Asia Communications and Photonics, SPIE Vol. 7633, 76331K 2009 SPIE-OSA-IEEE CCC code: 0277-786X/09/$18 doi: 10.1117/12.851813 SPIE-OSA-IEEE/ Vol. 7633 76331K-1

c=0 (outermost) c=1 c=2 c=3 c=4 Fig.1. Routing nodes and intra-cylinder links within binary Data Vortex of A=4, H=16 and C=5. The Data Vortex network can be viewed as a multiple stage interconnection network (MIN). The routing nodes are arranged in concentric cylinders with A, H and C = log 2 H + 1 designating the number of nodes along three dimensions angle, height and cylinder respectively. An example of A=4, H=16 and C=5 Data Vortex network is shown in Fig.1, where the intra-cylinder link patterns are specifically shown at each cylinder from the outermost (c=0) to the innermost side (c=4). These links along the same cylinder route a packet back and forth between two height groups, i.e. with specific binary header bit alternating between 1 and 0 using a binary-tree header decoding. The inter-cylinder links (not shown in Fig.1) are to forward a packet to an inner neighbor cylinder while maintaining its height location, thus these paths appear to be parallel paths between the cylinders [6]. Fig.2 Routing node implementation at i th cylinder SPIE-OSA-IEEE/ Vol. 7633 76331K-2

The packet routing in the Data Vortex network is operated in a synchronous and slotted fashion. Packet length is typically chosen to be the same as the hop latency for a simple and efficient implementation. Packets are injected at the outermost cylinder at specified injection angles A in, and each travels towards the innermost cylinder at its shortest paths which depends on its target address as well as other data traffic. Each node processes a single arrival packet as shown in Fig.2. For example, a routing node at i th cylinder makes its routing decision based on the packet frame bit, which tells whether or not a packet is present at the incoming paths, the corresponding i th header bit as well as the electrical control bit sent from its inner competing node. Both the frame and header bits can be extracted by a small power tap and through passive filtering and low packet rate optical/electronic (O/E) conversion. The traffic control signal is generated at each node to properly permit or block the packet of the outer cylinder so that the single-packet-routing rule is always satisfied for bufferless operation for all the routing nodes [10]. Fig.3 Distributed control signal among routing nodes. Fig.3 shows the distributed control signals (in dashed line) among the routing nodes on different cylinders. As an example shown, node 3 and node 4 both send packets to node 5, therefore node 4 of the inner cylinder generates a control signal for node 3 to set up the single packet routing rule. Similarly node 3 generates a control signal for its outer cylinder competing node 7 since they both send packet to node 6. Packets receiving the blocking control are deflected to stay on the current cylinder which acts as virtual buffers with a two hop delay penalty. Once packet arrives at the correct target, it exits the network in the innermost cylinder. The last cylinder maintains the same height to provide additional optical buffering in addition to the electrical buffering present in the output ports. If not all angles at exit are connected to output ports, the last cylinder also deals with angular address resolution of the packet. More details on the angular resolution and choice of angles vs. network height or I/O ports have been discussed in [11]. As mentioned above the routing in the Data Vortex network is progressed through a series of binarytree decoding stages. Each node and each cylinder stage only decodes a single header bit in the binary target address that corresponds to the specific cylinder, and it decides to route the packet further in the current level (current cylinder) or forward the packet to the next level (inner cylinder). Successful forwarding is also dependent on the availability of open control signal due to the traffic condition of the inner cylinder. At different traffic conditions, overall traffic latency is accumulated through routing, deflection and forwarding hops. 2. EXTENDING TO GENERAL K-ARY DATA VORTEX ARCHITECTURE Since the header decoding is extremely simple with passive filtering and low speed electronics for detection, it is possible to allow for multiple header bits decoding at each stage without significantly SPIE-OSA-IEEE/ Vol. 7633 76331K-3

increase the complexity and cost of the overall switching system. In a previous study [12], we proposed implementing the Data Vortex network with such k-ary decoding at each routing node, where the binarytree decoding is the special case with k=2. For the ease of optical system implementation, we limited our exploration to the cases of bufferless operation in the study, and the single packet routing rule is maintained as that in the original binary networks. Network routing performance under various arrangements of the routing and forwarding paths have been compared and studied. In particular, the latency in forwarding hops and routing hops are each considered for detailed comparison study. In the case of binary tree decoding in regular Data Vortex networks, each node decodes one header bit, and the routing on a specified cylinder chooses one out of two groups (upper vs. lower group or specific header bit being 1 vs. 0 ), and the deflection latency penalty in the network is two hops. As we extend the concept to general k-ary decoding, each stage then decodes (log 2 bits, and each hop along the same cylinder allows for the packet to choose one out of k groups (i.e. specific (log 2 bits alternate among all possible k combinations). When the corresponding (log 2 header bits in the target address are matched with the (log 2 bits in the routing node height address, the packet will proceed to the next inner cylinder if the corresponding traffic control opens the path at the same time. The same traffic control mechanism is set up for the purpose as that in the original Data Vortex network. If inner traffic causes deflection, the outer packet stays on the current cylinder for further routing. Since no buffering is necessary the routing logic is kept as minimal as possible. As a result of such arrangement, the deflected packet in k-ary networks undergoes k hops delay penalty to return to the matched height or to the open path to the next cylinder routing. c=0 outermost c=1 c=2 Fig.4 Routing patterns at each of the three cylinders in a 4-ary decoding Data Vortex network. A=4, H=16, C = log 4 H + 1 = 3 SPIE-OSA-IEEE/ Vol. 7633 76331K-4

Fig.5 Routing node implementation at i th cylinder in 4-ary decoding network In order to allow for a complete permutation of k groups based on the corresponding (log 2 header bits, the intra-cylinder routing paths are slightly modified from the binary-tree decoding networks. In Fig.4 an example of 4-ary decoding network is shown where each hop decodes 2 header bits in a network of A=4, H=16 and the number of cylinders isc = log 4 H + 1 = 3. The routing node is shown in Fig.5 and the only difference is the requirement of two header bits detection. Note the interconnection patterns at different cylinders in the binary decoding Data Vortex network are combined and also reversed at proper angles to construct the 4-ary decoding networks. In such an arrangement, a packet takes k hops to go through all k possible groups along the cylinder, so the deflection latency penalty would increase to 4 hops in 4-ry decoding networks, which is the obvious disadvantage of using big value of k. Such disadvantage can also be shown in two dimensional routing topologies such as Banyan and Bufferfly networks where the deflection penalty equals the diameter of the network, and this greatly eliminates their potential for pure deflection routing based optical implementations in larger networks. On the other hand, in extended k-ary Data Vortex implementation, the number of cylinders required is C = log k H + 1, assuming the last cylinder maintains the same height just for output buffering purpose as that of regular Data Vortex network. As a result, the forwarding latency is much smaller with a larger value of k. The network height H is chosen so that log k H is kept an integer for simplicity. The comparison study between the 4-ary networks and the binary-decoding networks in bufferless operation have shown that the latter achieves overall better performance in throughput as well as in latency. The gain in reduction of forwarding hops is cancelled out by the increase in routing hops at each stage due to much greater traffic backpressure, and such backpressure tends to introduce more congestion and degrade data throughput at the same time. 3. NETWORKS WITH BUFFERING CAPABILITY WITHIN THE ROUTING NODES While bufferless Data Vortex networks achieve reasonable routing performance, a previous study [7] suggested incorporating an additional node buffer for enhanced performance. Such systems require additional switching devices and more complicated routing arbitration due to multiple packets scheduling and prioritizing buffered packets. The first proposed buffer keeps the single packet processing so that a two slot delay buffer is required to establish the necessary traffic control. In the second buffering scheme, the node can process two packets simultaneously where one is from the node buffer and the other one is from outside of node. In this case a single slot buffer is required but two parallel sets of switching elements are needed for the routing. The performance enhancement with the node buffer shown in [7] is reasonably significant especially with the second scheme, so if the additional hardware cost mainly the switching components can be justified or reduced with advanced integration technology, buffered routing node could serve as a feasible solution for applications that desire higher throughput or lower latency. SPIE-OSA-IEEE/ Vol. 7633 76331K-5

(a) (b) Fig.6 Buffered routing nodes with (a) single arrival packet (b) two arrival packets. The buffer implementation in [7] can easily be incorporated within the generalized k-ary networks in the same manner. Fig.6 shows such node in the example of 4-ary networks with two different buffering schemes. Figure (a) shows the buffered node with a single packet processing and (b) shows the nodes with two packets processing which require twice of the switching components to handle the routing. In both cases, the previously buffered packet will be given priority in routing so that it leaves the node before an un-buffered packet does. Since the study in bufferless k-ary networks suggested the disadvantages of traffic backpressure with high value of k, we expect the presence of buffering has more significant effect on performance enhancement in such networks. Therefore, this study would focus on such performance enhancement comparison and particularly compare the results in k-ary network to buffering effect reported in the binary networks. 4. PERFORMANCE EVALUATION OF K-ARY NETWORKS WITH NODE BUFFERING For performance evaluation, the previously developed C/C++ event simulator is extended to include networks with the two different methods of node buffering in k-ary networks. Not only the routing performance is compared with that in the bufferless operation, such difference in performance is also compared to that reported in the binary networks with and without node buffers. For comparison purpose, only networks with similar costs are studied with a range of traffic load and redundant conditions. The study is worthwhile because of the potential benefit in performance enhancement, even though the increased cost in implementation must be kept in consideration. The cost evaluation has been carried out in details in [7] for the binary networks, and such evaluation can be directly applied to the generalized k-ary networks due to the same implementation of buffering scheme. To focus on the buffering effect on the extended k-ary networks and compare that in the original Data Vortex, two networks M in binary (A=5, C=9, H=256) and N in 4-ary (A=9, C=5, H=256) are compared under different traffic and buffering conditions. We mainly focus our attention to 4-ary networks because of their comparable low deflection penalty. A medium redundant condition is considered when A in =3. M 0, M 1 and M 2 represent the case when network M under bufferless operation, buffered with one packet processing and buffered with two packets processing respectively. Similarly N 0, N 1 and N 2 represents network N under these buffering conditions. Since both networks share the same number of routing nodes and same number of I/O ports, we consider them to have the same cost for the same buffering conditions. SPIE-OSA-IEEE/ Vol. 7633 76331K-6

1 Throughput Performance Ain=3 0.8 Injection rate 0.6 M0: bufferlss 0.4 M1:buffered, one packet M2: buffered, tw o packets N0: bufferless 0.2 N1: buffered, one packet N2:buffered, tw o packets traffic load 0 0 0.2 0.4 0.6 0.8 1 Fig.7 (a) Throughput performance comparison 25 Delay Performance Ain=3 20 Average hops 15 10 5 0 M0: bufferlss M1:buffered, one packet M2: buffered, tw o packets N0: bufferless N1: buffered, one packet N2:buffered, tw o packets traffic load 0 0.2 0.4 0.6 0.8 1 Fig.7 (b) Delay performance comparison Fig.7 (a) and (b) show the results of routing performance comparison. The throughput is measured by the successful injection rate at the input angles, and the latency is the average hopes of packets that arrive at the output ports. The graph indicates that buffering with single packet processing brings more improvement in 4-ary networks if we compare the throughput performance from M 0 to M 1 to that from N 0 to N 1. The resulted N 1 s throughput is slightly better than that of M 1. This is mainly due to the much worse traffic backpressure in 4-ary bufferless networks, and the added node buffer relieves such pressure for significant enhancement compared to a relatively modest improvement in the binary networks. For buffering with two packets processing, the binary networks (M 2 ) outperforms the 4-ary networks (N 2 ) similar to the bufferless performance comparison (M 0 vs. N 0 ). Therefore in this case the effect of buffering is very similar if we compare the difference from M 0 to M 2 to that from N 0 to N 2. For delay performance, M 1 has much worse performance compared to M 0 due to its two-slot delay. While also worse at very high loads, the delay in N 1 is closer or slightly better than N 0 for other load conditions. The overall delay in N 1 is significantly better SPIE-OSA-IEEE/ Vol. 7633 76331K-7

than that of M 1. For buffering with two packets processing, N 2 also gains much significant latency improvement from N 0 compared to improvement from M 0 to M 2. In summary, the presence of the buffering at node tends to benefit the overall performance gain of 4-ary networks more compared to that in the original networks. 1 Throughput Performance Ain=5 0.8 Injection rate 0.6 0.4 M0: bufferlss M1:buffered, one packet M2: buffered, tw o packets 0.2 N0: bufferless N1: buffered, one packet N2:buffered, tw o packets traffic load 0 0 0.2 0.4 0.6 0.8 1 30 Fig. 8 (a) Throughput performance comparison for Ain=5 Delay Performance Ain=5 25 Average hops 20 15 10 M0: bufferlss M1:buffered, one packet M2: buffered, tw o packets 5 N0: bufferless N1: buffered, one packet 0 N2:buffered, tw o packets traffic load 0 0.2 0.4 0.6 0.8 1 Fig. 8 (b) Delay performance comparison for Ain=5 Similar trends and comparison results can be obtained for different redundant conditions. Fig.8 shows the comparison for network M and N when Ain=5. While the overall performance are worse due to effectively higher network loads, the performance difference follow very similar trend as discussed earlier. These are due to the inherent scalable nature of the Data Vortex network, whether it is in the binary SPIE-OSA-IEEE/ Vol. 7633 76331K-8

configuration or in k-ary configuration. We expect the same trend for larger network size comparison as well. 5. CONCLUSION In short, we have studied the implementation of node buffering in the extended k-ary Data Vortex network architecture. The network routing performance has been studied with the two different buffering schemes. In comparing to the performance reported in binary Data Vortex networks, the k-ary network specifically the 4-ary network gains much enhanced improvement in throughput and latency due to such employment of the node buffer. This is mainly because of the higher traffic backpressure in the bufferless operation in 4- ary networks. As a result, the lower forwarding latency or smaller number of cylinders does not contribute to overall better performance under bufferless operation. Once buffered, with the single packet processing nodes, the ultimate performance of 4-ary network achieves better throughput as well as lower latency than that in binary networks with the same buffering scheme. If buffered with the two packet processing nodes, the binary network achieves slightly better throughput performance, but the latency performance is worse than that in 4-ary networks. The combined performance gain more improvement in 4-ary networks than that in binary networks with these buffering capabilities. While the study focuses on the performance enhancement, additional switching cost associated with buffering must be considered for practical implementation. 6. REFERENCE: [1] H. Jonathan Chao and Ben Liu, High Performance Switches and Routers, Wiley-Interscience, ISBN 0470053674, 2007. [2] Gorgios I. Papadimitriou, Chrisoula Papazoglou and Andreas S. Pomportsis, Optical Switching: Switch Fabrics, Techniques and Architectures, Journal of Lightwave Technology, Vol.21, No. 2, pp.384-405, Feb 2003. [3] Haijun Yang and S.J. Ben Yoo, All-Optical Variable Buffering Strategies and Switch Fabric Architecture for Future All-Optical Data Routers, Journal of Lightwave Technology, Vol. 23, No.10, pp.3321-3330, Oct 2005. [4] Chowdhury, A. Yong-Kee Yeo, Jianjun Yu and Gee-Kung Chang, DWDM reconfigurable optical delay buffer for optical packet switched networks, IEEE Photonics Technology letters, Vol. 18, No.10, pp.1176-1178, May 2006. [5] B.E.Swekla and R.I.Macdonald, Tandem Banyan Switching Fabric with Dilation, Electronics Letters, Vol.27, No.19, pp.1770-1772, 1991. [6] Hawkins, C.; Small, B.A.; Wills, D.S.; Bergman, K.; The Data Vortex, an All Optical Path Multicomputer Interconnection Network, IEEE Transactions on Parallel and Distributed Systems, Vol. 18, No.3, pp. 409-420, March 2007. [7] Assaf Shacham and Keren Bergman, On contention resolution in the data vortex optical interconnection networks, Journal of Optical Networking, Vol.6, pp.777-788, 2007. [8] A. Shacham, B.A.Small, O.Liboiron-Ladouceur and K. Bergman, A fully implemented 12x12 data vortex optical packet switching interconnection networks, Journal of Lightwave Technology, vol.23, pp.3066-3075, 2005. [9] O. Liboiron-Ladouceur, B.A. Small, K. Bergman, "Physical Layer Scalability of WDM Optical Packet Interconnection Networks," Journal of Lightwave Technology, Vol. 24, No.1, pp. 262-270, Jan 2006. [10] Qimin Yang and Keren Bergman, Traffic Control and WDM Routing in the Data Vortex Packet Switch, IEEE Photonics Technology Letters, Vol. 14, No.2, pp. 236-238, Feb 2002. [11] Cory Hawkins and D.Scott Wills, Impact of Number of Angles on the Performance of the Data Vortex Optical Interconnection Networks, Journal of Lightwave Technology, Vol.24, pp.3288-3294, 2006. [12] Qimin Yang, Optimum routing and forwarding path arrangement in bufferless Data Vortex networks, IARIA Journal, Vol. 1, January 2009. SPIE-OSA-IEEE/ Vol. 7633 76331K-9