Study of Dynamically-Allocated Multi-Queue Buffers for NoC Routers

Size: px

Start display at page:

Download "Study of Dynamically-Allocated Multi-Queue Buffers for NoC Routers"

Frederick Dixon
5 years ago
Views:

1 Study of Dynamically-llocated Multi-Queue Buffers for NoC Routers Yung-Chou Tsai Department of lectrical ngineering National Tsing Hua University Hsinchu, Taiwan Yarsun Hsu Department of lectrical ngineering National Tsing Hua University Hsinchu, Taiwan bstract large portion of area and power in Network-on- Chip (NoC) routers is consumed by buffers, and hence these costly storage resources must be utilized well. However, some early related literatures are not suitable for modern NoC router architecture as well as various complicated traffic loads anymore. In this work, we refine the dynamically-allocated multi-queue (DMQ) buffer organization and propose a new one that can accommodate multiple packets more than the number of virtual channels, named DMQ with multiple packets (DMQ-MP). The DMQ-MP scheme can solve certain data transmission issues under some circumstances, such as heavy network congestion or short packets, to improve performance. We also introduced two methods applicable to DMQ-based buffers, which are adding priorities for switch allocation and reserving virtual channels for high-priority packets. xperimental results show that DMQ-MP routers can have up to 24.52% higher saturated throughput than SMQ and DMQ counterparts. Keywords-buffer; virtual channel; router; Network-on-Chip; I. INTRODUCTION Buffers are usually added in NoC router nodes to provide temporal storage space for arriving data to wait for resources to traverse these router nodes. In most cases, staticallyallocated multi-queue (SMQ) buffers in input-queued switches are adopted due to their simple flow control mechanism and small hardware overhead. However, each queue of SMQ buffers contains a fixed amount of its own buffer spaces and never shares them with other queues. This kind of static allocation on input buffers results in storage resource wastage for lack of flexibility on buffer management. For the same input buffer, it is possible that some queues are overloaded in need of more buffer spaces while some queues are nearly empty with plenty of buffer spaces unused. specially buffers consume a great portion of power and area in a NoC router [1], and thus these costly buffer resources must be effectively utilized when designing router architecture. Tamir and Gregory [2] first proposed a novel buffer organization named DMQ to provide better flexibility of buffer utilization. This DMQ buffer organization can dynamically partition buffer storage with linked lists to handle variable length packets for achieving higher performance. nother approach using self-compacting buffers (SCB) to implement DMQ switches was introduced in [3] to reduce the hardware overhead and complexity. Liu et al. [4] improve the DMQ switches adopting SCB by letting two sets of virtual channels in different dimensions share buffer space. special architecture called highperformance input-queued switch (HIPIQS) also uses a DMQ organization, pipelined access to multi-bank input buffers with chucks, and many small additional cross-point buffers, to deliver high performance [5]. However, current NoC architecture designs prevalently use wormhole switching to relax the constraints on buffer size, virtual channel flow control to avoid deadlocks and head-of-line (HoL) blockings, and mesh-based topology to fit chip layouts. The above mentioned searches are not suited to the requirements of modern NoCs anymore and therefore have to be resurveyed. Rezazad et al. [6] showed that the optimal number of virtual channels and buffer length of mesh-based interconnection networks highly depends on the traffic pattern as follows. Under light traffic, the buffer structure extends virtual channel depth for continual transfers to improve latencies; under heavy traffic, the buffer structure dispenses many virtual channels for congestion avoidance to increase throughput. Based on these concepts, two buffer structures of dynamically changing their number of virtual channels were proposed [7][8]. However, these two buffer structures have to put a lot of effort to manage the varying number of virtual channels, such as tracking tables, and that makes them hard to scale. In addition, arbiters used in virtual channel allocations must be simplified to prevent the hardware overhead from dramatically increasing as the number of virtual channels goes up. In this paper, we propose the DMQ-MP scheme by allowing multiple packets that are more than the number of virtual channels coexisting inside a DMQ-based input buffer. Because this scheme breaks the limitation of one packet per virtual channel in conventional virtual channel routers, it can overcome the possible issues resulting from waiting for switching packets, handling short packets, and HoL blockings. We believe that DMQ-MP is the easiest and feasible method to improve network performance and buffer utilization without sophisticated hardware design modifications. Besides, it is suitable to some applications like reserving virtual channels for high-priority packets.

2 II. MOTIVTION. Packet Switching Latencies In conventional SMQ and DMQ buffers, one virtual channel typically only holds one packet each time for easy control and preventing HOL blockings. Therefore, if all virtual channels contain packets whose tail flits have entered but not left yet, the remaining free buffer resources are wasted until some packet is gone and another new packet arrives, as illustrated in Figure 1a. ven though the DMQ mechanism lets the remaining buffer resources be used by packets that have not completed their delivery as shown in Figure 1b, bringing more flits of the these long packets merely helps little to deal with the following blanks after any other packet leaves. The duration between the two departures of the tail flit belonging to the current packet and the head flit belonging to the next packet is pretty long. This duration includes the time of handling the following procedures: notifying upstream router that this virtual channel has become free, allocating a new packet to this virtual channel, delivering the flits through the switch and physical link into this virtual channel, doing routing computation and output virtual channel allocation of its downstream router, and possibly incurring stalls due to contention. This kind of packet-switching latency can t be neglected because data transmission within the related virtual channel stops and unable to provide any contribution to throughput during this period of time. Usually the packet-switching latencies can be hidden by using multiple virtual channels: some virtual channels may still work well while some virtual channels are reloading their new packets. However, if the interconnection network is seriously congested and most of virtual channels in routers are halted by contention stalls, any loss of virtual channels caused by switching packets will make the situation worse and the influence of these packet-switching latencies becomes more apparent. The way solving this issue is to find out how to shorten the packet-switching latencies, and thus one straightforward method is to bring the next incoming packet into the input buffer in advance without waiting for a virtual channel released to be free and standing by. This helps halted virtual channels returning back to work normally as soon as possible so that more packets can be provided to choose to traverse the switch fabric. B. Traffic Loads with Short Packets specially in some scenarios, interconnection networks carry traffic loads mixing long data packets with short control packets (e.g., commands, acknowledgments, etc.). ach short packet still occupies one virtual channel but brings only few flits to make the physical channel and unused buffers idle in most of the time. Short packets also make virtual channels switching more frequently than long ones, and have more chances of letting virtual channels halted due to waiting for the arrivals of new packets. When encountering some blocking, each blocked short packet will stay in its virtual channel and thus be distributed into one of many different routers. However, this situation can be improved by allowing several short packets which stop forwarding to be compacted in a few routers as long as there are enough buffer spaces available in these routers. Having more packets contained in the input buffers not only leads to higher buffer utilization rate but also releases more available virtual channel entities for transmitting more packets in the whole interconnection network. B() (a) SMQ T(D) H(D) (c) SMQ-MP H(F) B() (b) DMQ (d) DMQ-MP Figure 1. Comparison of buffer organizations. ach flit is labeled with its type (H, B, and T standing for head flit, body flit, and tail flit respectively) and the packet it belongs to (i.e., the smaller letter within the parentheses). or or C. Dynamic Input Buffer Management Based on the reasons mentioned above, to make an input buffer accommodate multiple packets whose amount is more than its number of virtual channels can take some advantages. However, in order to keep more associated packet information due to the increasing number of packets, extra hardware such as registers used for state fields as well as pointers and control logics must be added. If applying this method to the SMQ organization and then becoming the SMQ with multiple packets (SMQ-MP) organization as shown in Figure 1c, it is obviously unsuitable and impractical. First of all, letting the same virtual channel commonly shared by multiple packets will result in HOL blocking problems. lthough moving the blocked packet from its current virtual channel to another free virtual channel can overcome the HOL blocking, but the price on the extra hardware cost and complexity is too expensive. In addition, like the buffer allocation in SMQ buffers is partitioned statically, these extra registers used for state fields and pointers waste along with reserved buffer resources if their associated packets belonging to the dedicated virtual channels are absent. On the contrary, the DMQ organization is more suitable to this scheme due to its linked-list buffer structure, as shown in Figure 1d. In the original DMQ buffer, each set of state fields and pointers is dedicated to one virtual channel. However, if the number of packets in a buffer is no more limited by the number of virtual channels but by a quota of packets, an input buffer must reserve one set of state fields and pointers for each packet, not for each virtual channel. Because every packet is stored as a linked list and buffer spaces as well as virtual channels are used among packets, even if the number of packets in use is far less than the reserved quota of packets in a buffer at all, there are no buffer resources waste for unused packets except the associated sets of packet state fields and pointers. T(D) H(D)

3 III. TH DMQ-MP SCHM. Overview of DMQ Buffers DMQ buffer is structured as linked lists for dynamically adjusting the depths of virtual channels for more efficiently utilizing the input buffer. In this way, the precious memory resources are shared by all virtual channels in the same input unit, and busy virtual channels can get more buffer spaces than idle ones. n evident difference between the SMQ and DMQ organization is the mechanism of credit management. The DMQ router must collect all credit information and then put it together in the virtual channel allocator to make centralized credit management with several extra counters. In practice, some restrictions also need to apply to the DMQ buffer allocation policies for avoiding deadlock and load imbalance. B. DMQ-MP Router rchitecture In DMQ-MP buffers, whenever the tail flit of a packet enters the input buffer via one of the virtual channel entries connecting with the physical link, this packet will never use this virtual channel entry from now on and certainly can tear the corresponding allocation relationship by alerting the notification signal to its upstream router. Therefore, a new packet from its upstream router can be sent out and received via this free virtual channel entry. Then the newly arriving packet begins its routing computation, and all following flits belonging to this packet are linked together. fter the result of routing computation comes out and is stored into its corresponding state field, this packet will occupy one of the unused virtual channel exits connecting with the switch fabric to request output virtual channel and switch traversal bandwidth for departure. Here we use the terms of entry and exit to distinguish the virtual channels for receiving packets from the virtual channels for sending packets. These virtual channel entries and exits are respectively only responsible to the usage rights of the physical link and the switch fabric connecting with the input buffer. In the original DMQ organization, one packet enters, occupies, and leaves its virtual channel, and the entry and the exit it uses both belong to the identical virtual channel it occupies; nevertheless, in the DMQ-MP organization, an entry or an exit of the virtual channels is just an access gateway to write or read flits, and neither entries nor exits have to coexist in pairs. It is possible that a packet (like Packet D in Figure 2a), which has brought all its flits inside the DMQ-MP buffer through an entry, merely stays alone and waits for an exit to depart. s illustrated in Figure 2b, DMQ-MP buffers just like DMQ buffers need several pairs of head and tail pointers, one extra free-list pointer, and counters for all linked lists. xcept for the original state fields used for virtual channel exits to store their virtual channel status and assigned output virtual channel number, additional state fields for all packets are also needed to keep the routing computation results and some other associated attributes, for instance, the allocation priority of the packet. ssentially every packet in a DMQ- MP buffer exists individually and must possess a unique packet ID number. These packet ID numbers are uniformly assigned and managed by the centralized virtual channel allocator, and they are always carried with the head flits of their packets to their next routers for use of recognition while manipulating the input buffers, just like the virtual channel ID number carried with every flit. VC entries 4 1 D2 3 D1 (a) 2 C2 1 B1 C1 VC exits PKT1 Head PKT1 Tail PKT2 Head PKT2 Tail PKT3 Head PKT3 Tail PKT4 Head PKT4 Tail PKT5 Head PKT5 Tail PKT6 Head PKT6 Tail (b) Memory D1 1 D2 1 4 B1 3 2 C2 C1 Figure 2. DMQ-MP buffers and linked-list memory space. In order to keep track of which packets are currently using which access gateways to move in and out of the input buffer, each virtual channel entry and exit has to record the packet ID number of the packet which resides in it. These above packet ID numbers are stored in two arrays of registers named as vc_entry and vc_exit respectively. Meanwhile, another array of registers named as vc_list is used to record the arriving order of the packets which are waiting for free virtual channel exits, and these registers assist the input unit to decide the allocation order of the virtual channel exits. There are two copies of these three register arrays for managing one input buffer: one is certainly built inside the input buffer itself for buffer storage handling, and another is located in the virtual channel allocator of its upstream router for buffer storage allocation. The information of these three array registers (i.e., packet ID numbers) are continuously updated to keep their consistency according to the transitions of the vc_free and pkt_left signals as well as the head flits of packets. The vc_free signals are exactly identical to the same signals in the DMQ router to notify its upstream router that there is a free virtual channel ready for being used again, but now they only refer to the occupation of the virtual channel entries in a DMQ-MP input buffer. On the other hand, the pkt_left signals are similar to the vc_free signals for indicating the departures of packets and merely respond to the availability of the virtual channel exits by giving notifications to the upstream router. The information about mapping a packet to a virtual channel entry is carried with the head flit of a packet and passed to the input unit of its downstream router after transmitting this head flit. While comparing the proposed DMQ-MP router with the original DMQ router, they almost have identical hardware components and behavior in appearance, but there are two huge differences inside the input units and the virtual channel allocator. The first different part is relative to the novel packet number system. very packet existing in an input buffer must have a unique ID number to distinguish itself from others. Handling any proceedings about packets will need to use these packet ID numbers. ll packet ID numbers belonging to the same input buffer are centrally managed and distributed by the virtual channel allocator of

4 its upstream router. While a packet in a virtual channel exit requests an output virtual channel to the input buffer of its downstream router, it will also need to be assigned an available packet ID number for the input buffer. Certainly this assigned packet ID number will be used until the associated packet has completely left the input buffer, and then the input unit must notify the virtual channel allocator of its upstream router via the pkt_left signal that this packet ID number is no longer used and available again. The second different part is that the DMQ-MP router induces the concept of decoupling the virtual channel entries and exits. lthough the same number of virtual channel entries and exits in a DMQ or DMQ-MP router, a packet in the DMQ router passes through the input buffer via an entry and an exit belonging to the identical virtual channel, whereas any pair of a virtual channel entry and a virtual channel exit in the DMQ-MP input buffer are irrelative and handle packets independently. To manage all of these virtual channel entries and exits in DMQ-MP buffers relies on the above-mentioned three array registers (i.e., vc_entry, vc_exit, and vc_line ) as well as two notification signals (i.e., vc_free and pkt_left ). C. Characteristics Because of the intrinsic characteristic of the linked-list data structure, a DMQ router can more easily enhanced to bring in more packets inside than a SMQ router. s mentioned previously, one major cause of improving performance is that the DMQ-MP mechanism reduces the idle time of reloading a new packet into the halted virtual channel exit which the contained packet inside just leaves. nother cause is that the DMQ-MP buffers let all virtual channel entries keep receiving packets all the time if there are enough packet sources coming from the upstream router, especially for short packets. Therefore, the DMQ-MP organization can be beneficial for the cases of short packets, large buffer capacity (relative to packet lengths), heavy traffic congestion, and small number of virtual channels. Because the DMQ-MP scheme handles buffer resource allocation in units of packets instead of virtual channels, the increased hardware costs are just several registers for pointers, counters, state fields, and virtual channel mapping arrays, as well as the extra signals and control logics for packet system management and virtual channel decoupling. Owing to the additional mapping transformations from virtual channels to packets and vice versa, the data access delay may be prolonged to make the performance degraded. Certainly, we can add more sophisticated hardware such as register pre-fetch units to avoid or compensate the loss on performance, and they are trade-offs while designing DMQ-MP routers. nother possible problem is that too many packets residing in a crowded but small DMQ-MP buffer will compress the available spaces that a packet possibly can get. If lots of flits belonging to one packet have already existed in the input buffer, bringing one more remaining flit of this packet still can t make it forward further until all of its preceding flits have left. Therefore, one simple method is adding priorities to the allocating switch process to always first bring a flit that is the most likely to depart the input buffer soon. The priority levels can be estimated by counting the minimum amount of flits which are in front of the candidate flit and leave the input buffer possibly before it. Then the switch allocator sets priorities of all requesting packets based on the counting results. Imposing the priorities on the switch allocation can effectively solve the load imbalance problem occurring inside a DMQ-based input buffer and make the data transmission ceaselessly. D. Case Discussion: Cut-in-Line for High-Priority Packets In most cases, the packets encapsulating control information are shorter than the packets encapsulating raw data in packet length. In addition, these control packets are usually much more important than raw-data packets to a certain extent and demand faster transmission by possessing higher priority level. In the case of all packets sharing the same network fabric, the simplest method to prevent highpriority packets from being blocked by other packets is reserving some network resources, for instance at least one virtual channel, for high-priority packets to establish fast transferring routes all the time. The proposed DMQ-MP scheme is especially suitable for such reservation method due to its linked-list data structure and short lengths of high-priority packets. Because the occurrence probability and the amount of high-priority packets are both in very small proportions, the high-priority dedicated virtual channels can keep the least minimum number of reserved flit buffers while idle at most of time and can be dynamically inserted more available flit buffers while they are in use. Furthermore, if a DMQ-MP input buffer has more than one high-priority packets inside, switching process for high-priority packets in the virtual exits can immediately accomplish without letting any virtual exits idle. Because many packets appear at the same time in the DMQ-MP input buffer, one conceptual cut-in-line behavior possibly happens to shorten the waiting time to the high-priority packets for the virtual channel exits. s shown in Figure 3, when a virtual channel exit is freed up (by Packet B), the control logics definitely have to choose the foremost one of the high-priority packets (Packet ) to take that virtual channel exit, whether any low-priority packets L(B) (a) (b) Mapping Registers vc_entry vc_line Mapping Registers vc_entry vc_line D D cut-in-line vc_exit B C vc_exit Figure 3. xamples of the cut-in-line behavior. ach flit is labeled with its priority (H and L standing for high-priority and low-priority respectively) and the packet it belongs to (i.e., the smaller letter within the parentheses). C

5 12 verage Latency of 12-Flit Buffers with 3VCs and 8-Flit Packets verage Latency of 12-Flit Buffers with 3VCs and 8-Flit Packets in Bit Complement Traffic 12 verage Latency of 12-Flit Buffers with 3VCs and 8-Flit Packets in Transpose Traffic Throughput of 12-Flit Buffers with 3VCs and 8-Flit Packets Throughput of 12-Flit Buffers with 3VCs and 8-Flit Packets in Bit Complement Traffic Throughput of 12-Flit Buffers with 3VCs and 8-Flit Packets in Transpose Traffic (a) Uniform Random (b) Bit Complement (c) Transpose Figure 4. Performance of routers with 12-flit buffers and carrying traffic loads of 8-flit packets under different traffic patterns. (Packet D) are in front of that chosen high-priority packet or not. This kind of cut-in-line behavior can be easily made by inserting a newly coming high-priority packet into the place after all present high-priority packets as well as before all other packets in the waiting line vc_line, rather than directly adding it to the place right behind the last packet in the waiting line. IV. XPRIMNTL RSULTS We build a cycle-accurate flit-level simulator in SystemC to carry out all experiments. The interconnection network we simulate is an 8 x 8 mesh topology adopting 4-stage router pipeline, X-Y routing and wormhole switching flow control. We set the warm-up phase of 1, cycles to wait the network into its steady state, and then start to sample data during the measurement phase of 1, cycles. Unless otherwise specified, all buffer schemes are tested under the synthetic uniformly distributed random traffic pattern and the maximum number of packets in a DMQ-MP buffer is unlimited. The latency of a packet is the time interval measured from the time the head flit of the packet is generated by the traffic generator of a source node to the time the last flit of the packet leaves the network. The throughput is the average accepted traffic amount by a destination node per cycle.. Traffic Pattern Figure 4 shows the performance results of routers with buffer capacity of 12 flits and 3 virtual channels conveying 8-flit packets under different traffic patterns. No matter what kind of traffic pattern is, the DMQ-MP organization always has the best performance among all three buffer structures. specially when the network starts to get saturated, the dynamic buffer allocation and reduction of packet switching latencies in DMQ-MP organization makes the whole network accommodate more flits and reach a higher throughput value. Owing to the balanced loads in uniform random traffic distribution, the saturated throughput of the DMQ-MP routers is steadily about 8.53% higher than the other two. ven under the bit complement and transpose traffic patterns, the peak throughput improvements are still 9.56% and 1.56% respectively. B. Packet Length and Buffer Structure Compared with the result under random traffic in Figure 4a, we make a series of experiments of changing the ratio of packet length to buffer capacity that determines the expected amount of packets possibly residing in a buffer. s the result shown in Figure 5a, when packet length is relatively small to buffer capacity, the DMQ-MP routers can hold as many short packets as possible but the SMQ and DMQ ones can t due to the limitation of one packet per virtual channel. The saturated throughputs of the DMQ-MP routers are 24.52% higher than the other two for 4-flit packets. Then if the buffer size is increased to 18 flits with the same amount of virtual channels, the performance improvement between DMQ-MP routers and the other two expands as the input buffer capacity grows. s shown in Figure 5b, the improvement for the DMQ-MP routers relative to the DMQ ones in the saturated throughput rises to 11.24%. This proves that providing more sufficient buffer spaces for DMQ-MP routers to accommodate more packets at the same time can make them achieve better performance. If the buffers are kept to have the same average number of 4 flits per virtual channel, the difference of performance improvement between DMQ-MP and the others shrinks as adding more virtual channels. s shown in Figure 5c, the improvement of saturated throughput drops to 4.79%. The mechanism of multiple packets in the DMQ-MP routers facilitates the input unit to fast reload new packets to any idle

6 12 verage Latency of 12-Flit Buffers with 3VCs and 4-Flit Packets 12 verage Latency of 18-Flit Buffers with 3VCs and 8-Flit Packets 12 verage Latency of 16-Flit Buffers with 4VCs and 8-Flit Packets SMQ: 4-Flit Packets SMQ: Buff18, 3VCs SMQ: Buff16, 4VCs 11 DMQ: 4-Flit Packets 11 DMQ: Buff18, 3VCs 11 DMQ: Buff16, 4VCs DMQ-MP: 4-Flit Packets DMQ-MP: Buff18, 3VCs DMQ-MP: Buff16, 4VCs Throughput of 12-Flit Buffers with 3VCs and 4-Flit Packets.45 Throughput of 18-Flit Buffers with 3VCs and 8-Flit Packets.45 Throughput of 16-Flit Buffers with 4VCs and 8-Flit Packets SMQ: 4-Flit Packets SMQ: Buff18, 3VCs SMQ: Buff16, 4VCs.35 DMQ: 4-Flit Packets DMQ-MP: 4-Flit Packets.4 DMQ: Buff18, 3VCs DMQ-MP: Buff18, 3VCs.4 DMQ: Buff16, 4VCs DMQ-MP: Buff16, 4VCs (a) 4-Flit Packets (b) 18-Flit Buffer Size (c) 4 Virtual Channels Figure 5. Performance of routers with short packets, large buffer size, and more virtual cahnnels. virtual channel exits and keep as many virtual channels unceasingly running as possible. However, if there are already many virtual channels within a buffer, these efforts that the DMQ-MP scheme does become less apparent relative to the DMQ organization. C. Maximum Number of Packets Intuitively, setting the parameter maximum number of packets must depend on the buffer capacity and the length of packets. Hence, we choose a large 32-flit buffer with 4 virtual channels for this experiment. In Figure 6a, the results show that there is almost no performance improvement when the maximum number of packets is greater than 6. This is because the probability of more than one virtual channel entry, which have been freed up by these long packets, becoming available at the same moment is pretty low. Therefore, basically choosing the maximum number of packets one or two larger than the number of virtual channels for being standby is enough in most common cases, and this means that building the DMQ-MP router needs only few additional registers and control logics. D. Priorities for Switch llocation We pick a small 12-flit buffer with 3 virtual channels to test the proposed method of adding priorities for switch allocation. The main purpose of this method is trying to arrange the limited free buffer resources to packets which are almost running out of flits and desirous of supplies. The priorities are divided into three levels according to the counting values of to 3, 4 to 5, and greater than 5. In Figure 6b, this scheme seems to have no effects on the DMQ routers, but it improves the performance of the DMQ-MP routers when traffic loads are highly congested. lthough the saturated throughput of DMQ-MP routers only increases 1.65% after applied the priority scheme, this simple method indeed can prevent load imbalance and be easily integrated into any routers with the design of handling packet priority.. Virtual Channel Reservation for High-Priority Packets In the previous case discussion, the DMQ-based buffers should more fitting than the SMQ-based ones to the method of reserving certain high-priority dedicated virtual channels. Furthermore, high-priority packets inside the DMQ-MP buffers have bigger chances of doing cut-inline behaviors to pass through as fast as possible. Hence we design this experiment to compare the performance of DMQ and DMQ-MP routers with or without reservations of virtual channels to deal with high-priority packets. The buffers are set to 16-flit with 4 virtual channels, and the traffic loads are made up by 1% 1-flit high-priority and 99% 8-flit low-priority packets. If using high-priority reservation scheme, at least one of these virtual channels is reserved for high-priority packets; otherwise, all virtual channels are identical and can be used by any kinds of packets. In Figure 6c, average latencies of high-priority and lowpriority packets are saturated simultaneously because both kinds of packets injected by the same router node commonly come from the identical source queue. There are no differences among the performance results of these four sets while the packet injection rates are low. This observation proves that the dynamic buffer allocation mechanism of DMQ and DMQ-MP organizations indeed reduces the impact of taking some virtual channels for high-priority reservations on performance. Besides, this reservation scheme can facilitate an interconnection network to keep the average latency of high-priority packets almost constant until the injection rate reaches a higher value. The curves of average latency for low-priority packets in both organizations adopting the reservation scheme rise earlier than their counterparts without using the scheme. It is

7 verage Latency of 32-Flit DMQ-MP Buffer with 4VCs and 8-Flit Packets verage Latency of 12-Flit Buffers with 3VCs and 8-Flit Packets 12 verage Latency of 16-Flit Buffers with 4VCs and Different Priorities 4 Packets MX DMQ DMQ: HP 11 5 Packets MX 11 DMQ with 3 Priority Levels 11 DMQ: LP 6 Packets MX DMQ-MP DMQ with HP-First: HP 1 7 Packets MX 1 DMQ-MP with 3 Priority Levels 1 DMQ with HP-First: LP 8 Packets MX DMQ-MP: HP DMQ-MP: LP DMQ-MP with HP-First: HP DMQ-MP with HP-First: LP Throughput of 32-Flit DMQ-MP Buffer with 4VCs and 8-Flit Packets 4 Packets MX 5 Packets MX 6 Packets MX 7 Packets MX 8 Packets MX Throughput of 12-Flit Buffers with 3VCs and 8-Flit Packets DMQ DMQ with 3 Priority Levels DMQ-MP DMQ-MP with 3 Priority Levels Throughput of 16-Flit Buffers with 4VCs and Different Priorities DMQ: HP DMQ: LP DMQ with HP-First: HP DMQ with HP-First: LP DMQ-MP: HP DMQ-MP: LP DMQ-MP with HP-First: HP DMQ-MP with HP-First: LP (a) Maximum Number of Packets (b) Priorities for Switch llocation (c) VC Reservation for HP Packets Figure 6. Performance of DMQ-based routers with different schemes. HP and LP stand for high-priority and low-priority respectively. because the maximum amount of virtual channels which can be used by the low-priority packets decreases if using the reservation scheme; this also affects the saturated accepted throughputs. DMQ-MP routers can sustain low-latency transferring services for high-priority packets at a higher traffic injection rate than DMQ routers after they use the reservation scheme. This is because not only a DMQ-MP buffer can accommodate more high-priority packets but also the cut-in-line behaviors happen to speed up the highpriority packets passing through the buffer internally. V. CONCLUSION In this paper, we introduced a novel DMQ-MP organization for input buffers which can accommodate multiple packets more than the number of virtual channels. Because DMQ-MP buffers are not constrained by the limitation of one packet per virtual channel existing in these old-fashioned SMQ and DMQ buffers, they can well utilize those scarce buffer storage and physical link resources. In addition, the DMQ-MP scheme can bring standby packets in advance to reduce the waiting latencies of switching packets in virtual channel exits. The original DMQ organization solves the problem of shallow virtual channel depth while using small SMQ input buffers, and the proposed DMQ-MP scheme further solve the problem of few working virtual channels for DMQ routers while carrying heavy traffic loads. In addition, we discussed two methods applicable to DMQ-based buffers, which are adding priorities for switch allocation and reserving virtual channels only for high-priority packets. We performed some experiments on three buffer organizations and proved that the DMQ-MP scheme can take advantages in many scenarios. These experimental results showed that DMQ- MP routers can have up to 24.52% higher saturated throughput than SMQ and DMQ counterparts. CKNOWLDGMNT The authors thank the support from NSC under grants and and MO under grant 12-C S1-22. RFRNCS [1] Y. Hoskote, S. Vangal,. Singh, N. Borkar, and S. Borkar, 5- GHz Mesh Interconnect for a Teraflops Processor, I Micro, vol. 27, no. 5, pp , 27. [2] Y. Tamir and G. L. Frazier, Dynamically-llocated Multi-Queue Buffers for VLSI Communication Switches, I Transactions on Computers, vol. 41, no. 6, pp , [3] J. Park, B. W. O Krafka, S. Vassiliadis, and J. Delgado-Frias, Design and valuation of a DMQ Multiprocessor Network with Self-Compacting Buffers, In Proceedings of CM/I Conference on Supercomputing, pp , [4] J. Liu and J. G. Delgado-Frias, Shared Self-Compacting Buffer for Network-On-Chip Systems, In Proceedings of I International Midwest Symposium on Circuits and Systems, pp. 26-3, 26. [5] R. Sivaram, C. B. Stunkel, and D. K. Panda, HIPIQS: High- Performance Switch rchitecture Using Input Queuing, I Transactions on Parallel and Distributed Systems, vol. 13, no. 3, pp , 22. [6] M. Rezazad and H. Sarbazi-zad, The ffect of Virtual Channel Organization on the Performance of Interconnection Networks, In Proceedings of the 19th I International Parallel and Distributed Processing Symposium, 25. [7] C.. Nicopoulos, D. Park, J. Kim, N. Vijaykrishnan, M. S. Yousif, and C. R. Das, ViChaR: Dynamic Virtual Channel Regulator for Network-on-Chip Routers, In Proceedings of International Symposium on Microarchitecture, pp , 26. [8] M. Lai, Z. Wang, L. Gao, H. Lu, and K. Dai, Dynamically- llocated Virtual Channel rchitecture with Congestion wareness for On-Chip Routers, In Proceedings of CM/I Design utomation Conference, pp , 28.

Evaluation of NOC Using Tightly Coupled Router Architecture

IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 1, Ver. II (Jan Feb. 2016), PP 01-05 www.iosrjournals.org Evaluation of NOC Using Tightly Coupled Router