ScienceDirect. Packet-based Adaptive Virtual Channel Configuration for NoC Systems

Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 34 (2014 ) 552 558 2014 International Workshop on the Design and Performance of Network on Chip (DPNoC 2014) Packet-based Adaptive Virtual Channel Configuration for NoC Systems Masoud Oveis Gharan and Gul N. Khan* Electrical and Computer Engineering, Ryerson University, Toronto, ON M5B 2K3 Canada Abstract Growing number of on-chip cores requires the introduction of an efficient communication structure such as NoC. In NoC design, the channel buffer organization facilitates the use of Virtual Channels (VC) for on-chip communication. A VC structure can be categorized as static or dynamic. In a dynamic VC structure, variable number of buffer-slots can be employed by each VC according to different traffic conditions in the NoC. We introduce a Packet-Based Virtual Channel (PBVC) scheme, where a VC is reserved when a packet enters the router and released when the packet leaves. A VC will hold the flits of only one packet at a time that subsequently removes the Head-of-Line blocking. PBVC technique is an amended version of dynamically allocated multi-queue schemes where, an input or output port employs a centralized buffer whose slots are dynamically allocated to VCs. The experimental results show that our approach improves network latency and throughput as compared to other VC designs. 2014 The Elsevier Authors. B.V. Published This an by open Elsevier access B.V. article under the CC BY-NC-ND license Peer-review (http://creativecommons.org/licenses/by-nc-nd/3.0/). under responsibility of the Program Chairs of FNC-2014. Selection and peer-review under responsibility of Conference Program Chairs Keywords: adaptive virtual channels; head-of-line blocking; NoC; on-chip communication; VC organization. 1. Introduction In NoC (Network-on-Chip) domain, wormhole routing is mainly employed for communication among various cores of NoC 1. The flits of a packet are stored in the channel buffers. The header flit of a packet starts from the source core router and passes through a series of routers reserving the route path. This type of routing can cause traffic congestion. The blocking of one packet leads to the blocking of other packets queued for the channel. This blocking is known as Head-of-Line (HoL) blocking that increases the latency and results in lower throughput and buffer utilization. HoL problem can be alleviated by employing Virtual Channels 2. The VC mechanism enables the multiplexing and buffering of several packets in a single router channel 3. However, it does not remove HoL blocking * Corresponding author. Tel.: +1-416-979-5000 x6084; fax: +1-416-979-5280. E-mail address: gnkhan@ryerson.ca 1877-0509 2014 Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/). Selection and peer-review under responsibility of Conference Program Chairs doi:10.1016/j.procs.2014.07.069

Masoud Oveis Gharan and Gul N. Khan / Procedia Computer Science 34 ( 2014 ) 552 558 553 completely 4,5,6. Queue structure is an important component of NoC router that stores the flits of message packets. It is the main storage of a router structure that temporarily stores packet flits by following the First-Come-First-Serve (FCFS) order until network resources become available. Sometimes queue and FIFO terms are used interchangeably. Many queue architectures have been proposed such as FIFO queue, circular queue, Dynamically Allocated Multi-Queue (DAMQ) and their variants 7,8. DAMQ is a single storage array that maintains multiple FIFO queues 4. It can provide a solution to the HoL blocking problem 10. The packets are stored in the queues of a multiqueue of the output port for routing. Therefore, in case the current packet faces a blockage, the other packets destined to that output port can become blocked. Choi and Pinkston claimed that this type of blocking is not HoL 10. However, we can argue that in fact this is HoL blocking. Even, HoL blocking can also happen inside the queue of a DAMQ buffer 2,11. We present a new methodology that entirely removes the effect of HoL blocking. 2. DAMQ based VC organization A DAMQ buffer organization scheme for all the virtual channels (VCs) has been presented in the past 6. In Fig. 1, two buffer slots are reserved for each virtual channel before the buffer accepts an incoming flit. The rest of the buffer slots are free to be employed by any of the VCs according to the traffic demand. Various research groups have presented the design and implementation of DAMQ buffers 2,4,5. These designs are either expensive in terms of hardware or inefficient due to data dependencies. A centralized shared buffer architecture called Virtual Channel Regulator (ViChaR) is introduced by Nicopoulos and others 2. ViChaR requires a large size of arbiter that can create latency bottlenecks in the critical path and may limit the overall NoC frequency 11. It dynamically allocates VCs on a First-Come-First-Served basis, and there is no priority for the new packets. In the case of blocking, a packet can occupy all the slots of a channel buffer and prevents any new packet to pass through the router. If the blocking of that packet continues, all the upstream routers will be occupied by the packet, and a new packet may not pass through the routers on that route. Another drawback of ViChaR is its huge hardware for some configurations. Two more approaches are introduced by Evripidou et.al involving DAMQ mechanism 11. These are named as Mask-based and Link-List-based techniques. The Mask-based approach is cheaper in terms of hardware but it is of synchronous nature and its performance is low. The Link-List-based approach that mimics DAMQ organization is expensive however, it leads to higher performance. All the existing DAMQ based VC buffer structures suffer from HoL blocking except ViChaR that impose high latency and expensive hardware overhead. Therefore, a general solution to remove HoL blocking from DAMQ buffer organization is needed. All of the above points have motivated us to investigate and design the PBVC approach for avoiding HoL blocking. 2.1. Conventional wormhole VC organization Conventional Wormhole Virtual Channel (CWVC) mechanism is a common approach used in most of the existing VC based NoCs being researched. It is usually used as a basic approach to be compared with the other new approaches 5,8,9,12,13. In Fig. 2, the VC implementation of a physical channel is illustrated for a conventional static queue structure. The buffer slots in static queue structure are statically allocated to incoming packets where the buffer slots are dynamically allocated to incoming packets in DAMQ organization. In CWVC mechanism, when the header flit of a packet enters a VC buffer, the VC is reserved by the packet. The reservation of the VC is kept until the tail flit arrives. Then the VC can accept a new packet and it might have flits from previous packet. In this way, it can contain two parts of two packets at a time. When a VC has at least two parts of two packets, the blocking of HoL packet blocks the 2nd packet even if the route of the other packet is open. This is the main source of HoL blocking. Physical channel P 0 P 1 P 1 P 3 P 3 P 1 P 1 VC0 VC1 VC2 VC3 Free Fig. 1. Reserved buffer space for virtual channels.

554 Masoud Oveis Gharan and Gul N. Khan / Procedia Computer Science 34 ( 2014 ) 552 558 Write-Pointer VC3 Read-Pointer VC3 Write-Pointer VC2 Read-Pointer VC2 Write-Pointer VC1 Read-Pointer VC1 Write-Pointer VC0 Read-Pointer VC0 P 0 P 0 P 0 Physical Channel 32 P 1 P 1 P 1 P 1 P 2 P 2 P 2 P 2 P 2 P 2 P 2 P 2 32 Fig. 2. Input-port with static queue. For DAMQ based VC buffers, the depth of VC dynamically varies according to the traffic situation. To better illustrate the structure and operation of DAMQ buffers, two different cases are illustrated in Fig. 3. In the first case, the depth of VC0 is thirteen while in the 2 nd case, VC0 depth is zero. The varying depth of VC0 can be due to different traffic conditions. This property of DAMQ buffer where a VC can occupy all the buffer space can lead to lower performance. In the case of DAMQ based CWVC organization, there may be two situations of lower performance. First of all, when a packet is blocked in a VC and the other packets can move into that VC buffer, the VC becomes bigger and occupies all the space of buffer. This prevents the other VCs to perform efficiently due to lack of buffer space. Secondly, this blockage leads to the blocking in upstream routers (related to the blocked packets). In other words, the buffer blockage will spread to the other parts of NoC causing global congestion. 2.2. PBVC approach Our PBVC approach provides an effective solution to remove HoL blocking. The methodology is suitable for DAMQ buffer organization. In this approach, when the header flit of a packet enters a VC, the VC is reserved by the packet. The VC becomes free when the tail flit of the packet exits. The VC does not accept a new packet until the VC buffer is holding any flit of the existing packet. In this way, our PBVC methodology removes HoL blocking completely as only one packet can occupy a VC at a particular instance. Its main features are listed below: The chances of getting a free VC for unblocked packets of a channel in the PBVC approach are much higher than the CWVC mechanism. Consider two situations of Figures 4 and 5. In Fig. 4 that represents CWVC organization, the packet 4 and 5 remain blocked until packet-0 is facing a blockage. For PBVC organization, when one of the VCs (i.e. VC1, VC2 or VC3) becomes free, the packet 4 and 5 can occupy the buffer as illustrated in Fig. 5. In the case of our PBVC approach, when a packet faces a blockage, its VC gets minimum space from the channel buffer. Consider a situation of packet blockage as shown in Fig. 6. For the CWVC scheme, the upstream router continues sending new packets to the VC and it continues allocating more buffers and can occupy all the free area of DAMQ buffer as illustrated in Fig. 6a. In the case of PBVC, the new packets stay in the upstream routers until a VC becomes empty. In fact, more free space is available for the other unblocked VCs as shown in Fig. 6b. Consequently, PBVC performance and buffer utilization is better under such scenarios. We expect that the packets reach their destinations in a sequential order. In the traditional CWVC mechanism, if any packet of a series is faced with HoL blocking, the following packets of the series can travel via a free VC and reach the destination before the blocked packet. However, our PBVC approach avoids HoL blocking and each packet of a series will reach the destination in order. H 3 B 2 T 1 H 6 T 5 B 5 B 5 H 5 T 4 B 4 B 4 H 4 T 0 B 0 B 0 H 0 Physical Channel (a) first case VC3 VC2 VC1 VC0 T 3 B 3 B 3 H 3 T 2 B 2 B 2 T 1 B 1 B 1 H 1 P 4 P 4 P 5 P 4 P 4 P 3 P 6 P 5 P 2 P 5 P 5 P 0 P 1 P 0 P 0 P 0 Physical Channel P 3 P 3 P 2 P 3 P 2 P 2 P 3 P 1 P 1 P 1 P 1 (b) 2nd case Free Header flit: H Body flit: B Tail Flit: T Fig. 3. Two different cases in a DAMQ buffer for 4 VCs/channel and 4 flits/packet.

Masoud Oveis Gharan and Gul N. Khan / Procedia Computer Science 34 ( 2014 ) 552 558 555 VC3 VC2 VC1 VC0 H 3 T 2 B 2 B 2 H 2 T 1 B 1 H 5 T 4 B 4 B 4 H 4 T 0 B 0 B 0 H 0 Free H 3 T 2 B 2 B 2 H 2 T 1 B 1 T 0 B 0 B 0 H 0 Physical channel P 4 P 4 P 5 P 4 P 4 P 3 P 2 P 2 P 2 P 2 P 1 P 0 P 1 P 0 P 0 P 0 P 3 P 2 P 2 P 2 P 2 P 1 P 0 P 1 P 0 P 0 P 0 Fig. 4. CWVC: Packet-0 blockage causes HoL blocking for P4 & P5. Fig. 5. PBVC: Packet-0 blocked but no HoL blocking. H 3 B 2 T 1 H 6 T 5 B 5 B 5 H 5 T 4 B 4 B 4 H 4 T 0 B 0 B 0 H 0 T 3 B 3 B 3 H 3 T 2 B 2 B 2 T 1 T 0 B 0 B 0 H 0 Free P 4 P 4 P 5 P 4 P 4 P 3 P 5 P 5 P 5 P 2 P 6 P 0 P 1 P 0 P 0 P 0 (a) CWVC The buffer utilization of PBVC can be a bit lower than the CWVC approach. However, due to the adaptive nature of buffer, its utilization can be compensated. Moreover, in NoCs where the communication involves a large number of HoL blockings, PBVC mechanism will show better buffer utilization. 3. PBVC router structure Fig. 6. HoL blocking due to a blocked Packet-0. P 3 P 3 P 2 P 3 P 2 P 2 P 3 P 0 P 1 P 0 P 0 P 0 (b) PBVC A DAMQ based CWVC 5x5 router architecture for mesh topology is given in Fig. 7. Generally, the microarchitecture of a router consists of input and/or output ports, an arbiter and a crossbar switch. Each input or output port can utilize VCs to control flow and to enable the sharing of a NoC communication channel. We assume in this paper that a router has VCs at the input-ports level. The NoC router organization and architecture is similar for both CWVC and PBVC approaches. The only difference is the organization of their input-ports. The micro-architecture of a typical CWVC input-port is illustrated in Fig. 8. It contains an SRAM, five arrival/departure tables and other logic and control circuits. The SRAM module serves as the channel buffer. The word size of SRAM is equal to the packetflit size. The data pointed by the Read-Address will always appeared at the SRAM output. When credit-in is active, the data is stored in the SRAM slot pointed by the Write-Address. Five tables are used to implement a dynamic Link-List based CWVC mechanism. The table keeps the records of the occupied VCs. The Header-List table keeps the addresses of channel buffer (SRAM) that point to the header flits of VCs. The Tail-List table keeps the addresses of SRAM buffers that point to the tail flits of VCs. The Link-List table keeps the address of next slot of each buffer slot in the SRAM. In fact, it links the address of flits of each VC in a FIFO manner. The Slot-State table keeps the records of occupied slots in the SRAM. In a CWVC router with reserved space for each VC, a credit signal is sent to upstream routers indicating the in the down-stream router. When the capacity of a VC is full, the credit signal will change to close the VC. Assume that one slot is reserved per VC, the capacity of each VC is dynamic and it varies from one to M buffer slots dynamically, where: M = SRAM slots Total VCs 1 Assuming a total of 16 buffer slots and four virtual channels, the dynamic capacity of each VC varies from 1 slot to 13 slots. In the dynamic buffer implementation, the credits are regulated only by two conditions as given below: if {(Current VC is Empty) OR (Free Slots < Free VCs)} then credit = ON In other words, a VC is open if it is empty or at least one slot is reserved for each free VC. These conditions guarantee that at least one slot is dedicated to each VC. Subsequently, the rest of the slots are dynamically used for all the VCs. It will also remove any starvation and protocol-level deadlocks in the NoC communication. A little-bit of additional hardware is required to implement the PBVC approach in each router. In fact, the coding of two "if statements" is required to open and close each VC as illustrated in Fig. 9a. The first "if statement" is used to close each VC when the arriving flit is a tail flit. The second "if statement" is used to open each VC when the exiting flit is a tail flit. To open and close VC, we use a single bit register, PBVC-Blk. It is set in the case of open and reset otherwise. PBVC-Blk output is inverted and ANDed with the credit out signal of the related VC. Therefore, when PBVC-Blk is set, the VC will be closed as shown in Fig. 9b. PBVC and CWVC routers are also implemented for two platforms: SYNOPSYS and FPGA. The SYNOPSYS

556 Masoud Oveis Gharan and Gul N. Khan / Procedia Computer Science 34 ( 2014 ) 552 558 platform uses 0.25μ technology to determine the power and area overhead for implementing PBVC approach. Both routers use a dual-ported SRAM memory for data buffering. The SRAMs have 16 slots with 16 bits in each slot. The power and chip area values of both routers are listed in Table 1. The PBVC router requires an extra 0.006% chip area and consumes 0.24% more power than CWVC. We have also investigated the amount of extra hardware for PBVC router when implemented in the FPGA platform. The PBVC router requires 0.4% more combinational logic and 0.9% more registers as compared to CWVC router. VC-ID VC-ID-local Arrival/Departure Tables Write-pointer Credit-in Grant Read-pointer Clk Link-list slot state 0 3 1 4.. 15 0 VC-State 0 1 2 1 3 0 Slot-State slot state 0 1 15 1 Tail-List 0 6 1 5 2 7 3 - Header-List 0 2 2 7 3 - Header-List Read-pointer 0 2 2 7 3 - VC-ID-local VC-block Request Buffer-full Credit in Link N Link E Link S Link W Link L Router Input Port N Input Port E Input Port S Input Port W Input Port L Req N In N out N In E out E Crossbar In S out S In W In L grant N Arbiter (VC and SW Allocator) VC_block Aselect VC_full config out W out L Buffer_full Credit_out Link N Link E Link S LinkW Link L Buffer-full Data_in Credit_in Slot-State slot state 0 0.... 15 1 VC-State 0 1 2 1 3 0 Read-pointer Write-pointer De coder 4 3 Read-Address SRAM Write-Address 2 Data Out Data In 1 0 00001 Grant << Flit- info Out Reg. In Req Data_out VC_ID VC_ID Fig. 7. A 5 5 router architecture for 2D-mesh NoCs. Fig. 8. Micro-architecture of a CWVC router input-port. 1 Credit-in grant tail 0 tail PBVC-Blk In Out PBVC-Blk One-bit Register >... PBVC-Blk Credit-out (a) logic for if statements (b) logic for Credit-out Fig. 9. Extra hardware per VC for PBVC router input-port. Table 1. PBVC and CWVC router design. ASIC FPGA Router Area Power Combinational Registers Type (μm 2 ) (mw) Logic (elements) (bits) CWVC 3420235 231 8991 2244 PBVC 3420435 231.6 9032 2264 Extra Hardware 0.006% 0.24% 0.4% 0.9% 4. Experimental results The conventional DAMQ organization with a slot reserved per VC is employed in the CWVC mechanism and the same terminology is used. We simulated two different traffic patterns such as Random and HoL specific traffic. For Random traffic, all the destination cores are chosen randomly. HoL specific traffic creates a situation of higher HoL blocking conditions. By evaluating the results of these traffic patterns, the efficiency of our PBVC approach is demonstrated in this section. We setup our simulator for PBVC and CWVC modes in a DAMQ Link-List based router and input-port micro-architecture. We change the number of traffic packets and VCs to measure the

Masoud Oveis Gharan and Gul N. Khan / Procedia Computer Science 34 ( 2014 ) 552 558 557 performance in terms of throughput and latency. The NoC topology selected is a 4 4 mesh with XY routing methodology. The communication of packets is in the form of wormhole switching where the channel width is equal to the flit size (32 bits). Each packet is made of 16 flits and the buffer depth for each input-port is 16 slots of a packet-flit each. Each flit is sent or received by a source core/router in two clock cycles. We assume that the link delays between routers are negligible as compared to the router delay. The performance of PBVC is compared with the CWVC approach during the following experiments and results are presented in Figures 10, 11, 12 and 13. In the case of Random traffic, all the sources, destinations and routers are clocked at the same rate (e.g. 1nsec). In the second experiment, the HoL specific traffic pattern is employed, where all the source cores send their first packet to one destination. The destination is set to be two times slower than the other destination cores. After sending the first packet, the rest of the packets of all the sources are sent randomly to all the destination cores. This condition increases the HoL blocking especially when the second or later packets are transferred. Our simulator is coded in Verilog and simulation is done by using the ModelSim for the Altera FPGA platform. In these experiments, the throughput is measured by the rate of packets received to the maximum number of packets (ideally) sent at a specific time. The latency is measured in terms of time that a specific number of packets are sent and received by the NoCs. 70% 60% 1200 1000 Throghput (rate of receiving) 50% 40% 30% 20% 10% Average Latency ( ns ) 800 600 400 200 % 132 256 512 1024 1536 2048 Time (ns) Fig. 10. Throughput for random traffic. 0 64 128 256 512 1024 Packets Sent Fig. 11. Average latency for random traffic. Throghput (rate of receiving) 45% 40% 35% 30% 25% 20% 15% 10% Average Latency ( ns ) 2500 2000 1500 1000 500 5% % 132 256 512 1024 1536 2048 Time (ns) 0 64 128 256 512 1024 Packets Sent Fig. 12. Throughput for HoL specific traffic. Fig. 13. Average latency for HoL specific traffic.

558 Masoud Oveis Gharan and Gul N. Khan / Procedia Computer Science 34 ( 2014 ) 552 558 Figures 10 and 11 show the throughput and latency results in the case of Random Traffic. In the beginning of simulation (around 132 nsec), the performance of PBVC is much higher than that of CWVC, and as the time passes this advantage diminishes. This is due the fact that in the beginning of simulation, the traffic is not crowded, and when the HoL blocking occurs in a channel, the incoming packet can move out of the channel. This situation will improve the performance of PBVC approach. In the case of four virtual channel (i.e. VC4 case in Figures 10 and 11), the throughput of PBVC is 2.6% higher and the latency is 8.2% lower than those of CWVC. In the second part of the experiment, both models are evaluated in a high contention environment. Figures 12 and 13 show the throughput and latency results for the HoL specific traffic where two scenarios are investigated. In the first scenario, the first packets of all the sources are intended for one destination, and afterward the packets travel to all the destinations randomly. In the second condition, the destination is two times slower than the other destinations. In this traffic pattern, the PBVC performance improvement is much better than CWVC. For 1024 packets, the average latency is 40% less than CWVC, and the average throughput is 23% higher than CWVC for 2048 nsecs. This is due to a lot of HoL blocking occurring in the beginning of simulation. As the time passes the occurrence of HoL blockings will reduce, and the throughputs of two methods are going to be close to each other. Another important point is that as the number of VCs is reduced to two, the advantage of PBVC also diminishes. This is due the fact that when the HoL blocking occurs in PBVC and there are free VCs, the new packet passes through these free VCs that improves the performance. 5. Conclusions The architecture and structure of packet-based virtual channel (PBVC) approach is presented. PBVC buffer has its root in dynamically allocated multi-queue (i.e. DAMQ) buffers. We conclude that PBVC is able to completely remove HoL blockings in NoCs. To verify our claims PBVC and the conventional wormhole VCs (CWVC) are implemented using DAMQ buffers. In the experiments, two traffic patterns i.e. random and HoL specific traffic are applied to PBVC and CWVC based NoCs. Throughput and latency for PBVC based NoC are compared with those of CWVC based NoC. The performance results are obtained in varying number of VCs and traffic conditions. The PBVC results are better on average as compared to the CWVC. In the case of HoL specific traffic, the average latency and throughput improve for our PBVC approach as compared to traditional CWVC. References 1. Yoo HJ, Lee K, Kim JK, Network-on-Chip based SoC. In: Low-Power NoC for High-Performance SoC Design. Boca Raton: CRC Press; 2008. p. 142 5. 2. Nicopoulos CA, Dongkook P, Jongman K, Vijaykrishnan N, Yousif MS, Das CR. ViChaR: A dynamic virtual channel regulator for Network-on-Chip routers. In Proc. 39th Annual IEEE/ACM International Symposium on Microarchitecture, 2006. p. 333 46. 3. Dally WJ. Virtual-channel flow control. IEEE Transactions on Parallel and Distributed Systems, 1992; 3:194 205. 4. Frazier GL, Tamir Y. The design and implementation of a multiqueue buffer for VLSI communication switches. In Proc. IEEE International Conference on Computer Design: VLSI in Computers and Processors, 1989. p. 466 71. 5. Liu J, Delgado-Frias JG. DAMQ self-compacting buffer schemes for systems with Network-on-Chip. In Proc. International Conference on Computer Design, 2005. p. 97 103. 6. Tamir Y, Frazier GL. Dynamically-Allocated Multi-Queue Buffers for VLSI Communication Switches. IEEE Transactions on Computers, 1992; 41:725 37. 7. Park J, O Krafka BW, Vassiliadis S, Delgado-Frias J. Design and evaluation of a DAMQ multiprocessor network with self-compacting buffers. In Proc. Supercomputing, 1994. p. 713 22. 8. Benini L, Micheli GD. Register designs for queuing buffer. In: Networks on Chips: Technology And Tools. San fransisco: Morgan Kaufmann Publishers; 2006. 9. Choi Y, Pinkston TM. Evaluation of queue designs for true fully adaptive routers. Journal of Parallel and Distributed Computing, 2004; 64:606 16. 10. Xu Y, Zhao B, Zhang Y, Yang J. Simple virtual channel allocation for high throughput and high frequency on-chip routers. In Proc. International Symposium on High Performance Computer Architecture, 2010. p. 1 11. 11. Evripidou M, Nicopoulos C, Soteriou V, Kim J. Virtualizing virtual vhannels for increased Network-on-Chip robustness and upgradeability. In Proc. IEEE Computer Society Annual Symposium on VLSI, 2012. p. 21 6. 12. Zhang H, Wang K, Dai Y, Liu L. A Multi-VC Dynamically Shared Buffer with Prefetch for Network on Chip. In Proc. IEEE 7th International Conference on Networking, Architecture and Storage, 2012. p. 320 7. 13. Oveis-Gharan M, Khan GN. A novel virtual channel implementation technique for multi-core on-chip communication. In Proc. IEEE Symp. Computer Architecture and High Performance Computing (WAMCA 12), 2012. p. 36 40.