Deadlock-Avoidance Technique for Fault-Tolerant 3D-OASIS-Network-on-Chip

Size: px

Start display at page:

Download "Deadlock-Avoidance Technique for Fault-Tolerant 3D-OASIS-Network-on-Chip"

Roderick Harmon
5 years ago
Views:

1 Deadlock-Avoidance Technique for Fault-Tolerant 3D-OASIS-Network-on-Chip Akram Ben Ahmed, Abderazek Ben Abdallah The University of Aizu Graduate School of Computers Science and Engineering Aizu-Wakamatsu , Japan {d , Abstract During the last few decades, 3-dimensional Networks-on-Chips (3D-NoCs) have been proposed as a promising architecture that combines the high parallelism of Networkon-Chip interconnect paradigm with the high performance and lower interconnect power of 3-dimensional integration circuits (3D-ICs); however, 3D-NoC systems are exposed to a variety of manufacturing and design factors making them vulnerable to different faults that cause corrupted message transfer or even catastrophic system failures; therefore, a 3D-NoC system should be fault tolerant to transient malfunctions or permanent physical damages. Most of the exiting 3D-NoC systems rely on routing algorithms to ensure fault-tolerance; however, one the serious problems that may face routing algorithms is deadlock which can cause the blockage of some routers in the network or even blocking the entire system. Consequently, deadlock should be avoided or detected and removed. In this paper, we present a low cost deadlock-recovery technique for fault-tolerant 3D-NoC systems. The proposed technique detects the presence of deadlock in the network and removes it with no considerable performance drop. The proposed technique was implemented on our earlier designed 3D-NoC system, named 3D-OASIS-NoC, which adopts Look- Ahead-Fault-Tolerant routing algorithm (LAFT). Evaluation results show that... Keywords-3D NoC; Concurrent; Fault-tolerant; Routing; Deadlock; I. Introduction Based on a simple and scalable architecture platform, Network-on-Chip (NoC) [1], [2] connects processors, memories and other custom designs together using switches to distribute packets on a hop-by-hop basis to increase the bandwidth and performance and solve the interconnect bottleneck in traditional bus-based systems. At the same time, three dimensional integrated circuits (3D-ICs) [3], [4] have attracted a lot of attention as a potential solution to resolve the interconnect bottleneck. Thanks to the reduced average interconnect length, 3D-ICs can achieve higher performance and a lower interconnect power consumption can be obtained [5], [6]. Moreover, circuitry is more immune to noise with 3D-ICs chips [4], and the realization of mixed technology has become possible [7], [8]. Combining the NoC structure with the benefits of the 3D integration offers a promising 3D- NoC architecture. This combination provides a new horizon for NoC designs to satisfy the high requirements of future large scale applications. Due to the complex nature of 3D-IC fabrics and the continuing shrinkage of semiconductor components, 3D-NoC systems are becoming increasingly vulnerable to failures caused by physical defects (permanent faults) and transient faults caused by some component failures [9]. 3D-NoC systems are susceptible to many kinds of faults such as in routers, IPs, links etc. As Lehtonen et al stated in [10], the majority of failures (80%) are caused by transient faults, while the rest of them originate mainly in permanent and intermittent faults. These kinds of faults should not cause a complete system failure as a safety requirement and 3D- NoCs should be able to run and deliver correct messages to their corresponding destination nodes, even with degraded performance. This can be done by either employing a fault tolerant mechanism that avoids or deactivates the faulty components or by reconfiguring the system without causing any important performance drop. Figure 1. Deadlock example in fault-tolerant 3D-NoC system. As is the case for every adaptive routing algorithm, the deadlock problem may rise with fault-tolerant routing schemes. Deadlock is one of the major issues in NoC

2 systems which is caused when packets in different buffers are unable to progress because they are dependent on each other forming a dependency cycle. In fault-tolerant algorithms, deadlock is more often to occur due to the presence of faults which add more restrictions to the routing decision. Figure 1 illustrates a deadlock example in a fault-tolerant 3D-NoC system. The dependency is caused by the flits exchange between R 002 and R 001. Due to the presence of faults, the choices for a minimal routing is limited and both communications are dependent on each other; thus, none of them can make progress along the network. On the same figure, we can see that flits Dest010 and Dest000, stored in the input-ports of R 011 and R 001 respectively, are victims of this deadlock; i.e., even their output-channel is free, they have to wait in the buffer until the blocking is resolved. Most of the existing 3D-NoC systems [11], [12], [13], [14] used Virtual-Channel (VC) [15] as a deadlock avoidance technique. As illustrated in Fig. 2, VC divides the inputbuffer in smaller queues which are independent on each other and managed by an arbiter. When a blockage happens in one VC, the other ones are not affected and they continue asking request for their corresponding output-channel. In this fashion, nonblocked requests are served and their slots are freed to host other incoming flits. The number of VCs depends on the algorithm complexity and the deadlock probability to happen; thus, some architectures used two VCs [11], others used three VCs [12] and some others even used four VCs [13], [14] to ensure deadlock-freedom for their fault-tolerant routing algorithms. Both VC and VOQ ensure deadlock-freedom; however, the employment of such techniques is costly in terms of hardware and implementation complexity. This is caused by the arbitration needed to handle the different requests coming from the multiple VCs/VOQs at each input-port. In another work, Pasricha et al [18] extended a 2D turn model for partially adaptive routing to the third dimension. The proposed scheme combines both 4N-FIRST and 4P- FIRST schemes to propose a lightweight 4NP-FIRST. On the other hand, this turn model introduces some routing restriction to prevent from deadlock. These restrictions cause a nonminimal routing selection where in some cases it may take too many additional hops for the packet to reach its destination. Figure 3. 3D VOQ router architecture. Figure 2. 3D VC router architecture. Another technique used for deadlock-avoidance is called Virtual-Output-Queue (VOQ) [16]. In VOQ, as sown in Fig. 3, the input-buffer is divided into different queues to host incoming flits which are stored depending on their corresponding output-channel; i.e., VOQ (i,j) stores flits coming from input-port i wishing to access output-port j. For each output-channel, a 7x1 crossbar(i) is dedicated to handle the traversal of flits coming from the different input-channels and asking the grant for the output-channel(j). According to [17], VOQ can achieve less switch delay than VC with the same efficiency. Based on these facts, in this paper we propose a lowcost deadlock-recovery technique, named Random-Access- Buffer (RAB). RAB detects first the deadlock occurrence then manages to drop the blocking request and looks for other ones to free some slots in the buffer and break the dependency cycle causing the deadlock. RAB was implemented on our fault-tolerant 3D-OASIS-NoC system [19], [20], [21] that employs Look-Ahead-Fault-Tolerant routing algorithm (LAFT) [22] which boosts the performance of 3D-OASIS-NoC while simultaneously guaranteeing faulttolerance with considerably no performance degradation. The rest of the paper is organized as follows: In Section 2, 3D-OASIS-NoC system architecture is overviewed including the adopted Look-Ahead-Fault-Tolerant routing algorithm (LAFT). The proposed Random-Access-Buffer (RAB) for deadlock recovery is introduced in Section 3. Section 4 is dedicated for the evaluation methodology and results, and finally we end the paper with conclusion and future work in Section 5.

3 Figure 4. Look-Ahead-Fault-Tolerant routing algorithm flow chart. II. Fault-Tolerant 3D-OASIS-NoC System Overview A. Look-Ahead-Fault-Tolerant routing algorithm To keep the benefits of look-ahead routing [20], [23], Look-Ahead-Fault-Tolerant routing algorithm (LAFT) [22] should be able to perform the routing decision for the next node taking into consideration its link status and selects the best minimal path. Before starting to explain LAFT, there are two important assumptions that should be mentioned. First, the links connecting the PE to the local input and output ports are always nonfaulty. Second, we assume that there exists at least one minimal path between a (source, destination) pair. These assumptions are natural and necessary to deliver any flit from source to destination. We employed a simple fault detection mechanism based on a single multiplexer in each input-port that reads the incoming flit and verifies whether is corrupted or not. Depending on this verification, the fault-control module sends a single bit signal to the upstream node that can be either 0 or 1, for valid or faulty respectively. Each router sends the collected information corresponding to its own fault status to each one of the six neighboring nodes and also to the Network-Interface of the attached PE. This information is represented in a six bits signal representing the router link status in each direction (North, East, Up, South, West and Down). It is important to mention that the choice of using control signals to transfer the fault information rather than using control flits is taken to enhance the performance. Using the latter approach will increase the congestion in the router where we may find data and control flits competing for the router resources. Also, we avoided using registers to store this information, and instead used signals to decrease as much area overhead as possible that might be caused by additional registers. The fault information is read by each input-port where LAFT is executed. Figure 4 illustrates the flow chart of this algorithm. The first phase of this algorithm calculates the next node address depending on the next-port identifier read from the flit. For a given node wishing to send a flit to a given destination, there exist at most 3 possible directions through X, Y, and Z dimensions respectively. In the second phase, LAFT performs the calculation of these 3 directions by comparing x, y and z coordinates of both current and destination nodes concurrently. At the same time, as these directions are being computed, the fault-control module reads the next-port identifier from the flit and sends the appropriate fault information to the corresponding inputport. By the end of this second phase, LAFT has information about the next node fault status and also the three possible directions for a minimal routing. In the next phase, the routing selection is performed. For this decision, we adopted a set of prioritized conditions to ensure fault-tolerance and high performance either in the presence or absence of faults: 1) The selected direction should ensure a minimal path and it is given the highest priority in the routing selection. 2) We should select the direction with the largest next hope path diversity. 3) The congestion status is given the lowest priority. Depending on these priorities, LAFT reads the fault status of the next node received from the fault-control module and checks the number of possible nonfaulty minimal directions. As illustrated in Fig. 4, if only one nonfaulty minimal direction is obtained, this direction will be selected as out-port for

the next node. If more than one possible minimal direction is available, the algorithm selects the direction which leads to a node with higher path diversity.

This means that the probability of finding a nonfaulty link is greater when considering faults.

4 the next node. If more than one possible minimal direction is available, the algorithm selects the direction which leads to a node with higher path diversity. The diversity value for a given node is the number of possible directions leading to the destination through a minimal path. A node with high diversity results in more routing choices. This means that the probability of finding a nonfaulty link is greater when considering faults. When no faults are detected in the system, selecting the direction with the highest diversity gives more choices to find the least congested direction. As stated in [18], to obtain directions with high diversity, we should select those leading to nodes located in the center of the mesh and avoid routing to the edges of the network. to allow the routing calculation and switch allocation to be performed both in parallel. B. System architecture The 3D-OASIS-NoC system architecture [19], [20], [21] is represented in Fig.6. This figure also depicts the router block diagram and its three main pipeline stages: Buffer Writing BW, Routing calculation/switch Arbitration RC/SA and finally the Crossbar Traversal stage CT. The router is considered as the back-bone component of the whole 3D-OASIS-NoC design. Each router has a maximum number of 7-input by 7-output ports, where 4 ports are dedicated to the connection to the neighboring routers, one port is used to connect the switch to the local computation tile, and the remaining two ports are added to connect the router to the upper and downer layers to ensure the inter-layer communication. The 3D-OASIS- NoC router block diagram is shown in Fig.6. It contains seven Input-port modules for each direction in addition to the Switch-Allocator (where the STALL-GO flow control and the matrix-arbiter scheduler can be found) and the Crossbar module that handles the transfer of flits to the next neighboring node. Figure 5. Look-Ahead-Fault-Tolerant routing algorithm example.. When the three possible directions are minimal and have the same diversity, the routing selection is made depending on the congestion of each output port. This congestion information is obtained by the stop signal issued from the flow control used in our 3D-OASIS-NoC system. When there is no valid minimal route available, LAFT chooses a nonminimal route while also considering the 2nd and 3rd priorities (path diversity and congestion) as illustrated in Fig. 4. To understand better how LAFT works, we observe Fig. 5. Assuming that the current node (labeled C) received an incoming flit where the next port identifier, calculated in the previous node, indicates that the out-port for this flit is East (Red arrow). The next node address is calculated (labeled N). Three minimal directions are possible for routing: East, North or Up. The East direction will not be selected since the link in this direction is faulty. Therefore, either North or Up can be selected, which both are minimal and nonfaulty. In this case, the diversity priority is taken into consideration. If Up is selected, where the node in this direction is on one of the network edges, the diversity value is equal to 2 (2 minimal possible directions: East or North). However if North is selected, its diversity value is equal to 3 (East, North or Up). Having the highest priority, the North outport (Green arrow) is selected for the next node and it is embedded in the flit to be used in the downstream node Figure 7. Input-port module architecture. The Input-port module is represented in Fig.7. It is composed of two main elements: Input-buffer and the Route module. Incoming flits from different neighboring routers, or from the connected computation tile, are first stored in the Input-buffer and waiting to be processed. This step is considered as the first pipeline stage of the flit s life-cycle BW. Each input-buffer can host up to 4 flits. After being stored, the flit is fetched form the buffer and advances to the next pipeline stage. The destination addresses (xdest, ydest and zdest) are decoded in order to extract the information about the destination address in

5 Figure 6. 3D-OASIS-NoC system architecture. addition to the Next-Port identifier pre-calculated in the previous upstream node. These values, in addition to the fault information, are sent to the Route circuit where LAFT is executed to determine the New-next-Port direction for the next downstream node while taking into consideration its link fault status. At the same time, the Next-Port identifier is also used to generate the request for the Switch-Allocator asking for grant to use the selected output port via sw-req and port-req signals. In order to enable the bypass technique [21], [24], two signals are issued from the buffer to give information about the buffer occupancy status. These signals are fifo-empty and fifo-nearly-empty. When the fifo-empty signal is issued, that means that the input-buffer is empty, and when an incoming flit arrives to the input port it does not need to be stored in the buffer. So, the flit can overlap the buffering stage and advancing to the next stage (RC and SA). The sw-req and port req signals issued from each Inputport module, and giving information about the desired output-port, are transmitted to the Switch-Allocator module to perform the arbitration between the different requests. This process is done in parallel with the routing computation done in Input-port to form the second pipeline stage RA/SA. At the end, the Switch-Allocator sends the sw-cntrl signal that contains all the information needed by the Crossbar circuit about the scheduling result. This latter, forming the last pipeline stage CT, reads the correspondent flit from the granted Input-port and sends it to its allocated outputchannel. More details about the 3D-OASIS-Architecture can be found in [20], [21]. III. Random-Access-Buffer for Deadlock-recovery As is the case for every adaptive routing algorithm, the deadlock and livelock issues may rise. As we previously mentioned in Section 2, most of the existing routing algorithms use either virtual channels or add restrictions to the routing selection to avoid deadlock. These solutions either suffer from high implementation complexity or incur an additional delay due to the nonminimal approach. In our case, we implemented a similar technique to virtual channels, but it is much simpler and less complex. This technique, named Random Access Buffer (RAB), detects first the flit being the reason of deadlock in the buffer, drops its request and then looks for another flit whose request can be granted to free some slots in the buffer and break the dependency. Instead of manging many requests at the same time, as it is the case of virtual channels which require additional complexity and delay for the arbitration, RAB handles each request at a time. Figure 8 shows an example how RAB works. In each input-port, a buffer-controller (BC) manages the detection of deadlock and handles the assignment of head and tail addresses. The detection mechanism is based on a timer which after a period of time, if the request being processed is not served a flag is issued informing the presence of a deadlock (Figure 8 (1)). This is done by reading the grant signal received from the Switch-allocator (sw gr). In this case, the BC reads the head of the next packet in the buffer and checks whether the requested out-port is different from the one previously flagged as blocked or not. When it finds a request whose channel is free (Figure 8 (2)), it sends a request to the Switch-allocator to be served. When the request is granted, the flits of the granted packet are dequeued from the buffer and the freed slots can be used to host another incoming packet (Figure 8 (3)). After new flits are enqueued in the buffer, the blocked packet is checked again (Figure 8 (4)). The BC receives a grant for the direction requested (North) and the packet is dequeued from the buffer. Despite the delay penalty required by the timer to detect the deadlock, this technique is still faster and simpler to implement than Virtual channels. As long as the chosen route is minimal, the livelock problem does not exist either. However, it can be observed when a nonminimal direction is selected. For this reason,

6 (1) (2) Figure 8. (3) (4) Performance evaluation: (a) Stall count evaluation with(b) Latency per flit (c) Throughput. some restrictions are added when selecting the nonminimal route in addition to the one mentioned above. The first restriction forbids the flit to turn back to the same direction where it came from. The second one forbids selecting a path which is in the opposite direction of the faulty link (i.e. if East is faulty then West should not be selected). Adopting these restrictions guarantees the livelock freedom of LAFT, and the flits will continue to advance and search for a route until it finds a valid link. A. Evaluation methodology IV. Evaluation The proposed deadlock-recovery technique is implemented on 3D-OASIS-NoC system [19], [20] which was designed in Verilog HDL, synthesized and prototyped on commercial CAD tools and FPGA board [25]. We evaluate the hardware complexity of LAFT router in terms of area utilization, power consumption (static and dynamic) and speed. To evaluate the performance of the proposed algorithm, we selected Matrix-multiplication [26], [27] as a real benchmark and also two traffic patterns: Transpose [28] and Uniform [29]. We chose Matrix-multiplication because it is one of the most fundamental problems in computer sciences and mathematics, which forms the core of many important algorithms such as engineering and image processing applications [26], [27]. To evaluate 3D-OASIS- NoC system s performance with Matrix-multiplication, we set the matrix size to a 6x6. We also decided to calculate from 1 to 100 different matrices at the same time. This aims to increase the number of flits traveling the network at the same time and see the impact of congestion on the performance of the proposed system with different traffic loads. The Transpose traffic pattern is a communication method based matrix transposition. Each node sends messages to another node with the address of the reversed dimension index [28]. The Transpose workload is often used to evaluate the NoC throughput and power consumption since it creates a bottleneck due to the long communication distance exhibited between (transmitter and receiver) pairs. The Uniform traffic pattern is a standard benchmark used in on-chip and off-chip network routing studies which can be considered as the traffic model for well-balanced shared memory computations [29]. Each node sends messages to

7 other nodes with an equal probability (i.e., destination nodes are chosen randomly using a uniform probability distribution function). In our evaluation with the two traffic patterns, we set 4x4x4 as a network size where all the nodes were assigned for both transmitter and receiver nodes. Each transmitter node injects from 10 2 to 10 5 flits into the network. While on the other side, receiver nodes verify the correctness of the received flits. Using these three benchmarks, we evaluated the latency per flit and the throughput of each application. We observed the performance variation of the proposed system under different fault link rates (0%, 1%, 5%, 10%, 15% and 20%). The number of links in each system can be calculated using this formula [30]: #links = N 1 N 2 (N 3 1) + N 1 N 3 (N 2 1) + N 2 N 3 (N 1 1) (1) Where N1, N2 and N3 are the respective network s X, Y and Z dimensions. During the evaluation, we divided the faults into two categories: Half of the faults are permanent (considered during the whole simulation time) and the second half are transient (randomly start and end along the simulation time). In addition, as much as the fault rate increases, we employed more faults in flits paths to cause nonminimal routing and observe the system behavior in a worst case environment. All the results obtained with LAFT were compared with our previous proposed algorithm LAFT [22] and also Dimension Order Routing XYZ [31], [32]. Table I represents the configuration parameters used for our evaluation. Table I Simulation configuration. Parameters / System LAFT-based LAFT-based+ RAB XYZ-based JPEG 2x2x2 2x2x2 2x2x2 Network Size Matrix (3x3) 3x3x3 3x3x3 3x3x3 (Mesh) Matrix (6x6) 3x6x6 3x6x6 3x6x6 Transpose 3x3x3 3x3x3 3x3x3 JPEG 34 bits 27 bits 30 bits Flit size Matrix 38 bits 31 bits 33 bits Transpose 38 flit 31 flit 33 flit JPEG 16 bits 9 bits 12 bits Header size Matrix 16 bits 9 bits 12 bits Transpose 16 bits 9 bits 12 bits JPEG 16 bits 16 bits 16 bits Payload size Matrix 21 bits 21 bits 21 bits Transpose 21 bits 21 bits 21 bits Buffer Depth Switching Wormhole-like Wormhole-like Wormhole-like Flow control Stall-Go Stall-Go Stall-Go Scheduling Matrix-Arbiter Matrix-Arbiter Matrix-Arbiter Routing LA-XYZ XYZ RPM Target FPGA device Stratix III Stratix III Stratix III Target Structured-ASIC device HardCopy III HardCopy III HardCopy III B. Hardware complexity evaluation C. Performance evaluation 1) Communication latency evaluation: 2) Throughput Evaluation: Table II Hardware complexity comparison results. Target device System Area Static Power Speed (ALUTs) (mw) (MHz) FPGA Structured-ASIC LAFT-based LAFT-based+RAB XYZ-based Look-ahead Local Hybrid V. Conclusion References [1] F. N. Sibai. On-Chip Network for Interconnecting Thousands of Cores. IEEE Transactions on Parallel and Distributes Systems, 23(2): , February [2] A. Ben Abdallah and M. Sowa. Basic Network-on-Chip Interconnection for Future Gigascale MCSoCs Applications: Communication and Computation Orthogonalization. Proceedings of The TJASSST2006 Symposium on Science, December [3] X. Wu, W. Zhao, M. Nakamoto, C. Nimmagadda, D. Lisk, S. Gu, R. Radojcic, M. Nowak and Y. Xie. Electrical Characterization for Intertier Connections and Timing Analysis for 3-D ICs. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 20(1): , January [4] G. Philip, B. Christopher, and P. Ramm. Handbook of 3D Integration: Technology and Applications of 3D Integrated Circuits. Wiley-VCH, [5] Y. Xie, G. H. Loh, B. Black and K. Bernstein. Design Space Exploration for 3D Architectures. ACM Journal on Emerging Technologies in Computing Systems, 2(2):65-103, April [6] A. W. Topol, J. D. C. La Tulipe, L. Shi, D. J. Frank, K. Bernstein, S. E. Steen, A. Kumar, G. U. Singco, A. M. Young, K. W. Guarini and M. Ieong. Three-dimensional Integrated Circuits. IBM Journal of Research and Development, 50(4/5): , July [7] X. Dong, X. Wu, G. Sun, Y. Xie, H. Li and Y. Chen. Circuit and Microarchitecture Evaluation of 3D Stacking Magnetic RAM (MRAM) as a Universal Memory Replacement. Proceedings of the 45th Annual Design Automation Conference, pages , June [8] G. Sun, X. Dong, Y. Xie, J. Li and Y. Chen. A Novel 3D Stacked MRAM Cache Architecture for CMPs. IEEE 15th International Symposium High Performance Computer Architecture, pages , February [9] L. Benini and G. De Micheli. Networks on Chips: Technology and Tools. Morgan Kauffmann, [10] T. Lehtonen, P. Liljeberg and J. Plosila. Online Reconfigurable Self-timed links for Fault Tolerant NoC. VLSI Design, (2007):1-13, 2007.

8 [11] A. -M. Rahmani, K. R. Vaddina, K. Latif, P. Liljeberg, J. Plosila and H. Tenhunen. Design and Management of Highperformance, Reliable and Thermal-aware 3D Networkson-Chip. IET Circuits, Devices & Systems, 6(5): , September [12] A. A. Chien and J. H. Kim. Planar-adaptive Routing: Lowcost Adaptive Networks for Multiprocessors. The 19th Annual International Symposium on Computer Architecture, pages , [13] J. Wu. Fault-tolerant Adaptive and Minimal Routing in Mesh-connected Multicomputer Using Extended Safety Levels. IEEE Transactions on Parallel and Distributed Systems, 11(2): , February [14] J. Wu. A Fault-tolerant Adaptive and Minimal Routing Approach in 3-D Meshes. The 7th International Conference on Parallel and Distributed Systems, pages , July [15] W. J. Dally. Virtual-channel flow control, IEEE Trans. on Parallel and Distributed Systems, 3(2): , March [16] Y. Tar and G. L. Frazier. High-performance multiqueue buffers for VLSI communication switches. 15th Annual International Symposium on Computer Architecture, pages , May-June [17] Y. Zhang and J. Hu. A DFTR Router Architecture for 3D Network on Chip. 3rd IEEE International Conference on Computer Science and Information Technology (ICCSIT), pages , July 2010 [18] S. Pasricha and Y. Zou. A Low Overhead Fault Tolerant Routing Scheme for 3D Networks-on-Chip. The 12th International Symposium on Quality Electronic Design, pages 1-8, March [19] A. Ben Ahmed, A. Ben Abdallah and K. Kuroda. Architecture and Design of Efficient 3D Network-on-Chip (3D NoC) for Custom Multicore SoC. IEEE Proceedings of the 5th International Conference on Broadband, Wireless Computing, Communication and Applications, pages 67-73, November [23] A. Kumar, P. Kundu, A. P. Singh, L. -S. Peh and N. K. Jha. A 4.6Tbits/s 3.6GHz Single-cycle NoC Router with a Novel Switch Allocator in 65nm CMOS. Proceedings of the 2007 IEEE International Conference on Computer Design, pages 63-70, October [24] L. Xin and C.-S. Choy. A Low-latency NoC Router with Lookahead Bypass, IEEE International Symposium on Circuits and Systems, pages , May-June [25] [26] P. Chan, K. Dai, D. Wu, J. Rao and X Zou. The Parallel Algorithm Implementation of Matrix Multiplication Based on ESCA. IEEE ASIA Pacific Conference on Circuits and Systems, pages , December [27] A. S. Zekri and S. G. Sedukin. The General Matrix Multiply- Add Operation on 2D Torus. In the 20th IEEE International Parallel and Distributed Processing Symposium, April [28] A. A. Chien and J. H. Kim. Planar-Adaptive Routing: Low- Cost Adaptive Networks for Multiprocessors. Journal of the ACM, 42(1):91-123, January [29] A. M. Rahmani, A. A. Kusha and M. Pedram. NED: A Novel Synthetic Traffic Pattern for Power/Performance Analysis of Network-on-Chips Using Negative Exponential Distribution. Journal of Low Power Electronics American Scientific Publishers, 5(3): , [30] B. Feero and P. P. Pande. Performance Evaluation for Three- Dimensional Networks-on-Chip. Proceedings of IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pages , May [31] H. Sullivan and T. R. Bashkow. Large Scale, Homogeneous, Fully Distributed Parallel Machine. Annual Symposium on Computer Architecture, ACM Press, pages , March [32] C. H. Chao, K. Y. Jheng, H. Y. Wang, J. C. Wu and A. -Y. Wu. Traffic and Thermal-aware Run-time Thermal Management Scheme for 3D NoC Systems. In Proceedings of the ACM/IEEE International Symposium on Networks-on- Chip (NoCS), pages , May [20] A. Ben Ahmed and A. Ben Abdallah. LA-XYZ: Low Latency, High Throughput Look-Ahead Routing Algorithm for 3D Network-on-Chip (3D-NoC) Architecture. The 6th IEEE International Symposium on Embedded Multicore SoCs, pages , September [21] A. Ben Ahmed and A. Ben Abdallah. Low-overhead Routing Algorithm for 3D Network-on-Chip. IEEE Proceedings of The Third International Conference on Networking and Computing, December [22] A. Ben Ahmed and A. Ben Abdallah. Architecture and Design of High-throughput, Low-latency, and Fault-Tolerant Routing Algorithm for 3D-Network-on-Chip (3D-NoC). To be published in the Journal of Supercomputing, DOI: /s

Low-overhead Routing Algorithm for 3D Network-on-Chip

2012 Third International Conference on Networking and Computing Low-overhead Routing Algorithm for 3D Network-on-Chip Akram Ben Ahmed, Abderazek Ben Abdallah The University of Aizu Graduate School of Computers