A Novel 3D Layer-Multiplexed On-Chip Network

Size: px
Start display at page:

Download "A Novel 3D Layer-Multiplexed On-Chip Network"

Transcription

1 A Novel D Layer-Multiplexed On-Chip Network Rohit Sunkam Ramanujam University of California, San Diego rsunkamr@ucsdedu Bill Lin University of California, San Diego billlin@eceucsdedu ABSTRACT Recently, a near-optimal oblivious routing algorithm for D mesh networks called Randomized Partially-Minimal () routing was proposed [12], which works by loadbalancing traffic across vertical layers and routing minimally on each horizontal layer It achieves optimal worst-case throughput when the network radix k is even and within a factor of 1/k 2 of optimal when k is odd, and it achieves significantly lower latencies than Valiant routing [18], the best previously known optimal worst-case throughput algorithm This paper presents a novel layer-multiplexed (LM) architecture for D on-chip networks that exploits the optimality of together with the short inter-layer wiring delays enabled in D technology The LM architecture replaces the one-layer-per-hop routing in a D mesh with simpler vertical demultiplexing and multiplexing structures The proposed LM architecture can achieve the same worst-case throughput as a D mesh by adapting routing to the LM architecture However, the LM architecture consumes 27% less power, occupies 27% less area, attains 145% higher average throughput, and achieves % lower worst-case hop count for a symmetric mesh topology On an asymmetric mesh, the LM architecture achieves comparable average-case throughput to a D mesh, but consumes 26% less power, takes up 27% less area and attains 2% lower worst-case hop count Categories and Subject Descriptors C12 [Processor Architectures]: Multiple Data Stream Architectures (Multiprocessors) Interconnection Architectures General Terms Algorithms, Design, Performance A short version of this paper appears in Embedded Systems Letters, vol1,no1,29 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee ANCS 9, October 19-2, 29, Princeton, New Jersey, USA Copyright 29 ACM /9/1$1 Keywords On-Chip Interconnection Networks, D ICs, Oblivious Routing 1 INTRODUCTION Three-dimensional (D) integration [7, ] has attracted significant attention in recent years because of its potential benefits, including smaller chip footprints, higher transistor packing density, shorter wiring delays and significantly higher communication bandwidth As D technology matures and becomes more viable, it opens up new opportunities for chip architecture innovations One direction is to extend 2D tiled chip-multiprocessor architectures [16, 19, 1] into three dimensions [5, 8] Many proposed 2D chipmultiprocessor architectures have relied on a 2D mesh network topology as the underlying communication fabric Extending mesh-based tiled chip-multiprocessor architectures into three dimensions represents a natural progression for exploiting D integration As with 2D mesh networks, throughput and latency are important performance metrics in the design of routing algorithms for D meshes Ideally, a routing algorithm must aim at maximizing worst-case and average-case throughput and minimizing average latency Dimension Ordered Routing () [15] is minimal in terms of latency but is known to suffer from poor worst-case and average-case throughput due to the lack of path diversity [1] provides another alternative minimal routing algorithm, but it too suffers from poor worst-case throughput An oblivious routing algorithm called [1] is known to achieve optimal worst-case throughput with minimal hop count for 2D meshes However, its performance surprisingly degrades tremendously for the D case, as noted in [1] Just like, the worst-case throughput of other minimal routing algorithms like and also degrade severely when extended to D Until recently, the best known oblivious routing algorithm that achieves optimal worst-case throughput for D meshes was Valiant () routing [18], which globally load-balances traffic across all dimensions to achieve optimal worst-case throughput, but at the expense of destroying locality is a factor of 2x of in average hop count Recently, a near-optimal oblivious routing algorithm for conventional D mesh networks called Randomized Partially-Minimal () routing was proposed [12] Essentially, works by load-balancing traffic uniformly along one of the three

2 Normalized Throughput /Optimal tion in the LM architecture First, 7-port routers of a D mesh are reorganized into smaller 5-port 2D mesh routers for intra-layer communication integrated with demultiplexing and multiplexing structures for inter-layer communication This router re-organization lowers the power consumption when a layer load-balanced routing algorithm like is used Second, single-hop transfer of packets in the vertical dimension is much more power efficient than one-layer-perhop communication due to the short inter-layer distances in D ICs In addition to reducing power, the demultiplexing and multiplexing structures also increase the inter-layer communication bandwidth, thereby increasing the averagecase throughput of the LM architecture compared to a D mesh Network Radix(k) Figure 1: Normalized worst-case throughput of oblivious routing algorithms for D mesh networks dimensions and routing minimally in the two remaining dimensions The main result shown in [12] is that achieves optimal worst-case throughput when the network radix k is even and within a factor of 1/k 2 of optimal when k is odd achieves a factor of 1x of in average hop count which is much better than a factor of 2x for Figure 1 shows the worst-case throughput of in comparison to,, and The worstcase throughput of, and degrade tremendously with increasing radix At k = 14, the worstcase throughput of is 14 times higher than and 526 times higher than and Therefore, a load-balanced, non-minimal routing algorithm like is necessary to provide high worst-case throughput guarantees in D mesh networks as the network radix increases In this paper, we present a novel layer-multiplexed (LM) architecture for D mesh networks that exploits the optimality of together with the short inter-layer wiring delays enabled in D technology One way of implementing in D meshes is by load balancing traffic uniformly along the vertical dimension to a random intermediate layer and routing minimally within each layer This requires two phases of minimal routing along the vertical dimension In contrast to a conventional D mesh, the LM architecture replaces the one-layer-per-hop routing in the vertical dimension with simpler vertical demultiplexing and multiplexing structures In doing so, the LM architecture retains the same worst-case throughput as a conventional D mesh by adapting routing to the LM architecture However, in comparison to a conventional D mesh, the LM architecture consumes 27% less power, occupies 27% less area, attains 145% higher average throughput and achieves % lower worst-case hop count for a symmetric 4 4 4mesh Foranasymmetric mesh topology, the LM architecture attains comparable average throughput to a conventional D mesh, but consumes 26% less power, takes up 27% less area and achieves 2% lower worst-case hop count Two factors are primarily responsible for the power reduc- In the rest of the paper, Section 2 outlines related work Section 1 provides a brief background on performance metrics Section 2 discusses layer load-balanced routing on D mesh networks Section 4 provides microarchitectural details of the LM architecture and analyzes its worst-case throughput, and Section 5 evaluates the LM architecture Finally, Section 6 concludes the paper 2 RELATED WORK This section briefly relates recently proposed D on-chip network architectures to the LM architecture proposed in this paper Li et al [8] proposed a NOC-Bus-hybrid architecture in the context of Network-in-Memory for D chipmultiprocessors, which uses a shared bus for vertical (interlayer) communication However, under adversarial traffic conditions, eg when the source and destinations are on different layers, the vertical shared bus becomes the throughput bottleneck The LM architecture uses a demultiplexing crossbar at packet injection and buffered multiplexers on each layer at packet ejection for inter-layer communication Such structures can sustain higher throughput than a shared bus This is crucial for layer load-balanced routing using Kim et al [6] proposed a true D crossbar design called DimDe As noted in [6], its design can only support routing using, which is known to suffer from poor worst-case throughput Using a routing algorithm other than with the DimDe architecture requires a very complex -stage hierarchical arbitration scheme Matsutani et al [9] proposed an architecture called XNOT that is also based on the idea of vertical switching However, it requires large 2k 2k vertical switches, which can be costly since crossbar power grows quadratically with the number of ports Park et al [11] proposed MIRA, which is based on implementing a 2D mesh chip-multiprocessor architecture in three dimensions It assumes the processor cores are designed in D, which makes it difficult to reuse existing highly optimized 2D processor core designs BACKGROUND 1 Preliminaries In this section, we provide a brief overview of analytical methods used to evaluate worst-case and average-case throughput In particular, we elaborate on the concept of network capacity and a method to compute worst-case throughput for oblivious routing algorithms We then elaborate on a method to compute average-case throughput To simplify the discussion on throughput analysis, in this sec-

3 tion we ignore flow control issues, and we assume single flit packets that are routed from node to node in a single cycle Network capacity is defined by the maximum channel load γ that a channel at the bisection of the network needs to sustain under uniformly distributed traffic For any k-ary n-mesh [2], independent of n, γ = j k 4 k 2 1 4k k is even k is odd The network capacity is the inverse of γ The maximum channel load γ(r, Λ) for a routing algorithm R and traffic matrix Λ is the expected traffic load crossing the heaviest loaded channel under R and Λ The worst-case channel load γ wc(r) for a routing algorithm R is the maximum channel load that can be caused by any admissible traffic Admissible traffic is defined to be any doubly substochastic matrix Λ with all row and column sums bounded by 1 As shown in [17], an admissible traffic matrix that can cause the worst-case channel load for a routing algorithm R can be found by solving a derived maximum weighted matching problem The worst-case saturation throughput for a routing algorithm R is the inverse of the worst-case channel load The normalized worst-case saturation throughput, Θ wc(r), is defined as the worst-case saturation throughput normalized to the network capacity: Θ wc(r) = γ γ wc(r) Unless otherwise noted, we will simply refer to Θ wc(r) as the worst-case throughput of R Using the methodology used in [17, 1], the average-case throughput for a routing algorithm R can be computed by averaging the saturation throughputs over T, a large set of random traffic patterns: Θ avg(r) = 1 X «γ () T γ(r, Λ) Λ T In this paper, we used T =onemillion 2 Layer load-balanced routing on a D Mesh A routing algorithm called is presented in [12] which uses the concept of load balancing traffic across all vertical layers to achieve optimal worst-case throughput for D mesh networks Figure 2 depicts how works Suppose Z is the vertical dimension and X and Y are the two horizontal dimensions, first routes packets in the Z dimension to a randomly chosen intermediate XY plane It then routes packets on each XY plane using either minimal XY or YX routing with equal probability Finally, it routes packets to their final destinations along the Z dimension So, requires two phases of vertical routing to load-balance traffic across the layers Figure 2 depicts two possible routing paths Let (x s,y s,z s) be the source, (x d,y d,z d )bethe destination, and ẑ be the randomly chosen intermediate Z position The two corresponding Z-XY-Z and Z-YX-Z routing paths employed by are (x s,y s,z s) (x s,y s, ẑ) (x d,y s, ẑ) (x d,y d, ẑ) (x d,y d,z d ) and (x s,y s,z s) (x s,y s, ẑ) (x s,y d, ẑ) (x d,y d, ẑ) (x d,y d,z d ) (1) (2) Z X (a) Z-XY-Z path Y Z X (b) Z-YX-Z path Figure 2: Examples of routing can be equivalently defined by uniformly load-balancing along either the X or Y dimensions and routing minimally in the remaining two dimensions For symmetric D mesh networks, load-balancing traffic along each of the three dimensions with equal probability (and routing minimally in the two remaining dimensions) achieves higher average-case throughput as compared to load balancing along a single dimension For asymmetric meshes, uniform load-balancing along the shortest dimension achieves best throughput results The following theorem has been proved in [12]: Theorem 1 achieves optimal worst-case throughput for a k k k mesh when the network radix k is even and within a factor of 1/k 2 optimal when k is odd The throughput evaluation of and comparison to existing oblivious routing algorithms is presented in Section 52 achieves near-optimal worst-case throughput and is far better than existing mesh routing algorithms, especially under adversarial traffic Also, considering the tremendous worst-case throughput degradation of minimal routing algorithms for D mesh networks with increasing network radix (see Figure 1), it becomes extremely essential to employ a layer load-balanced routing algorithm like 4 LAYER-MULTIPLEXED D ARCHI- TECTURE A D mesh is a natural extension to the 2D mesh architecture Two extra ports are needed at each router in the D case to support up/down communication, resulting in a 7 7 crossbar compared to a 5 5 crossbar in the 2D case However, since the crossbar power and area costs increase quadratically with the number of ports, D routers have higher power consumption and a larger area footprint than 2D routers Increased number of ports in D routers further increase contention in switches and the complexity of arbiters Also, in a D mesh, packets are routed in a one-layerper-hop manner along the vertical dimension When routing is used, which involves two phases of vertical routing, a packet may need to traverse 2k layers (and hence 2k router hops) in the worst-case, resulting in long worst-case latencies Along with long delay, one-layer-per-hop communication is also inefficient in terms of power consumption as a flit needs to traverse a router pipeline at each hop despite Y

4 P1 P2 L1 L2 L1 L2 L L4 L1 L2 L L4 P1 P2 Route Selection/Load Balancing VC Allocation Credits in from the injection port of routers on layers 1-4 Flit Counters Switch Arbitration P P4 Packet Injection L L4 L1 L2 L L4 L1 L2 L L4 Packet Ejection P P4 P1 P2 P P4 To the injection port of the Layer 1 router To the injection port of the Layer 4 router Figure : Packet injection and ejection stages of the layer-multiplexed architecture the short inter-layer distances in D designs These factors motivated us to propose a new Layer-Multiplexed (LM) D on-chip network architecture that exploits the throughput optimality of together with the short inter-layer wiring delays and the abundance of vertical wiring in D designs The LM architecture has a lower power cost, a smaller area footprint and higher performance compared to a D mesh, as shown in Section 5 We describe this LM architecture in Section 41 and show that it indeed achieves the same worst-case throughput as on a conventional D mesh in Section Architecture In the LM architecture, the one-layer-per-hop communication in a D mesh is replaced with simpler demultiplexing and multiplexing stages to take advantage of the short vertical wiring delays and the abundance of vertical wiring in D designs The high level architecture of these stages is shown in Figure for k = 4 layers These demultiplexing and multiplexing stages are used to implement the layer load-balancing method employed by requires two phases of routing along the vertical dimension to loadbalance traffic across the layers In the LM architecture, the two phases of vertical routing are restricted to packet injection and packet ejection, respectively At packet injection, flits 1 are demultiplexed uniformly to the k layers using the packet injection stage The packet injection demultiplexer gives every processor direct access to all the layers, thus removing the necessity for one-layer-per-hop communication Once demultiplexed to a horizontal layer, packets are routed on this plane using either minimal XY or YX routing with equal probability At the destination (X, Y) coordinates, packets from different layers are multiplexed at the destination processor in the packet ejection stage With this architecture, each horizontal plane is effectively a conventional 2D mesh with just 5-port routers Rather than connecting directly to these 5-port routers, each processor is connected to a corresponding packet injection stage at the 1 A flit is a fixed size portion of a packetized message Figure 4: Packet injection/demultiplexing stage same (X, Y) location This packet injection stage is in turn connected to k 5-port routers at the same (X, Y) location This is depicted in Figure for k = 4, with processors P 1 to P 4 demultiplexing to layers L 1 to L 4 Similarly, on the packet ejection side, a packet ejection multiplexer is used at each processor to multiplex flits arriving from layers L 1 to L 4 We refer to this adaptation of the routing algorithm on the LM architecture as -LM As we will see in Section 51, the combined costs of these demultiplexers and multiplexers with the 5-port router costs are significantly lower than the 7-port router costs of a conventional D mesh both in terms of power and area The microarchitectural details of these packet injection and ejection stages and the routers are detailed next 411 Packet Injection The details of the packet injection stage are shown in Figure 4 for k = 4 layers It is essentially a 4-port switch with a typical router pipeline consisting of route selection, virtual channel (VC) allocation, switch arbitration, switch traversal, and link traversal However, route selection is modified to implement layer load-balancing, which has to ensure that statistically traffic does indeed get uniformly distributed across all k layers This is achieved by using a set of flit-counters for each ordered pair of input and output ports in the load-balancing logic A flit-counter (i, j) records the total number of flits sent from input i to output j When a new head flit is injected from processor P i, the route selection logic will select the output layer L j with the lowest flit-count (i, j) Ties between layers are broken by a rotating output priority pointer for P i that gets advanced at each route selection The route selection logic keeps track of the ouput ports that are currently in use in order to avoid allocating the same output to more than one input, which would reduce throughput The layer selection is done as follows: Each input port is given a unique priority that rotates every cycle over all inputs

5 Injection port Credits out for XY VCs of all PCs VCID Credits out for YX VCs of all PCs Routing (XY), VC Allocation Routing (YX), VC Allocation Credits in for XY VCs of Out_X+, Out_X-, Out_Y+, Out_Y- Switch Arbiter Credits in for YX VCs of Out_X+, Out_X-, Out_Y+, Out_Y- Credits in for Layer j buffers of P1, P2, P, P4 Ejection Port Packets from layer1 VCID Credits out for L1-P1, L2-P1, L-P1 and L4-P1 Packets from layer2 Packets from layer Packets from layer4 L1-P1 L2-P1 L-P1 L4-P1 Arbiter P1 P2 P In_X+ In_X- In_Y+ In_Y- Out_X+ Out_ X- Out_Y+ Out_Y- Credits out for L1-P4, L2-P4, L-P4 and L4-P4 Packets from layer2 Packets from layer Packets from layer4 L1-P4 L2-P4 L-P4 L4-P4 Arbiter P4 Figure 5: Microarchitecture of a 2D intra-layer router on layer j When two or more inputs enter the route selection stage at the same cycle, the highest priority input is served first Suppose input i requests for an output port and F is the set of free outputs in that cycle after the higher priority inputs have been served The route selection logic simply chooses an output port from F that has received the least number of flits from input i This strategy guarantees that every input is always granted an unique output It must be noted that route computation is only performed on the head flit of a packet Once a layer is chosen by the route selection stage for the head flit, all remaining flits of the same packet will be routed on the same layer The VC allocation stage further allocates a virtual channel on the injection port of the 5-port router on the selected layer Together, flits belonging to the same packet are guaranteed to be routed on the same layer through the same set of virtual channels, thus ensuring their in-order arrival at the destination Finally, we found that a single VC with a small amount of buffering (eg 5 flits) is sufficient at each input of the packet injection demultiplexing switch For implementation, the input buffers can be located on the same layer as the corresponding processor The crossbar of the packet injection demultiplexer can be spread across the k layers, ie, the cross points for a particular layer can be located at that layer The route selection, VC allocation and switch arbitration logic can be located on the middle layer (or one of the middle layers when k is even) so that the wiring for the control signals is minimized and also uniformly distributed across the layers 412 Router Microarchitecture Figure 5 depicts the microarchitectural details of the 5-port horizontal plane routers used in the LM architecture Once a packet is injected into a router on one of the layers, -LM uses either minimal XY or YX paths with equal probability to reach the (X, Y) coordinates of its destination This is essentially [1] routing on the horizontal plane Figure 6: Packet ejection/multiplexing stage In turn, the microarchitecture of each 5-port router on the 2D plane is very similar to that of the router [1] The injection port receives flits from the corresponding output port of the vertical switch based on the layer at which it is located The virtual channels at each input port are divided into two sets one set for XY routing and another for YX routing After being injected into one of the routing layers (XY or YX), a packet is restricted to remain in the virtual channels of that layer while being routed on the 2D plane The routing and VC allocation stages are duplicated as in the router to independently handle the corresponding decisions in the XY and YX routings The switch arbitration stage is common to both XY and YX routings since flits from all virtual channels at each input contend for the same crossbar 41 Packet Ejection The ejection port of each router is connected through packet ejection multiplexers to all processors at the same (X, Y) coordinate, as depicted in Figure 6 In particular, each horizontal plane router sees at its ejection port four virtual channels, each of which corresponds to an ejection queue of a processor connected to the router In the example depicted in Figure 6, a router on the first layer (L 1) sees four virtual channels, namely L 1-P 1, L 1-P 2, L 1-P,andL 1-P 4 as its ejection channels located at processors P 1, P 2, P and P 4, respectively The route selection stage of the router chooses the appropriate output VC for a packet based on its destination After leaving the horizontal plane router, the VC ID field of a flit is used to determine the packet ejection multiplexer into which the flit is inserted For implementation, a packet ejection multiplexer can be implemented on the corresponding processor layer A flit ejected from a horizontal plane router can be broadcast to the packet ejection multiplexers on all layers and a decision to accept a flit can be made locally at the input buffer of each multiplexer, based on the flit s VC ID Finally, each packet ejection multiplexer can independently choose among its input queues which flit to forward to the corresponding destination processor at each cycle Each multiplexer also handles flow control for its input buffers We found that only a small amount of buffering (eg 5 flits) is sufficient at each of these input queues

6 42 Analysis Claim 1 The worst-case throughput of -LM is equal to the worst-case throughput of on a D mesh, which is optimal when the network radix k is even and within a factor of 1/k 2 of optimal when k is odd Proof As mentioned in Section 1, optimal worst-case throughput analysis is based on ideal load analysis, which assumes ideal routers with infinite buffering It is well-known that Valiant routing () [18] achieves optimal worst-case throughput, and its maximum channel load is twice that of routing uniform traffic using dimension-ordered routing, which is k/2 whenk is even and (k 2 1)/2k when k is odd for a mesh network with radix k To demonstrate that a routing algorithm R is worst-case throughput optimal, it is sufficient to show that the maximum channel load of R under all admissible traffic is the same as In [12], it was shown that on a D mesh is optimal when k is even and within a factor of 1/k 2 of optimal when k is odd This was shown by showing that the uniform load-balancing of in the vertical dimension ensures that the 2D traffic that will traverse any XY plane will be doubly sub-stochastic if the overall D traffic matrix Λ is doubly sub-stochastic Given that uses [1] for routing on a XY plane, and that the 2D traffic on a XY plane is guaranteed to be admissible, it follows from the result that the maximum channel load along the horizontal dimensions is k/2, which is optimal when k is even When k is odd, the ratio of the maximum channel loads of over is [(k 2 1)/2k]/[k/2] = (1 1/k 2 ), which is a factor 1/k 2 of optimal For -LM routing on the LM architecture, traffic is also uniformly load-balanced across the layers Using the same analysis, the 2D traffic that will traverse any XY plane will be doubly sub-stochastic under an admissible Λ Therefore, the maximum channel load on the XY planes will be the same as Given that the demultiplexing packet injection stage and the multiplexing packet ejection stage are both non-blocking under ideal load analysis, the maximum channel load is dictated by the load on the channels in the XY planes Therefore, -LM on the LM architecture achieves the same worst-case throughput as on a D mesh This analysis holds for both symmetric and asymmetric meshes 5 EUATION This section is divided into three parts Section 51 describes the simulation framework used for performance evaluation Section 52 compares the performance of on a conventional D mesh with existing mesh routing algorithms Section 5 then evaluates the benefits of using -LM on the LM architecture over on a conventional D mesh in terms of power, area and performance 51 Simulation Framework We evaluate two D mesh configurations, a symmetric mesh and a asymmetric mesh topology Both the D configurations evaluated have four device layers In practice, D mesh networks are not expected to be symmetric in D chip designs The number of device layers is expected to be much less than the number processor tiles that can be placed along the edge of a device layer Hence, we chose an asymmetric mesh topology for our evaluation Throughput evaluation is carried out in two stages We initially perform a simplified throughput analysis assuming an ideal single cycle router with infinite buffers and then back these results with more realistic flit-level simulations Worst-case and average-case throughputs are computed using techniques described in Section 1 We modified the PoPnet [14] on-chip network simulator to perform our flit-level simulations The PoPnet simulator models a five stage router pipeline corresponding to route selection, VC allocation, switch arbitration, switch traversal, and link traversal It uses credit-based flow control We used the simulator to evaluate the average routing delays under different injection loads For each simulation, we ran the simulator for 5, cycles The latency of a packet is measured as the delay between the time the header flit is injected into the network and the time the tail flit is consumed at the destination We assume 8 virtual channels (VCs) per physical channel and buffers of size 5 flits per VC at each input port for our performance evaluations We include 8 VCs in our setup because it is well known that VCs improve throughput of any routing algorithm by reducing head-ofline blocking and enabling better statistical multiplexing of flits So, having a reasonably large number of VCs lets us compare the best performance of routing algorithms For the LM architecture, as mentioned in Section 4, we include 5 flit buffers at each injection port of the vertical demultiplexing switch and 2 flit buffers (4 queues of 5 flits each) at the input of each multiplexer in the packet ejection stage The injected packets have a constant size of 5 flits The simulations were performed on the four traffic patterns shown in Table 1 (Transpose, Complement, -WC and Uniform random) The traffic pattern described as -WC is a worst-case traffic pattern for 52 Throughput Evaluation of In this section, we compare the performance of on a D mesh with four other oblivious mesh routing algorithms, namely, [15], [1], [1] and [18] The results of the simplified throughput analysis are shown in Table 2 The normalized saturation throughput (normalized to the network capacity) for each oblivious routing algorithm on each traffic pattern of Table 1 is shown in Table 2 for two different D mesh configurations achieves the same optimal worst-case throughput as for both the symmetric and asymmetric mesh topologies For the symmetric mesh topology, outperforms minimal routing algorithms like, and in worst-case throughput by %, 14% and 1%, respectively Similarly, for the asymmetric mesh topology, the worst-case throughput of surpasses, and by 4%, 182% and 1%, respectively also achieves the highest average-case throughput averaged over a million random traffic patterns It can sustain higher throughput than existing oblivious routing algorithms under adversarial traffic (transpose, -WC) and comparable throughput under benign traffic (uniform, complement) This is further substantiated by flit-level simulation results

7 Table 1: Traffic patterns evaluated Worst-Case (WC) Worst-case traffic that causes lowest throughput Average-Case Average throughput over a million random traffic matrices Transpose Packet at (x, y, z) sent to (y, z, x) (asymmetric) Destination obtained by left shifting the concatenated bit representation of the source xyz to yzx and repartitoning the result Complement (Compl) Packet at (x, y, z) sentto(k x x 1,k y y 1,k z z 1) -WC Packet at (x, y, z) sentto(k z 1,k y 1,k x 1) (asymmetric) If x is represented using b x bits, destination node of (x, y, z) is obtained by interchanging the positions of the first b x bits of the concatenated bit representation of the source with the last b x bits, repartitioning the result, and taking its complement Uniform random Packet sent to a destination chosen at uniform random shown in Figures 7 and 8 The plots validate the fact that performs consistently well over a wide range of network traffic conditions and is superior to existing oblivious routing algorithms in handling adversarial traffic like transpose and -WC It can also be inferred from the plots that has a strictly lower latency and strictly higher throughput than for all traffic patterns considered 5 Evaluation of the LM Architecture Next, we evaluate the LM architecture in comparison to a conventional D mesh in terms of power, area and performance For performance evaluation, we present simplified throughput analysis, detailed flit-level simulations and an analysis of worst-case packet latency 51 Power Comparison In this section we compare the power of the LM and D mesh architectures using the power models in Orion 2 [4] The power models used are for a 65 nm process at 1V and an operating frequency of 1GHz The power reported includes both the static and dynamic power components A flit is assumed to be 128 bits wide SRAMs are used as input buffers and crossbars are built using multiplexer trees A two-stage round robin arbiter is used for VC allocation and matrix arbiters are used for switch allocation Since the buffer power constitutes a significant portion of the router power, we only use 4VCs (instead of 8 VCs used for performance evaluation) with 5-flit buffers per VC at each input port of the 2D routers (LM) and D routers (D mesh) This makes the power numbers less favorable for the LM architecture as it diminishes the advantage of using fewer buffers in the demultiplexing and multiplexing structures compared to buffering used in the routers A typical Table 2: Throughput evaluation of Network WC Avg-Case Transpose Compl WC Uniform Network WC Avg-Case Transpose Compl WC Uniform Table : Average hop count of on LM and D mesh architectures Topology LM D Mesh 5-port Demux Mux 7-port router hops hops hops router hops flit injection rate of flits/cycle is assumed at the injection queue of each processor for both the and topologies Based on this injection rate and the average hop count of a flit in the LM and D mesh architectures, the average flit arrival rate per port is computed for the routers in the two architectures For the symmetric topology, since the LM architecture requires fewer hops on an average to route a flit from its source to its destination, (as shown in Table ), the lower hop count translates into a lower flit arrival rate per port in the routers of the LM architecture when compared to D mesh routers Similarly, flits need to traverse more routers on an average in the topology when compared to the topology, thus increasing the flit arrival rate per port of routers in the network The per-port flit arrival rate affects the dynamic power consumption of routers Table 4 compares the average router power cost per processor tile for the two architectures The routers in the LM architecture have only 5 ports, compared to 7 ports needed by routers in a D mesh However, the LM architecture incurs the cost of an extra vertical demultiplexing switch for every 4 routers and a buffered multiplexer at every processor The additional hardware adds to the amortized router cost per tile As shown in Table 4, with the costs of the demultiplexing and multiplexing structures added, the LM architecture still consumes 27% less power in comparison to a D mesh for the topology and 26% less power for the 8 8 4topology 52 Area Comparison Next, we compare the router area per processor tile for the LM and D mesh architectures In a D mesh, the router area per processor is the area occupied by a 7-port router The router area per processor for the LM architecture com-

8 (a) Uniform Random (b) Transpose (c) Complement (d) -Worst Case (a) Uniform Random Figure 7: Performance of on a Mesh (b) Transpose (c) Complement (d) -Worst Case Figure 8: Performance of on a Mesh Table 4: Comparison of router power per tile Router Amortized Mux Total Power Demux Power Router Power Power (mw) (mw) (mw) (mw) mesh D mesh 187 na na 187 LM mesh D mesh 2528 na na 2528 LM Table 5: Comparison of router area per tile Router Amortized Mux Wire Total Area Demux Area Area Router Area Area (mm 2 ) (mm 2 ) (mm 2 ) (mm 2 ) (mm 2 ) D 9 na na na 9 mesh LM prises the area of the 5-port routers, the amortized demultiplexer area, the multiplexer area and area overhead due to the extra vertical wiring needed for inter-layer demultiplexing and multiplexing of flits The area of the 7-port routers, 5-port routers, demultiplexer and multiplexer were estimated using the Orion 2 [4] area models The technology and router parameters used were the same as those used for power estimation Since we assume inter-layer communication using Through-Silicon Vias (TSVs), the vertical wiring occupies some area on each device layer For estimating the area overhead due to the extra wiring needed in the LM architecture, we assumed vertical Table 6: Throughput of vs -LM D mesh LM D mesh LM Worst-Case Average-Case Transpose Complement WC Uniform via pitch of 4 um [6] and eight 144 bit bundles (128 bit flits and 16 bit control signals, four bundles for demultiplexing and four for multiplexing) running vertically through all four device layers The extra wiring overhead was just 1% of the area of a 5-port router As shown in Table 5, even after including all the area overheads of the LM architecture, the total router area per processor is still 27% less than the area of a 7-port router needed in a D mesh This comparison holds for both the symmetric and asymmetric topologies which have the same number of device layers 5 Throughput Evaluation of the LM architecture In this section, we compare the saturation throughput of -LM on the LM architecture with on a D mesh These routing algorithms have been shown to be worst-case throughput optimal for the two architectures and is known to be better than existing algorithms for D meshes, especially under adversarial traffic as shown in Section 52 Table 6 presents simplified throughput analysis results -LM has the same worst-case throughput as that of on a D mesh, as analyzed in Section 42 For the symmetric mesh topology, -LM outperforms by 145% in average-case throughput, which is a significant improvement For this topology, although per-

9 #( # '( ' ( ;D )*+,-/1245+* (a) Uniform 9:1+-71;-41<=>?=>=1@A #( # '( ' ( ;D )*+,-/1245+* (b) Transpose 9:1+-71;-41<=>?=>=1@A #( # '( ' ( ;D )*+,-/1245+* (c) Complement 9:1+-71;-41<=>?=>=1@A #( # '( ' ( ;D )*+,-/1245+* (d) -Worst Case 9:1+-71;-41<=>?=>=1@A #( # '( ' ( Figure 9: Performance of on D mesh and LM architectures for a topology ;D )*+,-/1245+* (a) Uniform 9:1+-71;-41<=>?=>=1@A #( # '( ' ( ;D )*+,-/1245+* (b) Transpose 9:1+-71;-41<=>?=>=1@A #( # '( ' ( ;D )*+,-/1245+* (c) Complement 9:1+-71;-41<=>?=>=1@A #( # '( ' ( ;D )*+,-/1245+* (d) -Worst Case Figure 1: Performance of on D mesh and LM architectures for a topology forms slightly better than -LM on transpose traffic, the two perform comparably on complement and -WC traffic patterns On uniform traffic, -LM outperforms by % for the symmetric mesh topology This is because the LM architecture makes better use of the interlayer bandwidth available in D ICs to overcome bottlenecks caused by two-phase routing along the vertical dimension The simplified throughput analysis results of and -LM are very similar for the asymmetric mesh topology -LM and have the same worst-case throughput and comparable average-case throughput The saturation throughputs for the different traffic patterns evaluated are also very close This is because, for the asymmetric topology, throughput is mainly constrained by the load on the horizontal channels and the links along the short vertical dimension are relatively less loaded We use flit-level simulations to get a realistic insight into the actual throughput that can be sustained by the two architectures under different traffic conditions The simulations were performed on the four traffic patterns shown in Table 1 Figures 9 and 1 show the simulation results These results follow the same trend as the simplified throughput analysis -LM clearly outperforms under uniform traffic for the mesh topology On the remaining traffic patterns, the saturation throughputs of the two algorithms are comparable On the asymmetric topology, -LM outperforms by a narrow margin on all traffic patterns The plots also give us an accurate idea of the number of cycles saved by using the LM architecture over a D mesh The latency of -LM is lower than that of for all four traffic patterns The reduction is a consequence of the single-hop vertical communication and fewer pipeline stages in the packet ejection stage of the LM architecture, as compared to a conventional router 54 Worst-Case Latency Analysis Minimizing the worst-case number of router hops is very important for latency critical applications Since on a D mesh requires two-phase routing along one of the three dimensions and minimal routing in the two remaining dimensions, the worst-case hop count of on a symmetric D mesh is 4(k 1), k being the network radix The worst-case hop count of -LM on the LM architecture is 2(k 1) + 2, where the +2 corresponds to the extra demultiplexing and multiplexing hops For a symmetric mesh with k = 4, -LM on the LM architecture achieves % lower worst-case hop count than on a D mesh For an asymmetric mesh, the worst-case hop count of is given as (k x 1) + (k y 1) + 2(k z 1) and the worst-case hop count of -LM is given as (k x 1) + (k y 1) + 2 where k x, k y and k z are lengths of the X, Y and Z dimensions, respectively For the asymmetric 8 8 4topology, the worst-case hop count of -LM is 2% less than that of on a D mesh 6 CONCLUSIONS The increasing viability of three dimensional silicon integration technology has opened new opportunities for chip architecture innovations The main contribution of this paper is a novel layer-multiplexed architecture for D on-chip interconnection networks that exploits the throughput optimality of an oblivious routing algorithm called together with the short inter-layer wiring delays and high vertical communication bandwidth enabled in D technology The LM architecture reorganizes the large 7-port routers in a D mesh into smaller 5-port 2D routers for intra-layer communication integrated with demultiplexing and multiplexing structures for inter-layer communication This takes advantage of the short inter-layer distances in D designs and replaces the one-layer-per-hop routing in the vertical dimen-

10 sion with simpler vertical demultiplexing and multiplexing hops The LM architecture retains the same worst-case throughput as a D mesh by adapting routing to the LM architecture However, in comparison to a conventional D mesh, the LM architecture consumes 27% less power, occupies 27% less area, attains 145% higher average throughput and achieves % lower worst-case hop count for a symmetric mesh For an asymmetric 8 8 4mesh topology, the LM architecture attains comparable average throughput to a conventional D mesh, but consumes 26% less power, takes up 27% less area and achieves 2% lower worst-case hop count 7 REFERENCES [1] A Agarwal et al Tile processor: Embedded multicore for networking and multimedia Hot Chips, Stanford, CA, Aug 27 [2] W J Dally, B Towles Principles and practices of interconnection networks, Morgan Kaufmann, 2 [] W R Davis et al Demystifying D ICs: The pros and cons of going vertical IEEE Design and Test of Computers, 22(6):498-51,25 [4] A Kahng, B Li, LS Peh, K Samadi ORION 2: A fast and accurate NoC power and area model for early-stage design space exploration Design, Automation and Test in Europe (DATE), Nice, France, April 29 [5] T Kgil et al PICOSERVER: Using D stacking technology to enable a compact energy efficient chip multiprocessor International Conference on Architectural Support for Programming Languages and Operating Systems, 26 [6] J Kim et al A novel dimensionally-decomposed router for on-chip communication in D architectures International Symposium on Computer Architecture, 27 [7] K Lee et al Three-dimensional shared memory fabricated using wafer stacking technology IEDM Technical Digest, pages , December 2 [8] F Li, C Nicopoulos, T Richardson, Y Xie, V Narayanan, M Kandemir Design and management of D chip multiprocessors using network-in-memory International Symposium on Computer Architecture (ISCA), pages1-141,26 [9] H Matsutani, M Koibuchi, H Amano Tightly-coupled multi-layer topologies for -D NOCs International Conference on Parallel Processing, 27 [1] T Nesson, S L Johnsson routing on mesh and torus networks ACM Symposium on Parallel Algorithms and Architectures, Santa Barbara, CA, 1995 [11] D Park et al MIRA: A multi-layered on-chip interconnect router architecture International Symposium on Computer Architecture, 28 [12] R S Ramanujam, B Lin Near-optimal oblivious routing on three-dimensional mesh networks International Conference on Computer Design, 28 [1] D Seo, A Ali, W-T Lim, N Rafique, M Thottethodi Near-optimal worst-case throughput routing for two-dimensional mesh networks International Symposium on Computer Architecture, 25 [14] L Shang, LS Peh, NK Jha Dynamic voltage scaling with links for power optimization of interconnection networks IEEE International Symposium on High-Performance Computer Architecture, February 2 [15] H Sullivan, T R Bashkow A large scale, homogeneous, fully distributed parallel machine Annual Symposium on Computer Architecture, 1977 [16] M B Taylor, W Lee, S Amarasinghe, A Agarwal Scalar operand networks: On-chip interconnect for ILP in partitioned architectures International Symposium on High-Performance Computer Architecture (HPCA), pages41-5,anaheim,ca, 2 [17] B Towles, W J Dally Worst-case traffic for oblivious routing functions ACM Symposium on Parallel Algorithms and Architectures, Winnipeg,Manitoba, Canada, 22 [18] L G Valiant, G J Brebner Universal schemes for parallel communication ACM Symposium on The Theory of Computing, pp26-277,milwaukee,mn, 1981 [19] S Vangal et al An 8-tile sub-1-w teraflops processsor in 65nm CMOS IEEE Journal of Solid State Circuits, Jan28

A Layer-Multiplexed 3D On-Chip Network Architecture Rohit Sunkam Ramanujam and Bill Lin

A Layer-Multiplexed 3D On-Chip Network Architecture Rohit Sunkam Ramanujam and Bill Lin 50 IEEE EMBEDDED SYSTEMS LETTERS, VOL. 1, NO. 2, AUGUST 2009 A Layer-Multiplexed 3D On-Chip Network Architecture Rohit Sunkam Ramanujam and Bill Lin Abstract Programmable many-core processors are poised

More information

Randomized Partially-Minimal Routing: Near-Optimal Oblivious Routing for 3-D Mesh Networks

Randomized Partially-Minimal Routing: Near-Optimal Oblivious Routing for 3-D Mesh Networks 2080 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 11, NOVEMBER 2012 Randomized Partially-Minimal Routing: Near-Optimal Oblivious Routing for 3-D Mesh Networks Rohit Sunkam

More information

Oblivious Routing Design for Mesh Networks to Achieve a New Worst-Case Throughput Bound

Oblivious Routing Design for Mesh Networks to Achieve a New Worst-Case Throughput Bound Oblivious Routing Design for Mesh Networs to Achieve a New Worst-Case Throughput Bound Guang Sun,, Chia-Wei Chang, Bill Lin, Lieguang Zeng Tsinghua University, University of California, San Diego State

More information

INTERCONNECTION networks are used in a variety of applications,

INTERCONNECTION networks are used in a variety of applications, 1 Randomized Throughput-Optimal Oblivious Routing for Torus Networs Rohit Sunam Ramanujam, Student Member, IEEE, and Bill Lin, Member, IEEE Abstract In this paper, we study the problem of optimal oblivious

More information

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Interconnection Networks: Topology. Prof. Natalie Enright Jerger Interconnection Networks: Topology Prof. Natalie Enright Jerger Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design

More information

Near-Optimal Worst-case Throughput Routing for Two-Dimensional Mesh Networks

Near-Optimal Worst-case Throughput Routing for Two-Dimensional Mesh Networks Near-Optimal Worst-case Throughput Routing for Two-Dimensional Mesh Networks Daeho Seo Akif Ali Won-Taek Lim Nauman Rafique Mithuna Thottethodi School of Electrical and Computer Engineering Purdue University

More information

Near-Optimal Worst-case Throughput Routing for Two-Dimensional Mesh Networks

Near-Optimal Worst-case Throughput Routing for Two-Dimensional Mesh Networks Near-Optimal Worst-case Throughput Routing for Two-Dimensional Mesh Networks Daeho Seo Akif Ali Won-Taek Lim Nauman Rafique Mithuna Thottethodi School of Electrical and Computer Engineering Purdue University

More information

Topology basics. Constraints and measures. Butterfly networks.

Topology basics. Constraints and measures. Butterfly networks. EE48: Advanced Computer Organization Lecture # Interconnection Networks Architecture and Design Stanford University Topology basics. Constraints and measures. Butterfly networks. Lecture #: Monday, 7 April

More information

A Survey of Techniques for Power Aware On-Chip Networks.

A Survey of Techniques for Power Aware On-Chip Networks. A Survey of Techniques for Power Aware On-Chip Networks. Samir Chopra Ji Young Park May 2, 2005 1. Introduction On-chip networks have been proposed as a solution for challenges from process technology

More information

Deadlock-free XY-YX router for on-chip interconnection network

Deadlock-free XY-YX router for on-chip interconnection network LETTER IEICE Electronics Express, Vol.10, No.20, 1 5 Deadlock-free XY-YX router for on-chip interconnection network Yeong Seob Jeong and Seung Eun Lee a) Dept of Electronic Engineering Seoul National Univ

More information

Basic Low Level Concepts

Basic Low Level Concepts Course Outline Basic Low Level Concepts Case Studies Operation through multiple switches: Topologies & Routing v Direct, indirect, regular, irregular Formal models and analysis for deadlock and livelock

More information

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks X. Yuan, R. Melhem and R. Gupta Department of Computer Science University of Pittsburgh Pittsburgh, PA 156 fxyuan,

More information

Switching/Flow Control Overview. Interconnection Networks: Flow Control and Microarchitecture. Packets. Switching.

Switching/Flow Control Overview. Interconnection Networks: Flow Control and Microarchitecture. Packets. Switching. Switching/Flow Control Overview Interconnection Networks: Flow Control and Microarchitecture Topology: determines connectivity of network Routing: determines paths through network Flow Control: determine

More information

Topologies. Maurizio Palesi. Maurizio Palesi 1

Topologies. Maurizio Palesi. Maurizio Palesi 1 Topologies Maurizio Palesi Maurizio Palesi 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and

More information

Finding Worst-case Permutations for Oblivious Routing Algorithms

Finding Worst-case Permutations for Oblivious Routing Algorithms Stanford University Concurrent VLSI Architecture Memo 2 Stanford University Computer Systems Laboratory Finding Worst-case Permutations for Oblivious Routing Algorithms Brian Towles Abstract We present

More information

Lecture 3: Topology - II

Lecture 3: Topology - II ECE 8823 A / CS 8803 - ICN Interconnection Networks Spring 2017 http://tusharkrishna.ece.gatech.edu/teaching/icn_s17/ Lecture 3: Topology - II Tushar Krishna Assistant Professor School of Electrical and

More information

Evaluation of NOC Using Tightly Coupled Router Architecture

Evaluation of NOC Using Tightly Coupled Router Architecture IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 1, Ver. II (Jan Feb. 2016), PP 01-05 www.iosrjournals.org Evaluation of NOC Using Tightly Coupled Router

More information

A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing

A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing 727 A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing 1 Bharati B. Sayankar, 2 Pankaj Agrawal 1 Electronics Department, Rashtrasant Tukdoji Maharaj Nagpur University, G.H. Raisoni

More information

Topologies. Maurizio Palesi. Maurizio Palesi 1

Topologies. Maurizio Palesi. Maurizio Palesi 1 Topologies Maurizio Palesi Maurizio Palesi 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and

More information

A Novel Energy Efficient Source Routing for Mesh NoCs

A Novel Energy Efficient Source Routing for Mesh NoCs 2014 Fourth International Conference on Advances in Computing and Communications A ovel Energy Efficient Source Routing for Mesh ocs Meril Rani John, Reenu James, John Jose, Elizabeth Isaac, Jobin K. Antony

More information

Early Transition for Fully Adaptive Routing Algorithms in On-Chip Interconnection Networks

Early Transition for Fully Adaptive Routing Algorithms in On-Chip Interconnection Networks Technical Report #2012-2-1, Department of Computer Science and Engineering, Texas A&M University Early Transition for Fully Adaptive Routing Algorithms in On-Chip Interconnection Networks Minseon Ahn,

More information

Prediction Router: Yet another low-latency on-chip router architecture

Prediction Router: Yet another low-latency on-chip router architecture Prediction Router: Yet another low-latency on-chip router architecture Hiroki Matsutani Michihiro Koibuchi Hideharu Amano Tsutomu Yoshinaga (Keio Univ., Japan) (NII, Japan) (Keio Univ., Japan) (UEC, Japan)

More information

Network-on-chip (NOC) Topologies

Network-on-chip (NOC) Topologies Network-on-chip (NOC) Topologies 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and performance

More information

Interconnection Networks: Routing. Prof. Natalie Enright Jerger

Interconnection Networks: Routing. Prof. Natalie Enright Jerger Interconnection Networks: Routing Prof. Natalie Enright Jerger Routing Overview Discussion of topologies assumed ideal routing In practice Routing algorithms are not ideal Goal: distribute traffic evenly

More information

Thomas Moscibroda Microsoft Research. Onur Mutlu CMU

Thomas Moscibroda Microsoft Research. Onur Mutlu CMU Thomas Moscibroda Microsoft Research Onur Mutlu CMU CPU+L1 CPU+L1 CPU+L1 CPU+L1 Multi-core Chip Cache -Bank Cache -Bank Cache -Bank Cache -Bank CPU+L1 CPU+L1 CPU+L1 CPU+L1 Accelerator, etc Cache -Bank

More information

Quest for High-Performance Bufferless NoCs with Single-Cycle Express Paths and Self-Learning Throttling

Quest for High-Performance Bufferless NoCs with Single-Cycle Express Paths and Self-Learning Throttling Quest for High-Performance Bufferless NoCs with Single-Cycle Express Paths and Self-Learning Throttling Bhavya K. Daya, Li-Shiuan Peh, Anantha P. Chandrakasan Dept. of Electrical Engineering and Computer

More information

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance Lecture 13: Interconnection Networks Topics: lots of background, recent innovations for power and performance 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees,

More information

OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel

OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel Hyoukjun Kwon and Tushar Krishna Georgia Institute of Technology Synergy Lab (http://synergy.ece.gatech.edu) hyoukjun@gatech.edu April

More information

Evaluating Bufferless Flow Control for On-Chip Networks

Evaluating Bufferless Flow Control for On-Chip Networks Evaluating Bufferless Flow Control for On-Chip Networks George Michelogiannakis, Daniel Sanchez, William J. Dally, Christos Kozyrakis Stanford University In a nutshell Many researchers report high buffer

More information

DLABS: a Dual-Lane Buffer-Sharing Router Architecture for Networks on Chip

DLABS: a Dual-Lane Buffer-Sharing Router Architecture for Networks on Chip DLABS: a Dual-Lane Buffer-Sharing Router Architecture for Networks on Chip Anh T. Tran and Bevan M. Baas Department of Electrical and Computer Engineering University of California - Davis, USA {anhtr,

More information

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Kshitij Bhardwaj Dept. of Computer Science Columbia University Steven M. Nowick 2016 ACM/IEEE Design Automation

More information

Lecture 22: Router Design

Lecture 22: Router Design Lecture 22: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO 03, Princeton A Gracefully Degrading and Energy-Efficient Modular Router Architecture for On-Chip

More information

Slim Fly: A Cost Effective Low-Diameter Network Topology

Slim Fly: A Cost Effective Low-Diameter Network Topology TORSTEN HOEFLER, MACIEJ BESTA Slim Fly: A Cost Effective Low-Diameter Network Topology Images belong to their creator! NETWORKS, LIMITS, AND DESIGN SPACE Networks cost 25-30% of a large supercomputer Hard

More information

Dynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers

Dynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers Dynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers Young Hoon Kang, Taek-Jun Kwon, and Jeff Draper {youngkan, tjkwon, draper}@isi.edu University of Southern California

More information

Design of a high-throughput distributed shared-buffer NoC router

Design of a high-throughput distributed shared-buffer NoC router Design of a high-throughput distributed shared-buffer NoC router The MIT Faculty has made this article openly available Please share how this access benefits you Your story matters Citation As Published

More information

Power and Performance Efficient Partial Circuits in Packet-Switched Networks-on-Chip

Power and Performance Efficient Partial Circuits in Packet-Switched Networks-on-Chip 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing Power and Performance Efficient Partial Circuits in Packet-Switched Networks-on-Chip Nasibeh Teimouri

More information

Lecture 2: Topology - I

Lecture 2: Topology - I ECE 8823 A / CS 8803 - ICN Interconnection Networks Spring 2017 http://tusharkrishna.ece.gatech.edu/teaching/icn_s17/ Lecture 2: Topology - I Tushar Krishna Assistant Professor School of Electrical and

More information

Lecture: Interconnection Networks

Lecture: Interconnection Networks Lecture: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm 1 Packets/Flits A message is broken into multiple packets (each packet

More information

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics Lecture 16: On-Chip Networks Topics: Cache networks, NoC basics 1 Traditional Networks Huh et al. ICS 05, Beckmann MICRO 04 Example designs for contiguous L2 cache regions 2 Explorations for Optimality

More information

OFAR-CM: Efficient Dragonfly Networks with Simple Congestion Management

OFAR-CM: Efficient Dragonfly Networks with Simple Congestion Management Marina Garcia 22 August 2013 OFAR-CM: Efficient Dragonfly Networks with Simple Congestion Management M. Garcia, E. Vallejo, R. Beivide, M. Valero and G. Rodríguez Document number OFAR-CM: Efficient Dragonfly

More information

WITH the development of the semiconductor technology,

WITH the development of the semiconductor technology, Dual-Link Hierarchical Cluster-Based Interconnect Architecture for 3D Network on Chip Guang Sun, Yong Li, Yuanyuan Zhang, Shijun Lin, Li Su, Depeng Jin and Lieguang zeng Abstract Network on Chip (NoC)

More information

Pseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks

Pseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks Department of Computer Science and Engineering, Texas A&M University Technical eport #2010-3-1 seudo-circuit: Accelerating Communication for On-Chip Interconnection Networks Minseon Ahn, Eun Jung Kim Department

More information

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Lecture 12: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) 1 Topologies Internet topologies are not very regular they grew

More information

Destination-Based Adaptive Routing on 2D Mesh Networks

Destination-Based Adaptive Routing on 2D Mesh Networks Destination-Based Adaptive Routing on 2D Mesh Networks Rohit Sunkam Ramanujam University of California, San Diego rsunkamr@ucsdedu Bill Lin University of California, San Diego billlin@eceucsdedu ABSTRACT

More information

Low-Power Interconnection Networks

Low-Power Interconnection Networks Low-Power Interconnection Networks Li-Shiuan Peh Associate Professor EECS, CSAIL & MTL MIT 1 Moore s Law: Double the number of transistors on chip every 2 years 1970: Clock speed: 108kHz No. transistors:

More information

MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect

MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect Chris Fallin, Greg Nazario, Xiangyao Yu*, Kevin Chang, Rachata Ausavarungnirun, Onur Mutlu Carnegie Mellon University *CMU

More information

Routing Algorithms, Process Model for Quality of Services (QoS) and Architectures for Two-Dimensional 4 4 Mesh Topology Network-on-Chip

Routing Algorithms, Process Model for Quality of Services (QoS) and Architectures for Two-Dimensional 4 4 Mesh Topology Network-on-Chip Routing Algorithms, Process Model for Quality of Services (QoS) and Architectures for Two-Dimensional 4 4 Mesh Topology Network-on-Chip Nauman Jalil, Adnan Qureshi, Furqan Khan, and Sohaib Ayyaz Qazi Abstract

More information

EECS 570. Lecture 19 Interconnects: Flow Control. Winter 2018 Subhankar Pal

EECS 570. Lecture 19 Interconnects: Flow Control. Winter 2018 Subhankar Pal Lecture 19 Interconnects: Flow Control Winter 2018 Subhankar Pal http://www.eecs.umich.edu/courses/eecs570/ Slides developed in part by Profs. Adve, Falsafi, Hill, Lebeck, Martin, Narayanasamy, Nowatzyk,

More information

Design of a router for network-on-chip. Jun Ho Bahn,* Seung Eun Lee and Nader Bagherzadeh

Design of a router for network-on-chip. Jun Ho Bahn,* Seung Eun Lee and Nader Bagherzadeh 98 Int. J. High Performance Systems Architecture, Vol. 1, No. 2, 27 Design of a router for network-on-chip Jun Ho Bahn,* Seung Eun Lee and Nader Bagherzadeh Department of Electrical Engineering and Computer

More information

BARP-A Dynamic Routing Protocol for Balanced Distribution of Traffic in NoCs

BARP-A Dynamic Routing Protocol for Balanced Distribution of Traffic in NoCs -A Dynamic Routing Protocol for Balanced Distribution of Traffic in NoCs Pejman Lotfi-Kamran, Masoud Daneshtalab *, Caro Lucas, and Zainalabedin Navabi School of Electrical and Computer Engineering, The

More information

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP 133 CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP 6.1 INTRODUCTION As the era of a billion transistors on a one chip approaches, a lot of Processing Elements (PEs) could be located

More information

NoC Test-Chip Project: Working Document

NoC Test-Chip Project: Working Document NoC Test-Chip Project: Working Document Michele Petracca, Omar Ahmad, Young Jin Yoon, Frank Zovko, Luca Carloni and Kenneth Shepard I. INTRODUCTION This document describes the low-power high-performance

More information

4. Networks. in parallel computers. Advances in Computer Architecture

4. Networks. in parallel computers. Advances in Computer Architecture 4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors

More information

TDT Appendix E Interconnection Networks

TDT Appendix E Interconnection Networks TDT 4260 Appendix E Interconnection Networks Review Advantages of a snooping coherency protocol? Disadvantages of a snooping coherency protocol? Advantages of a directory coherency protocol? Disadvantages

More information

Network on Chip Architecture: An Overview

Network on Chip Architecture: An Overview Network on Chip Architecture: An Overview Md Shahriar Shamim & Naseef Mansoor 12/5/2014 1 Overview Introduction Multi core chip Challenges Network on Chip Architecture Regular Topology Irregular Topology

More information

A 256-Radix Crossbar Switch Using Mux-Matrix-Mux Folded-Clos Topology

A 256-Radix Crossbar Switch Using Mux-Matrix-Mux Folded-Clos Topology http://dx.doi.org/10.5573/jsts.014.14.6.760 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.14, NO.6, DECEMBER, 014 A 56-Radix Crossbar Switch Using Mux-Matrix-Mux Folded-Clos Topology Sung-Joon Lee

More information

Hybrid On-chip Data Networks. Gilbert Hendry Keren Bergman. Lightwave Research Lab. Columbia University

Hybrid On-chip Data Networks. Gilbert Hendry Keren Bergman. Lightwave Research Lab. Columbia University Hybrid On-chip Data Networks Gilbert Hendry Keren Bergman Lightwave Research Lab Columbia University Chip-Scale Interconnection Networks Chip multi-processors create need for high performance interconnects

More information

Basic Switch Organization

Basic Switch Organization NOC Routing 1 Basic Switch Organization 2 Basic Switch Organization Link Controller Used for coordinating the flow of messages across the physical link of two adjacent switches 3 Basic Switch Organization

More information

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,

More information

Abstract. Paper organization

Abstract. Paper organization Allocation Approaches for Virtual Channel Flow Control Neeraj Parik, Ozen Deniz, Paul Kim, Zheng Li Department of Electrical Engineering Stanford University, CA Abstract s are one of the major resources

More information

Lecture 3: Flow-Control

Lecture 3: Flow-Control High-Performance On-Chip Interconnects for Emerging SoCs http://tusharkrishna.ece.gatech.edu/teaching/nocs_acaces17/ ACACES Summer School 2017 Lecture 3: Flow-Control Tushar Krishna Assistant Professor

More information

548 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 4, APRIL 2011

548 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 4, APRIL 2011 548 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 4, APRIL 2011 Extending the Effective Throughput of NoCs with Distributed Shared-Buffer Routers Rohit Sunkam

More information

ERA: An Efficient Routing Algorithm for Power, Throughput and Latency in Network-on-Chips

ERA: An Efficient Routing Algorithm for Power, Throughput and Latency in Network-on-Chips : An Efficient Routing Algorithm for Power, Throughput and Latency in Network-on-Chips Varsha Sharma, Rekha Agarwal Manoj S. Gaur, Vijay Laxmi, and Vineetha V. Computer Engineering Department, Malaviya

More information

HARDWARE IMPLEMENTATION OF PIPELINE BASED ROUTER DESIGN FOR ON- CHIP NETWORK

HARDWARE IMPLEMENTATION OF PIPELINE BASED ROUTER DESIGN FOR ON- CHIP NETWORK DOI: 10.21917/ijct.2012.0092 HARDWARE IMPLEMENTATION OF PIPELINE BASED ROUTER DESIGN FOR ON- CHIP NETWORK U. Saravanakumar 1, R. Rangarajan 2 and K. Rajasekar 3 1,3 Department of Electronics and Communication

More information

Packet Switch Architecture

Packet Switch Architecture Packet Switch Architecture 3. Output Queueing Architectures 4. Input Queueing Architectures 5. Switching Fabrics 6. Flow and Congestion Control in Sw. Fabrics 7. Output Scheduling for QoS Guarantees 8.

More information

Packet Switch Architecture

Packet Switch Architecture Packet Switch Architecture 3. Output Queueing Architectures 4. Input Queueing Architectures 5. Switching Fabrics 6. Flow and Congestion Control in Sw. Fabrics 7. Output Scheduling for QoS Guarantees 8.

More information

Interconnection Network

Interconnection Network Interconnection Network Recap: Generic Parallel Architecture A generic modern multiprocessor Network Mem Communication assist (CA) $ P Node: processor(s), memory system, plus communication assist Network

More information

Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip

Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip ASP-DAC 2010 20 Jan 2010 Session 6C Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip Jonas Diemer, Rolf Ernst TU Braunschweig, Germany diemer@ida.ing.tu-bs.de Michael Kauschke Intel,

More information

Design and Implementation of Multistage Interconnection Networks for SoC Networks

Design and Implementation of Multistage Interconnection Networks for SoC Networks International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol.2, No.5, October 212 Design and Implementation of Multistage Interconnection Networks for SoC Networks Mahsa

More information

Interconnection Networks: Flow Control. Prof. Natalie Enright Jerger

Interconnection Networks: Flow Control. Prof. Natalie Enright Jerger Interconnection Networks: Flow Control Prof. Natalie Enright Jerger Switching/Flow Control Overview Topology: determines connectivity of network Routing: determines paths through network Flow Control:

More information

Traffic- and Thermal-Aware Run-Time Thermal Management Scheme for 3D NoC Systems

Traffic- and Thermal-Aware Run-Time Thermal Management Scheme for 3D NoC Systems 2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip Traffic- and Thermal-Aware Run-Time Thermal Management Scheme for 3D NoC Systems Chih-Hao Chao, Kai-Yuan Jheng, Hao-Yu Wang, Jia-Cheng Wu,

More information

Lecture 23: Router Design

Lecture 23: Router Design Lecture 23: Router Design Papers: A Gracefully Degrading and Energy-Efficient Modular Router Architecture for On-Chip Networks, ISCA 06, Penn-State ViChaR: A Dynamic Virtual Channel Regulator for Network-on-Chip

More information

NEtwork-on-Chip (NoC) [3], [6] is a scalable interconnect

NEtwork-on-Chip (NoC) [3], [6] is a scalable interconnect 1 A Soft Tolerant Network-on-Chip Router Pipeline for Multi-core Systems Pavan Poluri and Ahmed Louri Department of Electrical and Computer Engineering, University of Arizona Email: pavanp@email.arizona.edu,

More information

OASIS Network-on-Chip Prototyping on FPGA

OASIS Network-on-Chip Prototyping on FPGA Master thesis of the University of Aizu, Feb. 20, 2012 OASIS Network-on-Chip Prototyping on FPGA m5141120, Kenichi Mori Supervised by Prof. Ben Abdallah Abderazek Adaptive Systems Laboratory, Master of

More information

SAMBA-BUS: A HIGH PERFORMANCE BUS ARCHITECTURE FOR SYSTEM-ON-CHIPS Λ. Ruibing Lu and Cheng-Kok Koh

SAMBA-BUS: A HIGH PERFORMANCE BUS ARCHITECTURE FOR SYSTEM-ON-CHIPS Λ. Ruibing Lu and Cheng-Kok Koh BUS: A HIGH PERFORMANCE BUS ARCHITECTURE FOR SYSTEM-ON-CHIPS Λ Ruibing Lu and Cheng-Kok Koh School of Electrical and Computer Engineering Purdue University, West Lafayette, IN 797- flur,chengkokg@ecn.purdue.edu

More information

Design of Low-Power and Low-Latency 256-Radix Crossbar Switch Using Hyper-X Network Topology

Design of Low-Power and Low-Latency 256-Radix Crossbar Switch Using Hyper-X Network Topology JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.15, NO.1, FEBRUARY, 2015 http://dx.doi.org/10.5573/jsts.2015.15.1.077 Design of Low-Power and Low-Latency 256-Radix Crossbar Switch Using Hyper-X Network

More information

in Oblivious Routing

in Oblivious Routing Static Virtual Channel Allocation in Oblivious Routing Keun Sup Shim, Myong Hyon Cho, Michel Kinsy, Tina Wen, Mieszko Lis G. Edward Suh (Cornell) Srinivas Devadas MIT Computer Science and Artificial Intelligence

More information

Real-Time Mixed-Criticality Wormhole Networks

Real-Time Mixed-Criticality Wormhole Networks eal-time Mixed-Criticality Wormhole Networks Leandro Soares Indrusiak eal-time Systems Group Department of Computer Science University of York United Kingdom eal-time Systems Group 1 Outline Wormhole Networks

More information

Ultra-Fast NoC Emulation on a Single FPGA

Ultra-Fast NoC Emulation on a Single FPGA The 25 th International Conference on Field-Programmable Logic and Applications (FPL 2015) September 3, 2015 Ultra-Fast NoC Emulation on a Single FPGA Thiem Van Chu, Shimpei Sato, and Kenji Kise Tokyo

More information

Recall: The Routing problem: Local decisions. Recall: Multidimensional Meshes and Tori. Properties of Routing Algorithms

Recall: The Routing problem: Local decisions. Recall: Multidimensional Meshes and Tori. Properties of Routing Algorithms CS252 Graduate Computer Architecture Lecture 16 Multiprocessor Networks (con t) March 14 th, 212 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252

More information

Adaptive Channel Queue Routing on k-ary n-cubes

Adaptive Channel Queue Routing on k-ary n-cubes Adaptive Channel Queue Routing on k-ary n-cubes Arjun Singh, William J Dally, Amit K Gupta, Brian Towles Computer Systems Laboratory, Stanford University {arjuns,billd,btowles,agupta}@cva.stanford.edu

More information

Overlaid Mesh Topology Design and Deadlock Free Routing in Wireless Network-on-Chip. Danella Zhao and Ruizhe Wu Presented by Zhonghai Lu, KTH

Overlaid Mesh Topology Design and Deadlock Free Routing in Wireless Network-on-Chip. Danella Zhao and Ruizhe Wu Presented by Zhonghai Lu, KTH Overlaid Mesh Topology Design and Deadlock Free Routing in Wireless Network-on-Chip Danella Zhao and Ruizhe Wu Presented by Zhonghai Lu, KTH Outline Introduction Overview of WiNoC system architecture Overlaid

More information

PRIORITY BASED SWITCH ALLOCATOR IN ADAPTIVE PHYSICAL CHANNEL REGULATOR FOR ON CHIP INTERCONNECTS. A Thesis SONALI MAHAPATRA

PRIORITY BASED SWITCH ALLOCATOR IN ADAPTIVE PHYSICAL CHANNEL REGULATOR FOR ON CHIP INTERCONNECTS. A Thesis SONALI MAHAPATRA PRIORITY BASED SWITCH ALLOCATOR IN ADAPTIVE PHYSICAL CHANNEL REGULATOR FOR ON CHIP INTERCONNECTS A Thesis by SONALI MAHAPATRA Submitted to the Office of Graduate and Professional Studies of Texas A&M University

More information

MinRoot and CMesh: Interconnection Architectures for Network-on-Chip Systems

MinRoot and CMesh: Interconnection Architectures for Network-on-Chip Systems MinRoot and CMesh: Interconnection Architectures for Network-on-Chip Systems Mohammad Ali Jabraeil Jamali, Ahmad Khademzadeh Abstract The success of an electronic system in a System-on- Chip is highly

More information

QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING. Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose

QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING. Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose Department of Electrical and Computer Engineering University of California,

More information

Routing Algorithms. Review

Routing Algorithms. Review Routing Algorithms Today s topics: Deterministic, Oblivious Adaptive, & Adaptive models Problems: efficiency livelock deadlock 1 CS6810 Review Network properties are a combination topology topology dependent

More information

Modeling NoC Traffic Locality and Energy Consumption with Rent s Communication Probability Distribution

Modeling NoC Traffic Locality and Energy Consumption with Rent s Communication Probability Distribution Modeling NoC Traffic Locality and Energy Consumption with Rent s Communication Distribution George B. P. Bezerra gbezerra@cs.unm.edu Al Davis University of Utah Salt Lake City, Utah ald@cs.utah.edu Stephanie

More information

International Journal of Advanced Research in Computer Science and Software Engineering

International Journal of Advanced Research in Computer Science and Software Engineering Volume 2, Issue 9, September 2012 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Heterogeneous

More information

Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing?

Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing? Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing? J. Flich 1,P.López 1, M. P. Malumbres 1, J. Duato 1, and T. Rokicki 2 1 Dpto. Informática

More information

ECE 4750 Computer Architecture, Fall 2017 T06 Fundamental Network Concepts

ECE 4750 Computer Architecture, Fall 2017 T06 Fundamental Network Concepts ECE 4750 Computer Architecture, Fall 2017 T06 Fundamental Network Concepts School of Electrical and Computer Engineering Cornell University revision: 2017-10-17-12-26 1 Network/Roadway Analogy 3 1.1. Running

More information

Routing of guaranteed throughput traffic in a network-on-chip

Routing of guaranteed throughput traffic in a network-on-chip Routing of guaranteed throughput traffic in a network-on-chip Nikolay Kavaldjiev, Gerard J. M. Smit, Pascal T. Wolkotte, Pierre G. Jansen Department of EEMCS, University of Twente, the Netherlands {n.k.kavaldjiev,

More information

Network-on-Chip Architecture

Network-on-Chip Architecture Multiple Processor Systems(CMPE-655) Network-on-Chip Architecture Performance aspect and Firefly network architecture By Siva Shankar Chandrasekaran and SreeGowri Shankar Agenda (Enhancing performance)

More information

Lecture 15: NoC Innovations. Today: power and performance innovations for NoCs

Lecture 15: NoC Innovations. Today: power and performance innovations for NoCs Lecture 15: NoC Innovations Today: power and performance innovations for NoCs 1 Network Power Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO 03, Princeton Energy for a flit

More information

FT-Z-OE: A Fault Tolerant and Low Overhead Routing Algorithm on TSV-based 3D Network on Chip Links

FT-Z-OE: A Fault Tolerant and Low Overhead Routing Algorithm on TSV-based 3D Network on Chip Links FT-Z-OE: A Fault Tolerant and Low Overhead Routing Algorithm on TSV-based 3D Network on Chip Links Hoda Naghibi Jouybari College of Electrical Engineering, Iran University of Science and Technology, Tehran,

More information

Asynchronous Bypass Channel Routers

Asynchronous Bypass Channel Routers 1 Asynchronous Bypass Channel Routers Tushar N. K. Jain, Paul V. Gratz, Alex Sprintson, Gwan Choi Department of Electrical and Computer Engineering, Texas A&M University {tnj07,pgratz,spalex,gchoi}@tamu.edu

More information

ScienceDirect. Packet-based Adaptive Virtual Channel Configuration for NoC Systems

ScienceDirect. Packet-based Adaptive Virtual Channel Configuration for NoC Systems Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 34 (2014 ) 552 558 2014 International Workshop on the Design and Performance of Network on Chip (DPNoC 2014) Packet-based

More information

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup Yan Sun and Min Sik Kim School of Electrical Engineering and Computer Science Washington State University Pullman, Washington

More information

ANALYSIS AND IMPROVEMENT OF VALIANT ROUTING IN LOW- DIAMETER NETWORKS

ANALYSIS AND IMPROVEMENT OF VALIANT ROUTING IN LOW- DIAMETER NETWORKS ANALYSIS AND IMPROVEMENT OF VALIANT ROUTING IN LOW- DIAMETER NETWORKS Mariano Benito Pablo Fuentes Enrique Vallejo Ramón Beivide With support from: 4th IEEE International Workshop of High-Perfomance Interconnection

More information

Challenges for Future Interconnection Networks Hot Interconnects Panel August 24, Dennis Abts Sr. Principal Engineer

Challenges for Future Interconnection Networks Hot Interconnects Panel August 24, Dennis Abts Sr. Principal Engineer Challenges for Future Interconnection Networks Hot Interconnects Panel August 24, 2006 Sr. Principal Engineer Panel Questions How do we build scalable networks that balance power, reliability and performance

More information

Global Adaptive Routing Algorithm Without Additional Congestion Propagation Network

Global Adaptive Routing Algorithm Without Additional Congestion Propagation Network 1 Global Adaptive Routing Algorithm Without Additional Congestion ropagation Network Shaoli Liu, Yunji Chen, Tianshi Chen, Ling Li, Chao Lu Institute of Computing Technology, Chinese Academy of Sciences

More information