A Novel 3D Layer-Multiplexed On-Chip Network

Size: px

Start display at page:

Download "A Novel 3D Layer-Multiplexed On-Chip Network"

Dora Holt
5 years ago
Views:

1 A Novel D Layer-Multiplexed On-Chip Network Rohit Sunkam Ramanujam University of California, San Diego rsunkamr@ucsdedu Bill Lin University of California, San Diego billlin@eceucsdedu ABSTRACT Recently, a near-optimal oblivious routing algorithm for D mesh networks called Randomized Partially-Minimal () routing was proposed [12], which works by loadbalancing traffic across vertical layers and routing minimally on each horizontal layer It achieves optimal worst-case throughput when the network radix k is even and within a factor of 1/k 2 of optimal when k is odd, and it achieves significantly lower latencies than Valiant routing [18], the best previously known optimal worst-case throughput algorithm This paper presents a novel layer-multiplexed (LM) architecture for D on-chip networks that exploits the optimality of together with the short inter-layer wiring delays enabled in D technology The LM architecture replaces the one-layer-per-hop routing in a D mesh with simpler vertical demultiplexing and multiplexing structures The proposed LM architecture can achieve the same worst-case throughput as a D mesh by adapting routing to the LM architecture However, the LM architecture consumes 27% less power, occupies 27% less area, attains 145% higher average throughput, and achieves % lower worst-case hop count for a symmetric mesh topology On an asymmetric mesh, the LM architecture achieves comparable average-case throughput to a D mesh, but consumes 26% less power, takes up 27% less area and attains 2% lower worst-case hop count Categories and Subject Descriptors C12 [Processor Architectures]: Multiple Data Stream Architectures (Multiprocessors) Interconnection Architectures General Terms Algorithms, Design, Performance A short version of this paper appears in Embedded Systems Letters, vol1,no1,29 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee ANCS 9, October 19-2, 29, Princeton, New Jersey, USA Copyright 29 ACM /9/1$1 Keywords On-Chip Interconnection Networks, D ICs, Oblivious Routing 1 INTRODUCTION Three-dimensional (D) integration [7, ] has attracted significant attention in recent years because of its potential benefits, including smaller chip footprints, higher transistor packing density, shorter wiring delays and significantly higher communication bandwidth As D technology matures and becomes more viable, it opens up new opportunities for chip architecture innovations One direction is to extend 2D tiled chip-multiprocessor architectures [16, 19, 1] into three dimensions [5, 8] Many proposed 2D chipmultiprocessor architectures have relied on a 2D mesh network topology as the underlying communication fabric Extending mesh-based tiled chip-multiprocessor architectures into three dimensions represents a natural progression for exploiting D integration As with 2D mesh networks, throughput and latency are important performance metrics in the design of routing algorithms for D meshes Ideally, a routing algorithm must aim at maximizing worst-case and average-case throughput and minimizing average latency Dimension Ordered Routing () [15] is minimal in terms of latency but is known to suffer from poor worst-case and average-case throughput due to the lack of path diversity [1] provides another alternative minimal routing algorithm, but it too suffers from poor worst-case throughput An oblivious routing algorithm called [1] is known to achieve optimal worst-case throughput with minimal hop count for 2D meshes However, its performance surprisingly degrades tremendously for the D case, as noted in [1] Just like, the worst-case throughput of other minimal routing algorithms like and also degrade severely when extended to D Until recently, the best known oblivious routing algorithm that achieves optimal worst-case throughput for D meshes was Valiant () routing [18], which globally load-balances traffic across all dimensions to achieve optimal worst-case throughput, but at the expense of destroying locality is a factor of 2x of in average hop count Recently, a near-optimal oblivious routing algorithm for conventional D mesh networks called Randomized Partially-Minimal () routing was proposed [12] Essentially, works by load-balancing traffic uniformly along one of the three

2 Normalized Throughput /Optimal tion in the LM architecture First, 7-port routers of a D mesh are reorganized into smaller 5-port 2D mesh routers for intra-layer communication integrated with demultiplexing and multiplexing structures for inter-layer communication This router re-organization lowers the power consumption when a layer load-balanced routing algorithm like is used Second, single-hop transfer of packets in the vertical dimension is much more power efficient than one-layer-perhop communication due to the short inter-layer distances in D ICs In addition to reducing power, the demultiplexing and multiplexing structures also increase the inter-layer communication bandwidth, thereby increasing the averagecase throughput of the LM architecture compared to a D mesh Network Radix(k) Figure 1: Normalized worst-case throughput of oblivious routing algorithms for D mesh networks dimensions and routing minimally in the two remaining dimensions The main result shown in [12] is that achieves optimal worst-case throughput when the network radix k is even and within a factor of 1/k 2 of optimal when k is odd achieves a factor of 1x of in average hop count which is much better than a factor of 2x for Figure 1 shows the worst-case throughput of in comparison to,, and The worstcase throughput of, and degrade tremendously with increasing radix At k = 14, the worstcase throughput of is 14 times higher than and 526 times higher than and Therefore, a load-balanced, non-minimal routing algorithm like is necessary to provide high worst-case throughput guarantees in D mesh networks as the network radix increases In this paper, we present a novel layer-multiplexed (LM) architecture for D mesh networks that exploits the optimality of together with the short inter-layer wiring delays enabled in D technology One way of implementing in D meshes is by load balancing traffic uniformly along the vertical dimension to a random intermediate layer and routing minimally within each layer This requires two phases of minimal routing along the vertical dimension In contrast to a conventional D mesh, the LM architecture replaces the one-layer-per-hop routing in the vertical dimension with simpler vertical demultiplexing and multiplexing structures In doing so, the LM architecture retains the same worst-case throughput as a conventional D mesh by adapting routing to the LM architecture However, in comparison to a conventional D mesh, the LM architecture consumes 27% less power, occupies 27% less area, attains 145% higher average throughput and achieves % lower worst-case hop count for a symmetric 4 4 4mesh Foranasymmetric mesh topology, the LM architecture attains comparable average throughput to a conventional D mesh, but consumes 26% less power, takes up 27% less area and achieves 2% lower worst-case hop count Two factors are primarily responsible for the power reduc- In the rest of the paper, Section 2 outlines related work Section 1 provides a brief background on performance metrics Section 2 discusses layer load-balanced routing on D mesh networks Section 4 provides microarchitectural details of the LM architecture and analyzes its worst-case throughput, and Section 5 evaluates the LM architecture Finally, Section 6 concludes the paper 2 RELATED WORK This section briefly relates recently proposed D on-chip network architectures to the LM architecture proposed in this paper Li et al [8] proposed a NOC-Bus-hybrid architecture in the context of Network-in-Memory for D chipmultiprocessors, which uses a shared bus for vertical (interlayer) communication However, under adversarial traffic conditions, eg when the source and destinations are on different layers, the vertical shared bus becomes the throughput bottleneck The LM architecture uses a demultiplexing crossbar at packet injection and buffered multiplexers on each layer at packet ejection for inter-layer communication Such structures can sustain higher throughput than a shared bus This is crucial for layer load-balanced routing using Kim et al [6] proposed a true D crossbar design called DimDe As noted in [6], its design can only support routing using, which is known to suffer from poor worst-case throughput Using a routing algorithm other than with the DimDe architecture requires a very complex -stage hierarchical arbitration scheme Matsutani et al [9] proposed an architecture called XNOT that is also based on the idea of vertical switching However, it requires large 2k 2k vertical switches, which can be costly since crossbar power grows quadratically with the number of ports Park et al [11] proposed MIRA, which is based on implementing a 2D mesh chip-multiprocessor architecture in three dimensions It assumes the processor cores are designed in D, which makes it difficult to reuse existing highly optimized 2D processor core designs BACKGROUND 1 Preliminaries In this section, we provide a brief overview of analytical methods used to evaluate worst-case and average-case throughput In particular, we elaborate on the concept of network capacity and a method to compute worst-case throughput for oblivious routing algorithms We then elaborate on a method to compute average-case throughput To simplify the discussion on throughput analysis, in this sec-

3 tion we ignore flow control issues, and we assume single flit packets that are routed from node to node in a single cycle Network capacity is defined by the maximum channel load γ that a channel at the bisection of the network needs to sustain under uniformly distributed traffic For any k-ary n-mesh [2], independent of n, γ = j k 4 k 2 1 4k k is even k is odd The network capacity is the inverse of γ The maximum channel load γ(r, Λ) for a routing algorithm R and traffic matrix Λ is the expected traffic load crossing the heaviest loaded channel under R and Λ The worst-case channel load γ wc(r) for a routing algorithm R is the maximum channel load that can be caused by any admissible traffic Admissible traffic is defined to be any doubly substochastic matrix Λ with all row and column sums bounded by 1 As shown in [17], an admissible traffic matrix that can cause the worst-case channel load for a routing algorithm R can be found by solving a derived maximum weighted matching problem The worst-case saturation throughput for a routing algorithm R is the inverse of the worst-case channel load The normalized worst-case saturation throughput, Θ wc(r), is defined as the worst-case saturation throughput normalized to the network capacity: Θ wc(r) = γ γ wc(r) Unless otherwise noted, we will simply refer to Θ wc(r) as the worst-case throughput of R Using the methodology used in [17, 1], the average-case throughput for a routing algorithm R can be computed by averaging the saturation throughputs over T, a large set of random traffic patterns: Θ avg(r) = 1 X «γ () T γ(r, Λ) Λ T In this paper, we used T =onemillion 2 Layer load-balanced routing on a D Mesh A routing algorithm called is presented in [12] which uses the concept of load balancing traffic across all vertical layers to achieve optimal worst-case throughput for D mesh networks Figure 2 depicts how works Suppose Z is the vertical dimension and X and Y are the two horizontal dimensions, first routes packets in the Z dimension to a randomly chosen intermediate XY plane It then routes packets on each XY plane using either minimal XY or YX routing with equal probability Finally, it routes packets to their final destinations along the Z dimension So, requires two phases of vertical routing to load-balance traffic across the layers Figure 2 depicts two possible routing paths Let (x s,y s,z s) be the source, (x d,y d,z d )bethe destination, and ẑ be the randomly chosen intermediate Z position The two corresponding Z-XY-Z and Z-YX-Z routing paths employed by are (x s,y s,z s) (x s,y s, ẑ) (x d,y s, ẑ) (x d,y d, ẑ) (x d,y d,z d ) and (x s,y s,z s) (x s,y s, ẑ) (x s,y d, ẑ) (x d,y d, ẑ) (x d,y d,z d ) (1) (2) Z X (a) Z-XY-Z path Y Z X (b) Z-YX-Z path Figure 2: Examples of routing can be equivalently defined by uniformly load-balancing along either the X or Y dimensions and routing minimally in the remaining two dimensions For symmetric D mesh networks, load-balancing traffic along each of the three dimensions with equal probability (and routing minimally in the two remaining dimensions) achieves higher average-case throughput as compared to load balancing along a single dimension For asymmetric meshes, uniform load-balancing along the shortest dimension achieves best throughput results The following theorem has been proved in [12]: Theorem 1 achieves optimal worst-case throughput for a k k k mesh when the network radix k is even and within a factor of 1/k 2 optimal when k is odd The throughput evaluation of and comparison to existing oblivious routing algorithms is presented in Section 52 achieves near-optimal worst-case throughput and is far better than existing mesh routing algorithms, especially under adversarial traffic Also, considering the tremendous worst-case throughput degradation of minimal routing algorithms for D mesh networks with increasing network radix (see Figure 1), it becomes extremely essential to employ a layer load-balanced routing algorithm like 4 LAYER-MULTIPLEXED D ARCHI- TECTURE A D mesh is a natural extension to the 2D mesh architecture Two extra ports are needed at each router in the D case to support up/down communication, resulting in a 7 7 crossbar compared to a 5 5 crossbar in the 2D case However, since the crossbar power and area costs increase quadratically with the number of ports, D routers have higher power consumption and a larger area footprint than 2D routers Increased number of ports in D routers further increase contention in switches and the complexity of arbiters Also, in a D mesh, packets are routed in a one-layerper-hop manner along the vertical dimension When routing is used, which involves two phases of vertical routing, a packet may need to traverse 2k layers (and hence 2k router hops) in the worst-case, resulting in long worst-case latencies Along with long delay, one-layer-per-hop communication is also inefficient in terms of power consumption as a flit needs to traverse a router pipeline at each hop despite Y

4 P1 P2 L1 L2 L1 L2 L L4 L1 L2 L L4 P1 P2 Route Selection/Load Balancing VC Allocation Credits in from the injection port of routers on layers 1-4 Flit Counters Switch Arbitration P P4 Packet Injection L L4 L1 L2 L L4 L1 L2 L L4 Packet Ejection P P4 P1 P2 P P4 To the injection port of the Layer 1 router To the injection port of the Layer 4 router Figure : Packet injection and ejection stages of the layer-multiplexed architecture the short inter-layer distances in D designs These factors motivated us to propose a new Layer-Multiplexed (LM) D on-chip network architecture that exploits the throughput optimality of together with the short inter-layer wiring delays and the abundance of vertical wiring in D designs The LM architecture has a lower power cost, a smaller area footprint and higher performance compared to a D mesh, as shown in Section 5 We describe this LM architecture in Section 41 and show that it indeed achieves the same worst-case throughput as on a conventional D mesh in Section Architecture In the LM architecture, the one-layer-per-hop communication in a D mesh is replaced with simpler demultiplexing and multiplexing stages to take advantage of the short vertical wiring delays and the abundance of vertical wiring in D designs The high level architecture of these stages is shown in Figure for k = 4 layers These demultiplexing and multiplexing stages are used to implement the layer load-balancing method employed by requires two phases of routing along the vertical dimension to loadbalance traffic across the layers In the LM architecture, the two phases of vertical routing are restricted to packet injection and packet ejection, respectively At packet injection, flits 1 are demultiplexed uniformly to the k layers using the packet injection stage The packet injection demultiplexer gives every processor direct access to all the layers, thus removing the necessity for one-layer-per-hop communication Once demultiplexed to a horizontal layer, packets are routed on this plane using either minimal XY or YX routing with equal probability At the destination (X, Y) coordinates, packets from different layers are multiplexed at the destination processor in the packet ejection stage With this architecture, each horizontal plane is effectively a conventional 2D mesh with just 5-port routers Rather than connecting directly to these 5-port routers, each processor is connected to a corresponding packet injection stage at the 1 A flit is a fixed size portion of a packetized message Figure 4: Packet injection/demultiplexing stage same (X, Y) location This packet injection stage is in turn connected to k 5-port routers at the same (X, Y) location This is depicted in Figure for k = 4, with processors P 1 to P 4 demultiplexing to layers L 1 to L 4 Similarly, on the packet ejection side, a packet ejection multiplexer is used at each processor to multiplex flits arriving from layers L 1 to L 4 We refer to this adaptation of the routing algorithm on the LM architecture as -LM As we will see in Section 51, the combined costs of these demultiplexers and multiplexers with the 5-port router costs are significantly lower than the 7-port router costs of a conventional D mesh both in terms of power and area The microarchitectural details of these packet injection and ejection stages and the routers are detailed next 411 Packet Injection The details of the packet injection stage are shown in Figure 4 for k = 4 layers It is essentially a 4-port switch with a typical router pipeline consisting of route selection, virtual channel (VC) allocation, switch arbitration, switch traversal, and link traversal However, route selection is modified to implement layer load-balancing, which has to ensure that statistically traffic does indeed get uniformly distributed across all k layers This is achieved by using a set of flit-counters for each ordered pair of input and output ports in the load-balancing logic A flit-counter (i, j) records the total number of flits sent from input i to output j When a new head flit is injected from processor P i, the route selection logic will select the output layer L j with the lowest flit-count (i, j) Ties between layers are broken by a rotating output priority pointer for P i that gets advanced at each route selection The route selection logic keeps track of the ouput ports that are currently in use in order to avoid allocating the same output to more than one input, which would reduce throughput The layer selection is done as follows: Each input port is given a unique priority that rotates every cycle over all inputs

5 Injection port Credits out for XY VCs of all PCs VCID Credits out for YX VCs of all PCs Routing (XY), VC Allocation Routing (YX), VC Allocation Credits in for XY VCs of Out_X+, Out_X-, Out_Y+, Out_Y- Switch Arbiter Credits in for YX VCs of Out_X+, Out_X-, Out_Y+, Out_Y- Credits in for Layer j buffers of P1, P2, P, P4 Ejection Port Packets from layer1 VCID Credits out for L1-P1, L2-P1, L-P1 and L4-P1 Packets from layer2 Packets from layer Packets from layer4 L1-P1 L2-P1 L-P1 L4-P1 Arbiter P1 P2 P In_X+ In_X- In_Y+ In_Y- Out_X+ Out_ X- Out_Y+ Out_Y- Credits out for L1-P4, L2-P4, L-P4 and L4-P4 Packets from layer2 Packets from layer Packets from layer4 L1-P4 L2-P4 L-P4 L4-P4 Arbiter P4 Figure 5: Microarchitecture of a 2D intra-layer router on layer j When two or more inputs enter the route selection stage at the same cycle, the highest priority input is served first Suppose input i requests for an output port and F is the set of free outputs in that cycle after the higher priority inputs have been served The route selection logic simply chooses an output port from F that has received the least number of flits from input i This strategy guarantees that every input is always granted an unique output It must be noted that route computation is only performed on the head flit of a packet Once a layer is chosen by the route selection stage for the head flit, all remaining flits of the same packet will be routed on the same layer The VC allocation stage further allocates a virtual channel on the injection port of the 5-port router on the selected layer Together, flits belonging to the same packet are guaranteed to be routed on the same layer through the same set of virtual channels, thus ensuring their in-order arrival at the destination Finally, we found that a single VC with a small amount of buffering (eg 5 flits) is sufficient at each input of the packet injection demultiplexing switch For implementation, the input buffers can be located on the same layer as the corresponding processor The crossbar of the packet injection demultiplexer can be spread across the k layers, ie, the cross points for a particular layer can be located at that layer The route selection, VC allocation and switch arbitration logic can be located on the middle layer (or one of the middle layers when k is even) so that the wiring for the control signals is minimized and also uniformly distributed across the layers 412 Router Microarchitecture Figure 5 depicts the microarchitectural details of the 5-port horizontal plane routers used in the LM architecture Once a packet is injected into a router on one of the layers, -LM uses either minimal XY or YX paths with equal probability to reach the (X, Y) coordinates of its destination This is essentially [1] routing on the horizontal plane Figure 6: Packet ejection/multiplexing stage In turn, the microarchitecture of each 5-port router on the 2D plane is very similar to that of the router [1] The injection port receives flits from the corresponding output port of the vertical switch based on the layer at which it is located The virtual channels at each input port are divided into two sets one set for XY routing and another for YX routing After being injected into one of the routing layers (XY or YX), a packet is restricted to remain in the virtual channels of that layer while being routed on the 2D plane The routing and VC allocation stages are duplicated as in the router to independently handle the corresponding decisions in the XY and YX routings The switch arbitration stage is common to both XY and YX routings since flits from all virtual channels at each input contend for the same crossbar 41 Packet Ejection The ejection port of each router is connected through packet ejection multiplexers to all processors at the same (X, Y) coordinate, as depicted in Figure 6 In particular, each horizontal plane router sees at its ejection port four virtual channels, each of which corresponds to an ejection queue of a processor connected to the router In the example depicted in Figure 6, a router on the first layer (L 1) sees four virtual channels, namely L 1-P 1, L 1-P 2, L 1-P,andL 1-P 4 as its ejection channels located at processors P 1, P 2, P and P 4, respectively The route selection stage of the router chooses the appropriate output VC for a packet based on its destination After leaving the horizontal plane router, the VC ID field of a flit is used to determine the packet ejection multiplexer into which the flit is inserted For implementation, a packet ejection multiplexer can be implemented on the corresponding processor layer A flit ejected from a horizontal plane router can be broadcast to the packet ejection multiplexers on all layers and a decision to accept a flit can be made locally at the input buffer of each multiplexer, based on the flit s VC ID Finally, each packet ejection multiplexer can independently choose among its input queues which flit to forward to the corresponding destination processor at each cycle Each multiplexer also handles flow control for its input buffers We found that only a small amount of buffering (eg 5 flits) is sufficient at each of these input queues

6 42 Analysis Claim 1 The worst-case throughput of -LM is equal to the worst-case throughput of on a D mesh, which is optimal when the network radix k is even and within a factor of 1/k 2 of optimal when k is odd Proof As mentioned in Section 1, optimal worst-case throughput analysis is based on ideal load analysis, which assumes ideal routers with infinite buffering It is well-known that Valiant routing () [18] achieves optimal worst-case throughput, and its maximum channel load is twice that of routing uniform traffic using dimension-ordered routing, which is k/2 whenk is even and (k 2 1)/2k when k is odd for a mesh network with radix k To demonstrate that a routing algorithm R is worst-case throughput optimal, it is sufficient to show that the maximum channel load of R under all admissible traffic is the same as In [12], it was shown that on a D mesh is optimal when k is even and within a factor of 1/k 2 of optimal when k is odd This was shown by showing that the uniform load-balancing of in the vertical dimension ensures that the 2D traffic that will traverse any XY plane will be doubly sub-stochastic if the overall D traffic matrix Λ is doubly sub-stochastic Given that uses [1] for routing on a XY plane, and that the 2D traffic on a XY plane is guaranteed to be admissible, it follows from the result that the maximum channel load along the horizontal dimensions is k/2, which is optimal when k is even When k is odd, the ratio of the maximum channel loads of over is [(k 2 1)/2k]/[k/2] = (1 1/k 2 ), which is a factor 1/k 2 of optimal For -LM routing on the LM architecture, traffic is also uniformly load-balanced across the layers Using the same analysis, the 2D traffic that will traverse any XY plane will be doubly sub-stochastic under an admissible Λ Therefore, the maximum channel load on the XY planes will be the same as Given that the demultiplexing packet injection stage and the multiplexing packet ejection stage are both non-blocking under ideal load analysis, the maximum channel load is dictated by the load on the channels in the XY planes Therefore, -LM on the LM architecture achieves the same worst-case throughput as on a D mesh This analysis holds for both symmetric and asymmetric meshes 5 EUATION This section is divided into three parts Section 51 describes the simulation framework used for performance evaluation Section 52 compares the performance of on a conventional D mesh with existing mesh routing algorithms Section 5 then evaluates the benefits of using -LM on the LM architecture over on a conventional D mesh in terms of power, area and performance 51 Simulation Framework We evaluate two D mesh configurations, a symmetric mesh and a asymmetric mesh topology Both the D configurations evaluated have four device layers In practice, D mesh networks are not expected to be symmetric in D chip designs The number of device layers is expected to be much less than the number processor tiles that can be placed along the edge of a device layer Hence, we chose an asymmetric mesh topology for our evaluation Throughput evaluation is carried out in two stages We initially perform a simplified throughput analysis assuming an ideal single cycle router with infinite buffers and then back these results with more realistic flit-level simulations Worst-case and average-case throughputs are computed using techniques described in Section 1 We modified the PoPnet [14] on-chip network simulator to perform our flit-level simulations The PoPnet simulator models a five stage router pipeline corresponding to route selection, VC allocation, switch arbitration, switch traversal, and link traversal It uses credit-based flow control We used the simulator to evaluate the average routing delays under different injection loads For each simulation, we ran the simulator for 5, cycles The latency of a packet is measured as the delay between the time the header flit is injected into the network and the time the tail flit is consumed at the destination We assume 8 virtual channels (VCs) per physical channel and buffers of size 5 flits per VC at each input port for our performance evaluations We include 8 VCs in our setup because it is well known that VCs improve throughput of any routing algorithm by reducing head-ofline blocking and enabling better statistical multiplexing of flits So, having a reasonably large number of VCs lets us compare the best performance of routing algorithms For the LM architecture, as mentioned in Section 4, we include 5 flit buffers at each injection port of the vertical demultiplexing switch and 2 flit buffers (4 queues of 5 flits each) at the input of each multiplexer in the packet ejection stage The injected packets have a constant size of 5 flits The simulations were performed on the four traffic patterns shown in Table 1 (Transpose, Complement, -WC and Uniform random) The traffic pattern described as -WC is a worst-case traffic pattern for 52 Throughput Evaluation of In this section, we compare the performance of on a D mesh with four other oblivious mesh routing algorithms, namely, [15], [1], [1] and [18] The results of the simplified throughput analysis are shown in Table 2 The normalized saturation throughput (normalized to the network capacity) for each oblivious routing algorithm on each traffic pattern of Table 1 is shown in Table 2 for two different D mesh configurations achieves the same optimal worst-case throughput as for both the symmetric and asymmetric mesh topologies For the symmetric mesh topology, outperforms minimal routing algorithms like, and in worst-case throughput by %, 14% and 1%, respectively Similarly, for the asymmetric mesh topology, the worst-case throughput of surpasses, and by 4%, 182% and 1%, respectively also achieves the highest average-case throughput averaged over a million random traffic patterns It can sustain higher throughput than existing oblivious routing algorithms under adversarial traffic (transpose, -WC) and comparable throughput under benign traffic (uniform, complement) This is further substantiated by flit-level simulation results

7 Table 1: Traffic patterns evaluated Worst-Case (WC) Worst-case traffic that causes lowest throughput Average-Case Average throughput over a million random traffic matrices Transpose Packet at (x, y, z) sent to (y, z, x) (asymmetric) Destination obtained by left shifting the concatenated bit representation of the source xyz to yzx and repartitoning the result Complement (Compl) Packet at (x, y, z) sentto(k x x 1,k y y 1,k z z 1) -WC Packet at (x, y, z) sentto(k z 1,k y 1,k x 1) (asymmetric) If x is represented using b x bits, destination node of (x, y, z) is obtained by interchanging the positions of the first b x bits of the concatenated bit representation of the source with the last b x bits, repartitioning the result, and taking its complement Uniform random Packet sent to a destination chosen at uniform random shown in Figures 7 and 8 The plots validate the fact that performs consistently well over a wide range of network traffic conditions and is superior to existing oblivious routing algorithms in handling adversarial traffic like transpose and -WC It can also be inferred from the plots that has a strictly lower latency and strictly higher throughput than for all traffic patterns considered 5 Evaluation of the LM Architecture Next, we evaluate the LM architecture in comparison to a conventional D mesh in terms of power, area and performance For performance evaluation, we present simplified throughput analysis, detailed flit-level simulations and an analysis of worst-case packet latency 51 Power Comparison In this section we compare the power of the LM and D mesh architectures using the power models in Orion 2 [4] The power models used are for a 65 nm process at 1V and an operating frequency of 1GHz The power reported includes both the static and dynamic power components A flit is assumed to be 128 bits wide SRAMs are used as input buffers and crossbars are built using multiplexer trees A two-stage round robin arbiter is used for VC allocation and matrix arbiters are used for switch allocation Since the buffer power constitutes a significant portion of the router power, we only use 4VCs (instead of 8 VCs used for performance evaluation) with 5-flit buffers per VC at each input port of the 2D routers (LM) and D routers (D mesh) This makes the power numbers less favorable for the LM architecture as it diminishes the advantage of using fewer buffers in the demultiplexing and multiplexing structures compared to buffering used in the routers A typical Table 2: Throughput evaluation of Network WC Avg-Case Transpose Compl WC Uniform Network WC Avg-Case Transpose Compl WC Uniform Table : Average hop count of on LM and D mesh architectures Topology LM D Mesh 5-port Demux Mux 7-port router hops hops hops router hops flit injection rate of flits/cycle is assumed at the injection queue of each processor for both the and topologies Based on this injection rate and the average hop count of a flit in the LM and D mesh architectures, the average flit arrival rate per port is computed for the routers in the two architectures For the symmetric topology, since the LM architecture requires fewer hops on an average to route a flit from its source to its destination, (as shown in Table ), the lower hop count translates into a lower flit arrival rate per port in the routers of the LM architecture when compared to D mesh routers Similarly, flits need to traverse more routers on an average in the topology when compared to the topology, thus increasing the flit arrival rate per port of routers in the network The per-port flit arrival rate affects the dynamic power consumption of routers Table 4 compares the average router power cost per processor tile for the two architectures The routers in the LM architecture have only 5 ports, compared to 7 ports needed by routers in a D mesh However, the LM architecture incurs the cost of an extra vertical demultiplexing switch for every 4 routers and a buffered multiplexer at every processor The additional hardware adds to the amortized router cost per tile As shown in Table 4, with the costs of the demultiplexing and multiplexing structures added, the LM architecture still consumes 27% less power in comparison to a D mesh for the topology and 26% less power for the 8 8 4topology 52 Area Comparison Next, we compare the router area per processor tile for the LM and D mesh architectures In a D mesh, the router area per processor is the area occupied by a 7-port router The router area per processor for the LM architecture com-

8 (a) Uniform Random (b) Transpose (c) Complement (d) -Worst Case (a) Uniform Random Figure 7: Performance of on a Mesh (b) Transpose (c) Complement (d) -Worst Case Figure 8: Performance of on a Mesh Table 4: Comparison of router power per tile Router Amortized Mux Total Power Demux Power Router Power Power (mw) (mw) (mw) (mw) mesh D mesh 187 na na 187 LM mesh D mesh 2528 na na 2528 LM Table 5: Comparison of router area per tile Router Amortized Mux Wire Total Area Demux Area Area Router Area Area (mm 2 ) (mm 2 ) (mm 2 ) (mm 2 ) (mm 2 ) D 9 na na na 9 mesh LM prises the area of the 5-port routers, the amortized demultiplexer area, the multiplexer area and area overhead due to the extra vertical wiring needed for inter-layer demultiplexing and multiplexing of flits The area of the 7-port routers, 5-port routers, demultiplexer and multiplexer were estimated using the Orion 2 [4] area models The technology and router parameters used were the same as those used for power estimation Since we assume inter-layer communication using Through-Silicon Vias (TSVs), the vertical wiring occupies some area on each device layer For estimating the area overhead due to the extra wiring needed in the LM architecture, we assumed vertical Table 6: Throughput of vs -LM D mesh LM D mesh LM Worst-Case Average-Case Transpose Complement WC Uniform via pitch of 4 um [6] and eight 144 bit bundles (128 bit flits and 16 bit control signals, four bundles for demultiplexing and four for multiplexing) running vertically through all four device layers The extra wiring overhead was just 1% of the area of a 5-port router As shown in Table 5, even after including all the area overheads of the LM architecture, the total router area per processor is still 27% less than the area of a 7-port router needed in a D mesh This comparison holds for both the symmetric and asymmetric topologies which have the same number of device layers 5 Throughput Evaluation of the LM architecture In this section, we compare the saturation throughput of -LM on the LM architecture with on a D mesh These routing algorithms have been shown to be worst-case throughput optimal for the two architectures and is known to be better than existing algorithms for D meshes, especially under adversarial traffic as shown in Section 52 Table 6 presents simplified throughput analysis results -LM has the same worst-case throughput as that of on a D mesh, as analyzed in Section 42 For the symmetric mesh topology, -LM outperforms by 145% in average-case throughput, which is a significant improvement For this topology, although per-

9 #( # '( ' ( ;D )*+,-/1245+* (a) Uniform 9:1+-71;-41<=>?=>=1@A #( # '( ' ( ;D )*+,-/1245+* (b) Transpose 9:1+-71;-41<=>?=>=1@A #( # '( ' ( ;D )*+,-/1245+* (c) Complement 9:1+-71;-41<=>?=>=1@A #( # '( ' ( ;D )*+,-/1245+* (d) -Worst Case 9:1+-71;-41<=>?=>=1@A #( # '( ' ( Figure 9: Performance of on D mesh and LM architectures for a topology ;D )*+,-/1245+* (a) Uniform 9:1+-71;-41<=>?=>=1@A #( # '( ' ( ;D )*+,-/1245+* (b) Transpose 9:1+-71;-41<=>?=>=1@A #( # '( ' ( ;D )*+,-/1245+* (c) Complement 9:1+-71;-41<=>?=>=1@A #( # '( ' ( ;D )*+,-/1245+* (d) -Worst Case Figure 1: Performance of on D mesh and LM architectures for a topology forms slightly better than -LM on transpose traffic, the two perform comparably on complement and -WC traffic patterns On uniform traffic, -LM outperforms by % for the symmetric mesh topology This is because the LM architecture makes better use of the interlayer bandwidth available in D ICs to overcome bottlenecks caused by two-phase routing along the vertical dimension The simplified throughput analysis results of and -LM are very similar for the asymmetric mesh topology -LM and have the same worst-case throughput and comparable average-case throughput The saturation throughputs for the different traffic patterns evaluated are also very close This is because, for the asymmetric topology, throughput is mainly constrained by the load on the horizontal channels and the links along the short vertical dimension are relatively less loaded We use flit-level simulations to get a realistic insight into the actual throughput that can be sustained by the two architectures under different traffic conditions The simulations were performed on the four traffic patterns shown in Table 1 Figures 9 and 1 show the simulation results These results follow the same trend as the simplified throughput analysis -LM clearly outperforms under uniform traffic for the mesh topology On the remaining traffic patterns, the saturation throughputs of the two algorithms are comparable On the asymmetric topology, -LM outperforms by a narrow margin on all traffic patterns The plots also give us an accurate idea of the number of cycles saved by using the LM architecture over a D mesh The latency of -LM is lower than that of for all four traffic patterns The reduction is a consequence of the single-hop vertical communication and fewer pipeline stages in the packet ejection stage of the LM architecture, as compared to a conventional router 54 Worst-Case Latency Analysis Minimizing the worst-case number of router hops is very important for latency critical applications Since on a D mesh requires two-phase routing along one of the three dimensions and minimal routing in the two remaining dimensions, the worst-case hop count of on a symmetric D mesh is 4(k 1), k being the network radix The worst-case hop count of -LM on the LM architecture is 2(k 1) + 2, where the +2 corresponds to the extra demultiplexing and multiplexing hops For a symmetric mesh with k = 4, -LM on the LM architecture achieves % lower worst-case hop count than on a D mesh For an asymmetric mesh, the worst-case hop count of is given as (k x 1) + (k y 1) + 2(k z 1) and the worst-case hop count of -LM is given as (k x 1) + (k y 1) + 2 where k x, k y and k z are lengths of the X, Y and Z dimensions, respectively For the asymmetric 8 8 4topology, the worst-case hop count of -LM is 2% less than that of on a D mesh 6 CONCLUSIONS The increasing viability of three dimensional silicon integration technology has opened new opportunities for chip architecture innovations The main contribution of this paper is a novel layer-multiplexed architecture for D on-chip interconnection networks that exploits the throughput optimality of an oblivious routing algorithm called together with the short inter-layer wiring delays and high vertical communication bandwidth enabled in D technology The LM architecture reorganizes the large 7-port routers in a D mesh into smaller 5-port 2D routers for intra-layer communication integrated with demultiplexing and multiplexing structures for inter-layer communication This takes advantage of the short inter-layer distances in D designs and replaces the one-layer-per-hop routing in the vertical dimen-

10 sion with simpler vertical demultiplexing and multiplexing hops The LM architecture retains the same worst-case throughput as a D mesh by adapting routing to the LM architecture However, in comparison to a conventional D mesh, the LM architecture consumes 27% less power, occupies 27% less area, attains 145% higher average throughput and achieves % lower worst-case hop count for a symmetric mesh For an asymmetric 8 8 4mesh topology, the LM architecture attains comparable average throughput to a conventional D mesh, but consumes 26% less power, takes up 27% less area and achieves 2% lower worst-case hop count 7 REFERENCES [1] A Agarwal et al Tile processor: Embedded multicore for networking and multimedia Hot Chips, Stanford, CA, Aug 27 [2] W J Dally, B Towles Principles and practices of interconnection networks, Morgan Kaufmann, 2 [] W R Davis et al Demystifying D ICs: The pros and cons of going vertical IEEE Design and Test of Computers, 22(6):498-51,25 [4] A Kahng, B Li, LS Peh, K Samadi ORION 2: A fast and accurate NoC power and area model for early-stage design space exploration Design, Automation and Test in Europe (DATE), Nice, France, April 29 [5] T Kgil et al PICOSERVER: Using D stacking technology to enable a compact energy efficient chip multiprocessor International Conference on Architectural Support for Programming Languages and Operating Systems, 26 [6] J Kim et al A novel dimensionally-decomposed router for on-chip communication in D architectures International Symposium on Computer Architecture, 27 [7] K Lee et al Three-dimensional shared memory fabricated using wafer stacking technology IEDM Technical Digest, pages , December 2 [8] F Li, C Nicopoulos, T Richardson, Y Xie, V Narayanan, M Kandemir Design and management of D chip multiprocessors using network-in-memory International Symposium on Computer Architecture (ISCA), pages1-141,26 [9] H Matsutani, M Koibuchi, H Amano Tightly-coupled multi-layer topologies for -D NOCs International Conference on Parallel Processing, 27 [1] T Nesson, S L Johnsson routing on mesh and torus networks ACM Symposium on Parallel Algorithms and Architectures, Santa Barbara, CA, 1995 [11] D Park et al MIRA: A multi-layered on-chip interconnect router architecture International Symposium on Computer Architecture, 28 [12] R S Ramanujam, B Lin Near-optimal oblivious routing on three-dimensional mesh networks International Conference on Computer Design, 28 [1] D Seo, A Ali, W-T Lim, N Rafique, M Thottethodi Near-optimal worst-case throughput routing for two-dimensional mesh networks International Symposium on Computer Architecture, 25 [14] L Shang, LS Peh, NK Jha Dynamic voltage scaling with links for power optimization of interconnection networks IEEE International Symposium on High-Performance Computer Architecture, February 2 [15] H Sullivan, T R Bashkow A large scale, homogeneous, fully distributed parallel machine Annual Symposium on Computer Architecture, 1977 [16] M B Taylor, W Lee, S Amarasinghe, A Agarwal Scalar operand networks: On-chip interconnect for ILP in partitioned architectures International Symposium on High-Performance Computer Architecture (HPCA), pages41-5,anaheim,ca, 2 [17] B Towles, W J Dally Worst-case traffic for oblivious routing functions ACM Symposium on Parallel Algorithms and Architectures, Winnipeg,Manitoba, Canada, 22 [18] L G Valiant, G J Brebner Universal schemes for parallel communication ACM Symposium on The Theory of Computing, pp26-277,milwaukee,mn, 1981 [19] S Vangal et al An 8-tile sub-1-w teraflops processsor in 65nm CMOS IEEE Journal of Solid State Circuits, Jan28

A Layer-Multiplexed 3D On-Chip Network Architecture Rohit Sunkam Ramanujam and Bill Lin

50 IEEE EMBEDDED SYSTEMS LETTERS, VOL. 1, NO. 2, AUGUST 2009 A Layer-Multiplexed 3D On-Chip Network Architecture Rohit Sunkam Ramanujam and Bill Lin Abstract Programmable many-core processors are poised