Interconnection Network Project EE482 Advanced Computer Organization May 28, 1999

Size: px
Start display at page:

Download "Interconnection Network Project EE482 Advanced Computer Organization May 28, 1999"

Transcription

1 Interconnection Network Project EE482 Advanced Computer Organization May 28, 1999 Group Members: Overview Tom Fountain T.J. Giuli Paul Lassa Derek Taylor In this project we design a 1024-port x 10Gb/s per port interconnection network. Using the specifications and constraints outlined in the project assignment, the design aims to minimize the cost while sustaining a 25% duty factor on each port with an average latency of 125ns (50 2.5ns clock cycles) for random traffic with message length of 16 bytes. The network also performs well under most permutation traffic patterns. Finally, we design the network so that it can handle packets of length from 8 bytes to 256 bytes. The network also guarantees delivery of the packet. Table 1 summarizes our Network design. The first part of the report provides the network topology and specifies the chosen layout that minimized cost while providing reasonable performance. Then, the routing and flow control strategy for the network is outlined. Next, we detail the micro-architecture of the router. The key elements of the router are the Input and Output ports, the Routing Relation, the Allocator, and the Crossbar Switch. Finally, we built a C++ simulator for the network and recorded simulation results. We also created much of a complete network model in Verilog. Table 1: Interconnection Network Summary Topology 7-ary 3-cube (4:1) Cost $88, Speedup ~2 (accounting for overhead) Routing Algorithm Duato s Algorithm with Dimension-Ordered Escape Channels. Flow Control Virtual Channel Number of VCs 8 No. Escape Channels 2 Flit Size/Phit Size 32 bits Phit Size 32 bits (or width of channel) Buffers/VC 4*32 bits = 128 bits Total Buffers/Input Port 8*128bits = 1Kbit - 1 -

2 Table of Contents 1. SPECIFICATIONS AND CONSTRAINTS TOPOLOGY AND COST SELECTION...4 Dimension Selection... 4 Cost Comparison... 5 Why a Torus?... 5 Speedup and Final Cost ROUTING DESIGN...6 Routing Relation... 6 Router logic block diagram FLOW-CONTROL DESIGN...7 Buffer State Space... 7 Buffer Flit Space... 8 Input Port (Switch Bandwidth) -- Optimistic Design... 9 Output Port ROUTER MICRO-ARCHITECTURE...9 Input/Output Ports... 9 VCAllocator Design Routing Relation Logic Switch C++ NETWORK SIMULATOR VERILOG SIMULATOR...17 Message Interface (msg_interface.v) Router (router.v) Routing Relation (routing_relation.v) Switch (switch.v) VC Allocator (vc_allocator.v) Switch Allocator (switch_allocator.v) SIMULATION RESULTS CONCLUSION...18 APPENDIX

3 1. Specifications and Constraints The basic router is a 1024-port 10 Gbit/s interconnection network. The exact specification and constraints follow directly from the project assignment; we summarize them in Table 2 and Table 3. Table 2: Network Specifications Category Specification Number of Input Ports 1024 Number of Output Ports 1024 Port Data Rate 10 Gb/s Clock Rate 400 MHz (2.5 ns clock) Minimum Duty Factor 25% Maximum Latency (random) 125ns (50 clock 16 bytes Traffic Pattern Random or Port Permutation Latency/Throughput Same for random as port permutation Packet Length 8 bytes 256 bytes Quality of Service (QoS) No packets dropped Table 3: Network Constraints Category Constraint Chips Max Pins 512 signal pins Chip Signals Differential/Bidirectional (2 pins/signal) Chip Pin Bandwidth 1 Gb/s Chip Clock Rate 400 MHz (2.5 ns clock) Chip Memory 500Kb single-port (drops quadratically) Combinational logic ns clock Cross-Chip Wire Delay 2.5 ns clock cycle Memory access 64Kbits (R or W) in 2.5 ns cycle Chip Cost $200 Board Chip Count 32 chips (8 x 16 ) Board Max Bisection 128 wires/inch Board Connectors 40 pins/inch (20 signals/inch) on opposite edges Board Connector Cost $0.10/pin Board Cost $400 Backplane Size 16 x 16 Backplane Bisection 128 wires/inch (both dimensions) Backplane Connectors 16 Board Connectors with 40 pins Backplane Connector Cost $0.10/pin Backplane Cost $800 Cable Cost (w/connector) $0.05 (P (D + 4)) Signal Limits (1 Gb/s signal) 1m PC board or 4m Cable (or reduced if >) Signal Delay 2ns/ft (6ns/m) - 3 -

4 2. Topology and Cost Selection Dimension Selection We selected our topology to provide optimum performance for both random and permutation traffic. Our selection process considered four different topologies in our effort to find a minimal-cost network: the k-ary n-fly (and Bene), k-ary n-cube (torus and mesh), and a fat tree. For each network, we analyze the networks theoretically to determine the optimum k and n for the network, making appropriate reasonable assumptions in the process. Then, we compare the costs of each of the network types using the optimum k and n. Table 4 shows the analysis table for the Torus network. For brevity, we show only the analysis for the Torus network with 4:1 concentrators. However, we generated an identical table for each of the above-mentioned network topologies. Table 4: 256x256 k-ary n-cube with 4:1 Concentrators. Dark shaded cells are pin limited. Light colored cells are pin and bisection limited simultaneously. Multiply the throughput values by the signaling frequency to get the actual throughput. The unit on Latency is cycles. The unit on throughput is bits/cycle. Backplane Ws n k 2 4 7* w w2a w2b w2c w2d w2e Wa Wb Wc Wd We Gamma Thra Thrb Thrc Thrd Thre Laa Lab Lac Lad Lae In each analysis, we considered a network with a pin limitation of Wn = 256 (max number of pin differential signals), wire bisection limitation of Ws = 1024 * Number of Backplanes and message length of L = 128 bits. We analyze the networks over a range of Bisection width limitations, since we can pick this arbitrarily via the number of backplanes or via cables. We choose limitations that are multiples of the number of backplanes since the cable limitation is not as well defined. In the tables below, w1 is the signal constrains on the channel (signal/channel), w2 is the wire constraints on the channel (signal/channel), w is the minimum of w1 and w2, Thr is the throughput of the network, and La is the latency. We also consider using concentrators in our design. Since each port has a 25% duty factor, we can consider using 4:1 concentrators at each input and building a 256x256 port network to save costs. Wherever we consider concentrators, we account for the additional latency by adding two cycles

5 Cost Comparison We then analyzed the cost of each of the optimum configurations from the preceding tables. Finally, we selected our network by balancing cost with the zero-load performance analysis, including an estimate of the network performance under permutation traffic. Table 5 shows a summary of our cost calculations. From this table, it was immediately clear that a network with concentrators were a big win. Furthermore, we decided that our primary choice was between the 256-port Butterfly and the 256-port cube for cost reasons. Why a Torus? Table 5: Overall Cost Summary Attribute 1024 port butterfly 256 port butterfly 256 port Bene 1024 port cube 256 port cube Radix (k) Dimension(n) Total Cost $302, $77, $130, $392, $88, Since the cost of the 256-port Butterfly was comparable to the cost of the 256-port cube, we made the choice based on the effects of permutation traffic and channel loading. For random traffic in the Butterfly configuration, the loading factor (gamma) on each channel is still one. However, a bit reversal permutation produces a channel loading of 8 because there is no path diversity! Furthermore, the duty factor of this network is now 1 since we are using the concentrators. From an analysis similar to that in Table 4, we show that the widest possible channel width for the Butterfly is w = 32, due to pin limitations (assuming no slicing). Thus, since the network requires at least 10 signals to handle a loading factor of 1, the Butterfly has a relative loading factor of (32/10) or 3.2. Conversely, for the Torus, the 2-cube requires k = 16, or gamma = 2. This implies that we require up to 20-wide channels to handle random traffic loading. Since the 2-cube allows for up to 64 wide channels, the 2-cube can handle traffic permutations with a relative loading of 3.2 (64/20) over a random permutation s loading. The 3-cube requires k = 7, or gamma = This implies channel widths of approximately 9. If we design our 3-cube with 32 wide channels, the relative loading is 3.6 (32/9). The 4-cube has a relative loading of 3.2 (16/5), which is lower than the 3-cube but equal to the 2- cube. We calculate the relative loading of the 4-cube using 16-wide channels since a 32-wide channel leaves no available I/O pins for the entry/exit port. Thus, there is a clear advantage in designing the 3- cube over the 2-cube or the 4-cube. Furthermore, since the relative loading factor for the best Torus is far better than for the Butterfly, and since the path diversity of the Torus should be multiplied by the loading factor to get an accurate comparison, we easily see that the Torus is a big win over the Butterfly. Thus, we select the 256-port Torus for our network. Speedup and Final Cost Before costing the network, we needed to decide how wide to make the channels. A channel width of 32 offered a significant speedup of 3 over a more conservative channel width of 16. We chose the width of 32, however, because we determined that the critical packaging limitation was the off-board bandwidth (320 signals per side). A channel width of 32 allowed us to get the maximum bandwidth possible since we could send up to 10 channels off the board on each side. In retrospect, this speedup of 3 was probably excessive since we planned to provide enough routing flexibility to take advantage of the path diversity such that our channel loading would not require this bandwidth. However, we don t quite get a speedup of 3 since we use Virtual Channels with 4 signals dedicated for credits and 5 signals committed to overhead on every flit. Thus, our effective speedup is closer to 2 than 3. Figure 1 shows the actual chip and board layout configuration that we use for our network

6 I/O Z X Y Chip Layout Board Layout Figure 1 : Chip and Board Configuration for 256-port Torus Table 6 summarizes the cost of a 256 port network. Using this chip and board configuration, we have 256 network chips 256 concentrator chips. We also have 64 concentrator boards and 64 network boards. We will connect all the concentrator outputs to backplanes, requiring 16,384 board connectors. Additionally, we use every connector possible on the boards, for 81,920 board connector pins. We require 4 backplanes for the concentrator network. If we connect the network X-Y planes within backplanes and the Z-dimension over cables, we will require 1 backplane for each of the XY planes for 7 network backplanes total. We will connect the concentrator backplanes to the network backplanes, requiring that we connect 49 channels from the concentrator backplanes to every network backplane. This requires 21,952 backplane connector pins. Additionally, we can calculate all the board connectors as if we were using only 8 backplanes to connect to all the boards instead of 11; we will require 10,240 connectors per backplane for 81,920 board connector pins. Finally, since all Z-dimensional wiring is via cables, and we have two Z-dimensions for every node, we have 16,384 pins in cable costs. Table 6: 7-ary 3-cube Cost Summary Type Cost/Unit Number Units Total Cost Chips $ $38, Boards $ $14, Board Connectors $0.10/pin 98,304 $18, Backplanes $ $3200 Backplane Connectors $0.10/pin 103,872 $10, Cabling $0.05*(P*D+4) 16,384 $4,096 Total Cost NA NA $88, Routing Design This section explains our Routing Relation and Design. Routing Relation To gain the advantages of path diversity and to avoid deadlock, we have chosen to use Duato s algorithm to allocate two routing relations: R1 is a deadlock-free routing relation and C1 are the channels belonging to R1. R1-C1 can be thought of as the escape channel. R2 is a routing relation that governs all other channels (C2) in the network that are not in C1. For R1, we use simple dimension-ordered routing, First X, then Y, then Z. For R2, we use a selection algorithm that randomly picks any possible path along the minimal route. We do not use any information about congestion to assist the random assignment along a possible path. We require at least 2 Virtual Channels for C1 and use the remaining channels for C2. We rely upon Duato s Proof to demonstrate that the network is deadlock free

7 For our design, we also found it convenient to provide a filtering module between the Preferred Direction Vector and the VC Allocator that removed all requests for Channels that were already assigned. This, way the VC Allocator did not need to consider this information when making assignments. Router logic block diagram Figure 2 shows how the router logic will be implemented. When a packet arrives at an input port, it bids for all of the possible output VCs that the packet could travel over in accordance with the rules of Algorithm 2. The following diagram shows the logic needed at each input port to determine which output VCs the packet can possibly use. in Direction vector computation Quadrant analysis (return available C1 channels) Filter Compute available C2 channels + to VC allocator Figure 2: Routing Relation Logic Block Diagram 4. Flow-Control Design Our network uses Virtual Channel Flow Control as an extension of using Virtual Channels in our routing mechanism to avoid deadlock via Duato s Algorithm. Table 7 provides the specifications of our Virtual Channels Configuration. Table 7: Virtual Channels Specifications Number of VCs 8 Flit Size 32 bits Phit Size 32 bits (or width of channel) Buffers/VC 4*32 bits = 128 bits Total Buffers/Input Port 8*128bits = 1Kbit Essentially, the flow control mechanism answers the question of how to allocate resources. In our design, there are four essential resources that we allocate: 1) The Input Buffer State Space, 2) The Input Buffer Space, 3) The Switch Bandwidth (equivalent to obtaining a switch input port), and 4) the output channel bandwidth. Figure 3 illustrates the resources that must be allocated using Virtual Channel Flow Control. Buffer State Space We allocate the buffer state space to header flits as they arrive on the input channel. In Figure 3, we show only 4 input buffer state spaces per input port. Our design will allow for 8 input Virtual Channels per input port. For each input header flit, the router requires that the packet receive a VC assignment before it can be routed through the switch. As illustrated in Figure 2, each header flit passes its destination address into logic that implements the routing relation (R). The routine relation logic produces an array of requests for all the possible output ports that will sufficiently route the packet. The VC allocator receives the request from all the input ports in a cycle and allocates output VCs according to the allocation heuristic explained in the following section

8 R VC Allocator Switch Allocator Buffer State Space Switch Buffer Flit Space Input Port Output Port Input Port Figure 3: Resource Diagram for Virtual Channel Flow Control In order to ensure that the source router does not send to the current router unless an input buffer state space is available, the source router maintains a data structure associated with each output port (as does this router with its output port). The data structure of the output port contains two fields: The G field, and the C field. The G field specifies the global state. The state can either be Idle specifying that the output port is not in use or Active specifying that the output port is allocated to an input port. The C field is a count of the number of available credits, or empty buffer spaces at the destination. When the C field is zero, the output port will not send any flits to the destination. Furthermore, the global state (G field) of the output port does not return to Idle until the count reaches 4 (or the number of buffers on a VC). Thus, we ensure that there is buffer state space before sending any more flits to the destination. Buffer Flit Space Each Virtual Channel has enough buffers to handle a little more than a complete roundtrip delay of a flit. In Figure 3, we show that each buffer has 4 slots to store flits. When a virtual channel is allocated to an incoming flit, the buffer is guaranteed to be empty, since it will not be allocated to an incoming flit unless the state is Idle and it will not go Idle unless the buffers are empty. Since the source does not send unless it has an available credit, the allocation scheme ensures that we have sufficient buffer flit space by only sending one credit for each flit that arbitrates successfully for an input port of the switch. In our actual design, we allocate 5 slots to store flits. We chose this number because a 16 byte packet requires 5 flits. Since we could benefit by having the ability to buffer an entire packet (of the size of our simulation) within a flit buffer, we set the buffer size to 5. However, the network will still work on packets of any size larger than 2 flits

9 Input Port (Switch Bandwidth) -- Optimistic Design We allocate the input port to flits every cycle. In our design we provide a 56x7 crossbar switch. This allows us to arbitrate for each output port of the switch rather than allocate. This also allows us to use optimistic routing, which reduces the latency through our router to only 3 cycles! Essentially, every flit that is ready transmits across the switch every cycle that it also bids. The Switch Allocator sets the MUXes of the switch in the current cycle to allow the winners through the switch in the same cycle. The Switch Allocator then also sends back the acknowledgement to those ports and those that receive the acknowledgement send credits back to the source router. Output Port The Switch Allocator handles allocation of the Output port. Since the allocator matches the input port to the output port, there is no contention here and the flit is prepared for transmission exactly one cycle after sailing through the router s switch. 5. Router Micro-Architecture This section describes the router micro-architecture. A router design consists of four main elements: the input and output ports, the VC Allocator, the Switch Arbiter, the Routing Relation logic block, and the Switch. Input/Output Ports The Buffer State Vectors Diagram in Figure 5 illustrates the structure and interfacing between the main functional blocks, and shows communication paths and state logic used in performing the Routing function. White blocks represent Input/Output Port logic. Grey blocks represent functional elements described in later sections. Input ports: The router consists of a series of 7 input ports (X+,X-,Y+, Y-, Z+, Z-, and Injection), each of which contains buffering for incoming packets. Associated with each VC of each input port is an Input Buffer State Vector (IBSV) containing the following fields. (G) Global state {Idle, Routing, Active} (R) Routing Info {Preferred directions vector} (O) Output Virtual Channel Port (Awarded by VC Allocator) Output Ports: Outgoing flits are received from the switch, staged, then sent out to the downstream router. Associated with each VC of each output port is an Output Buffer State Vector (OBSV) containing the following fields. (G) Global state {Idle, Active} (I) Source Input Port (Awarded by VC Allocator) (C) Output Credits available (in downstream router) Figure 4shows a traversal of a two-flit packet (head & tail) through the Router diagram. Event Arrive/ Header Tail VC Allocate Switch Allocate Header Tail Switch Header Tail Output Header Tail Figure 4: Timing Diagram through router

10 At time 1, the packet header arrives at the input port, and an IBSV is initialized (G set to Routing state) for the intended Input VC. The header is simultaneously stored in the first flit buffer space for this VC (Buffer Pointer=1), and sent through the Routing Relation to the VC Allocator. The VC Allocator allocates the packet in this cycle. At time 2, the Tail flit arrives and queues up behind the Head flit in the Flit Buffer. In addition, the packet header optimistically advances through the Switch while bidding for its output port to the Switch Arbiter. In this example, the Head flit receives its request and advances to the Output Port. At time 3, the Head Flit passes out of the Port on a wire to the next router. The Tail flit also passes through the Switch in time 3. At time 4, the Tail flit passes out the output port. VCAllocator Design The job of the VC allocator is to assign one of a collection of requests to as many requesters as possible, solving a bipartite matching problem. Said differently, given a set of options for each input, the job of the allocator is to match up each input with an available output such that the maximum number of inputs receive a valid output. Figure 6 illustrates the bipartite matching problem. To get the maximal matching in the fastest possible way, we will use a heuristic that approaches perfect assignment. The Allocation will be solved in three phases. In the first phase, the allocator will calculate the number of input ports requesting each output port. This number will then be sent back to each input port. In the second phase, each input port will then make a bid on the output port with the lowest number of output port requests. In the final phase, we implement a matrix arbiter that assigns an output port to each input port based upon the bid and the last-used, last-served fairness. For example, assuming the request of Figure 6, at the end of the first phase, A would request 1, B would request 1, C would request 2, and D would request 3. Since A and B both want 1, the second phase would arbitrate between the two and the last one served would lose. Figure 7 illustrates the three phases. Since there are 8 channels per port, there are as many as 56 requesters. This poses a problem for the designed heuristic because we cannot complete the allocation in a singe time cycle for such a large number of bits. Thus, we add an additional phase up front that introduces a Separable scheme. Essentially, we reduce the number of input requests from 56 to 7, one for each input port. This scheme is based on the logic that at most one header arrives on each input port at a time, and that the usual case is that a new packet will be assigned a VC on its first request. Thus, this separable scheme offers good performance and low complexity. Figure 8 shows a block diagram of the allocator. The logic for the first phase of the allocator counts the number of input channels that request each output channel. These counts are then passed back to the input ports. Since each bit of the 56 bit field represents one of the output channels, we can devise a simple counter by simply counting the number of bits in each slice of the input array. In the second phase of the allocator, each input port bids for the output channel with the lowest count. The Min Selector outputs a 56 bit field that has the bit number of the lowest count set. Finally, the third phase is an arbiter that arbitrates all contention for output ports from the input bids. If more than one bit in any bit slice is set, the Arbiter uses a round-robin token to assign the output register

11 Figure 5: Router Control Blocks

12 A B C D Figure 6: Bipartite Matching problem of the Allocator If B is assigned to 1, then A cannot be assigned. However, only one of the requestors can receive 3, so one of the requesters must be idle. Separable Filter Return granted resource Requested VCs Query output VCs Bid for resource with lowest request count Arbitrate for multiply requested resources Figure 7: Three-phase allocator. Table 8: Result Vector Meaning A B Meaning

13 Input Bit Fields Min Selectors Output Bids Bit- Counter logic Enable Output Bit Fields Bit- Counter logic Round- Robin Arbiter Bit- Counter logic Figure 8: Block Diagram of the Allocator Logic Routing Relation Logic The preferred direction vector logic calculates the preferred direction in a specific dimension. The first step is to subtract the local router position from the destination position. The result is then compared with 4 the median of the number of nodes in a single dimension to determine if the packet should be routed in the positive or negative direction. Table 8 summarizes the meaning of the Result vector and Figure 9 illustrates the preferred direction vector calculator. Figure 9: Preferred Direction Vector Calculator (single dimension) Destination Position Router Position - 4 comparator Zero? A B Result To R field in input buffer state

14 Three single-dimension direction vector blocks run in parallel to calculate all three dimensions simultaneously. The calculated direction vector is written into the R field of the input buffer state of the requesting input VC. The VC global allocation state is a global state register that keeps track of which output VCs are available for allocation. Whenever the VC allocator allocates an output VC, the corresponding bit for that VC is set to 0 in the global state register. When a VC is freed, its bit is set to 1. After the preferred direction vector has been calculated, it is fed into the control logic. Since we are using a binary 3-cube, a flit can travel out of the router on a maximum of 3 output ports. The output port selection logic masks out the output VCs that belong to the output ports that cannot be taken and allows the VCs from legal output ports to pass to the next logic block. In the escape channel trimming logic block, the VCs associated with the escape channels are examined and the escape channels that cannot be taken yet are masked out. Our routing algorithm uses x-first, y-next, z-last dimension ordered routing for the escape channels, so the trimming logic prevents two of the three escape channels from being requested. Figure 10: VC Selection Logic Control Logic Direction vector VC global allocation state (49 bits) 55 0 Output port selection logic Escape channel trimming logic 56 To VC allocator After the trimming stage, the bits that are high represent the requested VCs that get sent off to the VC allocator. The routing logic checks to see which VCs are available for allocation before sending off its request to the allocator in order to remove some of the processing burden from the VC allocator stage

15 Figure 11: Overview of Input Port Routing Logic 0 1 input buffer state word VC logic selection to allocator 7 1 input buffer state word VC logic selection to allocator direction calculation vector For one input port, the logic blocks are connected together as shown above. To begin, only one header can arrive at an input port per cycle, so only one copy of the direction vector calculation logic needs to be used. However, requests for VCs can be denied, so the VC selection logic must be duplicated for each input state word. The VC selection is then sent off to the VC allocator. Switch The switch is a 56x7 crossbar that connects all 56 input Virtual Channels from each dimension to all of the possible output ports. We chose this configuration because our design was not constrained by on-chip area. Furthermore, this design allows for the simplest Switch Control a simple arbitration at each output port. Finally, we wanted to provide optimistic Switching Arbitration and allowing all 56 input ports to compete at every output port was the easiest way to implement this. Figure 12 illustrates the implementation logic

16 In X+ Control Out X+ In X- Out X- In Y+ Out Y+ 8 channels 56:1 MUXes Inject Eject Figure 12: 56x7 Crossbar Switch using 56:1 Muxes 6. C++ Network Simulator We completed a C++ simulator of our network to provide performance analysis on both random traffic and on randomly selected bit permutation traffic. Each component of the network was abstracted into a separate C++ class, thereby forcing strict interfaces to be used between modules. State information is maintained as instance variables within each class and interfaces are utilized by method invocation. Each class represents the major components of an interconnection network as shown in Figure 13. Several implementation tool classes were also created to provide system generalizations, abstract lists and utilities, and messages. The Network Simulator performs a synchronized time step at every node in the network, including the concentrators for a specified number of steps. A key issue in building the simulator was handling the race condition that could exist by passing flits and credits between routers. Our simulation models the time of flight over a channel as one cycle, so we wanted to make sure that a flit was not transmitted and read in the same time step. For example, if node A executes and sends a flit to node B, and then B executes and reads its flit buffer, node B could pick up the new flit and process it incorrectly if did not already have another flit waiting in its flit buffer. We handle the race condition by always sending one flit and one credit every cycle. If a router has nothing to send on a port on a given cycle, it sends a null credit or null flit instead. Since all buffers in the Router initialize with exactly one null flit and credit, the system maintains a synchronous state across all connections. To clarify, in the previous example, node B would not pick up node A s newly transmitted flit because there would already be a null flit waiting at the flit buffer (from the previous time step) which would be picked up instead

17 main Network Router VC Allocator Switch Allocator Utilities List SystemConstants Messages Routing Relation Router Switch Router Architecture Implementation Tools Figure 13: C++ Network Simulator Design 7. Verilog Simulator We also implemented the main Router logic in Verilog. A complete Network was not constructed, however, we did implement the following modules. The following sections describe our implementation of each module and the current working state of each. Message Interface (msg_interface.v) The Message interface links the network infrastructure to our topology of Router modules. Router (router.v) This module represents the core of our design. All external inputs and outputs are managed in this module, and all of the remaining modules are instantiated/called within router.v. Input Flit Buffer (IFB_mem.v) This module consists of 5-deep FIFO s (per VC) for received flits. Input Buffer State Vector (IBSV_mem.v) This module manages state bits for each VC of each Physical Input channel. It accepts commands and reports state to other modules. The Preferred direction is stored. Output Buffer State Vector (OBSV_mem.v) This module manages state bits for each VC of each Physical Output channel. It accepts commands and reports state to other modules. Credits are stored here. Routing Relation (rt_rel.v) This module receives the Destination Address from the header flit, and generates the preferred direction vector to be input to the VC Allocator. VC Allocator (vc_alloc.v) The VC Allocator module receives all contending VC requests from the IBSV module and performs the allocation, awarding available Output VC channels to up to 7 winners. Credit Manager (cred_mgr.v) The Credits manager receives control signals from the downstream router, the OBSV_mem, the IFB_mem, and the Switch Allocator, and manages local credits and sends VC tagged credits to the upstream router. VC Ready Combiner (vc_rdy.v) The Combiner collects state bits from the IFB_mem, the IBSV_mem, and the OBSV_mem, and combinatorally generates a uniform request vector to the Switch Allocator

18 Switch Allocator (sw_alloc.v) The Switch Allocator receives the VC_rdy vector from the Combiner, and generates the current_cycle 7 winners (and sends upstream credits), and also decrements the local Credit count in the OBSV_mem. The previous_cycle 7 winners actually go through the switch this cycle. Crossbar Switch (sw_cbar.v) The Switch passes the previous_cycle 7 winners from the IFB_mem through the 56 x 7 crossbar to the Physical Output ports. 8. Simulation Results Our simulator is currently in the debugging process. We just ran out of time. We will simulation results as an addendum to you as soon as we get it finished. 9. Conclusion Our team feels that this project was very helpful in forcing the team to come to an understanding of the complex issues surrounding the implementation of a modern router and interconnection network. In retrospect, we should have decided on our final topology and concentration ratio sooner so that we could have spent more time simplifying and analyzing different routing and flow control options. We probably designed an excessive amount of speedup. We think we could have driven the cost of the network down by designing smaller channels and placing more on a board. The board turned out to be the limiting resource in terms of channel width limitations. However, we designed an excessive speedup to handle a wider array of permutation traffic. The network simulation results verify that the design handles the permutation traffic well

19 Appendix List of Figures Figure 1 : Chip and Board Configuration for 256-port Torus...6 Figure 2: Routing Relation Logic Block Diagram...7 Figure 3: Resource Diagram for Virtual Channel Flow Control...8 Figure 4: Timing Diagram through router...9 Figure 5: Router Control Blocks...11 Figure 6: Bipartite...12 Figure 7: Three-phase allocator...12 Figure 8: Block Diagram of the Allocator Logic...13 Figure 9: Preferred Direction Vector Calculator (single dimension)...13 Figure 10: VC Selection Logic...14 Figure 11: Overview of Input Port Routing Logic...15 Figure 12: 56x7 Crossbar Switch using 56:1 Muxes...16 Figure 13: C++ Network Simulator Design...17 List of Tables Table 1: Interconnection Network Summary...1 Table 2: Network Specifications...3 Table 3: Network Constraints...3 Table 4: 256x256 k-ary n-cube with 4:1 Concentrators...4 Table 5: Overall Cost Summary...5 Table 6: 7-ary 3-cube Cost Summary...6 Table 7: Virtual Channels Specifications...7 Table 8: Result Vector Meaning

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Lecture 12: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) 1 Topologies Internet topologies are not very regular they grew

More information

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance Lecture 13: Interconnection Networks Topics: lots of background, recent innovations for power and performance 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees,

More information

Homework Assignment #1: Topology Kelly Shaw

Homework Assignment #1: Topology Kelly Shaw EE482 Advanced Computer Organization Spring 2001 Professor W. J. Dally Homework Assignment #1: Topology Kelly Shaw As we have not discussed routing or flow control yet, throughout this problem set assume

More information

Lecture: Interconnection Networks

Lecture: Interconnection Networks Lecture: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm 1 Packets/Flits A message is broken into multiple packets (each packet

More information

Topologies. Maurizio Palesi. Maurizio Palesi 1

Topologies. Maurizio Palesi. Maurizio Palesi 1 Topologies Maurizio Palesi Maurizio Palesi 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and

More information

Deadlock and Router Micro-Architecture

Deadlock and Router Micro-Architecture 1 EE482: Advanced Computer Organization Lecture #8 Interconnection Network Architecture and Design Stanford University 22 April 1999 Deadlock and Router Micro-Architecture Lecture #8: 22 April 1999 Lecturer:

More information

Switching/Flow Control Overview. Interconnection Networks: Flow Control and Microarchitecture. Packets. Switching.

Switching/Flow Control Overview. Interconnection Networks: Flow Control and Microarchitecture. Packets. Switching. Switching/Flow Control Overview Interconnection Networks: Flow Control and Microarchitecture Topology: determines connectivity of network Routing: determines paths through network Flow Control: determine

More information

Interconnection Networks: Flow Control. Prof. Natalie Enright Jerger

Interconnection Networks: Flow Control. Prof. Natalie Enright Jerger Interconnection Networks: Flow Control Prof. Natalie Enright Jerger Switching/Flow Control Overview Topology: determines connectivity of network Routing: determines paths through network Flow Control:

More information

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Interconnection Networks: Topology. Prof. Natalie Enright Jerger Interconnection Networks: Topology Prof. Natalie Enright Jerger Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design

More information

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics Lecture 16: On-Chip Networks Topics: Cache networks, NoC basics 1 Traditional Networks Huh et al. ICS 05, Beckmann MICRO 04 Example designs for contiguous L2 cache regions 2 Explorations for Optimality

More information

Topology basics. Constraints and measures. Butterfly networks.

Topology basics. Constraints and measures. Butterfly networks. EE48: Advanced Computer Organization Lecture # Interconnection Networks Architecture and Design Stanford University Topology basics. Constraints and measures. Butterfly networks. Lecture #: Monday, 7 April

More information

Basic Low Level Concepts

Basic Low Level Concepts Course Outline Basic Low Level Concepts Case Studies Operation through multiple switches: Topologies & Routing v Direct, indirect, regular, irregular Formal models and analysis for deadlock and livelock

More information

Lecture 23: Router Design

Lecture 23: Router Design Lecture 23: Router Design Papers: A Gracefully Degrading and Energy-Efficient Modular Router Architecture for On-Chip Networks, ISCA 06, Penn-State ViChaR: A Dynamic Virtual Channel Regulator for Network-on-Chip

More information

Adaptive Routing. Claudio Brunelli Adaptive Routing Institute of Digital and Computer Systems / TKT-9636

Adaptive Routing. Claudio Brunelli Adaptive Routing Institute of Digital and Computer Systems / TKT-9636 1 Adaptive Routing Adaptive Routing Basics Minimal Adaptive Routing Fully Adaptive Routing Load-Balanced Adaptive Routing Search-Based Routing Case Study: Adapted Routing in the Thinking Machines CM-5

More information

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background Lecture 15: PCM, Networks Today: PCM wrap-up, projects discussion, on-chip networks background 1 Hard Error Tolerance in PCM PCM cells will eventually fail; important to cause gradual capacity degradation

More information

Network-on-chip (NOC) Topologies

Network-on-chip (NOC) Topologies Network-on-chip (NOC) Topologies 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and performance

More information

Module 17: "Interconnection Networks" Lecture 37: "Introduction to Routers" Interconnection Networks. Fundamentals. Latency and bandwidth

Module 17: Interconnection Networks Lecture 37: Introduction to Routers Interconnection Networks. Fundamentals. Latency and bandwidth Interconnection Networks Fundamentals Latency and bandwidth Router architecture Coherence protocol and routing [From Chapter 10 of Culler, Singh, Gupta] file:///e /parallel_com_arch/lecture37/37_1.htm[6/13/2012

More information

Recall: The Routing problem: Local decisions. Recall: Multidimensional Meshes and Tori. Properties of Routing Algorithms

Recall: The Routing problem: Local decisions. Recall: Multidimensional Meshes and Tori. Properties of Routing Algorithms CS252 Graduate Computer Architecture Lecture 16 Multiprocessor Networks (con t) March 14 th, 212 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252

More information

Interconnection Networks: Routing. Prof. Natalie Enright Jerger

Interconnection Networks: Routing. Prof. Natalie Enright Jerger Interconnection Networks: Routing Prof. Natalie Enright Jerger Routing Overview Discussion of topologies assumed ideal routing In practice Routing algorithms are not ideal Goal: distribute traffic evenly

More information

ECE 4750 Computer Architecture, Fall 2017 T06 Fundamental Network Concepts

ECE 4750 Computer Architecture, Fall 2017 T06 Fundamental Network Concepts ECE 4750 Computer Architecture, Fall 2017 T06 Fundamental Network Concepts School of Electrical and Computer Engineering Cornell University revision: 2017-10-17-12-26 1 Network/Roadway Analogy 3 1.1. Running

More information

OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel

OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel Hyoukjun Kwon and Tushar Krishna Georgia Institute of Technology Synergy Lab (http://synergy.ece.gatech.edu) hyoukjun@gatech.edu April

More information

Chapter 3 : Topology basics

Chapter 3 : Topology basics 1 Chapter 3 : Topology basics What is the network topology Nomenclature Traffic pattern Performance Packaging cost Case study: the SGI Origin 2000 2 Network topology (1) It corresponds to the static arrangement

More information

Lecture 3: Flow-Control

Lecture 3: Flow-Control High-Performance On-Chip Interconnects for Emerging SoCs http://tusharkrishna.ece.gatech.edu/teaching/nocs_acaces17/ ACACES Summer School 2017 Lecture 3: Flow-Control Tushar Krishna Assistant Professor

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 21 Routing Outline Routing Switch Design Flow Control Case Studies Routing Routing algorithm determines which of the possible paths are used as routes how

More information

Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies. Admin

Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies. Admin Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies Alvin R. Lebeck CPS 220 Admin Homework #5 Due Dec 3 Projects Final (yes it will be cumulative) CPS 220 2 1 Review: Terms Network characterized

More information

Lecture 22: Router Design

Lecture 22: Router Design Lecture 22: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO 03, Princeton A Gracefully Degrading and Energy-Efficient Modular Router Architecture for On-Chip

More information

Lecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control

Lecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control Lecture 12: Interconnection Networks Topics: dimension/arity, routing, deadlock, flow control 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees, butterflies,

More information

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control 1 Topology Examples Grid Torus Hypercube Criteria Bus Ring 2Dtorus 6-cube Fully connected Performance Bisection

More information

Routing Algorithms. Review

Routing Algorithms. Review Routing Algorithms Today s topics: Deterministic, Oblivious Adaptive, & Adaptive models Problems: efficiency livelock deadlock 1 CS6810 Review Network properties are a combination topology topology dependent

More information

NOW Handout Page 1. Outline. Networks: Routing and Design. Routing. Routing Mechanism. Routing Mechanism (cont) Properties of Routing Algorithms

NOW Handout Page 1. Outline. Networks: Routing and Design. Routing. Routing Mechanism. Routing Mechanism (cont) Properties of Routing Algorithms Outline Networks: Routing and Design Routing Switch Design Case Studies CS 5, Spring 99 David E. Culler Computer Science Division U.C. Berkeley 3/3/99 CS5 S99 Routing Recall: routing algorithm determines

More information

OASIS NoC Architecture Design in Verilog HDL Technical Report: TR OASIS

OASIS NoC Architecture Design in Verilog HDL Technical Report: TR OASIS OASIS NoC Architecture Design in Verilog HDL Technical Report: TR-062010-OASIS Written by Kenichi Mori ASL-Ben Abdallah Group Graduate School of Computer Science and Engineering The University of Aizu

More information

Routing Algorithm. How do I know where a packet should go? Topology does NOT determine routing (e.g., many paths through torus)

Routing Algorithm. How do I know where a packet should go? Topology does NOT determine routing (e.g., many paths through torus) Routing Algorithm How do I know where a packet should go? Topology does NOT determine routing (e.g., many paths through torus) Many routing algorithms exist 1) Arithmetic 2) Source-based 3) Table lookup

More information

EE382C Lecture 1. Bill Dally 3/29/11. EE 382C - S11 - Lecture 1 1

EE382C Lecture 1. Bill Dally 3/29/11. EE 382C - S11 - Lecture 1 1 EE382C Lecture 1 Bill Dally 3/29/11 EE 382C - S11 - Lecture 1 1 Logistics Handouts Course policy sheet Course schedule Assignments Homework Research Paper Project Midterm EE 382C - S11 - Lecture 1 2 What

More information

Lecture 18: Communication Models and Architectures: Interconnection Networks

Lecture 18: Communication Models and Architectures: Interconnection Networks Design & Co-design of Embedded Systems Lecture 18: Communication Models and Architectures: Interconnection Networks Sharif University of Technology Computer Engineering g Dept. Winter-Spring 2008 Mehdi

More information

Lecture 25: Interconnection Networks, Disks. Topics: flow control, router microarchitecture, RAID

Lecture 25: Interconnection Networks, Disks. Topics: flow control, router microarchitecture, RAID Lecture 25: Interconnection Networks, Disks Topics: flow control, router microarchitecture, RAID 1 Virtual Channel Flow Control Each switch has multiple virtual channels per phys. channel Each virtual

More information

A VERIOG-HDL IMPLEMENTATION OF VIRTUAL CHANNELS IN A NETWORK-ON-CHIP ROUTER. A Thesis SUNGHO PARK

A VERIOG-HDL IMPLEMENTATION OF VIRTUAL CHANNELS IN A NETWORK-ON-CHIP ROUTER. A Thesis SUNGHO PARK A VERIOG-HDL IMPLEMENTATION OF VIRTUAL CHANNELS IN A NETWORK-ON-CHIP ROUTER A Thesis by SUNGHO PARK Submitted to the Office of Graduate Studies of Texas A&M University in partial fulfillment of the requirements

More information

CS 552 Computer Networks

CS 552 Computer Networks CS 55 Computer Networks IP forwarding Fall 00 Rich Martin (Slides from D. Culler and N. McKeown) Position Paper Goals: Practice writing to convince others Research an interesting topic related to networking.

More information

Topologies. Maurizio Palesi. Maurizio Palesi 1

Topologies. Maurizio Palesi. Maurizio Palesi 1 Topologies Maurizio Palesi Maurizio Palesi 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and

More information

EECS 570. Lecture 19 Interconnects: Flow Control. Winter 2018 Subhankar Pal

EECS 570. Lecture 19 Interconnects: Flow Control. Winter 2018 Subhankar Pal Lecture 19 Interconnects: Flow Control Winter 2018 Subhankar Pal http://www.eecs.umich.edu/courses/eecs570/ Slides developed in part by Profs. Adve, Falsafi, Hill, Lebeck, Martin, Narayanasamy, Nowatzyk,

More information

Packet Switch Architecture

Packet Switch Architecture Packet Switch Architecture 3. Output Queueing Architectures 4. Input Queueing Architectures 5. Switching Fabrics 6. Flow and Congestion Control in Sw. Fabrics 7. Output Scheduling for QoS Guarantees 8.

More information

Packet Switch Architecture

Packet Switch Architecture Packet Switch Architecture 3. Output Queueing Architectures 4. Input Queueing Architectures 5. Switching Fabrics 6. Flow and Congestion Control in Sw. Fabrics 7. Output Scheduling for QoS Guarantees 8.

More information

Lecture 3: Topology - II

Lecture 3: Topology - II ECE 8823 A / CS 8803 - ICN Interconnection Networks Spring 2017 http://tusharkrishna.ece.gatech.edu/teaching/icn_s17/ Lecture 3: Topology - II Tushar Krishna Assistant Professor School of Electrical and

More information

Interconnection topologies (cont.) [ ] In meshes and hypercubes, the average distance increases with the dth root of N.

Interconnection topologies (cont.) [ ] In meshes and hypercubes, the average distance increases with the dth root of N. Interconnection topologies (cont.) [ 10.4.4] In meshes and hypercubes, the average distance increases with the dth root of N. In a tree, the average distance grows only logarithmically. A simple tree structure,

More information

IV. PACKET SWITCH ARCHITECTURES

IV. PACKET SWITCH ARCHITECTURES IV. PACKET SWITCH ARCHITECTURES (a) General Concept - as packet arrives at switch, destination (and possibly source) field in packet header is used as index into routing tables specifying next switch in

More information

Interconnection Networks

Interconnection Networks Lecture 17: Interconnection Networks Parallel Computer Architecture and Programming A comment on web site comments It is okay to make a comment on a slide/topic that has already been commented on. In fact

More information

Basic Switch Organization

Basic Switch Organization NOC Routing 1 Basic Switch Organization 2 Basic Switch Organization Link Controller Used for coordinating the flow of messages across the physical link of two adjacent switches 3 Basic Switch Organization

More information

TDT Appendix E Interconnection Networks

TDT Appendix E Interconnection Networks TDT 4260 Appendix E Interconnection Networks Review Advantages of a snooping coherency protocol? Disadvantages of a snooping coherency protocol? Advantages of a directory coherency protocol? Disadvantages

More information

EECS 122: Introduction to Computer Networks Switch and Router Architectures. Today s Lecture

EECS 122: Introduction to Computer Networks Switch and Router Architectures. Today s Lecture EECS : Introduction to Computer Networks Switch and Router Architectures Computer Science Division Department of Electrical Engineering and Computer Sciences University of California, Berkeley Berkeley,

More information

Lecture 28: Networks & Interconnect Architectural Issues Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture 28: Networks & Interconnect Architectural Issues Professor Randy H. Katz Computer Science 252 Spring 1996 Lecture 28: Networks & Interconnect Architectural Issues Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.S96 1 Review: ABCs of Networks Starting Point: Send bits between 2 computers Queue

More information

Chapter 4 : Butterfly Networks

Chapter 4 : Butterfly Networks 1 Chapter 4 : Butterfly Networks Structure of a butterfly network Isomorphism Channel load and throughput Optimization Path diversity Case study: BBN network 2 Structure of a butterfly network A K-ary

More information

Lecture: Interconnection Networks. Topics: TM wrap-up, routing, deadlock, flow control, virtual channels

Lecture: Interconnection Networks. Topics: TM wrap-up, routing, deadlock, flow control, virtual channels Lecture: Interconnection Networks Topics: TM wrap-up, routing, deadlock, flow control, virtual channels 1 TM wrap-up Eager versioning: create a log of old values Handling problematic situations with a

More information

Generic Architecture. EECS 122: Introduction to Computer Networks Switch and Router Architectures. Shared Memory (1 st Generation) Today s Lecture

Generic Architecture. EECS 122: Introduction to Computer Networks Switch and Router Architectures. Shared Memory (1 st Generation) Today s Lecture Generic Architecture EECS : Introduction to Computer Networks Switch and Router Architectures Computer Science Division Department of Electrical Engineering and Computer Sciences University of California,

More information

MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect

MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect Chris Fallin, Greg Nazario, Xiangyao Yu*, Kevin Chang, Rachata Ausavarungnirun, Onur Mutlu Carnegie Mellon University *CMU

More information

OpenSMART: An Opensource Singlecycle Multi-hop NoC Generator

OpenSMART: An Opensource Singlecycle Multi-hop NoC Generator OpenSMART: An Opensource Singlecycle Multi-hop NoC Generator Hyoukjun Kwon and Tushar Krishna Georgia Institute of Technology Synergy Lab (http://synergy.ece.gatech.edu) OpenSMART (https://tinyurl.com/get-opensmart)

More information

NoC Test-Chip Project: Working Document

NoC Test-Chip Project: Working Document NoC Test-Chip Project: Working Document Michele Petracca, Omar Ahmad, Young Jin Yoon, Frank Zovko, Luca Carloni and Kenneth Shepard I. INTRODUCTION This document describes the low-power high-performance

More information

Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip

Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip ASP-DAC 2010 20 Jan 2010 Session 6C Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip Jonas Diemer, Rolf Ernst TU Braunschweig, Germany diemer@ida.ing.tu-bs.de Michael Kauschke Intel,

More information

Deadlock-free XY-YX router for on-chip interconnection network

Deadlock-free XY-YX router for on-chip interconnection network LETTER IEICE Electronics Express, Vol.10, No.20, 1 5 Deadlock-free XY-YX router for on-chip interconnection network Yeong Seob Jeong and Seung Eun Lee a) Dept of Electronic Engineering Seoul National Univ

More information

Randomized Partially-Minimal Routing: Near-Optimal Oblivious Routing for 3-D Mesh Networks

Randomized Partially-Minimal Routing: Near-Optimal Oblivious Routing for 3-D Mesh Networks 2080 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 20, NO. 11, NOVEMBER 2012 Randomized Partially-Minimal Routing: Near-Optimal Oblivious Routing for 3-D Mesh Networks Rohit Sunkam

More information

Dynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers

Dynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers Dynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers Young Hoon Kang, Taek-Jun Kwon, and Jeff Draper {youngkan, tjkwon, draper}@isi.edu University of Southern California

More information

CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley Wide links, smaller routing delay Tremendous variation 3/19/99 CS258 S99 2

CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley Wide links, smaller routing delay Tremendous variation 3/19/99 CS258 S99 2 Real Machines Interconnection Network Topology Design Trade-offs CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley Wide links, smaller routing delay Tremendous variation 3/19/99

More information

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks X. Yuan, R. Melhem and R. Gupta Department of Computer Science University of Pittsburgh Pittsburgh, PA 156 fxyuan,

More information

Flow Control can be viewed as a problem of

Flow Control can be viewed as a problem of NOC Flow Control 1 Flow Control Flow Control determines how the resources of a network, such as channel bandwidth and buffer capacity are allocated to packets traversing a network Goal is to use resources

More information

Deadlock. Reading. Ensuring Packet Delivery. Overview: The Problem

Deadlock. Reading. Ensuring Packet Delivery. Overview: The Problem Reading W. Dally, C. Seitz, Deadlock-Free Message Routing on Multiprocessor Interconnection Networks,, IEEE TC, May 1987 Deadlock F. Silla, and J. Duato, Improving the Efficiency of Adaptive Routing in

More information

Quality of Service (QoS)

Quality of Service (QoS) Quality of Service (QoS) The Internet was originally designed for best-effort service without guarantee of predictable performance. Best-effort service is often sufficient for a traffic that is not sensitive

More information

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks Lecture: Transactional Memory, Networks Topics: TM implementations, on-chip networks 1 Summary of TM Benefits As easy to program as coarse-grain locks Performance similar to fine-grain locks Avoids deadlock

More information

Ultra-Fast NoC Emulation on a Single FPGA

Ultra-Fast NoC Emulation on a Single FPGA The 25 th International Conference on Field-Programmable Logic and Applications (FPL 2015) September 3, 2015 Ultra-Fast NoC Emulation on a Single FPGA Thiem Van Chu, Shimpei Sato, and Kenji Kise Tokyo

More information

4. Networks. in parallel computers. Advances in Computer Architecture

4. Networks. in parallel computers. Advances in Computer Architecture 4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors

More information

Boosting the Performance of Myrinet Networks

Boosting the Performance of Myrinet Networks IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. Y, MONTH 22 1 Boosting the Performance of Myrinet Networks J. Flich, P. López, M. P. Malumbres, and J. Duato Abstract Networks of workstations

More information

Crossbar - example. Crossbar. Crossbar. Combination: Time-space switching. Simple space-division switch Crosspoints can be turned on or off

Crossbar - example. Crossbar. Crossbar. Combination: Time-space switching. Simple space-division switch Crosspoints can be turned on or off Crossbar Crossbar - example Simple space-division switch Crosspoints can be turned on or off i n p u t s sessions: (,) (,) (,) (,) outputs Crossbar Advantages: simple to implement simple control flexible

More information

Routing Algorithms, Process Model for Quality of Services (QoS) and Architectures for Two-Dimensional 4 4 Mesh Topology Network-on-Chip

Routing Algorithms, Process Model for Quality of Services (QoS) and Architectures for Two-Dimensional 4 4 Mesh Topology Network-on-Chip Routing Algorithms, Process Model for Quality of Services (QoS) and Architectures for Two-Dimensional 4 4 Mesh Topology Network-on-Chip Nauman Jalil, Adnan Qureshi, Furqan Khan, and Sohaib Ayyaz Qazi Abstract

More information

Multicomputer distributed system LECTURE 8

Multicomputer distributed system LECTURE 8 Multicomputer distributed system LECTURE 8 DR. SAMMAN H. AMEEN 1 Wide area network (WAN); A WAN connects a large number of computers that are spread over large geographic distances. It can span sites in

More information

UNIVERSITY OF CASTILLA-LA MANCHA. Computing Systems Department

UNIVERSITY OF CASTILLA-LA MANCHA. Computing Systems Department UNIVERSITY OF CASTILLA-LA MANCHA Computing Systems Department A case study on implementing virtual 5D torus networks using network components of lower dimensionality HiPINEB 2017 Francisco José Andújar

More information

Lecture 14: Large Cache Design III. Topics: Replacement policies, associativity, cache networks, networking basics

Lecture 14: Large Cache Design III. Topics: Replacement policies, associativity, cache networks, networking basics Lecture 14: Large Cache Design III Topics: Replacement policies, associativity, cache networks, networking basics 1 LIN Qureshi et al., ISCA 06 Memory level parallelism (MLP): number of misses that simultaneously

More information

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP 133 CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP 6.1 INTRODUCTION As the era of a billion transistors on a one chip approaches, a lot of Processing Elements (PEs) could be located

More information

CS252 Graduate Computer Architecture Lecture 14. Multiprocessor Networks March 9 th, 2011

CS252 Graduate Computer Architecture Lecture 14. Multiprocessor Networks March 9 th, 2011 CS252 Graduate Computer Architecture Lecture 14 Multiprocessor Networks March 9 th, 2011 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252

More information

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University 18 447 Lecture 26: Interconnects James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L26 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Housekeeping Your goal today get an overview of parallel

More information

Lecture 7: Flow Control - I

Lecture 7: Flow Control - I ECE 8823 A / CS 8803 - ICN Interconnection Networks Spring 2017 http://tusharkrishna.ece.gatech.edu/teaching/icn_s17/ Lecture 7: Flow Control - I Tushar Krishna Assistant Professor School of Electrical

More information

Resource allocation in networks. Resource Allocation in Networks. Resource allocation

Resource allocation in networks. Resource Allocation in Networks. Resource allocation Resource allocation in networks Resource Allocation in Networks Very much like a resource allocation problem in operating systems How is it different? Resources and jobs are different Resources are buffers

More information

Lecture 21. Reminders: Homework 6 due today, Programming Project 4 due on Thursday Questions? Current event: BGP router glitch on Nov.

Lecture 21. Reminders: Homework 6 due today, Programming Project 4 due on Thursday Questions? Current event: BGP router glitch on Nov. Lecture 21 Reminders: Homework 6 due today, Programming Project 4 due on Thursday Questions? Current event: BGP router glitch on Nov. 7 http://money.cnn.com/2011/11/07/technology/juniper_internet_outage/

More information

CONGESTION AWARE ADAPTIVE ROUTING FOR NETWORK-ON-CHIP COMMUNICATION. Stephen Chui Bachelor of Engineering Ryerson University, 2012.

CONGESTION AWARE ADAPTIVE ROUTING FOR NETWORK-ON-CHIP COMMUNICATION. Stephen Chui Bachelor of Engineering Ryerson University, 2012. CONGESTION AWARE ADAPTIVE ROUTING FOR NETWORK-ON-CHIP COMMUNICATION by Stephen Chui Bachelor of Engineering Ryerson University, 2012 A thesis presented to Ryerson University in partial fulfillment of the

More information

Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing?

Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing? Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing? J. Flich 1,P.López 1, M. P. Malumbres 1, J. Duato 1, and T. Rokicki 2 1 Dpto. Informática

More information

A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing

A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing 727 A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing 1 Bharati B. Sayankar, 2 Pankaj Agrawal 1 Electronics Department, Rashtrasant Tukdoji Maharaj Nagpur University, G.H. Raisoni

More information

Priority Traffic CSCD 433/533. Advanced Networks Spring Lecture 21 Congestion Control and Queuing Strategies

Priority Traffic CSCD 433/533. Advanced Networks Spring Lecture 21 Congestion Control and Queuing Strategies CSCD 433/533 Priority Traffic Advanced Networks Spring 2016 Lecture 21 Congestion Control and Queuing Strategies 1 Topics Congestion Control and Resource Allocation Flows Types of Mechanisms Evaluation

More information

Interconnect Technology and Computational Speed

Interconnect Technology and Computational Speed Interconnect Technology and Computational Speed From Chapter 1 of B. Wilkinson et al., PARAL- LEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers, augmented

More information

Communication Performance in Network-on-Chips

Communication Performance in Network-on-Chips Communication Performance in Network-on-Chips Axel Jantsch Royal Institute of Technology, Stockholm November 24, 2004 Network on Chip Seminar, Linköping, November 25, 2004 Communication Performance In

More information

Removing the Latency Overhead of the ITB Mechanism in COWs with Source Routing Λ

Removing the Latency Overhead of the ITB Mechanism in COWs with Source Routing Λ Removing the Latency Overhead of the ITB Mechanism in COWs with Source Routing Λ J. Flich, M. P. Malumbres, P. López and J. Duato Dpto. of Computer Engineering (DISCA) Universidad Politécnica de Valencia

More information

Asynchronous Bypass Channel Routers

Asynchronous Bypass Channel Routers 1 Asynchronous Bypass Channel Routers Tushar N. K. Jain, Paul V. Gratz, Alex Sprintson, Gwan Choi Department of Electrical and Computer Engineering, Texas A&M University {tnj07,pgratz,spalex,gchoi}@tamu.edu

More information

Switching Hardware. Spring 2015 CS 438 Staff, University of Illinois 1

Switching Hardware. Spring 2015 CS 438 Staff, University of Illinois 1 Switching Hardware Spring 205 CS 438 Staff, University of Illinois Where are we? Understand Different ways to move through a network (forwarding) Read signs at each switch (datagram) Follow a known path

More information

Overview Computer Networking What is QoS? Queuing discipline and scheduling. Traffic Enforcement. Integrated services

Overview Computer Networking What is QoS? Queuing discipline and scheduling. Traffic Enforcement. Integrated services Overview 15-441 15-441 Computer Networking 15-641 Lecture 19 Queue Management and Quality of Service Peter Steenkiste Fall 2016 www.cs.cmu.edu/~prs/15-441-f16 What is QoS? Queuing discipline and scheduling

More information

WITH THE CONTINUED advance of Moore s law, ever

WITH THE CONTINUED advance of Moore s law, ever IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 11, NOVEMBER 2011 1663 Asynchronous Bypass Channels for Multi-Synchronous NoCs: A Router Microarchitecture, Topology,

More information

Fast Scalable FPGA-Based Network-on-Chip Simulation Models

Fast Scalable FPGA-Based Network-on-Chip Simulation Models We thank Xilinx for their FPGA and tool donations. We thank Bluespec for their tool donations and support. Computer Architecture Lab at Carnegie Mellon Fast Scalable FPGA-Based Network-on-Chip Simulation

More information

INTERCONNECTION NETWORKS LECTURE 4

INTERCONNECTION NETWORKS LECTURE 4 INTERCONNECTION NETWORKS LECTURE 4 DR. SAMMAN H. AMEEN 1 Topology Specifies way switches are wired Affects routing, reliability, throughput, latency, building ease Routing How does a message get from source

More information

Thomas Moscibroda Microsoft Research. Onur Mutlu CMU

Thomas Moscibroda Microsoft Research. Onur Mutlu CMU Thomas Moscibroda Microsoft Research Onur Mutlu CMU CPU+L1 CPU+L1 CPU+L1 CPU+L1 Multi-core Chip Cache -Bank Cache -Bank Cache -Bank Cache -Bank CPU+L1 CPU+L1 CPU+L1 CPU+L1 Accelerator, etc Cache -Bank

More information

Pseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks

Pseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks Department of Computer Science and Engineering, Texas A&M University Technical eport #2010-3-1 seudo-circuit: Accelerating Communication for On-Chip Interconnection Networks Minseon Ahn, Eun Jung Kim Department

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Midterm 1:

More information

EEC-484/584 Computer Networks

EEC-484/584 Computer Networks EEC-484/584 Computer Networks Lecture 13 wenbing@ieee.org (Lecture nodes are based on materials supplied by Dr. Louise Moser at UCSB and Prentice-Hall) Outline 2 Review of lecture 12 Routing Congestion

More information

Interconnection Network

Interconnection Network Interconnection Network Recap: Generic Parallel Architecture A generic modern multiprocessor Network Mem Communication assist (CA) $ P Node: processor(s), memory system, plus communication assist Network

More information

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow Abstract: High-level synthesis (HLS) of data-parallel input languages, such as the Compute Unified Device Architecture

More information

Performance Analysis of a Minimal Adaptive Router

Performance Analysis of a Minimal Adaptive Router Performance Analysis of a Minimal Adaptive Router Thu Duc Nguyen and Lawrence Snyder Department of Computer Science and Engineering University of Washington, Seattle, WA 98195 In Proceedings of the 1994

More information

OFAR-CM: Efficient Dragonfly Networks with Simple Congestion Management

OFAR-CM: Efficient Dragonfly Networks with Simple Congestion Management Marina Garcia 22 August 2013 OFAR-CM: Efficient Dragonfly Networks with Simple Congestion Management M. Garcia, E. Vallejo, R. Beivide, M. Valero and G. Rodríguez Document number OFAR-CM: Efficient Dragonfly

More information