Interconnection Network Project EE482 Advanced Computer Organization May 28, 1999

Similar documents
Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

Homework Assignment #1: Topology Kelly Shaw

Lecture: Interconnection Networks

Topologies. Maurizio Palesi. Maurizio Palesi 1

Deadlock and Router Micro-Architecture

Switching/Flow Control Overview. Interconnection Networks: Flow Control and Microarchitecture. Packets. Switching.

Interconnection Networks: Flow Control. Prof. Natalie Enright Jerger

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics

Topology basics. Constraints and measures. Butterfly networks.

Basic Low Level Concepts

Lecture 23: Router Design

Adaptive Routing. Claudio Brunelli Adaptive Routing Institute of Digital and Computer Systems / TKT-9636

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background

Network-on-chip (NOC) Topologies

Module 17: "Interconnection Networks" Lecture 37: "Introduction to Routers" Interconnection Networks. Fundamentals. Latency and bandwidth

Recall: The Routing problem: Local decisions. Recall: Multidimensional Meshes and Tori. Properties of Routing Algorithms

Interconnection Networks: Routing. Prof. Natalie Enright Jerger

ECE 4750 Computer Architecture, Fall 2017 T06 Fundamental Network Concepts

OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel

Chapter 3 : Topology basics

Lecture 3: Flow-Control

ECE 669 Parallel Computer Architecture

Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies. Admin

Lecture 22: Router Design

Lecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control

Routing Algorithms. Review

NOW Handout Page 1. Outline. Networks: Routing and Design. Routing. Routing Mechanism. Routing Mechanism (cont) Properties of Routing Algorithms

OASIS NoC Architecture Design in Verilog HDL Technical Report: TR OASIS

Routing Algorithm. How do I know where a packet should go? Topology does NOT determine routing (e.g., many paths through torus)

EE382C Lecture 1. Bill Dally 3/29/11. EE 382C - S11 - Lecture 1 1

Lecture 18: Communication Models and Architectures: Interconnection Networks

Lecture 25: Interconnection Networks, Disks. Topics: flow control, router microarchitecture, RAID

A VERIOG-HDL IMPLEMENTATION OF VIRTUAL CHANNELS IN A NETWORK-ON-CHIP ROUTER. A Thesis SUNGHO PARK

CS 552 Computer Networks

Topologies. Maurizio Palesi. Maurizio Palesi 1

EECS 570. Lecture 19 Interconnects: Flow Control. Winter 2018 Subhankar Pal

Packet Switch Architecture

Packet Switch Architecture

Lecture 3: Topology - II

Interconnection topologies (cont.) [ ] In meshes and hypercubes, the average distance increases with the dth root of N.

IV. PACKET SWITCH ARCHITECTURES

Interconnection Networks

Basic Switch Organization

TDT Appendix E Interconnection Networks

EECS 122: Introduction to Computer Networks Switch and Router Architectures. Today s Lecture

Lecture 28: Networks & Interconnect Architectural Issues Professor Randy H. Katz Computer Science 252 Spring 1996

Chapter 4 : Butterfly Networks

Lecture: Interconnection Networks. Topics: TM wrap-up, routing, deadlock, flow control, virtual channels

Generic Architecture. EECS 122: Introduction to Computer Networks Switch and Router Architectures. Shared Memory (1 st Generation) Today s Lecture

MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect

OpenSMART: An Opensource Singlecycle Multi-hop NoC Generator

NoC Test-Chip Project: Working Document

Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip

Deadlock-free XY-YX router for on-chip interconnection network

Randomized Partially-Minimal Routing: Near-Optimal Oblivious Routing for 3-D Mesh Networks

Dynamic Packet Fragmentation for Increased Virtual Channel Utilization in On-Chip Routers

CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley Wide links, smaller routing delay Tremendous variation 3/19/99 CS258 S99 2

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

Flow Control can be viewed as a problem of

Deadlock. Reading. Ensuring Packet Delivery. Overview: The Problem

Quality of Service (QoS)

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks

Ultra-Fast NoC Emulation on a Single FPGA

4. Networks. in parallel computers. Advances in Computer Architecture

Boosting the Performance of Myrinet Networks

Crossbar - example. Crossbar. Crossbar. Combination: Time-space switching. Simple space-division switch Crosspoints can be turned on or off

Routing Algorithms, Process Model for Quality of Services (QoS) and Architectures for Two-Dimensional 4 4 Mesh Topology Network-on-Chip

Multicomputer distributed system LECTURE 8

UNIVERSITY OF CASTILLA-LA MANCHA. Computing Systems Department

Lecture 14: Large Cache Design III. Topics: Replacement policies, associativity, cache networks, networking basics

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP

CS252 Graduate Computer Architecture Lecture 14. Multiprocessor Networks March 9 th, 2011

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 7: Flow Control - I

Resource allocation in networks. Resource Allocation in Networks. Resource allocation

Lecture 21. Reminders: Homework 6 due today, Programming Project 4 due on Thursday Questions? Current event: BGP router glitch on Nov.

CONGESTION AWARE ADAPTIVE ROUTING FOR NETWORK-ON-CHIP COMMUNICATION. Stephen Chui Bachelor of Engineering Ryerson University, 2012.

Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing?

A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing

Priority Traffic CSCD 433/533. Advanced Networks Spring Lecture 21 Congestion Control and Queuing Strategies

Interconnect Technology and Computational Speed

Communication Performance in Network-on-Chips

Removing the Latency Overhead of the ITB Mechanism in COWs with Source Routing Λ

Asynchronous Bypass Channel Routers

Switching Hardware. Spring 2015 CS 438 Staff, University of Illinois 1

Overview Computer Networking What is QoS? Queuing discipline and scheduling. Traffic Enforcement. Integrated services

WITH THE CONTINUED advance of Moore s law, ever

Fast Scalable FPGA-Based Network-on-Chip Simulation Models

INTERCONNECTION NETWORKS LECTURE 4

Thomas Moscibroda Microsoft Research. Onur Mutlu CMU

Pseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks

EE/CSCI 451: Parallel and Distributed Computation

EEC-484/584 Computer Networks

Interconnection Network

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

Performance Analysis of a Minimal Adaptive Router

OFAR-CM: Efficient Dragonfly Networks with Simple Congestion Management

Transcription:

Interconnection Network Project EE482 Advanced Computer Organization May 28, 1999 Group Members: Overview Tom Fountain (fountain@cs.stanford.edu) T.J. Giuli (giuli@cs.stanford.edu) Paul Lassa (lassa@relgyro.stanford.edu) Derek Taylor (dat@leland.stanford.edu) In this project we design a 1024-port x 10Gb/s per port interconnection network. Using the specifications and constraints outlined in the project assignment, the design aims to minimize the cost while sustaining a 25% duty factor on each port with an average latency of 125ns (50 2.5ns clock cycles) for random traffic with message length of 16 bytes. The network also performs well under most permutation traffic patterns. Finally, we design the network so that it can handle packets of length from 8 bytes to 256 bytes. The network also guarantees delivery of the packet. Table 1 summarizes our Network design. The first part of the report provides the network topology and specifies the chosen layout that minimized cost while providing reasonable performance. Then, the routing and flow control strategy for the network is outlined. Next, we detail the micro-architecture of the router. The key elements of the router are the Input and Output ports, the Routing Relation, the Allocator, and the Crossbar Switch. Finally, we built a C++ simulator for the network and recorded simulation results. We also created much of a complete network model in Verilog. Table 1: Interconnection Network Summary Topology 7-ary 3-cube (4:1) Cost $88,915.20 Speedup ~2 (accounting for overhead) Routing Algorithm Duato s Algorithm with Dimension-Ordered Escape Channels. Flow Control Virtual Channel Number of VCs 8 No. Escape Channels 2 Flit Size/Phit Size 32 bits Phit Size 32 bits (or width of channel) Buffers/VC 4*32 bits = 128 bits Total Buffers/Input Port 8*128bits = 1Kbit - 1 -

Table of Contents 1. SPECIFICATIONS AND CONSTRAINTS...3 2. TOPOLOGY AND COST SELECTION...4 Dimension Selection... 4 Cost Comparison... 5 Why a Torus?... 5 Speedup and Final Cost... 5 3. ROUTING DESIGN...6 Routing Relation... 6 Router logic block diagram... 7 4. FLOW-CONTROL DESIGN...7 Buffer State Space... 7 Buffer Flit Space... 8 Input Port (Switch Bandwidth) -- Optimistic Design... 9 Output Port... 9 5. ROUTER MICRO-ARCHITECTURE...9 Input/Output Ports... 9 VCAllocator Design... 10 Routing Relation Logic... 13 Switch... 15 6. C++ NETWORK SIMULATOR...16 7. VERILOG SIMULATOR...17 Message Interface (msg_interface.v)... 17 Router (router.v)... 17 Routing Relation (routing_relation.v)... 17 Switch (switch.v)... 17 VC Allocator (vc_allocator.v)... 17 Switch Allocator (switch_allocator.v)... 18 8. SIMULATION RESULTS...18 9. CONCLUSION...18 APPENDIX...19-2 -

1. Specifications and Constraints The basic router is a 1024-port 10 Gbit/s interconnection network. The exact specification and constraints follow directly from the project assignment; we summarize them in Table 2 and Table 3. Table 2: Network Specifications Category Specification Number of Input Ports 1024 Number of Output Ports 1024 Port Data Rate 10 Gb/s Clock Rate 400 MHz (2.5 ns clock) Minimum Duty Factor 25% Maximum Latency (random) 125ns (50 clock cycles) @ 16 bytes Traffic Pattern Random or Port Permutation Latency/Throughput Same for random as port permutation Packet Length 8 bytes 256 bytes Quality of Service (QoS) No packets dropped Table 3: Network Constraints Category Constraint Chips Max Pins 512 signal pins Chip Signals Differential/Bidirectional (2 pins/signal) Chip Pin Bandwidth 1 Gb/s Chip Clock Rate 400 MHz (2.5 ns clock) Chip Memory 500Kb single-port (drops quadratically) Combinational logic 20 Levels @ 2.5 ns clock Cross-Chip Wire Delay 2.5 ns clock cycle Memory access 64Kbits (R or W) in 2.5 ns cycle Chip Cost $200 Board Chip Count 32 chips (8 x 16 ) Board Max Bisection 128 wires/inch Board Connectors 40 pins/inch (20 signals/inch) on opposite edges Board Connector Cost $0.10/pin Board Cost $400 Backplane Size 16 x 16 Backplane Bisection 128 wires/inch (both dimensions) Backplane Connectors 16 Board Connectors with 40 pins Backplane Connector Cost $0.10/pin Backplane Cost $800 Cable Cost (w/connector) $0.05 (P (D + 4)) Signal Limits (1 Gb/s signal) 1m PC board or 4m Cable (or reduced if >) Signal Delay 2ns/ft (6ns/m) - 3 -

2. Topology and Cost Selection Dimension Selection We selected our topology to provide optimum performance for both random and permutation traffic. Our selection process considered four different topologies in our effort to find a minimal-cost network: the k-ary n-fly (and Bene), k-ary n-cube (torus and mesh), and a fat tree. For each network, we analyze the networks theoretically to determine the optimum k and n for the network, making appropriate reasonable assumptions in the process. Then, we compare the costs of each of the network types using the optimum k and n. Table 4 shows the analysis table for the Torus network. For brevity, we show only the analysis for the Torus network with 4:1 concentrators. However, we generated an identical table for each of the above-mentioned network topologies. Table 4: 256x256 k-ary n-cube with 4:1 Concentrators. Dark shaded cells are pin limited. Light colored cells are pin and bisection limited simultaneously. Multiply the throughput values by the signaling frequency to get the actual throughput. The unit on Latency is cycles. The unit on throughput is bits/cycle. Backplane Ws n 8 4 3 2 1 k 2 4 7* 16 256 w1 16 32 42 64 128 1 1024 w2a 4 8 10 32 512 2 2048 w2b 8 16 21 64 1024 4 4096 w2c 16 32 42 128 2048 8 8192 w2d 32 64 84 256 4096 16 16384 w2e 64 128 167 512 8192 Wa 4 8 10 32 128 Wb 8 16 21 64 128 Wc 16 32 42 64 128 Wd 16 32 42 64 128 We 16 32 42 64 128 Gamma 0.25 0.5 0.875 2 32 Thra 16 16 11.9417 16 4 Thrb 32 32 23.8834 32 4 Thrc 64 64 47.7667 32 4 Thrd 64 64 48 32 4 Thre 64 64 48 32 4 Laa 38 22 19.5 14 67 Lab 22 14 13.375 12 67 Lac 14 10 10.3125 12 67 Lad 14 10 10.2976 12 67 Lae 14 10 10.2976 12 67 In each analysis, we considered a network with a pin limitation of Wn = 256 (max number of pin differential signals), wire bisection limitation of Ws = 1024 * Number of Backplanes and message length of L = 128 bits. We analyze the networks over a range of Bisection width limitations, since we can pick this arbitrarily via the number of backplanes or via cables. We choose limitations that are multiples of the number of backplanes since the cable limitation is not as well defined. In the tables below, w1 is the signal constrains on the channel (signal/channel), w2 is the wire constraints on the channel (signal/channel), w is the minimum of w1 and w2, Thr is the throughput of the network, and La is the latency. We also consider using concentrators in our design. Since each port has a 25% duty factor, we can consider using 4:1 concentrators at each input and building a 256x256 port network to save costs. Wherever we consider concentrators, we account for the additional latency by adding two cycles. - 4 -

Cost Comparison We then analyzed the cost of each of the optimum configurations from the preceding tables. Finally, we selected our network by balancing cost with the zero-load performance analysis, including an estimate of the network performance under permutation traffic. Table 5 shows a summary of our cost calculations. From this table, it was immediately clear that a network with concentrators were a big win. Furthermore, we decided that our primary choice was between the 256-port Butterfly and the 256-port cube for cost reasons. Why a Torus? Table 5: Overall Cost Summary Attribute 1024 port butterfly 256 port butterfly 256 port Bene 1024 port cube 256 port cube Radix (k) 4 4 4 11 7 Dimension(n) 5 4 4 3 3 Total Cost $302,361.60 $77,238.00 $130,697.60 $392,960.00 $88,915.20 Since the cost of the 256-port Butterfly was comparable to the cost of the 256-port cube, we made the choice based on the effects of permutation traffic and channel loading. For random traffic in the Butterfly configuration, the loading factor (gamma) on each channel is still one. However, a bit reversal permutation produces a channel loading of 8 because there is no path diversity! Furthermore, the duty factor of this network is now 1 since we are using the concentrators. From an analysis similar to that in Table 4, we show that the widest possible channel width for the Butterfly is w = 32, due to pin limitations (assuming no slicing). Thus, since the network requires at least 10 signals to handle a loading factor of 1, the Butterfly has a relative loading factor of (32/10) or 3.2. Conversely, for the Torus, the 2-cube requires k = 16, or gamma = 2. This implies that we require up to 20-wide channels to handle random traffic loading. Since the 2-cube allows for up to 64 wide channels, the 2-cube can handle traffic permutations with a relative loading of 3.2 (64/20) over a random permutation s loading. The 3-cube requires k = 7, or gamma = 0.875. This implies channel widths of approximately 9. If we design our 3-cube with 32 wide channels, the relative loading is 3.6 (32/9). The 4-cube has a relative loading of 3.2 (16/5), which is lower than the 3-cube but equal to the 2- cube. We calculate the relative loading of the 4-cube using 16-wide channels since a 32-wide channel leaves no available I/O pins for the entry/exit port. Thus, there is a clear advantage in designing the 3- cube over the 2-cube or the 4-cube. Furthermore, since the relative loading factor for the best Torus is far better than for the Butterfly, and since the path diversity of the Torus should be multiplied by the loading factor to get an accurate comparison, we easily see that the Torus is a big win over the Butterfly. Thus, we select the 256-port Torus for our network. Speedup and Final Cost Before costing the network, we needed to decide how wide to make the channels. A channel width of 32 offered a significant speedup of 3 over a more conservative channel width of 16. We chose the width of 32, however, because we determined that the critical packaging limitation was the off-board bandwidth (320 signals per side). A channel width of 32 allowed us to get the maximum bandwidth possible since we could send up to 10 channels off the board on each side. In retrospect, this speedup of 3 was probably excessive since we planned to provide enough routing flexibility to take advantage of the path diversity such that our channel loading would not require this bandwidth. However, we don t quite get a speedup of 3 since we use Virtual Channels with 4 signals dedicated for credits and 5 signals committed to overhead on every flit. Thus, our effective speedup is closer to 2 than 3. Figure 1 shows the actual chip and board layout configuration that we use for our network. - 5 -

I/O Z X Y Chip Layout Board Layout Figure 1 : Chip and Board Configuration for 256-port Torus Table 6 summarizes the cost of a 256 port network. Using this chip and board configuration, we have 256 network chips 256 concentrator chips. We also have 64 concentrator boards and 64 network boards. We will connect all the concentrator outputs to backplanes, requiring 16,384 board connectors. Additionally, we use every connector possible on the boards, for 81,920 board connector pins. We require 4 backplanes for the concentrator network. If we connect the network X-Y planes within backplanes and the Z-dimension over cables, we will require 1 backplane for each of the XY planes for 7 network backplanes total. We will connect the concentrator backplanes to the network backplanes, requiring that we connect 49 channels from the concentrator backplanes to every network backplane. This requires 21,952 backplane connector pins. Additionally, we can calculate all the board connectors as if we were using only 8 backplanes to connect to all the boards instead of 11; we will require 10,240 connectors per backplane for 81,920 board connector pins. Finally, since all Z-dimensional wiring is via cables, and we have two Z-dimensions for every node, we have 16,384 pins in cable costs. Table 6: 7-ary 3-cube Cost Summary Type Cost/Unit Number Units Total Cost Chips $200 512 $38,400.00 Boards $400 128 $14,400.00 Board Connectors $0.10/pin 98,304 $18,432.00 Backplanes $800 11 $3200 Backplane Connectors $0.10/pin 103,872 $10,387.20 Cabling $0.05*(P*D+4) 16,384 $4,096 Total Cost NA NA $88,915.20 3. Routing Design This section explains our Routing Relation and Design. Routing Relation To gain the advantages of path diversity and to avoid deadlock, we have chosen to use Duato s algorithm to allocate two routing relations: R1 is a deadlock-free routing relation and C1 are the channels belonging to R1. R1-C1 can be thought of as the escape channel. R2 is a routing relation that governs all other channels (C2) in the network that are not in C1. For R1, we use simple dimension-ordered routing, First X, then Y, then Z. For R2, we use a selection algorithm that randomly picks any possible path along the minimal route. We do not use any information about congestion to assist the random assignment along a possible path. We require at least 2 Virtual Channels for C1 and use the remaining channels for C2. We rely upon Duato s Proof to demonstrate that the network is deadlock free. - 6 -

For our design, we also found it convenient to provide a filtering module between the Preferred Direction Vector and the VC Allocator that removed all requests for Channels that were already assigned. This, way the VC Allocator did not need to consider this information when making assignments. Router logic block diagram Figure 2 shows how the router logic will be implemented. When a packet arrives at an input port, it bids for all of the possible output VCs that the packet could travel over in accordance with the rules of Algorithm 2. The following diagram shows the logic needed at each input port to determine which output VCs the packet can possibly use. in Direction vector computation Quadrant analysis (return available C1 channels) Filter Compute available C2 channels + to VC allocator Figure 2: Routing Relation Logic Block Diagram 4. Flow-Control Design Our network uses Virtual Channel Flow Control as an extension of using Virtual Channels in our routing mechanism to avoid deadlock via Duato s Algorithm. Table 7 provides the specifications of our Virtual Channels Configuration. Table 7: Virtual Channels Specifications Number of VCs 8 Flit Size 32 bits Phit Size 32 bits (or width of channel) Buffers/VC 4*32 bits = 128 bits Total Buffers/Input Port 8*128bits = 1Kbit Essentially, the flow control mechanism answers the question of how to allocate resources. In our design, there are four essential resources that we allocate: 1) The Input Buffer State Space, 2) The Input Buffer Space, 3) The Switch Bandwidth (equivalent to obtaining a switch input port), and 4) the output channel bandwidth. Figure 3 illustrates the resources that must be allocated using Virtual Channel Flow Control. Buffer State Space We allocate the buffer state space to header flits as they arrive on the input channel. In Figure 3, we show only 4 input buffer state spaces per input port. Our design will allow for 8 input Virtual Channels per input port. For each input header flit, the router requires that the packet receive a VC assignment before it can be routed through the switch. As illustrated in Figure 2, each header flit passes its destination address into logic that implements the routing relation (R). The routine relation logic produces an array of requests for all the possible output ports that will sufficiently route the packet. The VC allocator receives the request from all the input ports in a cycle and allocates output VCs according to the allocation heuristic explained in the following section. - 7 -

R VC Allocator Switch Allocator Buffer State Space Switch Buffer Flit Space Input Port Output Port Input Port Figure 3: Resource Diagram for Virtual Channel Flow Control In order to ensure that the source router does not send to the current router unless an input buffer state space is available, the source router maintains a data structure associated with each output port (as does this router with its output port). The data structure of the output port contains two fields: The G field, and the C field. The G field specifies the global state. The state can either be Idle specifying that the output port is not in use or Active specifying that the output port is allocated to an input port. The C field is a count of the number of available credits, or empty buffer spaces at the destination. When the C field is zero, the output port will not send any flits to the destination. Furthermore, the global state (G field) of the output port does not return to Idle until the count reaches 4 (or the number of buffers on a VC). Thus, we ensure that there is buffer state space before sending any more flits to the destination. Buffer Flit Space Each Virtual Channel has enough buffers to handle a little more than a complete roundtrip delay of a flit. In Figure 3, we show that each buffer has 4 slots to store flits. When a virtual channel is allocated to an incoming flit, the buffer is guaranteed to be empty, since it will not be allocated to an incoming flit unless the state is Idle and it will not go Idle unless the buffers are empty. Since the source does not send unless it has an available credit, the allocation scheme ensures that we have sufficient buffer flit space by only sending one credit for each flit that arbitrates successfully for an input port of the switch. In our actual design, we allocate 5 slots to store flits. We chose this number because a 16 byte packet requires 5 flits. Since we could benefit by having the ability to buffer an entire packet (of the size of our simulation) within a flit buffer, we set the buffer size to 5. However, the network will still work on packets of any size larger than 2 flits. - 8 -

Input Port (Switch Bandwidth) -- Optimistic Design We allocate the input port to flits every cycle. In our design we provide a 56x7 crossbar switch. This allows us to arbitrate for each output port of the switch rather than allocate. This also allows us to use optimistic routing, which reduces the latency through our router to only 3 cycles! Essentially, every flit that is ready transmits across the switch every cycle that it also bids. The Switch Allocator sets the MUXes of the switch in the current cycle to allow the winners through the switch in the same cycle. The Switch Allocator then also sends back the acknowledgement to those ports and those that receive the acknowledgement send credits back to the source router. Output Port The Switch Allocator handles allocation of the Output port. Since the allocator matches the input port to the output port, there is no contention here and the flit is prepared for transmission exactly one cycle after sailing through the router s switch. 5. Router Micro-Architecture This section describes the router micro-architecture. A router design consists of four main elements: the input and output ports, the VC Allocator, the Switch Arbiter, the Routing Relation logic block, and the Switch. Input/Output Ports The Buffer State Vectors Diagram in Figure 5 illustrates the structure and interfacing between the main functional blocks, and shows communication paths and state logic used in performing the Routing function. White blocks represent Input/Output Port logic. Grey blocks represent functional elements described in later sections. Input ports: The router consists of a series of 7 input ports (X+,X-,Y+, Y-, Z+, Z-, and Injection), each of which contains buffering for incoming packets. Associated with each VC of each input port is an Input Buffer State Vector (IBSV) containing the following fields. (G) Global state {Idle, Routing, Active} (R) Routing Info {Preferred directions vector} (O) Output Virtual Channel Port (Awarded by VC Allocator) Output Ports: Outgoing flits are received from the switch, staged, then sent out to the downstream router. Associated with each VC of each output port is an Output Buffer State Vector (OBSV) containing the following fields. (G) Global state {Idle, Active} (I) Source Input Port (Awarded by VC Allocator) (C) Output Credits available (in downstream router) Figure 4shows a traversal of a two-flit packet (head & tail) through the Router diagram. Event 1 2 3 4 Arrive/ Header Tail VC Allocate Switch Allocate Header Tail Switch Header Tail Output Header Tail Figure 4: Timing Diagram through router. - 9 -

At time 1, the packet header arrives at the input port, and an IBSV is initialized (G set to Routing state) for the intended Input VC. The header is simultaneously stored in the first flit buffer space for this VC (Buffer Pointer=1), and sent through the Routing Relation to the VC Allocator. The VC Allocator allocates the packet in this cycle. At time 2, the Tail flit arrives and queues up behind the Head flit in the Flit Buffer. In addition, the packet header optimistically advances through the Switch while bidding for its output port to the Switch Arbiter. In this example, the Head flit receives its request and advances to the Output Port. At time 3, the Head Flit passes out of the Port on a wire to the next router. The Tail flit also passes through the Switch in time 3. At time 4, the Tail flit passes out the output port. VCAllocator Design The job of the VC allocator is to assign one of a collection of requests to as many requesters as possible, solving a bipartite matching problem. Said differently, given a set of options for each input, the job of the allocator is to match up each input with an available output such that the maximum number of inputs receive a valid output. Figure 6 illustrates the bipartite matching problem. To get the maximal matching in the fastest possible way, we will use a heuristic that approaches perfect assignment. The Allocation will be solved in three phases. In the first phase, the allocator will calculate the number of input ports requesting each output port. This number will then be sent back to each input port. In the second phase, each input port will then make a bid on the output port with the lowest number of output port requests. In the final phase, we implement a matrix arbiter that assigns an output port to each input port based upon the bid and the last-used, last-served fairness. For example, assuming the request of Figure 6, at the end of the first phase, A would request 1, B would request 1, C would request 2, and D would request 3. Since A and B both want 1, the second phase would arbitrate between the two and the last one served would lose. Figure 7 illustrates the three phases. Since there are 8 channels per port, there are as many as 56 requesters. This poses a problem for the designed heuristic because we cannot complete the allocation in a singe time cycle for such a large number of bits. Thus, we add an additional phase up front that introduces a Separable scheme. Essentially, we reduce the number of input requests from 56 to 7, one for each input port. This scheme is based on the logic that at most one header arrives on each input port at a time, and that the usual case is that a new packet will be assigned a VC on its first request. Thus, this separable scheme offers good performance and low complexity. Figure 8 shows a block diagram of the allocator. The logic for the first phase of the allocator counts the number of input channels that request each output channel. These counts are then passed back to the input ports. Since each bit of the 56 bit field represents one of the output channels, we can devise a simple counter by simply counting the number of bits in each slice of the input array. In the second phase of the allocator, each input port bids for the output channel with the lowest count. The Min Selector outputs a 56 bit field that has the bit number of the lowest count set. Finally, the third phase is an arbiter that arbitrates all contention for output ports from the input bids. If more than one bit in any bit slice is set, the Arbiter uses a round-robin token to assign the output register. - 10 -

Figure 5: Router Control Blocks - 11 -

A B C D 1 2 3 4 Figure 6: Bipartite Matching problem of the Allocator If B is assigned to 1, then A cannot be assigned. However, only one of the requestors can receive 3, so one of the requesters must be idle. Separable Filter Return granted resource Requested VCs Query output VCs Bid for resource with lowest request count Arbitrate for multiply requested resources Figure 7: Three-phase allocator. Table 8: Result Vector Meaning A B Meaning 0 0 + 0 1-1 0 0 1 1 0-12 -

Input Bit Fields Min Selectors Output Bids Bit- Counter logic Enable Output Bit Fields Bit- Counter logic Round- Robin Arbiter Bit- Counter logic Figure 8: Block Diagram of the Allocator Logic Routing Relation Logic The preferred direction vector logic calculates the preferred direction in a specific dimension. The first step is to subtract the local router position from the destination position. The result is then compared with 4 the median of the number of nodes in a single dimension to determine if the packet should be routed in the positive or negative direction. Table 8 summarizes the meaning of the Result vector and Figure 9 illustrates the preferred direction vector calculator. Figure 9: Preferred Direction Vector Calculator (single dimension) Destination Position Router Position - 4 comparator Zero? A B Result To R field in input buffer state - 13 -

Three single-dimension direction vector blocks run in parallel to calculate all three dimensions simultaneously. The calculated direction vector is written into the R field of the input buffer state of the requesting input VC. The VC global allocation state is a global state register that keeps track of which output VCs are available for allocation. Whenever the VC allocator allocates an output VC, the corresponding bit for that VC is set to 0 in the global state register. When a VC is freed, its bit is set to 1. After the preferred direction vector has been calculated, it is fed into the control logic. Since we are using a binary 3-cube, a flit can travel out of the router on a maximum of 3 output ports. The output port selection logic masks out the output VCs that belong to the output ports that cannot be taken and allows the VCs from legal output ports to pass to the next logic block. In the escape channel trimming logic block, the VCs associated with the escape channels are examined and the escape channels that cannot be taken yet are masked out. Our routing algorithm uses x-first, y-next, z-last dimension ordered routing for the escape channels, so the trimming logic prevents two of the three escape channels from being requested. Figure 10: VC Selection Logic Control Logic Direction vector VC global allocation state (49 bits) 55 0 Output port selection logic Escape channel trimming logic 56 To VC allocator After the trimming stage, the bits that are high represent the requested VCs that get sent off to the VC allocator. The routing logic checks to see which VCs are available for allocation before sending off its request to the allocator in order to remove some of the processing burden from the VC allocator stage. - 14 -

Figure 11: Overview of Input Port Routing Logic 0 1 input buffer state word VC logic selection to allocator 7 1 input buffer state word VC logic selection to allocator direction calculation vector For one input port, the logic blocks are connected together as shown above. To begin, only one header can arrive at an input port per cycle, so only one copy of the direction vector calculation logic needs to be used. However, requests for VCs can be denied, so the VC selection logic must be duplicated for each input state word. The VC selection is then sent off to the VC allocator. Switch The switch is a 56x7 crossbar that connects all 56 input Virtual Channels from each dimension to all of the possible output ports. We chose this configuration because our design was not constrained by on-chip area. Furthermore, this design allows for the simplest Switch Control a simple arbitration at each output port. Finally, we wanted to provide optimistic Switching Arbitration and allowing all 56 input ports to compete at every output port was the easiest way to implement this. Figure 12 illustrates the implementation logic. - 15 -

In X+ Control Out X+ In X- Out X- In Y+ Out Y+ 8 channels 56:1 MUXes Inject Eject Figure 12: 56x7 Crossbar Switch using 56:1 Muxes 6. C++ Network Simulator We completed a C++ simulator of our network to provide performance analysis on both random traffic and on randomly selected bit permutation traffic. Each component of the network was abstracted into a separate C++ class, thereby forcing strict interfaces to be used between modules. State information is maintained as instance variables within each class and interfaces are utilized by method invocation. Each class represents the major components of an interconnection network as shown in Figure 13. Several implementation tool classes were also created to provide system generalizations, abstract lists and utilities, and messages. The Network Simulator performs a synchronized time step at every node in the network, including the concentrators for a specified number of steps. A key issue in building the simulator was handling the race condition that could exist by passing flits and credits between routers. Our simulation models the time of flight over a channel as one cycle, so we wanted to make sure that a flit was not transmitted and read in the same time step. For example, if node A executes and sends a flit to node B, and then B executes and reads its flit buffer, node B could pick up the new flit and process it incorrectly if did not already have another flit waiting in its flit buffer. We handle the race condition by always sending one flit and one credit every cycle. If a router has nothing to send on a port on a given cycle, it sends a null credit or null flit instead. Since all buffers in the Router initialize with exactly one null flit and credit, the system maintains a synchronous state across all connections. To clarify, in the previous example, node B would not pick up node A s newly transmitted flit because there would already be a null flit waiting at the flit buffer (from the previous time step) which would be picked up instead. - 16 -

main Network Router VC Allocator Switch Allocator Utilities List SystemConstants Messages Routing Relation Router Switch Router Architecture Implementation Tools Figure 13: C++ Network Simulator Design 7. Verilog Simulator We also implemented the main Router logic in Verilog. A complete Network was not constructed, however, we did implement the following modules. The following sections describe our implementation of each module and the current working state of each. Message Interface (msg_interface.v) The Message interface links the network infrastructure to our topology of Router modules. Router (router.v) This module represents the core of our design. All external inputs and outputs are managed in this module, and all of the remaining modules are instantiated/called within router.v. Input Flit Buffer (IFB_mem.v) This module consists of 5-deep FIFO s (per VC) for received flits. Input Buffer State Vector (IBSV_mem.v) This module manages state bits for each VC of each Physical Input channel. It accepts commands and reports state to other modules. The Preferred direction is stored. Output Buffer State Vector (OBSV_mem.v) This module manages state bits for each VC of each Physical Output channel. It accepts commands and reports state to other modules. Credits are stored here. Routing Relation (rt_rel.v) This module receives the Destination Address from the header flit, and generates the preferred direction vector to be input to the VC Allocator. VC Allocator (vc_alloc.v) The VC Allocator module receives all contending VC requests from the IBSV module and performs the allocation, awarding available Output VC channels to up to 7 winners. Credit Manager (cred_mgr.v) The Credits manager receives control signals from the downstream router, the OBSV_mem, the IFB_mem, and the Switch Allocator, and manages local credits and sends VC tagged credits to the upstream router. VC Ready Combiner (vc_rdy.v) The Combiner collects state bits from the IFB_mem, the IBSV_mem, and the OBSV_mem, and combinatorally generates a uniform request vector to the Switch Allocator. - 17 -

Switch Allocator (sw_alloc.v) The Switch Allocator receives the VC_rdy vector from the Combiner, and generates the current_cycle 7 winners (and sends upstream credits), and also decrements the local Credit count in the OBSV_mem. The previous_cycle 7 winners actually go through the switch this cycle. Crossbar Switch (sw_cbar.v) The Switch passes the previous_cycle 7 winners from the IFB_mem through the 56 x 7 crossbar to the Physical Output ports. 8. Simulation Results Our simulator is currently in the debugging process. We just ran out of time. We will email simulation results as an addendum to you as soon as we get it finished. 9. Conclusion Our team feels that this project was very helpful in forcing the team to come to an understanding of the complex issues surrounding the implementation of a modern router and interconnection network. In retrospect, we should have decided on our final topology and concentration ratio sooner so that we could have spent more time simplifying and analyzing different routing and flow control options. We probably designed an excessive amount of speedup. We think we could have driven the cost of the network down by designing smaller channels and placing more on a board. The board turned out to be the limiting resource in terms of channel width limitations. However, we designed an excessive speedup to handle a wider array of permutation traffic. The network simulation results verify that the design handles the permutation traffic well. - 18 -

Appendix List of Figures Figure 1 : Chip and Board Configuration for 256-port Torus...6 Figure 2: Routing Relation Logic Block Diagram...7 Figure 3: Resource Diagram for Virtual Channel Flow Control...8 Figure 4: Timing Diagram through router...9 Figure 5: Router Control Blocks...11 Figure 6: Bipartite...12 Figure 7: Three-phase allocator...12 Figure 8: Block Diagram of the Allocator Logic...13 Figure 9: Preferred Direction Vector Calculator (single dimension)...13 Figure 10: VC Selection Logic...14 Figure 11: Overview of Input Port Routing Logic...15 Figure 12: 56x7 Crossbar Switch using 56:1 Muxes...16 Figure 13: C++ Network Simulator Design...17 List of Tables Table 1: Interconnection Network Summary...1 Table 2: Network Specifications...3 Table 3: Network Constraints...3 Table 4: 256x256 k-ary n-cube with 4:1 Concentrators...4 Table 5: Overall Cost Summary...5 Table 6: 7-ary 3-cube Cost Summary...6 Table 7: Virtual Channels Specifications...7 Table 8: Result Vector Meaning...12-19 -