Implementation of Message Scheduling on TDM Virtual Circuits for Network-on-Chip

Master Thesis KTH/ICT/EC/ 2008-98 Implementation of Message cheduling on TDM Virtual Circuits for Network-on-Chip Master of cience Thesis in Electronic ystem Design by Xinyou Tian tockholm, eptember 2008 upervisor: Examiner: Dr. Zhonghai Lu Dr. Zhonghai Lu and Prof. Axel Jantsch

Abstract Due to the increasing capacity and complexity of semiconductor technology, current design for ystem-on-chip (oc) faces big challenges from many aspects, such as Deep ubmicron (DM) Effects, Global ynchrony, Communication Architecture, etc. Network-on-Chip (NoC) is an emerging achievement in recent years to tackle with the crisis of communications within large VLI systems implemented on a single silicon chip. The way to schedule the traffic flows between different modules such as processor cores, memories and specialized intellectual property(ip) blocks became a important problem in NoC research. This thesis presents a Time Division Multiplexing (TDM) virtual circuit based strategy that minimizes resource usage by exploiting routing flexibilities offered by modern NoCs. In previous work, scheduling solutions however have two major limited factors. Firstly, they often do not consider all the flexible routings of NoCs; and on the other hand, make bandwidth reservations for throughput/latency guaranteed connections that are unnecessarily conservative. In our project, to illustrate the first point, we use a Breadth-first earch (BF) with Branch and Bound routing scheme to explore all the candidate paths with minimum length. For the second point, we proposed a TDM approach that not reserves bandwidth for guaranteed service permanently but during certain intervals. Compared to Greedy strategy, our experiments show adding backtracking could increase the feasible solutions for using routing flexibility of NoC. And the lab results also show that resource-utilization is improved when compared to the approaches without TDM and further show the effect that the size of slot table to the number of feasible schedules. i

Acknowledgments This thesis work was carried out at the Department Electronic, Computer and oftware ystems, KTH. First of all, I would like thank my advisor Dr. Zhongai Lu for his seasoned guidance. Without his helpful suggestions, this thesis could not be done. During the period of my thesis project, I got so many valuable suggestions from him, both in network algorithms and C++ programming. uch knowledge benefited me a lot to get my current position as a software designer in Ericsson. And I would like to thank my examiner, Professor Axel Jantsch, who gave me such a good opportunity to work with in his high performance NoC research group. I also would like to thank my parents and my wife for their consistent encouragement and support, physically and mentally. Last but not the least, I want to give my thanks to all of my friends in weden for their support from day one. ii

Table of Contents Chapter 1 Introduction... - 1-1.1 Network-on-Chip(NoC)... - 1-1.1.1 Why Network-on-Chip... - 1-1.1.2 Topology... - 3-1.1.3 witching strategy... - 5-1.1.4 Routing algorithm... - 7-1.1.5 Network flow control... - 9-1.1.6 Quality of ervice... - 10-1.2 TDM virtual circuits and message scheduling... - 10-1.3 Thesis work process and outline... - 11-1.3.1 Thesis work process... - 11-1.3.2 Thesis outline... - 11 - Chapter 2 Problem Definition and cheduling Method... - 13-2.1 Problem statement... - 13-2.2 Problem definitions and modeling... - 14-2.2.1 Application modeling... - 14-2.2.2 TDM communication system modeling... - 14-2.2.3 Message cheduling with TDM... - 15-2.3 cheduling design flow... - 16-2.3.1 Input... - 17-2.3.2 Routing... - 17-2.3.3 cheduling... - 19-2.3.4 Output:... - 24 - Chapter 3 Implementation and Experiment Resulets... - 26-3.1 C++ Implementation... - 26-3.1.1 Classes data structure... - 26-3.1.2 Program flow... - 32-3.2 Experiment... - 35-3.2.1 Experiment purpose and settings... - 35-3.2.2 Experiment results... - 36 - Chapter 4 Conclusions and Future works... - 47-4.1 Conclusions... - 47-4.2 Future works... - 48 - iii

List of Figures Figure 1-1 NoC-based MP-oC architecture... - 2 - Figure 1-2 Topology of 4 4 mesh network... - 4 - Figure 1-3 Topology of 6 6 mesh network... - 5 - Figure 1-4 Units for resource allocation... - 6 - Figure 1-5 X-Y routing... - 8 - Figure 1-6 Routing with branch and bound... - 9 - Figure 2-1 ample of bandwidth reservation... - 13 - Figure 2-2 Contention-free routing: a network switch at slot s=1, and its slot table... - 15 - Figure 2-3 The input of message streams... - 17 - Figure 2-4 Routing via 3 3 Mesh network... - 18 - Figure 2-5 Maze... - 18 - Figure 2-6 Routing and scheduling information for the message stream from N1 to N6, without message m11 delay... - 20 - Figure 2-7 Routing and scheduling information for the message stream from N1 to N6, with message m11 delay... - 21 - Figure 2-8 Conflict occurs when use Greedy approach... - 23 - Figure 2-9 Conflict is avoid by Backtracking strategy... - 24 - Figure 2-10 Output of the algorithm... - 25 - Figure 3-1 Tile-based architecture... - 27 - Figure 3-2 Message parameters... - 29 - Figure 3-3 Vector of messages and scheduling entities... - 30 - Figure 3-4 Path of the message sent from node 1 to node 8 in a 3 3 mesh network... - 31 - Figure 3-5 Example of the status of the network, which records a specified switch history... - 31 - Figure 3-6 Program design flow... - 32 - Figure 3-7 CDFG of the greedy scheduling algorithm... - 33 - Figure 3-8 Backtracking... - 35 - Figure 3-9 Network for experiment... - 37 - Figure 3-10 Hotspot traffic in 4 4 Mesh... - 38 - Figure 3-11 Percentage of feasible scheduling of hotspot traffic in 4 4 Mesh... - 38 - Figure 3-12 Hotspot traffic in 6 6 Mesh... - 39 - Figure 3-13 Percentage of feasible scheduling of hotspot traffic in 4 4 Mesh... - 39 - Figure 3-14 Uniform traffic in 4 4 Mesh... - 40 - Figure 3-15 Percentage of feasible scheduling of hotspot traffic in 4 4 Mesh... - 40 - Figure 3-16 Uniform traffic in 6 6 Mesh... - 41 - Figure 3-17 Percentage of feasible scheduling of hotspot traffic in 4 4 Mesh... - 41 - Figure 3-18 plit 1 second to 2 time slots... - 42 - Figure 3-19 ize of slot-table to feasible schedules in 4 4 Mesh, hotspot traffic... - 43 - Figure 3-20 ize of slot-table to feasible schedules in 6 6 Mesh, hotspot traffic... - 44 - Figure 3-21 ize of slot-table to feasible schedules in 4 4 Mesh, uniform traffic... - 45 - Figure 3-22 ize of slot-table to feasible schedules in 6 6 Mesh, uniform traffic... - 45 - iv

Abbreviations BF DF DM FIFO NoC OOP Qo oc TDM UML VLI VC Breadth-first search Depth-first search Deep ubmicron First In First Out Network Interface Network-on-Chip Object-Oriented Programming Quality of ervice ystem on Chip Time-Division Multiplexing Unified Modeling Language Very Large cale Integration Virtual Channel / Virtual Circuit v

Chapter 1 Introduction 1.1 Network-on-Chip(NoC) 1.1.1 Why Network-on-Chip ystem-on-chip (oc) Design Challenge: ystem-on-chip (oc) is an idea of integrating all components of a computer or other electronic system into a single integrated circuit (chip). It may contain digital, analog, mixed-signal, and often radio-frequency functions all on one chip. oc is believed to be more cost effective since it increases the yield of the fabrication and because its packaging is simpler. With the development of IC technology, we could integrate more and more complex applications into a single chip. One typical approach is the embedded system. However, our current design for oc faces big challenges from different aspects[1]: Deep ubmicron(dm) Effects, Global ynchrony, Communication Architecture, Power and Thermal Management, Verification, Productivity, etc. Network-on-Chip as a oc Platform: Networks-on-Chip (NoC) is emerging as a communication architecture which solves these issues as it provides a better structure and modularity. [3, 4, 5], NoC-based systems can accommodate multiple asynchronous clocking that many of today's complex oc designs use. The NoC solution brings a networking method to on-chip communication and brings notable improvements over conventional bus systems.[6] As showed in Figure 1-1, the Network on Chip (NoC) architecture provides the communication infrastructure for the resources. The resources in the node may contain processor cores, DP cores, memory banks, specialized I/O blocks such as Ethernet or Bluetooth protocol stack implementations, graphics processors, FPGA blocks, etc. The size of a resource can range from a bare processor core to a local cluster of several processors and memory connected via a local bus. In a NoC, resources have to be attached to network interfaces () of the communication network - 1 -

in order to communicate with other resources. The NoC platform is physically built up by means of switches, channels and network nodes in an ordered way. As shown in Figure1-1, each network node is composed of one switch and one resource. In this way, it is possible to develop the hardware of resources independently then connect the resources via network interface to the channels of the network. Moreover, the scalable and configurable network is a flexible platform that can be adapted to the needs of different workloads, while maintaining the generality of application development methods and practices. Link T T T T T T T T T T T T T T T T T T T T Figure 1-1 NoC-based MP-oC architecture With the possibility to develop hardware of resources independently, will it be possible to reuse the backbone of the network? The answer is yes. In the last decade, the reuse of processor cores has been developed under the old bus architectures. When the NoC concept was introduced, a new reuse method was also developed since NoC allowed integrating an arbitrary number of resources into the network. In the NoC system, the switches, the interconnects and the lower level communication protocols only need to be designed, optimized, verified and implemented once, because they will not change when the high level application changes. Thus, it is possible to reuse the network in a large number of products. In the old bus-based system, adding a new resource will surely degrade the performance of the rest of the system because the same communication resource is now shared among more resources. In a NoC system, adding new resources will also add new communication capacity by adding new switches and interconnects. Although complex communication protocols and flow controls are needed to make sure the discipline and scalability of NoC system, it is important to notice that NoC system has the potential to provide the arbitrary - 2 -

composability property with tremendous benefits for the design productivity [2]. 1.1.2 Topology The topology refers to the physical structure of the network graph, for example, how the network nodes (switches of routers) are physically connected. It defines the connectivity between nodes, thus having a fundamental impact on the network performance as well as the switch structures, i.e., the number of ports and port width. The tradeoff between generality and customization is an important issue when determining a network topology. The generality facilitates the re-usability and scalability of the communication platform. The customization is aimed for performance and resource optimality. Most of the today s NoC use mesh or torus as a starting point, since they are intuitive for the two dimensional systems on silicon surface, but fully connected network, fat tree network and honeycomb network also have advantages in specific applications. [1] Two dimensional mesh and torus network topologies provide a way to route data between nodes. These two topologies are direct multi-hop networks, they allow for continuous connections and reconfiguration around broken or blocked paths by hopping from node to node until the destination is reached. In a mesh/torus network with n nodes, there are n switches. Each switch is connected to 4 neighboring switches and 1 resource via input and output channels. A resource is a computation or storage unit and switches route and buffer messages between resources. A two dimensional mesh interconnection topology provides a simplest way to resemble the layout of a chip. Moreover, the local interconnections between the resources and switches in two dimensional meshes are independent of the size of the network. Routing in the mesh network is easier to manage and can result relatively small switches, high bandwidth, short clock cycle, and overall scalability. As the cost of area on the chips gets more and more expensive, it is important to use a proper topology to make the NoC as small as possible. In case there are high amount of resources needed to be connected on chip, a two dimensional mesh/torus topology would be the best choice in consideration of the cost efficiency. The only difference in topology between the mesh and torus is that torus has a wire connecting the first and last switches in the mesh. The torus topology provides more communication bandwidth since the average hops of the torus are smaller than mesh. On the other hand, torus is more complex to build and cost is higher than mesh topology. - 3 -

We use a mesh topology in our evaluation. Nodes located at the edge of the mesh are restricted in the links that can be used as at least one direction is not available because of the topology. In a 3 3 mesh this holds for all nodes except for one. In a 4 4 mesh there are 12 edge nodes and 4 non-edge nodes and a 6 6 mesh has 20 edge nodes and 16 non-edge nodes. The ratio of edge to non-edge nodes can possibly influence the scheduling strategies. To study this effect, problem sets are generated for a 4 4 and 6 6 mesh. Because 3 3 mesh is too small, we didn t simulate with this platform. Figure 1-2 shows the topology of 4 4 mesh and Figure 1-3 shows the topology of 6 6 mesh topology, respectively. Figure 01-2 Topology of 4 4 mesh network - 4 -

Figure 1-3 Topology of 6 6 mesh network 1.1.3 witching strategy The switching strategy determines how a message traverses its route in the network. Circuit switching and packet switching are the two main switching techniques used in interconnection networks. In circuit switching, a connection needs to be set up between the source and destination before a packet can be delivered. In packet switching, messages that need to be routed are divided into small packets and packets are routed between nodes over data links shared with other traffic. In this paper, we use packet switch strategy to switch the messages. In packet switching network, - 5 -

there are some resources and allocation units, for example, message, packet and flit. There are distinctions between those allocation units. As illustrated in Figure 1-4, at top level, a message is a contiguous group of bits (the complete data) that are transmitted from a source node to a destination node. Because messages may be too long, we divide a message into packets for the allocation of control state and into flow control digits (flits) for the allocation of channel bandwidth and buffer capacity. A packet is the basic unit of routing and sequencing in packet switching. A packet consists of 3 parts: head flit, body flit and tail flit. The head flit contains the routing information for the packet while the tail flit contains the error checking pattern and is a flag to signal the end of the packet. A packet should have a restrict length, as a consequence, the size and time duration of the resources allocation is also restricted, which is important for the performance of packet switching mechanism. For the reason of simplification, the messages only contain by 1 flit in this paper. In packet switching, the data is encapsulated into packets with some header information, passed to the network for delivery, and the sender can forget about it. It is like the mail system in real life, address the package, and give it to the postman. Then the mail system will deliver the package later. Examples of well known packet switching techniques are store and forward, virtual cut through and wormhole packet switching [9]. message message message packet packet flit flit H Figure 1-4 Units for resource allocation - 6 -

1.1.4 Routing algorithm The routing algorithm determines the routing paths the packets follow through the network graph. It usually restricts the set of possible paths to a smaller set of valid paths. There are three main routing algorithms in NoC: deterministic routing, oblivious routing and adaptive routing. In deterministic routing, the routing of packets between a given data source and destination remains over time. The routing path is chosen before the packets are sent into the network. It chooses always the same path given by the source and destination node. It ignores the network path diversity and is not sensitive to the network state. This may cause the imbalance of workload in the network but simple and inexpensive to implement. Besides, it is usually an easy way to keep the order of the network traffic. witches are easy to implement due to the easy routes which are set by the minimum deterministic routing function. All the routing directions can be pre-implemented in the route table of the switches. Moreover, it is easier to design the routes in switch to avoid the future deadlocks in the network. For some topologies, simple deterministic approaches can actually balance the load as other minimal routing algorithms do. However, deterministic routing does not utilize the network to its full potential when the network has a heavy load. The messages need to carry the complete routing information. Although the computation time of routing will decrease, the size of the messages increases. One example of the deterministic routing is XY routing (Dimension order routing)[9]. The XY routing strategy can be applied to regular two dimensional mesh topologies without obstacles. The position of the mesh nodes and their nested network components will be described by coordinates: the x-coordinate for the horizontal direction and the y-coordinate for the vertical direction. A packet will be routed to the correct horizontal position first then will be routed to the correct vertical position. XY routing produces minimal paths without redundancy. Also, the use of XY routing will give us a deadlock free NoC at the cost of available bandwidth. Figure 1-5 show a simple XY routing in a 3 3 mesh network, which the source node is node 1, and the destination node is node 9. XY routing found a path from them, here, is {1, 2, 3, 6, 9} (we use the node along the path to present the route in this paper). - 7 -

coordinates 1(1,1) rc 2(1,2) 3(1,3) 4(2,1) 5(2,2) 6(2,3) 7(3,1) 8(3,2) 9(3,3) Dest Figure 01-5 X-Y routing Oblivious routing, which includes deterministic routing as a subset, considers all possible paths from the source node to the destination node. Branch and Bound algorithm with XY routing is used in this paper to find all minimum paths from the source node to destination node for messages. In the later chapter we will talk more about oblivious routing, since oblivious routing is the main routing algorithm used in this thesis. However, oblivious routing doesn t consider the state of network when it makes routing decisions. Figure 1-6 shows a routing method with branch and bound method in a 3 3 mesh network, which the source node is node 1, and the destination node is node 8. We get a set of minimum paths, here, are {1, 2, 5, 8}, {1, 4, 5, 8}and {1, 4, 7, 8} - 8 -

1(1,1) rc 2(1,2) 3(1,3) 4(2,1) 5(2,2) 6(2,3) 7(3,1) 8(3,2) Dest 9(3,3) Figure 01-6 Routing with branch and bound An adaptive routing algorithm uses information about the network state, typically queue occupancies, to select among alternative paths to deliver a packet. Adaptive routing best serves the network that have a heavy load. As stated from the definition, packets are routed in the way to balance the load of the whole network. If congestion is detected in the routing path, the packets will be routed in another congestion free path in consideration of the destination node. Be aware, the adaptation is intended to allow as many routes as possible to remain valid in response to the change. While adding information about network state can potentially improve routing performance, it also adds considerable complexity and if not done carefully it can lead to performance degradation. For example, the switches get more complicated and consume more area on the chip. Deadlocks are harder to avoid in the adaptive routing because of the complex flow control and packets may encounter live lock during the routing. Packets might never get to their destination due to the congestion thus causes communication failure. 1.1.5 Network flow control The network flow control governs how packets are forwarded in the network, concerning shared resource allocation and contention resolution. The shared resources are buffers and links (physical - 9 -

channels). Essentially a flow control mechanism deals with the coordination of sending and receiving packets for the current delivery packets. Due to the limited buffers and link bandwidth, packets may be blocked due to contention. Whenever two or more packets attempt to use the same network resources, one of the packets might be stalled in placed, shunted into buffers, detoured to an unfavored link, or simply dropped. For packet-switched networks, there exists bufferless flow control and buffered flow control. The flow control scheme may be coupled with its switching strategy. In this paper, we use wormhole switching and buffered flow control, in each direction of the switch there is one flit buffer. The flits traverse the network in a pipeline fashion. 1.1.6 Quality of ervice In the fields of packet-switched networks, the traffic engineering term Quality of ervice refers to control mechanisms that can provide different priority to different users or data flows, or guarantee a certain level of performance to a data flow in accordance with requests from the application program or the internet service provider policy. Quality of ervice guarantees are important if the network capacity is limited, for example in cellular data communication, especially for real-time streaming multimedia applications, for example voice over IP and IP-TV, since these often require fixed bit rate and may be delay sensitive. Roughly classified, NoC researchers have proposed best-effort, guaranteed and differentiated services for on-chip packet-communication. In this paper, we established virtual circuits with TDM approach to serve guaranteed service traffic; the virtual circuits are implemented by time slots. 1.2 TDM virtual circuits and message scheduling A TDM (Time-Division Multiplexing) VC (Virtual Circuit) is a VC that shares buffers and link bandwidth in a time-division fashion. A VC is simplex. As long as VC is established, packets delivered on it, called VC packet, encounter no contention and have guaranteed in bandwidth and latency. In this paper, we use branch and bound method to find all VCs to all path diversity [7, 11]. - 10 -

The messages are injected in to the network periodically. Message scheduling is to find a start time for message and allocation the slot table along the path. Here is only a short description of the scheduling of message, we will discuss further in the later chapters. 1.3 Thesis work process and outline 1.3.1 Thesis work process This thesis project could be divided in to 5 phases Literature study and project plan, including study the programming language C++ ystem, message and scheduling entity modeling cheduling messages with different strategies and evaluation with case studies Case study and experiments The writing of Master thesis 1.3.2 Thesis outline Chapter 1 Chapter 1 gives an introduction of Network-on-Chip. It presents the issues related to NoC design such as oc design challenges, NoC topology, routing algorithm, switching technique, flow control and quality of service (Qo). We also give a short description of TDM VC and message scheduling. Chapter 2 In chapter 2, we stated the problem of on chip message routing and scheduling, and also defined and modeled the application, TDM communication system and the message. Then we proposed the flow of the design, from the input of the network, to the routing, scheduling and finally, the output information. - 11 -

Chapter 3 Chapter 3 presents the detail of C++ implementation of the TDM scheduling. We gave a specific description of the structure of the program, including the network architecture, message modeling and algorithm design flow. The simulation results show the comparison of message scheduling with two different strategies, and we also presented the effect of the size of slot table to the feasible schedules. Chapter 4 Chapter 4 gave the conclusion of the project and proposed the future works. - 12 -

Chapter 2 Problem Definition and cheduling Method 2.1 Problem statement Current NoCs like Æthereal [12] and Nostrum [7] use virtual circuit-switching to create connections through the NoC which offer timing guarantees. Today s routing and scheduling solutions however (1) routing often does not consider all routing flexibility for NoC; and for (2) scheduling reserve bandwidth for time/throughput guaranteed connection unnecessarily conservative[4]. To illustrate the first point, we proposed the branch and bound algorithm to restrict the paths with minimum length.[16] However, modern NoC allows the more flexible routing with non-minimum length. The reason why we only selected the subset of all flexible paths is simplicity and network efficiency. In most of time, routing with minimum length is enough for path diversity. As illustrate the second one, we consider a simple case as shown in Figure 2-1 with three links, link 1, link 2 and link 3 in a NoC. [8, 10] bw bw t bw link 1 link 2 witch link 3 traditional bw t t TDM t Figure 02-1 ample of bandwidth reservation - 13 -

The data stream from link1 and link2 want to share the same link, link3 via the switch. With the traditional scheduling approach for guaranteed service, we should reserve the sum of the bandwidth from link1 and link2, which requires large bandwidth on the shared link. The way to improve the traditional method is not to reserve bandwidth for guaranteed throughput connections permanently during the whole execution of an active application but only during the certain intervals. That is the motivation we proposed the TDM approach to efficiently divide the bandwidth with time intervals for the minimization of resource usage. 2.2 Problem definitions and modeling 2.2.1 Application modeling All messages in a stream are assigned a starting time t, message size sz and a duration d. Tasks exchange messages with other via the data streams. The tasks are mapped to the processors in the system, whenever multiple tasks are mapped on a single processor, the execution order of these tasks are fixed by certain scheduling methodology. Properly allocating the tasks into the time slots is a major concern in this thesis. 2.2.2 TDM communication system modeling Each node along the VC s path is equipped with a time-sliced slot table, N inputs and N outputs, which reserves time slots for input packets to use the output links. The slot table is used to: 1. Avoid contention on a link. 2. Divide up bandwidth per link between connections. 3. witch data to the correct output. Every slot table T has time slots and N switch outputs. There is a logical notion of synchronicity: All switches in the network occupy the same fixed-duration slot. In a slot, s, a network node (that is, a router or network interface) can read and write at most one flit per input and output ports, respectively. In the next slot, (s + 1) modulo, the network node writes the read flit to their appropriate output ports. Flits thus propagate in a pipeline fashion and cannot deadlock. The latency that a flit incurs per switch equals the duration of a slot, and the slot reservations guarantee bandwidth in multiples of block size per slots. - 14 -

The slot table entries map outputs to inputs for every slot: T(s, o) = i, which means that flits from input i (if present) proceed to output o at each s + k slot; k N. An entry is empty when there is no reservation for that output in that slot. There is no contention, by construction, because there is at most one input per output for each slot. [12] i N o N i E o E o W iw o i lot Table slot o E o o N 0 1 i i E 2 i 3 i W Figure 2-2 Contention-free routing: a network switch at slot s=1, and its slot table Figure 2-2 illustrates contention-free routing with a snapshot of a switch network and its corresponding slot tables at slot s = 1. The size of the slot tables is = 4. The four gray arrows represent connections with flit. The switch switches flit from input i to output o E as slot table T (1, o E ) = i, and input i E to output o N as slot table T(1, o N ) = i E. 2.2.3 Message cheduling with TDM In this thesis, we try to find a feasible schedule for messages sent within time constraint between different nodes in a network. And the messages are injected in to the network periodically. Here we assume that there is only one flit with in one message, so the message and flit mention the - 15 -

same thing in this thesis. A message stream consists of several messages flow from one task to another task[10]. The message, which specifies timing constraint of data communication, is defined with parameters as follows: u :source vertex in the network v: destination vertex in the network P: the period that the tasks are executed s: stream communicates between tasks identity of the message stream n: sequence number of message in stream identity of the message within a stream τ: the predefined earliest start time for the message that the communication could start Δ: the maximum allowable wait time for the slot δ: δ= r +Δ, the predefined maximum duration of the communication after the earliest start time, so the sum of τ+δ indicates the deadline of the communication id: the name of the message The scheduling of the message specifies the actual start time, duration, route and the slot allocation. We call a feasible schedule of a message is a VC, which include the following factors; st: the slot allocated on the first switch of the path t: the real start time for the message d: duration of the communication on a single link, because there is only one flit in a message, hence, d=1 r: route of the message The slots given in st are claimed on the first link of the route at time t. On the next link, the slot reservations are cyclically shifted on time-unite later, i.e. at t+1, but for the same duration. The complete message is received by the destination at time t+r. 2.3 cheduling design flow In this part, the design flow will be introduced, which is the TDM based scheduling methodologies for Network-on-Chip. It is contains four aspects, the input of message streams, find a route for a message, find a scheduling methodology for the message, and output the successful schedules. - 16 -

2.3.1 Input The input of the simulation is the message streams. We should design for a message generator to create the message streams periodically. Here we assume that there is only one flit with in one message, so the message and flit mention the same thing in this article. A message stream has a number of messages flowing from one task to another task. Figure2-3 shows the input factors of the simulation. The input is a set of message streams, each stream contains finite messages Message stream Message in the stream 1 m11 m12 m13...... m1m 2 m21 m22 m23...... m2m 3 m31 m32 m33...... m3m.................................... 2.3.2 Routing N mn1 mn2 mn3...... mnm Figure 2-3 The input of message streams We use the breadth-first-search (BF) with branch and bound to find all the minimum routes for the message, which starts from the vertex u and end to vertex v in a 2D network. The solution space is searched in a breadth-first manner beginning at a start node the source node. This start node is both a live node and the E-node (expansion node). From this E-node, we try to move to a new node. If we can move to a new node from the current E-node, then we do so. Each live node becomes an E-node exactly once. When a node becomes an E-node, all new nodes that can be reached using a single move are generated. Generated nodes that cannot possibly lead to a (optimal) feasible solution are discarded (i.e., the node dies). The remaining nodes are added to the list of live nodes, and then one node from this list is selected to become the next E-node. - 17 -

The selected node is extracted from the list of live nodes and expanded. This expansion process is continued until either the answer is found or the list of live nodes becomes empty[16]. There are two common ways to select the next E-node (though other possibilities exist): First In, First Out (FIFO) and Least Cost or Max Profit. The scheme of FIFO extracts nodes from the list of live nodes in the same order as they are put into it. The live node list behaves as a queue. In this thesis, we don t discuss the approach of Least Cost or Max Profit in detail. coordinates 1(1,1) rc 2(1,2) 3(1,3) 4(2,1) 5(2,2) 6(2,3) 7(3,1) 8(3,2) 9(3,3) Dest Figure 2-4 Routing via 3 3 Mesh network Consider the 3 3 Mesh network given by the matrix of Figure 2-5(a). We will search this maze using the solution architecture of Figure 2-4. Figure 2-5 Maze The search begins at source, position (1,1), which is the only live node at this time. It is also the E-node. To avoid going through this position again, we set maze(1, 1) to 1. From this position we can move to either (1,2) or (2,1). For the particular instance we are dealing with, both moves are feasible, as the maze has a 0 at each position. uppose we choose to move to (1,2). maze(1, 2) is set to 1 to avoid going through here again. The status of maze is as in Figure 2-5(b). At this time we have two live nodes: (1,1) and (1,2). (1,2) becomes the E-node. From the current E-node three - 18 -

moves are possible in the graph of Figure 2-4. Two of these moves are infeasible as the maze has a 1 in these positions. The only feasible move is to (1,3). We move to this position and set maze(1, 3) to 1 to avoid going through here again. The maze of Figure 2-5(c) is obtained, and (1,3) becomes the E-node. The graph of Figure 2-4 indicates two possible moves from the new E-node. Neither of these moves is feasible; so the E-node (1,3), dies and we back up to the most recently seen live node, which is (1,2). No feasible moves from here remain, and this node also dies. The only remaining live node is (1,1). This node becomes the E-node again, and we have an untried move that gets us to position (2,1). The live nodes now are (1,1) and (2,1). Continuing in this way, we reach position (3,3). At this time the list of live nodes is (1,1), (2,1), (3,1), (3,2), (3,3). This list also gives the path to the destination. 2.3.3 cheduling After we get a route, we should try t find a scheduling entity for each message need to communicate. Here is a simple example for a message stream sent from node 1 to node 6 from the mesh network shown in Figure2-4. After the route (R1, R2, R3, and R6) has been fixed, we check the slot table along the path to see whether there are free slots for scheduling the message. We begin with the source node N1, check whether s=0 is free, then N2, N3, N6 to check s=1, s=2, and s=3, respectively. Finally, we allocated the first slot in the slot-table of router R1. When the message gets to R2, we allocated the second slot in the slot-table of router R2. We do the same thing until the message gets to the destination node N6. Figure 2-6 shows the routing and scheduling information without delay for the message stream from N1 to N6. As shown from the picture, there are 4 message sent periodically (P=2) within the stream and the size of slot time is 8. For message m11, it starts from time 0, occupies the slot table s=0 at node1. When it reaches the node2, the time here is 1 and the slot table is occupied is shift one time unit, s=1. This process continues when it reaches the destination node, if no conflict occurs. - 19 -

Figure 2-6 Routing and scheduling information for the message stream from N1 to N6, without message m11 delay If there is a conflict, for example, as shown in Figure 2-7, the slot 0 in R1 for m11 is occupied by other message. Here we should delay the start time for m11 one time unit, and then check if the slot s=1 in N1 is free, if ok, check s=2 in N2, and also other nodes along the path. Finally, allocate the slot table for m11. We allocate s=1 in N1 here. - 20 -

Figure 2-7 Routing and scheduling information for the message stream from N1 to N6, with message m11 delay To find the scheduling entities, we also introduce the constraints for scheduling of the messages, as follows[10]: 1. u=src(r), this function makes sure the communication takes place from the correct source node. 2. v=dst(r), this function makes sure the communication ends to the correct destination node. 3. t>=τ, this function is to ensure the correct start time for message 4. t+ r <=τ+δ, this function ensures the communication ends before the deadline 5. lots reserved in the slot table by entity e between t and t+d should be more than 0 to ensure that enough slots are reserved for send the messages. 6. f(e 1, l) f(e 2, l)!=φ means contention free -- here we introduce a function f(e, l k ) to indicate the scheduling entity e the k th link on the route r of e the slot it uses from slot-table. 7. t1+d1<t2 and t1+ r1 < t2+ r2, this function makes sure the ordering of a message within a stream is reserved. Given a set of messages, a scheduling strategy is to find a scheduling entity e for each message. - 21 -

First, a greedy approach is proposed, which gives a quick solution and easy for implementation. Then, we add backtracking to the greedy method, which improves the quality of service, but at the same time, makes the design complex. Greedy: Using above restrictions, we schedule the message with Greedy approach. Greedy strategy here is always tries to schedule the earliest coming message with the shortest route. This strategy explores a small part of the solution space. As a result, it has a small run-time. However, it may miss solutions or find non-optimal ones in terms of resource usage. Next, try to find schedule entity for message ( that is to mean to fix on the values of the parameters st, t, and r): 1. The scheduling strategy should avoid sending data, in bursts as this increases the chance of congestion. Therefore, the start time, t, of e is set equal to the earliest possible time respecting the third and last constraint from above. elect a route r, make t=τ 2. et d=1, because in our experiment, only single flit messages are introduced. 3. elect a route r among the set of route R, which consists all the minimum routes between the source and destination node. 4. A set of slots, st, is selected which offer sufficient room to send the message. It is possible that no set of slots can be found which offer enough room to send the message within the timing constraints. If this is the case, the next route in R must be tried. 5. Now, all factors of e (st, t, r) have been set, makes the new set of scheduling entities as E {} e. Backtracking: The strategy inherits from the greedy strategy discussed in previous section, and it guarantees that all feasible problems for greedy strategy also could be solved with this approach. Moreover, as soon as conflict occurs, that is to say, no scheduling entity e i could be found for a message m i. A previous existing scheduling entity e i-1 for message m i-1 is removed from the set of feasible scheduling entities E; select another path for the m i-1 and schedule it. After rescheduled m i-1 we check whether there exists a feasible schedule for m i. If yes, go on to schedule the next message, else, reschedule the message m i-1 until reach it last path from all paths available. If we still could not get a feasible solution for message m i, reschedule the message previous of m i-1 until we get a - 22 -

feasible schedule. After that, the messages of which the corresponding schedule entities were removed are re-scheduled in last-out first-in order. On a new conflict, the Backtracking mechanism is activated again. We use following example to show the difference between the Greedy and Backtracking strategy. Figure 2-8 shows that when a conflict occurs, when message m i-1 (from Node 1 to Node 5) and message m i (from Node 1 to Node 3) are competing the link marked with red. For the time constraint of each of the two messages, they have to use the link at the same time. With the Greedy approach, there is no feasible schedule for massage m i after message m i-1 has been scheduled. coordinates Conflict 1(1,1) rc 2(1,2) 3(1,3) Path of m i 4(2,1) 5(2,2) 6(2,3) Path of m i-1 7(3,1) 8(3,2) 9(3,3) Dest Figure 8 Conflict occurs when use Greedy approach As soon as we introduce Backtracking strategy, schedule entity for message m i-1 is removed from the set of feasible schedules, and schedule m i before schedule message m i-1. We could see from Figure 2-9, time slot is allocated for m i on the link between Node 1 to Node 2. Later on, we - 23 -

reschedule the message m i-1, we choose another path for it since the link between Node 1 to Node 2 is occupied at the time m i-1 also want to use it. ince no conflict with this new selected path, we allocate the slot for m i-1 and continue to schedule the nest message. coordinates 1(1,1) rc 2(1,2) 3(1,3) Path of m i 4(2,1) 5(2,2) 6(2,3) Path of m i-1 7(3,1) 8(3,2) 9(3,3) Dest Figure 9 Conflict is avoid by Backtracking strategy We could conclude from these two pictures that Backtracking strategy could bring more feasible solutions for message scheduling, but at the same time, the rescheduling and recalculation of the route dramatically increase the complexity of the progress and also the time of computing. 2.3.4 Output The output of the simulation is a set of successful scheduling entities with three factors (t, r) for the input message streams. Each input message/flit should have a relevant output scheduling entity. Figure 2-10 shows the output factors of simulation. The output is the scheduling entity associated with the corresponding message - 24 -

Message stream cheduling entity associate the corresponding message m1 m2 m3 mm 1 e11 e12 e13...... e1m 2 e21 e22 e23...... e2m 3 e31 e32 e33...... e3m.................................... N en1 en2 en3...... enm Figure 10 Output of the algorithm - 25 -

Chapter 3 Implementation and Experiment 3.1 C++ Implementation 3.1.1 Classes data structure In this project, we implemented the scheduling methodologies with C++, which is an object-oriented programming (OOP) language. For the requirement of this master thesis, which requires an OOP language that can implement the algorithm in an efficient way and can reuse functions, we decided to choose C++ as the language for implementation.[17] We defined three classes: graph, message and schedule to classify the object network, messages and scheduling entities, respectively. Firstly, we built a general network platform, using the main program to specify the size and type of the topology. Then, we constructed the message with some parameters to denote the property of each message. Finally, the class schedule defined to find the scheduling entities. Class Graph This is a very functional graph class, although it doesn't have every bell and whistle one might hope for. We'll templatize Graph on two types. One will be the type we'll use to represent a Vertex, and one will the type of the values that we associate with every edge. Note that we don't always need values associated with edges (not all graphs are weighted). This suggests that this interface has too much functionality for certain uses. And, the NoC topology graph construction in the C++ program will not only be limited to mesh topologies, but we stick to mesh topology to be more specific. - 26 -

Regular tile-based [13] NoC architecture was recently proposed to mitigate these complex on-chip communication problems [14, 15]. A network-on-chip is used to interconnect the different tiles. As shown in the Figure 3-1, tile-based architecture consists of a grid of regular tiles where each tile can be a general-purpose processor, a DP and a memory subsystem, etc. A switch is embedded within each tile with the objective of connecting the neighboring tiles. The connections between different tiles are called links. This structure is easy to build and has a good symmetry, which leads to well-controlled electrical parameters. Moreover, these well-controlled properties can help reducing the power dissipation and propagation delay. All these advantages come together makes this structure very promising. 1(1,1) 2(1,2) 3(1,3) 4(2,1) 5(2,2) 6(2,3) 7(3,1) 8(3,2) 9(3,3) Figure 3-01 Tile-based architecture Based on the properties of the tile-based architecture, a mesh network structure was implemented. There are two basic elements and several related parameters in the mesh implementation. Two basic elements are nodes and edges. The related parameters include flit buffer, estimated delay, size of the mesh, the size of slot-table and routing algorithm. Nodes are the same as tiles in the tile-based architecture, and edges are the interconnection between the nodes. To simplify the implementation of edges, the bi-directional edge was split into two unidirectional edges. As shown in Figure 3-1, the idea of spiting bi-directional edge can help reduce the difficulty in the C++ implementation. Moreover, it can help reduce the complexity of the routing strategy. Although the bi-directional edge was split into two unidirectional edges, the bandwidth of each edge remains the same. It is based on the assumption that the traffic with different directions will - 27 -

not affect each other. The estimated delay property will be considered under the same assumption. The size of the mesh is described as follows: The topology of the mesh network extends in two dimensions: horizontal and vertical. In the C++ program, we set the parameter horizontal and vertical are the same, e.g. 3 3, 4 4. o we use Mesh_size, an integer, to represent the number of total nodes, and N to represent the number of node within a dimension, which N=sqrt (Mesh_size). The function: InsertVertex(): inserts the number of Mesh_size nodes for the network. InsertEdge(): inserts the edges between the two node. Num_vertices(): return the total number of nodes within the network Num_edges(): return the total number of edges. InsertCord(): insert the coordinates of the vertices Route_Candidate(): return the number of all candidate routes, and put all the route into a container. Insertlot_East(),Insertlot_West(),Insertlot_North(),Insertlot_outh(), and Insertlot_Dest(): insert the buffer for slot table for the current vertex, for there are 5 buffers for the 5 directions on each switch, east, west, north, south and destination. The algorithm that builds the mesh is as follows: Loop all the nodes in the network /* to insert the edge vertically */ if (i<=mesh_size-n){ InsertEdge(i,i+N); InsertEdge(i+N,i); } /* to insert the edge horizontally */ if(i%n!=0){ InsertEdge(i,i+1); InsertEdge(i+1,i); } } - 28 -

Class Message This is a simple class; we just define some values to represent the parameters of the message: u_src: source node of the message v_dst: destination node of the message stream Period: the period sends the message within the stream s_id: identity of the stream t_start: the earliest time the message could start to communicate delta: the predefined the maximum waiting time for the slot nth: identity of the message in the stream, to show the sequence number of message mid: identity of the message in the stream All the parameters of the message are read from a text file, as Figure 3-2 to show: source node destination node period Figure 03-2 Message parameters message id sequence number delta start time stream id - 29 -

The messages are read in line by line, and then stored into a message container as Figure 3-3 shows, before scheduling. Here, we use a vector of message to represent the data structure for the message container. m10 m11 m12 e10 e11 e12............ mij eij Figure 3-3 Vector of messages and scheduling entities Class chedule The definition for this class is not complex; two scheduling algorithms construct the main part of the class: greedy scheduling and greedy scheduling with backtracking. Both methods try to find a set of scheduling entities for the input messages. The output scheduling entities are also put into a container, as show in Figure 3-3, which is a vector type of scheduling entity. Function cheduling (): is used for greedy strategy. cheduling_backtracking(): find the scheduling for the message, with backtracking To monitor the status of network, we print the scheduling information and network information into two text files. Figure 3-4 shows a sample that a message sent from node 1 to node 8 in a 3 3 mesh network, start from time 0. The first column shows the node along the path; the second column shows the time arrived at the node; and the last one shows the slot table allocation for the corresponding node. - 30 -

node along the route arrive time at each node slot allocated Figure 3-4 Path of the message sent from node 1 to node 8 in a 3 3 mesh network Figure 3-5 shows another example of the status of the network, which records a specified switch history, including the time, slot allocation, corresponding message, and its switch information. Within this file, we could know which message passed the specified switch, when it came, where it from and where it to go. slot message upstream downstream time allocated passed node of node of message message Figure 3-5 Example of the status of the network, which records a specified switch history - 31 -