Final Year Project Report Spring Multi-Processor System Design on FPGA

Size: px

Start display at page:

Download "Final Year Project Report Spring Multi-Processor System Design on FPGA"

Geraldine Griffith
6 years ago
Views:

Final Year Project Report Spring 2005-2006 Multi-Processor System Design on FPGA Tarek

1 Final Year Project Report Spring Multi-Processor System Design on FPGA Tarek Darwish Sany Kabbani Acile Sleiman Supervisor: Dr. Mazen Saghir

2 Table of Contents Table of Contents...ii List of Figures... iii 1. Introduction Related Work Automated Platform Generator Three Levels of Specification Automated XPS Project Files Generator System Level Inputs Register Transfer Level Outputs Implementation Problems Hardware Design General View of the System Connections within a single FPGA System Implementation Creating a simple design using XPS Addition of peripheral Creating a peripheral FSL Timing Constraints Connection of the peripheral Inter-Processor Communication Communication through Ethernet Communication through FSL Unified Communication Library Header File Description C File Description Blocking and Non-Blocking Send Blocking and Non-Blocking Receive Testing Customized Hardware Unit Multiplying two matrices Matrix Distribution among MicroBlazes Matrix Multiplication Using Co-Processors Implementation of NonBlockingSend() in our library Problems and Future Work Evaluation of Design and Constraints...43 Appendix A: Technical Background...44 A.1. Field Programmable Gate Array (FPGA)...44 A.2. MicroBlaze: a soft reconfigurable microprocessor core...45 A.3. Types of Buses...48 On chip Connection...48 Off Chip Connections...49 Appendix B: The Message Passing Interface...51 Point to Point Communication...51 Blocking Send...51 Appendix C: Codes...53 References...83 ii

3 List of Figures Figure 1: System Design Flow...5 Figure 2: Bus Topology...7 Figure 3: Ring Topology...7 Figure 4: Tree Topology...8 Figure 5: Mesh Topology...8 Figure 6: Hardware Design...11 Figure 7: Connections within a Single FPGA...13 Figure 8: FSL Write Operation...17 Figure 9: FSL Read Operation...17 Figure 10: EMAC Control Register...21 Figure 11: Ethernet Frame Format...22 Figure 13: Virtex-II V2MB Figure 14: MicroBlaze Architecture...46 Figure 15: LMB...48 iii

4 1. Introduction The exponentially growing number of transistors on chip predicted by Moore s law has opened up the possibilities for implementing increasingly complex systems on chip. These systems require high performance levels and low power consumption. In multiprocessor systems, each processor functions at a moderate speed and the overall throughput of the system is increased by the multiplicity of processors. Even though a single processor runs at higher clock speeds, the cost of designing a multiprocessor system-on-chip remains smaller than that of designing a single processor [1]. The multiplicity of processors as well as the importance of re-configurability in error correction and system enhancement have made the design and implementation of a multiprocessor system on an Field Programmable Gate Array (FPGA) a hot research topic. The specification of a multiprocessor system needs to be described at the lowest level in order to be downloaded onto the FPGA. Commercial state-of-the-art tools take as input a higher level description of the system, the Register Transfer Level (RTL), and convert it to a code that can be downloaded onto the FPGA. Nevertheless, the increasing complexity of today s applications and systems makes creating an RTL description of these computationally demanding systems inadequate. Therefore, there is a need to move to a higher level of specification. However, this creates a gap between the traditionally used RTL level and the higher level of implementation, the system level. The main purpose of the project proposed in the fall term was to put forward a tool that will automate the transition between the system level and the RTL level. This tool would enable designers to build complex multiprocessor systems on a higher level saving time and effort. However, upon attempting to implement a multiprocessor system on the Virtex II FPGA boards available in the lab, we reached the conclusion that it is not possible to accommodate more than one processor on every FPGA board due to the lack of BRAM blocks. Therefore, the creation of a platform generator for multiprocessor systems on FPGA was not possible. Nevertheless, the amount of logic available on the chip allows adding customized hardware units that emulate the functionality of a MicroBlaze. Moreover, it is still possible to build multiprocessor systems whose Processing Elements (PEs) are implemented on multiple FPGAs that are distributed over multiple boards. We therefore decided to approach the issue of communication among these different types of processors spanning several FPGA boards and connected through a variety of interconnect types. In fact, within a single FPGA, PEs can be interconnected using a number of commercial or proprietary interconnects such as the IBM CoreConnect On-Chip Peripheral Bus (OPB) and the Xilinx Fast Simplex Link (FSL). On the other hand, larger systems that span multiple FPGA boards can be connected through well established or emerging high-speed links such as Ethernet or RocketIO. This variety in the types of interconnects increases the complexity of implementing inter-processor communication. As the number of processors, the types of topologies and the choices available for types of interconnects become wider, the task of efficiently and reliably implementing point to point communication becomes cumbersome. From the 1

5 perspective of an application programmer, the underlying system architecture and interconnection topology should remain abstract and the programmer should be presented with a unified programming interface and model. We have therefore developed a programming interface which provides the programmer with unified MPI-like send and receive operations. The system built for testing the unified communication library consists of multiple FPGA boards connected through Ethernet. Each board consists of a MicroBlaze (MB) and three customized hardware units connected to the MicroBlaze through FSL. Therefore, upon sending data from one processor to another, there is a need to determine the location of each processor, as well as the type of connection between them. Depending on the bus type, different send and receive operations are to be used. These operations differ in syntax, operands, and functionality. The library provides the user with a unified send/receive set of instructions which can be used to send data between any two processors, regardless of the topology or interconnection types. This report is divided into seven main sections describing the design choices, implementation and analysis. Section 2 presents an overview of the related work conducted in the field of multi-processor system design on FPGA. Section 3 describes the automated platform we were planning to implement in the fall term and explains the problems we faced. The fourth section presents and explains the design of the multiprocessor system created in the spring term. The details of implementing this system using the Xilinx Platform Studio (XPS) are depicted in section 4. Section 6 provides an exhaustive explanation of inter-processor communication mechanisms through FSL and Ethernet. A detailed description of the unified communication library created in this project is available in section 7. Section 8 provides an overview of the application and multiprocessor system created to test the library. The last two sections analyse the design choices and constraints and present the problems we faced as well as propositions for future works. 2

6 2. Related Work The design of multiprocessor systems on an FPGA is a hot research topic since it is ideal for implementing parallelizable applications efficiently. The implementation of a multiprocessor IPv4 forwarding router on Xilinx Virtex-II FPGA showed that when no specialized hardware exists, a soft multiprocessor solution allows a quick and costeffective deployment for many applications [2]. In fact this 2-port router was composed of 6 MicroBlazes. The routing table was stored on the Block RAM on-chip. The router achieved a throughput of 1.8Gbps. This soft multiprocessor solution was compared to a software implementation on the Intel IXP2800 network processor. The IXP2800 is a state-of-the-art multiprocessor specialized for packet forwarding applications. The result showed that the soft multiprocessor solution lost only 2.6X in performance compared to a specialized programmable implementation. The problem, however, remains in the limited number of Block RAMs available on chip. This problem was a limiting factor in the work of Hubner, Paulsson and Becker who created a system of five MicroBlazes on a single FPGA but were not able to add more MicroBlazes due to the lack of BRAMs on chip [3]. These processors were connected together using Fast Simple Link buses. Each Microblaze had a direct connection to on-chip local block-ram memory through local memory buses. The Microblaze was also connected to a timer through the OPB bus. The communication between each Microblaze was controlled by routing algorithms that was part of an operating system, the Xilkernel that ran on each processor. A solution to the BRAM problem would be to design systems that span multiple boards. The ESPAM tool, developed at the Leiden Embedded Research Center (LERC), is similar to the Automated Platform Generator [4] described in section 3. It was created by K.Huang and J.Gu as part of their Master s thesis, at Leiden University, The Netherlands, in August This tool takes as input an application that is parallelized and divided into processes and generates the program code files needed by the Xilinx Platform Studio commercial Synthesizer to generate a gate-level code downloadable on an FPGA. ESPAM also uses MicroBlaze processors on a Virtex-II board. Moreover, one of the most recent papers on multiprocessor systems on FPGAs [5] compared the efficiency of five different network topologies in terms of logic utilization, logic distribution (area), maximum clock frequency, number of nets, and place and route time. It showed that for the same size of FPGA, the maximum difference between a 16-node ring, star, mesh and hypercube topologies is less than 8% of the total routing resources used and that they can all be mapped to run at about 180 MHz, the maximum speed of the processor. Even a fully-connected topology up to 16 nodes was implemented but with a much higher overhead and a slight loss in performance. In this paper, it was also mentioned that an automated platform was created to help in the design of multiprocessor hardware design. This small difference between topologies will push programmer to test their application on different topologies. This will increase the importance of having a library that allows the programmer to implement the inter-processor communication in a unified way regardless of the underlying topology. It provides a large amount of scalability by adding transparency to the send/receive operations through different types of interconnects. 3

7 3. Automated Platform Generator The currently used applications to implement simple systems does not scale well when dealing with more complex structures. The objective during the fall term was to facilitate the design of multiprocessor systems on chip. In this section we present in detail the problem faced and present a possible solution Three Levels of Specification The design of state-of-the-art systems includes the following three levels of specifications: the System-Level specification, the RTL-Level specification, and the Gate-Level specification. These levels are illustrated in Figure 1. The Gate level is the lowest specification level. It describes the functionality of the logic blocks on the FPGA and how they are connected. The gate level specification is usually described in bitstreams. It is very hard to manually write the description of an application at this level. Thus the need for a higher level of description arises. This higher level is the Register Transfer Level (RTL). The specification at this level is written by designer in a hardware descriptive language such as Verilog or VHDL. This hand coded design made it possible to describe hardware in human readable language and increased the level of abstraction. Simple embedded systems consisting of one processor and several peripheral hardware can be implemented using this level of specification. This new level of specification was not created to replace the gatelevel specification. In fact synthesizers translate VHDL code into a lower level that is downloadable on the FPGA. Several commercial synthesizers can be found such as the EDK and Platform Studio from Xilinx. The Embedded Development Kit (EDK) and the Platform Studio integrate both hardware and software implementations on the FPGA. They also include automatic generation of device drivers, application code, and Board Support Packets for the Virtex-II model. This traditional hand-coded design is not efficient enough to keep up with the complexity of new design practices. Actually, the complexity of applications and platforms that are used in state-of-the-art system designs makes creating RTL descriptions of these systems inadequate. In fact, these new systems consist of an increasing number of processing elements. Manually describing the whole system and connecting the processing elements becomes much slower and more susceptible to errors. Moreover, the logic simulation traditionally used to test and verify a complex design represented using RTL has become extremely time-consuming and costly. [4] Therefore, there is a need to use a higher level of abstraction in designing the system specifications since the use of the low level RTL specification is a cumbersome task. This higher level, called the System level, would only require a general description of the system at hand. The designer would only need to specify the processing elements and their interconnection without the need of manually connecting functional blocks together. For example the user would specify a system consisting of three processors connected together through a shared OPB bus without having to connect each processor to this bus and other peripherals. 4

8 This new level of abstraction creates a gap, called the Implementation Gap [4], between the traditional low level RTL and the System-Level. Since there is no available commercial tool to fill this gap, there is a need to find an automated systematic way to make this transition in an effective and efficient manner. The tool we are creating takes system level inputs and translates them to RTL level specifications that are mapped into gate level bitstream by the currently available synthesizer, XPS. Figure 1: System Design Flow 3.2. Automated XPS Project Files Generator The Automated XPS Project Files Generator shown in Figure 1 is the tool that we will develop in order to cover the gap between the system level and the RTL level specifications. 5

9 The inputs to this program consist of the following system level specifications: - Number of processors (up to four) - Topology (bus, ring, star, tree, mesh) - Bus types (OPB or FSL) - Processor type for each processor (MicroBlaze or other) - Memory interface for each processor (on-chip BRAM or off-chip DDR) The outputs are the four files that define an XPS project - Xilinx Microprocessor Project (*.xmp) - Microprocessor Hardware Specification (*.mhs) - Microprocessor Software Specification (*.mss) - User Constraint File (*.ucf) The following sections describe these inputs and outputs in detail System Level Inputs The system level specification shown in Figure 1 consists of the topology and the processor specifications entered by the user. Actually, an important function of the tool we are developing is to test the performance of a certain application running on different multiprocessor implementations. The user can use this platform to test an application on several inter-processor connection topologies. In other words, the Automated XPS Project Files Generator takes as input the topology to be used, the type of buses used in this topology, the number of processors and the size of the memory of each processor as well as its type. All these parameters constitute the system level specification. There are many restrictions that need to be taken into account. First of all, the Virtex- II V2MB1000 Development Board we are using has a limited number of gates and Block RAMs. Therefore, the number of processors is limited by these two constraints. We are planning to restrict the number of processors on the FPGA to four. Actually, as discussed in [3], it is possible to implement four processors on the FPGA board we are using. However, exceeding this number will eventually lead to a lack of BRAM space. Moreover, the topology chosen by the user as well as the bus types entail new restrictions on the number of processors. Actually, we can interconnect the MicroBlazes using two types of buses: OPBs or FSLs. Each MicroBlaze has eight FSL ports and eight OPB ports. Therefore, if the user wishes to implement a certain topology using FSL links, the number of processors will be restricted due to unavailability of enough FSL or OPB ports depending on the type of buses specified. In fact, each MicroBlaze can be connected to a maximum of four other MicroBlazes using FSLs and eight MicroBlazes using OPBs. These restrictions will be discussed in detail later. 6

10 The topologies that to be implemented are the following: bus, ring, tree and mesh architectures and they are described in detail below. The Bus architecture shown in Figure 2 connects all processors via a single bus. This bus is the backbone of the network. Each processor is connected to it through an interface connector. If processor A wants to send a message to processor B, it broadcasts the message on the bus and all processors can see the data. However only processor B actually reads and processes the data [6]. This topology does not restrict the number of processors to be used. The maximum number of processors that can be implemented using a bus topology is defined by the maximum number of processors our tool implements, which is four processors. A shared memory model is also supported but having many memory accesses by different processors will keep the bus busy and thus the communication becomes slower. Figure 2: Bus Topology The Ring topology shown in Figure 3 connects each processor to exactly two neighbours. There is a unique direction of data transfer, either clockwise or counter clockwise [6]. In order to communicate with processor D, processor A sends the message in the direction of data transfer. The message will be received by the next processor in the ring and forwarded until it reaches its final destination. Therefore, only the recipient will actually process the data. The fact that each processor has exactly two neighbours imposes no restrictions on the number of processors that can be used. Some enhancements can be added to the ring topology. For example different segments of the ring can be used simultaneously. Furthermore, with the use of FSLs, another ring can be added to create a bi-directional ring. Figure 3: Ring Topology The Tree topology is actually a concatenation of several star topologies. The controller processor is the parent and the processors connected to it are children. Moreover, each of the children can itself be a parent to other children as shown in Figure 4. In this topology, a processor can be either a parent or a child or both. If the processor is the root of the tree, then it has the same restrictions that were discussed in the star topology. If the processor is a leaf, then it has only one other processor connected to it, its parent. Finally, if a processor is both a child and a parent, then the number of children it can support is restricted. Actually, since it is a child itself, then it has one parent processor connected to it. Therefore, the number of children it can 7

11 have cannot exceed seven using OPBs, and it cannot exceed three processors using FSLs. Since in our case the maximum number of processors is four, then there is no need to add restrictions. Actually, the maximum number of children that a root can have is three and the maximum number of children that a child can have is two. Therefore, both OPBs and FSLs can be used to implement the tree architecture. Figure 4: Tree Topology The Mesh topology is a fully connected topology, as shown in Figure 5. Every processor is connected to all other processors. Therefore, this topology requires a lot of connections between processors. Since in our case the maximum number of processors is four, then each processor needs to be connected to three other processors. Therefore, FSLs or OPBs can be used without further restrictions. Figure 5: Mesh Topology In conclusion, the number of processors is restricted by the topology and the bus types. However, since we have restricted the number of processors to four due to the lack of memory space on the FPGA, then all topologies are realizable using both OPBs and FSLs. The Automated XPS Project Files Generator s inputs that were discussed previously are user defined. Therefore, there is a need to create a user interface in order to prompt the user for the inputs. Although our current design only supports homogeneous multiprocessor implementations (MicroBlazes), we have chosen to take as input the processor type. Actually, we are planning to investigate the issue of upgrading our tool in order to support heterogeneous multiprocessor systems. Depending on the complexity of such a design and on our progress during the spring term we will decide whether to use this input or not. Another important point is that even thought we have set the maximum number of processors to four; one can 8

12 upgrade the tool by increasing this number. In this case, one should resort to off-chip DDR. Our tool allows the user to connect each processor to different memory types Register Transfer Level Outputs The outputs of our Automated XPS Project Files Generator are the four files (*.xmp, *.mhs, *.mss, *.ucf). These files describe the number of processors, their types, their memory interface and their interconnections. All the components of the system are listed and all connections between them are specified. These files describe a complete XPS project. Inputting these files into the XPS will generate the gate level netlist used to program the FPGA. Therefore, after generating these files, the XPS is used to generate a bitstream and download it onto the FPGA through the serial port. [4] To summarize, our tool takes the design of complicated systems to a higher level of abstraction. This level is less time consuming and easier to manipulate since it frees the designer from RTL details. The designer can always change the system using the XPS. In fact, the tool only generates a general platform describing the system where application specific customization might be needed Implementation Problems Upon implementing the platform generator we faced memory problems that forced us to change our design. The FPGA used in our design is the Virtex-II XC2V1000 [13]. It is located on the V2MB1000 development board that includes a 16M x 16 DDR memory, two clock sources, an RS-232 port, and additional support circuits. This FPGA can support up to 90 Kbytes (KB) of Block RAM (BRAM) [14]. These BRAMs are the memory used by the soft multiprocessor existing on the FPGA, namely the MicroBlaze. Including several MicroBlazes on a single FPGA requires each processor to have a separate BRAM of size 8KB, 16KB, 32KB or 64KB. In order for the executable file and C code of a processor to fit on its BRAM, the size of each MicroBlaze s BRAM cannot be less than 64KB RAM. However, both Virtex-II XC2V1000 and Virtex II Pro XC2VP7 [15] boards have a limited number of BRAM s and thus a maximum of one MicroBlaze can be implemented on each FPGA. Thus we can not host multiple processors on one FPGA, and this defies the purpose of our platform generator. If only one MicroBlaze fits onto an FPGA then the complexity of the system is reduced and there is no need for a platform generator. We have tried several ways to solve this problem. First of all we considered purchasing another FPGA development board which has a higher gate density and more BRAM blocks to be able to fit more than one MicroBlaze on it. We looked up the boards alternatives on the Xilinx and Memec Design websites. This solution turned out to be costly and we would have to wait around a month to get the board. This was not possible since we would loose a lot of time. 9

13 We considered other alternatives such as using the off-chip SDRAM to store the executable file of the C code for the MicroBlaze. However this would mean that each instruction fetched is accessed from an external memory and that is very time consuming. It would defy the purpose of speed-up which is a goal of a multiprocessor system. Therefore we could not create a multiprocessor system of MicroBlazes on one FPGA. This has obliged us to consider an alternative for a multiprocessor system design. There were two alternatives to overcome this problem: creating a customized hardware to emulate the work of a MicroBlaze, and connecting several boards together. The customized hardware would be a VHDL coded peripheral that emulates the work of a MicroBlaze in a specific application. It would, actually, act as a co-processor to speed-up a certain computation. Its advantage is that it is a VHDL code and therefore it takes some of the FPGA gates resources but doesn t need any BRAM. We can thus have one MicroBlaze and several hardware units on the same FPGA. We can also connect several boards together, having one MicroBlaze on each FPGA. The processors will be connected in this case through other interconnects such as Ethernet. This would allow having a multiprocessor system on distributed FPGAs. However, a platform generator is still not needed in this case since each FPGA needs a platform to be downloaded and this system still consists of one MB and is therefore simple. The complexity of the system on each FPGA is not increased yet the larger system contains several processors. 10

14 4. Hardware Design 4.1. General View of the System The hardware design of a multiprocessor can be viewed as a system consisting of edges and nodes. The nodes are the Processing Elements (PE) while the edges are interconnections between them. In general the PEs vary from soft CPU processor like the MicroBlaze, to a hardwired processor like the PowerPC and a customized hardware unit that can be described using VHDL. The advantages of the MicroBlaze over the PowerPC are its reconfigurable property. Actually the MicroBlaze can be customized according to the needs of the application and some of its data path or control functions can be deleted [7]. The specialized hardware are not only customized according to the application but also perform arithmetic function in hardware faster than software. The system we built is shown below: Figure 6: Hardware Design The PEs in our system span three FPGA boards. Three boards were chosen to demonstrate the scalability of the system. The PE chosen are the MicroBlaze and a specialized hardware called MatrixMult. Since it is important to have several PEs on the same board but it is not possible to insert several processor on the same FPGA we decided to create specialized hardware units. The edges in our system can be separated into two parts depending on whether the connection is within a single FPGA or it is between several boards. 11

15 The most widely used interconnection buses within a single chip are the OPB and the FSL. The OPB interface provides a slow connection to both on-and off-chip peripherals and memory [8]. It usually includes a parameterized OPB Arbiter that makes sure that the communication on the OPB is running smoothly. On the other hand, The FSL bus provides a point-to point communication channel between an output FIFO and an input FIFO [9]. The FSL buses are uni-directional non-arbitrated dedicated communication channels. Although only up to 8 master and slave FSL interfaces are available on the MicroBlaze this type of connection provides high clock and data rates compared to the OPB and thus it was chosen in our system to connect the MicroBlaze to the Hardware units. Connecting several boards together can be done through well established or emerging high-speed links such as Ethernet or RocketIO. Ethernet is the well known IEEE protocol [10]. It is a frame-based computer networking technology connected to the processor on the FPGA by the OPB. The importance of Ethernet is that we can connect several FPGAs together using a hub and provide a 10 Megabits per second (Mbps) connection. On the other hand, the RocketIO Multi-Gigabit Transceiver (MGT) is designed to operate at any baud rate in the range of 622Mbps to 3.125Gbps per channel, depending on the standard used [11]. In our design the boards were connected through a hub using the Ethernet protocol because of the absence of documentation on the RocketIO and the limited number of boards available that support this new technology. The hub adds scalability to the project by allowing the user to connect several boards without additional configuration. Alternatively, the boards could be connected through a switch, yet this would require the user to perform the network configurations Connections within a single FPGA Within a single FPGA, there are several additional connections needed but not mentioned in the section above. Several controllers are connected to one shared bus, the OPB. The MicroBlaze has two different connections to the OPB due to its Harvard Architecture Model [7]. The MicroBlaze exchanges data and instructions with the OPB as a Master on two different connections. To debug the hardware and/or software created, an XMD debugger should be connected to the OPB as a slave. Another device connected to the OPB as a slave is the RS232 interface that sends information from the board to the computer allowing us to read the obtained results on a hyper terminal through a serial port connected to the computer. The EMAC controller is also connected to the OPB as a slave. This controller is the interface between the MicroBlaze and the Ethernet port [10]. Finally the interrupt controller is also connected as slave on the OPB. All of these modules are allocated address ranges. Given the memory mapped IO property of the system, these address ranges are used whenever the modules are accessed. 12

16 In addition to the main OPB bus, on the same board two additional buses exist, the LMB bus and the FSL described previously. The LMB or Local Memory Bus is a fast local bus for connecting MicroBlaze instruction and data ports to high-speed peripherals, primarily on-chip BRAMs (Block Random Access Memory). In our design two LMB are used for connecting the main MicroBlaze with one BRAM block. Finally six FSL buses are added to connect the MicroBlaze to three hardware units. The figure below describes the hardware on a single FPGA. BRAM LMB MicroBlaze FSL FSL FSL MatrixMult MatrixMult MatrixMult OPB Bus XMD Debugger RS232 EMAC Controller Interrupt Controller Figure 7: Connections within a Single FPGA 13

17 5. System Implementation 5.1. Creating a simple design using XPS The Xilinx Platform Studio was used to create the hardware system of our design. This program offers the option to create systems using a Wizard, the Base System Builder (BSB) Wizard. One of the first design choice prompted by the Wizard is the target development board. Our hardware is designed to be downloaded on the Memec Virtex-II V2MB1000 Development Board with P160 Comm2 Module, Revision 3. This board supports only the MicroBlaze as a processor since the Hardwired PowerPC is not available on the board. The frequency of the processor and the bus is chosen to be equal to the reference clock frequency which is MHz. A software and hardware debugger, the XMD, is also added to the system with a 64Kbytes of BRAM for local data and instruction memory. The XMD provides the option of downloading the C code on a terminal connected serially to the board. In addition, the cache is disabled. Next, a list of IO interface and ports is provided from which several devices where chosen like the RS232. The Ethernet Controller is found on the P160 Comm2 module. The interrupt of the Controller is enabled whereas the DMA is disabled. The final step of the Wizard is to specify the standard input and output (STDIN and STDOUT) which are in our system the RS232. No sample C application was created to test memory or peripherals. To following points summarize the steps needed to create the system: 1- Create a new project using Base System Builder 2- Select the appropriate board used 3- Select the MicroBlaze processor as the processor for the system and configure it as following: a. Set the clock frequency and the Processor-bus clock frequency b. Select the program download mode as XMD with s/w debug stub. c. Set the data and instruction memory each to be 64 KB. d. Disable the cache. 4- Configure some IO interfaces that allow us to view the results. In our case, we will need: a. RS232 interface b. EMAC controller with no DMA and interrupt driven 5- Select the STDIN and STDOUT peripherals to be RS Generate the system in order to obtain the system data files. The files that are generated are the system.ucf file (constraint file), the system.mhs file (microprocessor hardware specification file), and the system.mss file (microprocessor software specification file). Upon generating the hardware designed by BSB Wizard all connections, address allocation, parameters setting are created. External connections are shown in the.ucf file. The external ports included in the file are the clock and reset pins of the system and several port of the Ethernet controller. A detailed description about each signal can be found in the EMAC controller datasheet. 14

18 ##################################################################### ## ## This system.ucf file is generated by Base System Builder based on ## the settings in the selected Xilinx Board Definition file. Please ## add other user constraints to this file based on customer design ## specifications. ##################################################################### ## Net sys_clk_pin LOC=B11; Net sys_rst_pin LOC=B6; ## System level constraints Net sys_clk_pin TNM_NET = sys_clk_pin; TIMESPEC TS_sys_clk_pin = PERIOD sys_clk_pin ps; Net sys_rst_pin TIG; ## FPGA pin constraints Net fpga_0_rs232_rx_pin LOC=B7; Net fpga_0_rs232_tx_pin LOC=A7; Net fpga_0_ethernet_mac_phy_tx_er_pin LOC=C21; Net fpga_0_ethernet_mac_phy_tx_clk_pin LOC=C11; Net fpga_0_ethernet_mac_phy_rx_clk_pin LOC=G22; Net fpga_0_ethernet_mac_phy_crs_pin LOC=J19; Net fpga_0_ethernet_mac_phy_dv_pin LOC=D22; Net fpga_0_ethernet_mac_phy_rx_data_pin<0> LOC=E21; Net fpga_0_ethernet_mac_phy_rx_data_pin<1> LOC=E22; Net fpga_0_ethernet_mac_phy_rx_data_pin<2> LOC=F21; Net fpga_0_ethernet_mac_phy_rx_data_pin<3> LOC=F22; Net fpga_0_ethernet_mac_phy_col_pin LOC=J20; Net fpga_0_ethernet_mac_phy_rx_er_pin LOC=C22; Net fpga_0_ethernet_mac_phy_tx_en_pin LOC=L19; Net fpga_0_ethernet_mac_phy_tx_data_pin<0> LOC=L20; Net fpga_0_ethernet_mac_phy_tx_data_pin<1> LOC=K18; Net fpga_0_ethernet_mac_phy_tx_data_pin<2> LOC=K20; Net fpga_0_ethernet_mac_phy_tx_data_pin<3> LOC=K19; Net fpga_0_ethernet_mac_phy_mii_clk_pin LOC=G21; Net fpga_0_ethernet_mac_phy_rst_n_pin LOC=K17; Net fpga_0_ethernet_mac_phy_mii_data_pin LOC=D11; In addition to these external ports, several internal connections are described in the.mhs file. The clocks and resets of the MicroBlaze, debug module, RS232, EMAC controller, OPB and FSL are all connected to the system clock and system reset respectively. The interrupt controller has a port connected to the MicroBlaze and another to the Ethernet controller. This connection enables the EMAC to be interrupt driven. All the parameters of the modules in the system can be modified according to the application on hand. In our system, some parameter changes needed to accommodate the added peripheral will be explained later. 15

19 5.2. Addition of peripheral Creating a peripheral The Create/Import Peripheral tool is used to create a template for a new EDK peripheral. The template created will be compliant with the port and parameters interface of the EDK. The peripheral can be attached to the OPB, FSL or PLB bus. In the hardware design of our project the hardware is connected to the FSL. This new IP could have an input FSL only and could have both input and output to the FSL. The size of the input and output FIFO buffer should be set to a value that is equal to the FSL FIFO depth that can be as low as 1 or as high as In our design the depth is set to be 16 to avoid lengthy bitstream creation. The tool used to create the peripheral will generate a folder having the same name as the hardware under the pcore folder. This folder includes three folders: data, devl and the hdl folder. The data folder contains two important files, the.pao and the.mpd. - pao (Peripheral analyze order) file contains all the VHDL files of the unit. They are listed in the order in which they will be synthesized in order to resolve component architectures. - mpd (Microprocessor peripheral description) file contains the external interface of the core to the MicroBlaze. The hdl folder contains a vhdl folder that has all the VHDL files related to the pcore. When the peripheral is created two important tasks should be done. First, the VHDL description of this peripheral should be modified according to the hardware unit that we are creating and second the peripheral should be added to the hardware system using the Add/Edit Cores like any other IP core. It is important to note that while changing the VHDL description of the hardware, the handshaking functionality of the peripheral with the FSL should not change. In fact there are several timing constraints that should be met when designing hardware connected to the FSL FSL Timing Constraints The writing to the FSL bus is controlled by the FSL_M_Write signal. When FSL_M_Write is set to 1 at the current clock edge, the data and control signals, FSL_M_Data and FSL_M_Control, are pushed onto the FSL FIFO on the next rising clock edge. The timing diagram in Figure 8: FSL Write Operation depicts a back-toback write operation on the FSL bus. The FIFO_Data and FIFO_Control signals denote the data and control signals of the FSL IFO. When the FSL FIFO is full, the FSL_M_Full signal is set to 1 [9]. 16

20 Figure 8: FSL Write Operation The read end of the FIFO always contains the last unpopped data and control signals, FSL_S_Data and FSL_S_Control. When the FIFO is empty the FSL_S_Exists signal is set to 0 else it is set to 1. When the slave side peripheral of the FSL has finished reading the data, the FSL_S_Read acknowledge signal must be set to 1 for one clock cycle. The data and control values are then popped out of the top of the FIFO [9]. Figure 9: FSL Read Operation Connection of the peripheral The main peripheral added to our system is a hardware unit named MatrixMult. The VHDL code of this core was not able to meet the timing constraint of the system. Thus there was a need to create and add a new core, a clock divider. This core is a simple peripheral taking as input the system clock and outputting a divided clock. After adding the two peripherals, MatrixMult and ClkDivider the following step were completed: 1- Change the number of FSL links of the MicroBlaze to 3 to connect three MatrixMult hardware. 2- Add six FSL buses. 17

21 3- On three FSL buses connect the MicroBlaze FSL link as a Master and the MatrixMult as a Slave and on the rest connect the MicroBlaze as a Slave and the MatrixMult as a Master. 4- Change the parameter of the FSL buses to asynchronous mode. 5- Connect the clock port of the MatrixMult to the divided clock. 6- Connect the reset port of the FSL buses to the net_gnd pin. Note: Make sure not to connect the reset port of the FSL to the system reset 7- Connect the FSL_M_Clk of the FSL to system clock whenever the master on the bus is the MicroBlaze and to the divided clock whenever the master is hardware unit. 8- Connect the FSL_S_Clk of the FSL to system clock whenever the slave on the bus is the MicroBlaze and to the divided clock whenever the slave is hardware unit. Upon completion of these steps, the bitstream of the hardware system described earlier is ready to be generated. The next step is to create a software project associated with the MicroBlaze. For each built in peripheral used in the project a folder is created containing all the C and header file of the functions related to the peripheral. These folders are created under the folder libsrc found in MicroBlaze_0. The source file is then added to the project. Each C file and header file included in the source file are copied to a new folder, include under MicroBlaze_0. However, if the programmer needs to create new C or header file that will be called by the source file, then he should create it in the root folder of the project. At this point, all what remains to do is to write the program that will be downloaded on the FPGA. 18

22 6. Inter-Processor Communication In order to implement our unified send and receive instructions, there is a need to understand the different inter-processor communication mechanisms. The type of communication between two processors depends on the type of each processor, the type of interconnection between them, as well as the underlying topology. The focus of our project is communication between several MicroBlazes through Ethernet, as well as communication between a MicroBlaze and a hardware unit through FSL Communication through Ethernet Sending data between MicroBlazes connected through Ethernet is implemented using the following Xilinx send and receive operations [10]: XEmac_FifoSend(XEmac *InstancePtr, Xuint8 *FramePtr, Xuint32 Size); XEmac_FifoRecv(XEmac *InstancePtr, Xuint8 *FramePtr, Xuint32 *ByteCountPtr); InstancePtr is a pointer to the XEmac instance to be worked on which needs to be initialized before starting to send or receive data through Ethernet. In the XEmac_FifoSend() instruction, FramePtr is a pointer to a 32-bit aligned buffer containing the frame that needs to be sent and Size is the total size, in bytes, of the frame including the header. In the XEmac_FifoRecv() instruction, FramePtr is a pointer to a 32-bit aligned buffer into which the received Ethernet frame will be copied and ByteCountPtr is both an input and an output parameter. It is a pointer to a 32-bit word that contains the size of the buffer on entry into the function and the size the received frame on return from the function [10]. Prior to using these send and receive functions, the EMAC controller needs to be initialized, the MAC address set and the interrupt controller started. The code shown below illustrates the necessary initializations prior to sending frames through Ethernet. /*************************** Constant Definitions *************************/ #define EMAC_HDR_SIZE 14 /* size of Ethernet header */ #define MAC_ADDR_SIZE 6 /* size of MAC address */ #define MAX_FRAME_SIZE 1500 #define MAX_FRAME_SIZE_IN_WORDS ((MAX_FRAME_SIZE / sizeof(xuint32)) +1) #define EMAC_BASEADDR #define INTC_BASEADDR 0x40c x /**************************** Function Prototypes *************************/ static void BuildFrame(Xuint8 *FramePtr, int Size); 19

23 /************************* Function Prototypes Added **********************/ static void FifoRecvHandler(void *CallBackRef); static void FifoSendHandler(void *CallBackRef); static void ErrorHandler(void *CallBackRef, XStatus Code); static XStatus SetupInterruptSystem(XEmac *InstancePtr); /**************************** Variable Definitions ************************/ static Xuint8 LocalAddress[MAC_ADDR_SIZE] = 0x06, 0x06, 0x07, 0x08, 0x09, 0x04 ; static Xuint8 FriendAddress[MAC_ADDR_SIZE] = 0x04, 0x09, 0x08, 0x07, 0x06, 0x05 ; static XEmac Emac; static int Global_length; static int flag; static Xuint8 TxFrameBuf[MAX_FRAME_SIZE]; static Xuint8 RxFrameBuf[MAX_FRAME_SIZE]; /*********************************** MAIN *********************************/ int main () //initializing the Emac device XEmac *InstancePtr = &Emac; Xuint16 DeviceId = XPAR_EMAC_DEVID; /* from xparameters.h */ XStatus Result; Result = XEmac_Initialize(InstancePtr, DeviceId); if (Result!= XST_SUCCESS) return -1; //setting the EMAC control register Xuint32 setting_control; setting_control= ; XEmac_mWriteReg(EMAC_BASEADDR,XEM_ECR_OFFSET,setting_control); //setting the MAC address XEmac_mSetMacAddress(EMAC_BASEADDR, LocalAddress); //Set the FIFO callbacks and error handler. XEmac_SetFifoSendHandler(InstancePtr,InstancePtr,FifoSendHandler); XEmac_SetFifoRecvHandler(InstancePtr,InstancePtr,FifoRecvHandler); XEmac_SetErrorHandler(InstancePtr, InstancePtr, ErrorHandler); //Connect to the interrupt controller and enable interrupts Result = SetupInterruptSystem(InstancePtr); if (Result!= XST_SUCCESS) return -1; //Start the device, which enables the transmitter and receiver Result = XEmac_Start(InstancePtr); if (Result!= XST_SUCCESS) return -1; The first part shows the constant definitions, function prototypes and variable definitions used in the code. The MAC addresses of the source and destination processors need to be defined in the C code. There is also a need to define the transmit and receive buffers which contain the data sent or received. 20

24 The first step in the code consists of initializing the XEmac driver by using the XEmac_Initialize(InstancePtr, DeviceId) function. InstancePtr is a pointer to the XEmac instance to be worked on and DeviceId is the unique id of the device controlled by this XEmac instance. Passing in a device id associates the generic XEmac instance to a specific device, as chosen by the caller or application developer. The XEmac_Initialize() function initializes the fields of the XEmac structure and the IPIF component with its register base address, clears the Ethernet statistics for this device, and configures the FIFO components with their register base addresses. If the device is configured with DMA, the DMA channel components are configured with their register base addresses. The function also resets the Ethernet MAC. It returns XST_SUCCESS if the initialization was successful and XST_DEVICE_NOT_FOUND if the device configuration information was not found for a device with the supplied device ID. After initializing the EMAC device, the 32-bits of the EMAC control register shown in Figure 10 need to be set [10]. These bits allow the user to control the operation of the EMAC by disabling and enabling transmission and reception options such as transmit auto pad or FCS insertion, internal loop-back and unicast or broadcast address. Figure 10: EMAC Control Register The next step is to set the 48-bit MAC address for the EMAC driver/device. XEmac_mSetMacAddress() is provided with InstancePtr, a pointer to the XEmac instance to be worked on and AddressPtr, a pointer to a 6-byte MAC address. It returns XST_SUCCESS if the MAC address was set successfully and XST_DEVICE_IS_STARTED if the device has not yet been stopped. Since Ethernet is interrupt driven, the appropriate function needs to be called whenever an interrupt occurs. Thus there is a need to set the callback functions. XEmac_SetFifoSendHandler(), XEmac_SetFifoRecvHandler() and XEmac_SetErrorHandler() set the callback functions for handling confirmation of transmitted frames, received frames and error handling when configured for direct memory-mapped I/O using FIFOs. They are invoked by the driver within interrupt context, so they need to do their job quickly. If there are potentially slow operations within the callback, these should be done at task-level. The first parameter is a pointer to the XEmac instance to be worked on, the second parameter is a reference pointer to be passed back to the adapter in the callback which helps the adapter correlate the callback to a particular driver. Finally, the third parameter is the pointer to the callback function. Prior to starting the device, the function SetupInterruptSystem() sets up the interrupt system so interrupts can occur for the EMAC. InstancePtr contains a pointer to the instance of the EMAC component which is going to be connected to the 21

25 interrupt controller. This function is application-specific since the actual system may or may not have an interrupt controller. The EMAC could be directly connected to a processor without an interrupt controller. It is the task of the user to modify this function to fit the application. The last step of initialization consists of starting the Ethernet controller using the XEmac_Start() function. If not in polled mode, the internal interrupt enable registers are set up appropriately and interrupts within the device itself are enabled. The transmitter and the receiver are also enabled. In addition to the initialization steps described before, the user needs to build the Ethernet frame to be sent as shown in Figure 11 [10]. The preamble and SFD fields are always automatically inserted by the EMAC and should never appear in the packet data provided to the EMAC. The header consists of the destination and source addresses as well as the number of bytes in the following data field. The data field may vary from 0 to 1500 bytes in length for a normal frame. It is the task of the programmer to make sure the size of the data field of a single frame does not exceed 1500 bytes. The pad field is used to insure that the frame length is at least 64 bytes in length (the preamble and SFD fields are not considered part of the frame for this calculation) which is required for successful CSMA/CD operation. If the pad field is supplied as part of the transmit packet, the FCS may be inserted by the EMAC or provided as part of the packet to the EMAC. If the pad field is inserted by the EMAC, the FCS field will also be calculated and inserted by the EMAC. This is necessary to insure proper FCS calculation over the pad field. Figure 11: Ethernet Frame Format 22

26 6.2. Communication through FSL On the other hand, communication through FSL is implemented through the following set of instructions [9]: // Blocking Data Read and Write to FSL number id microblaze_bread_datafsl(val, id) microblaze_bwrite_datafsl(val, id) // Non-blocking Data Read and Write to FSL number id microblaze_nbread_datafsl(val, id) microblaze_nbwrite_datafsl(val, id) These instructions represent blocking and non-blocking read and write of the 32-bit data value specified in val to the FSL port whose number is specified in id. They can be used for communication between two MicroBlazes or between a MicroBlaze and a hardware unit within a single FPGA. They allow sending only one word at a time. In order to send multiple words of data, the programmer needs to implement an appropriate loop. Synchronization issues also need to be taken into consideration. Upon sending a number of bytes that exceeds the depth of the FSL FIFO buffer, the user should integrate a certain waiting time in order to avoid loosing information [9]. 23

27 7. Unified Communication Library The functions described earlier pertain to communication through Ethernet between two MicroBlazes or communication through FSL between MicroBlazes and hardware units. The processing elements integrated in the system designed to test our library makes use of these operations. However, if other types of processing elements and interconnects are integrated into the system, a wider variety of instructions would be needed to implement inter-processor communication. The complexity of implementing this communication lies in the fact that the programmer has to be aware of all the details of the underlying topology. The type and location of every processor as well as the type of link connecting every two processors determine the instruction to be used for sending and receiving data. In the case of communication over Ethernet, the MAC addresses of the source and destination processors should be known and in the case of communication through FSL, the id of the FSL bus connecting the processors needs to be known. Therefore, the programmer needs to be knowledgeable about the communication protocols available on FPGA such as Ethernet, RocketIO, OPB and FSL. The library we created allows the programmer to send and receive data between any two processors without being provided with any details about the topology. The programmer only needs to know the id of the source and destination processors in order to send and receive data. Upon building the system, a header file describing the hardware design is created for each processor. The library consists of a C file containing functions which consult the header file in order to retrieve details about the underlying topology Header File Description The information contained in the header file includes: The processor s id: PE_Numb The number of processing elements in the system: PE_Nb The type and location of each processing element The MAC address of each processing element (if connected to Ethernet) The ids of the FSL buses and which processors they connect Each processor has its own header file describing information specific to it as well as information about the topology and other processors. The header file for a certain processor A defines PE_Numb, an integer between 0 and PE_Nb specifying the id of processor A. The type of every processor in the system is specified in an array Types[]. In order to determine the type of a certain processor whose id is x, the array element Types[x] is checked. The system we created contains two types of processors and therefore Types[] contains one of two values: MB or HU. Another array Locations[] specifies the location of each of the processors in the system. In the case of the system we designed, there are three FPGA boards and therefore the elements of this array can take one of three values: B1, B2 or B3. For instance if processor 5 is located on board 2 then Locations[5]=2. In order to implement communication through Ethernet, there is a need to determine if a certain processor is connected to the Ethernet bus. The array MACorNOT[] provides 24

28 this information by storing 1 if the processor is connected to Ethernet and 0 otherwise. Moreover, the 6-byte MAC addresses of all processing elements are stored in six arrays MAC_Addr1[] through MAC_Addr6[]. MAC_Addr1[x] contains the first byte of the MAC address of processor x, MAC_Addr2[x] contains the second byte of the MAC address of processor x, and so on. If a processor is not connected to Ethernet, the appropriate array elements are set to 0. The arrays containing information about the topology (PE types and locations) and the MAC addresses are the same for all processors. In other words, the header files of all the processors contain the same values in the elements of these arrays. On the other hand, communication through FSL requires determining the id of the FSL bus connecting the source processor to the destination. Thus, an array containing PE_Nb elements was created. This array is specific to each processor; it is does not necessarily contain the same values in the header files of different processors. In the header file of a certain processor y, if FSL[x] is set to -1 then x and y are not connected through FSL. Otherwise, FSL[x] specifies the id of the FSL bus connecting x and y. Therefore, the header file implemented in this project contains all the information needed about the topology of a multiprocessor system. Sending data through FSL or Ethernet can now be implemented through functions that consult the constants defined in the header file. In order to implement inter-processor communication through OPB or RocketIO, the header file can be extended appropriately. The format of this file is also scalable to more complex topologies containing more processors of different types such as PowerPC C File Description As described earlier, the main purpose of the library created in this project is to allow the programmer to send/receive data between any two processors without being provided with any information about the topology. In other words, the programmer should not know the type or location of the processors and the type of interconnect between them. The send/receive set of instructions created in the library take as input the id of the destination/source processors respectively and they consult the header file for details about the topology. The library includes two types of send/receive operations: blocking and non-blocking as well as an Initialize() function which is called at the beginning of the source C code of every processor. If the processor is connected to the Ethernet bus, this function takes care of initializing the EMAC controller as described in section 6. Processors connected only through FSL do not require any initialization. This function can be modified to include any initialization needed to support OPB and RocketIO communication. The operands of the send/receive instructions are the same as those of the MPI_Send and MPI_Receive instructions implemented in the Message Passing Interface standard. The Message Passing Interface 1 (MPI) is a standardized and portable message passing design that defines the syntax and semantics of a core of library routines. It is widely used in writing portable message passing programs in Fortran, C and C++ for distributed or shared memory parallel computer as well as networks of 1 For more information about the Message Passing Interface (MPI) refer to Appendix B. 25

29 workstations [16]. This design choice allows ease of scalability and insures universality of the library. Programmers are familiar with the MPI standard for interprocess communication and will easily be able to use our library Blocking and Non-Blocking Send The blocking send sends data and waits until an acknowledgment is received from the destination processor. If no acknowledgement is received after a certain timeout period, the processor resends the data and waits again. In the non blocking version of the send instruction there is no need to wait for an acknowledgement. As soon as the transmission is complete, the execution of the program is resumed. The library also implements blocking and non-blocking receive functions that will be discussed later. The blocking and non-blocking send instructions have the following syntax: BlockingSend(int Dest, Xuint8 *DataPtr, int Size, int Datatype, int Tag, int Comm) NonBlockingSend(int Dest, Xuint8 *DataPtr, int Size, int Datatype, int Tag, int Comm) The Dest field is an integer that determines the id of the destination processor. The processor ids are integers ranging from 0 to the total number of processors minus 1. Therefore each processor is assigned a unique id upon creation of the system. DataPtr points to the first byte of the data to be sent and size specifies the number of bytes to be sent. Datatype specifies the type of the data in the entries. Tag is the message tag which can be used to track and order multiple received messages from the same sender. The last parameter, comm, is a communicator which, in MPI, specifies a communication domain for the communication. In the BlockingSend() instruction, this field is used to indicate the type of the send instruction (blocking or non-blocking). The first step in both blocking and non-blocking send instructions consists of checking the locations of the sender and receiver. Actually, since the array Locations[] specifies the location of each processor in the system then Locations[PE_Numb] is compared to Locations[Dest]. Whenever the destination processor and source processor of the message are on the same FPGA, the communication is called on chip communication. Generally speaking, this communication could take place over the OPB, FSL, between two MicroBlazes or any other processing elements. However, in our case, the on chip communication takes place only between the MicroBlaze and a hardware unit over the FSL. The hardware unit created will have a special architecture and special input/output ports thus the communication to or from this hardware unit will be different from one hardware unit to the other. This communication is governed most importantly by the functionality of the hardware unit and the depth of the FIFO buffer. In fact, in our case the hardware unit should receive the inputs in a special order to achieve its result. Since the hardware created relates directly to the application, we will leave the discussion of the implementation of the blocking send and non blocking send for communication through FSL to the testing section. 26

30 In the case of communication between two processors located on two different boards, Ethernet is to be used to send the data to the destination. If both the source and the destination are connected to the Ethernet bus, then the data can be sent directly. However, in some cases there might be a need to send data between processors located on two different boards and which are not both connected to the Ethernet bus. If the source processor is not connected to the Ethernet bus then FSL should be used to send the data to another processor located on the same board and connected to the Ethernet bus. This processor will in turn forward the data through Ethernet to the appropriate board. On the other hand, if the destination processor is not connected to the Ethernet bus, the data is sent to a processor located on the destination board and connected to the Ethernet bus. This processor then forwards the data to the destination processor through FSL. Note that in the system created to test the library, all the MicroBlazes are connected to the Ethernet bus and therefore they can communicate directly. Sending data through Ethernet requires building the frame to be sent by adding the Ethernet header to the data. The library also inserts five additional header fields to the data part of the Ethernet frame. These include the tag, datatype and comm fields as well as two fields specifying the final destination and the original sender of the frame. The first three fields are the parameters specified upon calling the BlockingSend() or NonBlockingSend() functions. The final destination field is needed if the final destination is not connected to the Ethernet bus. This case, described earlier, requires the processor that receives the frame to forward it to its final destination. This is taken care of in the receive instruction. The field specifying the original sender is consulted in the receive functions in order to determine if the packet received is the one expected. Finally, the XEmac_FifoSend() is used to send the frame. The following code describes the process of building the Ethernet frame and sending it. /****************************Ethernet Header************************/ Xuint32 Index; Xuint8 *FramePtr; FramePtr = (Xuint8 *) TxFrameBuf; *((Xuint16 *)Size)= *((Xuint16 *)Size) +INTERNAL_HDR_SIZE; Xuint32 Sizevalue = *((Xuint16 *)Size); // Destination MAC address FramePtr[0] = PE_MAC_Addr1[Dest]; FramePtr[1] = PE_MAC_Addr2[Dest]; FramePtr[2] = PE_MAC_Addr3[Dest]; FramePtr[3] = PE_MAC_Addr4[Dest]; FramePtr[4] = PE_MAC_Addr5[Dest]; FramePtr[5] = PE_MAC_Addr6[Dest]; // Source MAC address FramePtr[6] = PE_MAC_Addr1[PE_Numb]; FramePtr[7] = PE_MAC_Addr2[PE_Numb]; FramePtr[8] = PE_MAC_Addr3[PE_Numb]; FramePtr[9] = PE_MAC_Addr4[PE_Numb]; FramePtr[10] = PE_MAC_Addr5[PE_Numb]; FramePtr[11] = PE_MAC_Addr6[PE_Numb]; // Length of Data in Bytes FramePtr[12]=Size[0]; FramePtr[13]=Size[1]; 27

31 /***************************Internal Header*************************/ //Set up the Tag FramePtr[14] = Tag; //Set up the Datatype FramePtr[15] = Datatype; //Set up the Comm field FramePtr[16] = Comm; //Set up the Final Destination FramePtr[17] = Dest; //Set up the Original Sender FramePtr[18] = PE_Numb; //Copy the data into the frame for (Index = 0; Index < Sizevalue-INTERNAL_HDR_SIZE; Index++) FramePtr[Index+19] = DataPtr[Index]; /******************************* SEND ******************************/ //Send using the Ethernet Send Function XStatus a; a = XEmac_FifoSend(InstancePtr, (Xuint8 *) TxFrameBuf, (Sizevalue + EMAC_HDR_SIZE)); In the case of sending data of size larger than 1495 bytes, there is a need to divide the data into multiple frames which are sent consecutively within the BlockingSend() or NonBlockingSend() function. This case is not applicable to our system due to memory constraints described earlier and thus it was not implemented in the library. The NonBlockingSend() function is completed upon sending the frame whereas the BlockingSend() needs to wait for an acknowledgement. This is implemented by creating a flag that is set by the receive handler. A while loop is inserted after sending in order to wait until a receive interrupt occurs and sets the flag to 1. This ensures that execution will not be resumed prior to receiving an acknowledgement. The following code illustrates the mechanism of waiting for an acknowledgement in the blocking send. 28

32 /******************* Wait for the appropriate Ack ******************/ while (1) //loop to wait for an interrupt to occur while (test == 0) test = flag; //when an interrupt occurs, reset the flag flag = 0; //check that the received frame is the appropriate ack if (RxFrameBuf[14] == 1 && RxFrameBuf[18] == Dest && RxFrameBuf[17] == PE_Numb) break; Blocking and Non-Blocking Receive The blocking and non-blocking receive operations we implemented take as input five arguments and return the size of the data received. On one hand, the blocking receive stalls the program until the appropriate data is successfully received. On the other hand, the non-blocking receive operation retrieves the data received and resumes execution without checking if this data is valid. The syntax of the receive functions is shown below. The first argument is the id of the sender and the rest are similar to those in the send function. The two functions return the size of the data received upon successful completion. Xuint32 BlockingReceive(int Src, Xuint8 *DataPtr, int *Datatype, int *Tag, int *Comm) Xuint32 NonBlockingReceive(int Src, Xuint8 *DataPtr, int *Datatype, int *Tag, int *Comm) The first step in a blocking or non-blocking receive instruction is similar to the step in the send instruction: the sender and destination locations are compared. Accordingly either the FSL functions are used so that the MicroBlaze can communicate with the hardware unit or the Ethernet functions are used for Inter-process communication between two MicroBlazes. The implementation of the receive functions for the case of communication between the MicroBlaze and the hardware unit will be described in the testing section for the same reasons mentioned in the same case but for the send functions. In the case of inter-process communication between two MicroBlazes using Ethernet, it is important to remember that the Ethernet send and receive functions are interrupt driven. Thus as soon as an Ethernet frame is received, the function FifoRecvHandler is called by its own. The main part of this function is shown below. FrameLen = XEM_MAX_FRAME_SIZE; Result = XEmac_FifoRecv(EmacPtr, (Xuint8 *)RxFrameBuf, &FrameLen); This function will save the entire packet with the FCS field inside the RxFrameBuf buffer using the function XEmac_FifoRecv() that will also save the length of the entire packet with the additional four bytes for the FCS in a global integer, FrameLen. 29

33 The FifoRecvHandler() will also set a flag to indicate that new data has been received. Note that if several frames are sent before being read by the user, they are saved in different buffers but since our library did not support sending several frames successively this part of the library was not implemented. When the NonBlockingReceive() function is called there are several important tasks to be fulfilled. First the memory content of the tag, datatype and comm fields are filled with the appropriate data saved in the RxFrameBuf. *Tag = RxFrameBuf[14]; *Datatype = RxFrameBuf[15]; *Comm = RxFrameBuf[16] ; Second the size of the data is saved in the ByteCounterPtr whose content is returned when the function exits. *ByteCountPtr = Global_length - EMAC_HDR_SIZE-INTERNAL_HDR_SIZE - 4; Next the data is transferred from the RxFrameBuf to the DataPtr in a simple loop that takes into account that the RxFrameBuf contains some fields not related to the data (Ethernet Header and Internal Header). for (Index = 0; Index <*ByteCountPtr; Index++) DataThingy[Index] = RxFrameBuf[Index+EMAC_HDR_SIZE+INTERNAL_HDR_SIZE]; Finally, the Comm field is checked in order to determine the type of the matching send instruction. If it is blocking, an acknowledgement (ack) is sent to the source through an NonBlockingSend() operation. The ack is empty of any meaningful data but its tag field is set to 1. Note that usually the tag is used for acknowledging different frames, however since in our case we are sending only one frame, we were able to define a tag equal to 0 as a normal data frame while a frame with tag equal to 1 is an acknowledgment frame. Finally, the flag that new data is available is cleared since the data is read. On the other hand, the BlockingSend() will loop at the beginning of the functions until new data is received and saved in the RxFrameBuf as shown below. The flag that the function is looping over is set in the FifoRecvHandler() function. The BlockingSend() also makes sure that the received data is received from the Src processor. while (1) while (test == 0) test = flag; if (RxFrameBuf[18] == Src) break; 30

34 8. Testing The design used to test our library consists of three FPGA boards connected through Ethernet. Each FPGA contains one MicroBlaze and three co-processors. These hardware units communicate with the MicroBlaze through FSL and perform addition and multiplication operations. The application used for testing is a simple matrix multiplication. Matrix multiplication is an easily parallelizable application, so we can manually divide the tasks over the different PEs and distribute them. This application, though simple, is important especially in image and video processing. It is a computationally demanding operation especially when dealing with large matrices. One of the MicroBlazes will be the root of the whole system. In this root the matrices are created and initialized. It is also responsible to gather all the information from the system and return the result matrix. It sends a third of the rows of matrix A and the whole of matrix B to the other MBs through Ethernet. Thus each MB will now have to complete a third of the task. Each MicroBlaze is also a local root on the FPGA itself. It takes the third of matrix A and matrix B and divides the computation to the co-processors. Each co-processor receives a row and a column and it returns the result of their multiplication. The MicroBlazes then gather the information from their coprocessors and send the result to the root MicroBlaze which concatenates the received data to create the final result matrix. The steps in the matrix multiplication are summarized below. 1. Define Matrices A and B in the system s root. 2. Send a third of A (A ) and entire B from system s root to Microblaze 2 3. Send A and B from system s root to Microblaze 3 4. Send a third of A (A ) and B from the local root to one hardware unit 5. Send A and B from the local root to the second hardware unit 6. Send A and B from the local root to the third hardware unit 7. Each local root gathers the result from all three hardware unit. 8. The system s root gathers the result in a predefined resulting matrix. A detailed description of the customized hardware unit and the C codes written to run this testing application is shown next. 31

35 8.1. Customized Hardware Unit The customized hardware unit was created using the Create/Import Peripheral Wizard as explained previously. The hardware unit is connected to the MicroBlaze using a 32-bit wide FSL link. The FIFO buffers can hold up to 16 words. The chosen HDL language is VHDL. The hardware template is thus created and its description is shown below. The VHDL code starts with the port definitions in the entity. These port definitions are created by the wizard and they are the ports needed for the interface with the FSL link. The main ports used are described below: FSL_Clk: input Clock to the hardware unit FSL_Rst: input Reset FSL_S_Read: output signal used when reading a value is done FSL_S_Data: input 32 bit word containing the value read from the FIFO FSL_S_Exists: input signal set to 1 when the input FIFO is not empty FSL_M_Write: output signal set to 1 when there is data to be written FSL_M_Data: output 32 bit word to push a value into the output FIFO FSL_M_Full: input signal set to 1 when the output buffer is full In the architecture section, some signals and new types are created. The first new type is STATE_TYPE used for the state machine. The second is an array of 32 bit words created to be used as rows and columns. After the definitions, some signals are set according to the FSL handshaking protocols and timing constraints. FSL_S_Read is set to be equal to FSL_S_Exists when in Read_Row, Read_Column or Read_Value state otherwise 0. Therefore if there is data in the input FIFO and the state is one of the reading states, FSL_S_Read is set to show that a read is being performed as explained in the timing constraints of the FSL. FSL_M_Write is set to not FSL_M_FULL when in Write_Output or Write_ACK state otherwise 0. Therefore if the output FIFO buffer is not full and the state is one of the writing states, it is set to show that the data is ready to be pushed onto the output FIFO. FSL_M_Data is set to the 32 bit output to be pushed into the output buffer when in Write_Output state and zero otherwise. The states mentioned above are from the state machine as shown in Figure 12. It is created to run the program while taking into consideration all synchronization and FSL protocol issues. The details of the functionality of the hardware as well as the states are described next. 32

36 Figure 12: Customized Hardware Unit State Diagram The process has the FSL_Clk port as the only element in the sensitivity list. Thus on the rising edge of the clock, the process migrate from one state to the other. The first state is Idle. In this state the sum is initialized to zero. If data exists in the FSL FIFO, Read_Value becomes the new state. In this state, one element of the FIFO is popped and saved as the size of the array and therefore the number of values to be read next. Next the state moves to Read_Row during which data in the buffer is popped and read at each rising edge of the clock and directly saved in the array created. The number of reads is decreased at each clock cycle and when the number of reads is zero the state is changed to Write_Ack. An ack of value zero is sent back through FSL to acknowledge that the whole row was read and saved. The state is now changed to Read_Column and the number of reads is set to the size of the array. In the Read_Column state each value is read from the FIFO and multiplied by the corresponding value of the array previously saved. The result of the multiplications is added to the sum. When all the values are read, sum will have the final result of the multiplication of the row and column. The state is then changed to Calc_Output where the sum computed is moved to the output value. This state is important so that the output is ensured to be the correct one and that it is ready on time. Then there is the Write_Output state where the output is pushed into the output FIFO and thus sent. Finally, the state is then changed back to Idle. Compiling the VHDL code was successful yet when it is included in the project it generated a flow error upon bitstream generation. This turned out to be a timing constraint error. The multiply and add function takes more than half a clock cycle when running at 100 MHz. The process would thus run again (due to a clock change) 33

37 before the operation is complete. This means that the clock used should be slower. However this would mean that the clock of the whole system will become slower and that would affect performance. To solve this problem a different clock should be fed into the customized hardware unit. Thus another peripheral, called ClockDivider, was created. It is a simple hardware that takes as input the system clock and returns a clock slower by half. The output clock changes at the rising edge of the input clock. The new clock is thus fed into the co-processor. However another synchronization issue arises. The FSL connected between the MicroBlaze and the co-processor would be linking two processing elements running at different clocks. Therefore we used the asynchronous mode of the FSL link as described previously. By using this mode, the FSL would be running on two different clocks instead of one. It will have a clock on the master side and another on the slave side which are connected to the system clock or the new generated clock depending on the processing element at the designated end. After connecting it using the clock divider, the hardware unit was heavily tested. It would first receive the size of the row and column to multiply, and then it would receive a row, return an acknowledgement, receive a column and return the value of their multiplication. The size of the rows that were tested was around 15 and 20. This is due to the depth of the FSL link which can reach up to 8K [9]. The row size can in this case reach 8K and the hardware unit will still be functional Multiplying two matrices In order to implement the specified application, several C codes are created in order to be downloaded on each MicroBlaze. In fact, there will be two main C codes. The first will be that of the root MicroBlaze which creates and initializes the matrices and divides the work amongst the others as well as having to distribute work over to the co-processors connected to it. The second will be that of the two remaining MicroBlazes which receive some parts of the job to be done and they, in turn, distribute the task over the co-processors connected to it. Each of the C codes will start by including some libraries namely the unified communication library that we have created. Note that the header file will be slightly different when added to different MBs as there are some fields specific to each PE. 34

38 Matrix Distribution among MicroBlazes Both codes start by a wait statement which is essential to deal with synchronization issues for the Ethernet connection. The initialize function which is defined in our library is called to take care of all initialization issues. wait ( ); /******************************Initialize**************************** */ int Initialize_result=Initialize(); if (Initialize_result == 0) xil_printf("initialization Complete\r\n"); else xil_printf("initialization Failed\r\n"); The matrices A and B are created as square matrices with a variable size which can be defined. Note that while testing our program, we made sure to always have two matrices of size 9k by 9k where k is a positive integer so that work load is divided equally among all PE. A and B can be defined at the start of the program but due to memory restrictions we have decided to create and save them directly in the data frame we are sending. The data frame to be sent from the root MicroBlaze to the other MicroBlazes is now created. DataPtr[0]= NbRowsMB; DataPtr[1]= NbColA; DataPtr[2]= NbRowsB; DataPtr[3]= NbColB; Xuint32 Index; Xuint8 * counter = malloc (4*sizeof (Xuint8)); * ((Xuint32 *) counter )= NbRowsMB*NbColA; //Filling A. for (Index = 4; Index <NbRowsMB+4 ; Index+=4) DataPtr[Index] = counter[0]; DataPtr[Index+1] = counter[1]; DataPtr[Index+2] = counter[2]; DataPtr[Index+3] = counter[3]; * ((Xuint32 *) counter )= * ((Xuint32 *) counter ) + 1; counter=0; //Filling B for (Index =NbRowsMB*NbColA+4; Index <NbRowsMB*NbColA+4+MatrixSize ; Index+=4) DataPtr[Index] = counter[0]; DataPtr[Index+1] = counter[1]; DataPtr[Index+2] = counter[2]; DataPtr[Index+3] = counter[3]; * ((Xuint32 *) counter )= * ((Xuint32 *) counter ) + 1; 35

39 The data consists of a pointer of type Xuint8, which is a Xilinx defined type of byte values. The first four bytes are set to be the sizes of the matrices sent: number of rows of A, number of columns of A, number of rows of B, number of columns of B. Then the values of matrix A and B to be sent are set in the data frame. They are set sequential running over a row. Therefore the first row of A is added, followed by the next and so on. It is then followed by the rows of B. The matrices to be sent are thus transformed to one-dimensional arrays. It is important to note that the values in the matrices are 32-bit values and therefore four consecutive values in the frame are needed. The NonBlockingSend is then called. It takes as parameters the destination PE, a pointer to the data frame, the size of the data frame in bytes and the rest are set to zeros. NonBlockingSend(4, (Xuint8 *) DataPtr, PtrSizeinBytes, 0, 0, 0); Again another data frame is prepared while taking another third of Matrix A. This frame is also sent using a NonBlockingSend instruction to the second MicroBlaze. * ((Xuint32 *) counter) = 2*NbRowsMB*NbColA; //Filling A. for (Index = 4; Index <NbRowsMB+4 ; Index+=4) DataPtr[Index] = counter[0]; DataPtr[Index+1] = counter[1]; DataPtr[Index+2] = counter[2]; DataPtr[Index+3] = counter[3]; * ((Xuint32 *) counter )= * ((Xuint32 *) counter ) + 1; counter=0; //Filling B for (Index =NbRowsMB*NbColA+4; Index <NbRowsMB*NbColA+4+MatrixSize ; Index+=4) DataPtr[Index] = counter[0]; DataPtr[Index+1] = counter[1]; DataPtr[Index+2] = counter[2]; DataPtr[Index+3] = counter[3]; * ((Xuint32 *) counter )= * ((Xuint32 *) counter ) + 1; NonBlockingSend(8, (Xuint8 *) DataPtr, PtrSizeinBytes, 0, 0, 0); The tasks have been divided and distributed and so each MicroBlaze including the root will do the same work from now on but using different data Matrix Multiplication Using Co-Processors At this point Each MicroBlaze has a third of matrix A, or A, and the entire Matrix B. In fact the root MicroBlaze has this data since it created it and distributed it while the others have received it through Ethernet using BlockingReceive().Knowing that 36

40 each MB has the same amount of data to work on, the same code was used on the three MicroBlaze. This code split A into three new different matrices A and sends the latter with matrix B to a hardware unit using the NonBlockingSend() function. NbRowsMB = DataArray[0]; NbColA = DataArray[1]; NbRowB = DataArray[2]; NbColB = DataArray[3]; Xuint32 SizetoFSL = (NbRowsMB/3)*NbColA + NbRowB*NbColB +1; Xuint32 SizeADoublePrimeBytes = (4*(NbRowsMB/3)*NbColA); Xuint8 * SizetoFSLPtr = malloc (sizeof (Xuint32)); *((Xuint32 *)SizetoFSLPtr) = SizetoFSL*4; Xuint8 *DataToFSL = malloc(sizetofsl*sizeof(xuint32)); if (DataToFSL== NULL) xil_printf("datatofsl initializtion error \r\n"); DataToFSL[0] = NbRowsMB/3; for (Index = 1; Index <4+SizeADoublePrimeBytes ; Index++) DataToFSL[Index] = DataArray[Index]; for (Index = 4+SizeADoublePrimeBytes; Index < 4 + SizeADoublePrimeBytes + 4*NbRowsB*NbColB ; Index++) DataToFSL[Index] = DataArray[Index+2*SizeADoublePrimeBytes]; NonBlockingSend(5, DataToFSL, SizetoFSLPtr, 0, 0, 0); for (Index = 4; Index <4 + SizeADoublePrimeBytes ; Index++) DataToFSL[Index] = DataArray[Index+ SizeADoublePrimeBytes]; for (Index = 4 + SizeADoublePrimeBytes; Index <4 + SizeADoublePrimeBytes + 4*NbRowsB*NbColB ; Index++) DataToFSL[Index] = DataArray[Index+2*SizeADoublePrimeBytes]; NonBlockingSend(6, DataToFSL, SizetoFSLPtr, 0, 0, 0); for (Index = 4; Index <4+SizeADoublePrimeBytes ; Index++) DataToFSL[Index] = DataArray[Index + 2*SizeADoublePrimeBytes]; for (Index = 4 + SizeADoublePrimeBytes; Index <4 + SizeADoublePrimeBytes + 4*NbRowsB*NbColB ; Index++) DataToFSL[Index] = DataArray[Index+2*SizeADoublePrimeBytes]; NonBlockingSend(7, DataToFSL, SizetoFSLPtr, 0, 0, 0); The sizes of rows and columns of the matrices are first extracted and saved. These are important to perform many loops. Matrix A is further divided to three parts, A to 37

41 which the entire matrix B is added and sent to one of the hardware unit using the NonBlockingSend() function Implementation of NonBlockingSend() in our library As mentioned previously, the implementation of the NonBlockingSend() and BlockingSend() functions whenever the communication is taking place between the MicroBlaze and the hardware unit is application specific. The hardware unit in our system multiplies a row by a column. It first reads a row from the The MicroBlaze. Then the values of the column are sent and the multiplication result is sent from the hardware to the MicroBlaze. In case the row size exceeds the depth of the FSL FIFO then the data is divided and sent in chunks less than or equal to the depth and the final result will be the sum of the previously received results. Whenever the NonBlockingSend() recognizes that the data is being sent to the hardware unit through FSL it will first find the corresponding FSL ID linking the two PEs. Xuint32 FSLID = PE_FSL[Dest]; if (FSLID == -1) xil_printf("error, not connected to FSL\r\n"); return; In our application, the programmer sends two matrices to be multiplied in the hardware unit. However, our unit only takes as input a 15 input row max followed by a 15 input column. Thus our library needs to rearrange the data to be saved as succession of rows and column and not two matrices one after the other. Thus we start by creating a big loop that loops over the row of A. Inside this loop another loop runs through the column of matrix B. The row of A and column of B that are presently being processed are saved in two array. //Looping over all the rows of A. for (RowAlooping = 0; RowAlooping < NbRowsA ; RowAlooping ++) //Filling The current row in an array for( RowCntr = 0; RowCntr < NbColA ; RowCntr ++) Row[RowCntr] = *((Xuint32 *) DataRowDummy); DataRowDummy+=4; //Looping over all the columns of B for (ColBLoop=0; ColBLoop < NbColB ; ColBLoop++) //Filling the current Column in array for( ColCntr = 0; ColCntr < NbRowsB ; ColCntr ++) Column[ColCntr] = *((Xuint32 *) DataColDummy); DataColDummy+=4*NbColB; DataColDummy = DataColDummy - 4*NbColB*NbRowsB + 4; 38

42 At this stage, all what is left to do is divide the row and column into chunks of 15 inputs or less and send before it the size of the chunks. The number of loops depend on the number of row/column size.. //looping for the purpose of the Hardware for (BigLoop=0; BigLoop <= NbLoop; BigLoop++) if (BigLoop == NbLoop) NrSendLoops = Remainder; else NrSendLoops = FSL_BUFFER; microblaze_nbwrite_datafsl(nrsendloops, 0); for(sendloop = 0; SendLoop < NrSendLoops; SendLoop++) ArrVal = BigLoop*FSL_BUFFER + SendLoop; DataVal= Row[ArrVal]; microblaze_nbwrite_datafsl(dataval, 0); microblaze_nbread_datafsl(indata, 0); for(sendloop = 0; SendLoop < NrSendLoops; SendLoop++) ArrVal = BigLoop*FSL_BUFFER + SendLoop; DataVal= Column[ArrVal]; microblaze_nbwrite_datafsl(dataval, 0); microblaze_bread_datafsl(indata, 0); FSL_flag=1; sum = sum + indata; The difference between a BlockingSend() and a NonBlockingSend() whenever the communication is through FSL is that instead of using the microblaze_nbwrite_datafsl() we use the microblaze_b_write_datafsl(). Finally, since the hardware unit returns the results directly, the microblaze_bread_datafsl() is used in the BlockingSend() case while the microblaze_nb_write_darafsl() is used in the other. The result of these read operation are saved in an array RxFSL. This array is read by the MicroBlaze using the BlockingReceive() or the NonBlockingReceive() function which puts the matrix in a pointer to a data frame and returns the size of the matrix. The same cycle is performed over the remaining hardware units and the result matrices are concatenated and sent to the root. The root at this point joins all the matrices it received to get the final result which is matrix C, the product of two large matrixes A and B. 39

43 The application chosen shows a lot of communication between the different processing elements. Knowing that the system is heterogeneous, writing the code for such an application would require a lot of time. It would require from the programmer to know details about the underlying system. For instance the programmer needs to know the type of PE he is communicating to as well as the bus used. Having a large number of PEs and a variety of buses would require the programmer to know how the different PEs and buses work to be able to use the specific send and receive functions. With the help of our library the programmer doesn t need to know much about the underlying topology and system design. It can be noted from the C codes explained above that all the communication between the PEs was done using the same function. Furthermore some functions of send and receive require several steps to prepare the data to be sent. Ethernet for example needs the data in a specific frame format. So the programmer doesn t only need to know the different functions, he also needs to write a block of code to prepare for that. With the use of our library the programmer uses only one instruction to send and receive. 40

44 9. Problems and Future Work Throughout the fall term, the main difficulty we faced was in getting familiar with the EDK tools. We were provided with four experiments performed at the University of Toronto on the Virtex II using EDK. Performing these experiments turned out to be much more time consuming then we expected. We faced a lot of problems upon trying to implement the hardware as well as on the software level. The fact that we had no prior experience with XPS was a major drawback to our progress. We were faced with problems that we did not understand and we could not find enough documentation to solve them. Moreover, bitstream generation was extremely time consuming and we had to wait for hours every time we generated a new bitsteam. The problem that we could not overcome by the end of the fall term was the fact that we were unable to integrate more than one MicroBlaze on a single FPGA board. At the beginning of the spring term we understood that the reason behind this is the lack of BRAM blocks on the chip. We therefore decided to create a system spanning multiple FPGA boards and to add the customized hardware units. At this stage there was a need to get familiar with the create/import peripheral tool provided by XPS. It took us a lot of time to get the hardware units to work properly. On one hand, we had to generate a new bitstream every time we modified the VHDL code. We also needed to get familiar with the *.pao and *.mpd files and we had to understand and preserve the handshaking mechanisms of the FSL bus. Another problem we faced was that certain instructions in the VHDL code did not meet the timing constraints of the FSL bus. We spent a lot of time trying to solve this problem through several ways and we finally decided to create a clock divider as described in section 8. Finally finding out that the reset ports of the FSL bus should be connected to the net_gnd and not to the system reset was not straight forward. On the other hand, we had to get familiar with the Ethernet communication mechanism. In order to implement communication through Ethernet we had to understand the functionality of the underlying hardware (EMAC controller, interrupt controller...). We went into the details of the Ethernet and interrupt control registers. We spent a lot of time tracking down the values of every bit in these registers in order to understand the functionality of the Ethernet send/receive instructions. We also experimented with multiple types of Ethernet send/receive operations provided by XPS until we found the ones needed for our library. Another difficulty was that we were using the P160 Comm Module 3 instead of the P160 Comm Module 2 which led to loosing a lot of time and effort on debugging the code. Therefore, the fact that we had to change the topic of our FYP at the beginning of the spring term was a major drawback. During the spring term, prior to designing our library, we had to learn how to create and design hardware units, communicate through FSL and communicate through Ethernet. We also had to get familiar with VHDL coding in order to implement the functionality of the hardware units. Although we were faced with a lot of problems, we were able to overcome most of them and to create the system described in section 4. We also implemented the parts of the send/receive operations that are used by our system. However, due to the limited time available, there are several issues which we did not tackle but which 41

45 should not be neglected. As a matter of fact, sending and receiving data that does not fit in one Ethernet frame was not implemented. This feature can easily be added to the functions we created by implementing a loop that creates and sends multiple packets and by making sure that received packets do not overwrite each other. On the other hand, the fact that communication through FSL occurs between a MicroBlaze and a hardware unit led us to implementing a communication protocol for FSL that is specific to the hardware unit and the application. Actually, we were restricted by the depth of the FSL bus as described in section 8. However, the library can easily be upgraded to support communication through FSL between two MicroBlazes or other types of PEs, as well as communication through OPB and RocketIO. If larger boards are available, larger and more heterogeneous systems could be created and the appropriate sections in the send/receive instructions could be modified to support these types of communication. Also, the platform generator we designed in the fall term could be implemented and used to create several topologies of multiprocessor systems. This would allow to test resource utilization for different topologies as well as the operation of our send/receive instructions on more complex topologies. The performance of the matrix multiplication application can also be tested on different topologies. It is important to note that another way to upgrade the send/receive instructions we created is to implement data forwarding and shortest path calculations. This allows efficient communication between any two processors. Actually, we are aware of the fact that efficiency was not a major criterion in our design. Networking protocols could be used to implement high efficiency inter-processor communication. 42

46 10. Evaluation of Design and Constraints In order to evaluate our design choices and the tasks we accomplished throughout the fall and spring semesters, the objective of our project need to be stated. Actually, the main objective of the library we created is to abstract the underlying topology from the application programmer. This objective was met since we were able to implement communication between the different PEs of the system we created using the send/receive instructions we created. The performance of our library can be measured in terms of how much it reduces the code and how much time and effort it saves. As a matter of fact, we compared the size of the code required to send a packet through Ethernet before and after using our library. The results showed that a single Initialize() instruction at the beginning of the code replaced about 80 lines of code. Another 50 lines of code were replaced by a single send instruction. Even though this number of lines may not appear to be very large, it is important to take into account the information that was specified throughout these lines. Actually, prior to using the unified communication library, the user needed to know all the details about the topology and the communicating processors. Using the send/receive instructions allows the programmer to send data regardless of the underlying topology. Application programming is now independent of the hardware design. Therefore, a lot of time and effort that was spent looking into the details of the topology is now saved. This criterion can be considered as an economic criterion since time is money, and saving time saves money. It is important to note that if the library we created were to be upgraded to support more PE and interconnect types, it could be used along with a platform generator for multiprocessor systems. This would allow creating complex heterogeneous systems used for simulation purposes in all domains. In fact, the domain of parallelizing applications is highly investigated nowadays and is used mainly in the science domain where real life simulations require high speed processing. Other real life constraints such as manufacturability were taken into account. Actually, the choice of the FPGA is based on the fact that it is a reprogrammable device. Once the user purchases the device and the appropriate software tools, the device can be reprogrammed as many times as necessary to achieve the required results. Finally, as discussed in the previous section, the library we designed can be easily upgradable to support more types of PEs and interconnects. Throughout the process of designing our library, we tried to make it scalable and upgradable. 43

47 Appendix A: Technical Background A.1. Field Programmable Gate Array (FPGA) A Field-Programmable Gate Array is a semiconductor device that contains programmable logic components and programmable interconnects. These programmable logic components are used to implement the functionality of basic logic gates (such as AND, OR, XOR and NOT), which can then be connected together to implement much more complicated systems or combinatorial functions (such as decoder or simple math functions). In most FPGAs, the programmable logic components, known as logic blocks, also include memory elements which may be designed as simple flip-flops or more complete memory blocks [17]. The typical FPGA logic block consists of a 4-input lookup table (LUT) as well as a flip-flop. The designer has the freedom of choosing how to connect the logic blocks of the FPGA; and therefore it is possible to implement any system on an FPGA. The FPGA is comparable to a single-chip programmable breadboard [17]. The logic blocks and interconnects can be programmed by the designer so that the FPGA performs the appropriate logical function. FPGA was created in response to the need for re-programmability. In fact, the FPGA has been compared to the Application-Specific Integrated Circuit (ASIC). The advantages of FPGAs over ASICs are its shorter design time, its re-programmability which helps to fix bugs in the field and its lower non-recurring engineering cost (cost paid for hardware changes in ASIC compared to the FPGA). The price of an FPGA may vary depending on its version which is also considered an advantage over ASIC. The main disadvantages of an FPGA are that its inability to handle complex designs as well as its high power consumption. Thus, FPGAs are mailny used in the development and testing of a certain design. The final system is then implemented onto an ASIC. The FPGA has been recently developed in order to accommodate a complete system on a programmable chip. In fact, the logic blocks and interconnects of the FPGA have been combined with embedded microprocessors and related peripherals to form this System on Chip. An example of these hybrid technologies is the Xilinx Virtex- II Pro which includes one or more PowerPC processors embedded within the FPGA s logic fabric [17]. Creating a system on an FPGA involves several steps. First of all, a Hardware Description Language (HDL) code needs to be generated. Then an electronic design automation tool, such as XPS, is used to create a binary file or bitstream which is then downloaded onto the FPGA in order to implement the appropriate logic design by connecting the different LUTs in a specific way. The FPGA used in our design is the Virtex-II. It is located on the Memec V2MB1000 development board. This FPGA utilizes one million gates. The system board includes 44

48 a 16M x 16 DDR memory, two clock sources, an RS-232 port, and additional support circuits. The figure below shows the board and its features [13]. Figure 13: Virtex-II V2MB1000 The FPGA used on this board is the Virtex-II XC2V1000-4FG456C. The Virtex-II family is a platform FPGA developed for high performance, low to high-density designs utilizing IP cores and customized modules. The DDR Memory is a 32MB Memory implemented using the Micron MT46V16M16TG-75 16Mx16 DDR device. The clock generation is provided by two on-board oscillators running at 100MHz and 24MHz. The communication between the board and an auxiliary component, such as the computer, is possible by means of the RS232 Port; whereas the programming of the board is completed using the JTAG Port through a JTAG connector [13]. A.2. MicroBlaze: a soft reconfigurable microprocessor core The processor used to design our multiprocessor system is the MicroBlaze soft processor. The MicroBlaze is a soft RISC (Reduced Instruction Set Computer) processor which is optimized for use on Xilinx FPGAs. Being a soft processor gives it many advantages over traditional hardwired processors [7]. The main advantage of soft processors is that they are reconfigurable and can be customized according to the needs of the application. The major difference between a soft processor and an ordinary microprocessor is the ability to make substantial changes to the data path itself in addition to the control flow. Since it is implemented on FPGA then it can always be updated on the same board with no additional cost. 45

49 A block diagram describing the MicroBlaze core is show in the figure below [7]: Figure 14: MicroBlaze Architecture The MicroBlaze embedded soft core is highly configurable, users can select the features they need according to the application. It has a fixed feature set including 32- bit general purpose registers, 32-bit instruction word with three operands and two addressing modes, 32-bit address bus and a single issue pipeline. Other features are optional and they are listed in the table below [7]. 46

Table 1: Optional features MicroBlaze uses Big-Endian, bit-reversed format to represent data. The hardware supported data types are word, half word, and byte.

50 Table 1: Optional features MicroBlaze uses Big-Endian, bit-reversed format to represent data. The hardware supported data types are word, half word, and byte. In addition to the thirty-two general purpose registers, reset on bitstream download, there exists five special purpose registers. The Program Counter saves the address of the next instruction; the Machine Status Register contains control and status bits for the processor; the Exception Address Register stores the full load/store address that caused the exception while the Exception Status Register contains status bits for the processor and the Floating Point Status Register contains status bits for the floating point unit [7]. The three stages of its pipeline are: fetch, decode and execute. For most instructions, each stage requires one clock cycle to complete. MicroBlaze has a Harvard memory architecture, i.e. instruction and data are located in separate address spaces. Each address space has a 32 bit range (i.e. handles up to 4 Gigabyte of instruction and data memory respectively). The instruction and data memory ranges can overlap if they are both mapped to the same physical memory. It uses memory mapped I/O, it does not separate between I/O and memory accesses. The processor has up to three interfaces for memory accesses: Local Memory Bus (LMB), On-Chip Peripheral Bus (OPB), and Xilinx CacheLink (XCL). It has a floating point unit based on the IEEE 754 standard [7]. MicroBlaze can be configured with up to eight Fast Simplex Link (FSL) interfaces, each consisting of one input and one output port. The FSL channels are dedicated unidirectional point-to-point data streaming interfaces [7]. 47

Hardware Design. MicroBlaze 7.1. This material exempt per Department of Commerce license exception TSU Xilinx, Inc. All Rights Reserved

Hardware Design. MicroBlaze 7.1. This material exempt per Department of Commerce license exception TSU Xilinx, Inc. All Rights Reserved Hardware Design MicroBlaze 7.1 This material exempt per Department of Commerce license exception TSU Objectives After completing this module, you will be able to: List the MicroBlaze 7.1 Features List