A Dynamically Reconfigurable FPGA-based Content Addressable Memory for IP Characterization

Size: px

Start display at page:

Download "A Dynamically Reconfigurable FPGA-based Content Addressable Memory for IP Characterization"

Toby Jacobs
5 years ago
Views:

1 Master Thesis ELE/ESK/ A Dynamically Reconfigurable FPGA-based Content Addressable Memory for IP Characterization Supervisors: Axel Jantsch Kjell Torkelsson Examinator: Axel Jantsch Master of Science Thesis in Electronic System Design by Johan M. Ditmar Stockholm, March 2000

2 Abstract IP characterization is the process of classifying IP packets into groups depending on information in the header. In this report three implementations of FPGA-based dynamically reconfigurable Content Addressable Memories (CAMs) are described for Internet Protocol Version 6 characterization. These CAMs are characterized by a large width of the search word, a relatively small number of CAM words and the fact that these may contain don t cares. To implement the CAMs, the CAM words were divided into a number of reconfigurable match blocks. In the first CAM implementation called the fixed length CAM, the number of these blocks is equal for all words. A more advanced architecture was developed as well, where blocks that merely store don t cares are omitted which leads to a varying number of reconfigurable blocks for each word. By placing these blocks in a smart way, more CAM words can be stored. This CAM is referred to as variable length CAM. In the last implementation an explicit priority mechanism was added where the priority can be programmed for each CAM word. This eliminates the slow insertion and deletion times without adding significant additional hardware costs. The CAMs were implemented on a Xilinx Virtex FPGA and the reconfiguration of the this device is done dynamically from a Java environment. A user interface for changing the contents of the CAM was developed, together with a hardware interface to let the software communicate with the FPGA. It has been shown that using this technology, a CAM containing over 100 words of 320 bits can be implemented, that is able to perform more than 7 million look ups per second.

3 Acknowledgements I would like to thank Kjell Torkelsson at CadLab for providing daily support and helping me with the problems that occured. I would also like to thank Axel Jantsch (Kungliga Tekniska Högskolan) for supervising the project. I thank Sabih Gerez (University of Twente) for professional input from the home front. Other people from my university in Holland that have helped me are Jaap Hofstede and Professor Herrmann. This is also a good moment to thank my father, who has supported me with both personal and financial matters during my studies in Holland and abroad. Without his help, this would not have been possible. Finally, thanks Anna for being so patient and waiting for me when I worked late again.

4 List of Abbreviations API Application Program Interface CAM Content Adressable Memory CLB Configurable Logic Block CRC Cyclic Redundancy Checking DLL Dynamic Link Library FPGA Field Programmable Gate Array FSM Finite State Machine GUI Graphical User Interface IOB Input/Output Block IP Internet Protocol IPv4 / IPv6 Internet Protocol version 4/version 6 LC Logic Cell LUT Look Up Table PAR Place And Route PCI Peripheral Component Interconnect STM Synchroneous Transport Modules TBUF Tristate Buffer TCP Transmission Control Protocol TLU Table Look Up UCF User Constraints File UDP User Datagram Protocol VHDL Very high speed integrated circuit Hardware Description Language XHWIF Xilinx Hardware Interface

5 Table of Contents Introduction... 7 Purpose and Motives... 7 Outline...8 Chapter 1: IP characterization Internet Protocol Description IP header IP routing 1.2 Specification and Requirements Specification Priority Performance requirements Chapter 2: Design Methodology Dynamic Reconfiguration Hardware Environments Software Environment...15 Chapter 3: Target Technology Virtex Architecture Global architecture Configurable logic block Tri-state buffers Block RAM General routing 3.2 Virtex Configuration Configuration memory Programming bitstream Chapter 4: JBits Introduction JBits Programming Model Constructor Reading the bitstream Setting a resource Writing the bitstream Getting the resource configuration Chapter 5: CAM Structures RAM-based versus CAM-based Look Up Different CAM Types Binary versus Ternary CAMs Return value Prioritizing 5.3 Fixed Length CAM Global structure

6 5.3.2 Match line 5.4 Variable Length CAM Global structure Mapping Placing 5.5 Explicit Priority Encoder Different implementations Global structure Switch box Output decoder Method to program the return value 5.6 Device Utilization Fixed length CAM Variable length CAM Inherent priority encoder Explicit priority encoder Summary Chapter 6: Implementation of the Board Interface Port Description VHDL Description Chapter 7: Hardware Implementation of the CAM Implementation of the Fixed Length CAM Virtex primitives Stitcher 7.2 Implementation of the Variable Length CAM Shift Register Multiplexer Stitcher Switch 7.3 Implementation of the Priority Encoder Inherent priority encoder Explicit priority encoder Chapter 8: Synthesis and Place & Route Method Synthesis Place & Route 8.2 Physical Structure Fixed length CAM Variable length CAM Explicit priority encoder 8.3 Results of the Fixed Length CAM FPGA editor view of the CAM Device utilization Timing analysis 8.4 Results of the Variable Length CAM FPGA editor view of the CAM Device utilization

7 8.4.3 Timing analysis 8.5 Results of the Variable Length CAM with Explicit Priority FPGA editor view of the CAM Device utilization Timing analysis 8.6 Summary...52 Chapter 9: Software Implementation JAVA User Application General description GUI of the fixed length CAM GUI of the variable length CAM GUI of the variable length CAM with explicit priority 9.2 Hardware Interface Native interface Remote Interface 9.3 JBits Integration Components, resources and values Describing the CAM structure Initialization Converting SRL16 to LUT Cyclic Redundancy Checking (CRC) 9.4 Preprocessing Logical and physical priority Realation between CAM words Checking for conflicts Adding a new entry Utilization of empty match blocks Chapter 10: Conclusions and Recommendations Summary Conclusions Recommendations Recommendation concerning the CAM Recommendations concerning the tools References...64 List of Used Tools...65 Appendices...66 Appendix A: Port Definitions of the Board Interface Appendix B: VHDL Code of the Board Interface Appendix C: FPGA Editor View of Various CAMs Appendix D: User Interface Manual of Various CAM Implementations Appendix E: Implementation of the Native Interface Appendix F: Remote Hardware Interface Appendix G: CAMConstants.java for the Variable Length CAM with Explicit Priority

8 Introduction Introduction Purpose and Motives The Internet Protocol (IP) provides the basis for the interconnections of the Internet. Its application field grows very rapidly with requirements doubling every three months. In the future, IP will not only be used to interconnect computers, but all kinds of equipment will use this protocol to communicate with each other including base stations for cellular communication. Due to the increasing demand for high bandwidth, many efforts are made to make faster IP handling systems. Not only speed, but also flexibility is an important factor here, since new standards and applications have to be supported at all times. A way to gain speed and flexibility is to move critical software functions to reconfigurable hardware. One of these critical functions is IP characterization, as done in firewalls and routers. IP characterization is the process of classifying packets into groups that require special treatment. A subset of IP characterization is IP filtering. IP filtering is a security feature that restricts IP traffic by permitting or denying packets by applying certain rules. This way users can be restricted to specific domains or applications on the Internet. To do characterization, IP headers that reach a router need to be compared with patterns stored in a table, and an output is generated. Nowadays, this table is stored in memory and matching is done entirely in software. Due to growing requirements, software becomes too slow and alternative implementations need to be considered. Semiconductor companies responded to this by producing full custom Content Addressable Memories, that are fast and can store a large amount of data. The goal of this project is to implement such a CAM in a Field Programmable Gate Array (FPGA). This way matching is done purely in hardware and is therefore faster than the software solution. But this approach has several advantages over the full custom solution: 1. FPGAs are used in IP handling systems for other applications besides characterization. The CAM functionality can therefore be integrated with other logic on the same chip. 2. Implementing a CAM in FPGA technology gives the possibility to add extra features. It is for example possible to add logic to obtain statistical data by counting the number of packets that satisfy certain rules. Since IP characterization is a dynamic process, the content of the FPGA will need to be updated regularly. This can be done in several ways, one of which is dynamic reconfiguration. The goal of this project is to design and implement an FPGA based Content Addressable Memory, based on the idea described above. This report describes the design methodology and summarizes the results. IP packet header FPGA CAM Group - 7 -

9 Introduction Outline The report starts with some theory on the Internet Protocol, together with a description of IP characterization and the specifications of the CAM. This forms chapter 1. Chapter 2 gives a description of the methodology that is used to design the CAM. This includes the hardware that was available, the tools that have been used and the design flow. In this chapter a general discussion of dynamic reconfiguration is given as well. Chapter 3 describes the architecture of the target technology in which the CAM is implemented, in this case a Xilinx Virtex FPGA. An overview of the FPGA resources that are available and some information on the programming bitstream is given also. JBits, the tool used for dynamically reconfiguring the FPGA is described in chapter 4. Here a general description of the JBits programming model and its ability to change a configuration is given. Chapter 5 starts with some theory on CAMs and discusses alternatives for using a CAM. Then two proposed CAM structures are described, the fixed and the variable length CAM and the way they are mapped on the Virtex architecture. Also the structure of an explicit priority encoder is described, which makes it possible not only to program the value of each CAM word, but also its priority with respect to other entries. In chapter 6 the hardware implementation of the board interface is summarized. This interface is implemented on the FPGA and makes communication possible between the CAM and the board where the FPGA is part of. Chapter 7 first describes how the two CAM structures of chapter 5 (fixed and variable length CAM) have been implemented in hardware. To implement the explicit priority encoder as described in chapter 5, it was integrated with the variable length CAM to form a new implementation. This leads to a total of three different CAM implementations. Chapter 8 discusses the synthesis and place & route process of the three CAM implementations. It describes how the tools have been used and what measures had to be taken for succesful implementation. The physical layout of the CAM strcutures are described as well and finally the results are given with respect to timing and hardware utilization. Chapter 9 describes how the software part of the CAM has been implemented, focussing on the integration of JBits in the design. The graphical user interface of the CAMs and some other software parts, like the interface to the hardware and a remote interface to access the hardware via a network are only mentioned in this chapter. A more detailed description of these parts can be found in the appendices. Conclusions and recommendations are given in chapter 10. References, a list of used tools and a set of appendices form the last part of this report. The appendices contain Java- and VHDL code and describe the parts of the software that have been implemented but are outside the scope of this report

10 Chapter 1: IP Characterization Chapter 1: IP Characterization 1.1: Internet Protocol Description The Internet consists of a number of interconnected networks, supporting communication among host computers using certain protocols. These networks are required to provide only packet (connectionless) transport and the protocol that is responsible for the movement of packets around the network is called the Internet Protocol (IP). According to the IP specification, packets can be delivered out of order, be lost or duplicated, and/or contain errors [1] IP header An IP packet is of variable length and consists of two parts: 1. Header: contains information about addressing and control. 2. Payload: the data encapsulated in the IP-packet, following the header. It contains a higher level protocol (TCP, UDP) packet with its own header and payload. An IP packet, together with the header format for IP version 6 is given below in figure 1-1 [2]. 40 bytes Header Payload 0 31 Version Traffic Class Flow Label Payload Length Next Header Hop Limit Source Address Destination Address Figure 1-1: IP packet with header format for IP version

11 Chapter 1: IP Characterization The protocol Version is 4 bits wide and is equal to 4 or 6. The Traffic Class is an 8-bit field, of which 6 bits are currently used. It is available to distinguish between different classes or priorities of IP packets. The Flow Label is composed of 20 bits, and may be used by a source to label sequences of packets for which it requires special handling by the IPv6 routers. The Payload Length is a 16 bit integer, giving the length of the payload in octets. The Next Header gives the type of the header, that is encapsulated inside the payload of the IPv6-packet, i.e. the type of header immediately following the IPv6 header. The Hop Limit is an 8-bit integer, decremented by 1 every time a router handles the packet. If Hop Limit is decremented to zero, the packet is discarded, preventing packets from running in circles forever and flooding a network. The Source Address is the 128-bit address of the originator of the packet. The Destination Address is the 128-bit address of the intended recipient of the packet IP Routing Routing is the selection of paths for packets in a network. A a B b c D C d Figure 1-2: a number of nodes attached in a mesh network Figure 1-2 shows a network, consisting of 4 nodes A to D and 4 links a to d attached in a network. Suppose a packet is travelling from A to D. The packet will arrive at node B via the incoming link a and will have the choice to leave via outgoing links b and c. Each node maintains a routing table, which is used to select an outgoing link from the Destination Address field in the IP header [3]. This is done by the Table Lookup (TLU) function, as depicted in figure 1-3, showing a schematic view of a router. After TLU, a packet passes an IP characterisation function that will decide whether to discard the IP packet, to send it to software for further processing, or to send the packet to the outgoing link, selected by the TLU function. Characterization is done to obtain statistical data as well, for example how many IPv4 packets have arrived. outgoing links: : incoming link Table Look up IP Characterization to other nodes Figure 1-3: Schematic view of a router, showing table look up and IP characterization. SW

12 Chapter 1: IP Characterization In paragraph 1.2, IP characterization will be discussed in more detail. 1.2: Specification and Requirements Specification The packet classification is based on some fields in the header and the higher level protocol that is encapsulated in the payload of the IP packet. These attributes are given in table 1-1, together with their size in bits. They are presumed to be available and do not need to be generated. Field Number of bits IP Source Address 128 IP Destination Address 128 Incoming Link a 6 Outgoing Link a 6 Next Header 8 Traffic Class 6 TCP/UDP Source Port 16 TCP/UDP Destination Port 16 TCP/UDP Syn/Ack 1 Total number of bits: 315 Table 1-1: IP characterization attributes and their size in bits. a. these attributes are not part of the IP packet, but is information from the router. The input of the IP characterizer is thus a 315 bits vector. The output of the system decides what actions should be performed for further processing the packet. Instead of letting the output point directly to the class to which the incoming IP packet belongs, the system returns an index, which is processed in software to perform the necessary actions. The IP characterization function can therefore be seen as a Content Addressable Memory (CAM). A CAM is a memory which can be viewed as the opposite of a normal SRAM. For SRAMs the address is used as an input and data is provided as an output. A CAM works the other way: the data is the input and all the memory words (or entries) are searched and when a match is found, the address to that word is given as an output. The CAM should be implemented in a ternary way, which means that the entries that are stored in it may contain don t cares. For this application a CAM of at least 128 entries should be implemented. The description given above is summarized in figure

13 Chapter 1: IP Characterization Data entry 0 entry entry 126 entry 127 Address Address Match Figure 1-3: schematic view of the IP characterization function Priority As mentioned before, the input of the CAM can contain both source and destination address and other information. In some cases, this other information may take priority over the addressing information altogether. This priority effect also exists with hierarchically structured addresses, that contain don t cares. Here, one part of an address takes priority over another part in the matching decision, depending on which address has least don t cares [4]. This means that in case two or more entries give a match for a certain input, the index of the most specific entry i.e. the entry, whose address contains least x s should be returned. This requires a priority scheme for the entries Performance requirements The time available for characterizing an IP packet is equal to the time it takes to transfer this packet over the communication channel. The minimum look up rate of the CAM is then equal to: B f min = ( 8 L) with B the bandwidth of the communication channel in bits/s and L the length of the IP packet in bytes. There are two possible schemes for calculating the timing requirements. The first scheme calculates the requirements using an average IP datagram length of 200 bytes. This implementation requires buffering. Since the communication channel is used to transfer voice and video as well, latency should be minimized where possible and buffering is therefore not favourable. The second scheme uses a (worst case) minimum datagram length, which is equal to 40 bytes, the length of the header only. Table 1-2 shows the required look up rate of the CAM for different bandwidths that are standardized [5], together with the required average and worst case look up rates. The STM-4 standard is currently in development and is to be considered as the minimum requirement

14 Chapter 1: IP Characterization Bandwidth [Mbit/sec] Required lookup-rate (average) [klookups/sec] Required lookup-rate (worst case) [klookups/sec] 155 (STM-1) (STM-4) 388 1, , (STM-16) 1,550 7,750 Table 1-2: required lookup rates for different bandwidths

15 Chapter 2: Design Methodology Chapter 2: Design Methodology 2.1: Dynamic Reconfiguration Dynamic reconfigurable hardware allows a flexible adaptation to the necessities of the application by reprogramming it at run-time [6]. The idea is that a general-purpose hardware agent is configured to carry out a specific task, but can be reconfigured on-demand to carry out other specific tasks. This definition is rather general and one can distinguish between different levels of reconfiguration: evolutionary reconfiguration: the reconfiguration is initiated by developers in response to evolving system requirements. Examples are requirement changes and bug fixes. This type of dynamic reconfiguration is suitable for non-stop applications, that do not tolerate a break in service. adaptive reconfiguration: the structure of the application is changed as a response to application or system events. In this case, the application will be adapted dependent on data or changing parameters. evolving reconfiguration: the structure of the application is changed, depending on the result of a previous stage in the execution. It should be noted that other definitions of dynamic reconfiguration exist [7]. Dynamic reconfigurable systems can be implemented using reconfigurable Field Programmable Gate Arrays (FPGAs). These are SRAM-based circuits, that can be programmed with a specific function during system operation. They may support partial reconfiguration, which means that it is possible to reprogram only a part of the FPGA. The partial reconfiguration can be non-disruptive or disruptive. Non-disruptive means, that the portions of the system which are not being reconfigured remain fully operational during reconfiguration. If the reconfiguration affects other portions of the system, it is called disruptive and a clock hold is needed. IP characterization is based on adaptive reconfiguration, since it reacts to a change in classifying rules by adapting its structure. Ideally the design should be reconfigured in a non-disruptive way. However, partial reconfiguration is not supported by the development environment at hand and is therefore not feasible at this moment. 2.2: Hardware Environment The implementation is targeted to a Xilinx Virtex XCV1000 FPGA. This FPGA is situated on a PCI card, the RC1000-PP made by Embedded Solutions Ltd. A block diagram of the architecture of the board is given if figure 2-1 [8]. It has four memory banks and each bank has 2MBytes of asynchronous SRAM. The FPGA has four 32-bit memory ports, one for each memory bank and each bank has separate data, address and control signals. The FPGA can therefore access all four bank simultaneously and independently

16 Chapter 2: Design Methodology There are three methods of transferring data or communicating between the FPGA and the PCI host: 1. The memory banks are used to perform bulk data transfers. 2. There are two unidirectional ports for direct communication between the host and the FPGA using two 8-bit paths with handshaking signals, depicted in the figure as Ctrl_Reg and Stat_Reg. 3. Two unidirectional ports GPO and GPI provide for single-bit communications without handshaking. Finally there is a programmable clock that is controlled by the host. Reset GPO PCI BUS SRAM 0 SRAM 1 SRAM 2 SRAM GPI Stat_Reg Ctrl_Reg Data0 Data1 Data2 Data3 FPGA clk XCV1000 Programmable Clock Configuration Figure 2-1: block diagram of PCI card This board is used in a PC with a 450 MHz Pentium II processor and 320 MB RAM, running Windows NT. 2.3: Software Environment The design tool flow describes the tools that have been used and the data flow between them and is shown in figure 2-2. It consists of a static and a dynamic part [9]. The static part is used to implement that part of the logic that is not dynamically changed. This includes I/O and logic that interfaces to the dynamic part. Furthermore it initializes the dynamic part of the design, by generating either a basic structure that will be changed in a later phase, or reserving an empty area for the dynamic part to be placed. The static part is designed using VHDL synthesis in combination with the tools, available from Xilinx for place and route and bitstream generation

17 Chapter 2: Design Methodology VHDL static Synthesis dynamic EDIF JBITS Place and Route BoardScope Generate bitstream for configuration User JAVA Application XHWIF Hardware Interface Bitstream Figure 2-2: design tool flow for dynamic reconfigurable logic The dynamic part of the design controls the configuration of the FPGA during operation of the application and a tool called JBits is used for this. With JBits, the programming bitstream of the FPGA can be altered easily with relatively simple commands as explained in chapter 4. The JBits functionality is used in a main JAVA application, that implements the user interface of the CAM and controls the reconfiguration. This application has to communicate with the hardware and for this the Xilinx Hardware Interface (XHWIF) is used [10]. It permits simple porting of JBits to the hardware. It includes methods for reading and writing bitstreams to FPGA s, incrementing the on-board clock and reading and writing to and from the on-board memory. Via the hardware interface, the vendor specific C-functions for communicating with the board can be used in the user Java application. BoardScope is a tool that can read a configuration back from the FPGA, including the state of instantiated flip flops. This tool is very useful for debugging purposes, because simulation of the dynamic part is not possible

18 Chapter 3: Target Technology Chapter 3: Target Technology This chapter describes the architecture of the Xilinx Virtex FPGA series. 3.1 Virtex Architecture global architecture Xilinx Virtex has a regular structure, consisting of configurable logic blocks (CLBs) surrounded by programmable input/output blocks (IOBs) [11]. This is shown in figure 3-1. IOB IOBs CLB IOBs BlockRAM CLBs IOBs BlockRAM IOBs Figure 3-1: global architecture of Virtex FPGA BlockRAM The CLBs provide the functional elements for constructing logic and IOBs provide the interface between the package pins and the CLBs. By connecting the IOBs and CLBs together using general routing resources, a complex circuit can be built. Except for the CLBs and IOBs, Virtex also contains integrated SRAM blocks, called BlockRAM and 3-State Buffers (TBUFs). The CLBs, BlockRAM and TBUFs will be discussed in the next paragraphs configurable logic block A schematic view of the Virtex CLB is given in figure 3-2. It consists of two identical parts, called slices and each slice has two logic cells (LCs). An LC includes a 4-input look-up table (LUT), carry logic and a storage element. The LUTs can be configured in different ways: Any combinatorial function of four inputs; 16x1 synchronous RAM; 16 bit shift register

19 Chapter 3: Target Technology C out C out Slice 1 Slice 0 G4 G3 G2 G-LUT carry logic D Q Y Y Q G4 G3 G2 G-LUT carry logic D Q Y Y Q G1 G1 BY BY F4 X F4 X F3 F2 F1 F-LUT carry logic D Q X Q F3 F2 F1 F-LUT carry logic D Q X Q BX BX C in C in Figure 3-2: Virtex CLB The storage elements in the Virtex slice can be configured either as edge-triggered D-flip-flops or as level-sensitive latches. The carry-logic, shown in figure 3-2, can be used for fast arithmetic functions, but also for cascading LUTs for implementing wide logic functions. The Virtex 1000 has an array of 64 x 96 CLBs Tri-State Buffers Each Virtex CLB contains two 3-State buffers (TBUFs) that can drive on-chip buses. These onchip busses are provided by horizontal routing resources and four bus lines are provided per CLB row, as shown in figure 3-3. CLB CLB CLB CLB Figure 3-3: TBUFs connected to dedicated horizontal busses

20 Chapter 3: Target Technology The TBUFs as implemented on the Virtex are no true TBUFs, but instead they are implemented using a logical circuit that emulates the behaviour of a true 3-State buffer. This way, several TBUFs may drive a line simultaneously without the device getting damaged. When at least one TBUF drives a 0 on a line, the logic value of that line becomes 0 no matter what the output values of the other TBUFs are Block RAM The Virtex contains 32 Block RAMs, organized in two columns along each vertical edge of the chip. Each such memory cell is a dual ported 4096-bit RAM with independent control signals for each port as illustrated in figure 3-4. The data widths of the two ports can be configured independently according to the table in the figure. WEA ENA RSTA CLKA ADDRA<#:0> DIA<#:0> WEB ENB RSTB CLKB ADDRB<#:0> DIB<#:0> DOA<#:0> DOB<#:0> Width Depth BlockRAM port configurations Figure 3-4: Dual-Port BlockRAM with possible port configurations General Routing Apart from dedicated routing, such as carry and 3-state lines, there is also general routing that is used to interconnect the CLBs. The general routing uses two kinds of wires: singles and hex s. Singles, starting at a certain CLB terminate at an adjacent CLB, while hex s terminate at CLBs 6 positions over. Singles should be used to transport data between local CLBs, whereas hex s should be used to transport data to non-local CLBs. The singles and hex s are each grouped into busses that extend in four primary directions: north, east, south and west. The connections to neighboring CLBs are straightforward. A north single connects directly to a south single in the CLB above it. A hex west wire connects directly to a hex east wire on the 6th CLB over. Switch boxes are used to connect lines together. A schematic view of a CLB, with the different wires and switch boxes is given in figure

21 Chapter 3: Target Technology Single North Hex North 24 CLB 12 Main Switch Box 24 Single West 12 Hex West Single Switch Box Hex Switch Box 24 Single East 12 Hex East Single South Hex South The main switch box allows the singles and the hex s to be connected to each other and the CLB. Some hex wires can only drive data into the CLB, these are uni-directional in. Some hex wires can only drive data out of the CLB, these are unidirectional out. Other hex wires can drive data both in or out, these are bidirectional. Circuits however should drive data on the bidirectional lines in only one direction, not both, since this leads to contention which can damage or destroy the device. 3.2 Virtex Configuration Figure 3-5: Schematic view of general routing of a Virtex CLB As mentioned before, the FPGA is programmed by means of a programming bitstream. This bitstream in generated by a Xilinx tool, called BitGen, that is run after place and route. This paragraph will give some information on the Virtex configuration and the programming bitstream. A bit-level description of the bitstream will not be given, but instead its general format will be discussed, together with the possibility of partial reconfiguration

22 Chapter 3: Target Technology Configuration Memory The configuration of an FPGA is stored in the configuration memory. This memory controls the switch boxes that are used to connect routes, and multiplexers to connect internal resources in the slices. Normally, this configuration memory is only written once during configuration and is not used explicitly used by the application. The power of using JBits is that not only the regular logic is available during operation, but also the configuration logic since one has access to the configuration memory. The internal configuration memory is partitioned into segments, called frames. The number and size of frames varies with device size. The Virtex 1000, that is used in this application, has 4909 frames of 1248 bits each Programming Bitstream The programming bitstream consists of a series of packets, where each packet consists of a packet header and data. Some packets are used for a special purpose, such as checksum checking or sending options to the FPGA. Other packets are used to write to the configuration memory. This kind of packet has a header that contains a frame address and each frame can be addressed separately this way. This is a novel way to access the configuration memory, and different from older FPGA configuration methods where the configuration of an FPGA component was spread over the entire bitstream. In the initial bitstream generated by BitGen, every frame is written to initialize the whole configuration memory. After this it is possible to only reconfigure those frames, that actually have changed. For example: if an entry is added to the CAM, only the frames that are responsible for that entry could be written. This is called partial reconfiguration. The advantage of that approach is that the number of bits that need to be sent to the FPGA is a lot smaller. To give an idea about the time it takes reconfigure the whole FPGA, one can calculate the time it takes to write to the configuration memory. The time for configuration is equal to: t configuration = The Virtex is configured via an 8 bits bus on a clock frequency of 50 MHz, so f configuration = 400 Mbits/s. The time it takes to reconfigure the entire FPGA is then equal to: 4909*1248/400 = 15.3 [ms] number of frames bits per frame [s] f configuration The actual programming time is slightly longer, due to handshaking and special purpose packets. Due to the high number of frames, the time for configuration is decreased significantly when doing partial reconfiguration. The board that is available for this application does not support partial reconfiguration in the sense that the device driver of the board does not allow partial bitstreams to be sent to the FPGA. More about Virtex configuration can be found in [12]

23 Chapter 4: JBits Chapter 4: JBits 4.1 Introduction JBits is a set of Java classes which provide an Application Program Interface (API) into the Xilinx FPGA bitstream [13]. This interface operates either on bitstreams generated by design tools, or on bitstreams read back from actual hardware. This provides the capability of designing, modifying and dynamically modifying the logic on an FPGA. JBits gives the possibility to manually place, route and reconfigure the FPGA on a CLB level with relatively simple commands. This makes it very suitable for dynamically reconfiguring regular structures, such as the CAM. 4.2 JBits Programming Model The diagram in figure 4-1 illustrates the essential steps involved in the development of a JBits application. Create JBits Object Read Bitstream Modify Bitstream Write Bitstream Figure 4-1: JBits programming model Constructor A JBits object must be constructed before anything can be done. The constructor is very simple and takes a single parameter, the device type. This constructor builds the device model for the selected part and performs various initializations. The prototype for the constructor is: JBits(int devicetype); For example, JBits jbits = new JBits(Devices.XCV1000); This builds the device model for the Virtex device XCV Reading the bitstream This method takes a single parameter, a string containing the name of the bitstream file to be read. It loads the bitstream into the constructed JBits object and maps the bitstream data into the device model. Once a bitstream has been loaded, configuration data in the form of bits may be read and written. The method prototype for reading the bitstream is: void JBits.read(String infilename); For example, jbits.read("infile.bit") ;

24 Chapter 4: JBits This reads in the bitstream file "infile.bit" Setting a resource This method writes the configuration data to a given FPGA resource. Examples of these resources are LUTs and CLB inputs and the configuration data can for example be the logical function which a LUT is set with or a specific wire (single, hex) that is connected to a CLB input. The CLB, where the resource is situated is identified by a CLB row and a CLB column. The resource in the selected CLB is then identified by a constant. These constants are defined in the Java classes containing the configurable objects. For instance, setting the configuration of the resource SLICE0 F1 input is accomplished by using the S0F1 constant in the S0F1 class, that is S0F1.S0F1. An array of integers supplying the configuration bits is passed as the final parameter for the set method. As with the resource, this data is nearly always a pre-defined constant. For instance, to set the S0F1.S0F1 input to the value of SLICE1 X output, the constant used is S0F1.S1_X. To summarize, the set() method is used to identify the CLB and the resource associated with it and to specify the predefined constant value applicable for that resource. The method prototype for setting a resource to a value (bits) is: void JBits.set(int row, int column, int[][] resource, int[] bits); For example, jbits.set(clbrow, clbcol, S0F1.S0F1, S0F1.S1_X); This connects the X-output of slice 1 to the F1 input of slice 0 (see figure 3-2) Writing the bitstream Similar to the read bitstream method, the write bitstream method takes a single parameter, a string containing the name of the bitstream file to be written. This method writes the bitstream from the constructed JBits object into a file. The method prototype for the write bitstream is: int JBits.write(String outfilename); For example, jbits.write("outfile.bit"); This writes the modified configuration data to the bitstream file, "outfile.bit". In a running system, one is not always interested in writing the new bitstream to a file, but in stead one wants to use the bitstream to configure the FPGA directly from memory. To do this, the following command can be used: byte[] jbits.getallpackets(); This method returns all packets contained in the bitstream and these can be sent directly to the FPGA

25 Chapter 4: JBits Getting the resource configuration The get() method is used to read the configuration of a given resource in a CLB. The resource is identified using the same convention mentioned in the set() method. For the most part, the data obtained using the get() method may be interpreted and used by other portions of a JBits application. The method prototype for get() is: int[] JBits.get(int row, int column, int[][] resource); For example, int[] Value = jbits.get(clbrow, clbcol, S0F1.S0F1); This returns the value set for the resource S0F1.S0F1, that is the wire connected to the SLICE0 F1 input. The code pieces mentioned before are assembled below to a simple JBits application. It essentially sets the F1 input of the SLICE0 F LUT in the CLB in row 5, column 4 to the SLICE1 X output. JBits JBits = new JBits(Devices.XCV1000); JBits.read("infile.bit"); JBits.set(5, 4, S0F1.S0F1, S0F1.S1_X); JBits.write("outfile.bit");

26 Chapter 5: CAM Structures Chapter 5: CAM Structures This chapter starts with a general overview on CAMs, including the definition of explicit priority. Then two different CAM structures are discussed and the way they can be mapped onto the Virtex architecture. In the first structure, the same amount of area is reserved for all entries. This will be referred to as fixed length CAM. The second structure shows much similarity with the fixed length CAM, but instead the area that each entry occupies is variable and depends on the number of don t cares. This structure is called variable length CAM. Both these implementations utilize dynamic reconfiguration for updating their content. Other CAM structures that do not use dynamic reconfiguration, but are suitable for implementation on FPGA can be found in [14]-[17]. Next the structure of an explicit priority encoder is described, that can be integrated with one of the earlier descibed CAM implementations. The chapter ends with estimating the hardware resources that are consumed by any of these structures. 5.1 RAM-based versus CAM-based Look Up There are a number of algorithms used to perform the look up function using standard Random Access Memory (RAM): 1. A RAM can perform the look up in a single cycle if the data being searched (i.e. the information from the packet header) is used as a direct index into memory. In this case the size of the RAM is determined by the size of the search field. The number of words stored in a RAM has no effect on this size and cost. Thus, if there were only 256 words, each with a 16- bit search field, the RAM must still have 64K words. The size and cost of the RAM when used with a direct index grows exponentially with the search field. Since the size of the search field in IP characterization is 315 bits, the practical limit of an economic RAM-based look up function is exceeded. 2. A linear search is the most efficient algorithm for table look up, requiring only one entry per active address. If the entries in the routing table are searched in order of highest priority first, then the first match will be the best match. Of course, the linear search runs in time O(N), where N is the number of entries, and so can take considerable time. 3. A faster approach is to form a tree search: using a binary tree or a patricia tree. In general, these trees can push the search time towards log(n) where the log base is 2, but since the length of the search fields is longer than the number of entries that needs to be stored for IP characterization, the worst case number of cycles needed for matching is as long is the number of entries. This search time can therefore still be excessive and tree search algorithms require a high complexity of the controlling hardware. 4. Under good conditions, a hash function can execute the look up function in constant time, only slightly slower than direct access. The worst case search time, however, can be considerably worse. The performance is a function of the size of the hash memory and the number of addresses that must be searched in a given time window (after which a hashed entry will be timed out). While the number of stored entries might be relatively small, the number of addresses that might potentially be searched is large. This number depends on packet traffic patterns, that can be hard to predict. Therefore, the amount of memory might be unacceptably large

27 Chapter 5: CAM Structures The RAM-based look up algorithms as described above either consume too many hardware resources, or are too slow to fulfil current speed requirements. With higher speeds it is necessary to go with the faster and well-bounded search time of a CAM. CAM-look up solutions can offer superior performance, compared to even the most sophisticated RAM-based search algorithms [18]. 5.2 Different CAM Types Binary versus Ternary CAMs A binary CAM stores only one of two states ( 0 and 1 ) in each memory location (i.e. in each bit of a word), a ternary CAM stores one of three states in each memory location. These three states are represented by: 0, 1, and X. Ternary CAMs may have a global mask as well. This allows also the search pattern (i.e. the bit vector that is used as an input of the CAM) to contain X s. This is especially useful when the width of the search pattern is small, such that two or more entries can be stored in the same CAM location Return value The entries in a CAM have two parts. The most important part is the search field, which is the part of the entry that is matched with the search pattern. The CAM entries also contain a return field, which is the information returned during a read. This contains either related information or an index. In some cases, one is not only able to write to the search field, but also to the return field, so that the return value can be programmed per entry Prioritizing Since the entries stored in the CAM may contain don t cares, there is a possibility that two or more entries give a match at the same time. As mentioned in chapter 1, the entries should be prioritized and the address of the entry with the highest priority should be returned to solve this. Two priority schemes are possible [18]: 1. Inherent priority: inherent priority exploits the CAMs predictable ordering when reading multiple matched data. In this case, the system stores the entries in order of priority. By using a priority encoder, the top address of the CAM has the highest priority (0) and the bottom address has the lowest priority (127). 2. Explicit priority: the inherent priority can be replaced with an explicit priority field added to each CAM word. In case of a multiple match, the entry with the highest explicit priority as stored in the priority field is returned. The advantage of explicit priority is that updating the CAM becomes easier, since a new entry can always be added at the end. When using inherent prioritizing, new entries are not always added in the end of the CAM and an address has to be reserved by shifting down other entries and updating the memory that is addressed by the CAM

28 Chapter 5: CAM Structures 5.3 Fixed Length CAM Global structure The global structure of the fixed length CAM is given in figure 5-1. It is a PLA structure, consisting of matching lines and an encoder. All match lines together are contained in a match field. The data that is coming in (Indata) is matched with the CAM words that are stored in the matching lines. If Indata matches with the word stored in a matching line, the output of that line becomes 1, else it becomes 0. The encoder is used to translate the outputs of the matching lines to the address of the line that gave a match. This can either be an inherent or an explicit priority encoder. Indata n Match Field Match Line Match Line Match Line ENCODER Address Match Match Line Figure 5-1: Global structure of the fixed length CAM Match Line Due to the limited width of the memory that is available on the board, it is not possible to match all bits at once. Therefore the match lines are divided into match blocks, separated by registers as shown in figure 5-2. Each block outputs 1 when its input is 1 and a partial match occurs. The number of clock cycles needed for a complete match is then equal to the number of match blocks. The block size was chosen to be 64 bits, so that 5 clock cycles are needed for a complete match. Since the required lookup rate is lookups/sec, the minimal clock frequency at which the system should run is then 1.937*5 = 10 MHz which is feasible in FPGA technology. A block size of 64 bits means that the CAM reads from two memory banks simultaneously. The block size can t be chosen much higher, due to the limited number of memory banks (One of the four banks is already used to write the result to)

29 Chapter 5: CAM Structures Indata<63:0> Block Block Block 1 Figure 5-2: schematic view of a match line, divided into blocks and registers Match A smaller block size causes the fanout of signal Indata to increase which negatively influences the speed. The match blocks can be mapped on the Virtex FPGA using LUTs and carry-logic. This is shown in figure 5-3. Indata [63:56] Indata [55:48] Indata [7:0] LUT LUT LUT LUT LUT LUT C in C out Figure 5-3: 64-bits match block implemented using LUTs and carry-logic The LUTs are configured in such a way that they output 1 when the corresponding bits on their inputs match, else they output 0. Initially the carry is equal to C in and going from bit 63 to 0 the carry chain will propagate this signal as long as the LUTs output 1. If all the LUTs in a particular match line output 1, then C out will be equal to C in, else the block will output Variable Length CAM The variable length CAM is a CAM where the stored entries have variable length, depending on the number of don t cares they contain. The reason for implementing this kind of CAM is that the number of don t cares is quite large in general. This has two reasons: 1. The total size of the header fields that are used for matching is 315 bits. This does not mean that all of these fields are used for matching an entry. An example is filtering packets from a certain host. In this case only the source address of the forbidden IP packets needs to be matched. Another example is counting the number of IP version x packets that arrive, which only needs four bits to be stored

30 Chapter 5: CAM Structures 2. IP addresses are often not completely specified, meaning that the packet is to be sent to a net or subnet rather than a host. This means that the 128 bit source and destination address fields in the entries often have don t cares in the end. To save area, don t cares are left out so that entries take less space to store. By placing these reduced entries in a smart way, more entries can be stored in the CAM Global Structure The global structure of the variable length CAM with a maximum of 16 entries is given figure 5-4. It consists of a long chain of match blocks and shift registers, separated by switches and placed into match lines of four blocks each. Address Priority encoder Match Multiplexer Shift register Match Block Switch Figure 5-4: Schematic view of CAM with entries of variable length

31 Chapter 5: CAM Structures Programming the CAM is done by mapping entries on match blocks and shift registers and placing the entries on the chain starting at match block 1. Blocks within an entry are connected together by closing a switch, which causes the carry signal of a block to be propagated to the next block. An open switch on the input of a block means that its input becomes 1 and starts a new entry. The multiplexers are used to connect the outputs of the entries to the priority encoder. Programming the CAM consists of two steps: 1. Mapping: Dividing entries into 64 bits blocks separated by delays, leaving out blocks that merely contain don t cares. 2. Placing: Placing the mapped entries on the actual CAM structure and connecting their output to the priority encoder Mapping First the 320 bits entry is divided into 5 blocks of 64 bits each. In case none of the blocks is empty (i.e. do not contain just x ), then no block can be removed and the entry is mapped as follows: block delay Every clock cycle a block is matched and one clock cycle later the resulting output of that match block is propagated to the next block. Now suppose that block 3 of the entry is empty, i.e. consists entirely of don t cares. If this block would simply be omitted, then block 4 is matched in clock cycle 3 already. To solve this, an extra delay needs to be inferred between block 2 and block 4. The entry is thus mapped on four blocks, as shown below: block delay Placing After mapping an entry on blocks and delays, the entry can be placed by mapping blocks on match blocks and delays on shift registers of length equal to the delay. Each match line contains two multiplexers. These are used to connect the output of two of the four shift registers in the match line to the priority encoder

32 Chapter 5: CAM Structures There are a few special cases that need to be looked at separately: 1. Empty entries: If a whole entry consists of don t cares, no match blocks are used to store this entry. Instead the corresponding multiplexer to which the entry would have been connected is programmed to output 1 independent of the input. 2. Beginning of entry is empty: If one or more blocks in the beginning of the entry are empty, then these blocks do not cause any delay to be incremented. 3. More than two outputs in a line: Since there are two multiplexers available per match line, a maximum of two entries can have their output in a line. When a new entry is added, which would cause the number of outputs in a match line to become three, the output of this entry is shifted to the next line. This leads to match blocks that are not used. 5.5 Explicit Priority Encoder As described before, there are two mechanisms for prioritizing: inherent priority and explicit priority. In the latter case, not only the words that are searched are added to the CAM, but also their priority. This way, new words are always added in the end or at empty places and shifting other entries is not necessary Different implementations To implement explicit priority, several schemes are possible. The common way to do explicit encoding is by adding an explicit priority field to each CAM word [17]. Each cycle, the system combines the search word with a different priority word. In the first cycle of the search, the system sets the priority word to the highest priority. If a match occurs, the address of the matching entry is returned, else the system combines the search word with the next highest priority. This procedure is repeated until either a match occurs, or the lowest priority is reached. The advantage of this algorithm is that it is easy to implement and no extra hardware design effort is needed. On the other hand, the algorithm is not very efficient and the matching process can take many clock cycles, depending on the number of possible priority values. Using dynamic reconfiguration, other implementations are possible. One of these possibilities is using a regular priority encoder in combination with a switch box. This switch box routes every output of the CAM to the correct input of the priority encoder and the configuration of the switch box is controlled by JBits. Although this method is efficient in time, it would consume too much hardware for the CAM size at hand. This problem can be solved by reducing the number of priority classes. The number of priority classes is defined as the number of explicit priority values that a search word can have. In case of an inherent priority encoder, this value is equal to the number of entries. By reducing this number, the amount of hardware is reduced, but there is a risk that more priority classes than available are needed for a certain CAM configuration. To solve this, a combined explicit/inherent priority encoder is proposed, where the priority can be set explicitly for each entry, but in case two entries have the same explicit priority, their priority is determined inherently. This way, entries can be added even when all priority classes have been used, just as with the inherent priority encoder

33 Chapter 5: CAM Structures Global structure The global structure of the n-to- 2 log(n) explicit priority encoder with eight priority classes is given in figure 5-5. Estimating the number of priority classes that is needed is difficult and requires again information about the actual content of the CAM. A number of eight has been chosen, since this can be mapped efficiently on the Virtex architecture as will be demonstrated. The priority encoder consists of two basic blocks. First there is a priority decoder. The input of this block is coming from the match lines and contains 1 s at all matching positions. The output is a bit vector with 1 s only at those positions that match and have highest priority. In case different priority classes have been used for all overlapping entries (i.e. entries that may match simultaneously), then the output of the priority decoder contains no more than one 1. The output of the priority decoder is connected to a regular n-to- 2 log(n) priority encoder that is needed to decide the return value when two or more overlapping entries with the same explicit priority match simultaneously. The priority decoder consists of three parts, as shown in figure 5-5 and works as follows. The n-to-8 switch box connects each of the n input lines to one of 8 priority lines. These lines, each representing a priority class are connected to an 8-to-3 priority encoder, that looks if there is a match and if there is, it extracts the value of the highest priority of all the input lines that match. The output decoder propagates the value of each input line only if the priority that this line is set with corresponds with the highest priority, else 0 is propagated at this bit position. Match Output < 2 log(n)-1:0> 2 log(n) n-to- 2 log(n) priority encoder Priority Decoder n 8-to-3 priority encoder 8 n-to-8 switch box 3 output decoder n n n Input <n-1:0> Figure 5-5: Global structure of a n-to- 2 log(n) explicit priority encoder with eight priority classes. To program the explicit priority encoder, the switch box and the output decoder need to be programmed using JBits and the implementation of these blocks is discussed next

34 Chapter 5: CAM Structures Switch box The switch box connects the input lines to the eight priority lines and is built out of several smaller 8-to-8 bits switch boxes. In figure 5-6 an example with two of these 8-to-8 bits switch boxes is shown, thus implementing a 16-to-8 switch box. The eigth priority lines have been implemented using the tri-state lines as available on Virtex. This way, several input lines may drive a certain priority line at the same time without causing damage. The tristate lines are connected to a pull up, meaning that the priority lines are high when nothing is driving them. In case an entry matches, then the priority line to which the matching entry is connected is driven low. This way a wide NOR function is created, without using many hardware resources. Since a selected priority line is driven low, the input lines of the 8-to-3 priority encoder are active low. priority lines 0 8-to-8 switch box 8-to-8 switch box 0 16-to-8 switch box Input<15:0> Figure 5-6: Schematic view of a 16-to-8 switch box, connected to eight priority lines. 0 1 switch box input <7:0> LUT LUT 0 priority line Figure 5-7: Schematic view of the multiplexer, connecting one or more switch box input lines to priority line k. To connect an input line to a certain priority line, the output lines of the 8-to-8 bits switch boxes each have an 8-input multiplexer at their input. This multiplexer can select one or more switch box inputs to propagate their value to the output line. The multiplexer is implemented using two LUTs, cascaded with carry logic. A schematic view of this configuration is given in figure 5-7. If a certain input line is to be connected to a priority line, then the LUT to which the input line is connected should output zero if this input line is 1. In that case, a 0 is propagated by the carry chain to the output that controls the TBUF Output decoder The output decoder consists merely of LUTs, one for each input bit of the explicit priority encoder. Every LUT has one of these input bits (to check for a match) and three of the bits from the 8-to-3 priority encoder (the highest priority of all matching entries) connected to its inputs

35 Chapter 5: CAM Structures Suppose that a certain LUT in the output decoder is connected to input k. Then this LUT is to be configured such that it outputs 1 if the entry to which input k is connected gives a match and the priority of this entry is equal to the highest priority of all matching entries at that moment Method to program the return value Although not implemented, it is worth to mention a method to program not only the priority, but also the return value for each entry. Suppose that the inherent priority encoder as part of the explicit encoder is omitted, and a maximum of eight overlapping entries is allowed. If these overlapping entries are assigned each a different explicit priority, then only one of the output bits of the output decoder can become 1 at a time. These output bits can then be used to drive a number of tri state lines, equal to the width of the return value of the CAM. This is shown in figure 5-8 for a return value that is two bits wide. Here there are four input bits, coming from the output decoder. Depending on what input bit is high, the tri state lines are driven with another value. The logical value on the input port of each TBUF decides what value is returned when a certain input bit becomes 1 and this can be programmed using JBits. output <0> output <1> input <0> input <1> input <2> input <3> Figure 5-8: Schematic view of an encoder, driving a different value on the output for each input bit. This implementation would be significantly smaller, since the TBUFs don t consume any slice logic and the inherent priority encoder is omitted. An implementation of the priority encoder where overlapping entries can have the same explicit priority and where the output can be programmed for each entry uses the on board BlockRAM. The index returned by the regular explicit priority encoder is then used as a memory address to look up the return value

36 Chapter 5: CAM Structures 5.6 Device Utilization Fixed Length CAM Each Virtex Slice is able to match 8 bits. The fixed length CAM consists of 128 match lines, each containing 5 blocks matching 64 bits each. The number of slices taken by the match field is then equal to: 128 x 5 x 8 = 5120 slices. Since the total number of slices is 12288, 42% of the slices is consumed, not counting the encoder. The flip flops on the output of each match block have been neglected, because these consume little resources and can be combined with other logic in the same slice Variable length CAM The number of match lines, blocks and multiplexers were chosen from a design point of view without taking into account what the actual format of the entries is. In case there are a lot of entries containing only one block after mapping, the number of multiplexers should be increased to minimize the number of unused blocks. Another problem is choosing the size of the priority encoder. If there are many long expressions, there are a lot of inputs that are not used and it s necessary to have a large priority encoder or longer match lines. If there are many short expressions, a smaller encoder can be used. From this it becomes clear that knowledge about the actual content of the CAM is necessary to implement it in an efficient way, which is not available for IP version 6 yet. The number of match lines has been chosen to be 128, and the CAM can therefore store up to 256 entries. The variable length CAM has 128 match lines of 4 match blocks each. In the worst case, all entries that are stored have 5 blocks after mapping and only 128 x 4/5 = 102 entries can be stored, but this is very unlikely. Each Virtex Slice is able to match 8 bits, so that 8 slices are needed per match block. The CAM consists of 128*4 = 512 blocks, so a total of 512 x 8 = 4096 Virtex Slices (33%) are consumed. Each shift register consumes one slice, since it s not possible to use the second slice for something else. There are 512 shift registers in the design, meaning a utilization of 4%. The two multiplexers that each match line has to select the output have to be mapped in separate slices also, which is explained in These multiplexers therefore consume 2 x 128 = 256 slices (2%). The total slice utilization of the variable length CAM without the encoder is then 39% Inherent Priority Encoder The device utilization of the inherent priority encoder depends on the CAM structure it is used with. The fixed length CAM requires a 128-to-7 priority encoder, while the variable length CAM requires an encoder that is twice as wide. The utilization by the inherent priority encoder has been determined by synthesizing both sizes and examining the mapping report. The results are: 128-to-7 bits: 179 slices (1%). 256-to-8 bits: 709 slices (5%)

37 Chapter 5: CAM Structures Explicit priority encoder The device utilization of the explicit priority encoder has also been estimated to be used with both fixed and variable length CAM. The fixed length CAM has 128 outputs, i.e. a 128-to-8 explicit priority encoder is needed. The 128-to-8 bits switch box is divided in 32 8-to-8 switch boxes, each containing 8 input multiplexers that use 1 slice each. The total number of slices for the switch box is then 128 = 1%. The output decoder consumes 128 LUTs. Normally two LUTs can be placed in one slice, but it is not possible to constrain a LUT to a certain position within a slice. In the case of a carry chain, the order of the LUTs is set implicitly by the direction of the carry chain. However, there is no carry chain now and it is therefore necessary to place the LUTs in separate slices. The output decoder therefore takes 128 (= 1%) slices. The utilization by the inherent priority encoder has been determined in and is equal to 171 slices = 1%. The total utilization by the 128 bits explicit priority encoder is then 3%, where the slices consumed by the 8-to-3 priority encoder have been neglected. Repeating this calculation for the 256 bits explicit priority encoder leads to a utilization of 9% Summary Table 5-1 gives an estimation of the device utilization for the fixed and variable length CAM combined with either the inherent or the explicit priority encoder. Inherent priority Explicit priority Fixed length CAM 43 % 45 % Variable length CAM 44 % 48 % Table 5-1: Device utilization for the fixed and variable length CAM with either an inherent or an explicit priority encoder. From this table it follows, that the hardware utilization of the fixed length CAM and the variable length CAM are about equal for these dimensions. Furthermore it follows that the hardware cost of using an explicit priority encoder in stead of using inherent priority is small and therefore interesting

38 Chapter 6: Implementation of the Board Interface Chapter 6: Implementation of the Board Interface The board interface takes care of the communication between the FPGA, the host and the onboard memory and is situated on the FPGA. This paragraph contains a description of the board interface and how it has been implemented together with simulation results. 6.1 Port Description In 2.2 a general overview of the board was given, together with a description of the ports that are available for communication between host, memory and FPGA. To use the board as part of the CAM application, these various ports were assigned a function. These functions are summarized in table 6-1, together with the direction of the signals viewed from the FPGA side. Besides these ports, other signals are necessary for controlling the memory, control register and status register. The respective ports are given in appendix A. A detailed description of these signals, together with their pin locations can be found in [8]. A schematic view of the board interface and its connections to the parts in the system that are not inside the FPGA are also given in appendix A. 6.2 VHDL Description Port Direction Function No. of bits clk In System clock 1 Reset In Reset CAM 1 Ctrl_Reg In Start Matching 8 Stat_Reg Out Matching Ready 8 Data 0 In Indata [31:0] 32 Data 1 In Indata [63:32] 32 Data 2 Out Match, Address 32 Table 6-1: function assignments for FPGA ports The VHDL description of the board interface is given in appendix B. The board interface is a finite state machine (FSM), that repeatedly reads data from memory to the CAM and writes the result from the CAM to memory. The behaviour of the board interface has been simulated and the result is given in figure 6-1. First a reset is applied, that initializes the control signals and brings the FSM in state IDLE. Then value 1 is written to the control register, which is interpreted as start. The board interface sends a memory request for bank 0, bank 1 and bank 2 to the onboard memory arbiter and waits until all banks have been granted. Next the board interface starts reading from bank 0 and bank 1. After 7 clock cycles (5 for processing by the CAM and 2 for latency due to the registers before and after the CAM) the result is written to bank

39 Chapter 6: Implementation of the Board Interface From this moment Indata is read every clock cycle and the result is written every 5 clock cycles until Addr0 is greater than Buffer_Size. For debugging purposes, LEDs are turned on and off depending on the state of the FSM. reset start starts reading starts writing requests memory memory granted Figure 6-1: Simulation results for the board interface

40 Chapter 7: Hardware Implementation of the CAM Chapter 7: Hardware Implementation of the CAM In this chapter, the VHDL implementations of the fixed length CAM, the variable length CAM and the explicit priority encoder are discussed. This chapter is meant to show how the designs have been described in a structural style, by using the hardware primitives available for Virtex. This type of VHDL description is necessary in applications that use dynamic reconfiguration, since full control over the implementation of specific parts of the design is necessary. 7.1 Implementation of the Fixed Length CAM The structure of the VHDL description of the fixed length CAM is shown in figure 7-1. It shows the entities that are used and how they relate to each other. CAM Encoder Match_Line Register_64 Stitcher Match_Block DecLut Virtex Primitives FD MUXCY_L SRL16 Figure 7-1: structure of the VHDL description of the fixed length CAM Entity DecLut defines the LUTs that are part of the match lines and store the actual entries. Encoder is the priority encoder Virtex Primitives The match lines are built entirely out of Virtex primitives (structural VHDL). The following VHDL primitives from the Virtex library have been instantiated: component FD -- D flip flop port ( Q : out std_logic; D : in std_logic; C : in std_logic ); end component;

41 Chapter 7: Hardware Implementation of the CAM component MUXCY_L -- 2-to-1 mux port ( LO : out std_logic; CI : in std_logic; DI : in std_logic; S : in std_logic ); end component; component SRL bits shift register port ( Q : out std_logic; A0 : in std_logic; A1 : in std_logic; A2 : in std_logic; A3 : in std_logic; D : in std_logic; CLK : in std_logic ); end component; When instantiating a LUT with some logical function in VHDL, the Xilinx place and route (PAR) tools swap the four inputs and change the logical function. This is done for optimization reasons and the way the inputs are swapped is not easily predictable. Normally this is not a problem, but when using JBits to change the content of the LUT, one needs to know exactly how the inputs are mapped on the LUT primitive. To solve this, shift register SRL16 was instantiated in stead of a LUT. This way, PAR can not swap the inputs (or else the behaviour would change). A more detailed description of primitive SRL16 is given in After place and route, the shift registers can be transformed to LUTs using JBits. This is an easy process, since a shift register is in principle a LUT configured in a special way. This transformation is discussed in Stitcher From figure 7-1 it follows that a match line does not only contain D-flip flops and match blocks, but also an entity called stitcher. This entity has been instantiated, because the place and route tool is not able to place a register between two blocks automatically. More precisely, the tool is not able to connect the output of a flip flop directly to a carry chain that is situated near the flipflop. What the stitcher does is that it routes the output of the flip flop manually to the carry chain. This is shown in figure 7-2. carry chain carry chain FD FD Stitcher Figure 7-2: Simplified schematics showing the stitcher function

42 Chapter 7: Hardware Implementation of the CAM Register_64 Signal Indata (see figure 5-1) is the 64-bits input data of the CAM which is connected to all the match blocks in the design. The fixed length CAM has a total of 128x5 = 640 match blocks, meaning that the Indata net has a very high fanout and is spread over the total CAM area. This leads to low performance. Indata is coming from two 32-bits registers placed between the actual CAM and the board memory (Data0_Reg and Data1_Reg in figure A-1). By replicating these registers, the fanout is decreased. Replication is done automatically by the Synthesis tools, but this does not lead to good results. Although the fanout of Indata decreases, it is still connected to match blocks spread over the entire CAM area. To solve this, Data0_Reg and Data1_Reg have been implemented as a 64-bits register (entity Register_64) which is replicated manually. This is desribed in the VHDL code of the CAM structure, but since it is functionally part of the board interface, it is shown in figure A-1 also. Replication is done by instanciating 16 such registers and every register is connected to the match blocks contained in eight consecutive match lines. This way, the fanout of Indata is decreased to 40 and by placing Register_64 near the eigth match lines that it is connected to, long routes are prevented. 7.2 Implementation of the Variable Length CAM The structure of the VHDL code of the variable length CAM is much like the fixed length CAM and is given in figure 7-3. The main differences are that two multiplexers were added for each match line (entity Mux4) and flip flops were replaced by shift registers CAM Encoder Register_64 Stitcher Virtex Primitives Match_Line Match_Block DecLut Mux4 MUXCY_L SRL16 Figure 7-3: Structure of the VHDL description of the variable length CAM Shift Register In the fixed length CAM, primitive SRL16 was used to implement a LUT whose inputs are not swapped. In the variable length CAM this primitive is instantiated to be used as a shift register as well. A description of SRL16 is given below:

43 Chapter 7: Hardware Implementation of the CAM SRL16 A0 A1 A2 A3 Q D The data (D) is loaded into the first bit of the shift register and during subsequent Low-to-High clock transitions data is shifted to the next bit position as new data is loaded. The data appears on the Q outputs when the shift register length determined by the address inputs is reached. The length of the shift register can be changed dynamically and is equal to: (8*A3) + (4*A2) + (2*A1) + A0. In the VHDL code, the length is initialized at 1 by driving 0 on all address inputs. To change the length during operation, some address inputs need to drive 1 and this is done by disconnecting these inputs. This way, these inputs are not driven and a pull up causes them to become High Multiplexer To implement the multiplexers that connect the entries to the priority encoder, a 4-input look up table is used. Since this LUT is changed by JBits, it s again important that the inputs are not swapped and therefore SRL16 was instantiated that is converted to a LUT in JBits Stitcher An extra stitcher in the beginning of each match line was added. In the fixed CAM design, the carry was always set to High at the start of a new match line. This signal was generated inside the first LUT of the match line. In the variable CAM design, a stitcher is needed to be able to connect to the carry signal from the previous match line Switch As mentioned in chapter 5, each match block has a switch to connect its carry chain to either logical High or the output of the previous block. To switch between these two states, a multiplexer is used that is available in the Virtex carry logic and connects the carry chain to either input signal BX or C in (see figure 3-2). This multiplexer is controlled by the configuration memory and can therefore be changed using JBits. In the VHDL code, the multiplexer is configured to connect to C in. To switch to a logical High, BX is to be driven High and this is done as follows: When instantiating a shift register in VHDL, input signal D comes in via port BX. As SRL16 is only instantiated to implement a LUT, signal D is not used and can be set to Implementation of the Priority Encoder Explicit priority encoder The structure of the VHDL description of the explicit priority encoder is given in figure 7-4, showing all entities and Virtex primitives that have been instanciated. The two priority encoders have been described in a behavioural style, while the switch box and the output decoder were implemented in a structural way, since these need to be configured by JBits. Entity SwitchBox_8x8 uses primitive TBUF, which refers to a tri-state buffer on Virtex

44 Chapter 7: Hardware Implementation of the CAM ExplicitEncoder PriorityEncoder SwitchBox PriorityEncoder_8x3 OutputDecoder SwitchBox_8x8 Virtex Primitives SRL16 MUXCY_L BUFT Figure 7-4: Structure of the VHDL description of the variable length CAM Inherent priority encoder The inherent priority encoder is described in behavioural VHDL. The corresponding VHDL code is given below. ENTITY Encoder IS GENERIC( Width : integer := 7; Size : integer := 128); PORT( Input : IN std_logic_vector (Size-1 DOWNTO 0); Output : OUT std_logic_vector (Width-1 DOWNTO 0); Match : OUT std_logic); END Encoder; ARCHITECTURE Behave OF Encoder IS VARIABLE temp_output : std_logic_vector(width-1 DOWNTO 0) := (OTHERS => 0 ); VARIABLE temp_match : std_logic := 0 ; PROCESS(Input) FOR i in (Size-1) DOWNTO 0 LOOP IF (Input(i) = 1 ) THEN temp_output := conv_std_logic_vector(i,width); temp_match := 1 ; END IF; END LOOP; Output <= temp_output; Match <= temp_match; END PROCESS; END Behave;

45 Chapter 8: Synthesis and Place & Route Chapter 8: Synthesis and Place & Route In this chapter synthesis and place & route (PAR) of three CAM implementations is discussed: 1. Fixed length CAM with inherent priority 2. Variable length CAM with inherent priority 3. Variable length CAM with explicit priority 8.1 describes the method for synthesis and PAR. 8.2 gives the physical structure of these three CAM implementation in terms of CLB locations. 8.3, 8.4 and 8.5 give the results in for the three implementations and these results are summarized in Method Synthesis During synthesis, the VHDL description is mapped to Virtex primitives. These primitives are those described in 7.1.1, together with LUTs, IOBs, input buffers and clock buffers. When mapping the design on Virtex primitives, Synplify also does several optimizations. In most cases these optimizations are useful, but when synthesizing the CAM structure, two optimizations have to be avoided: 1. Synplify recognizes signal Ctrl_Ack of the board interface as a clock, since this port is edge sensitive. What Synplify does is adding a clock buffer to this signal. Since only four dedicated pins have this buffer and not the Ctrl_ACK pin, this clock buffer has to be removed by the Synplify constraint: define_attribute {Ctrl_ACK} syn_noclockbuf {1} 2. Another optimization is omitting redundant logic. The fixed length CAM consists of 128 match lines, that are equal and have similar input signals, since the content is written in a later stage by JBits. What synplify does is removing all match lines except one. To prevent Synplify from doing this, a syn_keep attribute must be added in the VHDL code to signal Indata for every match line: column : FOR i IN Size-1 DOWNTO 0 GENERATE SIGNAL Temp: std_logic_vector(63 DOWNTO 0); ATTRIBUTE syn_keep OF Temp: SIGNAL IS TRUE; BEGIN Temp <= InData; Match_Line : Match_Line PORT MAP(InData => Temp, Match => Dec_Out(i), clk => clk); END GENERATE Column; This allows Synplify to optimize and omit redundant logic within a match line, but does not extend to other match lines so that all match lines are optimized individually and none is optimized away

46 Chapter 8: Synthesis and Place & Route Place and Route During PAR, the Virtex primitives are placed on the FPGA array and connected together. Usually it is up to the tool to decide where components are placed but in this case placement constraints are applied for three reasons: 1. The CAM is programmed by letting JBits change the content of the LUTs. In the variable length implementation also the length of the shift registers (SRLs) and output multiplexers need to be programmed. For JBits to do this, it needs to know exactly what LUTs and SRLs to write to and where they are placed in the FPGA array. This requires location constraints on the LUTs and SRLs that are part of the match lines. 2. The pin locations of the ports that are described in were decided by the board manufacturer. The PAR-tool needs to be aware of these locations, so that the IOBs are placed on the right locations. This requires pin-location constraints on the ports. 3. The CAM has a very regular structure, consisting of vertically placed carry-chains. This leads to a design that can be placed in a very compact way, with a high utilization density. Since the PAR tool is not aware of this system knowledge, better performance is reached when doing manual floorplanning. One of the main things that is to be placed manually are the two LUTs (primitive SRL16) and the two multiplexers that they control (primitive MUXCY_L) in the same slice. These constraints are applied to the design via a User Constraints File (UCF). A description of the UCF syntax can be found in [19]. Since all components are to be placed individually, it is infeasible to write the constraints manually. For this reason a C-program was written that generates the constraints. Absolute CLB locations were used to place all components. It is possible to use relative locations as well, where the location of each component is expressed as its relative position to some origin that can be situated anywhere on the FPGA (so called Relationally Placed Macro s ). This is a convenient way to constrain components, since the whole design can be moved by simply changing the location of the origin, in stead of the location of all individual components. It turned out though, that the Xilinx tools generate an error when two multiplexers and two shift registers are placed in one slice this way Timing constraints To increase the performance of the design, timing constraints were used. These constraints are passed to the synthesis tools and specified in the UCF of the PAR tools. They can limit the delay on some critical nets. First the synthesis tools minimize the logic levels of the design in order to meet the timing constraints by for example logic replication. Then the PAR tools try to place and route components in such a way to minimize routing delays until timing constraints are met. In the CAM implementations, the minimum clock frequency at which the designs should operate was constrained. For optimal results, the desired clock frequency was chosen just above the value that could be met by the tools

47 Chapter 8: Synthesis and Place & Route 8.2 Physical Structure Fixed length CAM The physical structure of the fixed length CAM on the CLB array is shown in figure 8-1. The CAM is placed as a rectangle in the middle of the FPGA. Locations in the array are denoted by the coordinate of the CLB (CLB col, CLB row) and the slice within the CLB (slice) which is one of two available slices S0 and S1. Every match block is mapped on 8 slices and a vertical space of 2 slices is left out between two blocks for a flip flop, followed by a stitcher to be placed. To allow for the Xilinx tools to place logic within the CAM structure, vertical columns have been left empty. This leads to better timing and faster PAR run times. In this implementation one CLB column is left empty every 8 match lines. <CLB col>.<slice>: 8.S1 8.S0 86.S1 86.S0 register <CLB row> match block stitcher register match block stitcher 64-bits Registers register match block stitcher register match block match line: = not constrained Figure 8-1: CLB locations of the various components in the match field of the fixed length CAM A block that is 8 CLBs high has been reserved for the registers, that connect the 64-bits input to the various match blocks. Only the coordinates of these 64-bits registers and the match blocks have been constrained. The positions of the stitchers, the register at the output of each match block and the priority encoder are decided by the place & route tool

48 Chapter 8: Synthesis and Place & Route Variable length CAM The physical placement of the variable length CAM is given in figure 8-2. The CAM is again placed as a rectangle in the middle of the FPGA. In the fixed length CAM implementation two slices were needed to place the stitcher and flip flop between each block. In the variable length implementation, the flip flops were replaced by shift registers, whose inputs can be connected to a carry-chain within the same slice. Therefore only one slice was reserved between each match block. The two line multiplexers are placed in seperate slices. The reason for this is that when placing them in one slice, it is not clear which multiplexer is placed in which LUT. Placing them in seperate slices, gives the possibility to configure both LUTs in each slice with the same value, without knowing what LUT the multiplexer is mapped on. All components were constrained. Again one CLB column was left empty per 8 matching lines for the place & route tools to place logic and supply for extra routing resources and an area of 8 CLBs high was reserved for the 64-bits registers that connect the input of the CAM to all match blocks. <CLB col>.<slice>: 8.S1 8.S0 86.S1 86.S0 <CLB row> multiplexer 1 11 multiplexer 0 12 shift register 14 match block shift register / stitcher 23 match block stitcher bits Registers shift register 40 match block shift register / stitcher 49 match block stitcher 58 match line: Figure 8-2: CLB locations of the various components in the match field of the variable length CAM

49 Chapter 8: Synthesis and Place & Route Explicit priority encoder The physical structure of the explicit priority encoder is shown in figure 8-3. As mentioned before, the explicit priority encoder is used together with the variable length CAM, whose structure is left unchanged. The priority encoder should be constrained in such a way, that there is no conflict between the two designs. It is placed above the variable length match field and the two are aligned horizontally. The priority multiplexers are part of the switchbox and each connect one or more inputs to one of the eight priority lines. These multiplexers are numbered between 0 and 7, referring to the priority lines their outputs are connected to. 0 means highest priority, 7 lowest. <CLB col>.<slice>: 8.S1 8.S0 9.S1 9.S0 10.S1 86.S0 <CLB row> output decoder LUT1 output decoder LUT0 priority multiplexer priority multiplexer Figure 8-3: CLB locations of the various components of the explicit priority encoder. 6 Output decoder LUT 0 and 1 are part of the output decoder and since the variable length CAM has two outputs per slice column, also two of these LUTs are needed per column. The LUTs have been placed in separate slices, so that JBits is able to distinguish between them (see 8.2.2). 8.3 Results of the Fixed Length CAM FPGA editor view of the CAM After place and route, the design can be made visible via FPGA editor. This Xilinx tool gives a view of the CLB array, with all placed components and routes. This tool also makes the slice internals and the configuration of the LUTs visible. In appendix C, figure C-1 the FPGA editor view of the entire fixed length CAM structure including the board interface is given. It shows the densely routed match field in the middle and the IOBs on the edges of the FPGA. The priority encoder is placed above the match field. In figure 8-4 the FPGA editor view of the internals of a slice is given. This slice is part of the carry chain of a match block. It shows the two LUTs configured as shift registers, the two multiplexers controlled by the LUTs and the carry routes. In this slice, the two available registers have not been utilized and this is the case for all slices that are part of a carry-chain. The reason for this is that the input of this register is controlled by the output of the LUT in the same logic cell. Since this LUT is used to control the carry logic, its output can t be connected to the flip flop as well. Another problem is, that the output of a flip flop can t connect to a carry-chain within the same slice, or to the C in of adjacent slices. Therefore it was necessary to reserve two slices between two match blocks: one for the stitcher and one for the register. The multiplexer, denoted with switch is used in the variable length CAM implementation and

50 Chapter 8: Synthesis and Place & Route is discussed in 8.4. switch SR-line ( 9.3.4) Figure 8-4: FPGA editor view showing internals of a slice, that is part of the carry-chain of a match block Device utilization Table 8-1 gives a summary of the FPGA resource utilization of the whole design, including board interface. The data was taken from the mapping report. Resource Resources used Resources available utilization Slices 7,084 12,288 58% Flip Flops 1,871 24,576 8% a LUTs ,576 b 2% Shift registers 10,240 24,576 b 42% IOBs % Table 8-1: FPGA resource utilization of fixed length CAM. a. From this data it follows that only 8% of the flip flops is used. It should be noted though that not all flip flops that are unused now can actually be used. Only flip flops situated in slices that are not part of a carry-chain can be instantiated, which makes the effective utilization 8+42 = 50%. b. This number is misleading, since both LUTs and shift registers are mapped on the same resource.therefore the total number of LUTs + shift registers is equal to

51 Chapter 8: Synthesis and Place & Route Table 8-2 shows the device utilization, where distinction is made between the board interface, the match field and the priority encoder. In this table the LUTs and shift registers were merged, since they represent the same physical resource Timing Analysis Component Board Interface Encoder Match Field Resource Slices 6 % 2 % 49 % Flip Flops 5 % 0 % 3 % LUTs 1 % 1 % 42 % Table 8-2: FPGA resource utilization per component for the fixed length CAM. To find the critical path in the design, an advanced design analysis has been performed by the Xilinx tools. Below a fragment of the resulting timing report is given: Delay: ns match_vector(9) to data2(1) ns Total path delay (26.538ns delay plus 1.499ns setup) 0.141ns clock skew The critical path is 28.2 ns and runs through the priority encoder, from the output of match line 9 to the output register that is connected to memory bank 2. The fixed length CAM can operate at a maximum frequency of 35.4 MHz. Since a complete match takes 5 clock cycles, the CAM is able to perform 35.4/5 = 7.1 Mlookups/s. 8.4 Results of the Variable Length CAM FPGA editor view of CAM The FPGA editor view of the whole variable length CAM is shown in figure C-2 in the appendix. The match field is smaller than in the fixed length CAM and the priority encoder is clearly visible as a dense area on top of the match field. Since the match blocks are implemented the same way as in the fixed length CAM, the slice internals of the carry chain are similar to figure 8-3. In this figure the multiplexer denoted with switch switches between C in and BX and is controlled by JBits to either start a new carry chain or to propagate the carry of the previous match block Device Utilization Table 8-3 summarizes the resource utilization for the board interface, encoder and match field of the variable length CAM

52 Chapter 8: Synthesis and Place & Route Component Board Interface Encoder Match Field Resource Slices 6 % 7 % 40 % Flip Flops 5 % 0 % 0 % LUTs 1 % 4 % 36 % Table 8-3: FPGA resource utilization per component for the variable length cam with inherent priority. The match field consumes less resources than in the fixed length CAM, because of the decreased number of match blocks Timing Analysis An advanced timing analysis has been performed and the critical path is equal to 52.3 ns. The critical path is again running through the priority encoder and is longer than in the fixed length CAM. This is caused by the increased size of the encoder. The maximum look up rate of the variable length CAM with inherent priority then becomes 3,8 Mlookups/s at a clock frequency of 19.1 MHz. 8.5 Results of the Variable Length CAM with Explicit Priority FPGA editor view of the CAM The FPGA editor view of the variable length CAM together with the explicit priority encoder is given in figure C-3 in the appendix Device utilization Table 8-4 summarizes the resource utilization for the board interface, encoder and match field of the variable length CAM with explicit priority. Component Board Interface Encoder Match Field Resource Slices 6 % 9 % 40 % Flip Flops 5 % 0 % 36 % LUTs 1 % 6 % 0 % Table 8-4: FPGA resource utilization per component for the variable length CAM with explicit priority. Comparing the utilization of this implementation with the variable length CAM with inherent priority, then there are no significant differences. The extra logic needed for explicit priority is only 2 % and does therefore not add much significant hardware costs

53 Chapter 8: Synthesis and Place & Route Timing analysis An advanced timing analysis has been performed and the critical path is equal to 58.0 ns The critical path is again running through the priority encoder, but is somewhat longer than in the variable CAM with inherent priority. This increase is caused by the extra propagation delay in the logic that was added in the explicit priority encoder. The maximum look up rate of the variable length CAM with explicit priority then becomes 3.4 Mlookups/s at a clock frequency of 17.2 MHz. 8.6 Summary Table 8-5 summarizes the implementation results of the three CAM implementations and another design that has been implemented for comparison. This design is that of a variable length CAM that outputs only one bit that tells if a match occurred, but does not return the matching address. The latter is used for IP filtering, where this bit decides whether an IP packet is to be forwarded or not. Only the hardware part of this design has been implemented to compare speed and utilization in the absence of the priority encoder. CAM type No. of 64 bits match blocks Device Utilization [% slices] a Max. clock frequency [MHz] Table 8-5: Speed and utilization of different CAM implementations. a. The device utilization is given for the CAMs without the board interface. Max look up rate [Mlookups/s] Fixed length, inherent pr Var. length, inherent pr Var. Length, explicit pr Var. Length, match returned The size of the CAMs is given as the number of match blocks, since this is an indication how many CAM words can be stored of a certain length. The fixed length CAM can store a total of 128 words of 320 bits. The variable length CAM stores at least 102 words of 320 bits, but by reducing the CAM words as decribed before, it is able to store a maximum of 256 words of 128 bits each. The variable length CAM is slower than the fixed length CAM. This is the result of the difference in size of the priority encoder that each implementation contains. The performance of the variable length CAM with explicit priority is less than the same CAM with inherent priority. This is caused by the fact that the critical path is running through the priority encoder, which leads to more logic levels in the former case. The variable length CAM that only returns a match bit is significantly faster than the other designs, since the priority encoder that is responsible for the critical path in the other designs has been omitted. From this it can be concluded, that the priority encoder significantly limits the performance of the whole CAM and that a performance increase of more than 200% is reached by leaving it out

54 Chapter 9: Software Implementation Chapter 9: Software Implementation In this chapter the implementation of the software part of the CAM will be discussed. This is done in three parts. The first part gives a short description of the Java user application and the Graphical User Interface (GUI) of the three CAM implementations. The second part gives a description of the hardware interface that has been written to let the Java application communicate with the board. The last part gives a description of how JBits is integrated in the program. This chapter is not meant to give a detailed description of all the software functions but is focused on the communication between the hardware and the software. 9.1 JAVA User Application General description The JAVA user application forms the interface between user and CAM. It is used to edit the contents of the CAM and to test its functionality. The program depends on the following main classes: 1. Sun s Swing library for implementing graphics. 2. class esl which is the JAVA interface to the board. This interface will be discussed in Xilinx JBits library, used to manipulate the FPGA bitstream. This is discussed in 9.3. A good introduction to the JAVA language as well as a language reference can be found in [20] GUI of the fixed length CAM The fixed length CAM user application gives access to all 128 entries of the CAM and changing/adding entries can be done from the program itself, or by reading from a configuration file. The configuration of the CAM and the way it has been mapped on the LUTs is shown graphically. Testing is done by reading packets from a file and processing these by the hardware, where the result is written to another file. A full description of GUI, including all functionality and menu s is given in appendix D GUI of the variable length CAM The GUI of the variable length CAM is different from the fixed length CAM in the way that more information is given to the user about the configuration of the CAM. Not only the contents of the LUTs, but also the state of the switches, the multiplexers and the shift registers is shown. The configuration can t be changed from within the program as in the fixed length CAM, in stead a configuration file should be read. A full description of the GUI of the variable length CAM is given in appendix D

55 Chapter 9: Software Implementation GUI of the variable length CAM with explicit priority In the user applications of the CAMs that use the inherent priority mechanism, the user was responsible for adding entries in the right order and there was no support for automatically adding a single entry on the correct location, depending on its priority. When using the explicit priority mechanism, not only the order, but also the priority values of the CAM words need to be controlled in order to add a new entry with changing the locations of other entries as little as possible. This process has been automized in this implementation. The user simply gives a list with CAM words that need to be added, together with their priority which can be any number. The Java application then maps every entry on the CAM structure with changing the locations of as few other entries as possible. This feature is part of preprocessing and is described in 9.4. The GUI of the variable length CAM with explicit priority has been changed in the way that the program is controlled by a command file. This file may contain commands for adding/deleting entries, searching for test packets and several other options. The GUI and the format of the command file are described in appendix D. 9.2 Hardware Interface Native interface Java, as the language is defined, is hardware independent. While this is a great benefit to most of its users, it provides no mechanism for interfacing to either hardware or non-java code. In this case, we would like the Java user interface to communicate with the board, containing the FPGA. Together with this board there came device drivers and a C-library PP1000 that implements all necessary functions for communication. To use this library in Java, a native interface was implemented. More details about this native interface and how it has been implemented can be found in appendix E Remote interface Java is very suitable for the implementation of distributed systems, and has an extensive library of routines for coping with TCP/IP protocols. This makes it possible to implement a remote hardware interface such that the board can be accessed from any computer connected to the host computer via a network. A remote hardware interface has been written using the Remote Method Invocation (RMI) mechanism, such that the Java user application can be used on any computer. Not only does this give the freedom communicate with the board on other computers, but also on other platforms such as UNIX. A description of this remote hardware interface is given in appendix F. 9.3 JBits Integration Components, resources and values Components are those parts of the CAM, that can be configured by JBits. In the fixed length CAM, these are the LUTs in the match blocks, that are written with the content of the entries

56 Chapter 9: Software Implementation The variable length CAM has four or six components, depending on the priority mechanism that is used. For inherent priority these are the LUTs in the match blocks, the multiplexers used to connect match blocks to the priority encoder, the switches and the shift registers. For explicit priority, also the encoder needs to be configured. The components contained in the explicit priority encoder are the 8-input multiplexers in the switch box and the LUTs in the output decoder. Every component is characterized by the FPGA resource that it uses. In chapter 4 it was mentioned that changing the configuration of a component is done by writing a value to its resource: set(int row, int column, int[][] resource, int[] bits); An overview of the resource and possible values of all components used in the three CAM implementations is given below. Since the multiplexers in the variable length CAM and all components in the explicit priority encoder are implemented in a LUT, these are not discussed separately. 1. LUTs: Resource: LUT.SLICE<slice number>_<lut> Where slice number denotes one of slices 0 and 1 and lut denotes one of LUTs F and G. Value: Init The type of Init is integer and its format is explained below. A LUT is a 4-to-1 boolean function that can be represented by a truth table with length 16. To represent this boolean function with a single value, the 16 output bits of the truth table are combined in a bit vector, inverted and converted to an integer. Example: the output column of the truth table of a 4-input OR-gate is (the only 0 is for input combination 0000 ). To program a LUT as an OR-gate, these bits are inverted and converted to an integer. Init then becomes 1. The LUT can be configured as a four-input multiplexer this way, as needed in the variable length CAM. 2. Switches The resource that the switches are mapped on, is the multiplexer that either propagates the carry of the previous match block C in or signal BX: Resource: S<slice number>control.cin.cin Value: S<slice number>control.cin (closed state) S<slice number>control.bx (open state) where slice number denotes one of slices 0 and

57 Chapter 9: Software Implementation 3. Shift registers The resources that decide the length of the shift registers, are the four inputs A0-A3 of SRL16: Resource: S<slice number><lut><input>.s<slice number><lut><input> Value: S<slice number><lut><input>.off (input is driven High) Where slice number denotes one of slices 0 and 1, lut denotes one of LUTs F and G and input one of inputs 1-4. To drive an input Low, the input has to be driven by the original route as decided by the place & route tools and can be read during initialization using method JBits.get(int row, int column, int[][] resource); Describing the CAM structure To reconfigure the FPGA, JBits needs to be aware of the physical structure of the CAM. This layout is defined in a separate class CAMConstants which is given in appendix G for the variable length CAM with explicit priority. It contains several constants and vectors, describing the number of match lines and match blocks and the relative physical CLB locations of the different components. For example: int [] SwitchOffset = [0,9,18,27]; in the variable length CAM gives the relative y-locations of the switches in a match line, starting at the least significant match block. The coordinates of the origin of the CAM structure are also defined in this class so that the whole CAM structure can be moved easily Initialization The first action that is performed by JBits is reading the bitstream of the CAM structure, described in chapter 6. This is done with command: jbits.read(userconstants.infilename); If reading has been successful, JBits starts building the initial CAM structure. For every component in the design, a class is instantiated that contains the CLB location of that component, its FPGA resource and value. In the fixed length design, these are only LUTs but in the variable length CAM, also switches, multiplexers and shift registers are instantiated. Example: The variable length CAM has a class called switch, of which a fragment is shown below: class switch {... public int[][] Resource;

58 Chapter 9: Software Implementation } public int[] Open; public int[] Closed; public int CLBy; public int CLBx; public int State; public int OldState; public boolean Changed; public JBits JBits; Variable state is either 0 (open) or 1 (closed) and OldState is defined to determine whether the state of the switch has changed or not. Resource denotes the FPGA resource of the switch and Open and Closed denote the values that should be written to Resource to change the state of the switch. The last step is initializing all components in such a way, that the CAM is empty. This is done in the constructor of every instantiated component. LUTs are initialized such that they output 0 independent of the four input values, shift registers are initialized with a delay of 1 and all switches are set to open Converting SRL16 to LUT After the whole structure has been built and all components have been initialized, JBits starts converting the 16 bits shift registers SRL16 that were instantiated in the VHDL code to disable the input swapping, to regular LUTs. This is done with the following method: jbits.set(y, x, S<slice number>ram.lut_mode, S<slice number>ram.on); where (y, x) is the coordinate in the CLB array and slice number denotes one of slices 0 and 1. This is done for the LUTs in the match blocks and in the variable length CAM also for the multiplexers that connect to the priority encoder and the LUTs that are used in the explicit priority encoder. Using this method does not only convert LUTs to SRLs, but also sets the Synchronous Reset (SR) line of the slice flip flops to low. This line is depicted in figure 8.4. This causes registers that are placed together with an SRL in the same slice to be resetted. This problem is solved by making sure that the SR lines are not inverted (1) and that they are not connected to any net (2): (1) JBits.set(y, x, S<slice number>control.srwenotinvert, S<slice number>control.off); (2) JBits.set(y, x, S<slice number>sr.s<slice number>sr, S<slice number>sr.off); Cyclic Redundancy Checking (CRC) Virtex configuration utilizes a standard 16-bit CRC checksum algorithm to verify bitstream integrity during configuration. An initial CRC checksum is calculated by the Xilinx tools while generating the bitstream after place and route. Also, a special purpose packet is added to the programming bitstream that tells the FPGA to do a CRC check. When the bitstream is being read by JBits, JBits takes away part of the bitstream that is not necessary but does not update the CRC value

59 Chapter 9: Software Implementation This means that a checksum error is generated when configuring the FPGA with this bitstream. A way to solve this problem is to replace the packet that tells the FPGA to do a CRC check with a dummy packet. This is done in the code below: Packet packet = null; int h = jbits.getpacketcount(); for (int k=0; k<h; k++) { packet = jbits.get(k); // Read packet(k) in bitstream if ((packet.getword(0) == 0x )) // compare header { packet.setword(0, 0x ); // Set to Write RCRC packet.setword(1, 0x07); } } The program searches for CRC command packets, that can be recognized by the 32 bits header 0x If such a packet is found, its data field is set to 0x07. In stead that the packet tells the FPGA to do a CRC check, it now only resets the registers in the CRC circuit. It is also possible to actually calculate the CRC in software and load this value to the CRC circuit. This way the CRC check is done with the correct value, which is recommended in a production version where bitstream integrity is necessary. 9.4 Preprocessing Preprocessing is part of the variable length CAM with explicit priority and consists of a series of actions that are performed when adding an entry to the CAM. These actions are: 1. Checking for conflicts with CAM words that are already in the CAM. 2. Mapping the user specified priority to a physical priority and location. 3. Utilization of match blocks that are left empty after deletion Logical and physical priority During preprocessing, a distinction is made between logical and physical priority. The logical priority is the priority that is defined by the user and can be any integer. A larger logical priority means a higher priority. The physical priority is the value that the explicit priority encoder is set with for a specific entry. Due to the limited number of priority classes, this is an integer between 0 and 7 where 0 is the highest priority Relation between CAM words To analyse the relation between two CAM words a and b, these are modelled by collections A and B. A is the collection of all incoming packets for which a gives a match and B is the collection of packets for which b gives a match. When comparing two CAM words a and b, there are four possible outcomes:

60 Chapter 9: Software Implementation 1. Equality: CAM words a and b are identical. 2. Hierarchical overlap: a and b are not identical, but if a packet matches CAM word a (b), then this packet matches b (a) as well. 3. Partial overlap: CAM words a and b are not hierarchical, but there are packets for which both a and b match. 4. No overlap: there are no packets for which a and b both match Checking for conflicts When checking for conflicts, the new CAM word is compared with the words that are already present in the CAM. Here the software determines the relation between the new CAM word and the already present words. There are three cases in which a conflict can occur: There is partial overlap between the new CAM word and another CAM word, while their logical priority is equal. In this case the program can not decide what entry should have highest priority. There is hierarchical overlap between the new CAM word and another CAM word, where the new CAM word is covered by the other word, but its logical priority is lower. Because the new entry covers the other entry, the address of the new entry will never be returned. There is hierarchical overlap between the new CAM and another word in the CAM, where the new CAM word entirely covers the other CAM word, but its logical priority is higher. Because the other entry covers the new entry, the address of the new entry will never be returned. If a conflict occurs, then the new entry is not added to the CAM Adding a new entry A = A B = A or A B = B and not 1. To add a new entry to the CAM, its logical value has to be translated into a physical priority and a location. This is the objective of preprocessing while moving as few other entries as possible. The way an entry is added to the CAM depends on its relation with other entries in the CAM and these cases are discussed separately. For each case, a simple example is given that shows the old CAM contents on the left side and the new contents on the right side. Each time, the entries are given as a 4 bits CAM-word together with their logical priority (any integer) and physical priority (integer between 0 and 7). The new CAM word is depicted on the right side of the arrow with a bold font. It should be noted that not all cases are covered by the examples. B ( A B ) and not 2. A B =

61 Chapter 9: Software Implementation 1. No overlap: if there is no overlap between the new entry and one of the other entries, then the entry is simply added at the first available position. Its physical priority is set to x x x1x 49 3 Add 1111, 47 0x1x Partial overlap: if there is partial overlap between the new entry and one or more other entries, then its physical priority is determined from the physical priorities of the overlapping entries. 100x x x Add x0x1, 47 1x x0x If this physical priority is occupied by other overlapping entries, then the physical priorities of these entries are changed to fit in the physical priority of the new entry. 100x 45 4 x00x 45 5 (changed) 1x Add x0x1, 47 00xx 49 3 x0x Due to the limited number of priority classes, it might be necessary to move some entries. 3. Hierarchical overlap: if there is hierarchical overlap between the new entry and another entry, then the new entry is added as there would be partial overlap. In the case that the new entry and the other entry have equal logical priority, its priority is decided depending on the number of don t cares. The more specific an entry, the higher becomes its priority (changed) 10xx xx 45 3 Add 100x, x Equality: if the new entry already exists in the CAM, then the existing entry is deleted if it has a different logical priority. Then the new entry is added using rules 1 to xx 49 3 Add 1100, 50 10xx 49 3 The physical priorities of the new entries are set in such a way, that their values are equally divided in a range between 0 and 7. This means that if a priority can be set on an interval from A to B, then the value is set in the middle of this interval. This way, the chance that a priority is occupied by other entries when adding a new entry is minimized. Therefore the physical priority of an entry that does not overlap with any other entry is set to 3 (in the middle between 0 and 7) Utilization of empty match blocks When an entry is deleted from the CAM, then the match blocks and the multiplexer that the output of the CAM word was connected to are marked unused and this way an empty fragment in the CAM content is created. When a new entry is added to the CAM, the software finds the first available fragment of match blocks in which the new CAM word would fit. The advantage of this approach is that no other entries are moved when deleting a CAM word and large unused fragments are prevented. But since the algorithm finds the first available fragment and not the fragment in which the new CAM word fits best, the CAM still becomes fragmented. Periodic defragmentation of the CAM is therefore necessary

62 Chapter 10: Conclusions and Recommendations Chapter 10: Conclusions and Recommendations 10.1 Summary The goal of this project was to implement an FPGA-based CAM for IP version 6 characterization. This CAM should be able to store at least 128 words with a maximum width of 315 bits and the CAM words may contain don t cares. It should be part of a 622 MBit/sec communication channel, which means that a look up rate of 1.9 million look ups per second is required. Three different CAM structures were implemented. The first implementation, called fixed length CAM can store 128 entries of 320 bits and all CAM words consume the same amount of resources. In this implementation an inherent priority mechanism is used, meaning that when several CAM words match simultaneously, then the matching word on the lowest address is selected. The second implementation is called variable length CAM that can store up to twice as many entries on the same area compared to the fixed length CAM. This is done by dividing the CAM words into five match blocks and when such a match block merely contains don t cares, then this block is omitted. The third implementation is based on the variable length CAM, but uses a more advanced priority mechanism where not only the CAM words, but also their priority can be programmed. To implement the CAMs, a specific design methodology was used, consisting of a static and a dynamic part. The static part is used to implement the basic structure of the CAM, from a VHDL description to the programming bitstream. The dynamic part is a Java application, used to change this bitstream for updating the CAM. The three implementations were implemented in a Xilinx Virtex device on a PCI-based board and for each implementation, a Java-based user interface has been developed to configure the CAM. The fixed length CAM has been implemented successfully, being able to contain 128 CAM words and able to perform 7.1 million lookups/sec. The variable length CAM that has been implemented can contain up to 256 CAM words and searching can be done at a rate of 3.8 million lookups/sec. The explicit priority scheme added in the third implementation allows fast adding/deleting of CAM words and it was shown that this added no significant hardware costs. Searching can be done at a rate of 3.4 million lookups/sec Conclusions When implementing an FPGA-based CAM, its architecture has to be considered in order to efficiently exploit its resources. This is because, unlike custom circuits, the architecture of the FPGA is fixed a-priori and therefore the premitted programmability, connectivity and routability is constrained by that architecture. Dynamic reconfiguration is a good way to implement FPGA-based CAMs. It was shown that flexible circuits can be implemented, without adding hardware costs. Since critical functions (searching the CAM) and non-critical functions (changing the CAM) can be divided and implemented in respectively hardware and software, the final hardware implementation becomes both faster and smaller than regular FPGA implementations

63 Chapter 10: Conclusions and Recommendations The performance of dynamically reconfigurable FPGA-based CAMs is enough for IP characterization. Between 3.4 and 7.1 packets can be characterized per second, depending on which of the three implementations described in this report is used. Since the required search rate is less than 1.9 million lookups/s, all three implementation are suitable. With progessing FPGA performance, it is expected that the three CAM structures can be used in future communication channels with more stringent requirements as well. The CAMs that have been implemented consume approximately half of the hardware resources that are available on the Virtex FPGA and can be integrated with other logic on the same chip for either extending its functionality or implementing other functions. The design methodology that was used to implement the CAMs and consists of a static and a dynamic flow was proven to be successful. The static part is used to implement those parts of the design that are not dynamically reconfigured. Since it is not possible to reserve an empty area on the FPGA in the tools for integrating reconfigurable cores, an initial structure of the dynamic part of the design must be implemented in the static design flow as well Recommendations Recommendations concerning the CAM The output of the CAM should be programmable, such that also the return value can be programmed. This way, the memory that is indexed by the CAM can be omitted and the correct signals to control the rest of the system are generated directly by the CAM. A method to do this is described in Besides returning a value when a match occurs, other actions could be performed. An example of such an action is incrementing a counter for aquiring statistical data. Another example is waiting for another entry to match. This way sequences of packets can be discovered. This concept is called Matching Machine (MaMa). The structure of a MaMa consists of a CAM structure, logic and memory and is very suitable to implement on an FPGA since all the necessary parts are already there. With dynamic reconfiguration, the MaMa can be configured with the correct CAM words, actions and return values and could become very valuable in protocol processing. It is recommended to investigate how the CAMs and methodology described in this report can be extended to implement such a function Recommendations concerning the tools To have full control over the implementation of the dynamically reconfigurable part of a design, one would like to use JBits to built this part. To do this, one should be able to reserve an area in the Xilinx tools for JBits to place and route certain regular structures. This area should be empty and one should be able to define ports on the edges of the area for JBits to route the area and connect to the other parts of the circuit (i.e. board interface and random logic). It is recommended that this is integrated in the place & route tool. If one is able to reserve an empty area, then routing in JBits can be done both manually and automatically. In some cases one would like to be able to use both methods, for example when there are both critical and non-critical nets. In the present version of JBits (2.1) one can only use one of these methods, since the automatic routing tool does not check for conflicts with manually placed routes. It is recommended that both methods can be used

64 Chapter 10: Conclusions and Recommendations It should be possible to add an attribute to the Virtex LUT primitive, that prevents the Xilinx tools from swapping the address lines on the LUT (see 7.1.1). In time-critical applications it is important that reconfiguration is done fast, possibly while a part of the system is still running. This can be reached by partial reconfiguration. It is recommended to have support for this both on the board and in the CAM software. The synthesis tool that was used for implementing the CAMs is advanced and suitable for synthesizing behavioural descriptions. Since the static part of the design is written in a structural way, a simpler synthesis tool can be used. This is cheaper and probably faster than using Synplify which takes up to many hours

65 References References [1] W. Richard Stevens, TCP/IP Illustrated, vol. 1, Addison-Wesley, [2] S. Deering, R. Hinden, Internet Protocol, Version 6 Specification, RFC 2460, [3] J. Walrand, Communication Networks, A First Course, Aksen Associates, [4] M. Mansour, A. Kayssi, FPGA-based Internet Protocol Version 6 Router, Proceedings of IEEE International Conference on Computer Design, p. 334, vol. 2, [5] Ericsson Telecom, Telia, Att Förstå Telekommunikation, Studentlitteratur, [6] R. Kress, High-Level Synthesis for Dynamically Reconfigurable Hardware/Software Systems, Proceedings of the 8th International Workshop of Field Programmable Logic and Applications FPL 98, p. 288, Springer Lecture Notes in Computer Science, [7] I. Warren, Dynamic Configuration Abstraction, Proceedings of the 5th European Software Engineering Conference (ESEC 95), Springer Lecture Notes in Computer Science, [8] C. Sweeney, B. Blyth, RC1000-PP Hardware Reference Manual, version 2.1, Embed ded Solutions, [9] P. Bellows, B. Hutchings, JHDL - A HDL for Reconfigurable Systems, Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines, p. 175, [10] S. Guccione, D. Levi, Run-Time Parameterizable Cores, Proceedings of the 9th International Workshop of Field Programmable Logic and Applications FPL 99, p. 215, Springer Lecture Notes in Computer Science, [11] Xilinx Inc, The Programmable Logic Data Book, Xilinx Databook, [12] C. Carmichael, VIRTEX TM FPGA Series Configuration and Readback, Xilinx Application Note 138, [13] Xilinx Inc, JBits Xilinx Reconfigurable Computing Platform, JBits 2.0 Tutorial, [14] M. Defossez, Content Addressable Memory (CAM) in ATM applications, Xilinx Application Note 202, [15] J. Brelet, B. New, Designing Flexible, Fast CAMs with Virtex Slices, Xilinx Application Note 203, [16] J. Brelet, Using Block SelectRAM+ for High-Performance Read/Write CAMs, Xilinx Application Note 204, [17] S. V. Kartalopoulos, An Associative RAM-based CAM and Its Application to Broad- Band Communiactions Systems, IEEE Transactions on Neural Networks, p 1036, vol. 9, [18] A.J. McAuley, P. Francis, Fast Routing Table Lookup Using CAMs, Proceedings of IEEE Infocom 93, p. 1382, [19] Xilinx Inc, Xilinx Library Guide, Alliance 2.1i Software Manual, [20] C. Horstmann, G. Cornell, Core JAVA 1.2 vol 1: Fundamentals, Sun Microsystems Press, [21] S. Guiccione, Portable Native Methods in JAVA, Embedded Systems, [22] S. McPherson, Java Servlets and Serialization with RMI, Java Developers Connection, [23] J.R. Jackson, A.L. McClellan, JAVA 1.2 by Example, Third edition, Sun Microsystems Press,

66 List of Used Tools List of Used Tools Software This paragraph gives an overview of the programs used to design and implement the CAM at the two platforms Solaris 2.6 and Windows NT 4.0. For each program it is described what it has been used for (Application). Solaris 2.6: Name: Version: Application: cc - C-compiler for automatic generation of UCF files for constraining the design Framemaker 5.5 Writing the report. Synplify 5.3 FAE 1 Synthesis of the initial CAM structure. TextEdit 3.6 Editing the VHDL code Xilinx Alliance 2.1i sp3 Place and route of the design and generation of the programming bitstream. Windows NT 4.0: Name: Version: Application: EditPlus 1.25 Editing the Java code. JBits 2.1 Java library for manipulation of the bitstream and dynamic reconfiguration of the FPGA. MS Visual C C-editor and compiler for generating the hardware interface SUN JDK 1.2 SUN Java developers Kit for compiling and running Java code. Hardware This paragraph gives a description of all the hardware components that have been used, together with a more detailed specification. Name: Type: Specification: PC Hewlett Packard Pentium II at 450 MHz, 320 MB, running Kayak XA Windows NT 4.0. Unix Workstation SUN Sparc Dual Sparc processor at 66 MHz, 192 MB. Station 10 Unix Compute Server SUN Sparc Dual Sparc processor at 350 MHz, 512 MB Enterprise 450 FPGA Board Embedded Solutions PCI board with XCV1000-4, 8 MB RC1000-PP

67 Appendix A: Port Definitions of Board Interface Appendix A: Port Definitions of Board Interface Port Direction Function CE<n> Out SRAM <n> Chip Enable WE<n> Out SRAM <n> Write Enable OE<n> Out SRAM <n> Output Enable Addr<n> Out SRAM <n> Address Req<n> Out SRAM <n> Request Gnt<n> In SRAM <n> Granted Ctrl_ACK Out Data Acknowledge, Control Register Ctrl_VLD In Data Valid, Control Register Stat_ACK In Data Acknowledge, Status Register Stat_VLD Out Data Valid, Status Register LEDs Out onboard LEDs on/off Table A-1: FPGA control ports for memory and registers Data0_Reg SRAM 0 SRAM 1 Data0 Data1 Indata[31:0] Address Match Indata[63:32] Data2 SRAM 2 PCI host Data1_Reg OE0, OE1, WE0, WE1 CAM OE0, OE1, WE0, WE1 Ctrl_Reg, Ctrl_VLD, Stat_ACK, Reset Stat_Reg, Stat_VLD, Ctrl_ACK Board Interface Logic OE2, WE2, CE2, Addr2 LEDS LEDs Memory Arbiter Req0, Req1, Req2 Gnt0, Gnt1, Gnt2 Board Interface note: signal clk is not shown explicitely. Figure A-1: schematic view of the board interface and its connections to other parts in the system

68 Appendix B: VHDL Code of Board Interface Appendix B: VHDL Code of Board Interface --************************************************************************ Module: board_interface.vhd -- Version: Date: November Author: Johan Ditmar Family: Virtex Description: board interface to be used with the variable length -- CAM structure. --************************************************************************** PACKAGE MEM IS -- States of main FSM. TYPE memory_states IS (IDLE, REQUEST, GRANTED, READ_FIRST, WRITE_FIRST, RD_WRT, READ_READY, WRITE_READY, VALID_READY); -- States of FSM for reading the control register. TYPE ctrl_states IS (IDLE, DATA_VALID); -- Number of packets to be processed. CONSTANT BUFFER_SIZE : INTEGER := 1000; -- Maximum number of blocks per entry. CONSTANT NBlocks : INTEGER := 5;END MEM; LIBRARY ieee; USE ieee.std_logic_1164.all; USE ieee.std_logic_arith.all; USE ieee.std_logic_unsigned.all; USE work.mem.all; ENTITY Board_Interface IS -- The different ports can be found in appendix A. PORT( clk : IN std_logic; Res : IN std_logic; Stat_Reg : OUT std_logic_vector(7 DOWNTO 0); Stat_VLD : OUT std_logic; Stat_ACK : IN std_logic; Ctrl_Reg : IN std_logic_vector(7 DOWNTO 0); Ctrl_VLD : IN std_logic; Ctrl_ACK : OUT std_logic; Addr0, Addr1, Addr2 : OUT std_logic_vector(20 DOWNTO 0); Data0, Data1 : IN std_logic_vector(31 DOWNTO 0); Data2 : OUT std_logic_vector(31 DOWNTO 0); CE0, CE1, CE2 : OUT std_logic_vector (3 DOWNTO 0); WE0, WE1, WE2 : OUT std_logic; OE0, OE1, OE2 : OUT std_logic; Req0, Req1, Req2: OUT std_logic; Gnt0, Gnt1, Gnt2: IN std_logic; LEDS: OUT std_logic_vector(3 DOWNTO 0); T_LED1, T_LED2: OUT std_logic ); END Board_Interface; ARCHITECTURE Behave OF Board_Interface IS SIGNAL Addr_0, Addr_1, Addr_2 : INTEGER := 0;

69 Appendix B: VHDL Code of Board Interface SIGNAL Data0_Reg, Data1_Reg, Data2_Reg : std_logic_vector(31 DOWNTO 0) := (OTHERS => '0'); SIGNAL Reset, Reset_Int: std_logic := '0'; SIGNAL CE_0, CE_1, CE_2 : std_logic := '1'; SIGNAL Start, Stat_ACK_Reg : std_logic := '0'; SIGNAL Address: std_logic_vector (7 DOWNTO 0) := (OTHERS => '0'); SIGNAL Match_Vector: std_logic_vector (15 DOWNTO 0) := (OTHERS => '0'); SIGNAL Match: std_logic := '0'; SIGNAL InData_temp : std_logic_vector(63 DOWNTO 0) := (OTHERS => '0'); SIGNAL Priority_temp : std_logic_vector(7 DOWNTO 0); COMPONENT CAM GENERIC( NLines : INTEGER := 64; Width : INTEGER := 7 ); PORT( InData : IN std_logic_vector(63 downto 0); Address : OUT std_logic_vector(width-1 DOWNTO 0); Match_Vector : OUT std_logic_vector(15 DOWNTO 0); Priority_Vector : OUT std_logic_vector(7 DOWNTO 0); Match : OUT std_logic; clk : IN std_logic ); END COMPONENT; BEGIN Addr0 <= conv_std_logic_vector(addr_0,21); Addr1 <= conv_std_logic_vector(addr_1,21); Addr2 <= conv_std_logic_vector(addr_2,21); CE0 <= (OTHERS => CE_0); CE1 <= (OTHERS => CE_1); CE2 <= (OTHERS => CE_2); Data2_Reg(31 DOWNTO 16) <= Match_Vector; -- First 15 priority encoder inputs Data2_Reg(15) <= Match; -- Match signal Data2_Reg(14 DOWNTO 8) <= Priority_temp(6 DOWNTO 0); -- 7 of 8 priority lines Data2_Reg(7 DOWNTO 0) <= Address; -- Return address from priority encoder Reset <= NOT Res; -- FSM for reading the control register. Read_Ctrl : PROCESS (Ctrl_VLD, clk) VARIABLE Current_State : ctrl_states; BEGIN IF( (clk = '1') AND clk'event ) THEN IF (Reset = '1') THEN Current_State := IDLE; ELSE CASE Current_State is WHEN IDLE => Ctrl_ACK <= '1';

70 Appendix B: VHDL Code of Board Interface IF (Ctrl_VLD='0') THEN Start <= Ctrl_Reg(0); Current_State := DATA_VALID; ELSE Current_State := IDLE; END IF; WHEN DATA_VALID => Ctrl_ACK <= '0'; Current_State := IDLE; END CASE; END IF; END IF; END PROCESS; -- Process that waits for an acknowledgment from the host that -- the status register has been read. Store_Ack : PROCESS (Stat_ACK, Reset, Reset_Int) BEGIN IF ( (Reset = '1') OR (Reset_Int = '1') ) THEN Stat_Ack_Reg <= '0'; T_LED1 <= '1'; T_LED2 <= '0'; ELSIF ( (Stat_ACK = '0') AND Stat_ACK'EVENT) THEN Stat_Ack_Reg <= '1'; T_LED1 <= '0'; T_LED2 <= '1'; END IF; END PROCESS; -- Process that asserts a reset as a response to an acknowledgement -- of the status register. Assert_Reset : PROCESS (clk, Stat_Ack_Reg) BEGIN IF ( (clk='1') AND clk'event) THEN IF (Stat_Ack_Reg = '1') THEN Reset_Int <= '1'; ELSE Reset_Int <= '0'; END IF; END IF; END PROCESS; -- Main FSM for reading packets from memory banks 0 and 1 and writing -- the result to bank 2. Read_Write : PROCESS (clk, Start, Reset) VARIABLE Current_State : memory_states; VARIABLE Cnt : INTEGER RANGE 0 TO NBlocks+2;

71 Appendix B: VHDL Code of Board Interface BEGIN IF ( (clk = '1') AND clk'event ) THEN IF (Reset = '1') THEN Current_State := IDLE; ELSE CASE Current_State IS -- Idle state, wait for 'Start' from control register. WHEN IDLE => LEDS <= "0001"; OE0 <= '1'; WE0 <= '1'; CE_0 <= '1'; Req0 <= '1'; Addr_0 <= 0; OE1 <= '1'; WE1 <= '1'; CE_1 <= '1'; Req1 <= '1'; Addr_1 <= 0; OE2 <= '1'; WE2 <= '1'; CE_2 <= '1'; Req2 <= '1'; Addr_2 <= 0; Stat_VLD <= '1'; Stat_Reg <= (OTHERS => '0'); Cnt := 0; IF (Start='1') THEN Current_State := REQUEST; ELSE Current_State := IDLE; END IF; -- Request memory banks. WHEN REQUEST => LEDS <= "0010"; Req0 <= '0'; Req1 <= '0'; Req2 <= '0'; IF (Gnt0='0' AND Gnt1='0' AND Gnt2='0') THEN Current_State := GRANTED; ELSE Current_State := REQUEST; END IF; -- All banks have been granted. WHEN GRANTED => LEDS <= "0011"; OE0 <= '0'; CE_0 <= '0'; OE1 <= '0'; CE_1 <= '0'; Current_State := READ_FIRST;

72 Appendix B: VHDL Code of Board Interface -- Read first packet from bank 0 and 1. WHEN READ_FIRST => LEDS <= "0100"; Addr_0 <= Addr_0+1; Addr_1 <= Addr_1+1; Cnt := Cnt+1; IF (Cnt > NBlocks) THEN Cnt := NBlocks-1; Current_State := WRITE_FIRST; ELSE Current_State := READ_FIRST; END IF; -- Write first result to bank 2. WHEN WRITE_FIRST => LEDS <= "0101"; Addr_0 <= Addr_0+1; Addr_1 <= Addr_1+1; Addr_2 <= 0; WE2 <= '0'; CE_2 <= '0'; Current_State := RD_WRT; -- Process all packets. WHEN RD_WRT => LEDS <= "0110"; Addr_0 <= Addr_0+1; Addr_1 <= Addr_1+1; Cnt := Cnt+1; IF (Cnt >= NBlocks) THEN Addr_2 <= Addr_2+1; Cnt := 0; END IF; IF (Addr_0>=Buffer_Size-2) THEN Current_State := READ_READY; ELSE Current_State := RD_WRT; END IF; -- All packets read from bank 0 and 1. WHEN READ_READY => LEDS <= "0111"; Req0 <= '1'; OE0 <= '1'; CE_0 <= '1'; Req1 <= '1'; OE1 <= '1'; CE_1 <= '1'; Cnt := Cnt + 1; IF (Cnt >= NBlocks) THEN Addr_2 <= Addr_2+1; Cnt := 0; END IF;

73 Appendix B: VHDL Code of Board Interface IF (Addr_2>=Buffer_Size/NBlocks-1) THEN Current_State := WRITE_READY; ELSE Current_State := READ_READY; END IF; -- All packets have been processed. WHEN WRITE_READY => LEDS <= "1000"; Stat_Reg(0) <= '1'; Req2 <= '1'; WE2 <= '1'; CE_2 <= '1'; Current_State := VALID_READY; -- Write 'Ready' to status register and wait for acknowledgement. WHEN VALID_READY => LEDS <= "1001"; Stat_VLD <= '0'; IF (Stat_Ack_Reg='1') THEN Current_State := IDLE; ELSE Current_State := VALID_READY; END IF; WHEN OTHERS => LEDS <= "0000"; Current_State := IDLE; END CASE; END IF; END IF; END PROCESS; InData_temp(63 DOWNTO 32) <= Data1; InData_temp(31 DOWNTO 0) <= Data0; -- Instanciation of the CAM structure. CAM : CAM GENERIC MAP(NLines => 128, Width => 8) PORT MAP (InData => InData_temp, Address => Address, Match_Vector => Match_Vector, Priority_Vector => Priority_temp, Match => Match, clk => clk); END Behave;

length CAM with inherent priority encoder.

74 Appendix C: FPGA Editor View of Various CAMs Appendix C: FPGA Editor View of Various CAMs Figure C-1: FPGA Editor view of fixed length CAM with inherent priority encoder. Figure C-2: FPGA Editor view of variable length CAM with inherent priority encoder

75 Appendix C: FPGA Editor View of Various CAMs Figure C-3: FPGA Editor view of variable length CAM with explicit priority encoder

76 Appendix D: User Interface Manual of Various CAMs Appendix D: User Interface Manual of Various CAMs With the Java user interface, the contents of the CAM can be changed and its functionality can be tested. This manual decribes the functions of the user interface and explains the menu s of three CAM implementation: the fixed and the variable length CAM, both with inherent priority and the variable length CAM with explicit priority. First it is described how to start the program. Then a description of the user interface of the three CAMs is given with how to view the contents of the CAMs. Here it is also described how to add/delete entries and test the functionality of the CAMs. D.1 Starting the program Before starting the program, be sure that SUN s JDK 1.2 or higher has been installed properly. The three CAM applications can be started from the directory where their respective classfiles are by typing: options: java -classpath <path to JBits>;. cam <option> -demo: Runs the application in demo mode, without connecting to the board. -remote <IP address>: runs the application remotely. <IP address> is the 32-bits IP address of the server, containing the board (ex ). Be sure to start both register and server application first! D.2 User Interface of Fixed Length CAM with Inherent Priority The graphical user interface of the fixed length CAM with inherent priority is given in figure D-1. D.2.1 Viewing the content of the CAM The content view panel (1) gives a graphical representation of the CAM with blocks, registers and priority encoder. Every block is divided in two 32 bits words. Each word is divided in eight 4-bits words, that each represent an FPGA LUT. The first entry has highest priority. Selecting is done by clicking with the mouse in the panel and only complete 32 bits words can be selected. Note that it is not possible to select empty entries, except the first one. The content of an entry is shown on bit level by means of 1 s, 0 s and x s where each complete entry contains 320 bits. To view the bit values, be sure that box (2) is checked. By default empty locations are depicted with a gray color and used locations with a yellow color. The LUTs, whose contents altered while changing the contents of the CAM, are drawn red. These are the LUTs that need to be updated when reconfiguring the FPGA

Appendix D: User Interface Manual of Various CAMs Since the number of entries and their width is so large, an extra panel was implemented that gives a global view of the CAM and shows what word has

77 Appendix D: User Interface Manual of Various CAMs Since the number of entries and their width is so large, an extra panel was implemented that gives a global view of the CAM and shows what word has been selected (3). The square shows what part of the CAM is viewed in the content view panel. Finally there is a status field (4) where messages, warnings and errors are printed Figure D-1: Graphical User Interface of Fixed Length CAM with inherent priority D.2.2 Changing the content from within the program Changing the contents of the CAM can be done either from within the program or from a file. In the program three functions have been implemented to add, delete and replace entries. To add an entry to the CAM, the location where the entry is to be added should be selected in the content view panel. Only complete entries can be added and a location is selected by clicking on one of the blocks in the row where the new entry should be placed. If a location has been selected that is used by another entry, then this and all entries below are shifted down. The value of the new entry is defined in the Packet and Mask field, which contain 32 bits hexidecimal numbers. Via the Packet field, 1 s and 0 s are added, while the Mask field determines where don t cares are. A 0 in the Mask field means x, for example: if packet is equal to F0F0F0F0 and mask equal to FF00FF00, then the value F0xxF0xx will be added. When a valid value has been entered in the Packet and Mask field and the Add button (5) is pressed, all words at the selected location become equal to the added value. Deleting an entry is done by selecting a location that is not empty and pressing the Delete button (6). The whole entry at that location is then deleted and in case there are non-empty entries on locations below the entry that is deleted, these are shifted up. Replacing an entry is done by selecting a word at a location that is not empty, writing a value

78 Appendix D: User Interface Manual of Various CAMs in the Packet and Mask field and pressing Replace (7). Only the word that has been selected is replaced, not the whole entry. D.2.3 Reading the content from file Except for changing the CAM in the program as described above, it s also possible to read the configuration from file. This is done via File->Open Configuration. The new configuration overwrites the existing configuration when using this method. The file should have the following format: packet (1) mask (1) packet (2)... packet (NEntries) mask (NEntries) where ( entry (n), mask(n) ) is a packet-mask-pair of 320 bits each, divided in 10 words of 32 bits in hexidecimal notation and separated by a space. If the number of entries in the file is larger than the number of entries that can be added to the CAM, then the exceeding entries will be ignored. Open lines and lines that start with # are ignored, so that the configuration file can be structured and commented. An appropriate error is generated when a syntax error occurs while reading the file. After changing the contents of the CAM, the FPGA is updated by pressing Configure (8). This will cause JBits to change the bitstream and write the new bitstream to the FPGA. All LUTs that were drawn red in the content view panel become gray or yellow again. D.2.4 Testing the functionality of the CAM To test the functionality, packets can be read from a file and processed by the CAM. To read a file, go to File->Open Testcases. This file should contain packets that are 320 bits, divided in 10 words of 32 bits in hexadecimal notation separated by a space. Every line should contain a separate packet. Open lines and lines that start with # are ignored, so that the configuration file can be structured and commented. An appropriate error is generated when a syntax error occurs while reading the file. The actual testing is done by pressing the Test button (9). After processing, the program will ask for a filename to store the result in. Every line in the result file contains the packet that was searched, the 32 bits word that was read from memory bank 2 (where the result was stored) in hexadecimal notation and the resulting return address. The 32 bits word that is read from the memory bank is printed for debugging reasons and contains more information besides the return address:

Appendix D: User Interface Manual of Various CAMs 31 16 15 14 7 6 0 match output<15:0> Not used Address Match match output<15:0> denotes the outputs of the 15 entries that have the highest priority

79 Appendix D: User Interface Manual of Various CAMs match output<15:0> Not used Address Match match output<15:0> denotes the outputs of the 15 entries that have the highest priority (before encoding). Sequential tests can be done with the same test cases without loading a new test file. D.3 User Interface of Variable Length CAM with Inherent Priority The graphical user interface of the variable length CAM with inherent priority is given in figure D Figure D-2: Graphical User Interface of Variable Length CAM D.3.1 Viewing the content of the variable length CAM This content view panel (1) gives a graphical representation of the CAM with match blocks, shift registers and priority encoder. It consists of 32 horizontal match lines, each containing 4 match blocks that are 64 bits wide and each consist of 16 FPGA LUTs. The priority encoder returns an address between 0 and 63, where 0 has highest priority. Selecting is done by clicking with the mouse in the panel and only complete 64 bits blocks can be selected. The content of an entry is shown on bit level by means of 1 s, 0 s and x s where

80 Appendix D: User Interface Manual of Various CAMs each complete entry contains a maximum of 320 bits. To view the bit values, be sure that box (2) is checked. In the variable length CAM, the connections between blocks and priority encoder are not fixed. For this reason the routes between matching blocks within the same entry and the routes from blocks to the encoder are shown as wires above the respective match line. The shift registers, that are placed between the match blocks are shown as well, together with their delay. Since the number of entries and their width is so large, an extra panel was implemented that gives a global view of the CAM and shows what word has been selected (3). The square shows what part of the CAM is viewed in the content view panel. Finally there is a status field (4) where messages, warnings and errors are printed. D.3.2 Changing the content of the CAM Changing the contents of the variable length CAM is done by reading a configuration file as described in D.2.2. After this, the FPGA is updated by pressing Configure (5). D.3.3 Testing the functionality of the CAM Testing the variable length CAM with inherent priority is done in the same way as the fixed length CAM with inherent priority. A test file containing a series of packets is read and the CAM starts testing by pressing the Test button (6). D.4 User Interface of Variable Length CAM with Explicit Priority The graphical user interface of the variable length CAM with explicit priority is given in figure D-3. The program is entirely controlled from a command file and therefore the buttons and menu items for testing and configuring the FPGA are omitted. The information that is given about the contents of the CAM is the same as for the variable length CAM with inherent priority, except that for each entry not only the return address, but also its physical priority is shown. D.4.1 Command File The command file is read via File->Open Command File and via the command file entries can be added/deleted and packets can be searched for testing the functionality of the CAM. The commands in the command file can be divided into different catagories and these are described below. Open lines and lines that start with # are ignored, so that the configuration file can be structured and commented. D.4.2 Changing the contents of the CAM Clearing the CAM Clear

Appendix D: User Interface Manual of Various CAMs return address / physical priority Figure D-3: Graphical User Interface of Variable Length CAM with explicit priority.

81 Appendix D: User Interface Manual of Various CAMs return address / physical priority Figure D-3: Graphical User Interface of Variable Length CAM with explicit priority. Adding an entry: Add <packet label> <packet> <mask> <logical priority> where <packet label> is a string that identifies the entry, ( packet, mask ) is a packet-maskpair of 320 bits each, divided in 10 words of 32 bits in hexidecimal notation and separated by a space and <logical priority> is the logical priority of the entry which can be any integer. An Add-command can be followed by several entries. Deleting an entry: Delete <packet label> where <packet label> is the label of the entry that is deleted. A Delete-command can be followed by several packet labels

EECS150 - Digital Design Lecture 16 Memory 1

EECS150 - Digital Design Lecture 16 Memory 1 March 13, 2003 John Wawrzynek Spring 2003 EECS150 - Lec16-mem1 Page 1 Memory Basics Uses: Whenever a large collection of state elements is required. data &