A Dynamically Reconfigurable FPGA-based Content Addressable Memory for IP Characterization

Size: px
Start display at page:

Download "A Dynamically Reconfigurable FPGA-based Content Addressable Memory for IP Characterization"

Transcription

1 Master Thesis ELE/ESK/ A Dynamically Reconfigurable FPGA-based Content Addressable Memory for IP Characterization Supervisors: Axel Jantsch Kjell Torkelsson Examinator: Axel Jantsch Master of Science Thesis in Electronic System Design by Johan M. Ditmar Stockholm, March 2000

2 Abstract IP characterization is the process of classifying IP packets into groups depending on information in the header. In this report three implementations of FPGA-based dynamically reconfigurable Content Addressable Memories (CAMs) are described for Internet Protocol Version 6 characterization. These CAMs are characterized by a large width of the search word, a relatively small number of CAM words and the fact that these may contain don t cares. To implement the CAMs, the CAM words were divided into a number of reconfigurable match blocks. In the first CAM implementation called the fixed length CAM, the number of these blocks is equal for all words. A more advanced architecture was developed as well, where blocks that merely store don t cares are omitted which leads to a varying number of reconfigurable blocks for each word. By placing these blocks in a smart way, more CAM words can be stored. This CAM is referred to as variable length CAM. In the last implementation an explicit priority mechanism was added where the priority can be programmed for each CAM word. This eliminates the slow insertion and deletion times without adding significant additional hardware costs. The CAMs were implemented on a Xilinx Virtex FPGA and the reconfiguration of the this device is done dynamically from a Java environment. A user interface for changing the contents of the CAM was developed, together with a hardware interface to let the software communicate with the FPGA. It has been shown that using this technology, a CAM containing over 100 words of 320 bits can be implemented, that is able to perform more than 7 million look ups per second.

3 Acknowledgements I would like to thank Kjell Torkelsson at CadLab for providing daily support and helping me with the problems that occured. I would also like to thank Axel Jantsch (Kungliga Tekniska Högskolan) for supervising the project. I thank Sabih Gerez (University of Twente) for professional input from the home front. Other people from my university in Holland that have helped me are Jaap Hofstede and Professor Herrmann. This is also a good moment to thank my father, who has supported me with both personal and financial matters during my studies in Holland and abroad. Without his help, this would not have been possible. Finally, thanks Anna for being so patient and waiting for me when I worked late again.

4 List of Abbreviations API Application Program Interface CAM Content Adressable Memory CLB Configurable Logic Block CRC Cyclic Redundancy Checking DLL Dynamic Link Library FPGA Field Programmable Gate Array FSM Finite State Machine GUI Graphical User Interface IOB Input/Output Block IP Internet Protocol IPv4 / IPv6 Internet Protocol version 4/version 6 LC Logic Cell LUT Look Up Table PAR Place And Route PCI Peripheral Component Interconnect STM Synchroneous Transport Modules TBUF Tristate Buffer TCP Transmission Control Protocol TLU Table Look Up UCF User Constraints File UDP User Datagram Protocol VHDL Very high speed integrated circuit Hardware Description Language XHWIF Xilinx Hardware Interface

5 Table of Contents Introduction... 7 Purpose and Motives... 7 Outline...8 Chapter 1: IP characterization Internet Protocol Description IP header IP routing 1.2 Specification and Requirements Specification Priority Performance requirements Chapter 2: Design Methodology Dynamic Reconfiguration Hardware Environments Software Environment...15 Chapter 3: Target Technology Virtex Architecture Global architecture Configurable logic block Tri-state buffers Block RAM General routing 3.2 Virtex Configuration Configuration memory Programming bitstream Chapter 4: JBits Introduction JBits Programming Model Constructor Reading the bitstream Setting a resource Writing the bitstream Getting the resource configuration Chapter 5: CAM Structures RAM-based versus CAM-based Look Up Different CAM Types Binary versus Ternary CAMs Return value Prioritizing 5.3 Fixed Length CAM Global structure

6 5.3.2 Match line 5.4 Variable Length CAM Global structure Mapping Placing 5.5 Explicit Priority Encoder Different implementations Global structure Switch box Output decoder Method to program the return value 5.6 Device Utilization Fixed length CAM Variable length CAM Inherent priority encoder Explicit priority encoder Summary Chapter 6: Implementation of the Board Interface Port Description VHDL Description Chapter 7: Hardware Implementation of the CAM Implementation of the Fixed Length CAM Virtex primitives Stitcher 7.2 Implementation of the Variable Length CAM Shift Register Multiplexer Stitcher Switch 7.3 Implementation of the Priority Encoder Inherent priority encoder Explicit priority encoder Chapter 8: Synthesis and Place & Route Method Synthesis Place & Route 8.2 Physical Structure Fixed length CAM Variable length CAM Explicit priority encoder 8.3 Results of the Fixed Length CAM FPGA editor view of the CAM Device utilization Timing analysis 8.4 Results of the Variable Length CAM FPGA editor view of the CAM Device utilization

7 8.4.3 Timing analysis 8.5 Results of the Variable Length CAM with Explicit Priority FPGA editor view of the CAM Device utilization Timing analysis 8.6 Summary...52 Chapter 9: Software Implementation JAVA User Application General description GUI of the fixed length CAM GUI of the variable length CAM GUI of the variable length CAM with explicit priority 9.2 Hardware Interface Native interface Remote Interface 9.3 JBits Integration Components, resources and values Describing the CAM structure Initialization Converting SRL16 to LUT Cyclic Redundancy Checking (CRC) 9.4 Preprocessing Logical and physical priority Realation between CAM words Checking for conflicts Adding a new entry Utilization of empty match blocks Chapter 10: Conclusions and Recommendations Summary Conclusions Recommendations Recommendation concerning the CAM Recommendations concerning the tools References...64 List of Used Tools...65 Appendices...66 Appendix A: Port Definitions of the Board Interface Appendix B: VHDL Code of the Board Interface Appendix C: FPGA Editor View of Various CAMs Appendix D: User Interface Manual of Various CAM Implementations Appendix E: Implementation of the Native Interface Appendix F: Remote Hardware Interface Appendix G: CAMConstants.java for the Variable Length CAM with Explicit Priority

8 Introduction Introduction Purpose and Motives The Internet Protocol (IP) provides the basis for the interconnections of the Internet. Its application field grows very rapidly with requirements doubling every three months. In the future, IP will not only be used to interconnect computers, but all kinds of equipment will use this protocol to communicate with each other including base stations for cellular communication. Due to the increasing demand for high bandwidth, many efforts are made to make faster IP handling systems. Not only speed, but also flexibility is an important factor here, since new standards and applications have to be supported at all times. A way to gain speed and flexibility is to move critical software functions to reconfigurable hardware. One of these critical functions is IP characterization, as done in firewalls and routers. IP characterization is the process of classifying packets into groups that require special treatment. A subset of IP characterization is IP filtering. IP filtering is a security feature that restricts IP traffic by permitting or denying packets by applying certain rules. This way users can be restricted to specific domains or applications on the Internet. To do characterization, IP headers that reach a router need to be compared with patterns stored in a table, and an output is generated. Nowadays, this table is stored in memory and matching is done entirely in software. Due to growing requirements, software becomes too slow and alternative implementations need to be considered. Semiconductor companies responded to this by producing full custom Content Addressable Memories, that are fast and can store a large amount of data. The goal of this project is to implement such a CAM in a Field Programmable Gate Array (FPGA). This way matching is done purely in hardware and is therefore faster than the software solution. But this approach has several advantages over the full custom solution: 1. FPGAs are used in IP handling systems for other applications besides characterization. The CAM functionality can therefore be integrated with other logic on the same chip. 2. Implementing a CAM in FPGA technology gives the possibility to add extra features. It is for example possible to add logic to obtain statistical data by counting the number of packets that satisfy certain rules. Since IP characterization is a dynamic process, the content of the FPGA will need to be updated regularly. This can be done in several ways, one of which is dynamic reconfiguration. The goal of this project is to design and implement an FPGA based Content Addressable Memory, based on the idea described above. This report describes the design methodology and summarizes the results. IP packet header FPGA CAM Group - 7 -

9 Introduction Outline The report starts with some theory on the Internet Protocol, together with a description of IP characterization and the specifications of the CAM. This forms chapter 1. Chapter 2 gives a description of the methodology that is used to design the CAM. This includes the hardware that was available, the tools that have been used and the design flow. In this chapter a general discussion of dynamic reconfiguration is given as well. Chapter 3 describes the architecture of the target technology in which the CAM is implemented, in this case a Xilinx Virtex FPGA. An overview of the FPGA resources that are available and some information on the programming bitstream is given also. JBits, the tool used for dynamically reconfiguring the FPGA is described in chapter 4. Here a general description of the JBits programming model and its ability to change a configuration is given. Chapter 5 starts with some theory on CAMs and discusses alternatives for using a CAM. Then two proposed CAM structures are described, the fixed and the variable length CAM and the way they are mapped on the Virtex architecture. Also the structure of an explicit priority encoder is described, which makes it possible not only to program the value of each CAM word, but also its priority with respect to other entries. In chapter 6 the hardware implementation of the board interface is summarized. This interface is implemented on the FPGA and makes communication possible between the CAM and the board where the FPGA is part of. Chapter 7 first describes how the two CAM structures of chapter 5 (fixed and variable length CAM) have been implemented in hardware. To implement the explicit priority encoder as described in chapter 5, it was integrated with the variable length CAM to form a new implementation. This leads to a total of three different CAM implementations. Chapter 8 discusses the synthesis and place & route process of the three CAM implementations. It describes how the tools have been used and what measures had to be taken for succesful implementation. The physical layout of the CAM strcutures are described as well and finally the results are given with respect to timing and hardware utilization. Chapter 9 describes how the software part of the CAM has been implemented, focussing on the integration of JBits in the design. The graphical user interface of the CAMs and some other software parts, like the interface to the hardware and a remote interface to access the hardware via a network are only mentioned in this chapter. A more detailed description of these parts can be found in the appendices. Conclusions and recommendations are given in chapter 10. References, a list of used tools and a set of appendices form the last part of this report. The appendices contain Java- and VHDL code and describe the parts of the software that have been implemented but are outside the scope of this report

10 Chapter 1: IP Characterization Chapter 1: IP Characterization 1.1: Internet Protocol Description The Internet consists of a number of interconnected networks, supporting communication among host computers using certain protocols. These networks are required to provide only packet (connectionless) transport and the protocol that is responsible for the movement of packets around the network is called the Internet Protocol (IP). According to the IP specification, packets can be delivered out of order, be lost or duplicated, and/or contain errors [1] IP header An IP packet is of variable length and consists of two parts: 1. Header: contains information about addressing and control. 2. Payload: the data encapsulated in the IP-packet, following the header. It contains a higher level protocol (TCP, UDP) packet with its own header and payload. An IP packet, together with the header format for IP version 6 is given below in figure 1-1 [2]. 40 bytes Header Payload 0 31 Version Traffic Class Flow Label Payload Length Next Header Hop Limit Source Address Destination Address Figure 1-1: IP packet with header format for IP version

11 Chapter 1: IP Characterization The protocol Version is 4 bits wide and is equal to 4 or 6. The Traffic Class is an 8-bit field, of which 6 bits are currently used. It is available to distinguish between different classes or priorities of IP packets. The Flow Label is composed of 20 bits, and may be used by a source to label sequences of packets for which it requires special handling by the IPv6 routers. The Payload Length is a 16 bit integer, giving the length of the payload in octets. The Next Header gives the type of the header, that is encapsulated inside the payload of the IPv6-packet, i.e. the type of header immediately following the IPv6 header. The Hop Limit is an 8-bit integer, decremented by 1 every time a router handles the packet. If Hop Limit is decremented to zero, the packet is discarded, preventing packets from running in circles forever and flooding a network. The Source Address is the 128-bit address of the originator of the packet. The Destination Address is the 128-bit address of the intended recipient of the packet IP Routing Routing is the selection of paths for packets in a network. A a B b c D C d Figure 1-2: a number of nodes attached in a mesh network Figure 1-2 shows a network, consisting of 4 nodes A to D and 4 links a to d attached in a network. Suppose a packet is travelling from A to D. The packet will arrive at node B via the incoming link a and will have the choice to leave via outgoing links b and c. Each node maintains a routing table, which is used to select an outgoing link from the Destination Address field in the IP header [3]. This is done by the Table Lookup (TLU) function, as depicted in figure 1-3, showing a schematic view of a router. After TLU, a packet passes an IP characterisation function that will decide whether to discard the IP packet, to send it to software for further processing, or to send the packet to the outgoing link, selected by the TLU function. Characterization is done to obtain statistical data as well, for example how many IPv4 packets have arrived. outgoing links: : incoming link Table Look up IP Characterization to other nodes Figure 1-3: Schematic view of a router, showing table look up and IP characterization. SW

12 Chapter 1: IP Characterization In paragraph 1.2, IP characterization will be discussed in more detail. 1.2: Specification and Requirements Specification The packet classification is based on some fields in the header and the higher level protocol that is encapsulated in the payload of the IP packet. These attributes are given in table 1-1, together with their size in bits. They are presumed to be available and do not need to be generated. Field Number of bits IP Source Address 128 IP Destination Address 128 Incoming Link a 6 Outgoing Link a 6 Next Header 8 Traffic Class 6 TCP/UDP Source Port 16 TCP/UDP Destination Port 16 TCP/UDP Syn/Ack 1 Total number of bits: 315 Table 1-1: IP characterization attributes and their size in bits. a. these attributes are not part of the IP packet, but is information from the router. The input of the IP characterizer is thus a 315 bits vector. The output of the system decides what actions should be performed for further processing the packet. Instead of letting the output point directly to the class to which the incoming IP packet belongs, the system returns an index, which is processed in software to perform the necessary actions. The IP characterization function can therefore be seen as a Content Addressable Memory (CAM). A CAM is a memory which can be viewed as the opposite of a normal SRAM. For SRAMs the address is used as an input and data is provided as an output. A CAM works the other way: the data is the input and all the memory words (or entries) are searched and when a match is found, the address to that word is given as an output. The CAM should be implemented in a ternary way, which means that the entries that are stored in it may contain don t cares. For this application a CAM of at least 128 entries should be implemented. The description given above is summarized in figure

13 Chapter 1: IP Characterization Data entry 0 entry entry 126 entry 127 Address Address Match Figure 1-3: schematic view of the IP characterization function Priority As mentioned before, the input of the CAM can contain both source and destination address and other information. In some cases, this other information may take priority over the addressing information altogether. This priority effect also exists with hierarchically structured addresses, that contain don t cares. Here, one part of an address takes priority over another part in the matching decision, depending on which address has least don t cares [4]. This means that in case two or more entries give a match for a certain input, the index of the most specific entry i.e. the entry, whose address contains least x s should be returned. This requires a priority scheme for the entries Performance requirements The time available for characterizing an IP packet is equal to the time it takes to transfer this packet over the communication channel. The minimum look up rate of the CAM is then equal to: B f min = ( 8 L) with B the bandwidth of the communication channel in bits/s and L the length of the IP packet in bytes. There are two possible schemes for calculating the timing requirements. The first scheme calculates the requirements using an average IP datagram length of 200 bytes. This implementation requires buffering. Since the communication channel is used to transfer voice and video as well, latency should be minimized where possible and buffering is therefore not favourable. The second scheme uses a (worst case) minimum datagram length, which is equal to 40 bytes, the length of the header only. Table 1-2 shows the required look up rate of the CAM for different bandwidths that are standardized [5], together with the required average and worst case look up rates. The STM-4 standard is currently in development and is to be considered as the minimum requirement

14 Chapter 1: IP Characterization Bandwidth [Mbit/sec] Required lookup-rate (average) [klookups/sec] Required lookup-rate (worst case) [klookups/sec] 155 (STM-1) (STM-4) 388 1, , (STM-16) 1,550 7,750 Table 1-2: required lookup rates for different bandwidths

15 Chapter 2: Design Methodology Chapter 2: Design Methodology 2.1: Dynamic Reconfiguration Dynamic reconfigurable hardware allows a flexible adaptation to the necessities of the application by reprogramming it at run-time [6]. The idea is that a general-purpose hardware agent is configured to carry out a specific task, but can be reconfigured on-demand to carry out other specific tasks. This definition is rather general and one can distinguish between different levels of reconfiguration: evolutionary reconfiguration: the reconfiguration is initiated by developers in response to evolving system requirements. Examples are requirement changes and bug fixes. This type of dynamic reconfiguration is suitable for non-stop applications, that do not tolerate a break in service. adaptive reconfiguration: the structure of the application is changed as a response to application or system events. In this case, the application will be adapted dependent on data or changing parameters. evolving reconfiguration: the structure of the application is changed, depending on the result of a previous stage in the execution. It should be noted that other definitions of dynamic reconfiguration exist [7]. Dynamic reconfigurable systems can be implemented using reconfigurable Field Programmable Gate Arrays (FPGAs). These are SRAM-based circuits, that can be programmed with a specific function during system operation. They may support partial reconfiguration, which means that it is possible to reprogram only a part of the FPGA. The partial reconfiguration can be non-disruptive or disruptive. Non-disruptive means, that the portions of the system which are not being reconfigured remain fully operational during reconfiguration. If the reconfiguration affects other portions of the system, it is called disruptive and a clock hold is needed. IP characterization is based on adaptive reconfiguration, since it reacts to a change in classifying rules by adapting its structure. Ideally the design should be reconfigured in a non-disruptive way. However, partial reconfiguration is not supported by the development environment at hand and is therefore not feasible at this moment. 2.2: Hardware Environment The implementation is targeted to a Xilinx Virtex XCV1000 FPGA. This FPGA is situated on a PCI card, the RC1000-PP made by Embedded Solutions Ltd. A block diagram of the architecture of the board is given if figure 2-1 [8]. It has four memory banks and each bank has 2MBytes of asynchronous SRAM. The FPGA has four 32-bit memory ports, one for each memory bank and each bank has separate data, address and control signals. The FPGA can therefore access all four bank simultaneously and independently

16 Chapter 2: Design Methodology There are three methods of transferring data or communicating between the FPGA and the PCI host: 1. The memory banks are used to perform bulk data transfers. 2. There are two unidirectional ports for direct communication between the host and the FPGA using two 8-bit paths with handshaking signals, depicted in the figure as Ctrl_Reg and Stat_Reg. 3. Two unidirectional ports GPO and GPI provide for single-bit communications without handshaking. Finally there is a programmable clock that is controlled by the host. Reset GPO PCI BUS SRAM 0 SRAM 1 SRAM 2 SRAM GPI Stat_Reg Ctrl_Reg Data0 Data1 Data2 Data3 FPGA clk XCV1000 Programmable Clock Configuration Figure 2-1: block diagram of PCI card This board is used in a PC with a 450 MHz Pentium II processor and 320 MB RAM, running Windows NT. 2.3: Software Environment The design tool flow describes the tools that have been used and the data flow between them and is shown in figure 2-2. It consists of a static and a dynamic part [9]. The static part is used to implement that part of the logic that is not dynamically changed. This includes I/O and logic that interfaces to the dynamic part. Furthermore it initializes the dynamic part of the design, by generating either a basic structure that will be changed in a later phase, or reserving an empty area for the dynamic part to be placed. The static part is designed using VHDL synthesis in combination with the tools, available from Xilinx for place and route and bitstream generation

17 Chapter 2: Design Methodology VHDL static Synthesis dynamic EDIF JBITS Place and Route BoardScope Generate bitstream for configuration User JAVA Application XHWIF Hardware Interface Bitstream Figure 2-2: design tool flow for dynamic reconfigurable logic The dynamic part of the design controls the configuration of the FPGA during operation of the application and a tool called JBits is used for this. With JBits, the programming bitstream of the FPGA can be altered easily with relatively simple commands as explained in chapter 4. The JBits functionality is used in a main JAVA application, that implements the user interface of the CAM and controls the reconfiguration. This application has to communicate with the hardware and for this the Xilinx Hardware Interface (XHWIF) is used [10]. It permits simple porting of JBits to the hardware. It includes methods for reading and writing bitstreams to FPGA s, incrementing the on-board clock and reading and writing to and from the on-board memory. Via the hardware interface, the vendor specific C-functions for communicating with the board can be used in the user Java application. BoardScope is a tool that can read a configuration back from the FPGA, including the state of instantiated flip flops. This tool is very useful for debugging purposes, because simulation of the dynamic part is not possible

18 Chapter 3: Target Technology Chapter 3: Target Technology This chapter describes the architecture of the Xilinx Virtex FPGA series. 3.1 Virtex Architecture global architecture Xilinx Virtex has a regular structure, consisting of configurable logic blocks (CLBs) surrounded by programmable input/output blocks (IOBs) [11]. This is shown in figure 3-1. IOB IOBs CLB IOBs BlockRAM CLBs IOBs BlockRAM IOBs Figure 3-1: global architecture of Virtex FPGA BlockRAM The CLBs provide the functional elements for constructing logic and IOBs provide the interface between the package pins and the CLBs. By connecting the IOBs and CLBs together using general routing resources, a complex circuit can be built. Except for the CLBs and IOBs, Virtex also contains integrated SRAM blocks, called BlockRAM and 3-State Buffers (TBUFs). The CLBs, BlockRAM and TBUFs will be discussed in the next paragraphs configurable logic block A schematic view of the Virtex CLB is given in figure 3-2. It consists of two identical parts, called slices and each slice has two logic cells (LCs). An LC includes a 4-input look-up table (LUT), carry logic and a storage element. The LUTs can be configured in different ways: Any combinatorial function of four inputs; 16x1 synchronous RAM; 16 bit shift register

19 Chapter 3: Target Technology C out C out Slice 1 Slice 0 G4 G3 G2 G-LUT carry logic D Q Y Y Q G4 G3 G2 G-LUT carry logic D Q Y Y Q G1 G1 BY BY F4 X F4 X F3 F2 F1 F-LUT carry logic D Q X Q F3 F2 F1 F-LUT carry logic D Q X Q BX BX C in C in Figure 3-2: Virtex CLB The storage elements in the Virtex slice can be configured either as edge-triggered D-flip-flops or as level-sensitive latches. The carry-logic, shown in figure 3-2, can be used for fast arithmetic functions, but also for cascading LUTs for implementing wide logic functions. The Virtex 1000 has an array of 64 x 96 CLBs Tri-State Buffers Each Virtex CLB contains two 3-State buffers (TBUFs) that can drive on-chip buses. These onchip busses are provided by horizontal routing resources and four bus lines are provided per CLB row, as shown in figure 3-3. CLB CLB CLB CLB Figure 3-3: TBUFs connected to dedicated horizontal busses

20 Chapter 3: Target Technology The TBUFs as implemented on the Virtex are no true TBUFs, but instead they are implemented using a logical circuit that emulates the behaviour of a true 3-State buffer. This way, several TBUFs may drive a line simultaneously without the device getting damaged. When at least one TBUF drives a 0 on a line, the logic value of that line becomes 0 no matter what the output values of the other TBUFs are Block RAM The Virtex contains 32 Block RAMs, organized in two columns along each vertical edge of the chip. Each such memory cell is a dual ported 4096-bit RAM with independent control signals for each port as illustrated in figure 3-4. The data widths of the two ports can be configured independently according to the table in the figure. WEA ENA RSTA CLKA ADDRA<#:0> DIA<#:0> WEB ENB RSTB CLKB ADDRB<#:0> DIB<#:0> DOA<#:0> DOB<#:0> Width Depth BlockRAM port configurations Figure 3-4: Dual-Port BlockRAM with possible port configurations General Routing Apart from dedicated routing, such as carry and 3-state lines, there is also general routing that is used to interconnect the CLBs. The general routing uses two kinds of wires: singles and hex s. Singles, starting at a certain CLB terminate at an adjacent CLB, while hex s terminate at CLBs 6 positions over. Singles should be used to transport data between local CLBs, whereas hex s should be used to transport data to non-local CLBs. The singles and hex s are each grouped into busses that extend in four primary directions: north, east, south and west. The connections to neighboring CLBs are straightforward. A north single connects directly to a south single in the CLB above it. A hex west wire connects directly to a hex east wire on the 6th CLB over. Switch boxes are used to connect lines together. A schematic view of a CLB, with the different wires and switch boxes is given in figure

21 Chapter 3: Target Technology Single North Hex North 24 CLB 12 Main Switch Box 24 Single West 12 Hex West Single Switch Box Hex Switch Box 24 Single East 12 Hex East Single South Hex South The main switch box allows the singles and the hex s to be connected to each other and the CLB. Some hex wires can only drive data into the CLB, these are uni-directional in. Some hex wires can only drive data out of the CLB, these are unidirectional out. Other hex wires can drive data both in or out, these are bidirectional. Circuits however should drive data on the bidirectional lines in only one direction, not both, since this leads to contention which can damage or destroy the device. 3.2 Virtex Configuration Figure 3-5: Schematic view of general routing of a Virtex CLB As mentioned before, the FPGA is programmed by means of a programming bitstream. This bitstream in generated by a Xilinx tool, called BitGen, that is run after place and route. This paragraph will give some information on the Virtex configuration and the programming bitstream. A bit-level description of the bitstream will not be given, but instead its general format will be discussed, together with the possibility of partial reconfiguration

22 Chapter 3: Target Technology Configuration Memory The configuration of an FPGA is stored in the configuration memory. This memory controls the switch boxes that are used to connect routes, and multiplexers to connect internal resources in the slices. Normally, this configuration memory is only written once during configuration and is not used explicitly used by the application. The power of using JBits is that not only the regular logic is available during operation, but also the configuration logic since one has access to the configuration memory. The internal configuration memory is partitioned into segments, called frames. The number and size of frames varies with device size. The Virtex 1000, that is used in this application, has 4909 frames of 1248 bits each Programming Bitstream The programming bitstream consists of a series of packets, where each packet consists of a packet header and data. Some packets are used for a special purpose, such as checksum checking or sending options to the FPGA. Other packets are used to write to the configuration memory. This kind of packet has a header that contains a frame address and each frame can be addressed separately this way. This is a novel way to access the configuration memory, and different from older FPGA configuration methods where the configuration of an FPGA component was spread over the entire bitstream. In the initial bitstream generated by BitGen, every frame is written to initialize the whole configuration memory. After this it is possible to only reconfigure those frames, that actually have changed. For example: if an entry is added to the CAM, only the frames that are responsible for that entry could be written. This is called partial reconfiguration. The advantage of that approach is that the number of bits that need to be sent to the FPGA is a lot smaller. To give an idea about the time it takes reconfigure the whole FPGA, one can calculate the time it takes to write to the configuration memory. The time for configuration is equal to: t configuration = The Virtex is configured via an 8 bits bus on a clock frequency of 50 MHz, so f configuration = 400 Mbits/s. The time it takes to reconfigure the entire FPGA is then equal to: 4909*1248/400 = 15.3 [ms] number of frames bits per frame [s] f configuration The actual programming time is slightly longer, due to handshaking and special purpose packets. Due to the high number of frames, the time for configuration is decreased significantly when doing partial reconfiguration. The board that is available for this application does not support partial reconfiguration in the sense that the device driver of the board does not allow partial bitstreams to be sent to the FPGA. More about Virtex configuration can be found in [12]

23 Chapter 4: JBits Chapter 4: JBits 4.1 Introduction JBits is a set of Java classes which provide an Application Program Interface (API) into the Xilinx FPGA bitstream [13]. This interface operates either on bitstreams generated by design tools, or on bitstreams read back from actual hardware. This provides the capability of designing, modifying and dynamically modifying the logic on an FPGA. JBits gives the possibility to manually place, route and reconfigure the FPGA on a CLB level with relatively simple commands. This makes it very suitable for dynamically reconfiguring regular structures, such as the CAM. 4.2 JBits Programming Model The diagram in figure 4-1 illustrates the essential steps involved in the development of a JBits application. Create JBits Object Read Bitstream Modify Bitstream Write Bitstream Figure 4-1: JBits programming model Constructor A JBits object must be constructed before anything can be done. The constructor is very simple and takes a single parameter, the device type. This constructor builds the device model for the selected part and performs various initializations. The prototype for the constructor is: JBits(int devicetype); For example, JBits jbits = new JBits(Devices.XCV1000); This builds the device model for the Virtex device XCV Reading the bitstream This method takes a single parameter, a string containing the name of the bitstream file to be read. It loads the bitstream into the constructed JBits object and maps the bitstream data into the device model. Once a bitstream has been loaded, configuration data in the form of bits may be read and written. The method prototype for reading the bitstream is: void JBits.read(String infilename); For example, jbits.read("infile.bit") ;

24 Chapter 4: JBits This reads in the bitstream file "infile.bit" Setting a resource This method writes the configuration data to a given FPGA resource. Examples of these resources are LUTs and CLB inputs and the configuration data can for example be the logical function which a LUT is set with or a specific wire (single, hex) that is connected to a CLB input. The CLB, where the resource is situated is identified by a CLB row and a CLB column. The resource in the selected CLB is then identified by a constant. These constants are defined in the Java classes containing the configurable objects. For instance, setting the configuration of the resource SLICE0 F1 input is accomplished by using the S0F1 constant in the S0F1 class, that is S0F1.S0F1. An array of integers supplying the configuration bits is passed as the final parameter for the set method. As with the resource, this data is nearly always a pre-defined constant. For instance, to set the S0F1.S0F1 input to the value of SLICE1 X output, the constant used is S0F1.S1_X. To summarize, the set() method is used to identify the CLB and the resource associated with it and to specify the predefined constant value applicable for that resource. The method prototype for setting a resource to a value (bits) is: void JBits.set(int row, int column, int[][] resource, int[] bits); For example, jbits.set(clbrow, clbcol, S0F1.S0F1, S0F1.S1_X); This connects the X-output of slice 1 to the F1 input of slice 0 (see figure 3-2) Writing the bitstream Similar to the read bitstream method, the write bitstream method takes a single parameter, a string containing the name of the bitstream file to be written. This method writes the bitstream from the constructed JBits object into a file. The method prototype for the write bitstream is: int JBits.write(String outfilename); For example, jbits.write("outfile.bit"); This writes the modified configuration data to the bitstream file, "outfile.bit". In a running system, one is not always interested in writing the new bitstream to a file, but in stead one wants to use the bitstream to configure the FPGA directly from memory. To do this, the following command can be used: byte[] jbits.getallpackets(); This method returns all packets contained in the bitstream and these can be sent directly to the FPGA

25 Chapter 4: JBits Getting the resource configuration The get() method is used to read the configuration of a given resource in a CLB. The resource is identified using the same convention mentioned in the set() method. For the most part, the data obtained using the get() method may be interpreted and used by other portions of a JBits application. The method prototype for get() is: int[] JBits.get(int row, int column, int[][] resource); For example, int[] Value = jbits.get(clbrow, clbcol, S0F1.S0F1); This returns the value set for the resource S0F1.S0F1, that is the wire connected to the SLICE0 F1 input. The code pieces mentioned before are assembled below to a simple JBits application. It essentially sets the F1 input of the SLICE0 F LUT in the CLB in row 5, column 4 to the SLICE1 X output. JBits JBits = new JBits(Devices.XCV1000); JBits.read("infile.bit"); JBits.set(5, 4, S0F1.S0F1, S0F1.S1_X); JBits.write("outfile.bit");

26 Chapter 5: CAM Structures Chapter 5: CAM Structures This chapter starts with a general overview on CAMs, including the definition of explicit priority. Then two different CAM structures are discussed and the way they can be mapped onto the Virtex architecture. In the first structure, the same amount of area is reserved for all entries. This will be referred to as fixed length CAM. The second structure shows much similarity with the fixed length CAM, but instead the area that each entry occupies is variable and depends on the number of don t cares. This structure is called variable length CAM. Both these implementations utilize dynamic reconfiguration for updating their content. Other CAM structures that do not use dynamic reconfiguration, but are suitable for implementation on FPGA can be found in [14]-[17]. Next the structure of an explicit priority encoder is described, that can be integrated with one of the earlier descibed CAM implementations. The chapter ends with estimating the hardware resources that are consumed by any of these structures. 5.1 RAM-based versus CAM-based Look Up There are a number of algorithms used to perform the look up function using standard Random Access Memory (RAM): 1. A RAM can perform the look up in a single cycle if the data being searched (i.e. the information from the packet header) is used as a direct index into memory. In this case the size of the RAM is determined by the size of the search field. The number of words stored in a RAM has no effect on this size and cost. Thus, if there were only 256 words, each with a 16- bit search field, the RAM must still have 64K words. The size and cost of the RAM when used with a direct index grows exponentially with the search field. Since the size of the search field in IP characterization is 315 bits, the practical limit of an economic RAM-based look up function is exceeded. 2. A linear search is the most efficient algorithm for table look up, requiring only one entry per active address. If the entries in the routing table are searched in order of highest priority first, then the first match will be the best match. Of course, the linear search runs in time O(N), where N is the number of entries, and so can take considerable time. 3. A faster approach is to form a tree search: using a binary tree or a patricia tree. In general, these trees can push the search time towards log(n) where the log base is 2, but since the length of the search fields is longer than the number of entries that needs to be stored for IP characterization, the worst case number of cycles needed for matching is as long is the number of entries. This search time can therefore still be excessive and tree search algorithms require a high complexity of the controlling hardware. 4. Under good conditions, a hash function can execute the look up function in constant time, only slightly slower than direct access. The worst case search time, however, can be considerably worse. The performance is a function of the size of the hash memory and the number of addresses that must be searched in a given time window (after which a hashed entry will be timed out). While the number of stored entries might be relatively small, the number of addresses that might potentially be searched is large. This number depends on packet traffic patterns, that can be hard to predict. Therefore, the amount of memory might be unacceptably large

27 Chapter 5: CAM Structures The RAM-based look up algorithms as described above either consume too many hardware resources, or are too slow to fulfil current speed requirements. With higher speeds it is necessary to go with the faster and well-bounded search time of a CAM. CAM-look up solutions can offer superior performance, compared to even the most sophisticated RAM-based search algorithms [18]. 5.2 Different CAM Types Binary versus Ternary CAMs A binary CAM stores only one of two states ( 0 and 1 ) in each memory location (i.e. in each bit of a word), a ternary CAM stores one of three states in each memory location. These three states are represented by: 0, 1, and X. Ternary CAMs may have a global mask as well. This allows also the search pattern (i.e. the bit vector that is used as an input of the CAM) to contain X s. This is especially useful when the width of the search pattern is small, such that two or more entries can be stored in the same CAM location Return value The entries in a CAM have two parts. The most important part is the search field, which is the part of the entry that is matched with the search pattern. The CAM entries also contain a return field, which is the information returned during a read. This contains either related information or an index. In some cases, one is not only able to write to the search field, but also to the return field, so that the return value can be programmed per entry Prioritizing Since the entries stored in the CAM may contain don t cares, there is a possibility that two or more entries give a match at the same time. As mentioned in chapter 1, the entries should be prioritized and the address of the entry with the highest priority should be returned to solve this. Two priority schemes are possible [18]: 1. Inherent priority: inherent priority exploits the CAMs predictable ordering when reading multiple matched data. In this case, the system stores the entries in order of priority. By using a priority encoder, the top address of the CAM has the highest priority (0) and the bottom address has the lowest priority (127). 2. Explicit priority: the inherent priority can be replaced with an explicit priority field added to each CAM word. In case of a multiple match, the entry with the highest explicit priority as stored in the priority field is returned. The advantage of explicit priority is that updating the CAM becomes easier, since a new entry can always be added at the end. When using inherent prioritizing, new entries are not always added in the end of the CAM and an address has to be reserved by shifting down other entries and updating the memory that is addressed by the CAM

28 Chapter 5: CAM Structures 5.3 Fixed Length CAM Global structure The global structure of the fixed length CAM is given in figure 5-1. It is a PLA structure, consisting of matching lines and an encoder. All match lines together are contained in a match field. The data that is coming in (Indata) is matched with the CAM words that are stored in the matching lines. If Indata matches with the word stored in a matching line, the output of that line becomes 1, else it becomes 0. The encoder is used to translate the outputs of the matching lines to the address of the line that gave a match. This can either be an inherent or an explicit priority encoder. Indata n Match Field Match Line Match Line Match Line ENCODER Address Match Match Line Figure 5-1: Global structure of the fixed length CAM Match Line Due to the limited width of the memory that is available on the board, it is not possible to match all bits at once. Therefore the match lines are divided into match blocks, separated by registers as shown in figure 5-2. Each block outputs 1 when its input is 1 and a partial match occurs. The number of clock cycles needed for a complete match is then equal to the number of match blocks. The block size was chosen to be 64 bits, so that 5 clock cycles are needed for a complete match. Since the required lookup rate is lookups/sec, the minimal clock frequency at which the system should run is then 1.937*5 = 10 MHz which is feasible in FPGA technology. A block size of 64 bits means that the CAM reads from two memory banks simultaneously. The block size can t be chosen much higher, due to the limited number of memory banks (One of the four banks is already used to write the result to)

29 Chapter 5: CAM Structures Indata<63:0> Block Block Block 1 Figure 5-2: schematic view of a match line, divided into blocks and registers Match A smaller block size causes the fanout of signal Indata to increase which negatively influences the speed. The match blocks can be mapped on the Virtex FPGA using LUTs and carry-logic. This is shown in figure 5-3. Indata [63:56] Indata [55:48] Indata [7:0] LUT LUT LUT LUT LUT LUT C in C out Figure 5-3: 64-bits match block implemented using LUTs and carry-logic The LUTs are configured in such a way that they output 1 when the corresponding bits on their inputs match, else they output 0. Initially the carry is equal to C in and going from bit 63 to 0 the carry chain will propagate this signal as long as the LUTs output 1. If all the LUTs in a particular match line output 1, then C out will be equal to C in, else the block will output Variable Length CAM The variable length CAM is a CAM where the stored entries have variable length, depending on the number of don t cares they contain. The reason for implementing this kind of CAM is that the number of don t cares is quite large in general. This has two reasons: 1. The total size of the header fields that are used for matching is 315 bits. This does not mean that all of these fields are used for matching an entry. An example is filtering packets from a certain host. In this case only the source address of the forbidden IP packets needs to be matched. Another example is counting the number of IP version x packets that arrive, which only needs four bits to be stored

30 Chapter 5: CAM Structures 2. IP addresses are often not completely specified, meaning that the packet is to be sent to a net or subnet rather than a host. This means that the 128 bit source and destination address fields in the entries often have don t cares in the end. To save area, don t cares are left out so that entries take less space to store. By placing these reduced entries in a smart way, more entries can be stored in the CAM Global Structure The global structure of the variable length CAM with a maximum of 16 entries is given figure 5-4. It consists of a long chain of match blocks and shift registers, separated by switches and placed into match lines of four blocks each. Address Priority encoder Match Multiplexer Shift register Match Block Switch Figure 5-4: Schematic view of CAM with entries of variable length

31 Chapter 5: CAM Structures Programming the CAM is done by mapping entries on match blocks and shift registers and placing the entries on the chain starting at match block 1. Blocks within an entry are connected together by closing a switch, which causes the carry signal of a block to be propagated to the next block. An open switch on the input of a block means that its input becomes 1 and starts a new entry. The multiplexers are used to connect the outputs of the entries to the priority encoder. Programming the CAM consists of two steps: 1. Mapping: Dividing entries into 64 bits blocks separated by delays, leaving out blocks that merely contain don t cares. 2. Placing: Placing the mapped entries on the actual CAM structure and connecting their output to the priority encoder Mapping First the 320 bits entry is divided into 5 blocks of 64 bits each. In case none of the blocks is empty (i.e. do not contain just x ), then no block can be removed and the entry is mapped as follows: block delay Every clock cycle a block is matched and one clock cycle later the resulting output of that match block is propagated to the next block. Now suppose that block 3 of the entry is empty, i.e. consists entirely of don t cares. If this block would simply be omitted, then block 4 is matched in clock cycle 3 already. To solve this, an extra delay needs to be inferred between block 2 and block 4. The entry is thus mapped on four blocks, as shown below: block delay Placing After mapping an entry on blocks and delays, the entry can be placed by mapping blocks on match blocks and delays on shift registers of length equal to the delay. Each match line contains two multiplexers. These are used to connect the output of two of the four shift registers in the match line to the priority encoder

32 Chapter 5: CAM Structures There are a few special cases that need to be looked at separately: 1. Empty entries: If a whole entry consists of don t cares, no match blocks are used to store this entry. Instead the corresponding multiplexer to which the entry would have been connected is programmed to output 1 independent of the input. 2. Beginning of entry is empty: If one or more blocks in the beginning of the entry are empty, then these blocks do not cause any delay to be incremented. 3. More than two outputs in a line: Since there are two multiplexers available per match line, a maximum of two entries can have their output in a line. When a new entry is added, which would cause the number of outputs in a match line to become three, the output of this entry is shifted to the next line. This leads to match blocks that are not used. 5.5 Explicit Priority Encoder As described before, there are two mechanisms for prioritizing: inherent priority and explicit priority. In the latter case, not only the words that are searched are added to the CAM, but also their priority. This way, new words are always added in the end or at empty places and shifting other entries is not necessary Different implementations To implement explicit priority, several schemes are possible. The common way to do explicit encoding is by adding an explicit priority field to each CAM word [17]. Each cycle, the system combines the search word with a different priority word. In the first cycle of the search, the system sets the priority word to the highest priority. If a match occurs, the address of the matching entry is returned, else the system combines the search word with the next highest priority. This procedure is repeated until either a match occurs, or the lowest priority is reached. The advantage of this algorithm is that it is easy to implement and no extra hardware design effort is needed. On the other hand, the algorithm is not very efficient and the matching process can take many clock cycles, depending on the number of possible priority values. Using dynamic reconfiguration, other implementations are possible. One of these possibilities is using a regular priority encoder in combination with a switch box. This switch box routes every output of the CAM to the correct input of the priority encoder and the configuration of the switch box is controlled by JBits. Although this method is efficient in time, it would consume too much hardware for the CAM size at hand. This problem can be solved by reducing the number of priority classes. The number of priority classes is defined as the number of explicit priority values that a search word can have. In case of an inherent priority encoder, this value is equal to the number of entries. By reducing this number, the amount of hardware is reduced, but there is a risk that more priority classes than available are needed for a certain CAM configuration. To solve this, a combined explicit/inherent priority encoder is proposed, where the priority can be set explicitly for each entry, but in case two entries have the same explicit priority, their priority is determined inherently. This way, entries can be added even when all priority classes have been used, just as with the inherent priority encoder

33 Chapter 5: CAM Structures Global structure The global structure of the n-to- 2 log(n) explicit priority encoder with eight priority classes is given in figure 5-5. Estimating the number of priority classes that is needed is difficult and requires again information about the actual content of the CAM. A number of eight has been chosen, since this can be mapped efficiently on the Virtex architecture as will be demonstrated. The priority encoder consists of two basic blocks. First there is a priority decoder. The input of this block is coming from the match lines and contains 1 s at all matching positions. The output is a bit vector with 1 s only at those positions that match and have highest priority. In case different priority classes have been used for all overlapping entries (i.e. entries that may match simultaneously), then the output of the priority decoder contains no more than one 1. The output of the priority decoder is connected to a regular n-to- 2 log(n) priority encoder that is needed to decide the return value when two or more overlapping entries with the same explicit priority match simultaneously. The priority decoder consists of three parts, as shown in figure 5-5 and works as follows. The n-to-8 switch box connects each of the n input lines to one of 8 priority lines. These lines, each representing a priority class are connected to an 8-to-3 priority encoder, that looks if there is a match and if there is, it extracts the value of the highest priority of all the input lines that match. The output decoder propagates the value of each input line only if the priority that this line is set with corresponds with the highest priority, else 0 is propagated at this bit position. Match Output < 2 log(n)-1:0> 2 log(n) n-to- 2 log(n) priority encoder Priority Decoder n 8-to-3 priority encoder 8 n-to-8 switch box 3 output decoder n n n Input <n-1:0> Figure 5-5: Global structure of a n-to- 2 log(n) explicit priority encoder with eight priority classes. To program the explicit priority encoder, the switch box and the output decoder need to be programmed using JBits and the implementation of these blocks is discussed next

34 Chapter 5: CAM Structures Switch box The switch box connects the input lines to the eight priority lines and is built out of several smaller 8-to-8 bits switch boxes. In figure 5-6 an example with two of these 8-to-8 bits switch boxes is shown, thus implementing a 16-to-8 switch box. The eigth priority lines have been implemented using the tri-state lines as available on Virtex. This way, several input lines may drive a certain priority line at the same time without causing damage. The tristate lines are connected to a pull up, meaning that the priority lines are high when nothing is driving them. In case an entry matches, then the priority line to which the matching entry is connected is driven low. This way a wide NOR function is created, without using many hardware resources. Since a selected priority line is driven low, the input lines of the 8-to-3 priority encoder are active low. priority lines 0 8-to-8 switch box 8-to-8 switch box 0 16-to-8 switch box Input<15:0> Figure 5-6: Schematic view of a 16-to-8 switch box, connected to eight priority lines. 0 1 switch box input <7:0> LUT LUT 0 priority line Figure 5-7: Schematic view of the multiplexer, connecting one or more switch box input lines to priority line k. To connect an input line to a certain priority line, the output lines of the 8-to-8 bits switch boxes each have an 8-input multiplexer at their input. This multiplexer can select one or more switch box inputs to propagate their value to the output line. The multiplexer is implemented using two LUTs, cascaded with carry logic. A schematic view of this configuration is given in figure 5-7. If a certain input line is to be connected to a priority line, then the LUT to which the input line is connected should output zero if this input line is 1. In that case, a 0 is propagated by the carry chain to the output that controls the TBUF Output decoder The output decoder consists merely of LUTs, one for each input bit of the explicit priority encoder. Every LUT has one of these input bits (to check for a match) and three of the bits from the 8-to-3 priority encoder (the highest priority of all matching entries) connected to its inputs

35 Chapter 5: CAM Structures Suppose that a certain LUT in the output decoder is connected to input k. Then this LUT is to be configured such that it outputs 1 if the entry to which input k is connected gives a match and the priority of this entry is equal to the highest priority of all matching entries at that moment Method to program the return value Although not implemented, it is worth to mention a method to program not only the priority, but also the return value for each entry. Suppose that the inherent priority encoder as part of the explicit encoder is omitted, and a maximum of eight overlapping entries is allowed. If these overlapping entries are assigned each a different explicit priority, then only one of the output bits of the output decoder can become 1 at a time. These output bits can then be used to drive a number of tri state lines, equal to the width of the return value of the CAM. This is shown in figure 5-8 for a return value that is two bits wide. Here there are four input bits, coming from the output decoder. Depending on what input bit is high, the tri state lines are driven with another value. The logical value on the input port of each TBUF decides what value is returned when a certain input bit becomes 1 and this can be programmed using JBits. output <0> output <1> input <0> input <1> input <2> input <3> Figure 5-8: Schematic view of an encoder, driving a different value on the output for each input bit. This implementation would be significantly smaller, since the TBUFs don t consume any slice logic and the inherent priority encoder is omitted. An implementation of the priority encoder where overlapping entries can have the same explicit priority and where the output can be programmed for each entry uses the on board BlockRAM. The index returned by the regular explicit priority encoder is then used as a memory address to look up the return value

36 Chapter 5: CAM Structures 5.6 Device Utilization Fixed Length CAM Each Virtex Slice is able to match 8 bits. The fixed length CAM consists of 128 match lines, each containing 5 blocks matching 64 bits each. The number of slices taken by the match field is then equal to: 128 x 5 x 8 = 5120 slices. Since the total number of slices is 12288, 42% of the slices is consumed, not counting the encoder. The flip flops on the output of each match block have been neglected, because these consume little resources and can be combined with other logic in the same slice Variable length CAM The number of match lines, blocks and multiplexers were chosen from a design point of view without taking into account what the actual format of the entries is. In case there are a lot of entries containing only one block after mapping, the number of multiplexers should be increased to minimize the number of unused blocks. Another problem is choosing the size of the priority encoder. If there are many long expressions, there are a lot of inputs that are not used and it s necessary to have a large priority encoder or longer match lines. If there are many short expressions, a smaller encoder can be used. From this it becomes clear that knowledge about the actual content of the CAM is necessary to implement it in an efficient way, which is not available for IP version 6 yet. The number of match lines has been chosen to be 128, and the CAM can therefore store up to 256 entries. The variable length CAM has 128 match lines of 4 match blocks each. In the worst case, all entries that are stored have 5 blocks after mapping and only 128 x 4/5 = 102 entries can be stored, but this is very unlikely. Each Virtex Slice is able to match 8 bits, so that 8 slices are needed per match block. The CAM consists of 128*4 = 512 blocks, so a total of 512 x 8 = 4096 Virtex Slices (33%) are consumed. Each shift register consumes one slice, since it s not possible to use the second slice for something else. There are 512 shift registers in the design, meaning a utilization of 4%. The two multiplexers that each match line has to select the output have to be mapped in separate slices also, which is explained in These multiplexers therefore consume 2 x 128 = 256 slices (2%). The total slice utilization of the variable length CAM without the encoder is then 39% Inherent Priority Encoder The device utilization of the inherent priority encoder depends on the CAM structure it is used with. The fixed length CAM requires a 128-to-7 priority encoder, while the variable length CAM requires an encoder that is twice as wide. The utilization by the inherent priority encoder has been determined by synthesizing both sizes and examining the mapping report. The results are: 128-to-7 bits: 179 slices (1%). 256-to-8 bits: 709 slices (5%)

37 Chapter 5: CAM Structures Explicit priority encoder The device utilization of the explicit priority encoder has also been estimated to be used with both fixed and variable length CAM. The fixed length CAM has 128 outputs, i.e. a 128-to-8 explicit priority encoder is needed. The 128-to-8 bits switch box is divided in 32 8-to-8 switch boxes, each containing 8 input multiplexers that use 1 slice each. The total number of slices for the switch box is then 128 = 1%. The output decoder consumes 128 LUTs. Normally two LUTs can be placed in one slice, but it is not possible to constrain a LUT to a certain position within a slice. In the case of a carry chain, the order of the LUTs is set implicitly by the direction of the carry chain. However, there is no carry chain now and it is therefore necessary to place the LUTs in separate slices. The output decoder therefore takes 128 (= 1%) slices. The utilization by the inherent priority encoder has been determined in and is equal to 171 slices = 1%. The total utilization by the 128 bits explicit priority encoder is then 3%, where the slices consumed by the 8-to-3 priority encoder have been neglected. Repeating this calculation for the 256 bits explicit priority encoder leads to a utilization of 9% Summary Table 5-1 gives an estimation of the device utilization for the fixed and variable length CAM combined with either the inherent or the explicit priority encoder. Inherent priority Explicit priority Fixed length CAM 43 % 45 % Variable length CAM 44 % 48 % Table 5-1: Device utilization for the fixed and variable length CAM with either an inherent or an explicit priority encoder. From this table it follows, that the hardware utilization of the fixed length CAM and the variable length CAM are about equal for these dimensions. Furthermore it follows that the hardware cost of using an explicit priority encoder in stead of using inherent priority is small and therefore interesting

38 Chapter 6: Implementation of the Board Interface Chapter 6: Implementation of the Board Interface The board interface takes care of the communication between the FPGA, the host and the onboard memory and is situated on the FPGA. This paragraph contains a description of the board interface and how it has been implemented together with simulation results. 6.1 Port Description In 2.2 a general overview of the board was given, together with a description of the ports that are available for communication between host, memory and FPGA. To use the board as part of the CAM application, these various ports were assigned a function. These functions are summarized in table 6-1, together with the direction of the signals viewed from the FPGA side. Besides these ports, other signals are necessary for controlling the memory, control register and status register. The respective ports are given in appendix A. A detailed description of these signals, together with their pin locations can be found in [8]. A schematic view of the board interface and its connections to the parts in the system that are not inside the FPGA are also given in appendix A. 6.2 VHDL Description Port Direction Function No. of bits clk In System clock 1 Reset In Reset CAM 1 Ctrl_Reg In Start Matching 8 Stat_Reg Out Matching Ready 8 Data 0 In Indata [31:0] 32 Data 1 In Indata [63:32] 32 Data 2 Out Match, Address 32 Table 6-1: function assignments for FPGA ports The VHDL description of the board interface is given in appendix B. The board interface is a finite state machine (FSM), that repeatedly reads data from memory to the CAM and writes the result from the CAM to memory. The behaviour of the board interface has been simulated and the result is given in figure 6-1. First a reset is applied, that initializes the control signals and brings the FSM in state IDLE. Then value 1 is written to the control register, which is interpreted as start. The board interface sends a memory request for bank 0, bank 1 and bank 2 to the onboard memory arbiter and waits until all banks have been granted. Next the board interface starts reading from bank 0 and bank 1. After 7 clock cycles (5 for processing by the CAM and 2 for latency due to the registers before and after the CAM) the result is written to bank

39 Chapter 6: Implementation of the Board Interface From this moment Indata is read every clock cycle and the result is written every 5 clock cycles until Addr0 is greater than Buffer_Size. For debugging purposes, LEDs are turned on and off depending on the state of the FSM. reset start starts reading starts writing requests memory memory granted Figure 6-1: Simulation results for the board interface

40 Chapter 7: Hardware Implementation of the CAM Chapter 7: Hardware Implementation of the CAM In this chapter, the VHDL implementations of the fixed length CAM, the variable length CAM and the explicit priority encoder are discussed. This chapter is meant to show how the designs have been described in a structural style, by using the hardware primitives available for Virtex. This type of VHDL description is necessary in applications that use dynamic reconfiguration, since full control over the implementation of specific parts of the design is necessary. 7.1 Implementation of the Fixed Length CAM The structure of the VHDL description of the fixed length CAM is shown in figure 7-1. It shows the entities that are used and how they relate to each other. CAM Encoder Match_Line Register_64 Stitcher Match_Block DecLut Virtex Primitives FD MUXCY_L SRL16 Figure 7-1: structure of the VHDL description of the fixed length CAM Entity DecLut defines the LUTs that are part of the match lines and store the actual entries. Encoder is the priority encoder Virtex Primitives The match lines are built entirely out of Virtex primitives (structural VHDL). The following VHDL primitives from the Virtex library have been instantiated: component FD -- D flip flop port ( Q : out std_logic; D : in std_logic; C : in std_logic ); end component;

41 Chapter 7: Hardware Implementation of the CAM component MUXCY_L -- 2-to-1 mux port ( LO : out std_logic; CI : in std_logic; DI : in std_logic; S : in std_logic ); end component; component SRL bits shift register port ( Q : out std_logic; A0 : in std_logic; A1 : in std_logic; A2 : in std_logic; A3 : in std_logic; D : in std_logic; CLK : in std_logic ); end component; When instantiating a LUT with some logical function in VHDL, the Xilinx place and route (PAR) tools swap the four inputs and change the logical function. This is done for optimization reasons and the way the inputs are swapped is not easily predictable. Normally this is not a problem, but when using JBits to change the content of the LUT, one needs to know exactly how the inputs are mapped on the LUT primitive. To solve this, shift register SRL16 was instantiated in stead of a LUT. This way, PAR can not swap the inputs (or else the behaviour would change). A more detailed description of primitive SRL16 is given in After place and route, the shift registers can be transformed to LUTs using JBits. This is an easy process, since a shift register is in principle a LUT configured in a special way. This transformation is discussed in Stitcher From figure 7-1 it follows that a match line does not only contain D-flip flops and match blocks, but also an entity called stitcher. This entity has been instantiated, because the place and route tool is not able to place a register between two blocks automatically. More precisely, the tool is not able to connect the output of a flip flop directly to a carry chain that is situated near the flipflop. What the stitcher does is that it routes the output of the flip flop manually to the carry chain. This is shown in figure 7-2. carry chain carry chain FD FD Stitcher Figure 7-2: Simplified schematics showing the stitcher function

42 Chapter 7: Hardware Implementation of the CAM Register_64 Signal Indata (see figure 5-1) is the 64-bits input data of the CAM which is connected to all the match blocks in the design. The fixed length CAM has a total of 128x5 = 640 match blocks, meaning that the Indata net has a very high fanout and is spread over the total CAM area. This leads to low performance. Indata is coming from two 32-bits registers placed between the actual CAM and the board memory (Data0_Reg and Data1_Reg in figure A-1). By replicating these registers, the fanout is decreased. Replication is done automatically by the Synthesis tools, but this does not lead to good results. Although the fanout of Indata decreases, it is still connected to match blocks spread over the entire CAM area. To solve this, Data0_Reg and Data1_Reg have been implemented as a 64-bits register (entity Register_64) which is replicated manually. This is desribed in the VHDL code of the CAM structure, but since it is functionally part of the board interface, it is shown in figure A-1 also. Replication is done by instanciating 16 such registers and every register is connected to the match blocks contained in eight consecutive match lines. This way, the fanout of Indata is decreased to 40 and by placing Register_64 near the eigth match lines that it is connected to, long routes are prevented. 7.2 Implementation of the Variable Length CAM The structure of the VHDL code of the variable length CAM is much like the fixed length CAM and is given in figure 7-3. The main differences are that two multiplexers were added for each match line (entity Mux4) and flip flops were replaced by shift registers CAM Encoder Register_64 Stitcher Virtex Primitives Match_Line Match_Block DecLut Mux4 MUXCY_L SRL16 Figure 7-3: Structure of the VHDL description of the variable length CAM Shift Register In the fixed length CAM, primitive SRL16 was used to implement a LUT whose inputs are not swapped. In the variable length CAM this primitive is instantiated to be used as a shift register as well. A description of SRL16 is given below:

43 Chapter 7: Hardware Implementation of the CAM SRL16 A0 A1 A2 A3 Q D The data (D) is loaded into the first bit of the shift register and during subsequent Low-to-High clock transitions data is shifted to the next bit position as new data is loaded. The data appears on the Q outputs when the shift register length determined by the address inputs is reached. The length of the shift register can be changed dynamically and is equal to: (8*A3) + (4*A2) + (2*A1) + A0. In the VHDL code, the length is initialized at 1 by driving 0 on all address inputs. To change the length during operation, some address inputs need to drive 1 and this is done by disconnecting these inputs. This way, these inputs are not driven and a pull up causes them to become High Multiplexer To implement the multiplexers that connect the entries to the priority encoder, a 4-input look up table is used. Since this LUT is changed by JBits, it s again important that the inputs are not swapped and therefore SRL16 was instantiated that is converted to a LUT in JBits Stitcher An extra stitcher in the beginning of each match line was added. In the fixed CAM design, the carry was always set to High at the start of a new match line. This signal was generated inside the first LUT of the match line. In the variable CAM design, a stitcher is needed to be able to connect to the carry signal from the previous match line Switch As mentioned in chapter 5, each match block has a switch to connect its carry chain to either logical High or the output of the previous block. To switch between these two states, a multiplexer is used that is available in the Virtex carry logic and connects the carry chain to either input signal BX or C in (see figure 3-2). This multiplexer is controlled by the configuration memory and can therefore be changed using JBits. In the VHDL code, the multiplexer is configured to connect to C in. To switch to a logical High, BX is to be driven High and this is done as follows: When instantiating a shift register in VHDL, input signal D comes in via port BX. As SRL16 is only instantiated to implement a LUT, signal D is not used and can be set to Implementation of the Priority Encoder Explicit priority encoder The structure of the VHDL description of the explicit priority encoder is given in figure 7-4, showing all entities and Virtex primitives that have been instanciated. The two priority encoders have been described in a behavioural style, while the switch box and the output decoder were implemented in a structural way, since these need to be configured by JBits. Entity SwitchBox_8x8 uses primitive TBUF, which refers to a tri-state buffer on Virtex

44 Chapter 7: Hardware Implementation of the CAM ExplicitEncoder PriorityEncoder SwitchBox PriorityEncoder_8x3 OutputDecoder SwitchBox_8x8 Virtex Primitives SRL16 MUXCY_L BUFT Figure 7-4: Structure of the VHDL description of the variable length CAM Inherent priority encoder The inherent priority encoder is described in behavioural VHDL. The corresponding VHDL code is given below. ENTITY Encoder IS GENERIC( Width : integer := 7; Size : integer := 128); PORT( Input : IN std_logic_vector (Size-1 DOWNTO 0); Output : OUT std_logic_vector (Width-1 DOWNTO 0); Match : OUT std_logic); END Encoder; ARCHITECTURE Behave OF Encoder IS VARIABLE temp_output : std_logic_vector(width-1 DOWNTO 0) := (OTHERS => 0 ); VARIABLE temp_match : std_logic := 0 ; PROCESS(Input) FOR i in (Size-1) DOWNTO 0 LOOP IF (Input(i) = 1 ) THEN temp_output := conv_std_logic_vector(i,width); temp_match := 1 ; END IF; END LOOP; Output <= temp_output; Match <= temp_match; END PROCESS; END Behave;

45 Chapter 8: Synthesis and Place & Route Chapter 8: Synthesis and Place & Route In this chapter synthesis and place & route (PAR) of three CAM implementations is discussed: 1. Fixed length CAM with inherent priority 2. Variable length CAM with inherent priority 3. Variable length CAM with explicit priority 8.1 describes the method for synthesis and PAR. 8.2 gives the physical structure of these three CAM implementation in terms of CLB locations. 8.3, 8.4 and 8.5 give the results in for the three implementations and these results are summarized in Method Synthesis During synthesis, the VHDL description is mapped to Virtex primitives. These primitives are those described in 7.1.1, together with LUTs, IOBs, input buffers and clock buffers. When mapping the design on Virtex primitives, Synplify also does several optimizations. In most cases these optimizations are useful, but when synthesizing the CAM structure, two optimizations have to be avoided: 1. Synplify recognizes signal Ctrl_Ack of the board interface as a clock, since this port is edge sensitive. What Synplify does is adding a clock buffer to this signal. Since only four dedicated pins have this buffer and not the Ctrl_ACK pin, this clock buffer has to be removed by the Synplify constraint: define_attribute {Ctrl_ACK} syn_noclockbuf {1} 2. Another optimization is omitting redundant logic. The fixed length CAM consists of 128 match lines, that are equal and have similar input signals, since the content is written in a later stage by JBits. What synplify does is removing all match lines except one. To prevent Synplify from doing this, a syn_keep attribute must be added in the VHDL code to signal Indata for every match line: column : FOR i IN Size-1 DOWNTO 0 GENERATE SIGNAL Temp: std_logic_vector(63 DOWNTO 0); ATTRIBUTE syn_keep OF Temp: SIGNAL IS TRUE; BEGIN Temp <= InData; Match_Line : Match_Line PORT MAP(InData => Temp, Match => Dec_Out(i), clk => clk); END GENERATE Column; This allows Synplify to optimize and omit redundant logic within a match line, but does not extend to other match lines so that all match lines are optimized individually and none is optimized away

46 Chapter 8: Synthesis and Place & Route Place and Route During PAR, the Virtex primitives are placed on the FPGA array and connected together. Usually it is up to the tool to decide where components are placed but in this case placement constraints are applied for three reasons: 1. The CAM is programmed by letting JBits change the content of the LUTs. In the variable length implementation also the length of the shift registers (SRLs) and output multiplexers need to be programmed. For JBits to do this, it needs to know exactly what LUTs and SRLs to write to and where they are placed in the FPGA array. This requires location constraints on the LUTs and SRLs that are part of the match lines. 2. The pin locations of the ports that are described in were decided by the board manufacturer. The PAR-tool needs to be aware of these locations, so that the IOBs are placed on the right locations. This requires pin-location constraints on the ports. 3. The CAM has a very regular structure, consisting of vertically placed carry-chains. This leads to a design that can be placed in a very compact way, with a high utilization density. Since the PAR tool is not aware of this system knowledge, better performance is reached when doing manual floorplanning. One of the main things that is to be placed manually are the two LUTs (primitive SRL16) and the two multiplexers that they control (primitive MUXCY_L) in the same slice. These constraints are applied to the design via a User Constraints File (UCF). A description of the UCF syntax can be found in [19]. Since all components are to be placed individually, it is infeasible to write the constraints manually. For this reason a C-program was written that generates the constraints. Absolute CLB locations were used to place all components. It is possible to use relative locations as well, where the location of each component is expressed as its relative position to some origin that can be situated anywhere on the FPGA (so called Relationally Placed Macro s ). This is a convenient way to constrain components, since the whole design can be moved by simply changing the location of the origin, in stead of the location of all individual components. It turned out though, that the Xilinx tools generate an error when two multiplexers and two shift registers are placed in one slice this way Timing constraints To increase the performance of the design, timing constraints were used. These constraints are passed to the synthesis tools and specified in the UCF of the PAR tools. They can limit the delay on some critical nets. First the synthesis tools minimize the logic levels of the design in order to meet the timing constraints by for example logic replication. Then the PAR tools try to place and route components in such a way to minimize routing delays until timing constraints are met. In the CAM implementations, the minimum clock frequency at which the designs should operate was constrained. For optimal results, the desired clock frequency was chosen just above the value that could be met by the tools

47 Chapter 8: Synthesis and Place & Route 8.2 Physical Structure Fixed length CAM The physical structure of the fixed length CAM on the CLB array is shown in figure 8-1. The CAM is placed as a rectangle in the middle of the FPGA. Locations in the array are denoted by the coordinate of the CLB (CLB col, CLB row) and the slice within the CLB (slice) which is one of two available slices S0 and S1. Every match block is mapped on 8 slices and a vertical space of 2 slices is left out between two blocks for a flip flop, followed by a stitcher to be placed. To allow for the Xilinx tools to place logic within the CAM structure, vertical columns have been left empty. This leads to better timing and faster PAR run times. In this implementation one CLB column is left empty every 8 match lines. <CLB col>.<slice>: 8.S1 8.S0 86.S1 86.S0 register <CLB row> match block stitcher register match block stitcher 64-bits Registers register match block stitcher register match block match line: = not constrained Figure 8-1: CLB locations of the various components in the match field of the fixed length CAM A block that is 8 CLBs high has been reserved for the registers, that connect the 64-bits input to the various match blocks. Only the coordinates of these 64-bits registers and the match blocks have been constrained. The positions of the stitchers, the register at the output of each match block and the priority encoder are decided by the place & route tool

48 Chapter 8: Synthesis and Place & Route Variable length CAM The physical placement of the variable length CAM is given in figure 8-2. The CAM is again placed as a rectangle in the middle of the FPGA. In the fixed length CAM implementation two slices were needed to place the stitcher and flip flop between each block. In the variable length implementation, the flip flops were replaced by shift registers, whose inputs can be connected to a carry-chain within the same slice. Therefore only one slice was reserved between each match block. The two line multiplexers are placed in seperate slices. The reason for this is that when placing them in one slice, it is not clear which multiplexer is placed in which LUT. Placing them in seperate slices, gives the possibility to configure both LUTs in each slice with the same value, without knowing what LUT the multiplexer is mapped on. All components were constrained. Again one CLB column was left empty per 8 matching lines for the place & route tools to place logic and supply for extra routing resources and an area of 8 CLBs high was reserved for the 64-bits registers that connect the input of the CAM to all match blocks. <CLB col>.<slice>: 8.S1 8.S0 86.S1 86.S0 <CLB row> multiplexer 1 11 multiplexer 0 12 shift register 14 match block shift register / stitcher 23 match block stitcher bits Registers shift register 40 match block shift register / stitcher 49 match block stitcher 58 match line: Figure 8-2: CLB locations of the various components in the match field of the variable length CAM

49 Chapter 8: Synthesis and Place & Route Explicit priority encoder The physical structure of the explicit priority encoder is shown in figure 8-3. As mentioned before, the explicit priority encoder is used together with the variable length CAM, whose structure is left unchanged. The priority encoder should be constrained in such a way, that there is no conflict between the two designs. It is placed above the variable length match field and the two are aligned horizontally. The priority multiplexers are part of the switchbox and each connect one or more inputs to one of the eight priority lines. These multiplexers are numbered between 0 and 7, referring to the priority lines their outputs are connected to. 0 means highest priority, 7 lowest. <CLB col>.<slice>: 8.S1 8.S0 9.S1 9.S0 10.S1 86.S0 <CLB row> output decoder LUT1 output decoder LUT0 priority multiplexer priority multiplexer Figure 8-3: CLB locations of the various components of the explicit priority encoder. 6 Output decoder LUT 0 and 1 are part of the output decoder and since the variable length CAM has two outputs per slice column, also two of these LUTs are needed per column. The LUTs have been placed in separate slices, so that JBits is able to distinguish between them (see 8.2.2). 8.3 Results of the Fixed Length CAM FPGA editor view of the CAM After place and route, the design can be made visible via FPGA editor. This Xilinx tool gives a view of the CLB array, with all placed components and routes. This tool also makes the slice internals and the configuration of the LUTs visible. In appendix C, figure C-1 the FPGA editor view of the entire fixed length CAM structure including the board interface is given. It shows the densely routed match field in the middle and the IOBs on the edges of the FPGA. The priority encoder is placed above the match field. In figure 8-4 the FPGA editor view of the internals of a slice is given. This slice is part of the carry chain of a match block. It shows the two LUTs configured as shift registers, the two multiplexers controlled by the LUTs and the carry routes. In this slice, the two available registers have not been utilized and this is the case for all slices that are part of a carry-chain. The reason for this is that the input of this register is controlled by the output of the LUT in the same logic cell. Since this LUT is used to control the carry logic, its output can t be connected to the flip flop as well. Another problem is, that the output of a flip flop can t connect to a carry-chain within the same slice, or to the C in of adjacent slices. Therefore it was necessary to reserve two slices between two match blocks: one for the stitcher and one for the register. The multiplexer, denoted with switch is used in the variable length CAM implementation and

50 Chapter 8: Synthesis and Place & Route is discussed in 8.4. switch SR-line ( 9.3.4) Figure 8-4: FPGA editor view showing internals of a slice, that is part of the carry-chain of a match block Device utilization Table 8-1 gives a summary of the FPGA resource utilization of the whole design, including board interface. The data was taken from the mapping report. Resource Resources used Resources available utilization Slices 7,084 12,288 58% Flip Flops 1,871 24,576 8% a LUTs ,576 b 2% Shift registers 10,240 24,576 b 42% IOBs % Table 8-1: FPGA resource utilization of fixed length CAM. a. From this data it follows that only 8% of the flip flops is used. It should be noted though that not all flip flops that are unused now can actually be used. Only flip flops situated in slices that are not part of a carry-chain can be instantiated, which makes the effective utilization 8+42 = 50%. b. This number is misleading, since both LUTs and shift registers are mapped on the same resource.therefore the total number of LUTs + shift registers is equal to

51 Chapter 8: Synthesis and Place & Route Table 8-2 shows the device utilization, where distinction is made between the board interface, the match field and the priority encoder. In this table the LUTs and shift registers were merged, since they represent the same physical resource Timing Analysis Component Board Interface Encoder Match Field Resource Slices 6 % 2 % 49 % Flip Flops 5 % 0 % 3 % LUTs 1 % 1 % 42 % Table 8-2: FPGA resource utilization per component for the fixed length CAM. To find the critical path in the design, an advanced design analysis has been performed by the Xilinx tools. Below a fragment of the resulting timing report is given: Delay: ns match_vector(9) to data2(1) ns Total path delay (26.538ns delay plus 1.499ns setup) 0.141ns clock skew The critical path is 28.2 ns and runs through the priority encoder, from the output of match line 9 to the output register that is connected to memory bank 2. The fixed length CAM can operate at a maximum frequency of 35.4 MHz. Since a complete match takes 5 clock cycles, the CAM is able to perform 35.4/5 = 7.1 Mlookups/s. 8.4 Results of the Variable Length CAM FPGA editor view of CAM The FPGA editor view of the whole variable length CAM is shown in figure C-2 in the appendix. The match field is smaller than in the fixed length CAM and the priority encoder is clearly visible as a dense area on top of the match field. Since the match blocks are implemented the same way as in the fixed length CAM, the slice internals of the carry chain are similar to figure 8-3. In this figure the multiplexer denoted with switch switches between C in and BX and is controlled by JBits to either start a new carry chain or to propagate the carry of the previous match block Device Utilization Table 8-3 summarizes the resource utilization for the board interface, encoder and match field of the variable length CAM

52 Chapter 8: Synthesis and Place & Route Component Board Interface Encoder Match Field Resource Slices 6 % 7 % 40 % Flip Flops 5 % 0 % 0 % LUTs 1 % 4 % 36 % Table 8-3: FPGA resource utilization per component for the variable length cam with inherent priority. The match field consumes less resources than in the fixed length CAM, because of the decreased number of match blocks Timing Analysis An advanced timing analysis has been performed and the critical path is equal to 52.3 ns. The critical path is again running through the priority encoder and is longer than in the fixed length CAM. This is caused by the increased size of the encoder. The maximum look up rate of the variable length CAM with inherent priority then becomes 3,8 Mlookups/s at a clock frequency of 19.1 MHz. 8.5 Results of the Variable Length CAM with Explicit Priority FPGA editor view of the CAM The FPGA editor view of the variable length CAM together with the explicit priority encoder is given in figure C-3 in the appendix Device utilization Table 8-4 summarizes the resource utilization for the board interface, encoder and match field of the variable length CAM with explicit priority. Component Board Interface Encoder Match Field Resource Slices 6 % 9 % 40 % Flip Flops 5 % 0 % 36 % LUTs 1 % 6 % 0 % Table 8-4: FPGA resource utilization per component for the variable length CAM with explicit priority. Comparing the utilization of this implementation with the variable length CAM with inherent priority, then there are no significant differences. The extra logic needed for explicit priority is only 2 % and does therefore not add much significant hardware costs

53 Chapter 8: Synthesis and Place & Route Timing analysis An advanced timing analysis has been performed and the critical path is equal to 58.0 ns The critical path is again running through the priority encoder, but is somewhat longer than in the variable CAM with inherent priority. This increase is caused by the extra propagation delay in the logic that was added in the explicit priority encoder. The maximum look up rate of the variable length CAM with explicit priority then becomes 3.4 Mlookups/s at a clock frequency of 17.2 MHz. 8.6 Summary Table 8-5 summarizes the implementation results of the three CAM implementations and another design that has been implemented for comparison. This design is that of a variable length CAM that outputs only one bit that tells if a match occurred, but does not return the matching address. The latter is used for IP filtering, where this bit decides whether an IP packet is to be forwarded or not. Only the hardware part of this design has been implemented to compare speed and utilization in the absence of the priority encoder. CAM type No. of 64 bits match blocks Device Utilization [% slices] a Max. clock frequency [MHz] Table 8-5: Speed and utilization of different CAM implementations. a. The device utilization is given for the CAMs without the board interface. Max look up rate [Mlookups/s] Fixed length, inherent pr Var. length, inherent pr Var. Length, explicit pr Var. Length, match returned The size of the CAMs is given as the number of match blocks, since this is an indication how many CAM words can be stored of a certain length. The fixed length CAM can store a total of 128 words of 320 bits. The variable length CAM stores at least 102 words of 320 bits, but by reducing the CAM words as decribed before, it is able to store a maximum of 256 words of 128 bits each. The variable length CAM is slower than the fixed length CAM. This is the result of the difference in size of the priority encoder that each implementation contains. The performance of the variable length CAM with explicit priority is less than the same CAM with inherent priority. This is caused by the fact that the critical path is running through the priority encoder, which leads to more logic levels in the former case. The variable length CAM that only returns a match bit is significantly faster than the other designs, since the priority encoder that is responsible for the critical path in the other designs has been omitted. From this it can be concluded, that the priority encoder significantly limits the performance of the whole CAM and that a performance increase of more than 200% is reached by leaving it out

54 Chapter 9: Software Implementation Chapter 9: Software Implementation In this chapter the implementation of the software part of the CAM will be discussed. This is done in three parts. The first part gives a short description of the Java user application and the Graphical User Interface (GUI) of the three CAM implementations. The second part gives a description of the hardware interface that has been written to let the Java application communicate with the board. The last part gives a description of how JBits is integrated in the program. This chapter is not meant to give a detailed description of all the software functions but is focused on the communication between the hardware and the software. 9.1 JAVA User Application General description The JAVA user application forms the interface between user and CAM. It is used to edit the contents of the CAM and to test its functionality. The program depends on the following main classes: 1. Sun s Swing library for implementing graphics. 2. class esl which is the JAVA interface to the board. This interface will be discussed in Xilinx JBits library, used to manipulate the FPGA bitstream. This is discussed in 9.3. A good introduction to the JAVA language as well as a language reference can be found in [20] GUI of the fixed length CAM The fixed length CAM user application gives access to all 128 entries of the CAM and changing/adding entries can be done from the program itself, or by reading from a configuration file. The configuration of the CAM and the way it has been mapped on the LUTs is shown graphically. Testing is done by reading packets from a file and processing these by the hardware, where the result is written to another file. A full description of GUI, including all functionality and menu s is given in appendix D GUI of the variable length CAM The GUI of the variable length CAM is different from the fixed length CAM in the way that more information is given to the user about the configuration of the CAM. Not only the contents of the LUTs, but also the state of the switches, the multiplexers and the shift registers is shown. The configuration can t be changed from within the program as in the fixed length CAM, in stead a configuration file should be read. A full description of the GUI of the variable length CAM is given in appendix D

55 Chapter 9: Software Implementation GUI of the variable length CAM with explicit priority In the user applications of the CAMs that use the inherent priority mechanism, the user was responsible for adding entries in the right order and there was no support for automatically adding a single entry on the correct location, depending on its priority. When using the explicit priority mechanism, not only the order, but also the priority values of the CAM words need to be controlled in order to add a new entry with changing the locations of other entries as little as possible. This process has been automized in this implementation. The user simply gives a list with CAM words that need to be added, together with their priority which can be any number. The Java application then maps every entry on the CAM structure with changing the locations of as few other entries as possible. This feature is part of preprocessing and is described in 9.4. The GUI of the variable length CAM with explicit priority has been changed in the way that the program is controlled by a command file. This file may contain commands for adding/deleting entries, searching for test packets and several other options. The GUI and the format of the command file are described in appendix D. 9.2 Hardware Interface Native interface Java, as the language is defined, is hardware independent. While this is a great benefit to most of its users, it provides no mechanism for interfacing to either hardware or non-java code. In this case, we would like the Java user interface to communicate with the board, containing the FPGA. Together with this board there came device drivers and a C-library PP1000 that implements all necessary functions for communication. To use this library in Java, a native interface was implemented. More details about this native interface and how it has been implemented can be found in appendix E Remote interface Java is very suitable for the implementation of distributed systems, and has an extensive library of routines for coping with TCP/IP protocols. This makes it possible to implement a remote hardware interface such that the board can be accessed from any computer connected to the host computer via a network. A remote hardware interface has been written using the Remote Method Invocation (RMI) mechanism, such that the Java user application can be used on any computer. Not only does this give the freedom communicate with the board on other computers, but also on other platforms such as UNIX. A description of this remote hardware interface is given in appendix F. 9.3 JBits Integration Components, resources and values Components are those parts of the CAM, that can be configured by JBits. In the fixed length CAM, these are the LUTs in the match blocks, that are written with the content of the entries

56 Chapter 9: Software Implementation The variable length CAM has four or six components, depending on the priority mechanism that is used. For inherent priority these are the LUTs in the match blocks, the multiplexers used to connect match blocks to the priority encoder, the switches and the shift registers. For explicit priority, also the encoder needs to be configured. The components contained in the explicit priority encoder are the 8-input multiplexers in the switch box and the LUTs in the output decoder. Every component is characterized by the FPGA resource that it uses. In chapter 4 it was mentioned that changing the configuration of a component is done by writing a value to its resource: set(int row, int column, int[][] resource, int[] bits); An overview of the resource and possible values of all components used in the three CAM implementations is given below. Since the multiplexers in the variable length CAM and all components in the explicit priority encoder are implemented in a LUT, these are not discussed separately. 1. LUTs: Resource: LUT.SLICE<slice number>_<lut> Where slice number denotes one of slices 0 and 1 and lut denotes one of LUTs F and G. Value: Init The type of Init is integer and its format is explained below. A LUT is a 4-to-1 boolean function that can be represented by a truth table with length 16. To represent this boolean function with a single value, the 16 output bits of the truth table are combined in a bit vector, inverted and converted to an integer. Example: the output column of the truth table of a 4-input OR-gate is (the only 0 is for input combination 0000 ). To program a LUT as an OR-gate, these bits are inverted and converted to an integer. Init then becomes 1. The LUT can be configured as a four-input multiplexer this way, as needed in the variable length CAM. 2. Switches The resource that the switches are mapped on, is the multiplexer that either propagates the carry of the previous match block C in or signal BX: Resource: S<slice number>control.cin.cin Value: S<slice number>control.cin (closed state) S<slice number>control.bx (open state) where slice number denotes one of slices 0 and

57 Chapter 9: Software Implementation 3. Shift registers The resources that decide the length of the shift registers, are the four inputs A0-A3 of SRL16: Resource: S<slice number><lut><input>.s<slice number><lut><input> Value: S<slice number><lut><input>.off (input is driven High) Where slice number denotes one of slices 0 and 1, lut denotes one of LUTs F and G and input one of inputs 1-4. To drive an input Low, the input has to be driven by the original route as decided by the place & route tools and can be read during initialization using method JBits.get(int row, int column, int[][] resource); Describing the CAM structure To reconfigure the FPGA, JBits needs to be aware of the physical structure of the CAM. This layout is defined in a separate class CAMConstants which is given in appendix G for the variable length CAM with explicit priority. It contains several constants and vectors, describing the number of match lines and match blocks and the relative physical CLB locations of the different components. For example: int [] SwitchOffset = [0,9,18,27]; in the variable length CAM gives the relative y-locations of the switches in a match line, starting at the least significant match block. The coordinates of the origin of the CAM structure are also defined in this class so that the whole CAM structure can be moved easily Initialization The first action that is performed by JBits is reading the bitstream of the CAM structure, described in chapter 6. This is done with command: jbits.read(userconstants.infilename); If reading has been successful, JBits starts building the initial CAM structure. For every component in the design, a class is instantiated that contains the CLB location of that component, its FPGA resource and value. In the fixed length design, these are only LUTs but in the variable length CAM, also switches, multiplexers and shift registers are instantiated. Example: The variable length CAM has a class called switch, of which a fragment is shown below: class switch {... public int[][] Resource;

58 Chapter 9: Software Implementation } public int[] Open; public int[] Closed; public int CLBy; public int CLBx; public int State; public int OldState; public boolean Changed; public JBits JBits; Variable state is either 0 (open) or 1 (closed) and OldState is defined to determine whether the state of the switch has changed or not. Resource denotes the FPGA resource of the switch and Open and Closed denote the values that should be written to Resource to change the state of the switch. The last step is initializing all components in such a way, that the CAM is empty. This is done in the constructor of every instantiated component. LUTs are initialized such that they output 0 independent of the four input values, shift registers are initialized with a delay of 1 and all switches are set to open Converting SRL16 to LUT After the whole structure has been built and all components have been initialized, JBits starts converting the 16 bits shift registers SRL16 that were instantiated in the VHDL code to disable the input swapping, to regular LUTs. This is done with the following method: jbits.set(y, x, S<slice number>ram.lut_mode, S<slice number>ram.on); where (y, x) is the coordinate in the CLB array and slice number denotes one of slices 0 and 1. This is done for the LUTs in the match blocks and in the variable length CAM also for the multiplexers that connect to the priority encoder and the LUTs that are used in the explicit priority encoder. Using this method does not only convert LUTs to SRLs, but also sets the Synchronous Reset (SR) line of the slice flip flops to low. This line is depicted in figure 8.4. This causes registers that are placed together with an SRL in the same slice to be resetted. This problem is solved by making sure that the SR lines are not inverted (1) and that they are not connected to any net (2): (1) JBits.set(y, x, S<slice number>control.srwenotinvert, S<slice number>control.off); (2) JBits.set(y, x, S<slice number>sr.s<slice number>sr, S<slice number>sr.off); Cyclic Redundancy Checking (CRC) Virtex configuration utilizes a standard 16-bit CRC checksum algorithm to verify bitstream integrity during configuration. An initial CRC checksum is calculated by the Xilinx tools while generating the bitstream after place and route. Also, a special purpose packet is added to the programming bitstream that tells the FPGA to do a CRC check. When the bitstream is being read by JBits, JBits takes away part of the bitstream that is not necessary but does not update the CRC value

59 Chapter 9: Software Implementation This means that a checksum error is generated when configuring the FPGA with this bitstream. A way to solve this problem is to replace the packet that tells the FPGA to do a CRC check with a dummy packet. This is done in the code below: Packet packet = null; int h = jbits.getpacketcount(); for (int k=0; k<h; k++) { packet = jbits.get(k); // Read packet(k) in bitstream if ((packet.getword(0) == 0x )) // compare header { packet.setword(0, 0x ); // Set to Write RCRC packet.setword(1, 0x07); } } The program searches for CRC command packets, that can be recognized by the 32 bits header 0x If such a packet is found, its data field is set to 0x07. In stead that the packet tells the FPGA to do a CRC check, it now only resets the registers in the CRC circuit. It is also possible to actually calculate the CRC in software and load this value to the CRC circuit. This way the CRC check is done with the correct value, which is recommended in a production version where bitstream integrity is necessary. 9.4 Preprocessing Preprocessing is part of the variable length CAM with explicit priority and consists of a series of actions that are performed when adding an entry to the CAM. These actions are: 1. Checking for conflicts with CAM words that are already in the CAM. 2. Mapping the user specified priority to a physical priority and location. 3. Utilization of match blocks that are left empty after deletion Logical and physical priority During preprocessing, a distinction is made between logical and physical priority. The logical priority is the priority that is defined by the user and can be any integer. A larger logical priority means a higher priority. The physical priority is the value that the explicit priority encoder is set with for a specific entry. Due to the limited number of priority classes, this is an integer between 0 and 7 where 0 is the highest priority Relation between CAM words To analyse the relation between two CAM words a and b, these are modelled by collections A and B. A is the collection of all incoming packets for which a gives a match and B is the collection of packets for which b gives a match. When comparing two CAM words a and b, there are four possible outcomes:

60 Chapter 9: Software Implementation 1. Equality: CAM words a and b are identical. 2. Hierarchical overlap: a and b are not identical, but if a packet matches CAM word a (b), then this packet matches b (a) as well. 3. Partial overlap: CAM words a and b are not hierarchical, but there are packets for which both a and b match. 4. No overlap: there are no packets for which a and b both match Checking for conflicts When checking for conflicts, the new CAM word is compared with the words that are already present in the CAM. Here the software determines the relation between the new CAM word and the already present words. There are three cases in which a conflict can occur: There is partial overlap between the new CAM word and another CAM word, while their logical priority is equal. In this case the program can not decide what entry should have highest priority. There is hierarchical overlap between the new CAM word and another CAM word, where the new CAM word is covered by the other word, but its logical priority is lower. Because the new entry covers the other entry, the address of the new entry will never be returned. There is hierarchical overlap between the new CAM and another word in the CAM, where the new CAM word entirely covers the other CAM word, but its logical priority is higher. Because the other entry covers the new entry, the address of the new entry will never be returned. If a conflict occurs, then the new entry is not added to the CAM Adding a new entry A = A B = A or A B = B and not 1. To add a new entry to the CAM, its logical value has to be translated into a physical priority and a location. This is the objective of preprocessing while moving as few other entries as possible. The way an entry is added to the CAM depends on its relation with other entries in the CAM and these cases are discussed separately. For each case, a simple example is given that shows the old CAM contents on the left side and the new contents on the right side. Each time, the entries are given as a 4 bits CAM-word together with their logical priority (any integer) and physical priority (integer between 0 and 7). The new CAM word is depicted on the right side of the arrow with a bold font. It should be noted that not all cases are covered by the examples. B ( A B ) and not 2. A B =

61 Chapter 9: Software Implementation 1. No overlap: if there is no overlap between the new entry and one of the other entries, then the entry is simply added at the first available position. Its physical priority is set to x x x1x 49 3 Add 1111, 47 0x1x Partial overlap: if there is partial overlap between the new entry and one or more other entries, then its physical priority is determined from the physical priorities of the overlapping entries. 100x x x Add x0x1, 47 1x x0x If this physical priority is occupied by other overlapping entries, then the physical priorities of these entries are changed to fit in the physical priority of the new entry. 100x 45 4 x00x 45 5 (changed) 1x Add x0x1, 47 00xx 49 3 x0x Due to the limited number of priority classes, it might be necessary to move some entries. 3. Hierarchical overlap: if there is hierarchical overlap between the new entry and another entry, then the new entry is added as there would be partial overlap. In the case that the new entry and the other entry have equal logical priority, its priority is decided depending on the number of don t cares. The more specific an entry, the higher becomes its priority (changed) 10xx xx 45 3 Add 100x, x Equality: if the new entry already exists in the CAM, then the existing entry is deleted if it has a different logical priority. Then the new entry is added using rules 1 to xx 49 3 Add 1100, 50 10xx 49 3 The physical priorities of the new entries are set in such a way, that their values are equally divided in a range between 0 and 7. This means that if a priority can be set on an interval from A to B, then the value is set in the middle of this interval. This way, the chance that a priority is occupied by other entries when adding a new entry is minimized. Therefore the physical priority of an entry that does not overlap with any other entry is set to 3 (in the middle between 0 and 7) Utilization of empty match blocks When an entry is deleted from the CAM, then the match blocks and the multiplexer that the output of the CAM word was connected to are marked unused and this way an empty fragment in the CAM content is created. When a new entry is added to the CAM, the software finds the first available fragment of match blocks in which the new CAM word would fit. The advantage of this approach is that no other entries are moved when deleting a CAM word and large unused fragments are prevented. But since the algorithm finds the first available fragment and not the fragment in which the new CAM word fits best, the CAM still becomes fragmented. Periodic defragmentation of the CAM is therefore necessary

62 Chapter 10: Conclusions and Recommendations Chapter 10: Conclusions and Recommendations 10.1 Summary The goal of this project was to implement an FPGA-based CAM for IP version 6 characterization. This CAM should be able to store at least 128 words with a maximum width of 315 bits and the CAM words may contain don t cares. It should be part of a 622 MBit/sec communication channel, which means that a look up rate of 1.9 million look ups per second is required. Three different CAM structures were implemented. The first implementation, called fixed length CAM can store 128 entries of 320 bits and all CAM words consume the same amount of resources. In this implementation an inherent priority mechanism is used, meaning that when several CAM words match simultaneously, then the matching word on the lowest address is selected. The second implementation is called variable length CAM that can store up to twice as many entries on the same area compared to the fixed length CAM. This is done by dividing the CAM words into five match blocks and when such a match block merely contains don t cares, then this block is omitted. The third implementation is based on the variable length CAM, but uses a more advanced priority mechanism where not only the CAM words, but also their priority can be programmed. To implement the CAMs, a specific design methodology was used, consisting of a static and a dynamic part. The static part is used to implement the basic structure of the CAM, from a VHDL description to the programming bitstream. The dynamic part is a Java application, used to change this bitstream for updating the CAM. The three implementations were implemented in a Xilinx Virtex device on a PCI-based board and for each implementation, a Java-based user interface has been developed to configure the CAM. The fixed length CAM has been implemented successfully, being able to contain 128 CAM words and able to perform 7.1 million lookups/sec. The variable length CAM that has been implemented can contain up to 256 CAM words and searching can be done at a rate of 3.8 million lookups/sec. The explicit priority scheme added in the third implementation allows fast adding/deleting of CAM words and it was shown that this added no significant hardware costs. Searching can be done at a rate of 3.4 million lookups/sec Conclusions When implementing an FPGA-based CAM, its architecture has to be considered in order to efficiently exploit its resources. This is because, unlike custom circuits, the architecture of the FPGA is fixed a-priori and therefore the premitted programmability, connectivity and routability is constrained by that architecture. Dynamic reconfiguration is a good way to implement FPGA-based CAMs. It was shown that flexible circuits can be implemented, without adding hardware costs. Since critical functions (searching the CAM) and non-critical functions (changing the CAM) can be divided and implemented in respectively hardware and software, the final hardware implementation becomes both faster and smaller than regular FPGA implementations

63 Chapter 10: Conclusions and Recommendations The performance of dynamically reconfigurable FPGA-based CAMs is enough for IP characterization. Between 3.4 and 7.1 packets can be characterized per second, depending on which of the three implementations described in this report is used. Since the required search rate is less than 1.9 million lookups/s, all three implementation are suitable. With progessing FPGA performance, it is expected that the three CAM structures can be used in future communication channels with more stringent requirements as well. The CAMs that have been implemented consume approximately half of the hardware resources that are available on the Virtex FPGA and can be integrated with other logic on the same chip for either extending its functionality or implementing other functions. The design methodology that was used to implement the CAMs and consists of a static and a dynamic flow was proven to be successful. The static part is used to implement those parts of the design that are not dynamically reconfigured. Since it is not possible to reserve an empty area on the FPGA in the tools for integrating reconfigurable cores, an initial structure of the dynamic part of the design must be implemented in the static design flow as well Recommendations Recommendations concerning the CAM The output of the CAM should be programmable, such that also the return value can be programmed. This way, the memory that is indexed by the CAM can be omitted and the correct signals to control the rest of the system are generated directly by the CAM. A method to do this is described in Besides returning a value when a match occurs, other actions could be performed. An example of such an action is incrementing a counter for aquiring statistical data. Another example is waiting for another entry to match. This way sequences of packets can be discovered. This concept is called Matching Machine (MaMa). The structure of a MaMa consists of a CAM structure, logic and memory and is very suitable to implement on an FPGA since all the necessary parts are already there. With dynamic reconfiguration, the MaMa can be configured with the correct CAM words, actions and return values and could become very valuable in protocol processing. It is recommended to investigate how the CAMs and methodology described in this report can be extended to implement such a function Recommendations concerning the tools To have full control over the implementation of the dynamically reconfigurable part of a design, one would like to use JBits to built this part. To do this, one should be able to reserve an area in the Xilinx tools for JBits to place and route certain regular structures. This area should be empty and one should be able to define ports on the edges of the area for JBits to route the area and connect to the other parts of the circuit (i.e. board interface and random logic). It is recommended that this is integrated in the place & route tool. If one is able to reserve an empty area, then routing in JBits can be done both manually and automatically. In some cases one would like to be able to use both methods, for example when there are both critical and non-critical nets. In the present version of JBits (2.1) one can only use one of these methods, since the automatic routing tool does not check for conflicts with manually placed routes. It is recommended that both methods can be used

64 Chapter 10: Conclusions and Recommendations It should be possible to add an attribute to the Virtex LUT primitive, that prevents the Xilinx tools from swapping the address lines on the LUT (see 7.1.1). In time-critical applications it is important that reconfiguration is done fast, possibly while a part of the system is still running. This can be reached by partial reconfiguration. It is recommended to have support for this both on the board and in the CAM software. The synthesis tool that was used for implementing the CAMs is advanced and suitable for synthesizing behavioural descriptions. Since the static part of the design is written in a structural way, a simpler synthesis tool can be used. This is cheaper and probably faster than using Synplify which takes up to many hours

65 References References [1] W. Richard Stevens, TCP/IP Illustrated, vol. 1, Addison-Wesley, [2] S. Deering, R. Hinden, Internet Protocol, Version 6 Specification, RFC 2460, [3] J. Walrand, Communication Networks, A First Course, Aksen Associates, [4] M. Mansour, A. Kayssi, FPGA-based Internet Protocol Version 6 Router, Proceedings of IEEE International Conference on Computer Design, p. 334, vol. 2, [5] Ericsson Telecom, Telia, Att Förstå Telekommunikation, Studentlitteratur, [6] R. Kress, High-Level Synthesis for Dynamically Reconfigurable Hardware/Software Systems, Proceedings of the 8th International Workshop of Field Programmable Logic and Applications FPL 98, p. 288, Springer Lecture Notes in Computer Science, [7] I. Warren, Dynamic Configuration Abstraction, Proceedings of the 5th European Software Engineering Conference (ESEC 95), Springer Lecture Notes in Computer Science, [8] C. Sweeney, B. Blyth, RC1000-PP Hardware Reference Manual, version 2.1, Embed ded Solutions, [9] P. Bellows, B. Hutchings, JHDL - A HDL for Reconfigurable Systems, Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines, p. 175, [10] S. Guccione, D. Levi, Run-Time Parameterizable Cores, Proceedings of the 9th International Workshop of Field Programmable Logic and Applications FPL 99, p. 215, Springer Lecture Notes in Computer Science, [11] Xilinx Inc, The Programmable Logic Data Book, Xilinx Databook, [12] C. Carmichael, VIRTEX TM FPGA Series Configuration and Readback, Xilinx Application Note 138, [13] Xilinx Inc, JBits Xilinx Reconfigurable Computing Platform, JBits 2.0 Tutorial, [14] M. Defossez, Content Addressable Memory (CAM) in ATM applications, Xilinx Application Note 202, [15] J. Brelet, B. New, Designing Flexible, Fast CAMs with Virtex Slices, Xilinx Application Note 203, [16] J. Brelet, Using Block SelectRAM+ for High-Performance Read/Write CAMs, Xilinx Application Note 204, [17] S. V. Kartalopoulos, An Associative RAM-based CAM and Its Application to Broad- Band Communiactions Systems, IEEE Transactions on Neural Networks, p 1036, vol. 9, [18] A.J. McAuley, P. Francis, Fast Routing Table Lookup Using CAMs, Proceedings of IEEE Infocom 93, p. 1382, [19] Xilinx Inc, Xilinx Library Guide, Alliance 2.1i Software Manual, [20] C. Horstmann, G. Cornell, Core JAVA 1.2 vol 1: Fundamentals, Sun Microsystems Press, [21] S. Guiccione, Portable Native Methods in JAVA, Embedded Systems, [22] S. McPherson, Java Servlets and Serialization with RMI, Java Developers Connection, [23] J.R. Jackson, A.L. McClellan, JAVA 1.2 by Example, Third edition, Sun Microsystems Press,

66 List of Used Tools List of Used Tools Software This paragraph gives an overview of the programs used to design and implement the CAM at the two platforms Solaris 2.6 and Windows NT 4.0. For each program it is described what it has been used for (Application). Solaris 2.6: Name: Version: Application: cc - C-compiler for automatic generation of UCF files for constraining the design Framemaker 5.5 Writing the report. Synplify 5.3 FAE 1 Synthesis of the initial CAM structure. TextEdit 3.6 Editing the VHDL code Xilinx Alliance 2.1i sp3 Place and route of the design and generation of the programming bitstream. Windows NT 4.0: Name: Version: Application: EditPlus 1.25 Editing the Java code. JBits 2.1 Java library for manipulation of the bitstream and dynamic reconfiguration of the FPGA. MS Visual C C-editor and compiler for generating the hardware interface SUN JDK 1.2 SUN Java developers Kit for compiling and running Java code. Hardware This paragraph gives a description of all the hardware components that have been used, together with a more detailed specification. Name: Type: Specification: PC Hewlett Packard Pentium II at 450 MHz, 320 MB, running Kayak XA Windows NT 4.0. Unix Workstation SUN Sparc Dual Sparc processor at 66 MHz, 192 MB. Station 10 Unix Compute Server SUN Sparc Dual Sparc processor at 350 MHz, 512 MB Enterprise 450 FPGA Board Embedded Solutions PCI board with XCV1000-4, 8 MB RC1000-PP

67 Appendix A: Port Definitions of Board Interface Appendix A: Port Definitions of Board Interface Port Direction Function CE<n> Out SRAM <n> Chip Enable WE<n> Out SRAM <n> Write Enable OE<n> Out SRAM <n> Output Enable Addr<n> Out SRAM <n> Address Req<n> Out SRAM <n> Request Gnt<n> In SRAM <n> Granted Ctrl_ACK Out Data Acknowledge, Control Register Ctrl_VLD In Data Valid, Control Register Stat_ACK In Data Acknowledge, Status Register Stat_VLD Out Data Valid, Status Register LEDs Out onboard LEDs on/off Table A-1: FPGA control ports for memory and registers Data0_Reg SRAM 0 SRAM 1 Data0 Data1 Indata[31:0] Address Match Indata[63:32] Data2 SRAM 2 PCI host Data1_Reg OE0, OE1, WE0, WE1 CAM OE0, OE1, WE0, WE1 Ctrl_Reg, Ctrl_VLD, Stat_ACK, Reset Stat_Reg, Stat_VLD, Ctrl_ACK Board Interface Logic OE2, WE2, CE2, Addr2 LEDS LEDs Memory Arbiter Req0, Req1, Req2 Gnt0, Gnt1, Gnt2 Board Interface note: signal clk is not shown explicitely. Figure A-1: schematic view of the board interface and its connections to other parts in the system

68 Appendix B: VHDL Code of Board Interface Appendix B: VHDL Code of Board Interface --************************************************************************ Module: board_interface.vhd -- Version: Date: November Author: Johan Ditmar Family: Virtex Description: board interface to be used with the variable length -- CAM structure. --************************************************************************** PACKAGE MEM IS -- States of main FSM. TYPE memory_states IS (IDLE, REQUEST, GRANTED, READ_FIRST, WRITE_FIRST, RD_WRT, READ_READY, WRITE_READY, VALID_READY); -- States of FSM for reading the control register. TYPE ctrl_states IS (IDLE, DATA_VALID); -- Number of packets to be processed. CONSTANT BUFFER_SIZE : INTEGER := 1000; -- Maximum number of blocks per entry. CONSTANT NBlocks : INTEGER := 5;END MEM; LIBRARY ieee; USE ieee.std_logic_1164.all; USE ieee.std_logic_arith.all; USE ieee.std_logic_unsigned.all; USE work.mem.all; ENTITY Board_Interface IS -- The different ports can be found in appendix A. PORT( clk : IN std_logic; Res : IN std_logic; Stat_Reg : OUT std_logic_vector(7 DOWNTO 0); Stat_VLD : OUT std_logic; Stat_ACK : IN std_logic; Ctrl_Reg : IN std_logic_vector(7 DOWNTO 0); Ctrl_VLD : IN std_logic; Ctrl_ACK : OUT std_logic; Addr0, Addr1, Addr2 : OUT std_logic_vector(20 DOWNTO 0); Data0, Data1 : IN std_logic_vector(31 DOWNTO 0); Data2 : OUT std_logic_vector(31 DOWNTO 0); CE0, CE1, CE2 : OUT std_logic_vector (3 DOWNTO 0); WE0, WE1, WE2 : OUT std_logic; OE0, OE1, OE2 : OUT std_logic; Req0, Req1, Req2: OUT std_logic; Gnt0, Gnt1, Gnt2: IN std_logic; LEDS: OUT std_logic_vector(3 DOWNTO 0); T_LED1, T_LED2: OUT std_logic ); END Board_Interface; ARCHITECTURE Behave OF Board_Interface IS SIGNAL Addr_0, Addr_1, Addr_2 : INTEGER := 0;

69 Appendix B: VHDL Code of Board Interface SIGNAL Data0_Reg, Data1_Reg, Data2_Reg : std_logic_vector(31 DOWNTO 0) := (OTHERS => '0'); SIGNAL Reset, Reset_Int: std_logic := '0'; SIGNAL CE_0, CE_1, CE_2 : std_logic := '1'; SIGNAL Start, Stat_ACK_Reg : std_logic := '0'; SIGNAL Address: std_logic_vector (7 DOWNTO 0) := (OTHERS => '0'); SIGNAL Match_Vector: std_logic_vector (15 DOWNTO 0) := (OTHERS => '0'); SIGNAL Match: std_logic := '0'; SIGNAL InData_temp : std_logic_vector(63 DOWNTO 0) := (OTHERS => '0'); SIGNAL Priority_temp : std_logic_vector(7 DOWNTO 0); COMPONENT CAM GENERIC( NLines : INTEGER := 64; Width : INTEGER := 7 ); PORT( InData : IN std_logic_vector(63 downto 0); Address : OUT std_logic_vector(width-1 DOWNTO 0); Match_Vector : OUT std_logic_vector(15 DOWNTO 0); Priority_Vector : OUT std_logic_vector(7 DOWNTO 0); Match : OUT std_logic; clk : IN std_logic ); END COMPONENT; BEGIN Addr0 <= conv_std_logic_vector(addr_0,21); Addr1 <= conv_std_logic_vector(addr_1,21); Addr2 <= conv_std_logic_vector(addr_2,21); CE0 <= (OTHERS => CE_0); CE1 <= (OTHERS => CE_1); CE2 <= (OTHERS => CE_2); Data2_Reg(31 DOWNTO 16) <= Match_Vector; -- First 15 priority encoder inputs Data2_Reg(15) <= Match; -- Match signal Data2_Reg(14 DOWNTO 8) <= Priority_temp(6 DOWNTO 0); -- 7 of 8 priority lines Data2_Reg(7 DOWNTO 0) <= Address; -- Return address from priority encoder Reset <= NOT Res; -- FSM for reading the control register. Read_Ctrl : PROCESS (Ctrl_VLD, clk) VARIABLE Current_State : ctrl_states; BEGIN IF( (clk = '1') AND clk'event ) THEN IF (Reset = '1') THEN Current_State := IDLE; ELSE CASE Current_State is WHEN IDLE => Ctrl_ACK <= '1';

70 Appendix B: VHDL Code of Board Interface IF (Ctrl_VLD='0') THEN Start <= Ctrl_Reg(0); Current_State := DATA_VALID; ELSE Current_State := IDLE; END IF; WHEN DATA_VALID => Ctrl_ACK <= '0'; Current_State := IDLE; END CASE; END IF; END IF; END PROCESS; -- Process that waits for an acknowledgment from the host that -- the status register has been read. Store_Ack : PROCESS (Stat_ACK, Reset, Reset_Int) BEGIN IF ( (Reset = '1') OR (Reset_Int = '1') ) THEN Stat_Ack_Reg <= '0'; T_LED1 <= '1'; T_LED2 <= '0'; ELSIF ( (Stat_ACK = '0') AND Stat_ACK'EVENT) THEN Stat_Ack_Reg <= '1'; T_LED1 <= '0'; T_LED2 <= '1'; END IF; END PROCESS; -- Process that asserts a reset as a response to an acknowledgement -- of the status register. Assert_Reset : PROCESS (clk, Stat_Ack_Reg) BEGIN IF ( (clk='1') AND clk'event) THEN IF (Stat_Ack_Reg = '1') THEN Reset_Int <= '1'; ELSE Reset_Int <= '0'; END IF; END IF; END PROCESS; -- Main FSM for reading packets from memory banks 0 and 1 and writing -- the result to bank 2. Read_Write : PROCESS (clk, Start, Reset) VARIABLE Current_State : memory_states; VARIABLE Cnt : INTEGER RANGE 0 TO NBlocks+2;

71 Appendix B: VHDL Code of Board Interface BEGIN IF ( (clk = '1') AND clk'event ) THEN IF (Reset = '1') THEN Current_State := IDLE; ELSE CASE Current_State IS -- Idle state, wait for 'Start' from control register. WHEN IDLE => LEDS <= "0001"; OE0 <= '1'; WE0 <= '1'; CE_0 <= '1'; Req0 <= '1'; Addr_0 <= 0; OE1 <= '1'; WE1 <= '1'; CE_1 <= '1'; Req1 <= '1'; Addr_1 <= 0; OE2 <= '1'; WE2 <= '1'; CE_2 <= '1'; Req2 <= '1'; Addr_2 <= 0; Stat_VLD <= '1'; Stat_Reg <= (OTHERS => '0'); Cnt := 0; IF (Start='1') THEN Current_State := REQUEST; ELSE Current_State := IDLE; END IF; -- Request memory banks. WHEN REQUEST => LEDS <= "0010"; Req0 <= '0'; Req1 <= '0'; Req2 <= '0'; IF (Gnt0='0' AND Gnt1='0' AND Gnt2='0') THEN Current_State := GRANTED; ELSE Current_State := REQUEST; END IF; -- All banks have been granted. WHEN GRANTED => LEDS <= "0011"; OE0 <= '0'; CE_0 <= '0'; OE1 <= '0'; CE_1 <= '0'; Current_State := READ_FIRST;

72 Appendix B: VHDL Code of Board Interface -- Read first packet from bank 0 and 1. WHEN READ_FIRST => LEDS <= "0100"; Addr_0 <= Addr_0+1; Addr_1 <= Addr_1+1; Cnt := Cnt+1; IF (Cnt > NBlocks) THEN Cnt := NBlocks-1; Current_State := WRITE_FIRST; ELSE Current_State := READ_FIRST; END IF; -- Write first result to bank 2. WHEN WRITE_FIRST => LEDS <= "0101"; Addr_0 <= Addr_0+1; Addr_1 <= Addr_1+1; Addr_2 <= 0; WE2 <= '0'; CE_2 <= '0'; Current_State := RD_WRT; -- Process all packets. WHEN RD_WRT => LEDS <= "0110"; Addr_0 <= Addr_0+1; Addr_1 <= Addr_1+1; Cnt := Cnt+1; IF (Cnt >= NBlocks) THEN Addr_2 <= Addr_2+1; Cnt := 0; END IF; IF (Addr_0>=Buffer_Size-2) THEN Current_State := READ_READY; ELSE Current_State := RD_WRT; END IF; -- All packets read from bank 0 and 1. WHEN READ_READY => LEDS <= "0111"; Req0 <= '1'; OE0 <= '1'; CE_0 <= '1'; Req1 <= '1'; OE1 <= '1'; CE_1 <= '1'; Cnt := Cnt + 1; IF (Cnt >= NBlocks) THEN Addr_2 <= Addr_2+1; Cnt := 0; END IF;

73 Appendix B: VHDL Code of Board Interface IF (Addr_2>=Buffer_Size/NBlocks-1) THEN Current_State := WRITE_READY; ELSE Current_State := READ_READY; END IF; -- All packets have been processed. WHEN WRITE_READY => LEDS <= "1000"; Stat_Reg(0) <= '1'; Req2 <= '1'; WE2 <= '1'; CE_2 <= '1'; Current_State := VALID_READY; -- Write 'Ready' to status register and wait for acknowledgement. WHEN VALID_READY => LEDS <= "1001"; Stat_VLD <= '0'; IF (Stat_Ack_Reg='1') THEN Current_State := IDLE; ELSE Current_State := VALID_READY; END IF; WHEN OTHERS => LEDS <= "0000"; Current_State := IDLE; END CASE; END IF; END IF; END PROCESS; InData_temp(63 DOWNTO 32) <= Data1; InData_temp(31 DOWNTO 0) <= Data0; -- Instanciation of the CAM structure. CAM : CAM GENERIC MAP(NLines => 128, Width => 8) PORT MAP (InData => InData_temp, Address => Address, Match_Vector => Match_Vector, Priority_Vector => Priority_temp, Match => Match, clk => clk); END Behave;

74 Appendix C: FPGA Editor View of Various CAMs Appendix C: FPGA Editor View of Various CAMs Figure C-1: FPGA Editor view of fixed length CAM with inherent priority encoder. Figure C-2: FPGA Editor view of variable length CAM with inherent priority encoder

75 Appendix C: FPGA Editor View of Various CAMs Figure C-3: FPGA Editor view of variable length CAM with explicit priority encoder

76 Appendix D: User Interface Manual of Various CAMs Appendix D: User Interface Manual of Various CAMs With the Java user interface, the contents of the CAM can be changed and its functionality can be tested. This manual decribes the functions of the user interface and explains the menu s of three CAM implementation: the fixed and the variable length CAM, both with inherent priority and the variable length CAM with explicit priority. First it is described how to start the program. Then a description of the user interface of the three CAMs is given with how to view the contents of the CAMs. Here it is also described how to add/delete entries and test the functionality of the CAMs. D.1 Starting the program Before starting the program, be sure that SUN s JDK 1.2 or higher has been installed properly. The three CAM applications can be started from the directory where their respective classfiles are by typing: options: java -classpath <path to JBits>;. cam <option> -demo: Runs the application in demo mode, without connecting to the board. -remote <IP address>: runs the application remotely. <IP address> is the 32-bits IP address of the server, containing the board (ex ). Be sure to start both register and server application first! D.2 User Interface of Fixed Length CAM with Inherent Priority The graphical user interface of the fixed length CAM with inherent priority is given in figure D-1. D.2.1 Viewing the content of the CAM The content view panel (1) gives a graphical representation of the CAM with blocks, registers and priority encoder. Every block is divided in two 32 bits words. Each word is divided in eight 4-bits words, that each represent an FPGA LUT. The first entry has highest priority. Selecting is done by clicking with the mouse in the panel and only complete 32 bits words can be selected. Note that it is not possible to select empty entries, except the first one. The content of an entry is shown on bit level by means of 1 s, 0 s and x s where each complete entry contains 320 bits. To view the bit values, be sure that box (2) is checked. By default empty locations are depicted with a gray color and used locations with a yellow color. The LUTs, whose contents altered while changing the contents of the CAM, are drawn red. These are the LUTs that need to be updated when reconfiguring the FPGA

77 Appendix D: User Interface Manual of Various CAMs Since the number of entries and their width is so large, an extra panel was implemented that gives a global view of the CAM and shows what word has been selected (3). The square shows what part of the CAM is viewed in the content view panel. Finally there is a status field (4) where messages, warnings and errors are printed Figure D-1: Graphical User Interface of Fixed Length CAM with inherent priority D.2.2 Changing the content from within the program Changing the contents of the CAM can be done either from within the program or from a file. In the program three functions have been implemented to add, delete and replace entries. To add an entry to the CAM, the location where the entry is to be added should be selected in the content view panel. Only complete entries can be added and a location is selected by clicking on one of the blocks in the row where the new entry should be placed. If a location has been selected that is used by another entry, then this and all entries below are shifted down. The value of the new entry is defined in the Packet and Mask field, which contain 32 bits hexidecimal numbers. Via the Packet field, 1 s and 0 s are added, while the Mask field determines where don t cares are. A 0 in the Mask field means x, for example: if packet is equal to F0F0F0F0 and mask equal to FF00FF00, then the value F0xxF0xx will be added. When a valid value has been entered in the Packet and Mask field and the Add button (5) is pressed, all words at the selected location become equal to the added value. Deleting an entry is done by selecting a location that is not empty and pressing the Delete button (6). The whole entry at that location is then deleted and in case there are non-empty entries on locations below the entry that is deleted, these are shifted up. Replacing an entry is done by selecting a word at a location that is not empty, writing a value

78 Appendix D: User Interface Manual of Various CAMs in the Packet and Mask field and pressing Replace (7). Only the word that has been selected is replaced, not the whole entry. D.2.3 Reading the content from file Except for changing the CAM in the program as described above, it s also possible to read the configuration from file. This is done via File->Open Configuration. The new configuration overwrites the existing configuration when using this method. The file should have the following format: packet (1) mask (1) packet (2)... packet (NEntries) mask (NEntries) where ( entry (n), mask(n) ) is a packet-mask-pair of 320 bits each, divided in 10 words of 32 bits in hexidecimal notation and separated by a space. If the number of entries in the file is larger than the number of entries that can be added to the CAM, then the exceeding entries will be ignored. Open lines and lines that start with # are ignored, so that the configuration file can be structured and commented. An appropriate error is generated when a syntax error occurs while reading the file. After changing the contents of the CAM, the FPGA is updated by pressing Configure (8). This will cause JBits to change the bitstream and write the new bitstream to the FPGA. All LUTs that were drawn red in the content view panel become gray or yellow again. D.2.4 Testing the functionality of the CAM To test the functionality, packets can be read from a file and processed by the CAM. To read a file, go to File->Open Testcases. This file should contain packets that are 320 bits, divided in 10 words of 32 bits in hexadecimal notation separated by a space. Every line should contain a separate packet. Open lines and lines that start with # are ignored, so that the configuration file can be structured and commented. An appropriate error is generated when a syntax error occurs while reading the file. The actual testing is done by pressing the Test button (9). After processing, the program will ask for a filename to store the result in. Every line in the result file contains the packet that was searched, the 32 bits word that was read from memory bank 2 (where the result was stored) in hexadecimal notation and the resulting return address. The 32 bits word that is read from the memory bank is printed for debugging reasons and contains more information besides the return address:

79 Appendix D: User Interface Manual of Various CAMs match output<15:0> Not used Address Match match output<15:0> denotes the outputs of the 15 entries that have the highest priority (before encoding). Sequential tests can be done with the same test cases without loading a new test file. D.3 User Interface of Variable Length CAM with Inherent Priority The graphical user interface of the variable length CAM with inherent priority is given in figure D Figure D-2: Graphical User Interface of Variable Length CAM D.3.1 Viewing the content of the variable length CAM This content view panel (1) gives a graphical representation of the CAM with match blocks, shift registers and priority encoder. It consists of 32 horizontal match lines, each containing 4 match blocks that are 64 bits wide and each consist of 16 FPGA LUTs. The priority encoder returns an address between 0 and 63, where 0 has highest priority. Selecting is done by clicking with the mouse in the panel and only complete 64 bits blocks can be selected. The content of an entry is shown on bit level by means of 1 s, 0 s and x s where

80 Appendix D: User Interface Manual of Various CAMs each complete entry contains a maximum of 320 bits. To view the bit values, be sure that box (2) is checked. In the variable length CAM, the connections between blocks and priority encoder are not fixed. For this reason the routes between matching blocks within the same entry and the routes from blocks to the encoder are shown as wires above the respective match line. The shift registers, that are placed between the match blocks are shown as well, together with their delay. Since the number of entries and their width is so large, an extra panel was implemented that gives a global view of the CAM and shows what word has been selected (3). The square shows what part of the CAM is viewed in the content view panel. Finally there is a status field (4) where messages, warnings and errors are printed. D.3.2 Changing the content of the CAM Changing the contents of the variable length CAM is done by reading a configuration file as described in D.2.2. After this, the FPGA is updated by pressing Configure (5). D.3.3 Testing the functionality of the CAM Testing the variable length CAM with inherent priority is done in the same way as the fixed length CAM with inherent priority. A test file containing a series of packets is read and the CAM starts testing by pressing the Test button (6). D.4 User Interface of Variable Length CAM with Explicit Priority The graphical user interface of the variable length CAM with explicit priority is given in figure D-3. The program is entirely controlled from a command file and therefore the buttons and menu items for testing and configuring the FPGA are omitted. The information that is given about the contents of the CAM is the same as for the variable length CAM with inherent priority, except that for each entry not only the return address, but also its physical priority is shown. D.4.1 Command File The command file is read via File->Open Command File and via the command file entries can be added/deleted and packets can be searched for testing the functionality of the CAM. The commands in the command file can be divided into different catagories and these are described below. Open lines and lines that start with # are ignored, so that the configuration file can be structured and commented. D.4.2 Changing the contents of the CAM Clearing the CAM Clear

81 Appendix D: User Interface Manual of Various CAMs return address / physical priority Figure D-3: Graphical User Interface of Variable Length CAM with explicit priority. Adding an entry: Add <packet label> <packet> <mask> <logical priority> where <packet label> is a string that identifies the entry, ( packet, mask ) is a packet-maskpair of 320 bits each, divided in 10 words of 32 bits in hexidecimal notation and separated by a space and <logical priority> is the logical priority of the entry which can be any integer. An Add-command can be followed by several entries. Deleting an entry: Delete <packet label> where <packet label> is the label of the entry that is deleted. A Delete-command can be followed by several packet labels

EECS150 - Digital Design Lecture 16 Memory 1

EECS150 - Digital Design Lecture 16 Memory 1 EECS150 - Digital Design Lecture 16 Memory 1 March 13, 2003 John Wawrzynek Spring 2003 EECS150 - Lec16-mem1 Page 1 Memory Basics Uses: Whenever a large collection of state elements is required. data &

More information

EECS150 - Digital Design Lecture 16 - Memory

EECS150 - Digital Design Lecture 16 - Memory EECS150 - Digital Design Lecture 16 - Memory October 17, 2002 John Wawrzynek Fall 2002 EECS150 - Lec16-mem1 Page 1 Memory Basics Uses: data & program storage general purpose registers buffering table lookups

More information

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP 133 CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP 6.1 INTRODUCTION As the era of a billion transistors on a one chip approaches, a lot of Processing Elements (PEs) could be located

More information

The Xilinx XC6200 chip, the software tools and the board development tools

The Xilinx XC6200 chip, the software tools and the board development tools The Xilinx XC6200 chip, the software tools and the board development tools What is an FPGA? Field Programmable Gate Array Fully programmable alternative to a customized chip Used to implement functions

More information

FPGA: What? Why? Marco D. Santambrogio

FPGA: What? Why? Marco D. Santambrogio FPGA: What? Why? Marco D. Santambrogio marco.santambrogio@polimi.it 2 Reconfigurable Hardware Reconfigurable computing is intended to fill the gap between hardware and software, achieving potentially much

More information

The Virtex FPGA and Introduction to design techniques

The Virtex FPGA and Introduction to design techniques The Virtex FPGA and Introduction to design techniques SM098 Computation Structures Lecture 6 Simple Programmable Logic evices Programmable Array Logic (PAL) AN-OR arrays are common blocks in SPL and CPL

More information

! Program logic functions, interconnect using SRAM. ! Advantages: ! Re-programmable; ! dynamically reconfigurable; ! uses standard processes.

! Program logic functions, interconnect using SRAM. ! Advantages: ! Re-programmable; ! dynamically reconfigurable; ! uses standard processes. Topics! SRAM-based FPGA fabrics:! Xilinx.! Altera. SRAM-based FPGAs! Program logic functions, using SRAM.! Advantages:! Re-programmable;! dynamically reconfigurable;! uses standard processes.! isadvantages:!

More information

Programmable Logic. Simple Programmable Logic Devices

Programmable Logic. Simple Programmable Logic Devices Programmable Logic SM098 Computation Structures - Programmable Logic Simple Programmable Logic evices Programmable Array Logic (PAL) AN-OR arrays are common blocks in SPL and CPL architectures Implements

More information

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011 FPGA for Complex System Implementation National Chiao Tung University Chun-Jen Tsai 04/14/2011 About FPGA FPGA was invented by Ross Freeman in 1989 SRAM-based FPGA properties Standard parts Allowing multi-level

More information

Today. Comments about assignment Max 1/T (skew = 0) Max clock skew? Comments about assignment 3 ASICs and Programmable logic Others courses

Today. Comments about assignment Max 1/T (skew = 0) Max clock skew? Comments about assignment 3 ASICs and Programmable logic Others courses Today Comments about assignment 3-43 Comments about assignment 3 ASICs and Programmable logic Others courses octor Per should show up in the end of the lecture Mealy machines can not be coded in a single

More information

Virtex-II Architecture. Virtex II technical, Design Solutions. Active Interconnect Technology (continued)

Virtex-II Architecture. Virtex II technical, Design Solutions. Active Interconnect Technology (continued) Virtex-II Architecture SONET / SDH Virtex II technical, Design Solutions PCI-X PCI DCM Distri RAM 18Kb BRAM Multiplier LVDS FIFO Shift Registers BLVDS SDRAM QDR SRAM Backplane Rev 4 March 4th. 2002 J-L

More information

Summary. Introduction. Application Note: Virtex, Virtex-E, Spartan-IIE, Spartan-3, Virtex-II, Virtex-II Pro. XAPP152 (v2.1) September 17, 2003

Summary. Introduction. Application Note: Virtex, Virtex-E, Spartan-IIE, Spartan-3, Virtex-II, Virtex-II Pro. XAPP152 (v2.1) September 17, 2003 Application Note: Virtex, Virtex-E, Spartan-IIE, Spartan-3, Virtex-II, Virtex-II Pro Xilinx Tools: The Estimator XAPP152 (v2.1) September 17, 2003 Summary This application note is offered as complementary

More information

Topics. Midterm Finish Chapter 7

Topics. Midterm Finish Chapter 7 Lecture 9 Topics Midterm Finish Chapter 7 ROM (review) Memory device in which permanent binary information is stored. Example: 32 x 8 ROM Five input lines (2 5 = 32) 32 outputs, each representing a memory

More information

Implementation and Design of High Speed FPGA-based Content Addressable Memory Anupkumar Jamadarakhani 1 Shailesh Kumar Ranchi 2 1, 2

Implementation and Design of High Speed FPGA-based Content Addressable Memory Anupkumar Jamadarakhani 1 Shailesh Kumar Ranchi 2 1, 2 IJSRD - International Journal for Scientific Research & Development Vol. 1, Issue 9, 2013 ISSN (online): 2321-0613 Implementation and Design of High Speed FPGA-based Content Addressable Memory Anupkumar

More information

FPGA. Logic Block. Plessey FPGA: basic building block here is 2-input NAND gate which is connected to each other to implement desired function.

FPGA. Logic Block. Plessey FPGA: basic building block here is 2-input NAND gate which is connected to each other to implement desired function. FPGA Logic block of an FPGA can be configured in such a way that it can provide functionality as simple as that of transistor or as complex as that of a microprocessor. It can used to implement different

More information

Field Programmable Gate Array (FPGA)

Field Programmable Gate Array (FPGA) Field Programmable Gate Array (FPGA) Lecturer: Krébesz, Tamas 1 FPGA in general Reprogrammable Si chip Invented in 1985 by Ross Freeman (Xilinx inc.) Combines the advantages of ASIC and uc-based systems

More information

Basic FPGA Architecture Xilinx, Inc. All Rights Reserved

Basic FPGA Architecture Xilinx, Inc. All Rights Reserved Basic FPGA Architecture 2005 Xilinx, Inc. All Rights Reserved Objectives After completing this module, you will be able to: Identify the basic architectural resources of the Virtex -II FPGA List the differences

More information

CPE/EE 422/522. Introduction to Xilinx Virtex Field-Programmable Gate Arrays Devices. Dr. Rhonda Kay Gaede UAH. Outline

CPE/EE 422/522. Introduction to Xilinx Virtex Field-Programmable Gate Arrays Devices. Dr. Rhonda Kay Gaede UAH. Outline CPE/EE 422/522 Introduction to Xilinx Virtex Field-Programmable Gate Arrays Devices Dr. Rhonda Kay Gaede UAH Outline Introduction Field-Programmable Gate Arrays Virtex Virtex-E, Virtex-II, and Virtex-II

More information

Index Terms- Field Programmable Gate Array, Content Addressable memory, Intrusion Detection system.

Index Terms- Field Programmable Gate Array, Content Addressable memory, Intrusion Detection system. Dynamic Based Reconfigurable Content Addressable Memory for FastString Matching N.Manonmani 1, K.Suman 2, C.Udhayakumar 3 Dept of ECE, Sri Eshwar College of Engineering, Kinathukadavu, Coimbatore, India1

More information

Overview. Implementing Gigabit Routers with NetFPGA. Basic Architectural Components of an IP Router. Per-packet processing in an IP Router

Overview. Implementing Gigabit Routers with NetFPGA. Basic Architectural Components of an IP Router. Per-packet processing in an IP Router Overview Implementing Gigabit Routers with NetFPGA Prof. Sasu Tarkoma The NetFPGA is a low-cost platform for teaching networking hardware and router design, and a tool for networking researchers. The NetFPGA

More information

EE178 Lecture Module 2. Eric Crabill SJSU / Xilinx Fall 2007

EE178 Lecture Module 2. Eric Crabill SJSU / Xilinx Fall 2007 EE178 Lecture Module 2 Eric Crabill SJSU / Xilinx Fall 2007 Lecture #4 Agenda Survey of implementation technologies. Implementation Technologies Small scale and medium scale integration. Up to about 200

More information

EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs)

EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) September 12, 2002 John Wawrzynek Fall 2002 EECS150 - Lec06-FPGA Page 1 Outline What are FPGAs? Why use FPGAs (a short history

More information

What is Xilinx Design Language?

What is Xilinx Design Language? Bill Jason P. Tomas University of Nevada Las Vegas Dept. of Electrical and Computer Engineering What is Xilinx Design Language? XDL is a human readable ASCII format compatible with the more widely used

More information

Outline. EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) FPGA Overview. Why FPGAs?

Outline. EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) FPGA Overview. Why FPGAs? EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) September 12, 2002 John Wawrzynek Outline What are FPGAs? Why use FPGAs (a short history lesson). FPGA variations Internal logic

More information

INTRODUCTION TO FPGA ARCHITECTURE

INTRODUCTION TO FPGA ARCHITECTURE 3/3/25 INTRODUCTION TO FPGA ARCHITECTURE DIGITAL LOGIC DESIGN (BASIC TECHNIQUES) a b a y 2input Black Box y b Functional Schematic a b y a b y a b y 2 Truth Table (AND) Truth Table (OR) Truth Table (XOR)

More information

Virtex-II Architecture

Virtex-II Architecture Virtex-II Architecture Block SelectRAM resource I/O Blocks (IOBs) edicated multipliers Programmable interconnect Configurable Logic Blocks (CLBs) Virtex -II architecture s core voltage operates at 1.5V

More information

Synthesis vs. Compilation Descriptions mapped to hardware Verilog design patterns for best synthesis. Spring 2007 Lec #8 -- HW Synthesis 1

Synthesis vs. Compilation Descriptions mapped to hardware Verilog design patterns for best synthesis. Spring 2007 Lec #8 -- HW Synthesis 1 Verilog Synthesis Synthesis vs. Compilation Descriptions mapped to hardware Verilog design patterns for best synthesis Spring 2007 Lec #8 -- HW Synthesis 1 Logic Synthesis Verilog and VHDL started out

More information

Basic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices

Basic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices 3 Digital Systems Implementation Programmable Logic Devices Basic FPGA Architectures Why Programmable Logic Devices (PLDs)? Low cost, low risk way of implementing digital circuits as application specific

More information

Review: Timing. EECS Components and Design Techniques for Digital Systems. Lec 13 Storage: Regs, SRAM, ROM. Outline.

Review: Timing. EECS Components and Design Techniques for Digital Systems. Lec 13 Storage: Regs, SRAM, ROM. Outline. Review: Timing EECS 150 - Components and Design Techniques for Digital Systems Lec 13 Storage: Regs,, ROM David Culler Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~culler

More information

Verilog for High Performance

Verilog for High Performance Verilog for High Performance Course Description This course provides all necessary theoretical and practical know-how to write synthesizable HDL code through Verilog standard language. The course goes

More information

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC Zoltan Baruch Computer Science Department, Technical University of Cluj-Napoca, 26-28, Bariţiu St., 3400 Cluj-Napoca,

More information

CSE140L: Components and Design Techniques for Digital Systems Lab

CSE140L: Components and Design Techniques for Digital Systems Lab CSE140L: Components and Design Techniques for Digital Systems Lab Tajana Simunic Rosing Source: Vahid, Katz, Culler 1 Announcements & Outline Lab 4 due; demo signup times listed on the cse140l site Check

More information

Logic Synthesis. EECS150 - Digital Design Lecture 6 - Synthesis

Logic Synthesis. EECS150 - Digital Design Lecture 6 - Synthesis Logic Synthesis Verilog and VHDL started out as simulation languages, but quickly people wrote programs to automatically convert Verilog code into low-level circuit descriptions (netlists). EECS150 - Digital

More information

CSE140L: Components and Design

CSE140L: Components and Design CSE140L: Components and Design Techniques for Digital Systems Lab Tajana Simunic Rosing Source: Vahid, Katz, Culler 1 Grade distribution: 70% Labs 35% Lab 4 30% Lab 3 20% Lab 2 15% Lab 1 30% Final exam

More information

Introduction to Partial Reconfiguration Methodology

Introduction to Partial Reconfiguration Methodology Methodology This material exempt per Department of Commerce license exception TSU Objectives After completing this module, you will be able to: Define Partial Reconfiguration technology List common applications

More information

Internetwork Protocols

Internetwork Protocols Internetwork Protocols Background to IP IP, and related protocols Internetworking Terms (1) Communications Network Facility that provides data transfer service An internet Collection of communications

More information

Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator

Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator Stanley Bak Abstract Network algorithms are deployed on large networks, and proper algorithm evaluation is necessary to avoid

More information

TSEA44 - Design for FPGAs

TSEA44 - Design for FPGAs 2015-11-24 Now for something else... Adapting designs to FPGAs Why? Clock frequency Area Power Target FPGA architecture: Xilinx FPGAs with 4 input LUTs (such as Virtex-II) Determining the maximum frequency

More information

INTRODUCTION TO FIELD PROGRAMMABLE GATE ARRAYS (FPGAS)

INTRODUCTION TO FIELD PROGRAMMABLE GATE ARRAYS (FPGAS) INTRODUCTION TO FIELD PROGRAMMABLE GATE ARRAYS (FPGAS) Bill Jason P. Tomas Dept. of Electrical and Computer Engineering University of Nevada Las Vegas FIELD PROGRAMMABLE ARRAYS Dominant digital design

More information

Fault Grading FPGA Interconnect Test Configurations

Fault Grading FPGA Interconnect Test Configurations * Fault Grading FPGA Interconnect Test Configurations Mehdi Baradaran Tahoori Subhasish Mitra* Shahin Toutounchi Edward J. McCluskey Center for Reliable Computing Stanford University http://crc.stanford.edu

More information

Chapter 9: Integration of Full ASIP and its FPGA Implementation

Chapter 9: Integration of Full ASIP and its FPGA Implementation Chapter 9: Integration of Full ASIP and its FPGA Implementation 9.1 Introduction A top-level module has been created for the ASIP in VHDL in which all the blocks have been instantiated at the Register

More information

William Stallings Data and Computer Communications. Chapter 10 Packet Switching

William Stallings Data and Computer Communications. Chapter 10 Packet Switching William Stallings Data and Computer Communications Chapter 10 Packet Switching Principles Circuit switching designed for voice Resources dedicated to a particular call Much of the time a data connection

More information

II. Principles of Computer Communications Network and Transport Layer

II. Principles of Computer Communications Network and Transport Layer II. Principles of Computer Communications Network and Transport Layer A. Internet Protocol (IP) IPv4 Header An IP datagram consists of a header part and a text part. The header has a 20-byte fixed part

More information

Introduction to Internetworking

Introduction to Internetworking Introduction to Internetworking Introductory terms Communications Network Facility that provides data transfer services An internet Collection of communications networks interconnected by bridges and/or

More information

VHX - Xilinx - FPGA Programming in VHDL

VHX - Xilinx - FPGA Programming in VHDL Training Xilinx - FPGA Programming in VHDL: This course explains how to design with VHDL on Xilinx FPGAs using ISE Design Suite - Programming: Logique Programmable VHX - Xilinx - FPGA Programming in VHDL

More information

JRoute: A Run-Time Routing API for FPGA Hardware

JRoute: A Run-Time Routing API for FPGA Hardware JRoute: A Run-Time Routing API for FPGA Hardware Eric Keller Xilinx Inc. 2300 55 th Street Boulder, CO 80301 Eric.Keller@xilinx.com Abstract. JRoute is a set of Java classes that provide an application

More information

EEL 4783: HDL in Digital System Design

EEL 4783: HDL in Digital System Design EEL 4783: HDL in Digital System Design Lecture 15: Logic Synthesis with Verilog Prof. Mingjie Lin 1 Verilog Synthesis Synthesis vs. Compilation Descriptions mapped to hardware Verilog design patterns for

More information

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC)

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC) FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC) D.Udhayasheela, pg student [Communication system],dept.ofece,,as-salam engineering and technology, N.MageshwariAssistant Professor

More information

Chapter 4: network layer. Network service model. Two key network-layer functions. Network layer. Input port functions. Router architecture overview

Chapter 4: network layer. Network service model. Two key network-layer functions. Network layer. Input port functions. Router architecture overview Chapter 4: chapter goals: understand principles behind services service models forwarding versus routing how a router works generalized forwarding instantiation, implementation in the Internet 4- Network

More information

EEC-484/584 Computer Networks

EEC-484/584 Computer Networks EEC-484/584 Computer Networks Lecture 13 wenbing@ieee.org (Lecture nodes are based on materials supplied by Dr. Louise Moser at UCSB and Prentice-Hall) Outline 2 Review of lecture 12 Routing Congestion

More information

FPGA BASED ADAPTIVE RESOURCE EFFICIENT ERROR CONTROL METHODOLOGY FOR NETWORK ON CHIP

FPGA BASED ADAPTIVE RESOURCE EFFICIENT ERROR CONTROL METHODOLOGY FOR NETWORK ON CHIP FPGA BASED ADAPTIVE RESOURCE EFFICIENT ERROR CONTROL METHODOLOGY FOR NETWORK ON CHIP 1 M.DEIVAKANI, 2 D.SHANTHI 1 Associate Professor, Department of Electronics and Communication Engineering PSNA College

More information

EECS150 - Digital Design Lecture 10 Logic Synthesis

EECS150 - Digital Design Lecture 10 Logic Synthesis EECS150 - Digital Design Lecture 10 Logic Synthesis September 26, 2002 John Wawrzynek Fall 2002 EECS150 Lec10-synthesis Page 1 Logic Synthesis Verilog and VHDL stated out as simulation languages, but quickly

More information

EECS150 - Digital Design Lecture 10 Logic Synthesis

EECS150 - Digital Design Lecture 10 Logic Synthesis EECS150 - Digital Design Lecture 10 Logic Synthesis February 13, 2003 John Wawrzynek Spring 2003 EECS150 Lec8-synthesis Page 1 Logic Synthesis Verilog and VHDL started out as simulation languages, but

More information

Chapter 5 (Week 9) The Network Layer ANDREW S. TANENBAUM COMPUTER NETWORKS FOURTH EDITION PP BLM431 Computer Networks Dr.

Chapter 5 (Week 9) The Network Layer ANDREW S. TANENBAUM COMPUTER NETWORKS FOURTH EDITION PP BLM431 Computer Networks Dr. Chapter 5 (Week 9) The Network Layer ANDREW S. TANENBAUM COMPUTER NETWORKS FOURTH EDITION PP. 343-396 1 5.1. NETWORK LAYER DESIGN ISSUES 5.2. ROUTING ALGORITHMS 5.3. CONGESTION CONTROL ALGORITHMS 5.4.

More information

How Much Logic Should Go in an FPGA Logic Block?

How Much Logic Should Go in an FPGA Logic Block? How Much Logic Should Go in an FPGA Logic Block? Vaughn Betz and Jonathan Rose Department of Electrical and Computer Engineering, University of Toronto Toronto, Ontario, Canada M5S 3G4 {vaughn, jayar}@eecgutorontoca

More information

Synthesis Options FPGA and ASIC Technology Comparison - 1

Synthesis Options FPGA and ASIC Technology Comparison - 1 Synthesis Options Comparison - 1 2009 Xilinx, Inc. All Rights Reserved Welcome If you are new to FPGA design, this module will help you synthesize your design properly These synthesis techniques promote

More information

LogiCORE IP Serial RapidIO Gen2 v1.2

LogiCORE IP Serial RapidIO Gen2 v1.2 LogiCORE IP Serial RapidIO Gen2 v1.2 Product Guide Table of Contents Chapter 1: Overview System Overview............................................................ 5 Applications.................................................................

More information

CCNA Exploration Network Fundamentals. Chapter 06 Addressing the Network IPv4

CCNA Exploration Network Fundamentals. Chapter 06 Addressing the Network IPv4 CCNA Exploration Network Fundamentals Chapter 06 Addressing the Network IPv4 Updated: 20/05/2008 1 6.0.1 Introduction Addressing is a key function of Network layer protocols that enables data communication

More information

Design and Implementation of Buffer Loan Algorithm for BiNoC Router

Design and Implementation of Buffer Loan Algorithm for BiNoC Router Design and Implementation of Buffer Loan Algorithm for BiNoC Router Deepa S Dev Student, Department of Electronics and Communication, Sree Buddha College of Engineering, University of Kerala, Kerala, India

More information

1 MALP ( ) Unit-1. (1) Draw and explain the internal architecture of 8085.

1 MALP ( ) Unit-1. (1) Draw and explain the internal architecture of 8085. (1) Draw and explain the internal architecture of 8085. The architecture of 8085 Microprocessor is shown in figure given below. The internal architecture of 8085 includes following section ALU-Arithmetic

More information

VHDL for Synthesis. Course Description. Course Duration. Goals

VHDL for Synthesis. Course Description. Course Duration. Goals VHDL for Synthesis Course Description This course provides all necessary theoretical and practical know how to write an efficient synthesizable HDL code through VHDL standard language. The course goes

More information

FPGA architecture and design technology

FPGA architecture and design technology CE 435 Embedded Systems Spring 2017 FPGA architecture and design technology Nikos Bellas Computer and Communications Engineering Department University of Thessaly 1 FPGA fabric A generic island-style FPGA

More information

PROGRAMMABLE MODULES SPECIFICATION OF PROGRAMMABLE COMBINATIONAL AND SEQUENTIAL MODULES

PROGRAMMABLE MODULES SPECIFICATION OF PROGRAMMABLE COMBINATIONAL AND SEQUENTIAL MODULES PROGRAMMABLE MODULES SPECIFICATION OF PROGRAMMABLE COMBINATIONAL AND SEQUENTIAL MODULES. psa. rom. fpga THE WAY THE MODULES ARE PROGRAMMED NETWORKS OF PROGRAMMABLE MODULES EXAMPLES OF USES Programmable

More information

PINE TRAINING ACADEMY

PINE TRAINING ACADEMY PINE TRAINING ACADEMY Course Module A d d r e s s D - 5 5 7, G o v i n d p u r a m, G h a z i a b a d, U. P., 2 0 1 0 1 3, I n d i a Digital Logic System Design using Gates/Verilog or VHDL and Implementation

More information

CH : 15 LOCAL AREA NETWORK OVERVIEW

CH : 15 LOCAL AREA NETWORK OVERVIEW CH : 15 LOCAL AREA NETWORK OVERVIEW P. 447 LAN (Local Area Network) A LAN consists of a shared transmission medium and a set of hardware and software for interfacing devices to the medium and regulating

More information

ENGG3380: Computer Organization and Design Lab4: Buses and Peripheral Devices

ENGG3380: Computer Organization and Design Lab4: Buses and Peripheral Devices ENGG3380: Computer Organization and Design Lab4: Buses and Peripheral Devices School of Engineering, University of Guelph Winter 2017 1 Objectives: The purpose of this lab is : Learn basic bus design techniques.

More information

CHAPTER 5 : Introduction to Intel 8085 Microprocessor Hardware BENG 2223 MICROPROCESSOR TECHNOLOGY

CHAPTER 5 : Introduction to Intel 8085 Microprocessor Hardware BENG 2223 MICROPROCESSOR TECHNOLOGY CHAPTER 5 : Introduction to Intel 8085 Hardware BENG 2223 MICROPROCESSOR TECHNOLOGY The 8085A(commonly known as the 8085) : Was first introduced in March 1976 is an 8-bit microprocessor with 16-bit address

More information

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup Yan Sun and Min Sik Kim School of Electrical Engineering and Computer Science Washington State University Pullman, Washington

More information

OSI Network Layer. Chapter 5

OSI Network Layer. Chapter 5 OSI Network Layer Network Fundamentals Chapter 5 Objectives Identify the role of the Network Layer, as it describes communication from one end device to another end device. Examine the most common Network

More information

Chapter 4 Network Layer: The Data Plane

Chapter 4 Network Layer: The Data Plane Chapter 4 Network Layer: The Data Plane A note on the use of these Powerpoint slides: We re making these slides freely available to all (faculty, students, readers). They re in PowerPoint form so you see

More information

Lecture 16: Network Layer Overview, Internet Protocol

Lecture 16: Network Layer Overview, Internet Protocol Lecture 16: Network Layer Overview, Internet Protocol COMP 332, Spring 2018 Victoria Manfredi Acknowledgements: materials adapted from Computer Networking: A Top Down Approach 7 th edition: 1996-2016,

More information

Chapter 5: ASICs Vs. PLDs

Chapter 5: ASICs Vs. PLDs Chapter 5: ASICs Vs. PLDs 5.1 Introduction A general definition of the term Application Specific Integrated Circuit (ASIC) is virtually every type of chip that is designed to perform a dedicated task.

More information

The Link Layer and LANs: Ethernet and Swiches

The Link Layer and LANs: Ethernet and Swiches The Link Layer and LNs: Ethernet and Swiches EECS3214 2018-03-21 Link layer, LNs: outline 6.1 introduction, services 6.2 error detection, correction 6.3 multiple access protocols 6.4 LNs addressing, RP

More information

4. Networks. in parallel computers. Advances in Computer Architecture

4. Networks. in parallel computers. Advances in Computer Architecture 4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors

More information

Last time. Wireless link-layer. Introduction. Characteristics of wireless links wireless LANs networking. Cellular Internet access

Last time. Wireless link-layer. Introduction. Characteristics of wireless links wireless LANs networking. Cellular Internet access Last time Wireless link-layer Introduction Wireless hosts, base stations, wireless links Characteristics of wireless links Signal strength, interference, multipath propagation Hidden terminal, signal fading

More information

Data Communication & Networks G Session 7 - Main Theme Networks: Part I Circuit Switching, Packet Switching, The Network Layer

Data Communication & Networks G Session 7 - Main Theme Networks: Part I Circuit Switching, Packet Switching, The Network Layer Data Communication & Networks G22.2262-001 Session 7 - Main Theme Networks: Part I Circuit Switching, Packet Switching, The Network Layer Dr. Jean-Claude Franchitti New York University Computer Science

More information

SoC Design. Prof. Dr. Christophe Bobda Institut für Informatik Lehrstuhl für Technische Informatik

SoC Design. Prof. Dr. Christophe Bobda Institut für Informatik Lehrstuhl für Technische Informatik SoC Design Prof. Dr. Christophe Bobda Institut für Informatik Lehrstuhl für Technische Informatik Chapter 5 On-Chip Communication Outline 1. Introduction 2. Shared media 3. Switched media 4. Network on

More information

Decision Forest: A Scalable Architecture for Flexible Flow Matching on FPGA

Decision Forest: A Scalable Architecture for Flexible Flow Matching on FPGA Decision Forest: A Scalable Architecture for Flexible Flow Matching on FPGA Weirong Jiang, Viktor K. Prasanna University of Southern California Norio Yamagaki NEC Corporation September 1, 2010 Outline

More information

International Journal of Advance Engineering and Research Development

International Journal of Advance Engineering and Research Development Scientific Journal of Impact Factor (SJIF): 4.14 e-issn : 2348-4470 p-issn : 2348-6406 International Journal of Advance Engineering and Research Development Volume 3,Issue 4,April -2016 DESIGN AND IMPLEMENTATION

More information

Evolution of Implementation Technologies. ECE 4211/5211 Rapid Prototyping with FPGAs. Gate Array Technology (IBM s) Programmable Logic

Evolution of Implementation Technologies. ECE 4211/5211 Rapid Prototyping with FPGAs. Gate Array Technology (IBM s) Programmable Logic ECE 42/52 Rapid Prototyping with FPGAs Dr. Charlie Wang Department of Electrical and Computer Engineering University of Colorado at Colorado Springs Evolution of Implementation Technologies Discrete devices:

More information

CS 268: Route Lookup and Packet Classification

CS 268: Route Lookup and Packet Classification Overview CS 268: Route Lookup and Packet Classification Packet Lookup Packet Classification Ion Stoica March 3, 24 istoica@cs.berkeley.edu 2 Lookup Problem Identify the output interface to forward an incoming

More information

Question 7: What are Asynchronous links?

Question 7: What are Asynchronous links? Question 1:.What is three types of LAN traffic? Unicasts - intended for one host. Broadcasts - intended for everyone. Multicasts - intended for an only a subset or group within an entire network. Question2:

More information

THE TRANSPORT LAYER UNIT IV

THE TRANSPORT LAYER UNIT IV THE TRANSPORT LAYER UNIT IV The Transport Layer: The Transport Service, Elements of Transport Protocols, Congestion Control,The internet transport protocols: UDP, TCP, Performance problems in computer

More information

Advanced FPGA Design Methodologies with Xilinx Vivado

Advanced FPGA Design Methodologies with Xilinx Vivado Advanced FPGA Design Methodologies with Xilinx Vivado Alexander Jäger Computer Architecture Group Heidelberg University, Germany Abstract With shrinking feature sizes in the ASIC manufacturing technology,

More information

Da t e: August 2 0 th a t 9: :00 SOLUTIONS

Da t e: August 2 0 th a t 9: :00 SOLUTIONS Interne t working, Examina tion 2G1 3 0 5 Da t e: August 2 0 th 2 0 0 3 a t 9: 0 0 1 3:00 SOLUTIONS 1. General (5p) a) Place each of the following protocols in the correct TCP/IP layer (Application, Transport,

More information

Module 5 - CPU Design

Module 5 - CPU Design Module 5 - CPU Design Lecture 1 - Introduction to CPU The operation or task that must perform by CPU is: Fetch Instruction: The CPU reads an instruction from memory. Interpret Instruction: The instruction

More information

1. INTRODUCTION TO MICROPROCESSOR AND MICROCOMPUTER ARCHITECTURE:

1. INTRODUCTION TO MICROPROCESSOR AND MICROCOMPUTER ARCHITECTURE: 1. INTRODUCTION TO MICROPROCESSOR AND MICROCOMPUTER ARCHITECTURE: A microprocessor is a programmable electronics chip that has computing and decision making capabilities similar to central processing unit

More information

Topics. Midterm Finish Chapter 7

Topics. Midterm Finish Chapter 7 Lecture 9 Topics Midterm Finish Chapter 7 Xilinx FPGAs Chapter 7 Spartan 3E Architecture Source: Spartan-3E FPGA Family Datasheet CLB Configurable Logic Blocks Each CLB contains four slices Each slice

More information

Programmable Logic Devices FPGA Architectures II CMPE 415. Overview This set of notes introduces many of the features available in the FPGAs of today.

Programmable Logic Devices FPGA Architectures II CMPE 415. Overview This set of notes introduces many of the features available in the FPGAs of today. Overview This set of notes introduces many of the features available in the FPGAs of today. The majority use SRAM based configuration cells, which allows fast reconfiguation. Allows new design ideas to

More information

Design and Implementation of Low Complexity Router for 2D Mesh Topology using FPGA

Design and Implementation of Low Complexity Router for 2D Mesh Topology using FPGA Design and Implementation of Low Complexity Router for 2D Mesh Topology using FPGA Maheswari Murali * and Seetharaman Gopalakrishnan # * Assistant professor, J. J. College of Engineering and Technology,

More information

Digital Integrated Circuits

Digital Integrated Circuits Digital Integrated Circuits Lecture 9 Jaeyong Chung Robust Systems Laboratory Incheon National University DIGITAL DESIGN FLOW Chung EPC6055 2 FPGA vs. ASIC FPGA (A programmable Logic Device) Faster time-to-market

More information

Leso Martin, Musil Tomáš

Leso Martin, Musil Tomáš SAFETY CORE APPROACH FOR THE SYSTEM WITH HIGH DEMANDS FOR A SAFETY AND RELIABILITY DESIGN IN A PARTIALLY DYNAMICALLY RECON- FIGURABLE FIELD-PROGRAMMABLE GATE ARRAY (FPGA) Leso Martin, Musil Tomáš Abstract:

More information

FPX Architecture for a Dynamically Extensible Router

FPX Architecture for a Dynamically Extensible Router FPX Architecture for a Dynamically Extensible Router Alex Chandra, Yuhua Chen, John Lockwood, Sarang Dharmapurikar, Wenjing Tang, David Taylor, Jon Turner http://www.arl.wustl.edu/arl Dynamically Extensible

More information

CHAPTER-2 IP CONCEPTS

CHAPTER-2 IP CONCEPTS CHAPTER-2 IP CONCEPTS Page: 1 IP Concepts IP is a very important protocol in modern internetworking; you can't really comprehend modern networking without a good understanding of IP. Unfortunately, IP

More information

Universal Serial Bus Host Interface on an FPGA

Universal Serial Bus Host Interface on an FPGA Universal Serial Bus Host Interface on an FPGA Application Note For many years, designers have yearned for a general-purpose, high-performance serial communication protocol. The RS-232 and its derivatives

More information

LogiCORE IP AXI DataMover v3.00a

LogiCORE IP AXI DataMover v3.00a LogiCORE IP AXI DataMover v3.00a Product Guide Table of Contents SECTION I: SUMMARY IP Facts Chapter 1: Overview Operating System Requirements..................................................... 7 Feature

More information

IP Packet Switching. Goals of Todayʼs Lecture. Simple Network: Nodes and a Link. Connectivity Links and nodes Circuit switching Packet switching

IP Packet Switching. Goals of Todayʼs Lecture. Simple Network: Nodes and a Link. Connectivity Links and nodes Circuit switching Packet switching IP Packet Switching CS 375: Computer Networks Dr. Thomas C. Bressoud Goals of Todayʼs Lecture Connectivity Links and nodes Circuit switching Packet switching IP service model Best-effort packet delivery

More information

Enabling Gigabit IP for Intelligent Systems

Enabling Gigabit IP for Intelligent Systems Enabling Gigabit IP for Intelligent Systems Nick Tsakiris Flinders University School of Informatics & Engineering GPO Box 2100, Adelaide, SA Australia Greg Knowles Flinders University School of Informatics

More information

Introduction to Field Programmable Gate Arrays

Introduction to Field Programmable Gate Arrays Introduction to Field Programmable Gate Arrays Lecture 1/3 CERN Accelerator School on Digital Signal Processing Sigtuna, Sweden, 31 May 9 June 2007 Javier Serrano, CERN AB-CO-HT Outline Historical introduction.

More information

Employing Multi-FPGA Debug Techniques

Employing Multi-FPGA Debug Techniques Employing Multi-FPGA Debug Techniques White Paper Traditional FPGA Debugging Methods Debugging in FPGAs has been difficult since day one. Unlike simulation where designers can see any signal at any time,

More information