UNIVERSITY OF CALIFORNIA, SAN DIEGO. Reconfigurable Computing Platform. A thesis submitted in partial satisfaction of the

Size: px

Start display at page:

Download "UNIVERSITY OF CALIFORNIA, SAN DIEGO. Reconfigurable Computing Platform. A thesis submitted in partial satisfaction of the"

Brendan Dennis
5 years ago
Views:

1 UNIVERSITY OF CALIFORNIA, SAN DIEGO An Overview and Benchmark Study of the Starbridge Reconfigurable Computing Platform A thesis submitted in partial satisfaction of the requirements for the degree Master of Science in Computer Science by David Tamjidi Committee in charge: Professor Allan E. Snavely, Chair Professor Rajesh Gupta Professor Alex Orailoglu 2005

3 The thesis of David Tamjidi is approved: Chair University of California, San Diego 2005 iii

4 TABLE OF CONTENTS SIGNATURE PAGE...iii TABLE OF CONTENTS...iv LIST OF ABBREVIATIONS...vii LIST OF FIGURES...ix LIST OF TABLES...x ACKNOWLEDGMENTS...xi ABSTRACT...xii CHAPTER 1 : INTRODUCTION The Reconfigurable Computing Platform Current Technology State Hardware Software Organization of the Thesis...6 CHAPTER 2 : RELATED WORK SRC Computing Platform Annapolis Microsystems Computing Platform...11 CHAPTER 3 : THE STARBRIDGE RECONFIGURABLE COMPUTING PLATFORM Hardware General Architecture Memory Subsystem...17 iv

5 3.1.3 Communications Subsystem Comparisons/Contrasts Software - VIVA Types with Context Polymorphic Objects Recursive Synthesis Synchronization System Dynamic Reconfiguration External Program Interface Evaluation of VIVA...29 CHAPTER 4 : MEMORY/COMMUNICATIONS BENCHMARKING Methodology Benchmarks Experimental Evaluation PCI Bus Maximum Sustainable Performance DRAM Maximum Sustainable Performance STREAM Benchmark RANDOM Benchmark Analysis...44 CHAPTER 5 : COMPUTATIONAL BENCHMARKING Methodology Benchmarks Experimental Evaluation...48 v

6 5.3.1 DGEMM Triple-DES Conway s Game of Life Smith-Waterman Algorithm Analysis...58 CHAPTER 6 : SYNTHESIS BENCHMARKING Methodology Experimental Evaluation Triple-DES Conway s Game of Life Analysis...65 CHAPTER 7 : CONCLUSION Summary of the Starbridge RCS Future Work...68 REFERENCES...70 vi

7 LIST OF ABBREVIATIONS AMS ASIC A/D converters CA COM DDR DGEMM DRAM DSP EDIF FIFO FPGA GDBW GUI GUPS HDL IP LUT MPGA PAR PCI Bus Annapolis Microsystems Inc. Application Specific Integrated Circuit Analog to digital converters Cellular Automaton Component Object Model Double Data Rate Double Precision General Matrix Multiply Benchmark Dynamic Random Access Memory Digital Signal Processing Electronic Design Interchange Format First In First Out Synchronization Queue Field Programmable Gate Array Go, Done, Busy, Wait paradigm Graphical User Interface Giga-updates per second Hardware Description Language Intellectual Property Look-Up Table Mask Programmable Gate Array Place and Route Peripheral Component Interface Bus vii

8 PEs PLD RC RCS ROM SRAM STREAM VHDL VHSIC VME Bus Processing Elements Programmable Logic Device Reconfigurable Computing Reconfigurable Computing System Read Only Memory Static Random Access Memory Memory Stress Test Benchmark VHSIC Hardware Description Language Very High Speed Integrated Circuit Versa Module Europa Bus viii

9 LIST OF FIGURES Figure 1: Typical system elements of a FPGA based RC system... 3 Figure 2: One configuration of the SRC system... 9 Figure 3: Compilation Process of the SRC system Figure 4: Hardware components of the WildStar II board Figure 5: Screenshot of CoreFire GUI Figure 6: General component layout of the Starbridge HC-62 RCS board Figure 7: Layout of one quad of the Starbridge RCS Figure 8: Screenshot of Viva design environment Figure 9: Recursive Synthesis example Figure 10: Sustained Bandwidth Measurements for PCI Bus and DRAM Figure 11: Percentage Stall Time of PCI Bus and DRAM Figure 12: Sustained Bandwidth Measurements for STREAM Benchmarks Figure 13: Percent Stall Time Measurements for STREAM Benchmarks Figure 14: DGEMM Basic Block Outline Figure 15: Bandwidth performance for Triple-DES Implementation Figure 16: Percent Stall Time for Triple-DES Implementation Figure 17: Wave Computation of Smith-Waterman Algorithm ix

10 LIST OF TABLES Table 1: Comparison of various parameters of the different RCSs Table 2: GDBW signal explanations Table 3: Memory Benchmark Explanations Table 4: STREAM Benchmark Results Table 5: Smith-Waterman Benchmark Results Table 6: Triple-DES/DES Synthesis Results Table 7: Conway s Game of Life Synthesis Results x

11 ACKNOWLEDGMENTS I would like to acknowledge the fact that this thesis could not have been completed without the help and experience of a few select people throughout my graduate school career. First and foremost I would like to thank Dr. Allan Snavely for putting faith in a young undergraduate and first introducing me to the world of reconfigurable computing. Without his support and patience for me and this new technology this thesis would not have had the chance to be written. I would also like to acknowledge a close friend and partner in crime throughout the graduate school process, Siddarth Gidwani. Sid s boundless optimism and outlook on graduate life helped to put the proper responsibilities in perspective at the right times. Another big acknowledgment that must be made is to Esmail Chitalwala of StarBridge Systems. Without Esmail s help and guidance on the subtleties and nuances of the Starbridge System, I can not begin to imagine the extra work that would have been required to complete this thesis. I would also like to thank Dr. Gupta and Dr. Orailoglu for the wonderful experiences that they have provided to me over the course of my graduate school career. Without these chances given to write papers and teach undergraduates, my graduate school experience would not have been as rich. Lastly, I would to thank my family, for being there to support me during my entire graduate school career and providing me with the necessary motivation, both verbal and emotional when a little nudge was needed to get back on track. xi

12 ABSTRACT OF THE THESIS An Overview and Benchmark Study of the Starbridge Reconfigurable Computing Platform by David Tamjidi Master of Science in Computer Science University of California, San Diego, 2005 Professor Allan E. Snavely, Chair Reconfigurable computing platforms have just now started to enter the forefront of being high performance and versatile computing solutions. These platforms, driven by new advances in field programmable gate array (FPGA) technology, now offer the memory and power to target computing problems that once could only be solved with specially designed hardware. This thesis offers an overview of the current state of reconfigurable computing, giving an in-depth analysis of three current reconfigurable computing platforms manufactured by: Starbridge Systems, Annapolis Microsystems, xii

13 and SRC Computers. Following this overview, focus shifts on the Starbridge HC-62 and an extensive summary of the platform s hardware architecture and software tool suite is provided to allow better understanding of this particular system. Following this summary a benchmark study of the Starbridge HC-62 is presented, with emphasis being placed on compiling benchmarks to measure the memory bandwidth, computational resource effectiveness, and synthesizer efficiency of the Starbridge system as a whole. Efforts were made to show comparisons between the Starbridge platform s capabilities versus traditional computing platforms. This thesis shows that while the Starbridge system has enough physical resources for reconfigurable computing applications, memory bandwidth remains an issue that hampers performance throughout many of the benchmarks. Throughout this thesis comparisons are also made to traditional CPU Microarchitecture based systems to allow differences showing the advance of reconfigurable computing and also its deficiencies. Finally, this thesis concludes with issues that should be addressed for the further proliferation of reconfigurable computing systems. xiii

14 CHAPTER 1 INTRODUCTION The hardware item that is called the FPGA (Field Programmable Gate Array) today, while being a mature technology, has a short life story compared to the traditional microprocessor. Born of humble beginnings from the field of programmable logic of two decades past, the FPGA in its infancy never had the ability to be its own computing platform. Rather, it was considered an intermediary between the existing PLD (Programmable Logic Device) devices of the time and MPGAs (Mask Programmable Gate Array). The design of the FPGA could be considered an upgrade that allowed not only re-programmability but also greater design flexibility. Over time the FPGA began to transform from a device that was used primarily for glue logic initially to more substantial roles. These included being programmed as simulators and rapid prototyping tools for ASIC (Application Specific Integrated Circuit) chip development. Then, during the proceeding years, the complexity and capacity of FPGAs was allowed to increase due to the greater hardware resources afforded by Moore s law. FPGAs with gates that initially numbered in the tens of thousands now range into the millions of gates and run at speeds in excess of 100 MHz. With ever increasing capacity and clock speed, today s FPGAs offer a great amount of performance with minimal engineering costs. The engineering and research communities recognized these flexibilities that the FPGA offered and today there are many studies and efforts that look at the feasibility of 1

15 2 expanding the role of the FPGA into arenas where CPUs, ASICs and other specialized chips worked before 1. Over the years with added hardware resources and greater synthesizer performance, the role of the FPGA has again shifted from that of being an auxiliary resource to a resource with its own computing platform. 1.1 The Reconfigurable Computing Platform According to [1], the term Reconfigurable Computing (RC) as it is used in current research refers to systems incorporating some form of hardware programmability customizing how the hardware is used using a number of physical control paths. In general a Reconfigurable Computing System (RCS) refers to a large class of different hardware devices whose primary purpose is to accelerate overall algorithm execution. These hardware devices can take on many different forms but typically are partitioned up into distinct parallel PEs (Processing Elements) spread across the reconfigurable system. These PEs generally (but are not required to) have the same granularity in their processing ability. The granularity of a PE can vary in different aspects, in terms of the size of the input that can be processed by the PE and the reconfigurability of the PE to perform different types of operations. In [2] several different bit level granularity RC systems are described; in these systems each PE is composed of reconfigurable datapath units 2. This is in contrast to other RCSs which 1 Examples include DSP (Digital Signal Processing) chips for signal processing and ASICs for specialized computation such as encryption. 2 One can consider these PEs similar in scale and complexity to reconfigurable ALUs in a sense.

16 3 use a more advanced system level granularity for their PEs 3. While there are advantages and disadvantages for each granularity level, this thesis will focus on FPGA granularity reconfigurable systems. Today, the major role played by FPGA-based reconfigurable systems is still the acceleration of algorithms that do not map well to the execution domain of the typical CPU. Generally, it is too complex to create a completely autonomous FPGA reconfigurable computing system. Therefore, most FPGA based RCSs act as an accelerator to a CPU, and the design of the overall system reflects this. Fig.1. Typical system elements of a FPGA based RC system As shown in Fig. 1, a typical FPGA RCS has these components: one or more FPGA based PEs, memory associated with each PE (external to the FPGA itself), internal communication lines between PEs, external I/O lines, and an external bus with associated 3 An example of this granularity can include having a FPGA for each PE.

17 4 controller. Some advanced systems even incorporate FPGAs used solely for routing signals between different networks of PEs. The other required component of the FPGA RCS is the software used to synthesize the bit files that program the FPGAs. The software for each RCS is different and specific manufacturer instructions must be followed to program the system. However, most systems follow a three step process. In the first step, a design independent EDIF (Electronic Design Interchange Format) file is synthesized which describes the planned FPGA implementation. The EDIF file uses only the particular resources available to the FPGA used in the system. The next step typically calls the FPGAs manufacturer specific PAR (Place and Route) tools to map the design to the FPGA. Lastly, the system will have a driver or method to program the FPGAs for runtime operation. 1.2 Current Technology State On the market today there are a variety of different FPGA based RCSs 4. The main distinction between most of these systems being the amount of FPGA resources offered and the flavor of the software tool suite provided. Some systems can be classified in the realm of add-on cards, where the FPGA based RCS is targeted as a card which is installed into a regular host CPU system, to perform execution acceleration. The corresponding software of these systems reflects this design decision. On the other hand, two systems at this time [5][7] are targeted at being stand-alone FPGA RCSs. The main 4 See [3]-[7] for a sample of current commercial systems offered.

18 5 distinctions of this category are the amount and architecture of FPGA resources offered and the capabilities of the tool suite to support broader design development Hardware The hardware that is present in current FPGA RCSs consists of FPGA PE elements, FPGA controller chips, external memory, external I/O, and bus connectivity circuits. The manufacturer offerings mentioned in [3][4][5][7] all chose the Xilinx VirtexII as the FPGA used for the computational PEs in their systems. For the current time frame, the Xilinx VirtexII certainly seems to be the dominating choice for PEs for most FPGA RCSs. Some of the systems presented above also contain additional routing and controller chips which help to program and route signals throughout the overall RCS. The majority of these systems again use Xilinx brand chips, with lesser resources and speed than the main computational counterparts. Another hardware element used in these RCSs is external memory. This memory typically consists of DRAM (Dynamic Random Access Memory) modules with associated refresh functionality or static SRAM (Static Random Access Memory) memory. The external memory can be shared between different PEs or, depending on the configuration of the RCS, directly attached to individual units for exclusive use. The RCSs presented above also offer connections for external I/O. These connections are through pins or connectors placed directly on the RC board itself. They allow the RCS to communicate with external sensors, modules, etc. without needing the data to be routed through the host CPU environment. The last hardware component is the bus connection that allows data to be transferred between

19 6 the RCS and the host CPU environment. The majority of the systems use the standard PCI bus interface to achieve this communication. Some ancillary buses used include the VME (Versa Module Europa) bus and other specialized buses developed to achieve high data transfer bandwidth Software The software portion of the different RCSs has an even greater variety in its offerings than the hardware implementations. Depending on the purpose of the RCS the different software offerings range from versions that simply load the RCS with pre-programmed algorithm implementations all the way up to allowing the end user to completely define a new base system as a synthesis target. The different software packages also vary in what type of user interface they use and what input languages are available for design entry. User interfaces range from graphical dataflow input to text based and command-line compilers. Input languages are also diverse, including implementations that support standard VHDL (VHSIC Hardware Description Language) and Verilog, an implementation that allows compiled C code to mix with HDL languages, and even support for more recent languages such as SystemC. 1.3 Organization of the Thesis This thesis is organized into seven different chapters. With the closing of this introductory section it is hoped that a general overview of the basics of FPGA based RCSs has been imparted. The rest of the thesis will focus on giving an overview and then presenting a detailed benchmark study of the Starbridge reconfigurable computing platform. However, before introducing the Starbridge system an overview of two similar

20 7 systems manufactured by SRC Computers and Annapolis Microsystems will be presented. This will allow for greater comparisons and contrasts throughout this thesis; these systems will be presented in chapter two. In chapter three an in-depth overview of the hardware and software components of the Starbridge system will be discussed. Chapter four begins the benchmarking section of the thesis; in this chapter the operating limits of the memory subsystem of the Starbridge RCS are explored. Chapter five continues benchmarking but now considers the computational power of the system by examining speedups in certain algorithms and programs. Chapter six then takes a different approach and examines the synthesizer efficiency of the Starbridge tool suite versus another commercial synthesizer. Lastly in chapter seven a summary of results and conclusion is provided along with directions that can be taken for future work. This thesis hopes to contribute to and raise awareness of the research in the field of reconfigurable computing systems. The specific contributions that this thesis provides to the field consist of four main areas. First, this thesis has completed a survey of stateof-the-art FPGA-based reconfigurable computing systems available today. Second, this thesis aims to provide a thorough benchmarking of the memory system of one representative RCS, the HC-62, from Starbridge systems. Memory subsystem performance is the bottleneck for overall performance in many applications, and this thesis aims to explore this issue. Third, this thesis will conduct an examination of speedups achievable on certain important programs using the Starbridge hardware. Lastly, a benchmarking and comparison of the Starbridge software tool suite s performance will be made against another commercially available synthesizer.

21 CHAPTER 2 RELATED WORK Though the field of RC is still new and major players in the industry (Sun, SGI, etc.) have still not produced a complete RCS, there is still competition from those that began work first. This section will address two different RCSs, the SRC Computer s MAP based computing solutions and the WildStar II Pro add-on board developed by Annapolis Microsystems. 2.1 SRC Computing Platform The SRC Computer s RCS differs the most in terms of hardware and software than the other general purpose RCS vendors (Annapolis Microsystems [3] and Starbridge Systems [7]). First a discussion of the hardware components of the SRC system will be presented then a look at the software interface will follow. The SRC hardware [5] is based on a modular architecture using its MAP processor as the basic building block. Using the language specified earlier, the MAP processor can be considered one self-contained subset of an overall RCS. Each MAP processor is an addon card that is fitted into the SRC server; multiple MAP cards can be added to expand system performance. Each MAP processor contains two FPGA chips, and seven banks of shared dual ported four megabyte memory totaling twenty eight megabytes. The MAP processor is connected to the rest of the system via a special bus that SRC Computers created, called SNAP. The SNAP bus is highly optimized for data transfers and according to SRC has four times the bandwidth of a regular PCI-X133 bus found in 8

22 9 normal CPU systems. Every component in the SRC system: the CPU module, extra memory module, the bus switch, and the MAP processors are all connected via the SNAP interface allowing high bandwidth communication. As mentioned above another component of the system is a module to store extra memory; called the Common Memory module, it allows the MAP and CPU modules access to eight gigabytes of DRAM memory. The memory is accessible at the high bandwidth provided by the SNAP bus. The last component to mention is a specially designed bus switch based on the SNAP interface called the Hi-Bar switch. This switch is used to connect together the different modules so that communications across the SNAP bus can be arbitrated. Shown below is a figure of one configuration of these different modules. Fig.2. One configuration of the SRC system showing the different components available, reproduced from [5] While this is one configuration of the system, the configuration most typically mentioned in the literature is that of one to two MAP processors per server. The software tool suite behind the SRC system, the Carte programming environment, is also an interesting comparison between the different manufacturers tool suites. The SRC system tries to hide most of the implementation and communication details that would normally be needed to handle computations and transferring data between the CPU and the MAP processor board. It does this in a unique way by directly

10 implementing this data/control transfer in a compiler similar to ones used for normal CPU source code compilation. The figure below shows the outline of this process. Fig. 3.

23 10 implementing this data/control transfer in a compiler similar to ones used for normal CPU source code compilation. The figure below shows the outline of this process. Fig. 3. Compilation Process of the SRC system, reproduced from [9] The programming paradigm introduced is one that is inherently familiar to any normal software engineer, that of function calls. Normal function calls are compiled to use the standard CPU resources; however, certain function calls can be marked to target the MAP s reconfigurable hardware. The SRC compiler then acts as a wrapper, taking function calls targeted at the MAP and putting in the necessary steps to transport data/control to and from the MAP. A standard suite of macros are provided and the user is also able to create their own macros using standard HDL languages. The SRC compiler also tries to automatically, through control flow graph analysis, generate a proper order and pipelining of the hardware components used in the design to maximize throughput. In the end, the SRC compiler s advantage is that it provides the infrastructure and

24 11 functionality needed when combining reconfigurable hardware with sequential CPU code. By contrast, the other RCSs do not automate this functionality to the same degree and often it needs to be explicitly designed into the application code on the CPU side. While the full capabilities and coverage for automated RCS design that the SRC software tool suite provides has not been further studied in this thesis, several sources can be referenced for input [5][8][9][10]. 2.2 Annapolis Microsystems Computing Platform The offerings of Annapolis Microsystems (AMS) more closely resemble the Starbridge system, so a discussion is given here to show contrasts for the two systems. While AMS manufacturers more than one RCS board, the board that will be considered is their WildStar II Pro board [3]. The board itself interfaces through a standard PCI slot on the CPU host system. The board has the normal characteristics of a RCS, in this case: two PEs, up to seven different external memory modules for each PE, internal PE to PE communication lines, external I/O lines, and a PCI bus interface. Shown below is a figure illustrating the hardware components of the WildStar II Pro board.

25 12 Fig. 4. Hardware components of the WildStar II board from AMS, reproduced from [3] One feature of the WildStar II Pro board is the addition of flash memory chips to allow automatic programming of the PEs without intervention from the CPU host system. Also, a big draw of the WildStar II is the availability of pre-made add-on I/O boards. These add-on boards connect to the WildStar II through the external I/O connectors present on

13 the side of each PE; the add-on boards are designed to handle the conversion of different types of signal inputs into standard logic levels for PE processing.

26 13 the side of each PE; the add-on boards are designed to handle the conversion of different types of signal inputs into standard logic levels for PE processing. The WildStar II also shows similarities with SRC s MAP board, in that the general layout of system components (PEs, external memory, and external I/O) is comparable. This layout will again be seen in the discussion of the Starbridge hardware. The software tool suite provided by AMS for the WildStar II board is their CoreFire development platform. The CoreFire platform is a GUI based development platform that uses a schematic type input system to create designs. A screenshot of the GUI is shown below. Fig. 5. Screenshot of CoreFire GUI, reproduced from [11] The CoreFire system aims for fast development cycles by providing pre-designed IP cores specifically made for the different AMS boards. Synchronization between different

27 14 library objects is automatically handled by the CoreFire software. The software suite also provides a graphical waveform debugger along with the ability to debug values from the hardware in real time. Overall the boards offered by AMS give the impression of being designed for solving specific problem domains and not necessarily as general purpose RCSs. The reasoning behind this statement is mostly based upon the software. The main purpose of the CoreFire design environment seems to be to allow the end-user to rapidly prototype system designs from the provided base IP and the ease and features for constructing entirely new base components have not been made clear. CHAPTER 3 THE STARBRIDGE RECONFIGURABLE COMPUTING PLATFORM In this section focus will now be provided at giving a comprehensive overview of the Starbridge RCS platform. An in-depth look at the hardware of the Starbridge platform and then the software will be made with attempts to compare and contrast to the previously described systems. 3.1 Hardware Currently, Starbridge systems offers two different RCS models, the HC-36 and HC- 62, both of which differ only in the number of attached PEs with the HC-62 having double the number of dedicated computational PEs than the HC-36. The figure shown below shows the general layout of the Starbridge RCS board.

28 Fig. 6. General component layout of the Starbridge HC-62 1 RCS board, reproduced from [12] 15

29 General Architecture The general architecture of the Starbridge HC-62 is shown above and described in [12]. The HC-62 has the most PE resources of any of the RCS mentioned earlier, with a total of eight FPGAs based PEs. The system has all the common components of a RCS, consisting of: PEs, external memory modules for each PE, internal PE to PE communication lines, external I/O lines, and a PCI bus interface. The Starbridge RCS system also has three other FPGAs used in specialized controller roles. One FPGA, the bus controller, arbitrates communication access to the PCI bus and implements for each PE FIFOs for buffered communication. The other two FPGAs, the router and cross-point chips are used to distribute signals between PEs in the different sections of the system. As shown above, the system is broken up into two different quads, where a quad consists of a group of four PEs. Fig. 7. Layout of one quad of the Starbridge RCS, reproduced from [12] Each PE in the quad is directly connected to a number of different components. Inside every quad, each PE is connected to every other PE using direct wire connections. Also, each PE is able to connect to any other non-local PE in the other

30 17 quad using routing resources available from either the router or crosspoint control chips. Lastly, each PE also has some connectivity to the external I/O pads Memory Subsystem The memory subsystem of the HC RCS is similar to the previous two systems in that each PE has exclusive access to external memory. However, unlike the SRC system there is no shared external memory. Each PE has access to four different DRAM memory modules 1, with all four having the ability to be accessed in parallel. Each DRAM module is controlled via circuitry synthesized on the actual PE. The modules are operated with DDR timing such that each request returns a word size based upon the number of memory chips in each DRAM module. The DRAM refresh circuitry is built into the controller logic synthesized into the PE Communications Subsystem Communications to the CPU host system is done via a standard PCI bus interface. Software drivers give both a Unix and Windows compatible API [13] for blocking and non-blocking stream based communications. Using the API, individual PEs can be targeted for reading operations while groups of PEs can be targeted for write operations Comparisons/Contrasts 1 Each individual DRAM memory module has either a 512MB or 1GB capacity.

31 18 Out of the three RCSs presented so far, the Starbridge system offers the greatest amount of PE resources. It also offers the most external memory available to each PE, with an order of magnitude increase over the other systems described 2. In terms of host CPU to PE communication, the Starbridge system and AMS system are faced with the same limitation of using the slower PCI bus, while the SRC system has an advantage due to its specially designed SNAP high bandwidth bus. All systems share a generous amount of external I/O connectivity. The table below summarizes some of the different characteristics of the systems mentioned. Table 1. Comparison of various parameters of the different RCSs WildStar II Pro SRC MAP Based HC-62 Manufacturer Annapolis SRC Computers Starbridge Systems Microsystems Total # of PEs 2 2 / MAP board 8 PE FPGA Type XC2VP70 XC2VP100 XC2V6000 orxc2vp100 Total Logic Cells 3 148,896 or 198, ,432 per MAP 608,256 in PEs Total External Memory per PE Up to 128MB DRAM Up to 8MB x 6 SRAM 4MB x 7 (shared between PEs) Up to 512MB x 4 DRAM Total Parallel Memory modules per PE 7 Up to 7 if not shared 4, more if XPoint memory used Host CPU Comm. Interface PCI Based Bus SNAP Based Bus PCI Based Bus 3.2 Software - VIVA While the hardware in the Starbridge RCS offers a great amount of resources and memory, the real advantage comes with their software tool suite Viva. The Viva synthesizer is a GUI based dataflow entry type design tool created for designing general 2 It should be noted that this memory is DRAM memory and the Starbridge RCS does not offer external SRAM based memory which is usually faster and easier for designing. 3 Information gathered from [14].

19 purpose RC implementations. The basic structure of the Viva design environment is shown below. Fig. 8. Screenshot of Viva design environment The design environment is fairly straightforward to use.

32 19 purpose RC implementations. The basic structure of the Viva design environment is shown below. Fig. 8. Screenshot of Viva design environment The design environment is fairly straightforward to use. The center portion of the screen is the workbench area where inputs, outputs, and objects are dragged onto the screen and connected via transports 4 to create larger hierarchical objects. The upper right portion of the screen shows the project navigator which contains the current active sheets that each project contains. Viva synthesizes in a hierarchical manner, so an entire design must be contained in one top level sheet. Objects can be drilled into by double clicking 4 A transport in Viva is another name for a connecting wire between two points, also called a net in other tool suites.

33 20 on their box representation to traverse one step down into the hierarchy. Immediately below the project navigator is the library area where library objects are retrieved. As mentioned before, Viva uses a dataflow style of design entry. Each design is created by first defining the inputs then laying out the appropriate logic and/or predefined objects and finally connecting the outputs. Afterwards, the sheet can be converted into an object to be used at a different level in the hierarchy. Eventually, the design is synthesized, during which the overall design is converted to EDIF format, and then given to the standard manufacturer PAR tools to produce the FPGA programming file. These procedures are all fairly standard for any dataflow based toolset but the Viva synthesizer has a number of features which set it apart from the other tool suites. These include an advanced input/output type paradigm, an exposed synchronization system, dynamic reconfiguration, and an external program interfacing scheme using standard COM objects. It is also important to mention the flexibility for different hardware configurations that Viva was designed to accommodate. By modifying a special software resource called a System Description, Viva can, theoretically, synthesize to any given hardware configuration (for example either the HC-62 architecture or a different architecture entirely). The basic goal of the system description is to encapsulate the hardware resources and required hardware constructs for proper system operation. Hardware resources that are recorded include basic information such as the number of gates and internal memory available on the system architecture among other things. While the required hardware constructs include keeping track of such tasks as properly initializing the global FPGA clock signals and initializing proper host CPU to RCS bus communications. Viva also provides a special system description called X86, which

34 21 targets synthesis to CPU emulated logic instead of real FPGA logic. This allows quick debugging of designs by not having to go through the PAR tool sequence Types with Context The most powerful feature of Viva that permeates throughout the rest of its features is that of its input/output types with context. While other tool suites have rigidly defined types of input/output transports 5 with the only context being the bit width of the type, in Viva there are a rich abundance of different types for inputs/outputs. These types range from the standard bit, to predefined common sizes of bit groupings (byte, word, double word), and also include contextual types that add context to a normal bit grouping such as the types integer and floating point which provide a distinction as to which type of hardware gets synthesized. There are even abstract types such as list which acts as a data structure, holding sets of different elements in one transport. The list type and its counterparts effectively allow a designer to package and manipulate complex datasets in a user friendly manner. The last type to consider is a special type called Variant. The variant type when applied to an input/output transport tells the design environment that currently the specific input/output type is unknown, and therefore the current transport can conditionally be allowed to attach to any other type of input/output. During actual synthesis all Variant types eventually get resolved, and any resulting type mismatches will bring up an error condition. However, the behavior of the Variant type before synthesis 5 For example, two of the most common types of input/output transports in other tool suites would be a single bit line or multiple bits held together in a bus structure.

35 22 allows some powerful design paradigms to be used. These include polymorphic object overloading and recursive based synthesis of objects Polymorphic Objects Polymorphic objects in Viva support the same functionality as polymorphic functions in a sequential programming language. Namely, the ability of one function/object with the same number of arguments/inputs and outputs as other functions/objects to be properly identified as the correct choice during compilation/synthesis. This is done via the Variant data type in Viva. For example, in Viva, if one creates two objects with the same object name and the same number but different types of input and outputs then Viva will automatically create an object with Variant inputs/outputs for those that had different types. Effectively, this lets a designer use one generic object in a design versus having to pick a specific implementation of that object. A good example of the utility of this feature can be illustrated with a typical Adder object. Instead of having to choose a specific Adder with certain inputs for the current circumstances and change it later if the input types change, the generic adder can be used and will automatically adapt at synthesis to whatever type inputs are present Recursive Synthesis Continuing with the idea of using Variant data types and polymorphic objects, the actual synthesis steps through which a higher level object is transformed to its 6 This is, of course, dependent on the condition that the implementation of the Adder for the specified types exists somewhere in the object library. If there is no

36 23 eventual hardware representation can be finely controlled. This idea is called recursive synthesis. Recursive synthesis is controlled through the use of extra, designer created objects that mimic the state of synthesis path desired. The notion is similar to using recursive functions in sequential programming languages, where the same function calls itself until a base case is reached and the recursion is stopped. At synthesis time, the same procedure applies; the synthesizer recursively breaks the higher level object into smaller copies of itself by calling the different polymorphic objects available. This continues until a base case hardware representative object is found 7, then the all the objects are wired together in the correct manner. This is best shown through the use of an example. Consider an 8 bit adder that needs to be synthesized to hardware; the designer makes a decision to only allow the use of two bit adders in the final hardware implementation. Typically, if doing this exercise by hand, one could construct a ripple carry adder from the 2bit adder pieces and correctly tie them together to form the 8 bit adder. In Viva, the designer would have to only create a couple of simple objects for the synthesis to perform correctly. Two of these objects are illustrated below side by side for comparison 8. implementation for the specified types then an error is raised and synthesis is aborted. 7 A base case object is an object that does not have any Variant inputs, meaning it is fully defined. 8 They would not normally be represented in this manner in Viva, they are shown this way to illustrate the different objects and their implementations.

24 Fig. 9. Recursive Synthesis example The top object is our hardware base case, a simple two bit adder having a bit type on all its inputs and outputs, this is what our synthesizer eventually uses.

37 24 Fig. 9. Recursive Synthesis example The top object is our hardware base case, a simple two bit adder having a bit type on all its inputs and outputs, this is what our synthesizer eventually uses. The bottom object is our recursive polymorphic object 9. In the bottom object what is happening is: using the ExposeLSB objects, one bit is being pulled off of each input and fed into the upper polymorphic Adder object 10. The Sum bits are then collected using the CollectLSB object which collects the to inputs to form one output, while the CarryOut of the first Adder is routed to the CarryIn of the next Adder in a ripple carry fashion. At synthesis time, the recursive nature can be seen, at the initial level of 9 Note: while hard to see, the inputs to the bottom object both have type Variant. 10 Again, while hard to see in the picture, the inputs to the individual Adder objects in the lower design are defined as Variants, which mean any input to them is legal at design

38 25 recursion the lower overall design is entered and a set of bits is pulled off our eight bit inputs and attached to the uppermost polymorphic Adder object. The synthesizer understands that it has a base case object with type bit inputs that match these types present and instantiates the 2 bit adder hardware for this, the rest of the 7 bits then go into the lowermost polymorphic Adder object. The synthesizer then replays this same scenario now with an input of 7 bits instead of 8, this cycle recursively continues until finally a 2 bit input is split into individual bits and two 2 bit adders are synthesized, the process is then complete. It could also be shown that depending on the library objects present different hardware implementations will result. For example, by creating an Adder implementation that might target some specific FPGA resource (such as special fast adder circuitry) and wrapping it up in a polymorphic Adder object, automatically at synthesis time the synthesizer could choose this implementation. The rule is for the synthesizer to choose the biggest base case object it can find and the implementation allows. Overall, recursive synthesis is a powerful tool which allows a savvy designer to leverage the synthesizer against brute force techniques in many design situations Synchronization System Synchronization support is also an important part of Viva s features. In Viva s library most objects have an asynchronous implementation as an option to use. However, to be able to meet FPGA timing constraints, a single chain of asynchronous time. 11 This will be seen in the implementation of Conway s Game of Life algorithm.

39 26 object inputs and outputs cannot extend too long without a register or pipeline stage. In Viva, this is where synchronous objects come in to play. Synchronous objects contain registers which allow them to break long timing constraints and are identical to their asynchronous counterparts in function with the addition of two inputs and outputs: the inputs Go and Wait, and the outputs Done and Busy. Using these inputs and outputs control flow can automatically be passed along pipeline stages of arbitrary processing times, no external control state machine is needed to monitor the whole design. The Go, Done, Busy, Wait (GDBW) paradigm is described below. Table 2. GDBW signal explanations Type of port Description Go Input Signals the object that input values are correct and to start processing Done Output Signals the environment that the outputs are correct and object has finished processing Busy Output When object is processing it outputs a constant high value through the Busy port Wait Input Input signal given to the object to instruct it not to begin processing and ignore any Go signals, the Wait is fed further downstream through the Busy output port Using these synchronization primitives, a designer is able to create true dataflow designs that have complex functionality without the need of a central control state machine. However, it should be noted that in some design situations the addition of synchronization could add extra resource overhead dependent on the granularity level at which the synchronization controls are used Dynamic Reconfiguration Another feature of Viva is the ability to perform targeted dynamic reconfiguration at runtime. This feature is implemented in a straightforward manner using a few provided library objects. The basic idea is to create and synthesize independent

40 27 design sheets which each are used to specify a PE configuration that is used for reconfiguration. In each design sheet a special start object called Go is placed which fires a signal after the design has properly been loaded into the PE after a reconfiguration. Consequently, a stop object called Stop is then used to signal to Viva that the current design has finished its processing. To control the flow of reconfigurations the highest level sheet in the hierarchy is used. On this sheet special objects named Spawn are placed, each spawn object signifies a reconfiguration and takes as input a synthesized Viva design. Using these spawn objects on the top level sheet and the stop and go objects in the different designs a complete runtime reconfiguration system can be implemented. It should also be noted that the spawn and the start/stop objects allow data to be transferred as initial conditions/final results throughout the reconfiguration cycles External Program Interface The last feature to mention is the interface provided by Viva for communicating with external programs during runtime. Viva provides this interface by supporting the Microsoft Component Object Model (COM) interface and marshaling system. The COM system was designed to allow different software modules to have a standardized way to share and discover exported functionality and events. The COM system does this through the use of special structures called interfaces; by acquiring the interface of the desired object, the calling object can discover and execute all exported functions the called object had wished to expose. Viva works in a similar

41 28 manner, by loading the desired object code 12 and then retrieving its interface, it has the ability to use any functionality offered by the component. All of this is done through the graphical GUI of Viva, where desired modules are put as subsection in a special part of the overall object library. Then after examining the modules interface, individual function calls are portrayed as regular library objects, and can even be dragged into a design sheet and synthesized so that the hardware can trigger their execution. The functions are actually executed by providing a pulse to the input named Go of the function call object 13. This ability to interface with externally compiled code makes possible whole new levels of system design. Infrastructures can 14 be created to allow many new useful operations between software code and hardware. A prime example is that of data transfers between software code and the hardware. Using these infrastructures would allow legacy code packages to be modified to pass data to the hardware for accelerated execution with the result then returned to the software 15. Infrastructures such as these mentioned are not automatically created, and currently, are the responsibility of the designer to make. However, Viva does provide a basic C code API which allows data to be sent and received across the PCI bus. It is therefore possible to build another communication 12 By either loading a DLL or instantiating an ActiveX object. 13 The function call objects use the previously mentioned GDBW paradigm, however, since the object is masking a function call only the Go (used to call the function) and Done (used to signify function return) of the paradigm are modeled. 14 And have been created for this thesis! 15 Note: this infrastructure capability mentioned is basically what is already built into and happens automatically in the SRC systems design process.

42 29 infrastructure around this API instead of using the previously mentioned COM system Evaluation of VIVA Overall, both the hardware model of the HC-62 and the software tool suite Viva provide a great combination for the design of high performance RCSs. Viva, in particular, is a powerful tool suite in terms of design flexibility and control. Offering many options and features, such as polymorphic objects, recursive synthesis, etc, which are not available in other tool suites. However, all these features come with hidden cost. The cost being that a fair amount of experience is necessary with the Viva system before one can become proficient at making effective designs. Often times, it seems, too much control is given to the designer, too many details are exposed. Without a solid understanding of what the final design should be, a designer has a greater chance to become lost in the details, going through many design iterations. Viva is a powerful system that needs a powerful user to create comparatively efficient designs. The features and advantages of Viva mentioned here all depict the current state of the software suite. This author s experience with Viva dates back a couple years and only recently does it seem to offer most of what it has been advertised to do. Among other things, a stable base object library was not always available. However, with the release of Viva 3.0 this seems to be changing. Also, for a long time reliable PCI to RCS communication was not available. Again, this now has been resolved. The current state of Viva is now usable to a good degree and with time it can only improve.

43 CHAPTER 4 MEMORY/COMMUNICATIONS BENCHMARKING In this chapter the results of a benchmark study of the memory subsystem and bus communication interfaces of the Starbridge HC-62 will be presented. The objective of this benchmark study was to try and locate the different data transfer paths of the HC-62 that would most greatly affect its performance, and also when this performance decrease would occur. The specific transfer paths studied include PCI bus to PE, and PE to external (DRAM) memory. No effort was made to study transfer paths including the internal FPGA RAM due to the fact that its latency is always equal to one clock cycle. Even if a very large internal FPGA RAM object was synthesized its latency would still be one clock cycle, the only side effect of this action would be a lower overall operating FPGA clock frequency. However, given the constraints that Viva and most FPGA designs place on available clock frequencies, the useful information obtained from such a study would be limited. 4.1 Methodology Before carrying out this study a means of access for sending and receiving data across the PCI bus had to be created. As mentioned previously, Viva offers two ways to coordinate such an effort, either through the use of its C API or through COM components interfaced with VIVA library objects. The approach taken for this study was a mixture of both C API usage for actual transfer of data, and a COM object interface to initiate communications along with capturing timing information. All code was 30

44 31 encapsulated into one COM component; access to the COM component was placed on the top level design sheet to facilitate function calling. Then, functions were created and subsequently called which started the process of sending or receiving data to/from the PEs and simultaneously measuring the associated transfer time. This technique effectively provides good coverage for timing measurements on the host CPU environment. However, the most valuable measurements are those made in the hardware. Before explaining how the measurements were made a brief overview of the DRAM and PCI bus systems will be given. On the HC-62 when data is sent across the PCI bus, the first stop is actually not the destination PE. Directly off the PCI bus sits the bus controller FPGA of the RCS board, after this controller receives the intended data it is put into one of two FIFOs 1 created separately for each PE. A separate object, the object encapsulating the PCI bus, is then notified of this data. This PCI bus object has an associated library object representation which supports the GDBW interface. The procedure for receiving data is: if no Wait input is high at the time data is received it will output the data with a corresponding Done signal, otherwise the data will sit in the FIFO until the Wait signal goes low 2. To effectively measure which component of the overall system could be causing any performance decreases, counters 3 were strategically placed around the 1 For each PE there is a separate FIFO for both sending and receiving data to/from the PCI bus. 2 This process mentioned is for reading data from the PCI bus, for writing data a pulse is issued to the Go port of the library PCI bus object. Also no data is allowed to be or will be written if the busy output is high. 3 These counters counted FPGA clock cycles.

45 32 GDBW ports of the PCI bus library object to account for various types of delays. This allowed measurements of the duration of the delays caused when the PCI bus could not accept data to be sent because it was not in a ready state and also when data was not being issued from the PCI bus when expected. One important consideration when designing a test for the PCI bus is to minimize any traffic across of it. Apart from regular system traffic in the host CPU that a designer has no control over, there are other aspects to consider. In Viva, when a transport is placed to an output port on the top-level sheet, a communication connection is automatically made through the PCI bus that transfers these output values on a real-time basis to the host CPU for debugging output. If one is not careful in designing the outputs of the test sheet for the PCI bus, then the very outputs which give the final performance data will affect the PCI bus by using up bandwidth unnecessarily transferring this data before the test is complete. With the use of input enabled registers these outputs can be tied to control test logic and made available only at the conclusion of the test so that avoidable bandwidth consumption can be minimized. For tests dealing with the DRAM, it should be pointed out that the DRAM interface object also employs the same GDBW paradigm to reading/writing operations as the PCI Bus. It can be considered similar to the PCI bus object interface with an added input for RAM addressing and a read/write port; therefore, timing measurements can be made in a similar manner to the PCI bus object. However, one particular caveat of the DRAM object is that it needs to periodically refresh its DRAM cells. During this refresh cycle the DRAM becomes unusable and sets its busy signal high; efforts have been made to capture when the DRAM is in this refresh state versus a normal busy state, so the DRAM refresh periods contribution to performance degradation can be quantified.

46 Benchmarks The actual benchmark tests employed were used to measure a variety of different configurations. The first benchmarks considered focused on observing the maximum sustained transfer rate for reading and writing data across the PCI Bus and across the external DRAM modules that surround a particular PE. These benchmarks simply consist of reading or writing streams of differing lengths of data to determine burst and sustained performance. In addition to these benchmarks for extracting maximum performance figures, the STREAM and RANDOM benchmarks were also used. These benchmarks are fully described in [15], however a brief overview is provided in the table below. The STREAM benchmark aims to test streaming memory and computational performance by reading an associated memory address and then writing it back to its location. While this is the most basic test, further tests of the STREAM benchmark include multiplying the memory value retrieved by a scalar before writing it back, adding the value to another retrieved value before writing it back, and combinations of the above. The RANDOM benchmark aims to disrupt normal prefetching of data by testing the performance of updates from random memory locations. In the tests conducted here, for random number generation a LFSR style number generator was created using information obtained from [16]. Table 3. Memory Benchmark explanations Benchmark Algorithm Measurement Metric STREAM c a Gigabytes/s Copy STREAM b α*c Gigabytes/s Scale STREAM Add c a + b Gigabytes/s STREAM Triad a b + α*c Gigabytes/s

47 34 RANDOM x x xor randomaddress GUPS (Giga-updates per second) The benchmarks 4 conducted are listed below: Maximum sustained transfer performance for reading/writing data on the PCI bus Maximum sustained transfer performance for reading/writing sequential addresses to external DRAM The different subsets of the STREAM benchmark using external DRAM The RANDOM benchmark using external DRAM All maximum performance tests on both the PCI bus and DRAM were constructed with the idea of using a pipeline to ensure adequate data availability when testing sending operations. Also, to ensure maximum performance during receive operations all received data is stored 5 then immediately discarded to avoid bottlenecks that could stall receive operations. This ensures that the tests measure the true throughput of the communication channel and not the components feeding or receiving its data. 4.3 Experimental Evaluation Below results from the maximum sustainable performance benchmarks, the STREAM benchmarks, and the RANDOM benchmarks will be presented PCI Bus Maximum Sustainable Performance 4 The choice of benchmarks was also inspired by [9]. 5 Data that is received is first registered through the use of a standard clock enable register, this register s output is then connected to an output transport allowing the value to be seen on the runtime interface during test execution. This is done so that the data signal is not automatically removed during synthesis and will have a chance to affect any timing constraints contained on the associated PCI Bus or DRAM object.

48 35 Results for the maximum sustained performance rates for both sending and receiving through the PCI Bus and DRAM are shown below. Fig. 10. Sustained Bandwidth Measurements for PCI Bus and DRAM The figure above displays the bandwidth that was measured using both the PCI Bus and DRAM for different amounts of data transferred. These measurements were taken using hardware that introduced no latency in the process of sending and receiving making all influences on sustainable bandwidth directly linked to either the hardware structure itself or its software support 6. Another important piece of information to help give a clear picture of overall sustainable performance is that of 6 In this case, the software reference is directed to the software driver that runs on the host CPU machine to control sending and receiving data to the RC board.

49 36 stall times that the PCI Bus and DRAM devices exhibit, this information is shown in the figure below. Fig. 11. Percentage Stall Time of PCI Bus and DRAM Examining the PCI Bus shows some interesting results. In the case of PCI Bus receives (CPU host system PE) a peak bandwidth of about 500 Mbytes\s is seen with small data transfer sizes; consequently this is the maximum bandwidth available from the PCI Bus itself 7. However, as the data transfer sizes increase a big drop in sustainable bandwidth is seen, eventually settling around 66Mbytes/s. Interestingly, 7 The PCI-X bus that the Starbridge CPU host system uses allows for different transfer rates and clock frequencies. However, in its current configuration with a 66 MHz clock and transferring 64 bits per clock, the maximum bandwidth available is 533 Mbytes/s.

50 37 this drop in sustainable bandwidth is also accompanied by a decrease of time that the PCI Bus was in a stalled state. The initial 100% stalled state of the PCI Bus can most likely be attributed to the need of the PCI Bus controller chip on the RC board to respond to the request for it to begin receiving data. This process takes time to occur and with small data transfers dominates the total time of the transfer creating the initial high stall percentages shown. With larger data transfer sets this initial setup time is negligible and the percent stalled time reaches a steady state controlled by other external factors. In the other case of PCI Bus sends (PE host CPU system) even more interesting behavior is noted. Initially, the sustainable bandwidth is again very close to the peak bandwidth of the PCI Bus and quickly drops to almost zero before increasing again. The percent stall time however is the inverse of the receiving case. Here, the stall time starts off at zero percent, sharply increases, then drops off and stabilizes. Both of these behaviors above can be explained by the availability and lack thereof of buffering that is present in the different systems (RC board and host CPU) and the amount of contention for PCI Bus access. In the case of sending data described above, the increase and subsequent decrease in percent stall time can be explained by the availability of a sending FIFO 8 (First in First Out queue) available on the PCI Bus controller chip on the RC board. During the smaller data transfers this FIFO does not completely fill up and therefore all data transfers appear to have 8 This FIFO is basically a queue that sits next to the PCI bus inside the PCI Bus controller FPGA. It allows some data to be queued when the PCI bus is busy servicing

51 38 completed immediately, however, the data is actually sitting in the FIFO waiting for free PCI Bus transfer time to allow it to be transmitted. The big increase in sending stall time from zero to 85 percent occurs when this FIFO becomes filled for the first time. After this point, and once a steady state has been reached, the percent stall time and sustainable bandwidth are functions of contention on the PCI Bus and buffer issues on the receiving side 9. In the end, to achieve better sustainable bandwidth there are design decisions to consider. First, all buffers must be sized appropriately for expected PCI Bus availability time and data transfer sizes (to avoid filling completely during one PCI Bus transfer session). If both buffers have correctly been sized and do not fill up during a PCI Bus receive session then most likely contention of the PCI Bus is to blame, by not allowing enough data to be transmitted fast enough. Looking at the percent stall times it is seen that PCI Bus receives have a 150% increase in stall time over PCI Bus sends and consequently the bandwidth of PCI Bus receives drops by almost a similar amount. Whether this stall time increase is due to contention or inadequate buffer sizes 10 is not known. However, there are differences in the amount other memory requests. 9 In this case the receiving end refers to the host CPU system and O/S (operating system), and eventually leads to a couple factors that can affect performance. One factor is how much buffer space was allocated by the driver and how it is handled. Another factor is contention among the different O/S drivers for PCI Bus access and is a function of what priority the driver is given in the overall system for receiving data and updating its buffers. On the sending end (RC board in this case), the size of the FIFO can also affect performance if it is too small in regards to the average size of data that can be transferred during available PCI Bus access cycles. 10 Both buffers can be sized inadequately. If the send buffer is too small it will run out

52 39 of contention present between the two ends. Looking at the software driver there is a higher level of contention present. The software driver has two different levels of contention to deal with. First, it has internal contention among the other different drivers trying to access I/O on the PCI and other buses, then the actual contention of the PCI Bus itself being busy. On the other hand the hardware PCI Bus controller chip on the RC board only has one level of contention that of the PCI Bus being busy. More than likely the software driver having to share bus access inside the O/S leads to the reduced sustainable bandwidth for PCI Bus receives DRAM Maximum Sustainable Performance The two figures posted above also show DRAM sustainable bandwidth and percent stall time 11. Unlike the PCI Bus measurements both DRAM sustainable bandwidth and stall time exhibit the same behavior with bandwidth reaching a steady state and stall time increasing to a steady state value. The DRAM objects used in the current test included error checking circuitry to identify if data stored on the DRAM became faulty. The DRAM controller itself is synthesized logic that is located on the FPGA and directly controls the different functionality of the memory module attached to the PE. A DRAM read or write operation takes on average nine cycles to of data to send before the PCI Bus transfer session is over, and consequently if the receive buffer is too small it will run out room before the session ends limiting overall bandwidth. 11 In this case the DRAM device stalls because it needs to periodically refresh its memory cells. During DRAM refresh no further read or write requests are allowed. Refresh time was measured by counting all cycles that the wait output of the DRAM was high and subtracting from that all cycles that a read or write was in progress (since refreshing cannot occur during a read or write transaction).

53 40 complete. Due to the DRAM controller being synthesized as logic on the FPGA, in its current implementation it operates at the same average clock frequency used by designs (in this case 66 MHz). While this simplifies interfacing to and using the DRAM module in a design, it reduces the sustainable bandwidth because of the slower operating speed of the controller. Overall, the additional error checking circuitry plus the fact that the controller is still a fairly recent addition to the object library probably contributes to the poor bandwidth performance compared to the PCI Bus bandwidth. However, the most important factor affecting its performance is the latency of the controller in terms of its logic and clocked speed. Efforts should be taken to increase the operating frequency of the controller to improve bandwidth performance if possible STREAM Benchmark For the STREAM benchmarks each of the four configurations: copy, scale, add, and triad were implemented and bandwidth measured. Normally STREAM is run across multiple CPUs and memory banks in shared memory systems. Borrowing from this idea, each implementation of the four different STREAM tests used at least two different memory modules with the last two tests using three. The idea was to create as parallel of a memory system as possible to gain bandwidth over total memory address space. All implementations featured pipelining by issuing read and write memory requests through queues to help alleviate possible bottlenecks caused by individual DRAM units entering a refresh cycle. The sustainable bandwidth and percent stall times for the different STREAM tests are shown below, each test was

54 41 run three times at ¼, ½, and the full address space of one individual DRAM module 12. Fig. 12. Sustained Bandwidth Measurements for STREAM Benchmarks 12 Each DRAM module (there are four DRAM modules per PE) has an address space of 2 26 = 67,108,864 different address with a word size of 120 bits.

55 42 Fig. 13. Percent Stall Time Measurements for STREAM Benchmarks Looking at the figures it is noticed that two pairs of the benchmarks almost completely overlap each other in terms of bandwidth and percent stall time. This is due to the similar layout of the both pairs of benchmarks, with the copy and scale benchmarks both having two different memory modules with pipelining placed in between and the add and triad benchmarks both having three memory modules (two read modules and one write module) with associated pipelining between them. The differences in the bandwidth shown simply accounts for the extra read of one data item in the latter two benchmarks (add and triad) versus the first two (copy and scale). The different in stall time is another effect of design decisions made. The reason that the add and triad benchmarks have such a greater stall time is because the algorithms for each of the two benchmarks required two reads. In the implementation

56 43 of these benchmarks it was decided that both results would be needed before the pipeline stage would be allowed to continue. This means that there is twice as much of a chance for the pipeline to stall (due to DRAM refreshing) contributing to greater overall percent stall time. Comparisons can be made with the results here to the HPC Challenge results [26], in terms of the individual bandwidths reported above or theoretical max bandwidth 13. This gives the following results below. Table 4. STREAM Benchmark Results STREAM Benchmark Single Instance Bandwidth Theoretical Maximum Bandwidth STREAM copy Gigabytes/s 3.22 Gigabytes/s STREAM scale Gigabytes/s 3.22 Gigabytes/s STREAM add Gigabytes/s 2.41 Gigabytes/s STREAM triad Gigabytes/s 2.41 Gigabytes/s Using the theoretical maximum bandwidth results the Starbridge HC-62 scores in the top one half of the HPC Challenge submissions (which consist of many powerful traditional supercomputers with specially engineered memory subsystems). Most of the competing entries in the HPC Challenge that closely match up to these results consist of 64 processor based systems with one process running on each processor. The shared memory semantics of these systems are not specified, however if the assumption is made that one processor directly controls one bank of memory and 13 Theoretical max bandwidth simply takes the results above and multiplies it by the number of total available memory models on the HC-62 system (it does this by assuming they would all be running the STREAM tests in parallel). For the copy and scale benchmarks the bandwidth number would be multiplied by 16, for the add and triad benchmarks it would conservatively be multiplied by 8.

57 44 performance can be divided evenly among processors then a rough correlation of 4 8 processors per PE can be made RANDOM Benchmark The last benchmark to be undertaken is the RANDOM benchmark. As explained earlier the RANDOM benchmark seeks to remove any performance advantages of memory systems with prefetch or look ahead capability by using a random memory stride. A new memory address is calculated for each round of the test and the contents of the memory address are updated by XORing the old value with the current address. The measurement metric for the benchmark is GUPS (Giga-updates per second). The implementation of this benchmark used one memory module, which was sequentially read then updated with the new value. Since only one memory module was used no pipelining could be done. The HC-62 with one memory module was able to sustain a GUPS rate of Theoretical max GUPS would be achieved by having all 32 memory modules in the different PEs of the HC-62 operating in parallel. In this case the maximum theoretical GUPS would be Comparing this to the HPC Challenge results would put the HC-62 in the top one fourth of all entries. 4.4 Analysis Overall, the memory subsystem of the Starbridge HC-62 has both its high and low points. The size of the DRAM memory along with its configurability offered through the number of independent controllers gives a designer great versatility. Also, the ability to have multiple parallel memory systems versus one big address space helped the HC-62 gain higher marks in the benchmarks presented in this section. While the HC-62 does

58 45 offer a magnitude of order greater memory quantity than the other systems discussed earlier, its bandwidth in key memory areas such as the PCI Bus and DRAM interfaces should be improved. Comparing the Starbridge HC-62 to the SRC Computers RCS, memory channel bandwidths and latencies can be seen to vary greatly (reference [24][25]), with the SRC RCS having a 2-3x greater bandwidth rate between the host CPU system and the RCS 14. In most of today s algorithms the limiting resource for effective speedup on a RCS is not memory quantity but sustainable memory bandwidth [25]. In this regard the HC-62 could benefit from work that improves both bandwidths across the PCI Bus but more importantly the DRAM modules. 14 This is most likely attributed to the special bus interface mentioned earlier in chapter two.

59 CHAPTER 5 COMPUTATIONAL BENCHMARKING In chapter four, the memory subsystem of the HC-62 was evaluated; in this chapter the aim is to evaluate the computational performance of the HC-62 using several different algorithms. The algorithms include DGEMM, the Triple-DES encryption algorithm, an implementation of Conway s Game of Life and lastly an implementation of the Smith- Waterman algorithm (described in [15], [17], [18], and [19] respectively). Each of these algorithms has been chosen to serve different and common purposes. The DGEMM algorithm was chosen to provide a direct comparison to an operation that is commonly used to benchmark normal CPU systems that of double precision floating point matrix multiplication. The Triple-DES algorithm due to its ease of being pipelined was also a good choice to exploit the throughput capabilities of RCSs and also to gain insight into the limiting factors of RCSs in the area of computational versus memory bandwidths. Conway s Game of Life, a cellular automaton based system and the Smith-Waterman algorithm were both chosen to show the power of parallel computation that the RCS provides in addition to showing the level of localized data communication and routing that is possible. 5.1 Methodology The basic methodology for all the algorithms is very similar. All algorithms will be initialized or streamed their input data via the PCI bus. Then, counters will count the total time that execution takes to finish along with measuring when stalls halt computing. 46

60 47 Varying sizes of inputs will be provided and execution times and throughput will be compared to normal CPU systems and other RCS systems to see how much and when speedup occurs. 5.2 Benchmarks The benchmarks tested include: DGEMM In this benchmark a theoretical maximum GFlops/s is calculated for the HC- 62. This is done by creating and synthesizing base components used in the DGEMM algorithm and evaluating limiting bandwidth factors whether computational or memory. These limits with synthesis results allow a theoretical throughput to be calculated. Triple-DES In this benchmark a pre-existing implementation of the DES algorithm is converted to a Triple-DES implementation. Data to be encrypted is streamed over the PCI Bus from the host CPU, through the pipelined encryptor and finally streamed back to the host. Limiting factors in achieving maximum throughput will be explored and maximum performance comparisons to other implementations will be made. Conway s Game of Life Conway s Game of Life is a benchmark intended to show the parallel processing capabilities of RCSs. In this benchmark an array of CAs (Cellular Automaton) interact with each other to simulate a biological environment of neighboring cells. Following certain rules, cells can either die or be reincarnated changing the overall

61 48 environment with every generational step 1. The goal in our case is to reach a stable environment and compare different execution time between our RCS and a traditional CPU system. Traditional CPU systems simulate the game s behavior by stepping through memory arrays that track the condition of each cell and then performing the rule calculations separately on each memory location. The RCS implementation, however, allows all cells to be interconnected to their closest neighbors allowing an entire generational step to occur every clock cycle. Smith-Waterman The Smith-Waterman benchmark follows the footsteps of the Game of Life benchmark by calculating its result through the use of an array of interconnected comparators. In this case, the benchmark calculates the overall match score between two different biological sequences. The sequences of the two different inputs are aligned at two edges of the array and computation is performed with the immediate neighbors on a character by character (or cell by cell) basis. If characters match, mismatch or follow certain rules a predefined amount is added or subtracted to the overall score at that specific point in the array. In this matter the data flows through the system in a wave like fashion. The implementation of this benchmark considered here was a pre-existing implementation developed by Starbridge systems. Their performance data is reported. 5.3 Experimental Evaluation 1 One pass over the entire array and subsequent updating of cell status is considered a

62 49 Each experimental benchmark is explained and results given below DGEMM The DGEMM benchmark aims to compute maximum GFlops/s by doing a matrix multiply followed by addition, the function D βc + αab models this behavior where A,B,C, D are matrices and α, β are scalars. In this benchmark only an approximation of the GFlops/s value was computed using synthesis results and performance values based on hardware blocks synthesized to mimic the behavior of a fully implemented DGEMM algorithm. The route taken to compute this value for the Starbridge hardware was to determine a fundamental basic block 2 hardware unit that could be used to distribute the matrix multiplications and additions into individual units which would each calculate one position of the final output matrix D. Each basic block would be provided the values of α, β and the corresponding element in C that it needed for addition. Additional data would be streamed over the PCI bus and either pre-stored in each basic block or computed on the fly depending on the size of the incoming matrices. Each basic block would begin by computing a dot product using a row of A and a column of B. This would then be multiplied by α and added to the value of β multiplied by the element of C, which when completed would produce one element of the final output matrix D. Each DGEMM basic block consists of one multiplier, one adder, and one accumulator register. The basic block outline is shown below. generation. 2 Inspiration for the primary hardware structures used in the DGEMM algorithm were

63 50 Fig. 14. DGEMM Basic Block Outline Additional control circuitry (not shown) contains a counter that controls the different control inputs to the MUXs. Initially the values A y and B y are multiplied together then added to a running total that calculates the dot product of the row of matrix A and column of matrix B. Next, the feedback input is fed into the top MUX and α is fed into the lower MUX to calculate the scalar multiply of α and the newly computed element of A y B y. Then C x and β are fed into the pipeline and added to the accumulator to obtain the final product, one new element of the output matrix D. Without knowing in advance the size of the matrices to perform the DGEMM computation with, calculating the maximum GFlops/s cannot be done. However, the minimum GFlops/s can be computed and is a function of the minimum time required for one iteration of the overall calculation to be completed in the DGEMM Basic Block. This minimum time calculation requires the latency, in terms of number of received from [28].

64 51 cycles, of the multiplier and adder. From synthesis testing multiplying a double precision floating point number takes 11 cycles and addition takes 6 with the overall design operating at 66 Mhz. Therefore 17 cycles per iteration will be needed resulting in a bandwidth requirement of 59Mbytes/s 3. Consequently, this is within the bandwidth limit that our PCI Bus can supply to the RCS board. Therefore, at a minimum we are able to perform two floating point operations every 17 cycles for one DGEMM Basic Block resulting in a minimum GFlops/s of If some assumptions are allowed to be made then a maximum GFlops/s value can also be calculated. Assuming that the matrices used in the calculation are so big that each of the DGEMM Basic Blocks requires the same columns of matrix B to be streamed for computation (after the row of matrix A has been preloaded into each Basic Block) then this allows the overall available bandwidth of the PCI Bus to not be a limiting factor. This is due to the fact that for cases of very large matrices all DGEMM Basic Blocks will receive the same values for computation. This allows as many DGEMM Basic Blocks as our PE can hold to effectively operate in parallel. Synthesis tests have shown that a conservative estimate of around eight DGEMM Basic Blocks can fit onto one PE. The Starbridge HC-62 has eight total PEs bringing our total DGEMM Basic Block count to 64. If these assumptions hold, then a maximum GFlops/s of 64 * could be reached giving, at the least, a burst rate of This bandwidth requirement results from needing two 64 bit double precision floating point numbers every 17 cycles.

65 52 GFlops/s. Compared to the HPC Challenge results [26], this peak GFlops/s ranks towards the bottom of the list. Reasons for this poor GFlops/s rating stems directly from the execution time of the multiplier, due to the fact that it is in the middle of the arithmetic chain of the DGEMM Basic Block. This effectively doesn t allow any pipelining to be done to hide its latency. This long processing cycle count along with the much slower speed of the clock speed of the HC-62 versus traditional CPU systems lessens the effect of the parallel capabilities of the RCS Triple-DES The next benchmark, Triple-DES, was implemented by modifying the original DES algorithm that had been created by Starbridge. The basic modification consisted of placing three of the DES modules together in series along with associated queues and control objects to allow interfacing with the PCI Bus. Tests then followed that sent differing amounts of packets to be encrypted and the resulting bandwidth was measured 4. Theoretically, if the Triple-DES pipeline were able to run at maximum capacity then a bandwidth of 533 Mbytes/s 5 would be expected for data streams that had the ability to completely fill the pipeline. The actual throughput bandwidth observed and percent stall time for the Triple-DES implementation are shown below. 4 Note: the bandwidth measured did not include the time required to setup and program the PE unit and transfer the encryption keys, only transfer time across the PCI Bus of plaintext and ciphertext data and execution time of the Triple-DES algorithm were captured. 5 A full Triple-DES pipeline would generate one 64bit encrypted output every clock cycle of 66Mhz corresponding to the throughput bandwidth of 533 Mbytes/s.

66 53 Fig. 15. Bandwidth performance for Triple-DES Implementation The following figures above present a clear picture of a memory bandwidth limitation. Almost immediately after the data stream is initiated across the PCI Bus the stall rate of the PCI Bus receive operation climbs to the same level shown in the chapter four, figure 11, this most likely takes place because the pipeline of the Triple- DES implementation starts to accept all the data that the bandwidth limited PCI Bus has received. Around 1,000 data packets the pipeline has had the chance to fully fill and fall into steady state operation, it is at this point that the PCI Bus is not able to supply the required bandwidth and the throughput drops with pipeline stalls that are introduced. Whenever these data receive stalls occurs, the pipeline must be refilled and the startup latency cost is consumed multiple times, consequently the percent time spent doing actual Triple-DES computation falls. Overall the actual sustained bandwidth of Triple-DES (around 37Mbytes/s) is lower than that the actual sustained

67 54 bandwidth of the PCI Bus receive (around 61Mbytes/s) observed in chapter four due to pipeline startup penalty. One side effect of this is that a Triple-DES implementation with fewer queues (resulting in an overall shorter pipeline length) might actually perform better due to having a smaller pipeline startup cost when the pipeline begins to constantly stall. Fig. 16. Percent Stall Time for Triple-DES Implementation A comparison between the Starbridge platform and the SRC platform for Triple- DES can be made by examining [10]. While not explicitly mentioning what the operating frequency of the FPGA was, the sustained bandwidth climbs to around 88Mbytes/s, with peak performance bandwidth of their Triple-DES implementation reaching 800Mbytes/s when no bandwidth limited communication occurs. The results show that the data bus on the SRC machine is not as heavily bandwidth limited or is run at a faster clock speed than the Starbridge implementation is able to reach.

68 Conway s Game of Life The implementation of Conway s game of life was one of the few designs created that made use of Viva s advanced recursive synthesis feature mentioned in chapter three. By building a recursive object that consisted of many individual basic Game of Life cells, an arbitrary N x M array of cells could be created by simply changing values of inputs at synthesis time. The game, as mentioned before, consists of an array of cells, where each cell monitors its nearest neighbors and changes its condition based upon the output of its neighbors. Following a few simple rules certain initial patterns can cause many generations of output to occur. The game is considered done if a stable pattern 6 is eventually reached. Runtime performance of the algorithm is easy to calculate and consists of how many generations were required to let the pattern entered reach a stable state plus a few clock cycles for the stable signal to pass through its pipeline and be output. For these comparisons a software based Game of Life simulator [29] 7 was used. Its source code was modified by inserting timing function calls to record how long actual execution took and this was compared to the hardware implementation. The table below lists the different execution times. 6 Here a stable pattern is defined in the hardware as when the output of all individual cells does not change over the course of three clock cycles. Each cell has a stable signal which then has to proceed through a pipeline that ANDs it together with other stable signals that finally converge into one stable output for the whole array. 7 Note: the PC Game of Life simulator differs slightly from the RCS version in how the array playing field is maintained and therefore might execute more operations than necessary. However, its implementation is much faster than a normal software based Game of Life implementation and this should even out any disadvantages the program

69 56 Life Pattern 1 Life Pattern 2 Life Pattern 3 Life Pattern 4 Generations Required To Reach Stability Execution Time RCS 8 Execution Time CPU 9 Speedu p μs 3,000 μs x5, μs 4,000 μs x1, μs 7,000 μs x μs 8,000 μs x251 Here it is interesting to note that as the generation required gets bigger the speedup decreases. This is probably due to the much faster memory bandwidth and processing speed of the CPU compared to the RCS. The longer the computation takes the greater chance the CPU has a chance to catch up to the RCS due to the low clock speed. This is the case only in simple computations such as the Game of Life were the CPUs resources are not under extreme pressure from many long and complex computations. The RCS is at a disadvantage due to its much lower clock speed Smith-Waterman Algorithm Before discussing the results of the Starbridge Smith-Waterman implementation an overview will be given of its operation. The Starbridge implementation of the Smith-Waterman algorithm used the idea of a sliding computation window. As shown in the figure below, the Smith-Waterman algorithm allows a wave of computation might incur. 8 Does not include time required for data transfer or retrieval, just running time in cycles (this includes number of generations plus time for stable output to flow through stable pipeline). The design was run at a 100Mhz clock and execution time was calculated accordingly. 9 While the resolution of the software timer in a CPU system cannot reach that of the cycle counter in the RCS it can give us a good idea of what order of magnitude difference one is dealing with.

70 57 as cells have data dependencies only to those cells above, to their left, and diagonally above and left. The diagonal arrows in the figure below indicate the direction of the computation wave, while the horizontal arrows show the direction that computation proceeds after a predefined area of the array has been computed. Fig. 17. Wave Computation of Smith-Waterman Algorithm The specific Starbridge implementation followed this wave pattern within predefined areas of computing (indicated by the square boxes enclosing regions of the overall array). Each PE in the system is assigned a particular box to compute, and as one PE finishes its execution the next PEs are allowed to begin computing due to the computation wave s expanding front. The finished PE is then reassigned a new area of computation and either immediately starts computing (if all data dependencies are met) or waits for completion of other PEs to supply its data dependencies. This method of computing continues until the lower right box is computed and the final score submitted. Memory bandwidth is

An Implementation Comparison of an IDEA Encryption Cryptosystem on Two General-Purpose Reconfigurable Computers

An Implementation Comparison of an IDEA Encryption Cryptosystem on Two General-Purpose Reconfigurable Computers Allen Michalski 1, Kris Gaj 1, Tarek El-Ghazawi 2 1 ECE Department, George Mason University