Title Reconfigurable Logic and Hardware Software Codesign Class EEC282 Author Marty Nicholes Date 12/06/2003 Abstract. This is a review paper covering various aspects of reconfigurable logic. The focus is on hardware that assists a general purpose processor. Some of the solutions discussed reconfigure dynamically, while others are more static in nature. Reconfigurable computing, when done dynamically is the pinnacle of success in this field. This would allow hardware that could be customized on the fly to the task that needs to be performed. Some discussion of static reconfiguration is important, because this work provides the foundational algorithms needed to efficiently configure the hardware to the task. The main paper referenced is [a], which describes Programmable Active Memories. The work described has become the basis for much subsequent research. 1 Introduction Reconfigurable logic is critical to the issues of hardware/software codesign. The main reason is that it is this logic that is either used to prototype an ASIC solution, or to implement the final hardware assist solution. The area of reconfigurable logic spans a variety of implementation techniques. One scale to measure the techniques is the frequency of reconfiguration. [f] describes a fanciful handheld device that is able to reconfigure on the fly for new network protocols, or security algorithms, and even deconfigure unused logic to save power. Although the article does not describe any unique research, it does provide a vivid picture of future uses of reconfigurable logic. Starting with the low frequency of reconfiguration is the traditional use of the FPGA in the role of assist hardware, where reconfigurations are only to fix bugs or slightly enhance functionality. Next comes hardware that is configured for a particular program, and is dedicated to that application. Use of the FPGA as a prototyping vehicle for ASIC development falls in this category as well. The next area is the area of faulttolerant hardware which can be reconfigured during operation to replace a defective part of the hardware. This is also called enbryonics, due to the fact that the hardware is structured like living cells [e]. [a] reviews the PAM hardware system that can be reconfigured for each application. The paper asserts that a PAM could be time-shared among 12 applications, but there is no supporting evidence for this claim. The final stage is reached with evolvable systems where the hardware can be reconfigured while it is being utilized. This will require many aspects of the various hardware systems described in this paper to work perfectly. This final stage exhibits the most important benefits of reconfigurable logic, including flexibility in the purposing of the hardware resources available. This flexibility allows the system to make runtime adjustments to hardware to tradeoff the critical factors of performance and power consumption. As [f] describes, the
Reconfigurable Logic and Hardware Software Codesign Page : 2 move is to mobile computing, and reconfigurable logic has much to offer in this area. The rest of the paper is organized as follows. Section 2 provides a quick overview of the FPGA. Section 3 covers the various techniques used to provide reconfigurable logic. Section 4 describes some of the applications for these techniques. Section 5 raises the issues preventing faster progress in this field. Section 6 is the conclusion. 2 FPGA Basic Building Block The field programmable gate array is the hardware basis for most of the papers. This chip contains logic that can be wired up on the fly in order to implement a design. The work being performed in this area utilizes the FPGAs in a variety of ways in the associated configurable hardware subsystem. FPGAs have good characteristics for this application. They are flexible, and of course, reprogrammable. The drawbacks of FPGAs are that they can be slow to reconfigure, they are expensive, and they require a complex tool chain to calculate the bitstream required to reconfigure an FPGA. The solutions discussed in the next section have interesting ways to work around these limitations. 3 Reconfiguration Techniques Reconfigurable logic requires some common steps: 1) profile the software, 2) find interesting code sections, 3) implement interesting code in hardware, 4) modify the software to use the hardware, and 5) run the partitioned system. Of course, many decisions about the amount of hardware resources to make available for reconfiguration must be made. In [a], the PAM prototype P1, uses a driver to provide access to hardware reconfiguration. A 1.5 Mb bitstream reconfigures the hardware. The hardware consists of 23 Xilinx FPGAs (5 switch FPGAs, 2 controller FPGAs, 16 FPGAs in a matrix), 4 blocks of SRAM @ 1MB each,, and 2 FIFOs. Figure 1 shows the structure of the P1 design. Figure 1 P1 PAM Design The programming language chosen was C++ with enhancements to describe nets. A simulation environment was also available. Results showed that the tools allowed non-ee students to successfully use the toolchain in a few weeks, compared with similar results with ASICs requiring highly skilled engineers. The main design guideline described in [a] is: cast the inner loop in PAM hardware; let the software handle the rest! Figure 2 Dynamic HW/SW Partitioning System Architecture [c] [c] describes a self-contained system which is a processor module that contains the following: a general-purpose microprocessor, memory, configurable
Reconfigurable Logic and Hardware Software Codesign Page : 3 Figure 3 Dynamic Partioning and Configurable Logic Module Detail [c] logic, and a dynamic partitioning module. Figure 2 shows the overall structure of the processor module, while Figure 3 shows details of the special logic. This early prototype design attempts at runtime to determine the location of candidate code loops by snooping instruction fetches from main memory. The system then disassembles the code and creates control and data flow graphs. Using this information, a bitfile for hardware reconfiguration is created, and the software binary is patched to trigger the hardware. While the reconfigured hardware is executing, the processor transitions to a low power state. This prototype has many limitations, including: 1) supports 1 cycle loops only, 2) memory accesses must be sequential, 3) provides only basic hardware logic, and 4) requires manual binary patching. However, this avenue is promising, because of the possibility of conserving on power, while improving performance on various algorithms. More promising in the runtime reconfiguration space is the design discussed in [d]. The PipeRench design consists of processing units (PE) attached to a reconfigurable data path. Figure 4 shows how the hardware reconfigures as needed, taking only 1 clock cycle for each stripe, which is the basic building block. Figure 5 shows the internal structure of a stripe. The big advantage of the PipeRench design is the hardware is abstracted from the software, allowing software to be ported between different PipeRench hardware implementations. The performance and energy efficiency are very impressive. [d] compares PipeRench running at 120 Mhz with an 800 Mhz PentiumIII processor. The algorithm is for encryption and the PipeRench hardware outperforms the processor by a factor of 5. Figure 4 PipeRench Overview [b] describes related work which is used to assist with software analysis to determine which code should be implemented in hardware. [b] describes a tool (LOOAN) that is used to detect critical loops in the software. The loop code is then recoded in a special C language called SA-C (single assignment C). The Toolchain flow is shown in Figure 6. The target architecture for this work is a processor and FPGA connected on a memory bus. One interesting aspect of the work in [b] is the result achieved in the area of energy improvement. The combined FPGA/processor system was capable of an average speedup of 1.6, while achieving an average energy savings of 25%. This is very promising, since this allows not only a lower power solution, but also allows a
Reconfigurable Logic and Hardware Software Codesign Page : 4 Figure 5 PipeRench Dynamic Reconfiguration slower processor to be designed into the system, which save design cost. However, the cost of the FPGA must be factored into the full analysis. Figure 6 Design Flow for Hardware/software Partitioning [b] 4 Applications [a] described various PAM applications. RSA encryption and decryption, faster than any previous implementation by an order of magnitude. Genetic applications such as DNA matching. A company called Compugen sells a PAM that speeds up biological searches. It looks like a co-processor to the the host. Applications like heat and Laplace equations are perfect for PAM. For example, a PAM at 20 Mhz can achieve 5 G operations (add and shift) each second. [a] compares this result with a super computer which would have to operate at 20 B instructions per second to match this. This illustrates the power of parallel operations. Further PAM examples are similar: Boltzmann algorithm to minimize quadratic equations (used in circuit placement). Again, a formula that allows a high amount of parallelism. Similarly, the video compression usage relies on the fact that the operation is highly suited to pipelining, operating on small squares of video frames [a]. In high-energy physics, where images from particle collisions must be evaluated, the algorithms, which like video compression, operate on small images, which lend themselves to pipelining. In physics, the images are 20x20x32 b that must be processed every 10 microseconds. [a] describes one interesting application that lays out the speed difference between the various implementations used in correlating pairs of images for stereo vision. Software performs the operation in 59 seconds on a SPARC-Station II, a hardware design using 4 DSPs takes 9.6 seconds, while the P1 takes 0.28 seconds. It would be interesting to compare the power used in each of these three implementations. [a] continues on with more examples of sound synthesis, and finally a Viterbi encoder/decoder with large constraint length codes. So, what is the common thread in all the examples? The fact the algorithms
Reconfigurable Logic and Hardware Software Codesign Page : 5 operate on large quantities of data in a repetitive fashion. Larger speedups are possible when the operation is computationally expensive for a general purpose processor, like multiplication, division, etc. 5 Issues One critical issue is the lack of a transparent tool chain to support development onto a platform with a reconfigurable hardware subsystem. Both [c] and [d] make some progress in this area. The approach used in [c] is to place a very simple tool chain in the hardware itself. As hardware density continues to increase, this may become a more viable solution. [d] makes the choice of limiting the runtime configuration to the datapath connections, and so is able to achieve single cycle reconfiguration. This simplification also allows the customization of the hardware to be done in the application code. [d] makes progress in the most critical area, the application is built on an application programming interface that abstracts the hardware implementation from the software code. [a] describes the fact that a PAM could time-share with 12 applications. The implementation of this was not discussed. This is the main problem with trying to combine a general-purpose machine running many applications with some reconfigurable hardware. How will it be possible to share that hardware, when jobs are being swapped in and out. The time to reconfigure the hardware will be a large issue, as will the ability to swap out the internal state of the reconfigurable logic. Finally [f] covers some of the hardware issues that are limiting this work. PLDs use expensive SRAM, which raises the price of the parts. In addition, these devices use more power than ASICs, and run at slower speeds. 6 Conclusion It would be interesting to combine the work of [c] with [b]. The architecture targeted with [b] is the same architecture used in [c]. The biggest issue would be to take the more extensive loop analyzer and logic synthesis capabilities and place them into hardware. Both designs have the processor in a low power state while the hardware assist is operating. The field of reconfigurable logic is very exciting. This will be a critical area for continued research and development, as vendor s seek to increase system performance, while keeping clock speed and power constrained. Reconfigurable logic may the answer. 7 References [a] J. Vuillemin, P.Bertin, et. al, Programmable Active Memories: Reconfigurable Systems Come of Age, IEEE Transactions on VLSI Systems, March 1996 [b] J. Villareal, D. Suresh, G. Stitt, F. Vahid, W. Najjar, Improving Software Performance with Configurable Logic, Design Automation for Embedded Systems, 2002 [c] [d] [e] G. Stitt, R Lysecky, F. Vahid, Dynamic Hardware/Software Partitioning: A First Approach, DAC, June 2003 H. Schmit, D. Whelihan, A. Tsai, M. Moe, B. Levin, R. Taylor, PipeRench: A Virtualized Programmable Datapath in 0.18 Micron Technology, CICC, 2002 G. De Micheli, R. Ernst, W. Wolf, Readings in Hardware/Software Co-Design, Morgan Kaufmann Publishers, 2002
Reconfigurable Logic and Hardware Software Codesign Page : 6 [f] N. Tredennick, B. Shimamato, Go Reconfigure ; Programmable logic devices will give us a handheld that does everything-well, IEEE Spectrum, 10/01/2003
Ref. : Reconfigurable Logic and Hardware Software Codesign Page : 2 of -7