Johnson Counter. Parallel Inputs. Serial Input. Serial Output. Mult. Stage. Mult. Stage. Mult. Stage. Mult. Stage. Header Stage. Mult.

Size: px

Start display at page:

Download "Johnson Counter. Parallel Inputs. Serial Input. Serial Output. Mult. Stage. Mult. Stage. Mult. Stage. Mult. Stage. Header Stage. Mult."

Spencer Stephens
5 years ago
Views:

1 Designing a partially recongured system J. D. Hadley and B. L. Hutchings Dept. of Electrical and Computer Eng. Brigham Young University Provo, UT ABSTRACT Run-time reconguration (RTR) is an implementation approach that divides an application into a series of sequentially executed stages with each stage implemented as a separate circuit module. System operation then consists of sequencing though these modules at run-time, one conguration at a time. Partial RTR extends this approach by partitioning these stages and designing their circuitry such that they exhibits a high degree of functional and physical commonality. By leaving common circuitry resident, transitioning between congurations can then be accomplished by updating only the dierences between congurations. This signicantly enhances overall performance by reducing the amount of time the RTR application spends conguring. This paper presents the methodology used to design the partial RTR system RRANN2, a partial RTR articial neural network. 1 INTRODUCTION Since its introduction, the Field Programmable Gate Array (FPGA) has received increasing attention due to its prociency as a recongurable logic device. Its merits include not only the ability to implement arbitrary logic functions, but also the fact that it can be reprogrammed an unlimited number of times during its lifetime. These characteristics have lead to the incorporation of FPGAs into several rapid prototyping and exible computing systems. 1{4 Most applications running on these FPGA-based systems are implemented using a single conguration per FPGA. 5{7 These applications congure the FPGAs before the beginning of their execution and those congurations remain active until the application is completed. Thus the functionality of the circuit does not change while the application is running. Such an application can be referred to as being Compile-Time Recongurable (CTR) because the entire conguration is determined at compile-time and does not change throughout system operation. Another implementation strategy is to implement an application with multiple congurations per FPGA. 8{10 In this scenario the application is divided into time-exclusive operations that need not (or cannot) operate concurrently. Each operation is implemented as a distinct conguration which can be down-loaded into the FPGA as necessary at run-time during application operation. This approach is referred to as Run-Time Reconguration (RTR). Thus, whereas CTR applications congure the FPGAs once during system operation, RTR applications typically recongure them many times during the normal operation of a single application. This paper outlines a design methodology for implementing RTR systems that partially recongure FPGA

2 devices. By partially reconguring FPGA resources, reconguration overhead can be reduced and overall performance signicantly enhanced. This design methodology was developed during the design and implementation of RRANN2, an articial neural network implemented on FPGAs with partial RTR. This paper proceeds by providing background on the RRANN2 project and discussing the design methodology in detail. Finally it draws some conclusions about the overall design process and the CAD tools necessary to support it. RRANN2 is being implemented on the National Semiconductor CLAy FPGA, a ne-grained SRAM-based FPGA that supports partial conguration. 2 BACKGROUND The RRANN2 project is a follow-up of on an earlier research eort: RRANN (run-time recongurable articial neural network). 11 RRANN was a proof-of-concept prototype system constructed to demonstrate that the functional density of FPGAs could be enhanced through run-time reconguration. It implemented the popular backpropagation training algorithm as three time-exclusive FPGA congurations: feed-forward, back-propagation and update. System operation consisted of sequencing through these three congurations at run-time, one conguration at a time. Each FPGA conguration followed the same general architecture consisting of a global controller and many nearly identical neural processors. As one circuit module nished (indicating the completion of the corresponding stage), all FPGA hardware was recongured with the next stage's circuit module. RRANN demonstrated that RTR can increase the functional density of a neural network by 500% when compared to FPGA-based implementations that do not use RTR. 11 This density enhancement was obtained by eliminating idle circuitry from each stage and then implementing ve additional neurons with the reclaimed FPGA resources. Additionally, once the neural network had completed the training process, the update and back-propagation congurations no longer needed to be loaded and the FPGAs could remain in the feed-forward conguration. This eliminated the need to recongure and further increased performance while maintaining the original density enhancements. 3 RRANN2 GOALS The basic goal of the RRANN2 project is to show how partial reconguration can enhance the performance of RTR systems. RRANN successfully demonstrated that RTR could enhance the functional density of FP- GAs; however, the break-even point the point at which RTR implementations began to outperform non-rtr implementations was relatively high. For RRANN, the break-even point was dened as the point where the number of weight-updates per second (wps) for the RTR-version of the neural network met or exceeded the wps for a non-rtr-version of the same neural network. This break-even point occurred when the total neuron count per layer exceeded 138 (23 FPGAs). Below this break-even point RRANN's overall computational performance lagged behind its non-rtr counterparts. This was a direct result of the overhead incurred by reconguring between each phase of the algorithm. Thus one of the main goals of the RRANN2 project is to enhance the performance of RTR systems such that they reach this break-even point sooner, i.e., with smaller systems. For RRANN2, the specic goal is to reach the break-even point with fewer neurons per layer. Other related goals are to develop a design methodology for partial reconguration and to demonstrate the overall benets and drawbacks of partial reconguration. The most eective way to enhance performance in RTR applications is to reduce the amount of time spent performing conguration. This is because RTR applications often spend more time conguring than computing. For example, at the break-even point RRANN spent 80% of its time reconguring and 20% of its time computing results. A 10% reduction in conguration time would have resulted in a net reduction of 8% in overall execution time thereby lowering the break-even point to approximately 124 neurons per layer. Reductions in computation

3 time do not have nearly as much impact; achieving the same eect by reducing computation time would have required a 40% reduction in computation time. Conguration time is reduced in RRANN2 by carefully organizing the functional and physical layout so that large sections of the circuitry can remain resident through application operation. Transitions between congurations can then be accomplished by partially reconguring small sections of the chip. Reconguring only a small portion of the chip reduces the size of the reconguration bit-stream that must be down-loaded thereby signicantly reducing reconguration time. 4 DESIGN METHODOLOGY The overall goal of the design methodology is to maximize static circuitry and to minimize dynamic circuitry. Static circuitry is circuitry that remains resident when transitioning from one conguration to the next; dynamic circuitry is circuitry that changes during reconguration. The design methodology maximizes static circuitry by carefully partitioning the application into functional blocks that are, for the most part, common to all of the congurations used to implement the application. These blocks represent those parts of the congurations that did not change and therefore could be implemented with static circuitry removing their circuit descriptions entirely from the reconguration process (except for initialization). The designer resorts to dynamic circuitry only when functional commonality cannot be found between congurations. 4.1 Static circuitry When examining the congurations of RRANN2, two types of fully static logic blocks were identied. The rst consisted entirely of combinational logic. These blocks represented logic functions such as adders, multipliers, comparators, and control functions that were used in several of the congurations. Storage devices were the other type of static logic block. By preserving these blocks between congurations, not only does the conguration of the block remain, but the current value of the storage device remains as well. Thus, if the storage device contains intermediate information that is needed from one conguration to the next, the preservation of this type of block also preserves its value for use in the next conguration. This increases performance in three ways. First, the time needed to reproduce the block through the conguration process is saved. This is true whether or not the storage device contains an intermediate result. Second, the routing and control logic needed to store and retrieve the value from some external storage device is eliminated. This could successfully reduce the size of the design by freeing up valuable resources depending on how the storage and retrieval logic is implemented. And third, the execution time needed to store and retrieve the value is also eliminated. The second and third benets depend greatly on the nature of the application. Besides fully static logic blocks that are identical across all congurations, there are other logic blocks that are \mostly static", i.e., blocks that change only slightly when transitioning from one conguration to the next. These blocks save conguration time because only their dierences need to be transmitted during reconguration. In the RRANN2 congurations, there are four basic types of \mostly static" logic blocks. These types are described in terms of their intrinsic dierences: precision, static value, function and subset. Precision. Two blocks dier in their precision if they are functionally the same except for the number of bits they manipulated. In the RRANN2 congurations, the serial-parallel multipliers are one example of this type of dierence. Each neuron in every conguration contains a serial-parallel multiplier. Across the three congurations the size of the multiplier changes depending on the size of the parallel operand: 5-bits for the activation used in feed-forward, 8-bits for the weight used in back-propagation, and 10-bits for the activation/learning constant

4 Serial Input Header Original 5-Bit Multiplier Parallel Inputs Additional s Serial Output Figure 1: Varying the Multiplier Size. combination used in update. Since the multipliers are constructed by serially linking a set of identical multiplier stages to a header stage, their size can easily be adjusted by adding or removing the appropriate number of stages at the end of the chain. Figure 1 illustrates this idea. The ve solid multiplier stages represent the implementation of a 5-bit multiplier. The three dashed stages represent the circuitry needed to change the 5-bit multiplier to an 8-bit multiplier. Since the ve solid blocks do not need to be changed in order to make this conversion, they can be left out of the conguration process only the three additional blocks need to be down-loaded. Constant Value. Two blocks dier by a constant value if a constant is the only dierence between them. In RRANN2, the controlling state machines contain a Johnson counter that exhibited this type of dierence. This particular counter is used to determine the current state of the controlling state machine. As the state machine transitions from one state to the next, the Johnson counter is incremented. To implement a loop, or jump back to a previous state in the state machine, the Johnson counter must be loadable with the value of the previous state. Since the states which loop back vary from one conguration to the next, the Johnson counter must be loaded with dierent values in each of the congurations. Except for the dierent preloaded values, the operation of the counter remains the same. Thus, in order to convert between counters, only the constant needs to be updated. Figure 2 shows a block level diagram for the Johnson counter used in feed-forward. By asserting the Load line, the counter is preloaded with the next state value on the next clock cycle. Also indicated on the diagram is how the rst three bits of the static value need to be updated in order to change this counter to the one used in back-propagation. Update uses the same static value as feed-forward, therefore the rst three bits would have to be reverted to change the back-propagation counter to update's. No change is required between update and feed-forward. Function. Two blocks are said to dier in their function if they perform logically dierent functions but their construction is almost identical. Take, for example, a bit-serial adder and a bit-serial subtracter. Structurally, these two units are almost identical. The only dierence in their construction is an inverter and the value of the carry register when it is reset. However, since these two units implement dierent functions, they are said Preload Value Increment Load Johnson Counter Counter Output Figure 2: Changing a Constant Value.

5 Serial Input Unsigned Parallel Inputs Original Unsigned Multiplier Serial Output Signed New Signed Multiplier Figure 3: Signed to Unsigned Multiplier. to be functionally dierent. In RRANN2, besides diering in precision, the serial-parallel multipliers also dier in function. In the feed-forward and update phases, the parallel input to the multiplier is presented with an unsigned value, the activation and activation/learning constant combination respectively. However, in the backpropagation phase, this input, the weight, is a signed value. Thus, depending on the conguration, the parallel input to the multiplier has to accept either a signed or an unsigned value. Structurally these two multipliers dier only in a simple modication to the rst unit in the multiplier chain. Thus, in order to change an unsigned multiplier to a signed multiplier, or vise versa, only the rst unit in the multiplier needs to be updated. Figure 3 illustrates this idea. Pictured is the block diagram for a 5-bit unsigned multiplier. As shown, in order to convert this to a 5-bit signed multiplier, the rst multiplier stage must be updated. Since the remaining four stages do not need to be changed, they can be left out of the reconguration processes. Subsets. One block is said to be a subset of another block if it is structurally and functionally contained within the bounds of its counterpart. In RRANN2 there are many examples of this, including a counter that needed to be preloaded in one conguration but not another and a register that needs to shift its output only part of the time. In order to use this type of similar block as a common logic block in a partial RTR design, one of two options can be realized. First, each block can be implemented in its original form. This would require the dierences between the blocks to be added or removed, as needed, when the congurations change. Or, second, the super-block (the one containing all the functionality) can be implemented in both congurations. This would cause the blocks to become identical, eliminating all need for their reconguration. While the second option produces the greatest reduction in conguration time, it also introduces idle circuitry into the sub-block's conguration. One of the original incentives for constructing the RTR design was to increase the utilization of the available silicon resource. If the introduction of idle circuitry consumes resources that could be used for some other purpose then the benets of using an RTR design are jeopardized. In RRANN2, it was found that due to routing and conguration limitations, many times the super-block could be implemented in an area that was the same or only slightly larger than what would have been required to implement the sub-block. In these cases the super-block has been used since it has little or no impact on the size of the system. For blocks that resulted in a large increase in system size, the rst option has been used. 4.2 Physical design issues Partitioning the system into a set of static logic blocks is only the rst step in developing a partial RTR design. This set represents only those parts of the congurations that could be used to reduce the system's reconguration time. The second step is to physically map the blocks onto the device. Before the benets of partial reconguration can be exhibited, the common logic block's implementation and location has to be physically constrained. Unless each block contains the same physical implementation and occupies precisely the same position on the device, an overlaying of the two congurations will not show commonalities at the block's location. And unless these

6 commonalities exist, they cannot be removed from the reconguration process. Besides the implementation and location constraints required for partial reconguration, a common logic block is also constrained by the physical context of its surroundings. All the common design issues of global interconnect, global placement, and the position and interconnection of neighboring circuitry has to be addressed. Complicating the design process further is the fact that many of these constraints are not known at design time. If the neighboring circuitry is not yet designed, the constraints (such as physical size and interconnection points) enforced by that neighbor are not known. This lack of knowledge at design time makes the implementation of the design a dicult iterative process. Even with all the common logic blocks successfully implemented and positioned within their surrounding circuitry, a reduced reconguration time may not result. If the static portions of the design are too small or too wide spread, the overhead needed in the reconguration bit-stream to address the neighboring dynamic circuitry will surpass the savings of removing the static circuitry's conguration data. Take, for example, the case where a small block of static circuitry lies within a dynamic circuit block. In order to remove this static circuitry from the reconguration bit-stream, a new conguration \window" must by created for the dynamic circuitry following the static block (the circuitry before the block is contained in the previous window denition). This window specication requires ve bytes of header information, which include the starting and ending addresses for the next section of conguration data. If the amount of reconguration data needed to specify the static block totals less then ve bytes, then the removal of the static circuitry will actually increases the length of the bit stream. In order to insure that the removal of static circuitry will decrease the resulting bit-stream, further constraints have to be placed on the design to group the static circuitry as much as possible. Another factor leading to the diculty of the design process is the current lack of eective design tools. Because the partial RTR design methodology is not well understood, most of the tools that can be used to develop partial RTR designs are modications of existing tools original developed with dierent purposes in mind. To be an eective tool for partial RTR designs, the tool must not only support the denition and enforcement of the physical implementation and location constraints required by partial RTR designs, but it must allow for a structured approach that can be used to dene the inter-conguration relationships that also exist. Current schematic capture tools lack the ability to dene and enforce the physical constraints. Place and route tools lack a structured approach. And no tool currently available, especially simulation tools, will allow for the denition of and enforcement on inter-conguration relationships. The lack of eective design tools only complicates the design process and leads to the introduction of errors. 5 THE RRANN2 ARCHITECTURE Using the design methodology described above, the RRANN2 system successfully incorporates the use of partial reconguration into the RRANN design. The system maintains the same general architecture as its predecessor, dividing the backpropagation training algorithm into three sequentially executed stages known as feed-forward, backpropagation, and update. Execution commences by preloading the FPGAs with the congurations corresponding to the feed-forward stage. This requires loading complete congurations since prior to this time the FPGAs were uncongured. Once the feed-forward circuitry runs to completion, the FPGAs are recon- gured to implement the back-propagation stage. This can be accomplished by reconguring only the dierences between the feed-forward and backpropagation circuit designs. Thus a partial conguration can be used that represents those changes. After the backpagation circuitry has executed, the FPGAs are once again recongured to implement the update circuitry through the use of partial reconguration. Since the backpropagation algorithm requires multiple iterations of these three stages, after the update circuitry nishes, partial conguration is used to recongure the FPGAs to implement feed-forward and the process repeats. Figure 4 illustrates this processes. The only time entire congurations are used is when the system is initialized.

7 Begin New Pattern S t a r t Configue Entire Design Feed-Forward Circuit Module Reconfigure Changes Backpropagation Circuit Module Reconfigure Changes Update Circuit Module Reconfigure Changes Figure 4: The RRANN2 Partial Reconguration Process. Pass-Thru Bus RAM RAM RAM Host Global Controller Neural Processor Neural Processor Global Control Line Dual Function Bus Figure 5: The General RRANN2 Architecture. Each phase of the backpropagation algorithm is implemented with a global controller occupying one FPGA and several neural processors (one per FPGA) occupying the balance of the available FPGAs (See Figure 5). The global controller is static and does not need to be recongured. It is responsible for controlling the execution of the local hardware subroutines contained on the neural processors by supplying them with key data and timing information. The neural processors, on the other hand, are dynamic and must be partially recongured between each algorithmic stage. Each processor contains nine hardware neurons and other local hardware subroutines, which are implemented by using a state machine. They are responsible for performing all the calculations required by the backpropagation algorithm. Associated with each FPGA is a local RAM. Both the neural processors and the global controller use this RAM to store any required information and as a scratch pad to hold temporary values. 6 IMPLEMENTATION AND PERFORMANCE The RRANN2 architecture was built and tested using a modied version of National Semiconductor's CLAy Development Board (CDB) contained in an IBM-compatible PC. The host PC serves three purposes. First, it stores all the necessary conguration information for the FPGAs. This includes both the complete and partial bitstreams required to implement the RRANN2 system. Next, it is used to monitor the progress of each system stage during execution. As each stage nishes its execution, the host PC is informed of its completion over the PC ISA bus. Finally, after the completion of each stage, the host PC supplies the appropriate reconguration data to the board in order to implement the next stage's circuitry. This only requires supplying the changes between the two congurations, or the partial bitstreams. All the controlling software used by the PC to communicate with the CDB board was developed using National Semiconductor's CLAy System Development Kit (CLAy SDK). The actual circuit modules used to build the RRANN2 FPGA congurations were rst designed and simulated using ViewLogic's schematic capture system. These circuits were then implemented on the CLAy31 FPGAs through the use of National Semiconductor's ClayTools. This implementation required two steps. In the rst step, the circuit modules were placed and routed by hand to physically map the schematics to corresponding

8 FPGA resources. Table 1 list the FPGA resources used in each of the four congurations. The resources are divided into four categories: ip-ops, FPGA programmable cells, user programmable Input/Output pins, and interconnection busses. For each of these categories, both the total number of the resource used and its percentage of the total available on the FPGA are given. The last column sumerizes this information by giving an equivalent gate rating for each design. In each of the neural processor congurations, this value exceeds the published rating of 5000 equivalent gates for the device. This is mainly due to a careful design style and manual placing and routing and it allowed nine hardware neurons to be implemented on each neural processor. Table 1: FPGA Resource Utilization Flip-Flops Cells I/O Pins Busses Equivalent Conguration Total % Used Total % Used Total % Used Total % Used Gates Global Controller 121 4% % 74 69% % 1870 Feed-Forward % % 72 67% % 8204 Backpropagation % % 72 67% % Update % % 72 67% % 9257 Figure 6 shows the physical layout of the neural processor used in the backpropagation stage of RRANN2. Its foot-print, which remains constant across all three congurations, divides the available resources into two main areas of circuitry. The left 1/3 of the chip is used to implement the controlling state machine for the processor. It is distinguishable by its relatively sparse use of system resources. The remaining 2/3 of the chip contains nine identical back-propagation neurons and the bus and memory interfaces. The neurons and interfaces together represent the processor's data-path. The actual neurons are implemented as four cell wide columns of circuitry with the interfaces sitting on their top and bottom edges. After the designs were physically mapped to the FPGA hardware, their physical representations had to be converted to down-loadable conguration bitstreams. Partial bitstreams are actually generated by comparing two complete bitstreams through the use of a \windowing" tool. This tool compares two bitstream and creates a third that contains only their dierences. Table 2 shows the sizes of the RRANN2 partial bitstreams as compared to a complete bitstream. The bitstream sizes are specied in total number of bytes and conguration windows the number of conguration data blocks in the bitstreams. Partial bitstreams have a greater number of windows since portions of the bitstream were removed, while complete bitstreams have one window, the entire device. Also specied is the time it would take to download the bitstream at 10 MHz. As can be seen from the table, the partial bitsteams reduce reconguration time by an average of 53.5%. Table 2: Partial Bitstream Sizes Conguration Size Conguration Conguration Percent Bitstream (bytes) Windows Time (at 10MHz) Reduction Complete Conguration s 0% Feed-Forward to Backpropagation s 48.5% Backpropagation to Update s 45.8% Update to Feed-Forward s 66.2% The CDB board hosts ve National Semiconductor CLAy31 FPGAs. The reconguration time for the CLAy31 is approximately 600 s. This raw reconguration time is roughly 1/10 of that found for the devices used in RRANN. Combined with an additional reduction of 53.5% due to partial reconguration techniques and a rated operating frequency of 20 MHz, RRANN2 achieves over 4 times the training performance of RRANN. (The actual hardware ran at 33 MHz, giving an even greater training performance.) In addition, RRANN2 implements 50% more neurons per FPGA than RRANN, due mainly to manual placement and routing. This enhances both the training- and operational-mode performance. This is shown in Table 3.

9 L L L L L L L L L L L L AN3 L L L ORL L L L L L 2 L L AN3 L L L L L L L AN3 L L ZERO N L ONE ND N SELBUFS SELBUFS 2 SELBUFS L SELBUFS L ZERO N 2 N ~OE_P ~WE_P GC A5_P A6_P A7_P A8_P A9_P A10_P A11_P A12_P A13_P A14_P A4_P A3_P A2_P A1_P A0_P C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C X ND OR AN L N ND ND N N N 2 N L N L X N OR AN L N ND N N N 2 N R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R OR N X N N OR AN N ORL ND N N N 2 N L ORL AN X N ND OR AN N ORL ND ND N N N 2 N ORL N D8_P G8_P D7_P G7_P D6_P G6_P D5_P G5_P D4_P G4_P D3_P G3_P D2_P G2_P D1_P G1_P D0_P G0_P X N OR AN N ORL ND N N ORL N 2 N L ORL AN N L X N OR AN N OR ND N N N 2 N AN N X N ND OR AN N ND ND N N N 2 N L ORL N AN X N OR AN N ND N N N 2 N N X N OR AN N ND N N CPI2 CPI1 CPI0 PT13 PT12 PT11 PT10 PT9 PT8 PT0 PT1 PT2 PT4 PT3 PT5 CPPV PT7 PT6 PO6_P PO7_P CPNT_P PO5_P PO3_P PO4_P PO2_P PO1_P PO0_P PO8_P PO9_P PO10_P PO11_P PO12_P PO13_P CPO0_P CPO1_P CPO2_P Figure 6: Back-Propagation Phase Layout.

10 In the table, performance is measured as speed-up relative to the well-known PDP back-propagation simulator 12 running on a 125 MHz Hewlett-Packard 735 workstation (135.7 SPECint92, SPECfp92). For training mode, RRANN achieve a speedup of approximately 0.08, per FPGA, for a network composed of 20 FPGAs (120 neurons per layer) whereas it is anticipated that RRANN2 will achieve a speedup of approximately 0.33, per FPGA. For operational mode (strictly running feed-forward, no reconguration), RRANN achieved a speedup of approximately 2.1, per FPGA, while it is estimated that RRANN2 will achieve a speedup of 2.7, per FPGA. Table 3: HP735 Performance Comparison. Mode RRANN RRANN2 Operational Training SUMMARY AND CONCLUSION Being an extension of the RTR design, partial RTR systems exhibit many of the same advantages found in their RTR counterparts. While the extent of these advantages is still being explored, one distinct advantage that has received notable attention is an RTR system's ability to increase functional density. This fact was previously demonstrated with the implementation of the RRANN system. Partial reconguration extends the benets of RTR systems by reducing the system's reconguration time and allowing for the retention of intermediate values on the programmable device. In RRANN2, careful organization of the design's functional and physical layouts allowed for the identication of large sections of static circuitry. This static circuitry was then removed from the reconguration process. By reducing the amount of the device that needed to be recongured, the reconguration time for the system was reduced. Furthermore, the static circuitry was also used to retain intermediate values on the programmable device. This eliminated the routing and control circuitry otherwise needed to store them in a static storage device. The reclaimed circuitry was then used to implement additional neurons. RRANN2 demonstrated these benets with a 25% reduction in conguration time and a 50% increase in neuron density. If partial RTR designs are to become a viable alternative to static system design, continued research is needed into their design methodology. Additional research systems need to be developed that will help dene how partial RTR should be developed and what the design tradeos are. Further more, additional tools need to be developed to aide in the design processes. Due to the physical constraints that need to be dened, tools that would allow both top-down and bottom-up designs equally well are probably needed. Also needed are simulation tools that will take into consideration the reconguration processes. Finally, research into what makes a good recongurable device for partial RTR systems must continue. 8 ACKNOWLEDGMENTS This work was supported by ARPA/CSTO under contract number DABT63-94-C-0085 under a subcontract to National Semiconductor. The author would like to express appreciation to Tim Garverick, Edson Gomersall, and Harry Holt at National Semiconductor for their interest and support in this project.

11 9 REFERENCES [1] P. M. Athanas and H. F. Silverman. Processor reconguration through instruction-set metamorphosis. Computer, 26(3):11{18, March [2] P. Bertin, D. Roncin, and J. Vuillemin. Programmable active memories: a performance assessment. In G. Borriello and C. Ebeling, editors, Research on Integrated Systems: Proceedings of the 1993 Symposium, pages 88{102, [3] J. M. Arnold, D. A. Buell, and E. G. Davis. Splash 2. In Proceedings of the 4th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 316{324, June [4] T. A. Petersen, D. A. Thomae, and D. E. Van den Bout. The Anyboard: a rapid-prototyping system for use in teaching digital circuit design. In First International Workshop on Rapid System Prototyping, pages 25{32, [5] F. Furtek. A eld-programmable gate array for systolic computing. In G. Borriello and C. Ebeling, editors, Research on Integrated Systems: Proceedings of the 1993 Symposium, pages 183{199, [6] D. T. Hoang. Searching genetic databases on Splash 2. In D. A. Buell and K. L. Pocek, editors, Proceedings of IEEE Workshop on FPGAs for Custom Computing Machines, pages 185{191, Napa, CA, April [7] B. K. Fawcett. Applications of recongurable logic. In W. Moore and W. Luk, editors, More FPGAs: Proceedings of the 1993 International workshop on eld-programmable logic and applications, pages 57{69, Oxford, England, September [8] D. Ross, O. Vellacott, and M. Turner. An FPGA-based hardware accelerator for image processing. In W. Moore and W. Luk, editors, More FPGAs: Proceedings of the 1993 International workshop on eldprogrammable logic and applications, pages 299{306, Oxford, England, September [9] P. Lysaght, J. Stockwood, J. Law, and D. Girma. Articial neural network implementation on a ne-grained FPGA. In R. Hartenstein and M. Z. Servit, editors, Field-Programmable Logic: Architectures, Synthesis and Applications. 4th International Workshop on Field-Programmable Logic and Applications, pages 421{431, Prague, Czech Republic, September Springer-Verlag. [10] P. C. French and R. W. Taylor. A self-reconguring processor. In D. A. Buell and K. L. Pocek, editors, Proceedings of IEEE Workshop on FPGAs for Custom Computing Machines, pages 50{59, Napa, CA, April [11] J. G. Eldredge and B. L. Hutchings. Density enhancement of a neural network using FPGAs and run-time reconguration. In D. A. Buell and K. L. Pocek, editors, Proceedings of IEEE Workshop on FPGAs for Custom Computing Machines, pages 180{188, Napa, CA, April [12] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. Parallel and Distributed Processing, 1:318{362, 1986.

Global Controller. Communication Network. = I/O Disabled. Linear Hardware Space

Global Controller. Communication Network. = I/O Disabled. Linear Hardware Space A Dynamic Set Computer Michael J. Wirthlin and Brad L. Hutchings Dept. of Electrical and Computer Eng. Brigham Young University Provo, UT 84602 Abstract A Dynamic Set Computer (DISC) has been developed