Johnson Counter. Parallel Inputs. Serial Input. Serial Output. Mult. Stage. Mult. Stage. Mult. Stage. Mult. Stage. Header Stage. Mult.

Size: px
Start display at page:

Download "Johnson Counter. Parallel Inputs. Serial Input. Serial Output. Mult. Stage. Mult. Stage. Mult. Stage. Mult. Stage. Header Stage. Mult."

Transcription

1 Designing a partially recongured system J. D. Hadley and B. L. Hutchings Dept. of Electrical and Computer Eng. Brigham Young University Provo, UT ABSTRACT Run-time reconguration (RTR) is an implementation approach that divides an application into a series of sequentially executed stages with each stage implemented as a separate circuit module. System operation then consists of sequencing though these modules at run-time, one conguration at a time. Partial RTR extends this approach by partitioning these stages and designing their circuitry such that they exhibits a high degree of functional and physical commonality. By leaving common circuitry resident, transitioning between congurations can then be accomplished by updating only the dierences between congurations. This signicantly enhances overall performance by reducing the amount of time the RTR application spends conguring. This paper presents the methodology used to design the partial RTR system RRANN2, a partial RTR articial neural network. 1 INTRODUCTION Since its introduction, the Field Programmable Gate Array (FPGA) has received increasing attention due to its prociency as a recongurable logic device. Its merits include not only the ability to implement arbitrary logic functions, but also the fact that it can be reprogrammed an unlimited number of times during its lifetime. These characteristics have lead to the incorporation of FPGAs into several rapid prototyping and exible computing systems. 1{4 Most applications running on these FPGA-based systems are implemented using a single conguration per FPGA. 5{7 These applications congure the FPGAs before the beginning of their execution and those congurations remain active until the application is completed. Thus the functionality of the circuit does not change while the application is running. Such an application can be referred to as being Compile-Time Recongurable (CTR) because the entire conguration is determined at compile-time and does not change throughout system operation. Another implementation strategy is to implement an application with multiple congurations per FPGA. 8{10 In this scenario the application is divided into time-exclusive operations that need not (or cannot) operate concurrently. Each operation is implemented as a distinct conguration which can be down-loaded into the FPGA as necessary at run-time during application operation. This approach is referred to as Run-Time Reconguration (RTR). Thus, whereas CTR applications congure the FPGAs once during system operation, RTR applications typically recongure them many times during the normal operation of a single application. This paper outlines a design methodology for implementing RTR systems that partially recongure FPGA

2 devices. By partially reconguring FPGA resources, reconguration overhead can be reduced and overall performance signicantly enhanced. This design methodology was developed during the design and implementation of RRANN2, an articial neural network implemented on FPGAs with partial RTR. This paper proceeds by providing background on the RRANN2 project and discussing the design methodology in detail. Finally it draws some conclusions about the overall design process and the CAD tools necessary to support it. RRANN2 is being implemented on the National Semiconductor CLAy FPGA, a ne-grained SRAM-based FPGA that supports partial conguration. 2 BACKGROUND The RRANN2 project is a follow-up of on an earlier research eort: RRANN (run-time recongurable articial neural network). 11 RRANN was a proof-of-concept prototype system constructed to demonstrate that the functional density of FPGAs could be enhanced through run-time reconguration. It implemented the popular backpropagation training algorithm as three time-exclusive FPGA congurations: feed-forward, back-propagation and update. System operation consisted of sequencing through these three congurations at run-time, one conguration at a time. Each FPGA conguration followed the same general architecture consisting of a global controller and many nearly identical neural processors. As one circuit module nished (indicating the completion of the corresponding stage), all FPGA hardware was recongured with the next stage's circuit module. RRANN demonstrated that RTR can increase the functional density of a neural network by 500% when compared to FPGA-based implementations that do not use RTR. 11 This density enhancement was obtained by eliminating idle circuitry from each stage and then implementing ve additional neurons with the reclaimed FPGA resources. Additionally, once the neural network had completed the training process, the update and back-propagation congurations no longer needed to be loaded and the FPGAs could remain in the feed-forward conguration. This eliminated the need to recongure and further increased performance while maintaining the original density enhancements. 3 RRANN2 GOALS The basic goal of the RRANN2 project is to show how partial reconguration can enhance the performance of RTR systems. RRANN successfully demonstrated that RTR could enhance the functional density of FP- GAs; however, the break-even point the point at which RTR implementations began to outperform non-rtr implementations was relatively high. For RRANN, the break-even point was dened as the point where the number of weight-updates per second (wps) for the RTR-version of the neural network met or exceeded the wps for a non-rtr-version of the same neural network. This break-even point occurred when the total neuron count per layer exceeded 138 (23 FPGAs). Below this break-even point RRANN's overall computational performance lagged behind its non-rtr counterparts. This was a direct result of the overhead incurred by reconguring between each phase of the algorithm. Thus one of the main goals of the RRANN2 project is to enhance the performance of RTR systems such that they reach this break-even point sooner, i.e., with smaller systems. For RRANN2, the specic goal is to reach the break-even point with fewer neurons per layer. Other related goals are to develop a design methodology for partial reconguration and to demonstrate the overall benets and drawbacks of partial reconguration. The most eective way to enhance performance in RTR applications is to reduce the amount of time spent performing conguration. This is because RTR applications often spend more time conguring than computing. For example, at the break-even point RRANN spent 80% of its time reconguring and 20% of its time computing results. A 10% reduction in conguration time would have resulted in a net reduction of 8% in overall execution time thereby lowering the break-even point to approximately 124 neurons per layer. Reductions in computation

3 time do not have nearly as much impact; achieving the same eect by reducing computation time would have required a 40% reduction in computation time. Conguration time is reduced in RRANN2 by carefully organizing the functional and physical layout so that large sections of the circuitry can remain resident through application operation. Transitions between congurations can then be accomplished by partially reconguring small sections of the chip. Reconguring only a small portion of the chip reduces the size of the reconguration bit-stream that must be down-loaded thereby signicantly reducing reconguration time. 4 DESIGN METHODOLOGY The overall goal of the design methodology is to maximize static circuitry and to minimize dynamic circuitry. Static circuitry is circuitry that remains resident when transitioning from one conguration to the next; dynamic circuitry is circuitry that changes during reconguration. The design methodology maximizes static circuitry by carefully partitioning the application into functional blocks that are, for the most part, common to all of the congurations used to implement the application. These blocks represent those parts of the congurations that did not change and therefore could be implemented with static circuitry removing their circuit descriptions entirely from the reconguration process (except for initialization). The designer resorts to dynamic circuitry only when functional commonality cannot be found between congurations. 4.1 Static circuitry When examining the congurations of RRANN2, two types of fully static logic blocks were identied. The rst consisted entirely of combinational logic. These blocks represented logic functions such as adders, multipliers, comparators, and control functions that were used in several of the congurations. Storage devices were the other type of static logic block. By preserving these blocks between congurations, not only does the conguration of the block remain, but the current value of the storage device remains as well. Thus, if the storage device contains intermediate information that is needed from one conguration to the next, the preservation of this type of block also preserves its value for use in the next conguration. This increases performance in three ways. First, the time needed to reproduce the block through the conguration process is saved. This is true whether or not the storage device contains an intermediate result. Second, the routing and control logic needed to store and retrieve the value from some external storage device is eliminated. This could successfully reduce the size of the design by freeing up valuable resources depending on how the storage and retrieval logic is implemented. And third, the execution time needed to store and retrieve the value is also eliminated. The second and third benets depend greatly on the nature of the application. Besides fully static logic blocks that are identical across all congurations, there are other logic blocks that are \mostly static", i.e., blocks that change only slightly when transitioning from one conguration to the next. These blocks save conguration time because only their dierences need to be transmitted during reconguration. In the RRANN2 congurations, there are four basic types of \mostly static" logic blocks. These types are described in terms of their intrinsic dierences: precision, static value, function and subset. Precision. Two blocks dier in their precision if they are functionally the same except for the number of bits they manipulated. In the RRANN2 congurations, the serial-parallel multipliers are one example of this type of dierence. Each neuron in every conguration contains a serial-parallel multiplier. Across the three congurations the size of the multiplier changes depending on the size of the parallel operand: 5-bits for the activation used in feed-forward, 8-bits for the weight used in back-propagation, and 10-bits for the activation/learning constant

4 Serial Input Header Original 5-Bit Multiplier Parallel Inputs Additional s Serial Output Figure 1: Varying the Multiplier Size. combination used in update. Since the multipliers are constructed by serially linking a set of identical multiplier stages to a header stage, their size can easily be adjusted by adding or removing the appropriate number of stages at the end of the chain. Figure 1 illustrates this idea. The ve solid multiplier stages represent the implementation of a 5-bit multiplier. The three dashed stages represent the circuitry needed to change the 5-bit multiplier to an 8-bit multiplier. Since the ve solid blocks do not need to be changed in order to make this conversion, they can be left out of the conguration process only the three additional blocks need to be down-loaded. Constant Value. Two blocks dier by a constant value if a constant is the only dierence between them. In RRANN2, the controlling state machines contain a Johnson counter that exhibited this type of dierence. This particular counter is used to determine the current state of the controlling state machine. As the state machine transitions from one state to the next, the Johnson counter is incremented. To implement a loop, or jump back to a previous state in the state machine, the Johnson counter must be loadable with the value of the previous state. Since the states which loop back vary from one conguration to the next, the Johnson counter must be loaded with dierent values in each of the congurations. Except for the dierent preloaded values, the operation of the counter remains the same. Thus, in order to convert between counters, only the constant needs to be updated. Figure 2 shows a block level diagram for the Johnson counter used in feed-forward. By asserting the Load line, the counter is preloaded with the next state value on the next clock cycle. Also indicated on the diagram is how the rst three bits of the static value need to be updated in order to change this counter to the one used in back-propagation. Update uses the same static value as feed-forward, therefore the rst three bits would have to be reverted to change the back-propagation counter to update's. No change is required between update and feed-forward. Function. Two blocks are said to dier in their function if they perform logically dierent functions but their construction is almost identical. Take, for example, a bit-serial adder and a bit-serial subtracter. Structurally, these two units are almost identical. The only dierence in their construction is an inverter and the value of the carry register when it is reset. However, since these two units implement dierent functions, they are said Preload Value Increment Load Johnson Counter Counter Output Figure 2: Changing a Constant Value.

5 Serial Input Unsigned Parallel Inputs Original Unsigned Multiplier Serial Output Signed New Signed Multiplier Figure 3: Signed to Unsigned Multiplier. to be functionally dierent. In RRANN2, besides diering in precision, the serial-parallel multipliers also dier in function. In the feed-forward and update phases, the parallel input to the multiplier is presented with an unsigned value, the activation and activation/learning constant combination respectively. However, in the backpropagation phase, this input, the weight, is a signed value. Thus, depending on the conguration, the parallel input to the multiplier has to accept either a signed or an unsigned value. Structurally these two multipliers dier only in a simple modication to the rst unit in the multiplier chain. Thus, in order to change an unsigned multiplier to a signed multiplier, or vise versa, only the rst unit in the multiplier needs to be updated. Figure 3 illustrates this idea. Pictured is the block diagram for a 5-bit unsigned multiplier. As shown, in order to convert this to a 5-bit signed multiplier, the rst multiplier stage must be updated. Since the remaining four stages do not need to be changed, they can be left out of the reconguration processes. Subsets. One block is said to be a subset of another block if it is structurally and functionally contained within the bounds of its counterpart. In RRANN2 there are many examples of this, including a counter that needed to be preloaded in one conguration but not another and a register that needs to shift its output only part of the time. In order to use this type of similar block as a common logic block in a partial RTR design, one of two options can be realized. First, each block can be implemented in its original form. This would require the dierences between the blocks to be added or removed, as needed, when the congurations change. Or, second, the super-block (the one containing all the functionality) can be implemented in both congurations. This would cause the blocks to become identical, eliminating all need for their reconguration. While the second option produces the greatest reduction in conguration time, it also introduces idle circuitry into the sub-block's conguration. One of the original incentives for constructing the RTR design was to increase the utilization of the available silicon resource. If the introduction of idle circuitry consumes resources that could be used for some other purpose then the benets of using an RTR design are jeopardized. In RRANN2, it was found that due to routing and conguration limitations, many times the super-block could be implemented in an area that was the same or only slightly larger than what would have been required to implement the sub-block. In these cases the super-block has been used since it has little or no impact on the size of the system. For blocks that resulted in a large increase in system size, the rst option has been used. 4.2 Physical design issues Partitioning the system into a set of static logic blocks is only the rst step in developing a partial RTR design. This set represents only those parts of the congurations that could be used to reduce the system's reconguration time. The second step is to physically map the blocks onto the device. Before the benets of partial reconguration can be exhibited, the common logic block's implementation and location has to be physically constrained. Unless each block contains the same physical implementation and occupies precisely the same position on the device, an overlaying of the two congurations will not show commonalities at the block's location. And unless these

6 commonalities exist, they cannot be removed from the reconguration process. Besides the implementation and location constraints required for partial reconguration, a common logic block is also constrained by the physical context of its surroundings. All the common design issues of global interconnect, global placement, and the position and interconnection of neighboring circuitry has to be addressed. Complicating the design process further is the fact that many of these constraints are not known at design time. If the neighboring circuitry is not yet designed, the constraints (such as physical size and interconnection points) enforced by that neighbor are not known. This lack of knowledge at design time makes the implementation of the design a dicult iterative process. Even with all the common logic blocks successfully implemented and positioned within their surrounding circuitry, a reduced reconguration time may not result. If the static portions of the design are too small or too wide spread, the overhead needed in the reconguration bit-stream to address the neighboring dynamic circuitry will surpass the savings of removing the static circuitry's conguration data. Take, for example, the case where a small block of static circuitry lies within a dynamic circuit block. In order to remove this static circuitry from the reconguration bit-stream, a new conguration \window" must by created for the dynamic circuitry following the static block (the circuitry before the block is contained in the previous window denition). This window specication requires ve bytes of header information, which include the starting and ending addresses for the next section of conguration data. If the amount of reconguration data needed to specify the static block totals less then ve bytes, then the removal of the static circuitry will actually increases the length of the bit stream. In order to insure that the removal of static circuitry will decrease the resulting bit-stream, further constraints have to be placed on the design to group the static circuitry as much as possible. Another factor leading to the diculty of the design process is the current lack of eective design tools. Because the partial RTR design methodology is not well understood, most of the tools that can be used to develop partial RTR designs are modications of existing tools original developed with dierent purposes in mind. To be an eective tool for partial RTR designs, the tool must not only support the denition and enforcement of the physical implementation and location constraints required by partial RTR designs, but it must allow for a structured approach that can be used to dene the inter-conguration relationships that also exist. Current schematic capture tools lack the ability to dene and enforce the physical constraints. Place and route tools lack a structured approach. And no tool currently available, especially simulation tools, will allow for the denition of and enforcement on inter-conguration relationships. The lack of eective design tools only complicates the design process and leads to the introduction of errors. 5 THE RRANN2 ARCHITECTURE Using the design methodology described above, the RRANN2 system successfully incorporates the use of partial reconguration into the RRANN design. The system maintains the same general architecture as its predecessor, dividing the backpropagation training algorithm into three sequentially executed stages known as feed-forward, backpropagation, and update. Execution commences by preloading the FPGAs with the congurations corresponding to the feed-forward stage. This requires loading complete congurations since prior to this time the FPGAs were uncongured. Once the feed-forward circuitry runs to completion, the FPGAs are recon- gured to implement the back-propagation stage. This can be accomplished by reconguring only the dierences between the feed-forward and backpropagation circuit designs. Thus a partial conguration can be used that represents those changes. After the backpagation circuitry has executed, the FPGAs are once again recongured to implement the update circuitry through the use of partial reconguration. Since the backpropagation algorithm requires multiple iterations of these three stages, after the update circuitry nishes, partial conguration is used to recongure the FPGAs to implement feed-forward and the process repeats. Figure 4 illustrates this processes. The only time entire congurations are used is when the system is initialized.

7 Begin New Pattern S t a r t Configue Entire Design Feed-Forward Circuit Module Reconfigure Changes Backpropagation Circuit Module Reconfigure Changes Update Circuit Module Reconfigure Changes Figure 4: The RRANN2 Partial Reconguration Process. Pass-Thru Bus RAM RAM RAM Host Global Controller Neural Processor Neural Processor Global Control Line Dual Function Bus Figure 5: The General RRANN2 Architecture. Each phase of the backpropagation algorithm is implemented with a global controller occupying one FPGA and several neural processors (one per FPGA) occupying the balance of the available FPGAs (See Figure 5). The global controller is static and does not need to be recongured. It is responsible for controlling the execution of the local hardware subroutines contained on the neural processors by supplying them with key data and timing information. The neural processors, on the other hand, are dynamic and must be partially recongured between each algorithmic stage. Each processor contains nine hardware neurons and other local hardware subroutines, which are implemented by using a state machine. They are responsible for performing all the calculations required by the backpropagation algorithm. Associated with each FPGA is a local RAM. Both the neural processors and the global controller use this RAM to store any required information and as a scratch pad to hold temporary values. 6 IMPLEMENTATION AND PERFORMANCE The RRANN2 architecture was built and tested using a modied version of National Semiconductor's CLAy Development Board (CDB) contained in an IBM-compatible PC. The host PC serves three purposes. First, it stores all the necessary conguration information for the FPGAs. This includes both the complete and partial bitstreams required to implement the RRANN2 system. Next, it is used to monitor the progress of each system stage during execution. As each stage nishes its execution, the host PC is informed of its completion over the PC ISA bus. Finally, after the completion of each stage, the host PC supplies the appropriate reconguration data to the board in order to implement the next stage's circuitry. This only requires supplying the changes between the two congurations, or the partial bitstreams. All the controlling software used by the PC to communicate with the CDB board was developed using National Semiconductor's CLAy System Development Kit (CLAy SDK). The actual circuit modules used to build the RRANN2 FPGA congurations were rst designed and simulated using ViewLogic's schematic capture system. These circuits were then implemented on the CLAy31 FPGAs through the use of National Semiconductor's ClayTools. This implementation required two steps. In the rst step, the circuit modules were placed and routed by hand to physically map the schematics to corresponding

8 FPGA resources. Table 1 list the FPGA resources used in each of the four congurations. The resources are divided into four categories: ip-ops, FPGA programmable cells, user programmable Input/Output pins, and interconnection busses. For each of these categories, both the total number of the resource used and its percentage of the total available on the FPGA are given. The last column sumerizes this information by giving an equivalent gate rating for each design. In each of the neural processor congurations, this value exceeds the published rating of 5000 equivalent gates for the device. This is mainly due to a careful design style and manual placing and routing and it allowed nine hardware neurons to be implemented on each neural processor. Table 1: FPGA Resource Utilization Flip-Flops Cells I/O Pins Busses Equivalent Conguration Total % Used Total % Used Total % Used Total % Used Gates Global Controller 121 4% % 74 69% % 1870 Feed-Forward % % 72 67% % 8204 Backpropagation % % 72 67% % Update % % 72 67% % 9257 Figure 6 shows the physical layout of the neural processor used in the backpropagation stage of RRANN2. Its foot-print, which remains constant across all three congurations, divides the available resources into two main areas of circuitry. The left 1/3 of the chip is used to implement the controlling state machine for the processor. It is distinguishable by its relatively sparse use of system resources. The remaining 2/3 of the chip contains nine identical back-propagation neurons and the bus and memory interfaces. The neurons and interfaces together represent the processor's data-path. The actual neurons are implemented as four cell wide columns of circuitry with the interfaces sitting on their top and bottom edges. After the designs were physically mapped to the FPGA hardware, their physical representations had to be converted to down-loadable conguration bitstreams. Partial bitstreams are actually generated by comparing two complete bitstreams through the use of a \windowing" tool. This tool compares two bitstream and creates a third that contains only their dierences. Table 2 shows the sizes of the RRANN2 partial bitstreams as compared to a complete bitstream. The bitstream sizes are specied in total number of bytes and conguration windows the number of conguration data blocks in the bitstreams. Partial bitstreams have a greater number of windows since portions of the bitstream were removed, while complete bitstreams have one window, the entire device. Also specied is the time it would take to download the bitstream at 10 MHz. As can be seen from the table, the partial bitsteams reduce reconguration time by an average of 53.5%. Table 2: Partial Bitstream Sizes Conguration Size Conguration Conguration Percent Bitstream (bytes) Windows Time (at 10MHz) Reduction Complete Conguration s 0% Feed-Forward to Backpropagation s 48.5% Backpropagation to Update s 45.8% Update to Feed-Forward s 66.2% The CDB board hosts ve National Semiconductor CLAy31 FPGAs. The reconguration time for the CLAy31 is approximately 600 s. This raw reconguration time is roughly 1/10 of that found for the devices used in RRANN. Combined with an additional reduction of 53.5% due to partial reconguration techniques and a rated operating frequency of 20 MHz, RRANN2 achieves over 4 times the training performance of RRANN. (The actual hardware ran at 33 MHz, giving an even greater training performance.) In addition, RRANN2 implements 50% more neurons per FPGA than RRANN, due mainly to manual placement and routing. This enhances both the training- and operational-mode performance. This is shown in Table 3.

9 L L L L L L L L L L L L AN3 L L L ORL L L L L L 2 L L AN3 L L L L L L L AN3 L L ZERO N L ONE ND N SELBUFS SELBUFS 2 SELBUFS L SELBUFS L ZERO N 2 N ~OE_P ~WE_P GC A5_P A6_P A7_P A8_P A9_P A10_P A11_P A12_P A13_P A14_P A4_P A3_P A2_P A1_P A0_P C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C X ND OR AN L N ND ND N N N 2 N L N L X N OR AN L N ND N N N 2 N R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R OR N X N N OR AN N ORL ND N N N 2 N L ORL AN X N ND OR AN N ORL ND ND N N N 2 N ORL N D8_P G8_P D7_P G7_P D6_P G6_P D5_P G5_P D4_P G4_P D3_P G3_P D2_P G2_P D1_P G1_P D0_P G0_P X N OR AN N ORL ND N N ORL N 2 N L ORL AN N L X N OR AN N OR ND N N N 2 N AN N X N ND OR AN N ND ND N N N 2 N L ORL N AN X N OR AN N ND N N N 2 N N X N OR AN N ND N N CPI2 CPI1 CPI0 PT13 PT12 PT11 PT10 PT9 PT8 PT0 PT1 PT2 PT4 PT3 PT5 CPPV PT7 PT6 PO6_P PO7_P CPNT_P PO5_P PO3_P PO4_P PO2_P PO1_P PO0_P PO8_P PO9_P PO10_P PO11_P PO12_P PO13_P CPO0_P CPO1_P CPO2_P Figure 6: Back-Propagation Phase Layout.

10 In the table, performance is measured as speed-up relative to the well-known PDP back-propagation simulator 12 running on a 125 MHz Hewlett-Packard 735 workstation (135.7 SPECint92, SPECfp92). For training mode, RRANN achieve a speedup of approximately 0.08, per FPGA, for a network composed of 20 FPGAs (120 neurons per layer) whereas it is anticipated that RRANN2 will achieve a speedup of approximately 0.33, per FPGA. For operational mode (strictly running feed-forward, no reconguration), RRANN achieved a speedup of approximately 2.1, per FPGA, while it is estimated that RRANN2 will achieve a speedup of 2.7, per FPGA. Table 3: HP735 Performance Comparison. Mode RRANN RRANN2 Operational Training SUMMARY AND CONCLUSION Being an extension of the RTR design, partial RTR systems exhibit many of the same advantages found in their RTR counterparts. While the extent of these advantages is still being explored, one distinct advantage that has received notable attention is an RTR system's ability to increase functional density. This fact was previously demonstrated with the implementation of the RRANN system. Partial reconguration extends the benets of RTR systems by reducing the system's reconguration time and allowing for the retention of intermediate values on the programmable device. In RRANN2, careful organization of the design's functional and physical layouts allowed for the identication of large sections of static circuitry. This static circuitry was then removed from the reconguration process. By reducing the amount of the device that needed to be recongured, the reconguration time for the system was reduced. Furthermore, the static circuitry was also used to retain intermediate values on the programmable device. This eliminated the routing and control circuitry otherwise needed to store them in a static storage device. The reclaimed circuitry was then used to implement additional neurons. RRANN2 demonstrated these benets with a 25% reduction in conguration time and a 50% increase in neuron density. If partial RTR designs are to become a viable alternative to static system design, continued research is needed into their design methodology. Additional research systems need to be developed that will help dene how partial RTR should be developed and what the design tradeos are. Further more, additional tools need to be developed to aide in the design processes. Due to the physical constraints that need to be dened, tools that would allow both top-down and bottom-up designs equally well are probably needed. Also needed are simulation tools that will take into consideration the reconguration processes. Finally, research into what makes a good recongurable device for partial RTR systems must continue. 8 ACKNOWLEDGMENTS This work was supported by ARPA/CSTO under contract number DABT63-94-C-0085 under a subcontract to National Semiconductor. The author would like to express appreciation to Tim Garverick, Edson Gomersall, and Harry Holt at National Semiconductor for their interest and support in this project.

11 9 REFERENCES [1] P. M. Athanas and H. F. Silverman. Processor reconguration through instruction-set metamorphosis. Computer, 26(3):11{18, March [2] P. Bertin, D. Roncin, and J. Vuillemin. Programmable active memories: a performance assessment. In G. Borriello and C. Ebeling, editors, Research on Integrated Systems: Proceedings of the 1993 Symposium, pages 88{102, [3] J. M. Arnold, D. A. Buell, and E. G. Davis. Splash 2. In Proceedings of the 4th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 316{324, June [4] T. A. Petersen, D. A. Thomae, and D. E. Van den Bout. The Anyboard: a rapid-prototyping system for use in teaching digital circuit design. In First International Workshop on Rapid System Prototyping, pages 25{32, [5] F. Furtek. A eld-programmable gate array for systolic computing. In G. Borriello and C. Ebeling, editors, Research on Integrated Systems: Proceedings of the 1993 Symposium, pages 183{199, [6] D. T. Hoang. Searching genetic databases on Splash 2. In D. A. Buell and K. L. Pocek, editors, Proceedings of IEEE Workshop on FPGAs for Custom Computing Machines, pages 185{191, Napa, CA, April [7] B. K. Fawcett. Applications of recongurable logic. In W. Moore and W. Luk, editors, More FPGAs: Proceedings of the 1993 International workshop on eld-programmable logic and applications, pages 57{69, Oxford, England, September [8] D. Ross, O. Vellacott, and M. Turner. An FPGA-based hardware accelerator for image processing. In W. Moore and W. Luk, editors, More FPGAs: Proceedings of the 1993 International workshop on eldprogrammable logic and applications, pages 299{306, Oxford, England, September [9] P. Lysaght, J. Stockwood, J. Law, and D. Girma. Articial neural network implementation on a ne-grained FPGA. In R. Hartenstein and M. Z. Servit, editors, Field-Programmable Logic: Architectures, Synthesis and Applications. 4th International Workshop on Field-Programmable Logic and Applications, pages 421{431, Prague, Czech Republic, September Springer-Verlag. [10] P. C. French and R. W. Taylor. A self-reconguring processor. In D. A. Buell and K. L. Pocek, editors, Proceedings of IEEE Workshop on FPGAs for Custom Computing Machines, pages 50{59, Napa, CA, April [11] J. G. Eldredge and B. L. Hutchings. Density enhancement of a neural network using FPGAs and run-time reconguration. In D. A. Buell and K. L. Pocek, editors, Proceedings of IEEE Workshop on FPGAs for Custom Computing Machines, pages 180{188, Napa, CA, April [12] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. Parallel and Distributed Processing, 1:318{362, 1986.

Global Controller. Communication Network. = I/O Disabled. Linear Hardware Space

Global Controller. Communication Network. = I/O Disabled. Linear Hardware Space A Dynamic Set Computer Michael J. Wirthlin and Brad L. Hutchings Dept. of Electrical and Computer Eng. Brigham Young University Provo, UT 84602 Abstract A Dynamic Set Computer (DISC) has been developed

More information

software, user input or sensor data. RTP Cores utilize a certain number of Configurable Logic Blocks (CLBs) on the device when created and contain a s

software, user input or sensor data. RTP Cores utilize a certain number of Configurable Logic Blocks (CLBs) on the device when created and contain a s Debug of Reconfigurable Systems Tim Price, Delon Levi and Steven A. Guccione Xilinx, Inc., 2100 Logic Drive, San Jose, CA 95124 (USA) ABSTRACT While FPGA design tools have progressed steadily, availability

More information

Effects of Technology Mapping on Fault Detection Coverage in Reprogrammable FPGAs

Effects of Technology Mapping on Fault Detection Coverage in Reprogrammable FPGAs Syracuse University SURFACE Electrical Engineering and Computer Science College of Engineering and Computer Science 1995 Effects of Technology Mapping on Fault Detection Coverage in Reprogrammable FPGAs

More information

160 M. Nadjarbashi, S.M. Fakhraie and A. Kaviani Figure 2. LUTB structure. each block-level track can be arbitrarily connected to each of 16 4-LUT inp

160 M. Nadjarbashi, S.M. Fakhraie and A. Kaviani Figure 2. LUTB structure. each block-level track can be arbitrarily connected to each of 16 4-LUT inp Scientia Iranica, Vol. 11, No. 3, pp 159{164 c Sharif University of Technology, July 2004 On Routing Architecture for Hybrid FPGA M. Nadjarbashi, S.M. Fakhraie 1 and A. Kaviani 2 In this paper, the routing

More information

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE Reiner W. Hartenstein, Rainer Kress, Helmut Reinig University of Kaiserslautern Erwin-Schrödinger-Straße, D-67663 Kaiserslautern, Germany

More information

EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs)

EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) September 12, 2002 John Wawrzynek Fall 2002 EECS150 - Lec06-FPGA Page 1 Outline What are FPGAs? Why use FPGAs (a short history

More information

Outline. EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) FPGA Overview. Why FPGAs?

Outline. EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) FPGA Overview. Why FPGAs? EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) September 12, 2002 John Wawrzynek Outline What are FPGAs? Why use FPGAs (a short history lesson). FPGA variations Internal logic

More information

SRAM SRAM SRAM SRAM EPF 10K130V EPF 10K130V. Ethernet DRAM DRAM DRAM EPROM EPF 10K130V EPF 10K130V. Flash DRAM DRAM

SRAM SRAM SRAM SRAM EPF 10K130V EPF 10K130V. Ethernet DRAM DRAM DRAM EPROM EPF 10K130V EPF 10K130V. Flash DRAM DRAM Hardware Recongurable Neural Networks Jean-Luc Beuchat, Jacques-Olivier Haenni and Eduardo Sanchez Swiss Federal Institute of Technology, Logic Systems Laboratory, EPFL { LSL, IN { Ecublens, CH { 1015

More information

Hardware Implementation of GA.

Hardware Implementation of GA. Chapter 6 Hardware Implementation of GA Matti Tommiska and Jarkko Vuori Helsinki University of Technology Otakaari 5A, FIN-02150 ESPOO, Finland E-mail: Matti.Tommiska@hut.fi, Jarkko.Vuori@hut.fi Abstract.

More information

INTERCONNECT TESTING WITH BOUNDARY SCAN

INTERCONNECT TESTING WITH BOUNDARY SCAN INTERCONNECT TESTING WITH BOUNDARY SCAN Paul Wagner Honeywell, Inc. Solid State Electronics Division 12001 State Highway 55 Plymouth, Minnesota 55441 Abstract Boundary scan is a structured design technique

More information

Using Multiple FPGA Architectures for Real-time Processing of Low-level Machine Vision Functions

Using Multiple FPGA Architectures for Real-time Processing of Low-level Machine Vision Functions Using Multiple FPGA Architectures for Real-time Processing of Low-level Machine Vision Functions Thomas H. Drayer, William E. King IV, Joeseph G. Tront, Richard W. Conners Philip A. Araman Bradley Department

More information

UNIT 4 INTEGRATED CIRCUIT DESIGN METHODOLOGY E5163

UNIT 4 INTEGRATED CIRCUIT DESIGN METHODOLOGY E5163 UNIT 4 INTEGRATED CIRCUIT DESIGN METHODOLOGY E5163 LEARNING OUTCOMES 4.1 DESIGN METHODOLOGY By the end of this unit, student should be able to: 1. Explain the design methodology for integrated circuit.

More information

Genetic Algorithms In Software and In Hardware A Performance. Analysis Of Workstation and Custom Computing Machine.

Genetic Algorithms In Software and In Hardware A Performance. Analysis Of Workstation and Custom Computing Machine. Genetic Algorithms In Software and In Hardware A Performance Analysis Of Workstation and Custom Computing Machine Implementations Paul Graham and Brent Nelson Department of Electrical and Computer Engineering

More information

Type T1: force false. Type T2: force true. Type T3: complement. Type T4: load

Type T1: force false. Type T2: force true. Type T3: complement. Type T4: load Testability Insertion in Behavioral Descriptions Frank F. Hsu Elizabeth M. Rudnick Janak H. Patel Center for Reliable & High-Performance Computing University of Illinois, Urbana, IL Abstract A new synthesis-for-testability

More information

The Xilinx XC6200 chip, the software tools and the board development tools

The Xilinx XC6200 chip, the software tools and the board development tools The Xilinx XC6200 chip, the software tools and the board development tools What is an FPGA? Field Programmable Gate Array Fully programmable alternative to a customized chip Used to implement functions

More information

FPGA. Logic Block. Plessey FPGA: basic building block here is 2-input NAND gate which is connected to each other to implement desired function.

FPGA. Logic Block. Plessey FPGA: basic building block here is 2-input NAND gate which is connected to each other to implement desired function. FPGA Logic block of an FPGA can be configured in such a way that it can provide functionality as simple as that of transistor or as complex as that of a microprocessor. It can used to implement different

More information

Testability Optimizations for A Time Multiplexed CPLD Implemented on Structured ASIC Technology

Testability Optimizations for A Time Multiplexed CPLD Implemented on Structured ASIC Technology ROMANIAN JOURNAL OF INFORMATION SCIENCE AND TECHNOLOGY Volume 14, Number 4, 2011, 392 398 Testability Optimizations for A Time Multiplexed CPLD Implemented on Structured ASIC Technology Traian TULBURE

More information

Basic Processing Unit: Some Fundamental Concepts, Execution of a. Complete Instruction, Multiple Bus Organization, Hard-wired Control,

Basic Processing Unit: Some Fundamental Concepts, Execution of a. Complete Instruction, Multiple Bus Organization, Hard-wired Control, UNIT - 7 Basic Processing Unit: Some Fundamental Concepts, Execution of a Complete Instruction, Multiple Bus Organization, Hard-wired Control, Microprogrammed Control Page 178 UNIT - 7 BASIC PROCESSING

More information

How Much Logic Should Go in an FPGA Logic Block?

How Much Logic Should Go in an FPGA Logic Block? How Much Logic Should Go in an FPGA Logic Block? Vaughn Betz and Jonathan Rose Department of Electrical and Computer Engineering, University of Toronto Toronto, Ontario, Canada M5S 3G4 {vaughn, jayar}@eecgutorontoca

More information

FPGA: What? Why? Marco D. Santambrogio

FPGA: What? Why? Marco D. Santambrogio FPGA: What? Why? Marco D. Santambrogio marco.santambrogio@polimi.it 2 Reconfigurable Hardware Reconfigurable computing is intended to fill the gap between hardware and software, achieving potentially much

More information

Leso Martin, Musil Tomáš

Leso Martin, Musil Tomáš SAFETY CORE APPROACH FOR THE SYSTEM WITH HIGH DEMANDS FOR A SAFETY AND RELIABILITY DESIGN IN A PARTIALLY DYNAMICALLY RECON- FIGURABLE FIELD-PROGRAMMABLE GATE ARRAY (FPGA) Leso Martin, Musil Tomáš Abstract:

More information

PROBLEMS. 7.1 Why is the Wait-for-Memory-Function-Completed step needed when reading from or writing to the main memory?

PROBLEMS. 7.1 Why is the Wait-for-Memory-Function-Completed step needed when reading from or writing to the main memory? 446 CHAPTER 7 BASIC PROCESSING UNIT (Corrisponde al cap. 10 - Struttura del processore) PROBLEMS 7.1 Why is the Wait-for-Memory-Function-Completed step needed when reading from or writing to the main memory?

More information

An FPGA Project for use in a Digital Logic Course

An FPGA Project for use in a Digital Logic Course Session 3226 An FPGA Project for use in a Digital Logic Course Daniel C. Gray, Thomas D. Wagner United States Military Academy Abstract The Digital Computer Logic Course offered at the United States Military

More information

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University Hardware Design Environments Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University Outline Welcome to COE 405 Digital System Design Design Domains and Levels of Abstractions Synthesis

More information

A Library of Parameterized Floating-point Modules and Their Use

A Library of Parameterized Floating-point Modules and Their Use A Library of Parameterized Floating-point Modules and Their Use Pavle Belanović and Miriam Leeser Department of Electrical and Computer Engineering Northeastern University Boston, MA, 02115, USA {pbelanov,mel}@ece.neu.edu

More information

Partial Evaluation. Sequencing EDIF. Initial Configuration. ConfigDiff. 1st Incremental Configuration. Mth Incremental Configuration.

Partial Evaluation. Sequencing EDIF. Initial Configuration. ConfigDiff. 1st Incremental Configuration. Mth Incremental Configuration. Compilation Tools for Run-Time Recongurable Designs Wayne Luk and Nabeel Shirazi Department of Computing Imperial College 180 Queen's Gate London, England SW7 2BZ Peter Y.K. Cheung Department of Electrical

More information

Evolution of Implementation Technologies. ECE 4211/5211 Rapid Prototyping with FPGAs. Gate Array Technology (IBM s) Programmable Logic

Evolution of Implementation Technologies. ECE 4211/5211 Rapid Prototyping with FPGAs. Gate Array Technology (IBM s) Programmable Logic ECE 42/52 Rapid Prototyping with FPGAs Dr. Charlie Wang Department of Electrical and Computer Engineering University of Colorado at Colorado Springs Evolution of Implementation Technologies Discrete devices:

More information

task object task queue

task object task queue Optimizations for Parallel Computing Using Data Access Information Martin C. Rinard Department of Computer Science University of California, Santa Barbara Santa Barbara, California 9316 martin@cs.ucsb.edu

More information

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and

More information

A Freely Congurable Audio-Mixing Engine. M. Rosenthal, M. Klebl, A. Gunzinger, G. Troster

A Freely Congurable Audio-Mixing Engine. M. Rosenthal, M. Klebl, A. Gunzinger, G. Troster A Freely Congurable Audio-Mixing Engine with Automatic Loadbalancing M. Rosenthal, M. Klebl, A. Gunzinger, G. Troster Electronics Laboratory, Swiss Federal Institute of Technology CH-8092 Zurich, Switzerland

More information

Dept. of Computer Science, Keio University. Dept. of Information and Computer Science, Kanagawa Institute of Technology

Dept. of Computer Science, Keio University. Dept. of Information and Computer Science, Kanagawa Institute of Technology HOSMII: A Virtual Hardware Integrated with Yuichiro Shibata, 1 Hidenori Miyazaki, 1 Xiao-ping Ling, 2 and Hideharu Amano 1 1 Dept. of Computer Science, Keio University 2 Dept. of Information and Computer

More information

Improving FPGA Design Robustness with Partial TMR

Improving FPGA Design Robustness with Partial TMR Improving FPGA Design Robustness with Partial TMR Brian Pratt, Michael Caffrey, Paul Graham, Keith Morgan, Michael Wirthlin Abstract This paper describes an efficient approach of applying mitigation to

More information

Conclusions and Future Work. We introduce a new method for dealing with the shortage of quality benchmark circuits

Conclusions and Future Work. We introduce a new method for dealing with the shortage of quality benchmark circuits Chapter 7 Conclusions and Future Work 7.1 Thesis Summary. In this thesis we make new inroads into the understanding of digital circuits as graphs. We introduce a new method for dealing with the shortage

More information

COE 561 Digital System Design & Synthesis Introduction

COE 561 Digital System Design & Synthesis Introduction 1 COE 561 Digital System Design & Synthesis Introduction Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum & Minerals Outline Course Topics Microelectronics Design

More information

HIGH-LEVEL SYNTHESIS

HIGH-LEVEL SYNTHESIS HIGH-LEVEL SYNTHESIS Page 1 HIGH-LEVEL SYNTHESIS High-level synthesis: the automatic addition of structural information to a design described by an algorithm. BEHAVIORAL D. STRUCTURAL D. Systems Algorithms

More information

Storage System. Distributor. Network. Drive. Drive. Storage System. Controller. Controller. Disk. Disk

Storage System. Distributor. Network. Drive. Drive. Storage System. Controller. Controller. Disk. Disk HRaid: a Flexible Storage-system Simulator Toni Cortes Jesus Labarta Universitat Politecnica de Catalunya - Barcelona ftoni, jesusg@ac.upc.es - http://www.ac.upc.es/hpc Abstract Clusters of workstations

More information

Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator

Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator Stanley Bak Abstract Network algorithms are deployed on large networks, and proper algorithm evaluation is necessary to avoid

More information

Lattice Semiconductor Design Floorplanning

Lattice Semiconductor Design Floorplanning September 2012 Introduction Technical Note TN1010 Lattice Semiconductor s isplever software, together with Lattice Semiconductor s catalog of programmable devices, provides options to help meet design

More information

Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors

Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors Francisco Barat, Murali Jayapala, Pieter Op de Beeck and Geert Deconinck K.U.Leuven, Belgium. {f-barat, j4murali}@ieee.org,

More information

Efficient Self-Reconfigurable Implementations Using On-Chip Memory

Efficient Self-Reconfigurable Implementations Using On-Chip Memory 10th International Conference on Field Programmable Logic and Applications, August 2000. Efficient Self-Reconfigurable Implementations Using On-Chip Memory Sameer Wadhwa and Andreas Dandalis University

More information

Adam Donlin, University of Edinburgh, James Clerk Maxwell Building, Mayeld Road, Edinburgh EH9 3JZ. Scotland

Adam Donlin, University of Edinburgh, James Clerk Maxwell Building, Mayeld Road, Edinburgh EH9 3JZ. Scotland Self Modifying Circuitry - A Platform for Tractable Virtual Circuitry Adam Donlin, Department of Computer Science, University of Edinburgh, James Clerk Maxwell Building, Mayeld Road, Edinburgh EH9 3JZ.

More information

Incremental Reconfiguration for Pipelined Applications

Incremental Reconfiguration for Pipelined Applications Incremental Reconfiguration for Pipelined Applications Herman Schmit Dept. of ECE, Carnegie Mellon University Pittsburgh, PA 15213 Abstract This paper examines the implementation of pipelined applications

More information

Network. Department of Statistics. University of California, Berkeley. January, Abstract

Network. Department of Statistics. University of California, Berkeley. January, Abstract Parallelizing CART Using a Workstation Network Phil Spector Leo Breiman Department of Statistics University of California, Berkeley January, 1995 Abstract The CART (Classication and Regression Trees) program,

More information

Evaluation of FPGA Resources for Built-In Self-Test of Programmable Logic Blocks

Evaluation of FPGA Resources for Built-In Self-Test of Programmable Logic Blocks Evaluation of FPGA Resources for Built-In Self-Test of Programmable Logic Blocks Charles Stroud, Ping Chen, Srinivasa Konala, Dept. of Electrical Engineering University of Kentucky and Miron Abramovici

More information

FPGA Implementation of Multiplier for Floating- Point Numbers Based on IEEE Standard

FPGA Implementation of Multiplier for Floating- Point Numbers Based on IEEE Standard FPGA Implementation of Multiplier for Floating- Point Numbers Based on IEEE 754-2008 Standard M. Shyamsi, M. I. Ibrahimy, S. M. A. Motakabber and M. R. Ahsan Dept. of Electrical and Computer Engineering

More information

Programmable Logic Devices FPGA Architectures II CMPE 415. Overview This set of notes introduces many of the features available in the FPGAs of today.

Programmable Logic Devices FPGA Architectures II CMPE 415. Overview This set of notes introduces many of the features available in the FPGAs of today. Overview This set of notes introduces many of the features available in the FPGAs of today. The majority use SRAM based configuration cells, which allows fast reconfiguation. Allows new design ideas to

More information

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Zhining Huang, Sharad Malik Electrical Engineering Department

More information

EMBEDDED SOPC DESIGN WITH NIOS II PROCESSOR AND VHDL EXAMPLES

EMBEDDED SOPC DESIGN WITH NIOS II PROCESSOR AND VHDL EXAMPLES EMBEDDED SOPC DESIGN WITH NIOS II PROCESSOR AND VHDL EXAMPLES Pong P. Chu Cleveland State University A JOHN WILEY & SONS, INC., PUBLICATION PREFACE An SoC (system on a chip) integrates a processor, memory

More information

Actel s SX Family of FPGAs: A New Architecture for High-Performance Designs

Actel s SX Family of FPGAs: A New Architecture for High-Performance Designs Actel s SX Family of FPGAs: A New Architecture for High-Performance Designs A Technology Backgrounder Actel Corporation 955 East Arques Avenue Sunnyvale, California 94086 April 20, 1998 Page 2 Actel Corporation

More information

JPG A Partial Bitstream Generation Tool to Support Partial Reconfiguration in Virtex FPGAs

JPG A Partial Bitstream Generation Tool to Support Partial Reconfiguration in Virtex FPGAs JPG A Partial Bitstream Generation Tool to Support Partial Reconfiguration in Virtex FPGAs Anup Kumar Raghavan Motorola Australia Software Centre Adelaide SA Australia anup.raghavan@motorola.com Peter

More information

A novel technique for fast multiplication

A novel technique for fast multiplication INT. J. ELECTRONICS, 1999, VOL. 86, NO. 1, 67± 77 A novel technique for fast multiplication SADIQ M. SAIT², AAMIR A. FAROOQUI GERHARD F. BECKHOFF and In this paper we present the design of a new high-speed

More information

DIGITAL DESIGN TECHNOLOGY & TECHNIQUES

DIGITAL DESIGN TECHNOLOGY & TECHNIQUES DIGITAL DESIGN TECHNOLOGY & TECHNIQUES CAD for ASIC Design 1 INTEGRATED CIRCUITS (IC) An integrated circuit (IC) consists complex electronic circuitries and their interconnections. William Shockley et

More information

fa0 fa1 fa2 a(0) a(1) a(2) a(3) cin a b sum(0) sum(1) sum(2) sum(3) sum cout cin cin cout a b sum cout cin a b sum cout cin b(0) b(1) b(2) b(3)

fa0 fa1 fa2 a(0) a(1) a(2) a(3) cin a b sum(0) sum(1) sum(2) sum(3) sum cout cin cin cout a b sum cout cin a b sum cout cin b(0) b(1) b(2) b(3) Introduction to Synopsys and VHDL on Solaris c Naveen Michaud-Agrawal for Dr. Pakzad's CSE 331 Honor class September 25, 2000 1 Introduction VHDL is an acronym which stands for VHSIC Hardware Description

More information

Abstract. 1 Introduction. Reconfigurable Logic and Hardware Software Codesign. Class EEC282 Author Marty Nicholes Date 12/06/2003

Abstract. 1 Introduction. Reconfigurable Logic and Hardware Software Codesign. Class EEC282 Author Marty Nicholes Date 12/06/2003 Title Reconfigurable Logic and Hardware Software Codesign Class EEC282 Author Marty Nicholes Date 12/06/2003 Abstract. This is a review paper covering various aspects of reconfigurable logic. The focus

More information

Design and Implementation of 5 Stages Pipelined Architecture in 32 Bit RISC Processor

Design and Implementation of 5 Stages Pipelined Architecture in 32 Bit RISC Processor Design and Implementation of 5 Stages Pipelined Architecture in 32 Bit RISC Processor Abstract The proposed work is the design of a 32 bit RISC (Reduced Instruction Set Computer) processor. The design

More information

ProASIC PLUS FPGA Family

ProASIC PLUS FPGA Family ProASIC PLUS FPGA Family Key Features Reprogrammable /Nonvolatile Flash Technology Low Power Secure Single Chip/Live at Power Up 1M Equivalent System Gates Cost Effective ASIC Alternative ASIC Design Flow

More information

Advanced FPGA Design Methodologies with Xilinx Vivado

Advanced FPGA Design Methodologies with Xilinx Vivado Advanced FPGA Design Methodologies with Xilinx Vivado Alexander Jäger Computer Architecture Group Heidelberg University, Germany Abstract With shrinking feature sizes in the ASIC manufacturing technology,

More information

Synthetic Benchmark Generator for the MOLEN Processor

Synthetic Benchmark Generator for the MOLEN Processor Synthetic Benchmark Generator for the MOLEN Processor Stephan Wong, Guanzhou Luo, and Sorin Cotofana Computer Engineering Laboratory, Electrical Engineering Department, Delft University of Technology,

More information

Outline. EECS Components and Design Techniques for Digital Systems. Lec 11 Putting it all together Where are we now?

Outline. EECS Components and Design Techniques for Digital Systems. Lec 11 Putting it all together Where are we now? Outline EECS 5 - Components and Design Techniques for Digital Systems Lec Putting it all together -5-4 David Culler Electrical Engineering and Computer Sciences University of California Berkeley Top-to-bottom

More information

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX Towards an Adaptive Distributed Shared Memory (Preliminary Version ) Jai-Hoon Kim Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3 E-mail: fjhkim,vaidyag@cs.tamu.edu

More information

Rapid Prototype with Field Gate (A Design and Implementation of Stepper Motor Using FPGA)

Rapid Prototype with Field Gate (A Design and Implementation of Stepper Motor Using FPGA) Circuits and Systems, 2016, 7, 1392-1403 Published Online June 2016 in SciRes. http://www.scirp.org/journal/cs http://dx.doi.org/10.4236/cs.2016.78122 Rapid Prototype with Field Gate (A Design and Implementation

More information

High-level Variable Selection for Partial-Scan Implementation

High-level Variable Selection for Partial-Scan Implementation High-level Variable Selection for Partial-Scan Implementation FrankF.Hsu JanakH.Patel Center for Reliable & High-Performance Computing University of Illinois, Urbana, IL Abstract In this paper, we propose

More information

16-Megabit 2.3V or 2.7V Minimum SPI Serial Flash Memory

16-Megabit 2.3V or 2.7V Minimum SPI Serial Flash Memory Features Single 2.3V - 3.6V or 2.7V - 3.6V Supply Serial Peripheral Interface (SPI) Compatible Supports SPI Modes 0 and 3 Supports RapidS Operation Supports Dual-Input Program and Dual-Output Read Very

More information

32-Megabit 2.7-volt Minimum SPI Serial Flash Memory AT25DF321A Preliminary

32-Megabit 2.7-volt Minimum SPI Serial Flash Memory AT25DF321A Preliminary BDTIC www.bdtic.com/atmel Features Single 2.7V - 3.6V Supply Serial Peripheral Interface (SPI) Compatible Supports SPI Modes and 3 Supports RapidS Operation Supports Dual-Input Program and Dual-Output

More information

Andrew Davenport and Edward Tsang. fdaveat,edwardgessex.ac.uk. mostly soluble problems and regions of overconstrained, mostly insoluble problems as

Andrew Davenport and Edward Tsang. fdaveat,edwardgessex.ac.uk. mostly soluble problems and regions of overconstrained, mostly insoluble problems as An empirical investigation into the exceptionally hard problems Andrew Davenport and Edward Tsang Department of Computer Science, University of Essex, Colchester, Essex CO SQ, United Kingdom. fdaveat,edwardgessex.ac.uk

More information

Developing a Data Driven System for Computational Neuroscience

Developing a Data Driven System for Computational Neuroscience Developing a Data Driven System for Computational Neuroscience Ross Snider and Yongming Zhu Montana State University, Bozeman MT 59717, USA Abstract. A data driven system implies the need to integrate

More information

Wj = α TD(P,Wj) Wj : Current reference vector W j : New reference vector P : Input vector SENSITIVITY REGION. W j= Wj + Wj MANHATTAN DISTANCE

Wj = α TD(P,Wj) Wj : Current reference vector W j : New reference vector P : Input vector SENSITIVITY REGION. W j= Wj + Wj MANHATTAN DISTANCE ICANN96, Springer-Verlag,1996. FPGA Implementation of an Adaptable-Size Neural Network Andres Perez-Uribe and Eduardo Sanchez Logic Systems Laboratory Swiss Federal Institute of Technology CH{1015 Lausanne,

More information

Technische Universitat Munchen. Institut fur Informatik. D Munchen.

Technische Universitat Munchen. Institut fur Informatik. D Munchen. Developing Applications for Multicomputer Systems on Workstation Clusters Georg Stellner, Arndt Bode, Stefan Lamberts and Thomas Ludwig? Technische Universitat Munchen Institut fur Informatik Lehrstuhl

More information

Khoral Research, Inc. Khoros is a powerful, integrated system which allows users to perform a variety

Khoral Research, Inc. Khoros is a powerful, integrated system which allows users to perform a variety Data Parallel Programming with the Khoros Data Services Library Steve Kubica, Thomas Robey, Chris Moorman Khoral Research, Inc. 6200 Indian School Rd. NE Suite 200 Albuquerque, NM 87110 USA E-mail: info@khoral.com

More information

A B. A: sigmoid B: EBA (x0=0.03) C: EBA (x0=0.05) U

A B. A: sigmoid B: EBA (x0=0.03) C: EBA (x0=0.05) U Extending the Power and Capacity of Constraint Satisfaction Networks nchuan Zeng and Tony R. Martinez Computer Science Department, Brigham Young University, Provo, Utah 8460 Email: zengx@axon.cs.byu.edu,

More information

Very high operating frequencies 100MHz for RapidS 85MHz for SPI Clock-to-output time (t V ) of 5ns maximum

Very high operating frequencies 100MHz for RapidS 85MHz for SPI Clock-to-output time (t V ) of 5ns maximum AT25DL6 6-Mbit,.65V Minimum SPI Serial Flash Memory with Dual-I/O Support DATASHEET Features Single.65V.95V supply Serial Peripheral Interface (SPI) compatible Supports SPI Modes and 3 Supports RapidS

More information

is developed which describe the mean values of various system parameters. These equations have circular dependencies and must be solved iteratively. T

is developed which describe the mean values of various system parameters. These equations have circular dependencies and must be solved iteratively. T A Mean Value Analysis Multiprocessor Model Incorporating Superscalar Processors and Latency Tolerating Techniques 1 David H. Albonesi Israel Koren Department of Electrical and Computer Engineering University

More information

THE IMPLEMENTATION OF A DISTRIBUTED FILE SYSTEM SUPPORTING THE PARALLEL WORLD MODEL. Jun Sun, Yasushi Shinjo and Kozo Itano

THE IMPLEMENTATION OF A DISTRIBUTED FILE SYSTEM SUPPORTING THE PARALLEL WORLD MODEL. Jun Sun, Yasushi Shinjo and Kozo Itano THE IMPLEMENTATION OF A DISTRIBUTED FILE SYSTEM SUPPORTING THE PARALLEL WORLD MODEL Jun Sun, Yasushi Shinjo and Kozo Itano Institute of Information Sciences and Electronics University of Tsukuba Tsukuba,

More information

Monolithic 3D IC Design for Deep Neural Networks

Monolithic 3D IC Design for Deep Neural Networks Monolithic 3D IC Design for Deep Neural Networks 1 with Application on Low-power Speech Recognition Kyungwook Chang 1, Deepak Kadetotad 2, Yu (Kevin) Cao 2, Jae-sun Seo 2, and Sung Kyu Lim 1 1 School of

More information

Design Guidelines for Optimal Results in High-Density FPGAs

Design Guidelines for Optimal Results in High-Density FPGAs White Paper Introduction Design Guidelines for Optimal Results in High-Density FPGAs Today s FPGA applications are approaching the complexity and performance requirements of ASICs. In some cases, FPGAs

More information

signature i-1 signature i instruction j j+1 branch adjustment value "if - path" initial value signature i signature j instruction exit signature j+1

signature i-1 signature i instruction j j+1 branch adjustment value if - path initial value signature i signature j instruction exit signature j+1 CONTROL FLOW MONITORING FOR A TIME-TRIGGERED COMMUNICATION CONTROLLER Thomas M. Galla 1, Michael Sprachmann 2, Andreas Steininger 1 and Christopher Temple 1 Abstract A novel control ow monitoring scheme

More information

PARAS: System-Level Concurrent Partitioning and Scheduling. University of Wisconsin. Madison, WI

PARAS: System-Level Concurrent Partitioning and Scheduling. University of Wisconsin. Madison, WI PARAS: System-Level Concurrent Partitioning and Scheduling Wing Hang Wong and Rajiv Jain Department of Electrical and Computer Engineering University of Wisconsin Madison, WI 53706 http://polya.ece.wisc.edu/~rajiv/home.html

More information

(RC) utilize CAD tools to perform the technology mapping of a extensive amount of time is spent for compilation by the CAD

(RC) utilize CAD tools to perform the technology mapping of a extensive amount of time is spent for compilation by the CAD Domain Specic Mapping for Solving Graph Problems on Recongurable Devices? Andreas Dandalis, Alessandro Mei??, and Viktor K. Prasanna University of Southern California fdandalis, prasanna, ameig@halcyon.usc.edu

More information

A framework for automatic generation of audio processing applications on a dual-core system

A framework for automatic generation of audio processing applications on a dual-core system A framework for automatic generation of audio processing applications on a dual-core system Etienne Cornu, Tina Soltani and Julie Johnson etienne_cornu@amis.com, tina_soltani@amis.com, julie_johnson@amis.com

More information

Design Tools for 100,000 Gate Programmable Logic Devices

Design Tools for 100,000 Gate Programmable Logic Devices esign Tools for 100,000 Gate Programmable Logic evices March 1996, ver. 1 Product Information Bulletin 22 Introduction The capacity of programmable logic devices (PLs) has risen dramatically to meet the

More information

Reconfigurable co-processor for Kanerva s sparse distributed memory

Reconfigurable co-processor for Kanerva s sparse distributed memory Microprocessors and Microsystems 28 (2004) 127 134 www.elsevier.com/locate/micpro Reconfigurable co-processor for Kanerva s sparse distributed memory Marcus Tadeu Pinheiro Silva a, Antônio Pádua Braga

More information

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores A Configurable Multi-Ported Register File Architecture for Soft Processor Cores Mazen A. R. Saghir and Rawan Naous Department of Electrical and Computer Engineering American University of Beirut P.O. Box

More information

DE2 Board & Quartus II Software

DE2 Board & Quartus II Software January 23, 2015 Contact and Office Hours Teaching Assistant (TA) Sergio Contreras Office Office Hours Email SEB 3259 Tuesday & Thursday 12:30-2:00 PM Wednesday 1:30-3:30 PM contre47@nevada.unlv.edu Syllabus

More information

1 Introduction Data format converters (DFCs) are used to permute the data from one format to another in signal processing and image processing applica

1 Introduction Data format converters (DFCs) are used to permute the data from one format to another in signal processing and image processing applica A New Register Allocation Scheme for Low Power Data Format Converters Kala Srivatsan, Chaitali Chakrabarti Lori E. Lucke Department of Electrical Engineering Minnetronix, Inc. Arizona State University

More information

CHAPTER 5. CHE BASED SoPC FOR EVOLVABLE HARDWARE

CHAPTER 5. CHE BASED SoPC FOR EVOLVABLE HARDWARE 90 CHAPTER 5 CHE BASED SoPC FOR EVOLVABLE HARDWARE A hardware architecture that implements the GA for EHW is presented in this chapter. This SoPC (System on Programmable Chip) architecture is also designed

More information

PARALLEL PERFORMANCE DIRECTED TECHNOLOGY MAPPING FOR FPGA. Laurent Lemarchand. Informatique. ea 2215, D pt. ubo University{ bp 809

PARALLEL PERFORMANCE DIRECTED TECHNOLOGY MAPPING FOR FPGA. Laurent Lemarchand. Informatique. ea 2215, D pt. ubo University{ bp 809 PARALLEL PERFORMANCE DIRECTED TECHNOLOGY MAPPING FOR FPGA Laurent Lemarchand Informatique ubo University{ bp 809 f-29285, Brest { France lemarch@univ-brest.fr ea 2215, D pt ABSTRACT An ecient distributed

More information

Hardware Modeling using Verilog Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Hardware Modeling using Verilog Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Hardware Modeling using Verilog Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture 01 Introduction Welcome to the course on Hardware

More information

Single Pass Connected Components Analysis

Single Pass Connected Components Analysis D. G. Bailey, C. T. Johnston, Single Pass Connected Components Analysis, Proceedings of Image and Vision Computing New Zealand 007, pp. 8 87, Hamilton, New Zealand, December 007. Single Pass Connected

More information

A Time-Multiplexed FPGA

A Time-Multiplexed FPGA A Time-Multiplexed FPGA Steve Trimberger, Dean Carberry, Anders Johnson, Jennifer Wong Xilinx, nc. 2 100 Logic Drive San Jose, CA 95124 408-559-7778 steve.trimberger @ xilinx.com Abstract This paper describes

More information

Dynamic Hardware Plugins in an FPGA with Partial Run-time Reconfiguration

Dynamic Hardware Plugins in an FPGA with Partial Run-time Reconfiguration 24.2 Dynamic Hardware Plugins in an FPGA with Partial Run-time Reconfiguration Edson L. Horta, Universidade de San Pãulo Escola Politécnica - LSI San Pãulo, SP, Brazil edson-horta@ieee.org John W. Lockwood,

More information

32-Mbit 2.7V Minimum Serial Peripheral Interface Serial Flash Memory

32-Mbit 2.7V Minimum Serial Peripheral Interface Serial Flash Memory Features Single 2.7V - 3.6V Supply Serial Peripheral Interface (SPI) Compatible Supports SPI Modes and 3 Supports RapidS Operation Supports Dual-Input Program and Dual-Output Read Very High Operating Frequencies

More information

CHAPTER 3 METHODOLOGY. 3.1 Analysis of the Conventional High Speed 8-bits x 8-bits Wallace Tree Multiplier

CHAPTER 3 METHODOLOGY. 3.1 Analysis of the Conventional High Speed 8-bits x 8-bits Wallace Tree Multiplier CHAPTER 3 METHODOLOGY 3.1 Analysis of the Conventional High Speed 8-bits x 8-bits Wallace Tree Multiplier The design analysis starts with the analysis of the elementary algorithm for multiplication by

More information

2. BLOCK DIAGRAM Figure 1 shows the block diagram of an Asynchronous FIFO and the signals associated with it.

2. BLOCK DIAGRAM Figure 1 shows the block diagram of an Asynchronous FIFO and the signals associated with it. Volume 115 No. 8 2017, 631-636 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu DESIGNING ASYNCHRONOUS FIFO FOR LOW POWER DFT IMPLEMENTATION 1 Avinash

More information

Neuro-Remodeling via Backpropagation of Utility. ABSTRACT Backpropagation of utility is one of the many methods for neuro-control.

Neuro-Remodeling via Backpropagation of Utility. ABSTRACT Backpropagation of utility is one of the many methods for neuro-control. Neuro-Remodeling via Backpropagation of Utility K. Wendy Tang and Girish Pingle 1 Department of Electrical Engineering SUNY at Stony Brook, Stony Brook, NY 11794-2350. ABSTRACT Backpropagation of utility

More information

Steering. Stream. User Interface. Stream. Manager. Interaction Managers. Snapshot. Stream

Steering. Stream. User Interface. Stream. Manager. Interaction Managers. Snapshot. Stream Agent Roles in Snapshot Assembly Delbert Hart Dept. of Computer Science Washington University in St. Louis St. Louis, MO 63130 hart@cs.wustl.edu Eileen Kraemer Dept. of Computer Science University of Georgia

More information

Design Methodologies and Tools. Full-Custom Design

Design Methodologies and Tools. Full-Custom Design Design Methodologies and Tools Design styles Full-custom design Standard-cell design Programmable logic Gate arrays and field-programmable gate arrays (FPGAs) Sea of gates System-on-a-chip (embedded cores)

More information

Convolutional Neural Networks for Object Classication in CUDA

Convolutional Neural Networks for Object Classication in CUDA Convolutional Neural Networks for Object Classication in CUDA Alex Krizhevsky (kriz@cs.toronto.edu) April 16, 2009 1 Introduction Here I will present my implementation of a simple convolutional neural

More information

Frank Mueller. Dept. of Computer Science. Florida State University. Tallahassee, FL phone: (904)

Frank Mueller. Dept. of Computer Science. Florida State University. Tallahassee, FL phone: (904) Static Cache Simulation and its Applications by Frank Mueller Dept. of Computer Science Florida State University Tallahassee, FL 32306-4019 e-mail: mueller@cs.fsu.edu phone: (904) 644-3441 July 12, 1994

More information

Chapter 5: ASICs Vs. PLDs

Chapter 5: ASICs Vs. PLDs Chapter 5: ASICs Vs. PLDs 5.1 Introduction A general definition of the term Application Specific Integrated Circuit (ASIC) is virtually every type of chip that is designed to perform a dedicated task.

More information

A Simple Placement and Routing Algorithm for a Two-Dimensional Computational Origami Architecture

A Simple Placement and Routing Algorithm for a Two-Dimensional Computational Origami Architecture A Simple Placement and Routing Algorithm for a Two-Dimensional Computational Origami Architecture Robert S. French April 5, 1989 Abstract Computational origami is a parallel-processing concept in which

More information