The AMDREL Project in Retrospective

The AMDREL Projet in Retrospetive K. Siozios 1, G. Koutroumpezis 1, K. Tatas 1, N. Vassiliadis 2, V. Kalenteridis 2, H. Pournara 2, I. Pappas 2, D. Soudris 1, S. Nikolaidis 2, S. Siskos 2, and A. Thanailakis 1 1 Dept. ECE, Demoritus University of Thrae, 671, Xanthi, Greee 2 Aristotle University of Thessaloniki, 546 Thessaloniki, Greee ksiop@ee.duth.gr and dsoudris@ee.duth.gr Abstrat The design of an novel embedded fine-grain reonfigurable hardware arhiteture (FPGA) is introdued. The arhiteture features a number of iruit-level low-power tehniques, sine power onsumption was onsidered as a primary design goal. Additionally, EX-VPR and DAGGER software tools (part from the MEANDER framework) were presented. The developed tool set design flow is used for mapping logi to the FPGA platform. The novel energy-effiient FPGA arhiteture was implemented in.18μm STM CMOS tehnology. The effiieny of the entire system (FPGA and tools) was proven by omparisons with the existing ontemporary ommerial and aademi FPGA systems. 1. FPGA arhiteture FPGAs have reently benefited from tehnology proess advanes to beome signifiant alternatives to Appliation Speifi Integrated Ciruits (ASICs). In this paper the FPGA arhiteture, whih an be onfigured using the developed toolset, is presented. The main design onstraints are the energy minimization under the delay onstraints, while maintaining a reasonable silion area. The most popular island style arhiteture, where an array of logi bloks are surrounded by routing hannels is the one adopted in the proposed FPGA. The I/O pads are evenly distributed around the perimeter of the FPGA. For demonstration purposes an 8 8 FPGA array was designed and simulated in STM.18μm CMOS tehnology. 1.1. CLB arhiteture The Configurable Logi Blok (CLB) has been desribed in detail in [9, 15] but for the sake of ompleteness is repeated here briefly. The interonnet network and onfiguration arhiteture are desribed in detail and simulation results are presented in.18μm STM proess. The design is mostly foused on minimizing energy dissipation, without signifiantly degrading delay and area. The design of the CLB arhiteture is ruial to the CLB granularity, performane, and power onsumption. Based on the exploration results and those reported in [5, 15], a deision for the CLB arhiteture with primary respet on energy minimization was made. This CLB was used for all the following explorations. Consequently, the features of the seleted CLB are: - Cluster of 5 Basi Logi Elements (BLEs) - 4-inputs LUT per BLE - One double edge-triggered Flip-Flop per BLE - One Gated Clok signal per BLE and CLB - 12 inputs and 5 outputs provided by eah CLB - All 5 outputs an be registered - A fully Conneted CLB resulting to 17-to-1 multiplexing in every input of a LUT - One asynhronous Clear signal for whole CLB - One Clok signal for whole CLB. 1.2. Interonnetion arhiteture The exploration proedure for determining the best andidate interonnetion arhiteture assumes the mapping of a number of representative benhmark iruits on various onfigurations of the FPGA and a mehanism to reate estimates for the propagation delay and the energy onsumption so that the various solutions an be evaluated. The tools that were developed and presented in the next setion were used. Various benhmarks from ITC 99 [14] (part of the MCNC benhmarks) were employed. 1.2.1 Exploration Exploration was performed for the swith blok types, the length of the segments, the onnetivity fator and for the population fator. Three different swith blok types (supported by VPR): Disjoint, Wilton, Universal were examined. These SBs were routed with four different segment lengths, L1, L2, L4 and L8. The SB whih inorporated the most desired harateristis was seleted and routed with a mixed segment widths orresponding to 5% L1 and 5% L2, 5% L1 and 5% L4, and 5% L2 and 5% L4. Segment length of L8 was rejeted beause of its results. The two best andidates in terms of energy onsumption were hosen and used to study the effet of the onnetion box onnetivity fator ( F ). A number of F values varied from 1% to 25% for the input and the output onnetion boxes were

examined. Finally, the effet of the redued onnetion and swith box population was explored. As it is shown in Fig. 1, for small segment lengths Disjoint and Universal swith boxes present almost similar energy-delay produt with a small advantage for the Disjoint topology. Also, the lower Energy-delay produt results for the L1 and L2. The use of L8 leads in a prohibitive energy-delay produt. Sine Disjoint topology presents the lowest energy-delay produt and its simpler to be implemented, it was used as the Swith Box aross all the following experiments. Energy*Delay(se*Joule) (avg. of benhmarks) 2E-17 1,8E-17 1,6E-17 1,4E-17 1,2E-17 1E-17 8E-18 6E-18 Energy-Delay Produt Disjoint Wilton Universal L1 L2 L4 L8 Figure 1. Energy-Delay Produt for different segment length and swith Table 1 summarizes the results of our exploration for Energy, Delay and Area. Sine the L4 arhiteture, with F =1 for input and output and fully populated, gave the best performane, all the results are illustrated with this arhiteture as referene. Table 1. Exploration Results Fin=1, Fout=1,Full Population L1&L2, Full Population Segment Energy Delay Area F(in&out) Energy Delay Area Length L4 % % % 1&1-18% 3% -5% L1-22% 7% -5% 1&.75-18% 2% -4% L1&L2-18% 3% -5%.75&.75-18% 4% -4% L2-15% 1% -3%.75&.5-19% 3% -3% L1&L4-12% 1% -4%.5&.5-18% 1% -3% L2&L4-9% % -3%.5&.25-18% 4% -2% L4 % % %.25&.25-16% % 3% L1, Full Population L1&L2, Fin=.5, Fout=.5, Redued Pop. F(in&out) Energy Delay Area Box Type Energy Delay Area 1&1-2% 7% -5% SB -19% 4% -2% 1&.75-21% 5% -4% CB -17% 2% -2%.75&.75-21% 9% -4% CB&SB -18% 3% -2%.75&.5-21% 6% -4%.5&.5-2% 8% -3%.5&.25-21% 6% -1%.25&.25-19% 6% 2% 1&1-2% 7% -5% 1.2.2. Seleted Interonnet Network Arhiteture Based on the previous illustrated results and the fat that the design onstraints were low energy onsumption under delay onstraints the Interonnet Network Arhiteture was seleted. Preisely the arhiteture harateristis are: a) segment length L1 b) Full population for Connetion and Swith boxes ) F =1 for the input and the output onnetion boxes d) Disjoint Swith box with F s =3. In addition based on the appliation requirements the number of traks in the routing hannel was seleted to be 2 and the array dimensions 8x8 CLBs. Finally, for the ommuniation with the FPGA, I/O pads were plaed on eah side of the FPGA. One pad was plaed for eah CLB, on the perimeter of the FPGA eah ontaining three I/O pins. Eah pin an onnet to eah one of the routing traks with a onfigurable pass transistor. A RAM-based, island-style interonnetion arhiteture [5] was designed; this style of FPGA interonnet is also employed by Xilinx [1], Luent Tehnologies [6] and the Vantis VF1 [2]. More speifially, the logi bloks are surrounded by vertial and horizontal metal routing traks, whih onnet the logi bloks, via programmable routing swithes. These swithes ontribute signifiant apaitane and ombined with the metal wire apaitane are responsible for the greatest amount of dissipated power. Routing swithes are either pass transistors or pairs of tri-state buffers (one in eah diretion) and allow wire segments to be joined in order to form longer onnetions [15]. The effet of the routing swithes on power, performane and area was explored in [4, 12]. Alternative onfigurations for different segment lengths and for three types of the Swith Box (SB) [4, 12], namely Disjoint, Wilton and Universal were explored. A number of ITC 99 benhmark iruits [14] were mapped on these arhitetures and the energy, delay and area requirements were measured. Another important parameter is the routing segment length. A number of general benhmarks were mapped on FPGA arrays of various sizes and segment lengths and the results were evaluated [5, 9, 15]. Based on the exploration results for energy onsumption, performane and area for the Disjoint swith box topology for various FPGA array sizes and wire segments, that shown in Fig. (2)-(4). Energy(joule) (avg. of benhmarks) 3,5E-1 3,E-1 2,5E-1 2,E-1 1,5E-1 1,E-1 8X8 1X1 12X12 14X14 16X16 Energy Consumption L1 L1&L2 L1&L4 L2 L2&L4 L4 L8 Figure 2. Energy onsumption exploration results Delay(se) (avg. of benhmarks) 2,16E-8 2,11E-8 2,6E-8 2,1E-8 1,96E-8 1,91E-8 1,86E-8 1,81E-8 1,76E-8 1,71E-8 Delay L1 L1&L2 L1&L4 L2 L2&L4 L4 L8 8X8 1X1 12X12 14X14 16X16 Figure 2. Performane exploration results

1 8X8 1X1 Area implementation of a target appliation. The tools are available at the AMDREL website [13]. 95 Area (um^2) (avg. of benhmarks) 9 85 12X12 14X14 16X16 8 75 L1 L1&L2 L1&L4 L2 L2&L4 L4 L8 Figure 3. Area exploration results An interonnet arhiteture with the following features was seleted: Disjoint Swith-Box Topology with F s =3 L1 Connetion-Box (CB): Connetivity equal to one ( F =1) for input and output Connetion-Boxes Full Population for Swith and Connetion-Boxes The size of the CB output s and SB s transistors is W / =1*.28/.18 n L n 1.3. Configuration Arhiteture The realization of an FPGA requires, apart from the definition of the CLB and Interonnetion arhiteture, the determination and implementation of the onfiguration arhiteture. The proposed onfiguration arhiteture onsists of the following omponents: the memory ell, where the programming bits are stored, the loal storage element for eah tile (a tile onsists of a CLB with its input and output onnetion boxes, a Swith Box plus the memory for its onfiguration) and the deoder whih ontrols the onfiguration proedure of all the FPGA. 2. MEANDER Design Framework Equally important to an FPGA platform is a tool set, whih supports the implementation of digital logi on the proposed FPGA. Therefore, suh a design flow was realized. It omprises a sequened set of steps employed in programming an FPGA hip, as shown in Fig. 4. The input is the RTL-VHDL iruit desription, while the output of the CAD flow is the bitstream file that an be used to onfigure the FPGA. Due to the fat that DIVINER tool performs the basi funtion of the RTL synthesis proedure, supporting until now a subset of VHDL, the MEANDER flow has the option to synthesis omplex HDL benhmarks in ommerial tool (Leonardo, Synpliity, et) and then through the DRUID tool ontinue the design in MEANDER flow. Additionally, the proposed tool framework an be used in arhiteture-level exploration, i.e. in finding the appropriate FPGA array size (number of CLBs) and routing trak parameters (SB, CB, et.) for the optimal Figure 4. The proposed design framework The Graphial User Interfae (GUI) provides the designer with the opportunities to easily use all (or some of the tools) that are inluded in the proposed design flow. Until now, there is no other aademi implementation of suh a omplete graphial design hain. It is possible to run it from a loal PC or through the Internet/Intranet, and the soure ode an be easily modified in order to add more tools. The tools an also be exeuted on-line at http://vlsi.ee.duth.gr:881. Extensive desription of the whole design flow an be found in [5, 8, 1, 13]. Here, we desribe only the features of the new tools EX-VPR and DAGGER, whih onern the plaement and routing, and, the reonfiguration bitstream generation. Qualitative and quantitative omparisons results of the whole design flow as well as for ertain tools are desribed on Setion 3. 2.1. EX-VPR The EX-VPR tool, whih plaes and routes the iruit into the FPGA, is based on VPR [3, 4, 12]. It was extended by adding a silion area model that estimates the area of the devie in um 2, assuming STM.18μm tehnology. Another very important extension is the addition of user-defined full-ustom swith boxes, while the original version of VPR supported only three types of swith boxes namely Subset (similar to the one used in Xilinx XC4 devies) [12], Wilton [12] and Universal [12]. This feature is possible as the EX-VPR handles devies with swith boxes where the aeptable onnetions among routing traks are defined by the designer. For demonstration purposes besides the three existing swith boxes to VPR we have implemented three additional swith boxes. The first one is desribed in Fig. 2(a) of [16], the seond one is shown in Fig. 1(a) of [17], while the last one is defined in the Fig. 4(b) of [18].

In addition to this, the EX-VPR has the ability of integrating IP ores. This feature allows the user to reserve a part inside the FPGA with speifi (x,y) oordinates for plaement of IP modules (e.g. CPUs, memories). The main advantage is the fat that the designer an realize onto FPGA arhiteture a omposite system and therefore, he/she an perform rapid prototyping of a new design. The power onsumption of an FPGA is alulated by the extended version of PowerModel [11], whih takes in mind the new FPGA omponents (new SBs, IP ores, et). Finally, the last extension onerns a proedure for post plaement optimization. This routine tries to replae the iruit by keeping the same length of the routing segment between CLBs. This results in a more effiient utilization of FPGA resoures and maximum routability between the remaining CLBs that an be used later through partial or dynami reonfiguration. 2.2. DAGGER DAGGER (DEMOCRITUS UNIVERSITY OF THRACE E-FPGA BITSTREAM GENERATOR) is a new FPGA onfiguration bitstream generator. This tool has been designed and developed from srath. To our knowledge there is no other available aademi implementation with suh features. The first version of DAGGER [8, 1] tool supported features like, ompression, enryption and error-detetion. The seond version of the tool is ompletely tehnology independent, as the bitstream file generated in a general form and then dumped appropriate to the onfiguration file taking into onsideration the tehnology parameters of the devie (like SB, routing traks, CLB parameters, et). Another ritial modifiation between the two versions is the ability for partial and run-time reonfiguration. Beside this the new version performs optimal plaement of the iruit implemented by partial reonfiguration, leading to an effiient FPGA resoure utilization number of CLBs and allows maximum routability of the remaining resoures. Finally, DAGGER employs some low-power tehniques both in the soure ode of the tool, as well as at the way that the FPGA is programmed. 2.2.1. Supported types of reonfiguration The DAGGER tool ould handle both run-time and partial reonfiguration types, if they are supported by the target devie. Using the seletive reonfiguration an greatly redue the amount of onfiguration data that must be transferred to the FPGA devie. Several runtime reonfigurable systems are based upon a partially reonfigurable design, inluding Chimaera, PipeRenh, NAPA, the Xilinx 62 and Virtex FPGAs. 2.2.2. Tool Charateristis The tool harateristis ould be summarized at the next onepts: Bitstream re-alloation (Defragmentation): A number of systems use the run-time reloation [8] among them are the Chimaera, PipeRenh and Garp. The DAGGER tool inorporates the ability to defrag the reonfigurable devie. Compression: The effetiveness of a ompression tehnique is haraterized by the ahieved ompression ratio, that is, the ratio of the size of the ompressed data to the size of the original data. However, depending on the appliation, metris suh as proessing rate, implementation ost, and adaptability may beome ritial performane issues. From the exploration results, the DAGGER redues the initial bitstream file with an average ratio of 8%, while it is about 8% better ompared to Run Length Enoding, and 86% better than TAR ompression tool. In ontrast, it ompresses the file 9% less than ZIP algorithm and 16% less than GZIP respetively. Error Detetion: The DAGGER tool inorporates the Cyli Redundany Cheking (CRC) algorithm (CRC-16 = X 16 + X 15 + X 2 + 1) for heking the data written to any onfiguration register. Read-Bak Tehnique: Read-bak is the proess of reading all the data from the FPGA devie in the internal onfiguration memory. Enryption Algorithm: The DAGGER output file ould be enrypted both for seurity to the FPGA devie, as well as for the program running on it. Low-Power Tehniques: The DAGGER employs some low-power tehniques both in the soure ode of the tool, as well as at the way that the FPGA is programmed. The soure ode is written in a way that minimizes the I/O requests from the memories. Similar, the bitstream file ontains bits that have no meaning and ould have value or 1, without affeting the iruit funtionality. 3. Comparisons Qualitative omparisons in terms of provided features among the MEANDER (the proposed one), Xilinx [1], Toronto [3] and Alliane [7] tool frameworks are provided in Table 1. The ( ) symbol indiates that the orresponding feature is available in the design framework, while the ( ) symbol indiates that the speifi feature is not supported by the design framework. The (-) symbol indiates that the orresponding feature is not provided, but not neessary for the ompleteness of that framework either. Table 2 shows that the MEANDER design framework provides implementation from as high-level a desription as possible (RTL) down to the FPGA onfiguration file, while it also provides power onsumption estimation, and onfiguration bitstream generation whih the other aademi frameworks do not. It also features a GUI (whih aademi frameworks do not) and remote aess to it (whih no other framework, ommerial or aademi) does. The only limitations of the proposed framework are that it does not urrently

support bak-annotation, but no other aademi tool frameworks do either. Table 2. Qualitative omparison among tool frameworks FEATURE MEANDER Xilinx Toronto Alliane Input Format VHDL/ VHDL/ Verilog Verilog BLIF VHDL Synthesis Power estimation Area estimation Arhit. desription Plaement Routing Bitstream Bak-annotation GUI Aess through HTTP User Manual It is evident that the MEANDER framework is the most omplete aademi tool framework, and is at least in terms of provided features omparable with ommerial tools. It ontains the only known aademi implementation of an arhiteture independent onfiguration bitstream generation tool. Additionally, the remote aess to GUI feature allows the user to run the framework without having the tools installed in his/her own omputer. A qualitative omparison between VPR [3, 4, 12] and EX-VPR is shown in Table 3. It is leared that the EX- VPR provides the VPR features, while it also inorporates the flexibility for full-ustom swith box definition, the IP handling option, the silion area alulation, and finally the remote aess to it. The remote aess to EX-VPR allows the user to run the tool (and onsequently the MEANDER framework) without having the tools installed in his/her own omputer. Table 2. Qualitative omparison among VPR and EX-VPR Feature VPR [3] EX-VPR Plaement Routing Supported Swith Boxes (SBs) Subset Wilton Universal Subset Wilton Universal User speified IP ore Power Estimation Timing info (se) Silion Area (um 2 ) Appliation speifi FPGA design GUI Graphi arhiteture desription Run through HTTP Various benhmarks from ITC 99 [14] (part of the MCNC benhmarks) were implemented in AMDREL FPGA array, using the proposed design framework and in Xilinx devies of similar resoures using Xilinx ISE tools. The benhmarks range from a few gates to tens of thousands and inlude ombinational, sequential and Finite State Mahines iruits. Figure 5 shows the maximum frequenies obtained by the two frameworks and devies. MHz 25 2 15 1 5 b1 b3 b6 Maximum Frequeny b9 b11 b13 b15 Benhmark b17 AMDREL Figure 5. Maximum frequeny omparison b2 b21 XILINX Fig. 6 provides power onsumption figures for some of the benhmarks mentioned above. mw 4 35 3 25 2 15 1 5 Power Consumption AMDREL XILINX b8 b1 b11 b14 b2 b2_1 b21 b21_1 Benhmark Figure 6. Power onsumption omparison Fig. 7 shows the average ontribution of eah power onsumption omponent to the total power budget per benhmark. It an be notied that the major onsumption for AMDREL designs ame from logi bloks, while for Xilinx ame from the signal. The reason stems from the fat that the open-soure synthesis tools are not as effiient as the ommerial ones in terms of LUTs.. Figure 7. The power budget pie of AMDREL and Xilinx implementations Fig. 8 shows the results from applying the DAGGER strategy for partial bitstream reonfiguration to the proposed FPGA array for a number of benhmarks. The initial bitstream size left bar for eah benhmark of Fig. 4 shows the number of bits in bitstream file, whih are required for reonfiguration of the FPGA array deriving from EX-VPR tool. The DAGGER bitstream file is the size of the onfiguration file that produed by DAGGER tool, employing features suh as ompression and partial reonfiguration. The DAGGER bitstream file, as it is smaller than the initial one, needs less memory ells for storing the FPGA onfiguration and has better hardware resoures utilization, as it programs only the funtional CLB in partial reonfiguration proedure. From this figure, it is lear that as the benhmark grows in number of gates, b22

the gains in size of the bitstream derived from the DAGGER tool, ompared to initial bitstream size, are bigger. # of bits 14 12 1 8 6 4 2 b1 b2 b3 b4 b6 b7 Bitstream Size b9 Initial Bitstream Size b1 b11 b12 b13 b14 b15 b15_1 b2 b21 DAGGER Bitstream Size Figure 8. DAGGER bitstream file size Fig. 9 shows the bitstream file size for programming the AMDREL s FPGA ompared to a number of Xilinx devies (Spartan2-E, Virtex, Virtex-E, and Virtex-II). These sizes are derived from the DAGGER tool and Xilinx ISE framework. All the benhmarks are plaed and routed in the smallest devie that fits. In Fig. 9, if a hange in urve slope ours, the orresponding benhmark was implemented into a larger devie of the same family (whih may require bigger or smaller onfiguration file). Based on the results, it is apparent that DAGGER produes the smallest bitstream file, ompared to all the available ommerial devies from Xilinx. As a result, AMDREL devie needs less memory ells to store the FPGA onfiguration bits. Although MEANDER framework needs more LUTs to implement the same benhmark ompared to Xilinx tools (due to the fat that our synthesizer is not as effiient as the ommerial one), the bitstream size of DAGGER is smaller than the one that produed for Xilinx devies. # of bits 55 5 45 4 35 3 25 2 15 1 5 b1 b2 Bitstream Size Comparisson b3 b4 b6 b7 b9 b1 b11 b12 b13 b14 b2 DAGGER SPARTAN2E VIRTEX VIRTEX-E VIRTEX-II Figure 9. Bitstream size b21_1 b22 b21 b22 4. Conlusions A novel FPGA arhiteture (CLB, interonnet and onfiguration arhiteture) with low-power features was presented together with the EX-VPR and DAGGER tools (part from the MEANDER framework), whih used for implementing logi in this platform. The proposed system of the FPGA (implemented in.18μm STM tehnology) and tool framework showed promising results when ompared with ommerial produts using a number of benhmarks. Aknowledgement This work was partially supported by the projet IST- 34379-AMDREL whih is funded by the European Commission. 5. Referenes [1] http://diret.xilinx.om/bvdos/publiations/ds3.pdf [2] http://www.vantis.om [3] http://www.eeg.toronto.edu/~vaughn/vpr/vpr.html [4] V. Betz, J. RA. Marquardt, Arhiteture and CAD for Deep-Submiron FPGAs, Kluwer Aademi Publishers, 1999. [5] K. Tatas, K. Siozios, N. Vasiliadis, D. J. Soudris, S. Nikolaidis, S. Siskos, and A. Thanailakis: FPGA Arhiteture Design and Toolset for Logi Implementation, 13th International Workshop, PATMOS 23, Turin, Italy (23) 67-616. [6] http://www.luent.om [7] http://www-asim.lip6.fr/reherhe/alliane/ [8] K. Siozios, G. Koutroumpezis, K. Tatas, D. Soudris and A. Thanailakis, DAGGER: A Novel Generi Methodology for FPGA Bitstream Generation and its Software Tool Implementation, 12th Reonfigurable Arhitetures Workshop (RAW 25), Colorado, USA, April 4-5, 25. [9] V. Kalenteridis, et al., An Integrated FPGA Design Framework: Custom Designed FPGA Platform and Appliation Mapping Toolset Development, 11th Reonfigurable Arhitetures Workshop (RAW 24), Santa Fe, New Mexio, USA, April 26-27, 24, pp. 138a. [1] K. Siozios et al., A Novel FPGA Configuration Bitstream Generation Algorithm and Tool Development, in Pro. FPL 23, Antwerp, Belgium, Aug. 3 Sep. 1, 24, pp.1116-1118, [11] K. Poon, A. Yan, S. Wilton, A Flexible Power Model for FPGAs, in Pro. FPL 22, Montpellier, Frane, 22, pp. 312 321. [12] G. Varghese and J. M. Rabaey, Low-Energy FPGAs Arhiteture and Design, Kluwer Aademi Publishers, 21. [13] http://vlsi.ee.duth.gr/amdrel [14] Ken MElvain, Benhmarks tests files, in Pro. MCNC International Workshop on Logi Synthesis, 1993. [15] H. Kalenteridis et al, A omplete platform and toolset for system implementation on fine-grain reonfigurable hardware, Miroproessors and Mirosystems, Vol. 29, pp. 247 259, 25 [16]Hongbing Fan, et al., "Redution Design for Generi Universal Swith Bloks", in ACM Trans. on Design Automation of Eletroni Systems, Vol.7, No.4, Ot. 22, pp. 526-546 [17]Yao Wen Chang, D. F. Wong and C. K. Wong, "Universal swith modules for FPGA design", in ACM Transations on Design Automation of Eletroni Systems (TODAES), Volume 1, Issue 1, pp. 8-11, Jan. 1996 [18]Yao-Wen Chang, D.F. Wong, and C.K. Wong, "Universal Swith-Module Design for Symetri-Array-Based FPGAs", 4th Intern. ACM Symposium on FPGAs (FPGA'96), pp. 8-86, Feb. 11-13, 1996