Improving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints

Improving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints Amit Kulkarni, Tom Davidson, Karel Heyse, and Dirk Stroobandt ELIS department, Computer Systems Lab, Ghent University Sint-Pietersnieuwstraat 41, Ghent B-9000, Belgium Email:{ Amit.Kulkarni, Tom.Davidson, Karel.Heyse, Dirk.Stroobandt }@UGent.be Abstract Dynamic Circuit Specialization (DCS) is an optimization technique used for implementing a parameterized application on an FPGA. The application is said to be parameterized when some of its inputs, called parameters, are infrequently changing compared to the other inputs. Instead of implementing these parameter inputs as regular inputs, in the DCS approach these inputs are implemented as constants and the design is optimized for these constants. When the parameter values change, the design is re-optimized for the new constant values by reconfiguring the FPGA. It has been investigated that run-time reconfiguration speed is the limiting factor of the DCS implementations on Xilinx FPGAs. We propose an idea to constrain the design s placement and use the custom Xilinx HWICAP driver to improve reconfiguration speed at the cost of a small reduction in design performance. We use Xilinx and as experimental platforms and we have used an 8-bit FIR filter with different tap configurations as our parameterized design whose filter coefficient values are infrequently changing inputs. A drastic improvement in the reconfiguration speed with a factor of 14 is achieved with only a 6% decrease in performance. I. INTRODUCTION Partial run-time reconfiguration is the ability to modify some logic blocks of an FPGA while the rest of it remains active. One of the commercially available technologies, developed by Xilinx, is called Partial Reconfiguration (PR) and has been around in the market for quite a while. Because of its reconfiguration overhead, the advantage of using PR is greatly diminished. Authors in [1] developed a technique called Dynamic Circuit Specialization which is a partial reconfiguration technique tailored to parameterized applications. Dynamic Circuit Specialization (DCS) uses the run-time reconfiguration technique to specialize the parameterized design depending on the values of the infrequently changing inputs (the parameters). Hence for every change in the parameter value, a new specialized bitstream is generated and the FPGA is reconfigured with the specialized bitstream. A detailed implementation of the DCS tool flow on a self reconfigurable platform is described in [2]. The DCS tool flow consists of two stages: the generic stage and the specialization stage. In the generic stage, the design with parameterized inputs described in a Hardware Description Language (HDL) is processed to yield a Partial Parameterized Configuration (PPC), which contains the bitstream expressed in the form of boolean functions. In the specialization stage, the boolean functions are evaluated for a specific parameter value by the Specialized Configuration Generator (SCG) to generate a specialized bitstream. Usually the SCG is implemented on an embedded processor. The embedded processor is responsible to swap the specialized bitstream into the configuration memory using the HWICAP. Our experiments for DCS implementations on a self reconfigurable platform have shown that the HWICAP proves to be the main bottleneck for the reconfiguration speed, since its throughput is not high enough to match with the speed of the embedded processor used during the reconfiguration process. However, experiments described in [3] have shown that the bottleneck depends on the experiment setup and the different components that participate during the reconfiguration process. The Xilinx HWICAP driver function XhwIcap_setClb_bits" is used to reconfigure the truth table entries of a single LookUp Table (LUT) during run time. However, with existing Xilinx FPGA column based architectures, we propose to reconfigure multiple LUTs at the same time. We do this by using design placement constraints to cluster the bits that have to be changed in the same reconfiguration columns and customizing the XhwIcap_setClb_bits" function. This gives us a significant improvement in reconfiguration speed. However this improvement comes at the cost of a slight reduction in the performance of the design. In this paper we show the trade-off between the design performance and the reconfiguration speed achieved by employing placement constraints and a custom HWICAP driver. We use the custom HWICAP driver along with the placement constraints on the Xilinx and FPGAs for implementing 8-bit FIR filters using DCS. In Section II, we describe the reconfiguration process of DCS. A brief overview of column based FPGA and architectures is presented in Section III. In Section IV, the details of the Xilinx HWICAP driver used for reconfiguration (the XhwIcap_setClb_bits" function) are described. In Section V, the use of placement constraints for the parameterized design is described. In Section VI, we present the main idea of improving the XhwIcap_setClb_bits" driver. In section VII, a brief description of the experiments with parameterized designs is given, the results of the improved reconfiguration speed are tabulated followed by the comparison and the discussion of the trade-off between reconfiguration speed and design performance. Finally we conclude in Section VIII. II. RUN-TIME RECONFIGURATION FOR DYNAMIC CIRCUIT SPECIALIZATION In this section, we briefly explain how run-time reconfiguration is used in Dynamic Circuit Specialization. In [4], it is explained how the parameterized design is mapped on 978-1-4799-5944-0/14/$31.00 c 2014 IEEE

to virtual LUTs called Tunable LUTs (TLUTs). TLUTs are virtual versions of conventional LUTs whose truth table entries are expressed as boolean functions of the parameters. The bitstream of a parameterized design is thus expressed as a boolean function of parameters, resulting in a parameterized configuration. For every change in parameter input values, a new specialized bitstream is generated by evaluating the corresponding boolean functions and a new specialized bitstream is generated by the SCG. Usually, the SCG is implemented on an embedded hard-core processor such as PowerPC or on an embedded soft-core processor such as MicroBlaze. The specialized bitstream represents the truth table entries of the TLUTs. Once the specialized bitstream is generated, it has to be swapped into the FPGA configuration memory to reconfigure the LUTs that correspond to their virtual TLUTs. The swapping is done by using the HWICAP as a configuration interface on a Xilinx FPGA. The HWICAP is accessible for the reconfiguration with the help of its driver called XhwIcap_setClb_ bits" [5]. More information on this driver is found in Section IV. The main advantage of this driver is that it provides access to the reconfiguration of a specific LUT when provided its location co-ordinates. Any LUT can be accessed via this driver function for the purpose of reconfiguration. The only disadvantage is that the XhwIcap_setClb_ bits" needs to be called for reconfiguring every single LUT even though there are good opportunities to reconfigure multiple LUTs with a single function call. To understand how this driver works, it is necessary to understand the Xilinx column based FPGA architecture first. TABLE I. XILINX FPGA DEVICE DETAILS Device name XC5VFX70T -FFG1136 XC7Z020 -CLG484-1 Board ML507 name Evaluation Platform ZedBoard Hard-core Processor PowerPC 440 Core ARM Cortex-A9 Clock frequency 400 MHz 667 MHz Soft-core Processor MicroBlaze (8.20.b) MicroBlaze (8.40.a) Clock frequency 100 MHz 100 MHz LUT inputs 6 6 LUT entries 64 64 HWICAP type XPS HWICAP (5.01.a) AXI HWICAP (2.03.a) HWICAP clock (MHz) 100 100 HWICAP throughput (non-dma) 19 19 (MB/s) HWICAP port width (bits) 32 32 Number of Clock Regions 16 6 Number of CLBs in one CLB column 20 50 Frame size (32-bit words) 41 101 III. XILINX COLUMN BASED ARCHITECTURE We consider the modern column based FPGA architectures from Xilinx for our experiments. Our experiments are limited to the and the FPGAs only. However, the idea of improving the reconfiguration speed can be applied to any column based Xilinx FPGA. The specifications related to reconfiguration are tabulated in Table I. The Xilinx FPGA contains an array of Configurable Logic Blocks (CLB) which encapsulates LUTs, flip-flops and multiplexers. Each CLB contains 8 LUTs and is capable of realizing combinational and sequential logic. The array of CLBs is divided into a number of Clock Regions. Each clock region contains CLB columns with a fixed number of CLBs and the height of the CLB column remains the same in all the clock regions. There are multiple CLB columns adjacent to each other thus forming CLB rows as shown in Figure 1. There are other columns such as DSP and BRAM columns that exist in between CLB columns. Frame Structure A frame of an FPGA is the smallest addressable element of an FPGA configuration. It can be viewed as a vertical stack of a fixed number of bits spanning a complete height of a row [6] [7]. A fixed data size of 2 words (1 word = 32 bits) are assigned to each CLB within the entire frame. This means a set of LUT entries present in one CLB can be configured within those 2 words. However, the complete configuration data of Fig. 1. Column based FPGA architecture: an entire CLB containing multiple LUTs spans over multiple frames and each frame has its own unique frame address [6]. It should be noted that there exist one extra word called HCLK config word" for each column within one frame as shown in Figure 2.

TABLE II. TLUTS CLUSTER RATE OF 64-TAP FIR FILTER IN A SINGLE CLB COLUMN Average Maximum Average Maximum Clustered TLUTs 55% 78% 52% 75% Remaining LUTs 45% 22% 48% 25% this is inefficient. The HWICAP with its fixed throughput proves to be a bottleneck and hence limits the reconfiguration speed. Our approach is to improve the XhwIcap_setClb_bits" to incorporate a technique where we can modify multiple TLUTs within a single read and write activity in frames. Fig. 2. Frame structure of column based Xilinx FPGA A single frame can contain truth table entries of multiple LUTs which are located in a single CLB column. In the Virtex- 5 there are 20 CLBs in one column and hence a total of 20 2+1=41 words exist in one frame. Similarly in the Zynq- 7000 family, there are 50 CLBs in one column, so a total of 50 2 + 1 = 101 words exist in one frame. The frame size plays an important role during the reconfiguration process. Since a frame is the smallest addressable element, for every reconfiguration process, at least one frame has to be accessed via the HWICAP. Thus the time taken to reconfigure a LUT is affected by the frame size. For a fixed HWICAP throughput, an increase in frame size results in an increase in reconfiguration time and thus reduces the reconfiguration speed. IV. THE XhwIcap_setClb_bits" DRIVER This is a HWICAP driver used to reconfigure actual LUTs that are used as virtual TLUTs in the DCS implementation. This procedure accepts the TLUT location co-ordinates and specialized bits (truth table values) as inputs. The function first generates a frame address from the given TLUT location co-ordinates and this helps to target the frame that contains truth table entries of a corresponding TLUT. The complete reconfiguration occurs in 3 steps: 1) Read frames: With the help of the frame address, multiple frames containing all the truth table entries of one TLUT are read from the configuration memory. 2) Modify frames: The current truth table entries of a TLUT are replaced with the specialized truth table bits. 3) Write-back the frames: With the help of the same frame address, the modified or specialized truth table values are updated in a TLUT by swapping in multiple frames into the configuration memory of the FPGA. The frames are accessed through the HWICAP and with a fixed HWICAP throughput. All 3 steps of the reconfiguration process should be executed to reconfigure a single TLUT and V. PLACEMENT CONSTRAINTS TO IMPROVE RECONFIGURATION SPEED The main aim of using placement constraints is to force multiple TLUTs to cluster all their truth table entries in a minimal number of frames. The placement constraints are used to restrict where the design s logic is placed. It forces the placer to use a certain area of the FPGA. We have described the correlation between the CLB columns and the frame structure in Section III. Our approach is to force more TLUTs to be placed in a single CLB column so that their truth table entries can be reconfigured with a minimal number of frame accesses. We have used the AREA_GROUP" constraint [8]. This constraint allows us to specify that certain parts of the design can only be placed in a pre-determined rectangular region of the FPGA s CLBs. To determine the exact size of this rectangular region the maximum length of the CLB column and minimum width of the CLB rows have to be considered. The maximum length of the CLB column is equal to its height (50 for the and 20 for the ) in a given clock region and it ensures that more TLUTs can fit the specified area, while the minimum CLB rows ensures that we use the minimal number of CLB columns possible. The exact area constraint differs for both targeted FPGAs. We first used the constraint to place the TLUTs in an exact minimum number of CLB columns determined by the number of LUTs present in it. For example, in the each column has 200 LUTs. Therefore to place the 64-tap FIR filter (1536 TLUTs), it is sufficient to use 8 columns. However with 8 columns, the router was not able to route the design. Hence we increased the width of the rectangular area by increasing the number of columns untill the router was able to route the whole design. The width of the rectangular area in terms of CLB columns for different configurations of the FIR filter is tabulated in Table III. For a 64-tap FIR filter, the average number of TLUTs clustered in a single CLB column of the is 110 which is 52% of the total LUTs available in a single CLB column and there are a maximum of 156 TLUTs clustered in a single column which is 75%, remaining LUTs are not a part of the reconfiguration process and hence they are used for the non-reconfigurable parts of the problem. Similarly, for the, the average number of TLUTs clustered in a single CLB column is 41 which is 55% of the total LUTs available in a single CLB column and there are a maximum of 60 TLUTs clustered in a single column which is 78%. Table II shows the percentage of TLUTs clustered.

VI. IMPROVING XhwIcap_setClb_bits" DRIVER Once the multiple TLUTs are placed within a single column, we modified the XhwIcap_setClb_bits" driver in order to exploit the advantage of the existing frame structure that is dependent on the column based Xilinx FPGA architecture. If multiple TLUTs of a parameterized design are placed in a single column then each TLUT with a certain set of truth table entries is located in a single frame. However, all 64 entries of a single TLUT are spread over multiple frames. We have modified the XhwIcap_setClb_bits" and renamed it XhwIcap_custom_setClb_bits". The reconfiguration process takes place in 3 steps: 1) Read frames: With the help of the frame address, multiple frames containing all the truth table entries of multiple TLUTs are read from the configuration memory. Since multiple TLUTs are placed in a single column, the truth table values of multiple TLUTs are read with a single read activity. 2) Modify frames: The current truth table entries of multiple TLUTs are replaced with the specialized truth table bits, which are generated by the SCG. Thus multiple TLUTs are specialized in a single attempt. 3) Write-back the frames: With the help of the same frame address, the modified or specialized truth table values are updated in multiple TLUTs by swapping in multiple frames into the configuration memory of the FPGA. This updates all the truth table entries of multiple TLUTs that are placed in a single column. Hence for a single read frames activity, multiple TLUTs can be reconfigured and this proves to be efficient since reading and writing back the frames for each TLUT can be avoided in contrast to the case of the conventional XhwIcap_setClb_bits" driver. If the number of TLUTs in a parameterized design is higher than what fits in a single CLB column then multiple CLB columns containing multiple TLUTs can be used in order to achieve the gain in reconfiguration speed. The main concern with using the placement constraints is the design performance. Strict placement constraints would lead to hindrance of the design performance. There will be a trade-off between the reconfiguration speed and the design performance which needs to be investigated. VII. EXPERIMENTS AND RESULTS In this section, we present our experiments followed by their results and compare them to the conventional DCS implementation. We used an 8-bit FIR filter with three different tap configurations as a paramaterized design. Each filter tap contains two 4-bit multipliers and each multiplier is mapped onto 12 TLUTs [2]. We used a FIR filter with different configurations as listed in Table IV. Figure 3 shows the structure of the filter: all coefficients are the parameterized inputs. For every infrequent change in the coefficient value, a specialized bitstream is generated and the filter taps containing multiplications are reconfigured accordingly. The reconfiguration time is tabulated in Table V and the corresponding bar graph is depicted in Figure 4. The Fig. 3. TABLE III. DIMENSIONS FOR THE PLACEMENT CONSTRAINTS 16-tap FIR 32-tap FIR 64-tap FIR Number of TLUTs to be clustered 384 768 1536 50 5 50 11 50 14 20 13 20 27 20 38 Note: Above dimensions are in the form of Length Width of the CLB columns. TABLE IV. k-taps, 8-bit FIR filter TABLE V. FIR FILTER CONFIGURATIONS Taps Multipliers TLUTs 16 32 384 32 64 768 64 128 1536 RECONFIGURATION TIME IN MILLISECONDS 16-tap FIR 384 TLUTs 32-tap FIR 768 TLUTs 64-tap FIR 1536 TLUTs 37.7 / 4.1 75.4 / 8.31 150.7 / 14.8 45.3 / 4.2 90.6 / 8.7 181.2 / 15.6 90.1 / 4.4 180.1 / 9.0 360.1 / 16.4 120.1 / 18.7 241.1 / 37.6 438.8 / 72.9 Note: Above values are in the form of Without placement constraints / With placement constraints. figure shows that the FIR implementation without placement constraints needs less reconfiguration time for the than for the. The main reason is the larger frame size of the compared to the and thus the higher number of words to be reconfigured for the compared to the [5]. The significant improvement in reconfiguration speed can be noticed after introducing the placement constraints and using the XhwIcap_custom_setClb_bits" driver. On average, the reconfiguration time is reduced with a factor of 14 because of the reduced number of read and write frames function calls of the XhwIcap_setClb_bits" driver. We used the placement constraints so that the TLUTs are placed within the minimal number of CLB columns possible. The dimensions for the rectangular region of the placement constraints is tabulated in Table III. Since there are more CLBs in the CLB columns of the than in the, it is an advantage for the to incorporate more TLUTs within a column. Therefore we notice in Table III that the number of columns (width size of the CLB columns) used to constrain the TLUTs in the is lower than in the

Fig. 4. Reconfiguration time comparison Fig. 5. Clock frequency of the FIR filter with various tap configurations TABLE VI. MAXIMUM CLOCK (MHZ) THE DESIGN CAN SUPPORT 16-tap FIR 384 TLUTs 32-tap FIR 768 TLUTs 64-tap FIR 1536 TLUTs 106.3 / 102.3 106.3 / 101.6 106.3 / 100.9 105.2 / 101.4 105.2 / 101.2 105.2 / 100.5 108.6 / 102.8 108.6 / 102.2 108.6 / 101.2 106.7 / 101.9 106.7 / 101.2 106.7 / 100.3 Note: Above values are in the form of Without placement constraints / With placement constraints.. The improvement in the reconfiguration speed comes at the cost of a reduction in the design performance. Introducing the placement constraints causes the design to have a long critical path compared to the conventional implementation. This causes a decrease in the maximum clock frequency the design can support as observed in Table VI. Figure 5 shows the bar graph of the design performance of various profiles. In Figure 6 the variation of clock frequency as a function of the number of TLUTs for a FIR filter implementation using a with hardcore processor is depicted. Clearly, an increase in number of TLUTs decreases the design performance. The overall average deterioration in design performance is about 6 MHz (or a deterioration of 6%). The same kind of response is observed in the implementation. Functional Density The effect of introducing the placement constraints to improve the reconfiguration speed in the DCS can be best explained using the Functional Density curve [9]. The functional density is defined as the number of Computations (N) that can be performed per unit Area (A) and unit Time (T) as shown in equation 1. F d = N AT (1) Fig. 6. Design Performance of FIR filter In our experiments, the computations are all the operations in the FIR filter. The value of A depends on the resources of the FPGA used by the FIR filter (mainly TLUTs). The value of T is composed of the reconfiguration time, the execution time and the time to specialize. A higher functional density signifies a more efficient usage of implementation area. The functional density curve is plotted against the rate of change of the input parameters. We plot the functional density of the FIR filter in three different forms: 1) Generic: FIR filter implementation without DCS. 2) DCS without placement constraints: FIR filter implementation using DCS without placement constraints. 3) DCS with placement constraints: FIR filter implementation using DCS with placement constraints. Figure 7 depicts the corresponding three curves. The x- axis represents the average time (in clock cycles) between

REFERENCES Fig. 7. Functional Density two parameter value changes. The Generic implementation has no variation in functional density since it uses a fixed number of resources. The functional density for the DCS with placement constraints, rises well before the functional density of the DCS without placement constraints. This shows that improving the reconfiguration speed allows the parameters to change faster with the same gain in area compared to the DCS whose reconfiguration speed is slow. However, since the design performance is slightly reduced, the magnitude of the functional density curve beyond point B is relatively lower compared to the DCS without placement constraints forming the main trade-off. Hence it makes sense to use the placement constraints in the range of parameter changes between point A and point B. If the parameters change too fast then it is suitable to use the generic implementation. In our future research work, we will try to push the crossover point A of the functional density of the DCS towards the left which causes the curve to rise more early than the other curves resulting in a significantly higher functional density for more frequent parameter re-use (expressed in clock cycles) in between changes, this can be achieved by improving the reconfiguration speed. [1] K. Bruneel, W. Heirman, and D. Stroobandt, Dynamic data folding with parameterizable configurations, ACM Transactions on Design Automation of Electronic Systems, vol. 16, no. 4, 2011. [2] K. Bruneel, F. Abouelella, and D. Stroobandt, Automatically mapping applications to a self-reconfiguring platform, in Design, Automation Test in Europe Conference Exhibition, 2009. DATE 09., April 2009, pp. 964 969. [3] K. Papadimitriou, A. Dollas, and S. Hauck, Performance of partial reconfiguration in fpga systems: A survey and a cost model, ACM Trans. Reconfigurable Technol. Syst., vol. 4, no. 4, pp. 36:1 36:24, Dec. 2011. [Online]. Available: http://doi.acm.org/10.1145/2068716.2068722 [4] K. Heyse, T. Davidson, E. Vansteenkiste, K. Bruneel, and D. Stroobandt, Efficient implementation of virtual coarse grained reconfigurable arrays on FPGAs, in Proceedings of the 23rd International Conference on Field Programmable Logic and Applications. Piscataway, NJ, USA: IEEE, 2013, pp. 1 8. [5] A. Kulkarni, K. Heyse, T. Davidson, and D. Stroobandt, Performance Evaluation of Dynamic Circuit Specialization on Xilinx FPGAs, in Proceedings of the 11th FPGAworld Conference, ser. FPGAworld 14, 2014. [6] FPGA Configuration User Guide (ug191), http://www.xilinx. com/support/documentation/user_guides/ug191.pdf, accessed: 2014-05- 14. [7] 7 Series FPGAs Configuration User Guide (ug470), http://www.xilinx. com/support/documentation/user_guides/ug470_7series_config.pdf, accessed: 2014-05-14. [8] Constriants Guide (cgd 10.1), http://www.xilinx.com/itp/xilinx10/ books/docs/cgd/cgd.pdf, accessed: 2014-05-16. [9] A. DeHon, Reconfigurable architectures for general-purpose computing, Cambridge, MA, USA, Tech. Rep., 1996. VIII. CONCLUSION To improve the reconfiguration speed in DCS implementations using parameterized reconfiguration we constrained the TLUTs of the FIR filter within the minimal number of columns possible. We have also modified the existing Xilinx HWICAP driver in which optimizations were done to read and write the frames only once to reconfigure multiple TLUT entries. We have shown that there is a drastic improvement in the reconfiguration speed but this comes at the cost of a slight reduction in performance of the design. Functional density curves were used to discuss the impact of improving reconfiguration speed and slight reduction in design performance. The experiments were done on the and the platforms. In typical cases, if the FPGA resources are underutilized during the DCS implementation, then it is suitable to use placement constraints in order to improve the reconfiguration speed. This gives more flexibility to the parameterized design to have changes in parameters more frequently than the conventional DCS implementation. It is also to be noted that design performance will be degraded slightly and which should be considered by the designers if it is allowed in the given timing budget.