Domain-Specific Modeling for Rapid System-Wide Energy Estimation of Reconfigurable Architectures

Size: px

Start display at page:

Download "Domain-Specific Modeling for Rapid System-Wide Energy Estimation of Reconfigurable Architectures"

Arabella Parsons
5 years ago
Views:

1 Domain-Specific Modeling for Rapid Sytem-Wide Energy Etimation of Reconfigurable Architecture Seonil Choi 1,Ju-wookJang 2, Sumit Mohanty 1, Viktor K. Praanna 1 1 Dept. of Electrical Engg. 2 Dept. of Electronic Engg. Univ. of Southern California Sogang Univerity Lo Angele, CA, U.S.A. Seoul, Korea {eonilch, mohanty, praanna}@uc.edu jjang@ogang.ac.kr Abtract Reconfigurable architecture uch a FPGA are flexible alternative to DSP or ASIC ued in mobile device for which energy i a key performance metric. Reconfigurable architecture offer everal parameter uch a operating frequency, preciion, amount of memory, number of computation unit, etc. Thee parameter define a large deign pace that mut be explored to find energy efficient olution. Efficient traveral of uch a large deign pace require high-level modeling to facilitate rapid etimation of ytem-wide energy. However, FPGA do not exhibit a high-level tructure like, for example, a RISC proceor for which high-level a well a low-level energy model are available. To addre thi cenario, we propoe a domain-pecific modeling technique that exploit the knowledge of the algorithm and the target architecture family for a given problem to develop a high-level model. Thi model capture architecture and algorithm feature, parameter affecting power performance, and power etimation function baed on thee parameter. A ytem-wide energy function i derived baed on the power function and cycle pecific power tate of each building block of the architecture. Thi model can be ued to undertand the impact of variou parameter on ytemwide energy and can be a bai for the deign of energy efficient algorithm. Our high-level model can be ued to quickly obtain fairly accurate etimate of the ytem-wide energy of data path configured uing FPGA. We demontrate our modeling methodology by applying it to two domain. Keyword domain modeling, energy etimation, energy optimization I. Introduction Dramatic increae in the denity and peed of FPGA make them attractive for complex application. The tateof-the-art Virtex-II Pro FPGA from Xilinx deliver over 0.3 Tera MAC/ec. at an operating frequency of 300 MHz. Table I [16] how the peak performance capabilitie of the Virtex FPGA compared with the fatet DSP available lat year. With uch an available proceing power, FPGA are an attractive fabric for implementing complex and compute intenive application uch a ignal proceing kernel for mobile device. Mobile device operate in power Thi work i upported by the DARPA Power Aware Computing and Communication Program under contract F33615-C monitored by Wright Patteron Air Force Bae and in part by the National Science Foundation under award No Ju-wook Jang work i upported by LG Yonam Foundation. contrained environment. Therefore, in addition to time performance, power performance i a key performance metric [11]. Studie how that optimization at the algorithmic level ha a much higher impact on total energy diipation of a ytem than RTL or gate level. It i reported that the impact (on energy optimization) ratio i 20 : 2.5 :1 for algorithmic, regiter, and circuit level [13]. In thi context, there i a need for a high-level energy model which not only enable algorithmic level optimization but alo provide rapid and reaonably accurate energy etimate. TABLE I Performance Comparion of FPGA and DSP (Xilinx Inc.) Function Fatet DSP Virtex-II 8 8 MAC 4.4 billion MAC 600 billion MAC FIR, 256 tap, 17 MSPS 180 MSPS 16-bit data/coefficient (1.1 GHz) (180 MHz) 1024 point FFT 7.7 µec..1 µec. (16 bit data) (800 MHz) (140 MHz) Several iue mut be addreed in developing a highlevel energy model for FPGA. There are numerou way to map an algorithm onto an FPGA a oppoed to mapping onto a traditional proceor uch a a RISC proceor or a DSP, for which the architecture and the component uch a ALU, data path, memory, etc. are well defined. For FPGA, the baic element i the lookup table (LUT), which i too low-level an entity to be conidered for highlevel modeling. Beide, the architecture deign depend heavily on the algorithm. Therefore, no ingle high-level model can capture the energy behavior of all feaible deign implemented on FPGA. In addition, to elevate the level of abtraction, high-level model do not capture all the detail of a ytem and conider only a mall et of key parameter that affect energy. Thi lower the accuracy of energy etimation. In order to addre the iue dicued above, we propoe a domain-pecific modelingtechnique (Figure 1). Thi technique facilitate high-level energy modeling for a pecific domain. A domain correpond to a family of architecture and algorithm that implement a given kernel. For

2 example, a et of algorithm implementing matrix multiplication on a linear array i a domain. Detailed knowledge of the domain i exploited to identify the architecture parameter for the analyi of the energy diipation of the reulting deign in the domain. By retricting our modeling to a pecific domain, we reduce the number of architecture parameter and their range, thereby ignificantly reducing the deign pace. A limited number of architecture parameter alo facilitate development of power function that etimate the power diipated by each component (a building block of a deign). For a pecific deign,thecomponent pecific power function, parameter value aociated with the deign, and the cycle pecific power tate of each component are combined to pecify a ytem-wide energy function. Fig. 1. Domain-Specific Modeling Our approach i a top-down approach in contrat with other approache that exploit low-level imulation and etimation for each component and accumulate thee reult to etimate overall energy diipation. The advantage of our approach i the ability to rapidly evaluate the ytem-wide energy uing energy function for different deign within a domain. Our high-level energy model alo facilitate algorithmic level energy optimization through identification of appropriate value for architecture parameter uch a frequency, number of component, preciion, etc., early in ytem deign. The organization of the paper i a follow. The next ection decribe the domain-pecific modeling technique. Section III decribe the methodology to etimate the power function. A detailed decription of high-level modeling and energy etimation baed on the propoed model for two pecific domain i preented in Section IV. We dicu ome application of the high-level model in Section V. Related effort are dicued in Section VI. Section VII conclude the paper. II. Domain-Specific Energy Modeling The goal of our domain-pecific modeling (Figure 2) i to repreent energy diipation of the deign pecific to a domain in term of parameter aociated with thi domain. For a given domain, only thoe parameter which can ignificantly affect ytem-wide energy diipation and can be varied at algorithmic level are choen for the highlevel energy model. A a reult, our model a) facilitate algorithmic level optimization of energy performance, b) provide rapid and fairly accurate etimate of the energy performance, and c) provide energy ditribution profile for individual component to identify candidate for further optimization. Firt, we define the high-level energy model. Then we provide detail of energy etimation uing thi model. A. High-level Energy Model Our high-level energy model conit of RModule, Interconnect, component pecific parameter and power function, component power tate matrice, and a ytem-wide energy function. Relocatable Module (RModule) i a high-level architecture abtraction of a computation or torage module. It i either a CLB-baed logic or a "larger" module compoed of multiple RModule and Interconnect. For example, a regiter can be a RModule if the number of regiter varie in the deign depending on algorithmic level choice. One important aumption about RModule i that energy performance of an intance of a RModule i independent of it location on the device. While thi aumption can introduce mall error in energy etimation, it greatly implifie the model. Interconnect repreent the connection reource ued for data tranfer between the RModule. The power conumed in a given Interconnect depend on it length, width, and witching activity. Interconnect can be of variou type. For example, in Virtex-II FPGA, there are everal Interconnect uch a long line, hex line, double line, andingle connection which differ in their length [16]. In the ret of the paper, we ue component to refer to both RModule and Interconnect. Component pecific parameter depend on the characteritic of the component and it relationhip to the algorithm. For example, operating frequency or preciion of a multiplier RModule can be choen a parameter if they are varied by the algorithm. Poible candidate parameter include operating frequency (f), input witching activity (a), word preciion (w), power tate (p), number of RModule type i (n i ), and etc. Component pecific power function capture the effect of component pecific parameter on the average power diipation of the component. The power function are obtained by implementing ample deign of individual component and imulating them uing low-level imulator (See Section III). Component Power State (CPS) matrice capture the power tate for all the component in each cycle. For example, conider a deign that contain k different type of component (C 1,..., C k )withn i component of type i. If

3 the deign ha the latency of T cycle, then k two dimenional matrice are contructed where the i-th matrix i of ize T n i (Figure 3). An entry in a CPS matrix repreent the power tate of a component during a pecific cycle and i determined by the algorithm. Sytem-wide energy function repreent the energy diipation of the deign belonging to a pecific domainaa function of the parameter aociated with the domain. B. Energy Etimation Power diipation by a RModule or Interconnect in a particular tate i captured a a power function of a et of parameter. Thee function are typically contructed through curve fittingbaedonomeamplelow-levelimulation (decribed in Section III). Thee function may alo be provided by the vendor. CPS matrice contain cycle pecific powertateinforma- tion for each component. The entrie in the CPS matrice are determined by the algorithm. Fig. 3. Component Power State Matrice Fig. 2. Domain-Specific Modeling and Sytem-wide Energy Etimation The domain-pecific natureofourenergymodelingiexploited when the deigner identifie the level of architecture abtraction (RModule and Interconnect) appropriate to the domain and/or chooe the parameter to be ued in the component pecific power function. Thiiahuman- in-the-loop proce and exploit the deigner expertie in the algorithm and the architecture family that contitute the domain. Well-known power model baed on capacitance, voltage, and witching frequency can be more accurate and are generic to be applicable acro many domain. However, they do not provide a deigner a clear undertanding of the impact of hi/her algorithmic level deign choice on the energy performance. Our modeling enable the deigner to rapidly explore a large deign pace baed on the undertanding of the effect of the deign choice on the overall energy performance. To handle modeling complexity we follow a hierarchical approach. Each RModule can be recurively divided into RModule and Interconnect. Thi hierarchical nature allow the deigner to capture the detail of architecture in the deign at variou level of abtraction to define parameter affecting performance. Combining the CPS matrice and component pecific power function (See Section III) for individual component, the total energy of the complete ytem i obtained by umming the energy diipation of individual component in each cycle. The ytem-wide energy function SE i obtained a: kx 1 Xn l TX SE = C i.p.p where p = CPS(i, t, j) f i=1 j=1 t=1 (1) C i.p.p i the power diipated in the j-th component (j = 1...n l ) of type i during cycle t (t =1...T )andf i the operating frequency. CPS(i, t, j) i the power tate of the j-th component of the i-th type during the t-th cycle. Since the ytem-wide energy function i derived uing component pecific power function, the energy ditribution among variou component (the fraction of the total energy conumed by each component) can be obtained. Thi information i ued to identify candidate component to be conidered by the deigner for energy optimization. Detail can be found in Section IV. Due to the high-level nature of the model, we can rapidly etimate the ytem-wide energy. In the wort cae, the complexity of energy etimation i O(T P k l=1 n i) (See Equation 1) which correpond to iterating over the element of the CPS matrice and adding the energy diipation by each component in each cycle. However, typically, there i a repeating pattern of tate change for a component (for example, due to loop tructure within the algorithm). Alo, different component of the ame type diipate the ame amount of energy during each cycle.

4 Fig. 4. Power Function Etimation uing MILAN Therefore, baed on thee obervation the time to compute the energy i better than the wort cae complexity of energy etimation tated above. Further, even if we compute the ytem-wide energy baed on each cycle we do not analyze the activitie at the level of individual gate. Typically, there are only a few ditinct component within a domain that affect energy diipation of the deign in that domain. Indeed, for the illutrative example conidered in thi paper, the time for energy etimation doe not depend on the problem ize. III. Etimating Component Specific Power Function We ue the MILAN framework [1] to derive the component pecific power function aociated with the highlevel energy model. MILAN i a Model baed Integrated imulation framework for embedded ytem deign and optimization by integrating variou imulator and tool into a unified environment. In order to ue the framework, the deigner firt model the target ytem uing the modeling paradigm provided in MILAN. The deigner provide the architecture and the parameter (with their poible range) that ignificantly affect the power diipation of the component. Model interpreter(mi) in the MILAN are ued to drive the integrated tool and imulator. Model interpreter (MI) tranlate the information captured in the model into the format required by the low level imulator and tool. Let z(p 1,...,p n ) be the component pecific power function and p 1,...,p n be the parameter aociated with the component. Figure 4 illutrate the proce of deriving component pecific power function. Thi proce involve etimation of power diipation through low-level imulation of the component at different deign point (a deign point i a unique combination of parameter value). For low-level imulation, we have integrated imulator uch a XPower [16] and ModelSim [8] into the MILAN framework. The witching activity for the input to the component can be provided by the deigner or pecified a ome default value, depending on the deired accuracy. One can chooe appropriate value baed on prior experience for the eae of analyi. However, if higher accuracy i needed, behavioral imulation of the complete application over expected input vector to the whole ytem can be performed to obtain exact value for witching activity at the input of each component. Low-level imulation i performed at each of the choen deign point to etimate the power diipation. Thee power etimate are fed to the power function builder. A typical low-level imulation for power etimation of a ample deign point proceed a follow. The choen ample VHDL deign i yntheized uing Synopy FPGA Expre on Xilinx ISE 4.1i. The place-and-route file (.ncd file) i obtained for the target FPGA device, Virtex-II XC2V1500. Mentor ModelSim 5.5e i ued to imulate the module and generate imulation reult (.vcd file). Thee two file are then provided to the Xilinx XPower tool to etimate the energy diipation. The power function builder i driven by an MI from the MILAN framework. For component with a ingle parameter, the power function can be obtained from curve-fitting on ample imulation reult. In cae of larger number of the parameter, urface fitting can be ued. Currently, we only focu on building component pecific power function with at mot two parameter. The reulting power function are provided back to the deigner. The component pecific power function of an interconnect depend on it length, operating frequency, and the witching activity. We ue Equation 2 to etimate power diipation in an interconnect. Φ.p denote the power diipation of a cluter of k RModule connected through the candidate interconnect and M.p i repreent power diipation of the i-th RModule. The power diipated by the cluter i obtained by low-level imulation. IC.p = Φ.p kx M.p i (2) i=1 While the initial effort to build the component pecific power function might be cotly compared with ad hoc approache, the benefit are noticeable when the ame component are re-ued in different deign within and acro domain. IV. Illutrative Example of Domain-Specific Energy Modeling To illutrate our domain-pecific energy modeling, we apply the technique dicued in the previou ection to define a high-level model for two domain implementing matrix multiplication, a frequently ued kernel operation in wide variety of ignal proceing algorithm. For each domain we identify the component and the component pecific parameter, identify the power function for each component, and finally derive a ytem-wide energy function. Two architecture familie, a uniproceor architecture and a linear array architecture, are choen to demontrate our approach.

For the ake of illutration, we aume that all operation have a ingle cycle latency andthedatamatricearetoredinanexternalmemory.

5 A. Example 1: Uniproceor Architecture We define a uniproceor (PE) implementing the uual block matrix multiplication a the firt domain. The PE ha one MAC (multiplier and accumulator), a cache of ize c, and I/O port (ee Figure 5). For the ake of illutration, we aume that all operation have a ingle cycle latency andthedatamatricearetoredinanexternalmemory. For n n matrix multiplication, computational complexity of the algorithm i O(n 3 ). Block matrix multiplication (BMM) i performed with block ize of c c. The I/O complexity (amount of traffic betweenthepeandexternal memory) i O(n 3 / c). Thi complexity correpond to the minimum achievable I/O complexity for matrix multiplication [5]. It can be oberved that a large cache decreae the I/O complexity and a a reult improve the energy performance for I/O. increae but later for large value of c the ytem-wide energy goe up. Fig. 6. Sytem-wide Energy for a PE and Energy Ditribution a a Function of Cache Size Fig. 5. Uniproceor Architecture A.1 Identifying Component and Parameter We identified the MAC and the cache a RModule and the I/O a an Interconnect. The ize of the cache can be controlled by the algorithm. The RModule have w bit preciion and the cache ha one more parameter, c, the cache ize. For the ake of illutration, the cache ize i the only variable parameter and w =8i ued. The component pecific power function for MAC (M.p), cache (R.p), and I/O (IO.p) are obtained through low-level imulation. The MAC i implemented uing CLB-baed multiplier and the cache i realized uing regiter module provided in the Virtex-II library. M.p and IO.p are contant. The power function for the cache i: R.p(c) = c (mw). A.2 Sytem-wide Energy Function We conider the energy diipated by the PE on an FPGA. We do not conider the energy diipated by the external memory. The ytem-wide energy function (SE) i: SE(c) = 1 f (n3 M.p + n 3 R.p(c)+(n 3 / c) IO.p). Note that a c varie, we obtain a family of the architecture each implementing matrix multiplication uing BMM with different block ize. The operating frequency f i 166 MHz. Figure 6 how how different value of c affect the ytem-wide energy and the energy ditribution between the component of the complete ytem for a 6 6 matrix multiplication. A c increae, the energy for performing I/O decreae but the energy diipated in the cache increae. Initially, the ytem-wide energy decreae a c B. Example 2: Linear Array Architecture For the econd domain, we conidered a linear array of proceing element (PE) with contant I/O bandwidth (independent of the problem ize) a the architecture family to perform matrix multiplication. The following dicuion i baed on optimal algorithm for matrix multiplication on a linear array family of architecture [12]. B.1 Defining Component and Parameter The tructure of the linear array i hown in Figure 7. It conit of two component: proceing element (PE) and interconnect connecting adjacent PE. For the purpoe of high-level modeling, we identified the PE a an RModule and the bu between two PE a an Interconnect. Fig. 7. Linear Array of PE In order to identify the component pecific parameter, weanalyzethetructureofeachpe.thepe(seefigure 8) ha a MAC of preciion w and a memory of ize. The memory i realized by uing regiter module provided in the FPGA library. The PE ha two power tate ON and OFF. During the ON tate the multiplier i ON and thu the PE diipate more energy than the OFF tate when the multiplier i off. The power tate of the multiplier i controlled by gated clocking. The PE alo include 6 regiter and 3 multiplexer of w bit. The key parameter affecting energy are preciion (w), amount of memory within a PE (), and power tate (p). A matrix multiplication algorithm for linear array architecture i propoed in [12]. There are everal contraint impoed by the algorithm which are exploited to identify component pecific parameter and their range. Alo, to achieve the minimum latency, the minimum number of PE

6 power diipation while both PE are in ON tate. The power diipated in the interconnect i IC.p = mw. Thu, our high-level model for matrix multiplication on linear array architecture conit of PE, Interconnect, component pecific parameter and their range a hown in Table II, the power function for the PE (Equation 4), and the power function for the interconnect. Fig. 8. The detail of the PE needed for a n n matrix multiplication i n [12]. Therefore, the range of i given by 1 n. To achieve the minimal I/O complexity O(n 2 ), the total amount of memory acro all PE hould be n 2. Therefore, the total number of PE (pe) in dn/e. The latency (T )ofthi deign uing n dn/e PE and memory per PE i [12]: T = 1 f (n2 +2n dn/e dn/e +1). (3) We conider the problem of ize 1 n 16. For the ake of illutration, we fixed w at 8. The parameter and their range are hown in Table II. Note that the parameter of interet are pe, p, and. The ytem-wide energy function i pecified uing thee three parameter. Parameter TABLE II Model Parameter Value or range 1 n pe 1 pe n dn/e w 8 p on, off B.2 Etimating Power Function To etimate the power function of the PE, we applied the technique decribed in Section III. The amount of memory per PE () wa varied. In the ample imulation, the input data to the component in etimating the power diipation wa randomly generated and it witching activity (a) wa found to be 25%. We choe the operating frequency a 166 MHz ince we compare our deign with the matrix multiplication provided in the Xilinx library which operate at the ame frequency. The comparion can be found in Section V. The power function for the PE i given by: ½ (mw), p = on P E.p.p = (4) (mw), p = off. The interconnect power function i contant. It i etimated uing Equation 2 ince the interconnect between the PE i localized in the deign and i regular. We implemented two PE and the interconnect, and meaured the B.3 Specifying Sytem-wide Energy Function To derive the ytem-wide energy function (SE) (hown in Equation 5), we ue Equation 1. In thi domain, we can eparate the equation into three part: the total energy diipated in the PE when the multiplier i on (E PE on ),the total energy diipated in the PE when the multiplier i off (E PE off ), and the total energy diipated in interconnect (E IC ). Therefore, SE(n, ) =E PE on + E PE off + E IC (5) Note that the power diipation of all the PE (at a given tate of multiplier) i identical. In each PE, the multiplier i on for a duration of T/(dn/e) [12]. E PE on i hown in Equation 6. In each PE, the multiplier i off for a duration of T (1 1/ dn/e). E PE off i hown in Equation 7. Equation 8 how the total energy of interconnect. P E.p.p refer to the power diipation of PE when it multiplier i in tate p (See Equation 4). E PE on (n, ) = T l n m n n PE. p.p=on = n T PE. p.p=on (6) E PE off (n, ) = T (1 1 l n m n ) n PE. p.p=off l n m = T (n n) PE. p.p=off (7) l n m E IC (n, ) =T (n 1) IC. p (8) In order to verify the accuracy of our high-level energy modeling, we performed the following experiment. We conidered everal ample deign with variou problem ize (n =3, 6, 8, 9, 12, 16). We et = n. We ued the ytem-wide energy function to etimate the ytem-wide energy diipation of thee deign. We compared thi reult with a complete VHDL imulation uing the Xilinx tool. From Equation 3 the latency i n 2 +2n clock cycle. Further, the PE, when = n, ialwayintheontate. Therefore, the ytem-wide energy function implifie to SE(n, n) =E PE ON + E IC. The ytem-wide energy for n n matrix multiplication with w =8i given in column labeled "Etimated" in Table III. We alo implemented each of the ample deign in VHDL and imulated them uing Xilinx ISE 4.1i and Modelim 5.5e. The energy etimation i performed uing Xilinx XPower. The input witching activity to thee deign i the ame (25%) a ued during evaluation of component

7 TABLE III Illutration of Accuracy of Our Modeling in Example 2 TABLE IV Comparion for 6 6 matrix multiplication Problem ize Energy (nj ) (n) Etimated Meaured Error % % % % 12 1, , % 16 4, , % pecific power function. The meaured energy value are hownincolumnlabeled"meaured"intableiii. Table III alo how the error percentage of our high-level etimation method when compared with energy etimation value obtained through low-level imulation. The error on the average i within 6.4% and i 7.4% in the wort cae. The time needed to perform high-level etimation (auming the power function are pre-computed) i on the order of minute on a Pentium III Xeon running at 700 MHz, wherea the time needed for low-level imulation and power etimation wa 3 hour per deign on the ame machine. V. Deign Methodology Uing the Model and Energy Optimization Our domain-pecific modeling provide an energy etimation methodology to facilitate deign deciion in the early phae of the deign cycle. The ytem-wide energy function capture the impact of the architecture parameter on the ytem-wide energy at the algorithmic level. Uing thi, the deigner identifie trade-off among area, latency, and energy. The deigner explore a domain and identifie an appropriate deign baed on a election criteria. The detailed deign methodology can be found in [10]. To demontrate the energy efficiency of our data path deign, we compare them with the deign provided by the Xilinx library [16]. Xilinx provide a module for 3 3 matrix multiplication. To perform C = A B, the module ue one et of 3 regiter for A matrix to tore one row and the another et of 9 regiter to tore the complete B matrix. A ingle multiplier i ued. A row of data from A and the complete B arebroughtintothemoduleto compute the firt row of C. Thi proce i repeated for the other two row of A to generate the complete C. For larger problem ize, block matrix multiplication with block ize 3 3 i ued. The operating frequency of the module i 166 MHz. The energy diipation wa meaured uing XPower. Table IV how the area, latency, and the ytemwide energy of the deign. For our deign, we etimate the energy diipation uing the ytem-wide energy function (Equation 5) with the ame operating frequency a the Xilinx module, f =166 MHz. The other parameter are w =8, n =6,and =6. Our deign ue more area compared with the Xilinx deign ince we ue 6 multiplier. However, the larger I/O requirement of the Xilinx library reult in higher energy Xilinx Deign baed on Metric deign [16] our methodology Area (# of lice) 179 1,074 Latency (µ ec.) Energy (nj) diipation. Our deign diipate about 30% le energy. Alo, uing our model and deign methodology, the deigner can further optimize a choen deign by improving the performance of the component that conume the ignificant energy. Initially, the ytem-wide energy function i analyzed to identify the ditribution of energy diipation among variou type of component (See Figure 6). Component with higher percentage of energy diipation are choen a poible candidate for deign modification. The detailed optimization technique can be found in [10]. VI. Related Work Several reearch effort have focued on rapid energy etimation of a deign on FPGA. Shang and Jha [14] propoed a black-box approach to etimate energy baed on input and output ignal tatitic. Thi approach i uitable for etimation of average power diipation of a RTlevel component to be embedded into a ytem. However, it i not applicable for algorithm level power analyi. On the other hand, our model capture variou architecture parameter that can be manipulated at algorithmic level for energy optimization. XPower, the power etimation tool provided by Xilinx [16], etimate the energy diipation of FPGA baed on low-level imulation. The input to the tool i LUT-level place-and-route information along with detail of witching activity for LUT-level component. While it accuracy i comparable with the actual execution of the deign, it doe not upport energy etimation early in the deign phae when the complete ytem decription in ome HDL i not available. Stammermann et al. preented ORINOCO, a oftware tool for power diipation analyi and optimization at the algorithmic level from C/C++ and VHDL decription [15]. However, C/C++ or VHDL decription do not capture parameter affecting ytem-wide energy and alo a deigner require a complete knowledge of the final ytem before the code can be generated in thee language. Both ORINOCO and XPower are eentially etimation toolandcanbeuedinourmethodologytoperformlowlevel ample imulation neceary for pecifying our component pecific power function. We have compared our etimation accuracy againt XPower. Chou et al. propoed a hardware/oftware co-ynthei CAD tool (IPChinook) [4] primarily targeted toward ASIC deign from a ingle high-level pecification (for both hardware and oftware) and hardware/oftware coimulation. Thi tool doe not capture the effect of parameter variation on energy diipation for individual compo-

8 nent which i eential for algorithmic level analyi. In [2] regreion tree [3] i ued to improve the power etimation of a RT-level component. Starting with candidate variable (I/O bit), the variable v i, which ha the maximum impact on the power diipation i identified. Then the ample power diipation reult of power meaurement i plit in two ubet baed on thi variable. The plitting i recurively performed to build a regreion tree which rank variable in their ignificance with repect to the power. It i a bottom-up approach tarting from low-level implementation and end in identifying ignificant variable affecting the power. In contrat, our model tart with candidate parameter choen from a high-level view of the architecture and algorithm. The effect of the parameter on the ytem-wide energy i captured in the component pecific power function. The component pecific power function are ued to obtain parameter value for optimal power performance by travering the deign pace at an algorithmic level. VII. Concluion Thi paper introduced domain-pecific energymodeling for rapid ytem-level energy etimation and algorithmic level optimization for reconfigurable architecture. The modeling capture the detail of the architecture and the algorithm to identify parameter affecting the power performance, hence facilitating derivation of a ytem-wide energy function. Matrix multiplication on a uniproceor and a linear array architecture were choen a two domain to illutrate contruction of a high-level model, derivation of power function for individual component, and combine them to obtain an ytem-wide energy function. For one pecific domain, the ytem-wide energy function wa teted on everal ample deign for it accuracy againt time-conuming low-level energy etimation. The reference low-level energy evaluation were obtained on a Virtex-II chip through ynthei uing Xilinx ISE 4.1i, imulation by ModelSim 5.5e, and power etimation by Xilinx XPower. The error in the ytem-wide energy etimation uing the high-level model wa within 6.4% and wa 7.4% in the wort cae. Uing our modeling, the time needed to evaluate the ytem-wide energy function i in the order of minute on a Pentium III Xeon running at 700 MHz while the low-level energy etimation take more than 3 hour (on the average) for each ample deign. Our modeling methodology provide a virtual malleable data path. It i virtual becaue at the time of performance etimation we do not implement the deign on target FP- GA. It i malleable becaue there are everal parameter that can be varied to undertand the trade-off between different performance metric uch a energy, area, and latency. Such a characteritic make our propoed model uitable to be conidered during large application ynthei where everal different kernel are integrated to implement a ytem uch a MPEG encoding or Software Defined Radio. In uch cenario, once we have model for variou kernel implementation, we can exploit the multi-level deign pace exploration (DSE) technique provided in the MILAN framework [1]. It evaluate the ytem-level deign pace againt uer pecified contraint [9] to identify an appropriate deign. Currently, analytical technique are being developed for variou option for interconnection reource available in the tate-of-the-art FPGA device to predict their energy behavior. Family of architecture that contain interconnection network uch a hypercube and binary tree are being conidered for future analyi. Reference [1] A.Agrawal,A.Bakhi,J.Davi,B.Eame,A.Ledeczi,S.Mohanty, V. Mathur, S. Neema, G. Nordtrom, V. Praanna, C. Raghavendra, and M. Singh, "MILAN: A Model Baed Integrated Simulation for Deign of Embedded Sytem," Language Compiler and Tool for Embedded Sytem, [2] A.Bogliolo,L.BeniniandG.Micheli,"Regreion-baedRTL Power Modeling," ACM Tranaction on Deign Automation of Electronic Sytem, Vol. 5, No. 3, [3] B. L. Bowerman and R. T. O Connell, "Linear Statitical Model-An Applied Approach," 2nd Edition, Brook/Cole Pub Co. [4] P. Chou, R. Ortega, K. Hine, K. Partridge, and G. Borriello, "IPCHINOOK: An Integrated IP-baed Deign Framework for Ditributed Embedded Sytem," Deign Automation Conference, [5] J.-W. Hong and H. T. Kung, "I/O Complexity: The Red-Blue Pebbling Game," ACM Sympoium on Theory of Computing (STOC), [6] J. Jang, S. Choi, and V. K. Praanna, "Energy-Efficient Matrix Multiplication on FPGA," ubmitted to International Conference on Field Programmable Logic and Application, [7] V. Mathur and V. K. Praanna, "A Hierarchical Simulation Framework for Application Development on Sytem-on-Chip Architecture," IEEE Intl. ASIC/SOC Conference, [8] ModelSim, Model Technologie, [9] S.Mohanty,V.K.Praanna,S.Neema,andJ.Davi,"Rapid Deign Space Exploration of Heterogeneou Embedded Sytem uing Symbolic Search and Multi-Granular Simulation," to appear in Language Compiler and Tool for Embedded Sytem, [10] S. Mohanty, S. Choi, J. Jang, and V. K. Praanna, "A Modelbaed Methodology for Application Specific Energy Efficient Data path Deign uing FPGA," to appear in IEEE Intl. Conference on Application-pecific Sytem, Architecture and Proceor, [11] T. Mudge, "Power: A Firt-Cla Architectural Deign Contraint," IEEE Computer, Volume. 34, April [12] V. K. Praanna Kumar and Y. Tai, "On Syntheizing Optimal Family of Linear Sytolic Array for Matrix Multiplication," IEEE Tranaction on Computer, Vol. 40, No. 6, [13] A. Ragunathan, N. K. Jha, and S. Dey, "High-level Power Analyi and Optimization," Kluwer Academic Publiher, 1998 [14] L. Shang and N. K. Jha, "High-Level Power Modeling of CPLD and FPGA," International Conference on Computer Deign, [15] A. Stammermann, L. Krue, W. Nebel, and A. Pratch, "Sytem Level Optimization and Deign Space Exploration for Low Power," Proc. of ISSS, [16] Xilinx Application Note: Virtex-II Serie and Xilinx ISE 4.1i Deign Environment,

Laboratory Exercise 6

Laboratory Exercise 6 Laboratory Exercie 6 Adder, Subtractor, and Multiplier The purpoe of thi exercie i to examine arithmetic circuit that add, ubtract, and multiply number. Each type of circuit will be implemented in two