Synthesis of Signal Processing Structured Datapaths for FPGAs Supporting RAMs and Busses

Size: px

Start display at page:

Download "Synthesis of Signal Processing Structured Datapaths for FPGAs Supporting RAMs and Busses"

Trevor Glenn
5 years ago
Views:

1 Synthei of Signal Proceing Structured Datapath for FPGA Supporting AM and Bue Baher Haroun and Behzad Sajjadi Department of lectrical and Computer ngineering, Concordia Univerity 455 de Maionneuve Blvd. W., Montreal, Quebec H3G M8 ABSTACT A novel approach i preented for tranforming a given cheduled and bound ignal proceing algorithm for a multiplexer baed datapath to a BUS/AM baed FPGA datapath. A datapath model i introduced that allow maximum flexibility in cheduling bu tranfer independent of operation cheduling. A novel integer linear programming (ILP) formulation that optimally elect and aign data-tranfer to bue while cheduling the bu tranfer to minimize a linear combination of the number of bue, bu loading in term of tritate driver and fanout, regiter and regiter file torage (AM) location. We demontrate that our reulting optimal datapath compare favorably to other for ignal proceing ynthei benchmark uch a: ingle and multiple elliptic filter and fat dicrete-coine-tranform (FDCT). Introduction FPGA are being ued in ignal proceing application a re-configurable computation accelerator. Two of the familie of FPGA that i pecifically uitable for uch application are the Xilinx XC4000 erie and the OCA ATT2C erie, which upport fat arithmetic function a well a efficient torage of variable in SAM [2]. Moreover, thee FPGA have global wire that can be driven by tri-tate driver to implement either wide muxe or chip wide bue. Other tructural characteritic that influence the tructure of a datapath are: (a) The limited interconnection reource for FPGA. Thi make interconnection a primary reource for optimization epecially for large bit-width datapath. Otherwie, larger (more cotly) FPGA may be required. Alo for FPGA, the interconnection delay i large due to the configurable erie witch reitance and capacitive effect. Hence, limiting interconnection loading i critical in improving the cycle time of the architecture. (b) The abundance of regiter and of torage AM bit (in relation to the number of logic gate that implement the computational functional unit (FU)). Two hardware mechanim can be ued to tore data in an XC4000 FPGA: ) Uing regiter at the output of each combinational logic block (CLB) (two look-up table (LUT) and two regiter per CLB), and 2) Uing AM by re-configuring the LUT. One CLB can tore up to 32 bit of data. Therefore, CLB can potentially tore 6 time a much if it i configured a a AM intead of a a regiter. The ue of SAM intead of regiter in variable torage can reult in ignificant aving in torage area. (c) CLB are compoed of programmable logic (LUT) followed by a regiter. Hence, datapath which utilize thi property of having regiter in front of logic, tand to maximize the uage of the FPGA reource. There are a number of ynthei approache targeting a random topology multiplexer baed datapath, i.e., a point to point connectivity model uing muxe and regiter. xample are: Cathedral-III[8] which ue heuritic, or uing optimal approache (ILP), e.g. [], [5], [9], [3] and [6], thee approache can be uitable for multiplexer baed FPGA architecture but do not explicitly addre the above three concern for FPGA upporting bue and AM. While other targeting bu baed architecture, (e.g. Cathedral-II [7], HYP[6], SPAID-X[3], STA[4], and [5]) ue an architecture model that ha bu/am acce time in erie with functional unit delay in one cycle. For uch architecture cycle time are long and may be more uited for ASIC like implementation were pecial technique can be ued to reduce the effect of bu /AM delay[2] but not in the cae of FPGA. 2 Synthei Sytem Overview The following apect yntheizing FPGA datapath are the contribution of thi paper: - To preent an architectural model that can be ued for bu/am baed FPGA data-path that allow for efficient interconnection and data torage with an architecture commenurate with the FPGA logic block. 2- To how that the complexity of architectural earch for FPGA bu/am baed architecture can be effectively handled a follow; a firt tep for cheduling and operation binding a done with multiplexer baed architecture, followed by a econd tep for bu tranfer cheduling and aignment and then the lat and third tep for performing interconnection and torage binding. Thi i performed provided that: a tructural complexity [] i minimized in the firt tep, regiter and AM cot together with bu loading are minimized in the econd tep and phyical floorplanning, routing and delay modeling (ytem clock cycle duration) i optimized in the lat tep [0]. An overview of the ynthei ytem i hown in Figure.

2 CDFG + Scheduling and binding Contraint FPGA Floorplan Seed & routing contraint STP-:ILP; Paper [] STP-2: ILP; thi paper -Bu xfer creation and - Scheduling & binding allocation with Structuring: Structuring= minim. -Bu tranfer cheduling [interconnection + mux+ torage + -Storage Minimization #var (life-time<2)] -Bu loading Minim. STP-3: Stochatic; [0] - Data Storage Aignment -interconnect minimization. -bu loading minimization - Actual Floorplanning and routing (contrained) -Clock cycle minimization. Data Path Hierarchical Controller Figure : Our Tool, OSTA, ue an ILP approach for Scheduling and bu aignment followed by a tochatic torage and interconnection binding. Paper [] decribe the cheduling and binding for tructured multiplexer baed architecture for FPGA. Structuring increae the exitence of data tranfer with life time > 2 that are pecifically uitable to be aigned to bue hence reducing the number of local interconnection by merging them with bue in tep-2 of the ynthei. In [0], the interconnection and torage binding i done in conjunction with floor planning to enure true phyical contraint of the implementation technology are taken into account 3- To preent an ILP formulation, to concurrently minimize number of bue, bu loading and data torage while optimally aigning bu tranfer to bue, cheduling thee tranfer and minimally allocating bue. Our propoed architectural model (ee above and Section 3) allow thi flexible bu tranfer cheduling. 4- To how that ILP handle mall/medium ize application and i a viable approach to architecture ynthei of bu/am baed architecture. On the other hand, for large problem, heuritic/tochatic approache may be uitable. In thi cae, thi paper provide proof that our flexible bu tranfer datapath and ynthei model producing tructured architecture, i an effective approach. Hence our introduced model, the contraint and the optimization criteria can be uitable for other earch (e.g. tochatic) technique. Section 3 preent the architectural model and the ynthei data tranfer model ued in our optimization of bu tranfer. Section 4 preent the ILP formulation and Section 5 how the effectivene of our bu/am baed datapath ynthei approach. 3 Modeling and Optimization Criteria 3. Underlying Architectural Model Supporting AM and Bue The model of the architecture that i ued in our ynthei technique follow the general tructure of Figure 2. For an example of a full data path ee Figure 6. Different type of module ued are: regiter module, regiter-file module (implemented a a AM), functional unit (FU) module and FU multiplexer ub-module. If the number of mux input i mall, the FU multiplexer module can be part of the function of the CLB implementing the FU module. On the other hand, when the number of mux input i large, extra mux ub-module are implemented uing extra CLB. A two phae clock (Φ and Φ2) can be ued to define the data tranfer between the regiter-file (AM) and the regiter of the data path. For data tranfer between the regiter of the data path only one clock phae (φ) i ued. The data move from a data path regiter to the FU input through a FU output, then the data move from the FU output to one or more of the data path regiter. For the tranfer between the regiter and the regiter file,φ and Φ2, are ued. The data path regiter are conidered a mater regiter and the lave are the torage location inide the regiter-file (AM). Thi clocking i explained in Figure 3 (b). Note that the AM i written in the beginning of the cycle (Φ) and read at the end (Φ2 (or Φ )). Hence, the cycle time i determined by either the critical path between any of the data path regiter or the read and write time plu the bu delay of a regiter file. Input /output port to the ytem can be conidered a regiter (). FU i FU O/P Interconnect FU Mux Sub-Module One of the Pipelined Bue egiter Mux FU Mux Tritate Bu Driver Figure 2: A typical connection of 2 FU with bue, local connection with regiter and a chaining connection. All muxe and tri-tate driver may not necearily be preent in every architecture. The clocking cheme i hown in Figure 3.b. 3.2 AM and Bu Support egiter File ( AM) Module Chaining Mux (Optional) FU Module FUj egiter Interconnect egiter Module Φ Φ2 Our architecture model allow both individual regiter and grouped torage in regiter file which are implemented a AM, hence reducing torage area. Support of AM i eential, epecially in the cae of handling ignal proceing algorithm requiring large torage (e.g. multi-channel filter). Figure 3 (a) how the conventional regiter file model (a) where the regiter file i acceed by a nonpipelined bu(e.g in [3][2][3] and [5]). Figure 3 (b) how our propoed clocking cheme where a regiter file baed on AM implementation i acceed by a pipelined bu. In cae of the non-pipelined bu acce (conventional cae), the read and write time of the regiter file array are part of the ytem cycle. Chaining the read and write time in one cycle increae the cycle time which reduce the clock rate and hence may reult in a large performance reduction. By pipelining the bu, the read and write are cheduled in eparate cycle preceding and following the operation 2

3 repectively. The deciion then to tore the data in a large AM can only be done if the life time of a variable i at leat two cycle. By allowing any variable to exit in more than one torage location, that i in a regiter and in a regiter file location, and by upporting flexible bu tranfer, our approach ha removed the bu delay from the critical path. The bue and their aociated regiter file are, hence, treated a functional unit and their influence on the clock cycle duration i independent of other functional unit delay. Therefore, the clock cycle of the datapath i controlled by either a critical path through a FU or through a regiter file and a bu but not by adding both, a i the cae in conventional bu baed datapath. Hence, our architecture model ha fater clock rate than conventional approache and alo ue le bue and reource a will be hown. 3.3 The egiter and Bu Binding Model Given the above architectural model, there are a number of different cae for which an edge in the Data Flow Graph (a data tranfer) can be bound. Thee cae are encompaed by the two generic tranfer hown in Figure 4. Notice that the bu in Figure 3. (b) (ee alo Figure 6.b) communicate only with regiter. In Figure 4, a bubble indicate the clock cycle at which the data i tranferred from a regiter ( n ) to the bu and the quare indicate the cycle where the data i read from the regiter file through a bu to a regiter ( p ). Thee cycle are determined in tep-2 of Figure, together with bu allocation. In a ene, a cheduling of bu tranfer i performed at tep-2. Thi flexibility of cheduling bu tranfer a hown in Figure 4 i very different from all previou architecture model in the literature uing bue. All other model have to tranfer data immediately at the end of an operation, or upon it tart. Our model allow relaxing thi contraint which i one of the contribution of thi paper. Thi relaxation in uing bue can reult in ignificant reduction in the number of bue required (ee Section 5). 4 STP-2: ILP Search for an Optimal Bu Baed Architecture 4. The ILP Bu Aignment and Scheduling We aume that operation cheduling and binding have been performed (tep-, Figure ) while minimizing interconnection and maximizing the number of tranfer with life-time > uing the tructural complexity meaure []. The objective of tep-2 i to minimize: (a) the number of parallel bu tranfer which reduce the number of bue allocated, (b) the maximum overlap in regiter which reduce the number of regiter allocated, (c) the maximum overlap in life-time of variable aigned to bue which reduce regiter file torage location, (d) the number of regiter having tri-tate acce to the bue hence minimizing tri-tate output loading on each bu and (e) the number of detination regiter that a bu i connected to, thi alo minimize bu loading. Note that in tep-2, no regiter binding i made, only the cheduling of the bu tranfer and electing which tranfer goe on a bu and which bu that tranfer ue.. GIST FIL egiter inide ζ GIST FIL Slave inide Data ζ AM Bu AAY Data In Data Out Addre Pa gate or Tritate buffer ζ.φ Bu Bu latch ζ.φ (a) Conventional. The following are the notation and variable ued in our ILP formulation given in Table for the bu cheduling and binding. : et of edge with lifetime >= 2cycle. : Max. number of bue allowed. D(Bu): epreent the delay of the bu. Multoutput: Set of operation with multiple edge at their output. Single: Set of edge that do not belong to Multiple output operation. dgeoverlap(): Set of edge that overlap at each cycle. WB (e,,n) : = only if clock cycle i ued to write the variable (edge) e in the regiter file through bu n, otherwie = 0. B (e,,n) : = only if clock cycle i ued to read the variable (edge) e from the regiter file through bu n, otherwie = 0. WBM (op,,n) : = only if clock cycle i ued to write any of the variable (edge) at the output of operation op in the regiter file through bu n, otherwie = 0. BM (op,,n) : = only if clock cycle i ued to read any of the variable (edge) at the Ρ Ρ ζ.φ Addre ζ ζ.φ (b) Our: two phae clock Figure 3: (a) A non-pipelined bu with a regiter file where read and write acce delay are chained in one cycle. (b)our Clocking cheme: A pipelined bu and regiter file (note the two phae clocking) where a write i followed by a read in one cycle. The bit ζ generated from the controller are AND ed with the clock (ι.ε. ζ.φ, ) to indicate the latching time. Our cheme require that data tored in a regiter file to have a life time greater than or equal to 2 cycle. (a) clock i FUx (b) clock i FUx clock k clock j m (i k) reg. file torage FUy clock clock t clock k clock j n FUy FU FU j>t > p Figure 4: (a) Show a data tranfer uing local interconnection through individual regiter n. (b) Show a data tranfer uing a pipelined bu. n and p do not have to be different, but have to have at leat a life time of one cycle. The variable i tored in a regiter file between clock cycle and k through Bu q. e op e2 e e2 e3 op op WB, WB 2, WB 3, WBM, egued, WB,2 B,2 WB 2,2 B 2,2 WB 3,2 B 3,2 WBM,2 egued,2 e3 BM,2 B,3 WB 2,3 B 2,3 WB 3,3 B 3,3 WBM,3 BM,3 egued,3 B 2,4 WB 3,4 B 3,4 WBM,4 BM,4 egued,4 B 3,5 BM,5 egued,5 Figure 5: The lit of Z-O variable generated for an operation with multiple edge at the output. 3

4 output of operation op from the regiter file through bu n, otherwie = 0. ange(e X ): The cycle in which the Z-O variable x exit ( x could be B or WB or...). egued (op,) : = only if any of the edge at the output of operation op i alive at clock cycle, otherwie = 0. Mineg: An integer variable ued to count the lower bound.. WB en,, n = ange ( e WB ) ange ( e WB ) WB en,, e Table B en,, = 0 e ange ( e B ) on the number of regiter needed to implement the architecture. Toteg: An integer variable ued to count the maximum lifetime of all the regiter. MaxegFile n : an integer variable ued to count the maximum required number of AM location per regiter file. Tritate: Thi number of tritate driver for every bu i calculated and the maximum B en,, n = ange ( e ) B e () (2) = (3) ALAP ( e WB ) p = n = D ( Bu) + WB epn,, + B e, p, n p = n = ASAP ( e B ) ALAP ( e WB ) + D( Bu) ASAP ( e B ) e (4) WB en,, WBM op,, n 0 op Multoutput e op ange( e WB ) (5) B en,, BM op,, n 0 e Single dgeoverlap ( ) WB en,, op Multoutput e op ange ( e B ) + WBM op,, n op Multoutput (6) (7) e Single dgeoverlap ( ) B en,, + BM op,, n op Multoutput (8) WB epn,, + B epn,, egued op, 0 n = p = Aap ( e WB ) n = p = Aap ( e B ) op Multoutput e op ( ange( e WB ) ange ( e B )) (9) e Single dgeoverlap ( ) WB epn,, + n = p = n = Aap ( e WB ) WB epn,, + e n = p = n = Aap ( e WB ) p = Aap ( e B ) p = Aap ( e B ) B epn,, B epn,, + egued op, + reg Mineg op Multoutput Toteg (0) () WB epn,, e p = dgeoverlap ( ) Aap ( e WB ) p = Aap ( e B ) B epn,, MaxegFile n 0 (2) Alap ( e B ) Alap ( op BM ) B e, p, n + BM op,, n Tritate e Single dgeoverlap ( ) p = op Multoutput p = (3) ObjectiveFunction Minimize N B ( C Mineg) + ( C2 Toteg) + C3 MaxegFile n + ( C4 Tritate) n = N B + C5 WBM op,, n + C6 BM op,, n + C7 egued op, op n = op n = op Multoutput Multoutput Multoutput 4

5 of them i aigned to thi integer variable. The firt two contraint of Table enure that only one clock cycle i ued to write the variable in the regiter file through one bu and one cycle to read it back from the regiter file through one (ame or another) bu. The third contraint enure that every variable aigned to a bu tranfer i written in and read from the ame regiter file through one bu. Without contraint 3, ue of multiported regiter file are enabled where one variable i written through one port (from one bu) and read from another port (and another bu). Inequality 4 enure that a read occur after a write to a regiter file for any variable. Contraint 5 and 6 are intended for operation with multiple edge at it output. We aign a new et of zero-one (0-) variable (WBM, BM) for thee output edge (a hown in Figure 5). Thee 0- variable are forced to if there i a read from a regiter file or a write to a regiter file at any cycle for any of the multiple output edge. For intance in Figure 5, WBM,2 i et to when at leat one of the variable WB,2, WB 2,2, WB 2,2 i equal. The ame argument applie to the B and BM variable. To enure that at every cycle only one data variable can be read and only one data variable written through any one pecific bu, we enforce contraint 7 and 8. In thee contraint we ue WB and B a 0- variable for edge that do not belong to multiple output operation and WBM and BM a 0- variable for the edge of the multiple output operation. To account for the regiter cot in the cot function, edge of multiple output operation are aigned a et of 0- variable (egued op, ) per cycle. Thee 0- variable are et to uing contraint 9, when any of the multiple edge for an operation i alive in cycle. Contraint 0 determine a lower bound (integer variable Mineg) on the number of regiter that can be aigned in tep-3. Contraint indirectly enure that the total regiter life-time i reduced. Contraint 2, count the maximum required number of AM location per regiter file. For all the bue, contraint 3 count the maximum bu loading due to tri-tate driver per bu. The objective function to be minimized ha even component. The firt four term contribute directly to the final datapath tructure. The lat three term are eential for the correct aignment of the Z-O variable in the ILP formulation. 4.2 egiter and Multiplexer Binding with Datapath Generation In thi paper, we report the regiter binding obtained by uing the tool in [0]. In ummary, the tool ued perform the regiter binding while at the ame time performing a floorplanning of the datapath to be able to compute the routing delay and area cot and it effect on the clock cycle duration. Such a tool produce better reult than independently determining a regiter binding followed by a floorplanning. It ue a tochatic (imulated annealing) earch while continuouly minimizing the critical path that determine the clock cycle for the datapath together with layout area. 5 eult lliptic Filter: We ue the 5th order elliptic filter to demontrate a number of point regarding our architecture feature of pipelined bue, the operation binding uing the tructuring approach and our bu tranfer cheduling. Figure 6 (a) how a 2 adder one multiplier chedule with 7 cycle. The operation chedule i very imilar to the one obtained by ALPS [3] (thi filter i retimed). Our ILP tool, OSTA running on a SPAC0, produced a cheduling and operation binding (tep-) in 2.3 CPU econd. The bu cheduling and binding (tep-2) took.7 econd. The detailed regiter binding wa done in conjunction with floorplanning (tep-3) and took 20 CPU econd[0] (and 700ec for ILP[7]). All reult proven optimal. Note that only one pipelined bu i ued compared to an optimal of 7 bue for the OASIC[5] architectural model, and 4 bue of the SPAID-X tyle architecture ued in [3][5] which i equivalent of a 400% aving in the number of bue. Becaue thee bue require global wire which are not abundant in FPGA, uch a aving which i a direct conequence of our approach i very ignificant in the routability of a datapath.. #cycl e #TS (#BC) #mx i/p #Bu (net) #eg (#lc) #CLB / bit OASIC[2] 8 na na 7(na) InSyn [6] 9 na na 4(na) 8+(5) 6.5 SPAIDX[3] (3) -(2) 5 IP [5] 9 2(23) 4(0) (0) 4 [7]-lm 9 6(2) 9 3 (8) 5.5 [7]- LM2 9 na 25 - () 5.5 STA[4] 9 6(28) 7 5 (na) OSTA (our) 7 2 (6) 22 (7) 7(6) 4 Table 2: Architecture for WF, 2 adder pipelined multiplier. Only net with fanout > are counted. Our etimate for the number of CLB for torage include recurive torage. If recurive edge are added to the other olution, up to an extra four regiter or 2 CLB/bit of word length are needed. #TS= #tritate driver, #BC= # of Bu Connection a in [5]. #CLB/bit i the # of CLB ued per bit of the word width of the datapath. #lc i the number of latche or AM location. To how how our model and ynthei compare with other, we highlight a number of meaure: number of bue, bu loading, number of network with fanout > (indicate complex interconnection), number of tritate, fanout of network, and component of delay on the critical path. 5

6 Table 2 compare different yntheized datapath for the WF. INa b c de (a) f gh i IN 4 egi (b) File bu OUT F Write through F ead through a b c d e f g h i OUT Figure 6: (a) Detail of the cheduling and binding for the elliptic filter example. (b) The reulting architecture chematic. Thi architecture account for all recurive variable torage. For the WF architecture hown in [5], the maximum bu loading i 3 input and three tri-tate driver, which i the ame for the architecture in Figure 6. Note that, in the architecture in [5] bu delay and AM acce time are added to the delay of a functional unit and regiter to obtain the critical path determining the cycle time. While for the architecture in Figure 6, the AM acce and FU delay are independent a dicued before. It i evident that our architecture i impler and ue le interconnection. egarding torage cot, our architecture ue a total of 7 regiter and 6 regiter file location (4 CLB per bit of datapath width), compared to an optimum of 9 regiter for a mux baed (without bue[]) datapath which i equivalent to uing 4.5 CLB per bit of data path width for torage. It i important to note that we include the cot of torage of all recurive variable (z - delay in the WF filter pecification) unlike almot all other publihed olution (the 8 recurive variable i-b, Figure 6 are not bound to torage in other reference). The reult by Li & Mowchenko [8] account for the recurive edge and ue regiter or 5.5 CLB per bit of data path width for torage. Thi data path can either be viewed a a bu baed architecture (LM- ) or a multiplexer baed architecture (LM-2) in Table 3. We ue le multiplexer input (> 2%), le CLB for torage (>40%) and 300% le for the number of bue. Our architecture ha a maximum loading of three driver on the bu and 3 output (mux input). The maximum fanout of any regiter output i 4. The maximum number of mux input i 4. Thee value are the bet value for any publihed WF architecture. Cacaded-lliptic Filter: An alternative example that require ignificantly more torage and i uitable to highlight the advantage of AM/bu baed architecture i a filter compoed of two elliptic filter connected in cacade (output of firt i directly the input of the econd). The operation cheduling and binding (olution time 58 ec.) a well a the bu cheduling (olution time 33 ec.) are hown in Figure 7. For toring all variable, including the recurive edge, our reulting datapath ue regiter at a cot of -- (/2 CLB/bit/regiter) and regiter file (AM) location at a cot of (/2 CLB/bit for all location). In comparion, a multiplexer baed olution would ue 8 regiter at a cot of (/2 CLB /bit/regiter). Hence, the bu baed olution ave an equivalent of 3 CLB per bit of the word length. a b c ed f g h i k l nmo p q r rr 2 3 F 4 I 5 6 S 7 T F I 2 L 3 T j S 2 22 C 23 O 24 N 25 D F 29 I 30 L 3 T a b c d efgh i j k l mn o p q r out Figure 7: Thi i the cheduling and binding of two elliptic filter connected in erie (output of the firt i the input of the econd). The architecture ue two adder and one pipelined multiplier. The operation chedule and binding wa done uing our ILP formulation for tep-. The bu tranfer cheduling wa done uing our tep-2 ILP formulation for a one bu olution. For a 32 bit width data path our etimate for our datapath i: (64 CLB for the adder and their multiplexer, 206 CLB for a 2x6 booth re-coded multiplier (yntheized), and 92 CLB for torage, total of 462 CLB). An extra 3x32=96 CLB are required for a multiplexer baed olution with a total of 558 CLB. A aving of 20% in term of the total number of CLB required for our datapath. Since our architecture ue le interconnection and only one bu, the efficiency due to routing i higher. For a two bu olution (not hown) the number of regiter ued in our olution wa 8 and the number of total regiter 6

7 file location are 4. Thi reult in a aving of 4 CLB per bit of the word length. For a 32 bit width datapath our etimate i 430 CLB and the aving i 30% in term of the total number of CLB required for thi datapath. Note that if a more efficient multiplier i ued, thee percentage gain can be increaed. Fat Dicrete Coine Tranform: Thi example i ued to demontrate that ILP can handle medium ize graph that have high parallelim (parallelim i limited in WF example). Step- took 33 econd. For a mux baed olution the number of regiter ued i, while for a tep-2 bu binding uing one bu which took.3 econd), the architecture ue 8 regiter and 4 regiter file location. Thi reult in a aving of CLB per bit of the data path width over a multiplexer baed datapath that doe not ue bue or AM. More detail about thi and the value of the cot function coeficcient ued can be found in [7]. It i to be noted that the running time of the ILP olution for the bu cheduling and aignment i very mall (compared to operation cheduling and binding). Thi i due to 3 factor; - The range of each bu tranfer i le than the original graph ince all operation are already cheduled. 2- All edge with life time <2 are eliminated from the bu earch, 3- The ILP formulation ued i tight. Thi i evident from the number of branch and bound trial did not exceed ten of branche taken in all preceding example. In ome intance where the regiter cot are not added to the cot function, no branching wa oberved and the integer optimal olution i obtained directly from the linear program olution. 6 Concluion Our ynthei approach when applied to benchmark example produce architectural olution that outperform all other publihed architecture regarding both it tructure, loading (hence, performance) a well a reource ued. Thi reult i due to flexible data tranfer on pipelined bue in our architectural model, and alo due to the ynthei approach that tructure the architecture and efficiently chedule the bu tranfer while minimizing bu loading. We have demontrated that any chedule and binding for a multiplexor baed data path can be tranformed to upport a bu/am baed datapath by properly cheduling the butranfer. In our methodology, by uing tructural complexity reduction during cheduling and binding, we have hown that bu binding and torage allocation can be delayed to later tep of ynthei and produce good reult. Hence, we have propoed an efficient plit of the architectural ynthei procedure. We have alo hown that ILP technique can be effectively ued in conjunction with complex cot function, for mall to medium ize DFG, to tie together different level and variou tak in architectural deign. We have alo hown by our low running time the tightne of the formulation we have preented. Moreover, our contraint and cot function can be extended to other heuritic architecture earch technique that may have better running time for larger problem. Future reearch will focu on addreing very large memory required for loop execution and multiple FPGA computation accelerator ytem a well a method of relegating ome of the bu binding deciion to tep-3 (the regiter binding and floorplanning) of our ynthei approach. eference [] B. Haroun, B. Sajjadi, ILP Synthei of Signal Proceing Architecture with minimum Structural Complexity, CICC, May 994, pp [2] J. oe, A. lgamal, A.Sangiovani-Vincentelli, Architecture of Field Programmable Gate Array, Proc. I, Vol. 8, No. 7, July 993. [3] C.T. Hwang, J.H. Lee, Y.C. Hu, A Formal Approach to the Scheduling Problem in HLS,I Tran. CAD, Vol-0, No.4, April 99, pp [4] F.S. Tai, Y.C. Hu: STA: An Automatic Data Path Allocator. I Tr. CAD. Vol, No9, September 992. [5] C. Geboty, M.I. lmary, Optimal Synthei of High Performance Architecture: JSSC, March 992, pp [6] M. im,. Jain,. Deleone, Optimal Allocation & Binding in HLS, DAC-92, pp [7] G. Gooen, J. abaey, J. Vandewalle, H, De Man: An fficient Microcode Compiler for Application Specific DSP Proceor, I Tranaction on CAD, Vol 9, No. 9, pp ,990. [8] S. Note, W. Geurt, F. Catthoor, H. De Man, Cathedral-III: Architecture-Driven High-Level Synthei for High Throughput DSP Application, 28th Deign Automation Conference, 99, pp [9] C.Geboty, M. I. lmary, Global Optimization Approach for Architectural Synthei, I Tr. on CAD, Vol-2, No.9, Sept. 993, pp [0] A. Safir, B. Haroun, A Floorplanner for Datapath Optimization ubmitted to Tr. VLSI Sytem. [] B. Haroun et. al. VLSI Architecture Synthei and Implementation of HiFI Digital Filter Proc. Canadian Conference on VLSI, pp.07-4, Oct [2] C. Geboty Syntheizing Optimal Application Specific DSP Architecture, in: VLSI Deign Methodologie for DSP Architecture ed. M. Bayoumi, pp.43-92, Kluwer Academic Publiher, Boton, MA, 994. [3] B. Haroun et.al Synthei of Multiple Bu Architecture For DSP Application, in: VLSI Deign Methodologie for DSP Architecture ed. M. Bayoumi, pp.93-30, Kluwer Academic Publiher, Boton, MA, 994. [4] C. Geboty, Syntheizing Optimal egiter file Architecture for FPGA Technology, CICC, May 994, pp, [5] J. J. abaey, et.al. Fat Proto-typing of data path Intenive Architecture. I Deign & Tet, Vol.8, No.2, pp.405, 99. [6] B. Sajjadi, Architectural Synthei for FPGA Baed Signal Proceing Sytem, M.Sc. Thei in preparation, Concordia Univerity, 995. [7] T. Li, J. Mowchenko, Applying Simulated volution to High Level Synthei, I Tr. CAD, March 993, pp

Laboratory Exercise 6

Laboratory Exercise 6 Laboratory Exercie 6 Adder, Subtractor, and Multiplier The purpoe of thi exercie i to examine arithmetic circuit that add, ubtract, and multiply number. Each type of circuit will be implemented in two