High-level synthesis under I/O Timing and Memory constraints

Size: px

Start display at page:

Download "High-level synthesis under I/O Timing and Memory constraints"

Grace Shelton
5 years ago
Views:

Highlevel synthesis under I/O Timing and Memory onstraints Philippe Coussy, Gwenolé Corre, Pierre Bomel, Eri Senn, Eri Martin To ite this version:

680683, 2005. <hal00077297> HAL Id: hal00077297 https://hal.arhivesouvertes.

they are published or not. The douments may ome from teahing and researh institutions in Frane or abroad, or from publi or private researh enters.

1 Highlevel synthesis under I/O Timing and Memory onstraints Philippe Coussy, Gwenolé Corre, Pierre Bomel, Eri Senn, Eri Martin To ite this version: Philippe Coussy, Gwenolé Corre, Pierre Bomel, Eri Senn, Eri Martin. Highlevel synthesis under I/O Timing and Memory onstraints. IEEE. 2005, IEEE, pp , <hal > HAL Id: hal Submitted on 30 May 2006 HAL is a multidisiplinary open aess arhive for the deposit and dissemination of sientifi researh douments, whether they are published or not. The douments may ome from teahing and researh institutions in Frane or abroad, or from publi or private researh enters. L arhive ouverte pluridisiplinaire HAL, est destinée au dépôt et à la diffusion de douments sientifiques de niveau reherhe, publiés ou non, émanant des établissements d enseignement et de reherhe français ou étrangers, des laboratoires publis ou privés.

2 Highlevel synthesis under I/O Timing and Memory onstraints Philippe Coussy, Gwenole Corre, Pierre Bomel, Eri Senn, Eri Martin LESTER LAB, UBS University, CNRS FRE 2734 Abstrat The design of omplex SystemsonChips implies to take into aount ommuniation and memory aess onstraints for the integration of dediated hardware aelerator. In this paper, we present a methodology and a tool that allow the HighLevel Synthesis of DSP algorithm, under both I/O timing and memory onstraints. Based on formal models and a generi arhiteture, this tool helps the designer to find a reasonable tradeoff between both the required I/O timing behavior and the internal memory aess parallelism of the iruit. The interest of our approah is demonstrated on the ase study of a FFT algorithm. I. INTRODUCTION Eletroni design omplexity has inreased hugely sine the birth of integrated iruits. System level tehnologies, over reent years, have moved from Appliation Speifi Integrated Ciruits (ASICs) and Appliation Speifi Signal Proessors (ASSPs) to omplete SystemOnChip (SoC) designs. This inrement in the hip omplexity requires an equivalent shift in the design methodology and a more diret path from the funtionality down to the silion. In [13], the authors propose system synthesis approahes where the algorithms of the funtional speifiation orrespond to predesigned omponents in a library. Maro generators produe the RTL arhiteture for hardware bloks by using the generi / generate VHDL mehanisms: the synthesis proess an hene be summarized as a blok instantiation. However, though suh omponents may be parameterizable, they rely on fixed arhitetural models with very restrited ustomization apabilities. This lak of flexibility in RTL bloks is espeially true for both the ommuniation unit, whih I/O sheduling and/or I/O timing requirements are defined, and the memory unit, whih data distribution is set. HighLevel Synthesis (HLS) an be used to redue this lak of flexibility. For example, SystemC Compiler [4] from Synopsys, and Monet from Mentor Graphis, propose a set of I/O sheduling modes (ylefixed, superstate, freefloating) that allow to target alternative arhitetural solutions. Communiation is speified using wait statements and is mixed with the omputation speifiation what limits the flexibility of the input behavioral desription. In these two tools, memory aesses are represented as multiyle operations in a Control and Data Flow Graph (CDFG). Memory verties are sheduled as operative verties by onsidering onflits among data aesses. In pratie, the number of nodes in their input speifiations must be limited to obtain a realisti and satisfying arhitetural solution. This limitation is mainly due to the omplexity of the algorithms that are used for the sheduling. Only a few works really shedule the memory aesses [5], [6]. They inlude preise temporal models of those aesses, and try to improve performanes without onsidering the possibility of simultaneous aesses that would ease the subsequent task of register and memory alloation. In the domain of realtime and dataintensive appliations, proessing resoures have to deal with ever growing data streams. The system/arhiteture design has therefore to fous on avoiding bottleneks in the buses and I/O buffers for datatransfer, while reduing the ost of data storage and satisfying strit timing onstraints and highdata rates. The methodology that an permit suh a design must rely on (1) onstraint modeling for both I/O timing and internal data memory, (2) onstraint analysis for feasibility heking and (3) highlevel synthesis. In [7] and [8], we proposed a methodology for SoC design that is based on the reusing of algorithmi desription. Our approah is based on highlevel synthesis tehniques under I/O timing onstraints and aims to optimally design the orresponding omponent by taking into aount the system integration onstraints: the data rate, the tehnology, and I/O timing properties. In [9], we have introdued a new approah to take into aount the memory arhiteture and the memory mapping in the behavioral synthesis of realtime VLSI iruits. A memorymapping file was used to inlude those memory onstraints in our HLS tool GAUT [10]. In this paper, we propose a design flow based on formal models that allow highlevel synthesis under both I/O timing and memory onstraints for digital signal proessing algorithms. DSP systems designers speify the I/O timing, the omputation lateny, the memory distribution and the appliation s data rate requirements that are the onstraints for the synthesis of the hardware omponents. This paper is organized as follows: In setion 2 we formulate the problem of synthesis under I/O timing and memory onstraints. Setion 3 presents the main steps of our approah, and its underlying formal models. In setion 4, we demonstrate the effiieny of our approah with the didati example of the Fast Fourier Transform (FFT). II. PROBLEM FORMULATION In this setion, we illustrate the interdependeny between the aess parallelism to memory and the timing performanes as well as the influene of these two parameters on the resulting omponent arhiteture. Let us onsider a hardware omponent based on a generi arhiteture omposed of two main funtional units: one memory unit MU and one proessing unit PU. Suppose the omputation proessed to be = (av1 v3) (bv2v4) where v1, v2, v3 and v4 are variables values stored in memory. Fig. 1(a) shows the Signal Flow Graph (SFG) of this algorithm. This omponent reeives input data a and b from the environment through an input port and sends its result on the output port. All the data used and produed by the proessing unit are respetively read and written in a fixed order

3 Reg1 a b a v1 b v2 Reg2 v1 v2 Reg3 x1 v3 x1 v4 x2 Reg4 x2 Reg5 v3 v4 Reg6 y1 x3 x4 Reg7 y2 Reg8 Lateny (a) (b) Fig. 1: (a) Signal Flow Graph SFG, (b) Timing behavior, sequene S =(a,b,): i.e. t a <t b < t. The read sequene of two variables v1 and v2 is ompletely deterministi i.e.: t v1 < t v2. with t v1 = t a and t v2 = t b. However, a sheduling hoie is needed to aess data v3 and v4 sine a single memory bank is available in the omponent. In our example, we hoose to aess v3 before v4. In this ontext, the minimum lateny is therefore equal to 5 yles (Fig. 1(b)). Fig. 2 presents a possible orresponding arhiteture of the proessing unit that inludes 1 multiplier, 1 adder, 1 substrator and 8 registers. A B Memory Unit reg2 reg1 v1, v2, v3, v4 reg3 Proessing unit reg6 reg7 Fig. 2: Sequential arhiteture reg8 Let us now onsider the following data transfer sequene S busses = (a b, ): i.e. t a =t b < t. If the lateny required to produe the result is long enough ( 5 yles) to allow a reordering (serialization) of input data a and b, then the previously designed arhiteture inluding one memory bank an be reused. However, this solution need to design an input wrapper omposed of 1 register, 1 multiplexer and 1 ontroller. If the required lateny is not long enough (i.e. = 3 yles), the designer must design a new omponent inluding 2 multipliers, 2 adders, 11 registers and 2 memory banks (see Fig. 3). In suh a ase, beause of their restrited ustomization apabilities, neither a predesigned omponent nor a maro generator would be flexible enough to respond to the new design onstraints. A B Memory Unit #1 v1 v3 reg2 reg1 reg12 reg11 reg3 v2 v4 Memory Unit #2 Proessing unit reg7 reg8 Fig. 3: Parallel arhiteture As stated before, a new design flow, based on synthesis under onstraints, is needed to get flexibility and to make the DSP C S t omponent design easier. This inludes (1) modeling styles to represent I/O timing and memory onstraints, (2) analysis steps to hek the feasibility of the onstraints (3) methods and tehniques for optimal synthesis. III. DESIGN APPROACH OVERVIEW The input of our HLS tool [10] is an algorithmi desription that speifies the funtionality disregarding implementation details. This initial desription is ompiled to obtain an intermediate representation: the Signal Flow Graph SFG (see Fig. 4). A. Timing Constraint Graph In a first step, we generate an Algorithmi Constraint Graph ACG from the operator latenies and the data dependenies expressed in the SFG. The latenies of the operators are assigned to operation verties of the ACG during the operator s seletion step in the behavioral synthesis flow. Starting from the system desription and its arhitetural model, the integrator, for eah bus or port that onnets the omponent to others in the SoC, speifies I/O rates, data sequene orders and transfer timing information. We defined a formal model named IOCG (IO Constraint Graph) that supports the expression of integration onstraints for eah bus (id. port) of the omponent. Finally we generate a Global Constraint Graph (GCG) by merging the ACG with the IOCG graph. Merging is done by mapping the verties and assoiated onstraints of IOCG onto the input and output verties set of ACG. A minimum timing onstraint on output verties (earliest date for data transfer) of the IOCG are transformed into the GCG in maximum timing onstraints (latest date for data omputation/prodution). After having desribed the behavior of the omponent and the design onstraints in a formal model, we analyze the feasibility between the appliation rate and the data dependenies of the algorithm, in funtion of the tehnologial onstraints. We analyze the I/O timing speifiations aording to the algorithmi ones: we hek if the required onstraints on output data are always verified with the behavior speified for input data. The entry point of the IP ore design task is the global onstraint graph GCG. B. Memory Constraint Graph As outlined in the previous subsetion, A Signal Flow Graph (SFG) is first generated from the algorithmi speifiation. A Memory Constraint Graph is a yli direted polar graph MCG(V',E',W') where V'={v'0,..., v'n} is the set of data verties plaed in memory. A memory Constraint Graph ontains V' =n1 verties whih represent the memory size, in term of memory elements. The set of edges E'=(v'i, v'j) represents possible onseutive memory aesses, and W' is a funtion that represents the aess delay between two data nodes. W' has only two possible values: Wseq (sequential) for an adjaent memory aess in memory, or Wrand (randomize) for a non adjaent memory aess. In our approah, this SFG is parsed and a memory table is reated. All data verties are extrated from the SFG to onstrut the memory table. The designer an hoose the data to be plaed in memory and defines a memory mapping. For every memory in the memory table, we onstrut a weighted Memory Constraint Graph (MCG). It represents onflits and sheduling possibilities between all nodes plaed in this memory. The MCG is onstruted from the SFG and the memory mapping file. It will be used during the sheduling proess.

4 operators Memory plaement & variable distribution MCG Algorithm Compilation SFG Seletion ACG Merging GCG Analysis IOCG Synthesis under I/O timing and Memory onstraints RTL Fig. 4: Proposed Synthesis Flow I/O refinement Fig. 6(b) shows a MCG for the presented example with one simple port memory bank. The variable data v1, v2, v3 and v4 are plaed onseutively in one bank. Dotted edges represent sequential aesses (two adjaent memory addresses) and plain edges represent random aesses (nonadjaent addresses). Further information about the formal models and the memory design an be found in [7], [8], [9]. C. Sheduling under I/O and Memory Constraints The lassial list sheduling algorithm relies on heuristis in whih ready operations (operations to be sheduled) are listed by priority order. In our tool, an early sheduling is performed on the GCG. In this sheduling, the priority funtion depends on the mobility riterion. For operations that have the same mobility, the priority is defined using the operation margin. Next, operations are sheduled and bind to operators (see Fig. 5). Sheduling_Funtion 1) Operation_Mobility_omputing(GCG) 2) For (time = 0; time < End; time = time t_yle) 3) List = Operation_Priority_listing(GCG) 4) Ready_Ops = Find_shedulable_operation(List, time) 5) Binding(Ready_Ops, operators_set, MCG, time) 6) End for Binding Funtion 1) While (Ready_Ops!= NULL) 2) Ops_low_mobility = Get_first(Ready_Ops) 3) if(op_low_mobility>margin > 0) 4) If(Find_mem_onfli(MCG, Ops_low_mobility) = FALSE) 5) If(operators_set!= NULL) 6) Ops_Binding(sh_list, operator) 7) else //no opr or mem onflit 8) Posponed(Ops_low_mobility) 9) else // margin = 0 10) If(Find_mem_onfli(MCG, Ops_low_mobility) = FALSE) 11) Operator_retation() 12) Ops_Binding(sh_list, operator) 13) else 14) Exit(yle, operator, operation, memory bank, ) 15) end if 16) End while nok Fig. 5: Pseudo ode of the sheduling algorithm An operation an be sheduled if the urrent yle is greater than the ASAP time. Whenever two ready operations need to aess the same resoure (this is a soalled resoure onflit), the operation with the lower mobility has the highest priority and is sheduled. The other is postponed. When the mobility is equal to zero, one new operator is alloated to this operation. To perform a sheduling under memory onstraint, we introdue memory aess operators and add an aessibility riterion based on the MCG. A memory has as muh aess operators as aess ports. The list of ready operations is still organised aording to the mobility riterion, but all the operations that do not math the aessibility ondition are removed from this list. Hene, when the mobility is equal to zero, the synthesis proess exits and the designer have to target an alternative solution for the omponent arhiteture by reviewing the memory mapping and/or modifying some ommuniation features. Our sheduling tehnique is illustrated in Fig. 6 using the previously presented example where the timing onstraints are now the following: S =(a b,) i.e. t a = t b < t. The memory table (Fig. 6(a)) is extrated from the SFG and is used by the designer to define the memory mapping. Internal data v1, v2, v3 and v4 are respetively plaed @2 in the bank0. Our tool onstruts one Memory Constraint Graph MCG (Fig. 6(b)). In addition to the mapping onstraint the designer also speifies two lateny Lat1=5 yles and Lat2=3yles. For lateny Lat1, the sequential aess sequene is v1 v2 v3 v4 : it inludes 3 dotted edges (with weight Wseq). To deal with the memory bank aess onflits, we define a table of aesses for eah port of a memory bank. In our example, the table has only one line for the single memory bank0. The table of memory aess has Data_rate / Sequential_aess_time elements. The value of eah element of the table indiates if a memory aess operator is idle or not at the urrent time (ontrol step _step). We use the MCG to produe a sheduling that permits to aess the memory in burst mode. If two operations have the same priority ( margin = Lat1T()T() = 1 yles) and request the same memory bank, the operation that is sheduled is the operation that involves an aess at an address that follows the preeding aess. For example, multipliation operation (av1) and (bv2) have the same mobility. At _step s_1, they are both exeutable and the both operands v1 and v2 are stored in bank0. MCG_1 indiates that the sequene v1 v2 is shorter than v2 v1. We then shedule (av1) at _step s_1 and (bv2) at _step s_2 to favour the sequential aess (see Fig. 6 ()). At _step s_3, addition (x1v3) and (x2v4) have the same mobility, the MCG indiates that sequene v2 v3 is shorter than v2 v4. Addition (x1v3) is sheduled at _step s_3 and (x2v4) at _step s_3. v1 0 0 v2 0 1 v3 0 2 v4 0 3 v2 v1 v3 v4 v1 a v3 x1 v2 b v4 (a)memory (b) MCG () Sheduling Table Fig. 6: Sheduling under I/O timing and lateny onstraint For lateny Lat2, multipliation operation (av1) and (bv2) have the same mobility that is null. Both operations must then be sheduled in _step s_1. Beause of the memory aess onflit, there is no solution to the sheduling problem: the designer has hene to review its design onstraints. He an target an alternative solution by adding one memory bank or by inreasing the omputing lateny. x2

5 IV. EXPERIMENTAL RESULTS We desribed in the two previous setions our synthesis design flow and the sheduling under I/O timing and memory onstraints. We present now the results of synthesis under onstraints obtained using the HLS tool GAUT [10]. The algorithm used for this experiene is a Fast Fourier Transform (FFT). This FFT reads 128 real input Xr(k) and produes the output Y(k) omposed of two parts: one real Yr(k) and one imaginary Yi(k). The SFG inludes edges and 8451 verties. Several syntheses have been realized using a 200MHz lok frequeny and a tehnologial library in whih the multiplier lateny is 2 yles and the lateny of the adder and the subtrator is 1 yle. A. Experiment 1: Synthesis under I/O timing onstraints In this first experiment we synthesized the FFT omponent under I/O timing onstraints and analyzed the requirements on memory banks. In order to generate a global onstraint graph GCG, minimum and maximum timing onstraints have been introdued between I/O verties of the ACG graph using the IOCG model. The FFT lateny is defined by a maximum timing onstraint between the first input and the first output verties. The speified lateny (that is the shortest one aording to the data dependenies and the operator latenies) orresponds to a 261 yles delay. The FFT omponent is onstrained to read one Xr sample and to produe one Y sample every yle. The resulting FFT omponent ontains 20 multipliers, 8 adders and 10 subtrators (see Exp#1 at Table 1). 8 memory banks are required for those I/O timing onstraints. However, the internal oeffiients are mapped in a nonlinear sheme in memory. A large amount of memory bank is needed to get enough parallel aesses to reah the speified lateny. Moreover, oeffiients an possibly be loated in multiple banks what requires the design of a omplex memory unit. B. Experiment 2: Synthesis under memory onstraints In this seond experiment we synthesized a FFT omponent only under memory onstraints. Only the maximal number of onurrent aess to the memory banks limits the minimal lateny. Thus, with a large amount of operators, a lateny equal to the ritial path delay of the SFG ould be obtained. For this reason, we synthesized the FFT with the same number of operators than in the first experiment. Then, we analyzed the requirement on I/O ports and omputation lateny. The memory onstraints are the following: 2 memory banks respeting a simple mapping onstraint: the 128 real oeffiient Wr in bank0 and the 128 imaginary oeffiient Wi in bank1. The shortest lateny imposed by the memory mapping and the number of operators orresponds to a 215 yles delay (Exp#2 at Table 1). This delay is shorter that the delay obtained in the previous experiment. This arhiteture requires 36 input busses and 14 outputs. However, a large amount of busses with nontrivial data ordering (nonlinear data index progression) is needed. If the environment imposes the exhange of data over a smaller number of I/O busses, a ommuniation unit should be designed. This unit would be able to add extra lateny to serialize data. C. Experiment 3: Synthesis under I/O timing and memory onstraints In this last experiment, we synthesized the FFT omponent under both I/O timing and memory onstraints. We kept the memory mapping used for the seond experiment and founded the shortest lateny that allows to respet the I/O rates defined in the first experiment. The resulting arhiteture ontains 17 multipliers, 8 adders and 10 subtrators (see Exp#3 at Table 1). It produes its first result after 343 yles. Memory bank. TABLE 1: SYNTHESIS RESULTS Input busses Output busses Sub. Add. Mult. Lateny (in yle) Exp# Exp# Exp# Beause of both the memory mapping and the I/O onstraints, the lateny is greater than in experiment 1 and 2. However, the arhiteture omplexity is equivalent to the previous ones in term of operators. Hene, it appears that synthesis under both I/O timing and memory onstraints allows to manage both the system s ommuniation and memory, while keeping a reasonable arhiteture omplexity. V. CONCLUSION In this paper, a design methodology for DSP omponent under I/O timing and memory onstraints is presented. This approah, that relies on onstraints modeling, onstraints analysis, and synthesis, helps the designer to effiiently implement omplex appliations. Experimental results in the DSP domain show the interest of the methodology and modeling, that allow tradeoffs between the lateny, I/O rate and memory mapping. We are urrently working on heuristi rules that ould help the designer in exploring more easily different arhitetural solutions, while onsidering memory mapping and I/O timing requirements. ACKNOWLEDGEMENTS These works have been realized within the Frenh RNRT Projet ALIPTA. REFERENCES [1] J. RuizAmaya, and Al., MATLAB/SIMULINKBased High Level Synthesis of DisreteTime and ContinuousTime Σ Modulators, In Pro. of DATE [2] L. Reyneri, F. Cuinotta, A. Serra, and L. Lavagno. A hardware/software odesign flow and IP library based on Simulink, In Pro. of DAC, [3] Codesimulink, [4] H. Ly, D. Knapp, R. Miller, and D. MMillen, Sheduling using behavioral templates, in Pro. Design Automation Conferene DAC'95, June 1995 [5] N. Passos, and al, Multidimensional interleaving for timeandmemory design optimization, in Pro. of ICCD, 1995 [6] A. Niolau and S. Novak, Trailblazing a hierarhial approah to perolation sheduling, in Pro. ICPP'93, 1993, [7] P. Coussy, A. Baganne, E. Martin, " Communiation and Timing Constraints Analysis for IP Design and Integration", In Pro. of IFIP WG 10.5 VLSISOC Conferene, [8] P. Coussy, D. Gnaedig, and al., A Methodology for IP Integration into DSP SoC: A Case Study of a MAP Algorithm for Turbo Deoder, In Pro. of ICASSP, 2004 [9] G. Corre, E. Senn, and al., Memory aesses Management During High Level Synthesis, In Pro. of CODESISSS, [10] GAUT HLS Tool for DSP,

This fact makes it difficult to evaluate the cost function to be minimized

This fact makes it difficult to evaluate the cost function to be minimized RSOURC LLOCTION N SSINMNT In the resoure alloation step the amount of resoures required to exeute the different types of proesses is determined. We will refer to the time interval during whih a proess