Fractional Rate Dataflow Model for Efficient Code Synthesis

Size: px

Start display at page:

Download "Fractional Rate Dataflow Model for Efficient Code Synthesis"

Gervase Hampton
5 years ago
Views:

1 Fractional Rate Dataflow Model for Efficient Code Synthesis Hyunok Oh Soonhoi Ha The School of Electrical Engineering and Computer Science Seoul National University Seoul , KOREA TEL : {oho,sha}@comp.snu.ac.kr ABSTRACT Automatic code synthesis from dataflow program graphs is a promising high-level design methodology for rapid prototyping of multimedia embedded systems. Memory efficient code synthesis from dataflow models has been an active research subject to reduce the gap in terms of memory requirements between the synthesized code and the hand-optimized code. However, existent dataflow models have inherent difficulty of efficiently handling data structures. In this paper, we propose a new dataflow extension called fractional rate dataflow (FRDF) in which fractional number of samples can be produced and consumed. In the proposed FRDF model, a constituent data type is considered as a fraction of the composite data type. Existent integer rate dataflow models can be easily extended to incorporate the fractional rates without loosing analytical properties. In this paper, the SDF model is extended to include FRDF, which can reduce the buffer memory requirements significantly, up to 70%, for some multimedia applications. Extended SDF model with fractional rate has been implemented in our system design environment called PeaCE(Ptolemy extension as Codesign Environment). 1. INTRODUCTION As system complexity increases and fast design turn-around time becomes important, automatic code synthesis from dataflow program graphs is a promising high level design methodology for rapid prototyping of multimedia embedded systems. COSSAP [1], GRAPE [2], and Ptolemy [3] are well-known design environments, especially for digital signal processing applications, with automatic code synthesis facility from graphical dataflow programs. The main obstacle of using this methodology for practical design problems is the code efficiency in terms of cycle counts and memory requirements of the synthesized code compared with the hand-optimized code. In a hierarchical dataflow program graph, a node, or a block, represents a function that transforms input data streams into output streams. The functionality of an atomic node is described in a high-level language such as C or VHDL. An arc represents a channel that carries streams of data samples from the source node to the destination node. The number of samples 1

2 produced (or consumed) per node firing is called the output (or the input) sample rate of the output (or the input) port. The simplest dataflow model is a single-rate dataflow (SRDF) [4] in which all sample rates are unity. When a port may consume or produce multiple samples, we call it a multirate dataflow (MRDF), which is assumed in this paper as well as in other design environments. We also assume static dataflow model in which the sample rates of all nodes are known at compile time. Static dataflow models permit compile time analysis that determines the node execution schedule and the memory requirements in the synthesized code from the graph. Hence most design environments use static dataflow models such as synchronous dataflow (SDF) [5] and cyclo-static dataflow (CSDF) [6]. In the SDF model, all sample rates are fixed while they can be changed periodically in the CSDF model. Memory efficient code synthesis from static dataflow models has been an active research subject to reduce the gap in terms of memory requirements between the automatically synthesized code and the manually optimized code [7][8][9]. The function body of a node is presumably highly optimized a priori and predefined in a block library. In the synthesized code, the function codes of atomic nodes are stitched together according to the execution order determined at compile time. Then, the cycle count of the synthesized code is not much inferior because the computation intensive kernel is encapsulated within a node. On the other hand, each arc of a dataflow graph is assigned a buffer memory whose size is at least the maximum amount of samples accumulated on the arc at run time. To minimize the buffer requirements in the synthesized code, many works have been performed to minimize the buffer requirements by optimal scheduling [7][8] and by buffer sharing [10][11]. While most previous works do not differentiate the composite type from the primitive type data, we take special account for the composite data type and achieve significant reduction on buffer memory requirement. Composite data types, such as video frame or network packet, are used extensively in recent networked multimedia applications, and become the major consumer of scarce memory resource. Existent dataflow models have inherent difficulty of efficiently expressing the mixture of a composite data type and its constituents: for example, a video frame and macro blocks. A video frame is regarded as a unit of data sample in integer rate dataflow graphs, and should be broken down into multiple macro blocks explicitly consuming extra memory space. To overcome this difficulty, we propose a new dataflow extension called fractional rate dataflow (FRDF) in which fractional number of samples can be produced or consumed In the proposed FRDF model, a constituent data type is considered as a fraction of the composite data type. An FRDF graph can be synthesized into efficient codes with shorter latency and less buffer memory than integer rate dataflow graphs. Moreover, the FRDF model can be adopted in the existent dataflow models. In the next section, we motivate the fractional rate dataflow model with an H.263 encoder example. In section 3, we show how the analytical properties of the static dataflow models can be extended to the FRDF model. Section 4 discusses the experimental results with an H.263 encoder example. In section 5, we extend the proposed fractional rate concept to the primitive type data samples. Some real experiments demonstrate in section 6 that fraction rate extension to the primitive type data also reduces the memory requirements. Code generation from the proposed FRDF model is discussed in section 7. Concluding remarks follow in section 8. 2

3 2. MOTIVATION Figure 1(a) shows an SDF subgraph of an H.263 encoder example. The ME node receives the current and the previous frames to compute the motion vectors and pixel differences, and the D node divides the input frame into 99 macro blocks to be encoded by the next EN node. Since the ME node regards a 176x144 video frame as the unit data sample while the EN node a 16x16 macro block, the D node is inserted to convert data type explicitly. The schedule of node execution becomes 1(ME,D)99(EN) in this subgraph, which executes ME and D once and EN 99 times. Hence two buffers of frame size are required for arc α and β both. The cyclo-static dataflow (CSDF) model gives a better representation, where the D node can produce a macro block per execution. The annotated notation, 1,98x0, on arc α in Figure 1(b) means that the D node consumes 1 data sample of frame size at the first execution and no sample at the subsequent 98 executions. The sample rates vary one firing to the next in a cyclic pattern. Because the pattern is periodic and predictable, it is possible to statically construct a periodic schedule: 1(ME)99(D,EN). As a result, the buffer size on arc β is reduced to the macro block size from the frame size of SDF model. The memory for arc α still has the frame size since the ME node generates a frame output. Compared with the hand-written code for the program graph, we observe that the motion estimation node need not produce the frame-size output at once. Since the ME node is the most time consuming and runs for each macro block, it had better generate output samples at the unit of macro block for shorter latency. The proposed FRDF allows implicit type conversion from the video frame type to the macro block type at the input arc of the ME node (Figure 1(c)). The fraction rate 1/99 indicates that the macro block is 1/99 of the video frame in its size. How to divide the fraction depends the frame type and should be specified by the programmer before run time. For each execution of the ME node, it consumes a macro block from the input frame, and produces a macro block output. Thus, we do not need the D node any more since type conversion is already made inside the ME node. A possible schedule becomes 99(ME,EN). Now, the buffer size becomes the macro-block size. Šœ Œ Œ œš tl `` k lu α β OˆP Šœ Œ Œ œš tl S`_ŸW ``Ÿ k lu α β O P Šœ Œ Œ œš V`` Šœ Œ S`_ŸW V`` tl lu S`_ŸW tl ``Ÿ lu Œ œš OŠP O P Figure 1. A subgraph of an H.263 encoder example: (a) SDF representation, (b) CSDF representation, (c) FRDF representation, and (d) revised CSDF representation. Even though the FRDF representation permits more intuitive representation, Figure 1(d) shows a CSDF graph with the same buffer requirements. The ME node consumes a whole frame at the first execution followed by 98 executions without input sample but produces one macro block at each execution. In case the input frame of the previous node is prepared at once, this modified CSDF and the FRDF solutions are the same in terms of memory requirements. However, there is a difference. The 3

4 FRDF solution can execute the ME node before the entire current frame arrives at the input arc while the CSDF solution can execute the node only after the entire current frame is available. 3. STATIC ANALYSIS Existent dataflow models can be extended to support fractional rates without losing any analytical property. In this section, we extend the SDF model to support fractional rates and discuss its static analyzability. The similar discussion can be made to other dataflow models like CSDF. We start our discussion stating the following definition. Definition 1: Equivalent node For an FRDF node A, an equivalent SDF node (A ) is defined such that Q executions of the FRDF node A is equivalent to a single execution of A, where Q is the minimum number of executions to make all sample rates integer. The sample rates of the equivalent node are Q times those of the FRDF node. Q is called an equivalence factor. We denote this equivalent relation as A ~ A. Q p i qi SDF node exists. And the function body of the SDF node performs Q executions of the FRDF node at each execution. For sample rates { p i q i } of a FRDF node, the equivalent SDF node has sample rates. Since q i divides Q, such an We also define the equivalence relation between an FRDF extension of an SDF graph and a pure SDF graph as follows. Definition 2: Equivalent graph For a given FRDF extension of an SDF graph, the equivalent graph is defined as the SDF graph obtained after all FRDF nodes are replaced with their equivalent SDF nodes as stated in definition 1. The key analytical property of the SDF model is that the node execution schedule can be constructed at compile time if the graph is consistent [5] meaning the repeated execution of the schedule does not overflow buffers. The number of executions of node A within a schedule is called the repetition number x(a) of the node. For a given arc e, we denote src(e) as the source node and snk(e) as the destination node. The output sample rate of src(e) onto the arc is denoted as prod(e) and the input sample rate of snk(e) as cons(e). Then, the following equation, called a balance equation, should be held for a consistent SDF graph. x ( src( e)) prod ( e) = x( snk ( e)) cons( e) for each e. (1) Now, we arrive at the following theorem. 4

5 Theorem 1. An FRDF extension of an SDF graph is consistent if the equivalent SDF graph is consistent. <proof> Suppose that an SDF node is substituted for an equivalent FRDF node with Q equivalence factor and equation (1) holds for the equivalent SDF graph. Then, the sample rate p i of each port becomes p i Q in the equivalent FRDF node. Let the repetition number of the FRDF node be Q times the repetition number of the SDF node. Then equation (1) becomes prod( e) (2) Q { Q x( src( e)) } = x( snk( e)) cons( e) for an output arc e of the FRDF node. Similar argument holds for any input arc. Thus if the equivalent SDF graph is consistent, the FRDF graph is consistent. Q.E.D. The theorem tells how to compute the repetition numbers of FRDF nodes in the extended SDF graph. The repetition number of an FRDF node is the product of the equivalence factor of the node and the repetition number of the equivalent SDF node in the equivalent SDF graph. Consider an FRDF example of Figure 2(a). The equivalent SDF graph is drawn in Figure 2(b) where equivalence factor of nodes B and C are 4 and 2 respectively. The key difference between these two graphs lies in the execution schedule of nodes. The schedule of the SDF graph becomes 4(A)B C meaning that node A is executed 4 times before node B and C are executed in sequence. If we replace B and C with the original FRDF nodes B and C, the schedule becomes as displayed in Figure 2(d). If we allow fractional rate, however, we may have more buffer-efficient schedule. After executing node B twice, half of the data is filled out, which can be consumed at node C. Therefore, we can obtain the following schedule 2(2(AB)C) as shown in Figure 2(c). We save the half of the buffer size on arc BC from this schedule with careful buffer sharing. A 1 1 B 1/4 1/2 C (a) Schedule: (2(2(AB)C) (c) (B ~ 4B) (C ~ 2C) A 1 4 B 1 1 C (b) Schedule: 4A(B =4B)(C =2C) (d) Figure 2. (a) An FRDF graph, (b) the equivalent SDF graph, (c) a schedule result with fractional sample rate, and (d) another schedule result with integer sample rate 5

6 4. H.263 ENCODER EAMPLE In this section, we compare the hand-coded reference code and the synthesized codes from three dataflow models with H.263 encoder example handling composite-type video frame data: SDF, CSDF, and FRDF. We use tmn (ver.2.0) H.263 encoder as the reference code, which has been developed by Telenor Research and Development [12]. For fair comparison, we use the same function body for each block as the associated function definition in the reference code. Since the dataflow models implement only the basic mode defined in the standard, we manually remove the extra code sections for the advanced coding mode from the reference code. As a result, the code size and the execution cycles are not much different from each other. Simplified graph specifications with SDF and CSDF model are shown in Figure 3 and Figure 4 respectively. We already discussed this example in section 2 to motivate the FRDF specification. Remind that the key difference lies in the type conversion between the video frame type and the macro-block type. In the SDF and CSDF representations, the Distributor node performs this type conversion explicitly, thus requires the frame-size buffer space on the input arc. yœˆ m kœ ŠŒ t `` tˆš k š œ i Š lš ˆ l Š Ž `` }ˆ ˆ Œ sœ Ž j Ž t `` j Œ šˆ tˆš i Š kœš Ž ~ Œ { kœ ŠŒ Figure 3. H.263 encoder in SDF yœˆ m kœ ŠŒ t lš ˆ S`_ŸW tˆš k š œ i Š l Š Ž ``Ÿ }ˆ ˆ Œ sœ Ž j Ž `_ŸWS `_ŸWS t j Œ šˆ ``Ÿ tˆš i Š kœš Ž ~ Œ { kœ ŠŒ Figure 4. H.263 encoder in CSDF A fractional rate dataflow specification is shown in Figure 5, where the type conversion occurs at the input port of the Motion Estimation node by the notion of fractional rate 1/99. Therefore, we do without the Distributor node in the FRDF specification, saving the buffer space of the frame size on the output arc of the Motion Estimation node. yœˆ m kœ ŠŒ V`` V`` t lš ˆ tˆš i Š l Š Ž }ˆ ˆ Œ sœ Ž j Ž V`` V`` t j Œ šˆ tˆš i Š kœš Ž ~ Œ { kœ ŠŒ V`` Figure 5. H.263 encoder in SDF with fractional rate 6

7 Table 1 compares the buffer sizes between the reference code and the synthesized codes from three dataflow specifications: SDF(Figure 3), CSDF(Figure 4), and FRDF(Figure 5). The buffer requirements are largely proportional to the number of frame instances since frame data structure is significantly larger than other data types. Note that the reference code is not fully optimized, so it maintains five frame structures throughout the encoding procedure. On the other hand, in the case of FRDF specification, the minimum number of required frame instances is statically analyzed at compile-time. H.263 in extended SDF with fractional rate reduces data memory requirements by 67% and 23% than in SDF and CSDF respectively. Table 1. Buffer size comparison between the reference code and the synthesized codes from SDF, CSDF, FRDF graphs Reference Code SDF CSDF FRDF Buffer size 361KB 686KB 291KB 225KB No. of frame-type data We can reduce the buffer size further if the output sampling rate of the Read From Device node becomes fractional. Assume that the output sampling rate is reduced to 1/2. Then, the buffer size on the associated arc can be reduced to half of the frame size. In other words, we do not need a full frame size memory even though the data unit has the frame size. Thus, fractional rate has advantages of less latency and less buffer memory. 5. ETENDED INTERPRETATION OF FRACTIONAL RATE The notion of fractional rate is understandable when a composite data type is composed of multiple constituent data types. Then, what is the meaning of the factional rate for an atomic type such as integer or float? The notion of fractional rate could be found in the Boolean Dataflow (BDF) model of computation introduced by Buck et al. [13]. The Switch node in a BDF graph has two outputs whose sample rates are dynamically determined by the value of the control data sample (Figure 6). They denote the sample rate of the TRUE arc as p which is a fractional value unknown at compile time. Their interpretation of the fractional rate is the ensemble average of the output sample rates. Our interpretation is similar to theirs for atomic data types. Unlike their approach, however, the fraction value is known and fixed at compile-time so that the fully static analysis of the graph is still possible. O{y l ˆ ŠP O ˆ ˆP zž Š T Omhszl ˆ ŠP OŠ P Figure 6. Switch node in the Boolean Dataflow model. The value p indicates the fractional sample rate. Such statistical interpretation opens another possibility of buffer memory minimization. Figure 7(a) shows an SDF-style multiplexer node where it consumes two inputs at once and routes one sample to each output at each execution. Since the input port needs two samples per execution, the buffer memory of the input arc should be no less than two. An equivalent 7

8 node of the CSDF model saves one buffer space by varying the output sample rate periodically as displayed in Figure 7(b): the upper output port produces one sample at the first invocation and no sample at the next while the lower output port produces no sample at the first and one sample at the next invocation. Figure 7(c) shows an FRDF representation of the multiplexer node, where the average sample rate of each output port becomes 0.5. Since a sample is guaranteed to be produced after two firings of the node, this node behaves like the SDF-style multiplexer. However, we need only one buffer at the input arc of the multiplexer. Y t YŸ t SW t VY WS VY OˆP O P OŠP Figure 7. A multiplexer node: (a) an SDF representation, (b) a CSDF representation, and (c) an FRDF representation A subtle difference can be found between the FRDF-style multiplexer and the CSDF-style multiplexer. In the FRDF-style multiplexer, the order of sample production is not specified. In other words, the output sample rate of the upper arc can be 0, 1, 1, 0, 0, 1, 0, 1, as long as 2k executions of the node generate k output samples for all integer k. Hence the output sample rates may be aperiodic in the FRDF. The same is true for the SDF-style multiplexer. This implies how to determine the fractional sample rate of each port in a block. The body of an FRDF block will have a periodic behavior in terms of how many samples will be consumed or produced. Unlike the CSDF, however, the pattern of the sample rate variation for each port need not be fixed nor periodic. Consider a down-sampling FRDF block whose input sample rate is 1 and the output sample rate is 1/3. The fractional rate only indicates that the block generates one sample per every 3 executions. For example, the output sample rate may become {1, 2x0}, {0,1,0}, or {2x0,1} in the CSDF context. The inclusion relationship between SDF, CSDF, and FRDF extension of SDF is shown in figure 8. jzkm zkm mykm ŒŸ Œ š Figure 8. Inclusion relationship between dataflow models It should also be noted that sample rates 1/2 and 2/4 are different. The former generates one sample per every 2 executions while the latter may generate no sample at the first two executions if it generates two samples during the last two executions. At the first look, it is counter-intuitive. But it is not actually because the sample rate is determined by the periodic behavior of the function. 8

9 For a fractional sample rate p i /q i for port i, q i indicates the period of the repeated node behavior and p i indicates the number of samples consumed or produced during period q i. We define q i as the repetition period of the port. In case an FRDF node has multiple ports with different repetition periods, the period of its repeated behavior is defined as the I/O period of the node. Definition 3: I/O period For an FRDF node, the I/O period is the period during which the node consumes or produces the same number of samples over iterations. It is the least common multiple of the repetition periods of all ports: I/O period = LCM (repetition periods of all ports). The I/O period of the node is in fact equivalent to the equivalence factor Q defined in Definition 1 if all fractions are irreducible. Consider a node with a fraction rate 2/4. The I/O period of the node becomes 4 while the equivalence factor can be 2. Therefore, the equivalence factor of Definition 1 should be replaced with the I/O period in the extended interpretation of fractional rate for atomic type. When we construct a schedule, we have to examine the firing condition of each node depending on the data types it handles: composite and atomic. Revisit the example of Figure 2(a). Now, we assume that the data type on arc BC is atomic. Then, the fraction rate 1/4 only means that one output sample will be produced after 4 executions of node B. So, node B should be scheduled 4 times before scheduling node C. Schedule 2(d) satisfies this condition so that it is a valid schedule. However, we have a better schedule 4(AB)2(C). The buffer size between nodes A and B is reduced to 1 from 4. Therefore, using the fractional rate is still beneficial for the atomic type data samples. We summarize the firing condition of FRDF node using the extended interpretation with the following lemma. Lemma 1. Firing condition of an FRDF node An FRDF node is fireable, or executable, if all input arcs satisfy the following condition depending on the data type: (condition 1) If the data type is atomic, there must be at least as many samples stored as the numerator value of the fractional sample rate. An integer sample rate is a special fractional sample rate with hidden denominator 1. (condition 2) If the data type is composite, there must be at least as large fraction of samples stored as the fractional input sample rate. In summary, fractional rate has different meanings for atomic data types and composite data types. For atomic data types, a fractional rate means the ensemble average during the I/O period of the associated node. For composite data types, it means the portion of the constituent data samples that compose the composite data type. The most popular data type of this sort is an array-type data: vector or matrix. If the data type does not have well-defined meaning of fractional rate, we regard the type as atomic and use the statistical interpretation of fractional rate. 9

10 Such interpretation should be consistent between the consumer and the producer of the data samples. Suppose that there is a two-dimensional array and the producer regards it as an array of row vectors while the consumer regards it as an array of column vectors. In this case, the two-dimensional array may not be regarded as a composite type data. Figure 9 illustrates this fact pictorially. œšœ VZ VY Š šœ Œ Figure 9. The consumer and the producer should have the same interpretation on the fraction of the composite data type. If they do not agree, the data type should be considered as atomic. 6. EPERIMENTS WITH ATOMIC DATA SAMPLES In this section, we compare three dataflow models with a compact disk to digital audiotape converter (CD2DAT) algorithm and a CELP algorithm handling atomic data samples. Figure 10 illustrates a CD2DAT specified in an SDF that converts 44.1kHz sampling data to 48KHz data. In SDF, FIR3 produces 5 samples at once when 7 samples are available on the input port. A looped schedule with minimum buffer requirements is 7(7(3(FIR1)2(FIR2))8(FIR3))40(FIR4). Then the required buffer size on each arc becomes 6, 56 and 280 summing up to 342. Y Z [ ^ \ ^ [ mpy mpyy mpyz mpy[ Figure 10. A CD2DAT algorithm in SDF Figure 11 illustrates the CD2DAT in SDF with fractional rate. In fractional rate dataflow, FIR2 consumes a sample every execution and produces four samples at once every three executions. Similarly, FIR3 generates 5 samples every 7 executions while FIR4 produces 4 samples. The transformation from FIR3 to 7(FIR3 ) as well as FIR4 to 7(FIR4 ) enables a better schedule to be 7(7(3(FIR1) 2(3(FIR2 )4(FIR3 ))) 40(FIR4 )), which requires 6, 4 and 40 buffer memory summing up to 50. The reduction rate is 85% compared with the SDF case. Y [VZ \V^ [V^ mpy mpyy mpyz mpy[ Figure 11. CD2DAT in SDF with fractional rate Figure 12 represents the CD2DAT implementation in CSDF. Even though the computation of the accurate sample rate is complicated, as many buffers are required as specified in FRDF if a loop schedule is applied to the application. 10

11 SW YŸ ZŸSW [Ÿ ^Ÿ WSSSWSSS mpy mpyy mpyz WSSWSSWSS mpy[ ^Ÿ Figure 12. CD2DAT in CSDF Figure 13 illustrates an SDF sub-system for evaluating codebooks in a code excited linear prediction (CELP) algorithm[6]. In SDF, LPC Anal is fired only after 80 samples are available and 80 samples are produced at once. A looped schedule with minimum buffer requirements is 80(fork)(LPC Coeff)(LPC Anal)2((Code Search)10(Channel)(Code Lookup))(LPC Synth). Then, the required buffer sizes on arcs {a,b,c,d,e} become {80, 80, 10, 10, 80} summing up to 260. _W swj j Œ ˆ ŒŠŒ Œ š Œ šœ Œ š Œ _W swj _W [W j Œ W W j Œ [W _W j ˆ Œ h ˆ zœˆ Š s œ Š Œ swj z U _W Figure 13. A subgraph of a CELP algorithm in SDF The CSDF specification of the CELP algorithm gives a more efficient implementation in term of buffer requirements as shown in figure 14. LPC Anal is fired for each data sample and produces one output sample at each execution. When 40 samples are produced, the next Code Search block can be executed. Therefore CSDF has benefits of short latency and less buffer memory over SDF. A possible schedule is 80(fork)(LPC Coeff)2(40(LPC Anal)(Code Search)10(Channel)(Code Lookup)(LPC Synth)). The required buffer sizes on arcs become {80, 40, 10, 10, 40} summing up to 180. Note that the buffer size on the output arc of the LPC Anal block is reduced to 40. ŒŠŒ Œ š Œ _W swj j Œ ˆ S^`ŸW šœ Œ š Œ S^`ŸW _WŸ swj _WŸ [W j Œ W W j Œ [W _WŸ j ˆ Œ h ˆ zœˆ Š s œ Š Œ swj z U _WŸ Figure 14. CELP specification of figure 13 in CSDF Figure 15 represents an FRDF extension of the SDF graph. It has the same advantage over the pure SDF representation as the CSDF specification has in terms of buffer requirements. However, we can use the simpler analysis technique of SDF model to obtain the memory-optimized looped schedule than CSDF. 11

12 _W swj j Œ ˆ V_W ŒŠŒ Œ š Œ šœ Œ š Œ V_W swj [W j Œ W W j Œ [W j ˆ Œ h ˆ zœˆ Š s œ Š Œ swj z U Figure 15. CELP algorithm in SDF with fractional rate 7. CODE GENERATION To use fractional rate, the behavior of a block has to be changed accordingly. In this section, we show that fractional rate extension requires at most a negligible amount of code overhead to maintain the invocation index in the I/O period of the block. First we discuss block definitions with the composite data structure. Figure 16 (a) and (b) represent an SDF-style motion estimation block and its internal code definition. In the code definition, variable i is used as the index number of the output macroblock to compute the x and y coordinates. Figure 16(c) shows a part of the generated code from figure 3. Note that the buffer sizes of input and output ports are both frame-size since the output port consists of 99 macroblocks of which the total buffer size is equal to the frame size ME input output (a) for(i=0; i<99; i++) { x = (16* i)%144; y = 16*((16* i)/144); motionestimation(&input,x,y, &output[i]); } (b) main() { struct { unsigned char s[176*144]; } input; struct { char s[16*16]; } output[99]; { //ReadFromDevice } { // MotionEstimation for(i=0; i<99; i++) { x = (16* i)%144; y = 16*((16* i)/144); motionestimation(&input,x,y, &output[i]); }} } (c) Figure 16. (a) SDF-style motion estimation node, (b) block definition, and (c) the generated code from figure 3 The behavior of the equivalent FRDF block is depicted in figure 17(b). Two modifications from the SDF definition are distinguished. First, the FRDF definition describes the loop body without loop structure. The loop structure will be automatically generated from the execution schedule as shown in figure 17(c). Second, the loop index of the SDF definition is replaced with a phase variable. In FRDF model, a phase variable is used to represent how many fractional samples are produced or consumed. The phase variable is incremented at every execution and reset at every I/O period. For instance, in the ME node the 1/99 sample rate of input port leads to reset the phase variable to 0 every 99 executions. If the sample rate is not a fractional number we do not use phase variables since the phase variables will be reset every execution. 12

13 Figure 17(c) illustrates the generated code that is similar to figure 16(c) except the overhead of phase variable management. To remove the overhead further, we can replace the variable ME_phase_input with the loop index (figure 17(d)). The buffer size of output port reduces to macroblock size in FRDF because the subsequent block to process each macroblock is executed inside the same loop. 1/99 1 ME input output (a) x = (16*$phase(input))%144; y = 16*((16*$phase(input))/144); motionestimation(&input,x,y, &output); (b) main() { struct { unsigned char s[176*144]; } input; struct { char s[16*16]; } output; { // ReadFromDevice } for(int loop=0; loop<99; loop++) { { // MotionEstimation x = (16*ME_phase_input)%144; y = 16*((16*ME_phase_input)/144); motionestimation(&input,x,y, &output); ME_phase_input = (ME_phase_input+1)%99; } } } (c) main() { struct { unsigned char s[176*144]; } input; struct { char s[16*16]; } output; { // ReadFromDevice } for(int loop=0; loop<99; loop++) { { // MotionEstimation x = (16*loop)%144; y = 16*((16*loop)/144); motionestimation(&input,x,y, &output); } } } (d) Figure 17. FRDF-style motion estimation: (a) node specification, (b) code definition, (c) the generated code without and (d) with phase variable sharing The implicit specification of the phase variables in FRDF model requires additional code overhead in the block definition especially for primitive data. Consider the multimplexer example that is already mentioned in Figure 7. Figure 18 (b) and (d) represent the block definition of the SDF-style multiplexer node and FRDF-style respectively. The SDF-style code is simpler than the FRDF-style code since the buffer size of the input port is two. The input buffer stores all input samples and distributes them to output ports at once, which provides simple code definition. On the other hand, in the code definition of the FRDF-style multiplexer node, we need a phase variable to specify the invocation index within an I/O period. In this example, an input sample is mapped to the output1 port at the first execution and another sample is to the output2 port at second execution. The output ports in FRDF-style multiplexer node can produce a sample every other execution. input 2 Mux (a) 1 1 output1 output2 output1 = input[0]; output2 = input[1]; (b) 1 input Mux (c) 1/2 output1 1/2 output2 if($phase(output1)==0) output1 = input; else output2 = input; (d) Figure 18. (a) SDF multiplexer node, (b) block definition of the SDF node, (c) FRDF multiplexer node, and (d) the block definition of the FRDF node Figure 19(a) shows an example using the FRDF-style multiplexer node and Figure 19(b) represents the generated code. In this code, we use only one buffer, A_output, for input port of the multiplexer node while we need two buffers if we use the SDFstyle multiplexer node. 13

14 A 1 1/2 1 1 Mux 1/2 1 2(A,Mux) (B) B main(){ int A_output, Mux_output1, Mux_output2; while(1) { for(i=0;i<2;i++) { {A} {if(i==0) Mux_output1 = A_output; else Mux_output2 = A_output; }} {B}}} (b) (a) Figure 19. Code generation from FRDF: (a) an FRDF graph, and (b) its generated code 8. CONCLUSION In this paper, we propose a new dataflow extension called fractional rate dataflow in which fractional number of samples can be produced or consumed. The notion of fractional sampling rate matches well with multimedia applications where type conversion between composite data types and constituent data types are needed in the dataflow representation. Existent integer rate dataflow models can be easily extended to incorporate the fractional rates without loosing analytical properties. In this paper, the SDF model is extended to include FRDF, which can reduce the buffer memory requirements significantly (up to 70%) for multimedia applications. Even for the atomic type data, use of fractional sampling rate may reduce the buffer requirement by large amount, which is demonstrated with a CD2DAT rate conversion and a CELP example. The SDF dataflow model extended with fractional rates has been implemented in our high-level system synthesis environment called PeaCE(Ptolemy extension as Codesign Environment) [14]. The buffer size of an H.263 encoder specified in our environment is reduced down to what the hand-optimized code would require. This result will bring the proposed high level code synthesis methodology from dataflow specification closer to the practical use, we believe. While the FRDF representation could reduce the buffer requirements, finding out the buffer-optimal schedule is another research subject we are currently pursuing. 9. REFERENCES [1] COSSAP: A stream driven simulator, in Proc. IEEE Int. Workshop Microelectronics Commun., Interlaken, Switzerland, Mar [2] R. Lauwereins, M. Engels, J. A. Peperstraete, E. Steegmans, and J. Van Ginderdeuren, GRAPE: A CASE Tool for Digital Signal Parallel Processing, IEEE ASSP Magazine, vol. 7, (no.2):32-43, April, 1990 [3] J. T. Buck, S. Ha, E. A. Lee, and D. G. Messerschimitt, Ptolemy: A Framework for Simulating and Prototyping Heterogeneous Systems, Int. Journal of Computer Simulation, special issue on Simulation Software Development, vol. 4, pp , April, [4] J. A. Sharp, Data Flow Computing. West Sussex, U.K.: Ellis Horwood, [5] E. A. Lee and D. G. Messerschmitt, Static scheduling of synchronous dataflow programs for digital signal processing, IEEE Trans. Comput., vol. C-36, no. 1, pp , Jan

15 [6] G. Bilsen, M. Engles, R. Lauwereins, and J. A. Peperstraete, Cyclo-Static Dataflow, IEEE Trans. Signal Processing, vol. 44, no. 2, Feb [7] S. S. Bhattacharyya, P. K. Murthy, and E. A. Lee, APGAN and RPMC: Complementary Heuristics for Translating DSP Block Diagrams into Efficient Software Implementations, DAES, vol. 2, no. 1, pp , January, [8] W. Sung and S. Ha, "Memory Efficient Software Synthesis using Mixed Coding Style from Dataflow Graph", IEEE Transactions on VLSI Systems, Vo.8, pp , Oct, [9] E. De Greef, F. Cattohoor, and H. DE Man, Array placement for storage size reduction in embedded multimedia systems, in Proc. Int. Conf. Applications Specific Systems, Architectures, Processors, Zurich, Switzerland, July 1997, pp [10] S. Ritz, M. Willems, and H. Meyr, Scheduling for optimum data memory compaction in block diagram oriented software synthesis, in Proc. ICASSP, Detroit, MI, May 1995, pp [11] P. K. Murthy, and S. S. Bhattacharyya, Shared Buffer Implementations of Signal Processing Systems Using Lifetime Analysis Techniques, IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, vol. 20, no. 2, Feb., [12] Telenor Research, TMN (H.263) Encoder/Decoder, Version 2.0, Download via anonymous ftp to bonde.nta.no, June [13] J. T. Buck, Scheduling dynamic dataflow graphs with bounded memory using the token flow model, Ph. D. Dissertation, Dept. Elect. Eng. Comput. Sci., Univ. of California, Berkeley, Sept [14] PeaCE : Codesign Environment. 15

Efficient Code Synthesis from Extended Dataflow Graphs for Multimedia Applications

Efficient Code Synthesis from Extended Dataflow Graphs for Multimedia Applications Hyunok Oh The School of Electrical Engineering and Computer Science Seoul National University Seoul 5-742, KOREA TEL :