Combining Module Selection and Replication for Throughput-Driven Streaming Programs

Size: px

Start display at page:

Download "Combining Module Selection and Replication for Throughput-Driven Streaming Programs"

Candace Jefferson
5 years ago
Views:

1 Combining Module Selection and Replication or Throughput-Driven Streaming Programs Jason Cong, Muhuan Huang, Bin Liu, Peng Zhang and Yi Zou Computer Science Department, University o Caliornia, Los Angeles Abstract Streaming processing is widely adopted in many data-intensive applications in various domains. FPGAs are commonly used to realize these applications since they can exploit inherent data parallelism and pipelining in the applications to achieve a better perormance. In this paper we investigate the design space exploration problem (DSE) when mapping streaming applications onto FPGAs. Previous works narrowly ocus on using techniques like replication or module selection to meet the throughput target. We propose to combine these two techniques together to guide the design space exploration. A ormal ormulation and solution to this combined problem is presented in this paper. Our objective is to optimize the total area cost subject to the throughput constraint. In particular, we are able to handle the eedback loops in the streaming programs, which, to the best o our knowledge, has never been discussed in previous work. Our methodology is evaluated with high-level synthesis tools, and we demonstrate our worklow on a set o benchmarks that vary rom module kernel design such as FFT to large designs such as an MPEG- decoder. I. INTRODUCTION The primary goal o implementing streaming applications is to meet throughput requirements. Streaming programs are usually modeled in graphs, where each node in the graph represents an actor. Previous approaches to increasing throughput ocus on replicating the bottleneck actors [] [7]. Those approaches point out that stateless actors, which have no dependence between consecutive executions, oer much data parallelism opportunities that can be exploited by replication o these actors. This strategy is targeted at a multicore architecture [] [6] or a multicore architecture with accelerators such as GPU [7]. For example, a greedy algorithm or mapping actors to dierent cores was proposed in []. I there are ewer actors than cores, the partitioner considers the actors in decreasing order o their computational requirements and replicates the candidate actors until all the cores are occupied. Conversely, i there are more actors than cores, the partitioner repeatedly uses the least demanding actors until all actors it in the multicore. In the context o the FPGA platorm, a similar idea was also exploited. [8] [9] extend the idea o actor replication to the FPGA platorm with consideration o resource limitations. For example, in [8] the proposed algorithm essentially perorms maximal replication o all the stateless actors bounded only by the FPGA resource, then uses the actors that do not aect the throughput. While replication does increase the throughput, such a mechanism is /DATE/ 0 EDAA quite resource-consuming since occupied resources increase quickly as the number o replicas increases. High resource utilization might lead to a timing closure problem in FPGA implementation. Another disadvantage o this scheme is that it does not evaluate dierent possible implementation options or the actors to improve the perormance; thus, only limited design space is explored. In this paper we address the ollowing issue: how do we design area-eicient FPGA implementations or streaming programs under the throughput constraints? We notice that most actors in streaming programs could have dierent implementations that trade o the processing throughput and area. Their combinations provide great opportunities or optimization since much area can be saved by selecting slower implementations or non-critical actors. This idea o module selection has been extensively explored in the past (or example in [0] []). However, these studies target a ine-grained circuit level, and thus replication techniques are not employed. Moreover, most studies ocus on the tradeo between area and time, and do not consider throughput as a constraint. In this work we are interested in building a library o implementations and selecting the best implementation rom the library to replicate. Thus, our strategy could be viewed as a combination o module replication and module selection. As ar as we know, we are the irst to tackle the DSE problem by combining these two techniques. In contrast to previous work on design space exploration [] [8] [9] that considers only streaming programs without eedback loops, we also take eedback loops into consideration since these are usually signiicant throughput bottlenecks in streaming programs [5]. The remainder o the paper is organized as ollows. Section II provides background and our motivating example. Section III details the ormulation and our methodology or the design space exploration problem. Section IV presents experiment results, and we conclude in Section V. II. BACKGROUND AND MOTIVATING EXAMPLE In this work we use a synchronous data low (SDF) [6] as our computation model. Streaming applications are represented as SDF graphs (SDFGs.) Nodes in the graph are actors, and the edges are called channels. In general, actor replication is not an area-eicient design technique or increasing throughput because control logics, which do not directly contribute to throughput, are also duplicated. Moreover, some computing logics might be wasteul ater replication. For example, consider an actor that contains

2 (, 6059) (, 33) (, 3556) (8, 5) (6, 668) SLICE DSP (3, 5) (5, 75) Initiation Interval Fig.. Tradeo between perormance and area o one actor in benchmark ilterbank. two add and one multiply operation, while the slowest implementation would use one adder and one multiplier to do the computation. When we duplicate such an implementation, two adders and two multipliers are generated, and one multiplier is wasted. When mapping an actor in streaming programs onto FPGAs, by setting dierent design goals and deploying dierent optimization conigurations we can generate unctionally equivalent implementations that dier in area and perormance. Thereore, we could generate a library o implementations or each actor so that design space is enlarged. For example, initiation interval (II) is a well-known parameter that trades o between area and throughput. In streaming applications, the initiation interval speciies the number o cycles between the consecutive iring o an actor and indicates to what extent the actor is pipelined. A pipelined actor oers high throughput, but usually at a large area cost because resource sharing opportunities are reduced. Fig. plots the initiation interval and area cost o our FPGA-based implementations or one actor in the StreamIt benchmark ilterbank [7]. The initiation interval o these design points are,,, 8, 6 and 5, respectively. We can see rom the igure that decreasing the initiation interval results in more DSP logics and slices. To achieve a throughput o one iring per cycle, we could either replicate the slowest implementation (with II = 5) into 5 copies, or use the implementation with II = directly. Obviously the latter option is more area eicient. It s worthwhile to mention that pipelining does not guarantee that the throughput target can be satisied, and replication is still needed in cases where a ully pipelined design cannot satisy the high throughput target. There are also situations when hardware designers choose to use the predeined IP cores to avoid high development cost, and they do not have many choices or dierent implementations. III. PROBLEM FORMULATION AND SOLUTIONS Consider a stream graph that has M actors. When mapping actor m in the stream graph to a physical module on FPGAs, we can have dierent implementations Pm, Pm,...,P S m m, where each implementation Pm s can perorm the unctionality o m with area cost A(Pm), s initiation interval δ(pm), s and execution time t(pm). s S m denotes the number o implementations we have or actor m. Moreover, to meet the throughput target, m in ( m ) in ( m ) in3 ( m ) m 8 Fig.. out ( m )=max{ in ( m )/, in ( m )/, in3 ( m )/} * out ( m )=max{ in ( m )/, in ( m )/, in3 ( m )/} * 8 Throughput target propagation. can be implemented as several replicas o one implementation selected rom the library. The DSE problem or streaming applications requires selecting implementations or each actor under throughput constraint while minimizing the total area cost. Such a throughput-driven DSE problem or streaming applications diers rom traditional DSE problems in that we need to maintain a balance between the consecutive actors; i.e., it is not beneicial to increase the throughput o a producer actor when its producer rate is higher than the highest consumer rate that the ollowing actor can sustain. A. Throughput Calculation For actor m and its implementation P s m, the maximal input throughput θ in (P s m) and output throughput θ out (P s m) o m are calculated as θ in (P s m) = In( m) δ(p s m), θ out(p s m) = Out( m) δ(p s m), () where In( m ) and Out( m ) equals the number o data tokens that m consumes on the input data channel and produces on the output data channel during each iring. And δ(p s m) is the initial interval o P s m. This deinition is dierent rom [8] in which the authors substitute δ(p s m) with t(p s m), because their deinition does not consider pipelined implementations. I an actor has multiple input channels, the channel with the lowest data consumer rate is used in unction (). The system throughput is measured at the irst actor in the SDFG. It measures how many data tokens the system can consume in one clock cycle. We notice that or the SDFG that does not contain eedback loops, we are able to express the relationship between the system throughput target and the throughput target o each actor. Thus, we can break down the original problem into independent module selection subproblems or each actor and greatly reduce the problem size. For a SDFG with eedback loops, system throughput is related to the total execution time o the actors on the eedback path, and thus we need to consider the operation semantics o these actors. We will describe the detailed ormulations o these two scenarios in the ollowing two subsections. B. SDFGs Without Feedback Loops Let us irst discuss how to propagate the input throughput target to the output throughput target or a single actor. Denote the number o input and output channels or actor m as numin( m ) and numout( m ). The throughput target on the input/output channel is denoted as θ j in ( m)/θout( k m ), where j numin( m ) and k numout( m ). Now, given the input throughput target θ j in ( m), the output throughput target

3 (a) 5 (c) 3 5 (b) 5 3 Fig. 3. (a) An example o the simplest orm o eedback loop. (b) An example o a more complicated orm o eedback loop. (c) and (d) are two ways to transorm (b) into a simpler orm. In (c), and are merged into one actor. In (d), assume has a longer execution time than, then we only need to consider in the eedback path. θout( k m ) is calculated as { θ k out( m ) = max j θ j in ( m) In j ( m ) } (d) Out k ( m ), k numout( m ). () The use o max presumes that the selected physical implementations o m could sustain the highest input throughput target. Fig. illustrates the throughput target propagation process. The system throughput target is enorced on the input channels o the irst actor. Using ormulation (), we can propagate the throughput target to each input and output channel in the graph. We can also derive the initial interval target (δ tgt ( m )) or actor m as { } δ tgt ( m ) = min j In j ( m ) θ j in ( m). (3) Thereore, given an SDFG and the system throughput target, we can compute the initial interval target or each actor. Then, the DSE problem can be described as ollows: Given a throughput (or initial interval) constraint or an actor, which implementation should be selected and how many replicas are needed so that the aggregated throughput o all the replicas is no less than the throughput constraint? That is, or actor m we want to get binary integers x, x,...,x Sm indicating which implementation is selected, and integer u m indicating number o replicas needed. Then the ormulation o the problem is: minimize u m Sm A(Pm)x i i i= subject to u m Sm S m i= i= x i = δ(p i m)x i δ tgt ( m ) where δ tgt ( m ) is calculated as in unction (3). To solve this problem, we could simply enumerate all the possible values o x i, compute the number o replicas needed based on the throughput constraint, compute the area cost, () 3 (a) 3 (b) Fig.. Precedence graphs o eedback loop in Fig. 3(a). According to the data dependency, in each iteration actor could ire once ater every two irings o actor. In Fig. (a), each actor has only one physical implementation, thus each pair o consecutive irings is connected with a dashed edge. In Fig. (b), actor has two physical modules. According to the cyclic scheduling policy, the irst and third irings are scheduled to one physical implementation, while the second and the ourth irings are scheduled to another physical implementation. and select the implementation that consumes the least area. The complexity o this method is O(S m ), where S m is the number o implementations in the library or actor m. C. SDFGs With Feedback Loops ) Feedback Loops: In this section we will discuss the simplest orm o eedback loop actors are sequentially connected with each other along the eedback path. Despite strong restrictions, many SDFGs that do not meet such requirements can still be transormed into this simple orm as shown in Fig. 3. Fig. 3(a) shows examples o the SDFGs that we are dealing with. Fig. 3(b) is an SDFG that violates the requirement it has two actors iring concurrently rather than sequentially. We can either merge these two actors (Fig. 3(c)) or only consider the bottleneck actor in the loop (Fig. 3(d).) The iteration o a eedback loop is a set o actor irings with as many irings as in the corresponding entry in the repetition vector q [8]. Ater one iteration, the SDFG returns to the same state; i.e. the number o tokens on each channel remains the same. In Fig. 3(a), the repetition actor q is [;;]. We also assume that the initial data tokens on the eedback loop are located at the input edge o the irst actor in the eedback path. Those eedback tokens, or in other words, dependency distance, are intrinsic in the algorithm. In act, such distance is in many streaming applications. Thereore, we also assume that the number o eedback tokens is just enough or the irst actor to ire or one iteration. ) Formulation: In the eedback loop, the throughput target imposes a constraint on the execution time o all the actors in the loop. Semantics o how the actors are ired are the key to estimate the latency o the loop. To delineate such a iring sequence, we construct a precedence graph as in [9]. The weight o edge denotes by how many cycles two irings should be separated. Fig. shows the precedence graph o a eedback loop in Fig. 3(a). Each node in the graph represents an actor iring. Solid edges denote the data dependency between adjacent actors. Dashed edges relect the physical resource conlict between consecutive irings o the same actor. Physical resource conlict depends on the number o physical modules

4 (implementations) selected, their initiation intervals, and how the irings are scheduled to these physical modules. We adopt the cyclic scheduling policy, which assigns a iring to the irst available physical module. Two irings that are scheduled to the same physical module should be separated by at least as many cycles as the initiation interval. In the precedence graph, edges with a dot denote the dependence across the iteration. Our problem has two parts: module selection and scheduling. In module selection, we select one implementation and decide the number o replicas or each actor, thus inalizing the area cost, module execution time, initiation interval and speciic structure o the precedence graph. In module scheduling, we schedule the start time o each node in the precedence graph to see i it can satisy the throughput target. Usually scheduling is perormed ater the module selection, but in our proposed ormulation we perorm module selection concurrently with scheduling. Moreover, we should generate a scheduling that is easible among all iterations, although we are only scheduling the irings in one iteration. Thus, we unold the precedence graph to include the s irings in the second iteration (as shown in Fig. 5). The reason is that i has the same iring timeline in iteration as in iteration, then all the other actors would also have the same timeline in the irst two iterations. And i the second iteration has the same iring timeline as the irst iteration, we are guaranteed to get a periodical scheduling. The total area cost, execution time, and initiation interval o the selected implementation or actor m is denoted as A m, t m and δ m respectively. We introduce binary integers x i m to denote i implementation P i m is selected and use u m to denote the number o replicas. Then we have A m = u m Sm A(Pm)x i i m, (5) i= δ m = S m δ(pm)x i i m, (6) i= t m = S m t(pm)x i i m, (7) i= S m i= x i m =. (8) In the unolded precedence graph, the set o nodes in the graph is V, and the set o edges is E. For e(u, v) E, denote the weight as w(u, v). I u and v are the irings o dierent actors m and m+, then I u and v are dierent irings m k,j m, then w(u, v) can be expressed as: j, j, j < j, w( k,j m, k,j m ) = w(u, v) = t m. (9) and k,j m o the same actor { δm, i j j = u m 0, otherwise. (0) Since u m is an unknown variable, every two irings o the same actor can be connected with an edge. And since the number o irings or m in one iteration is q[m], the number o such constraints is O(q[m] ). In realistic cases, we usually Fig t t t t 3 Unolded precedence graph o eedback loop in Fig. 3(a). know the maximum possible number o replicas ( m ). Thus the number o constraints is O(q[m] m ). Denote the start time o the jth iring o actor m in iteration k as T (m k,j ). We can ormulate the problem as ollows, M minimize A m, subject to m= j {,,..., q[m] }, T (m,(j+) ) T (m,j ) = T (m,(j+) ) T (m,j ),() j {,,..., q[]}, T (,j,j ) T ( ) In( ) q[] θ tgt in (, () ) e(u, v) E, T (v) T (u) w(u, v). (3) where θ tgt in ( ) is the throughput target o the eedback loop, which is measured at actor. Constraints () and () guarantee that the schedule is periodical and should meet the system throughput. Constraint (3) ensures the dependency that iring v ollows u by at least w(u, v) cycles. w(u, v) is calculated as shown in unctions (9) and (0). Functions (5) to (3) present our ormulation o the DSE problem or an SDFG with eedback loop. 3) Solution: The above set o ormulas are diicult to solve because unction (5) and (0) contain nonlinear expressions. Thereore, we propose the ollowing techniques to transorm the ormulation into linear expression so that we can use an ILP solver to solve the problem. Let us look at unction (0) irst. We could enumerate the possible values o u m as,,..., m and introduce binary integer variables ym i to denote which one is the actual value o u m. Thus w(m k,j, m k,j ) and u m is: w( k,j m, m k,j ) = δ m y j j m, () m u m = i ym, i (5) m i= i= Combining unction (6) and (), we have: w( k,j m, k,j m ) = ( S m i= 3 y i m =. (6) δ(p i m)x i m) y j j m. (7) Due to the product o unknown variables x i m and y j j m, the unction is not linear. Thus, we propose the ollowing theorem:

5 Theorem : Function F () is a unction o integer variable n x and binary variable y,y,...,y n. y i =, n >. Given i= the value o binary variables, i F () is a linear unction o x, then F (x, y, y,..., y n ) 0 is equivalent to the ollowing set o unctions which are all in the linear orm: F (x,, 0,..., 0) ( y ) in x F (x,, 0,..., 0) F (x, 0,,..., 0) ( y ) in x F (x, 0,,..., 0)... F (x, 0, 0,..., ) ( y n ) in x F (x, 0, 0,..., ) (8) where in means the inimum o the unction. We can also derive a similar set o unctions or the case where (x, y, y,..., y n ) 0 (which is omitted here due to page limitation). This theorem can be easily proved by enumerating all the possible values o binary variables. In unction (7), the variables x i m and y j j m correspond to the binary variables y i and x in the above theorem. We could also apply this theorem to unction (5); thus all o the ormulation o our problem is in a linear orm, and can be solved with an ILP solver. The total number o variables is O( M m= O( M m= S m + M m= m q[m]). m ), and the number o constraints is IV. EXPERIMENT RESULTS The experiment is carried out in two parts. For the streaming programs without eedback loops, we evaluate our strategies using StreamIt benchmarks [7]. For those containing eedback loops we examine our methodology using the MPEG- decoder. Implementation o these benchmarks on an FPGA is carried out by the high-level synthesis tool Autopilot [0] rom AutoESL Xilinx (version: AutoESL 0.). It takes C code as input and generates synthesizable RTL code which can be ed into a Xilinx ISE design suite. Thus, most benchmarks are rewritten into C code. The FPGA device we use is the Xilinx Virtex6 XC6VLX0T board. The target FPGA clock cycle is set at 00 MHz. Initiation intervals are speciied in the C program by Autopilot pragmas. Estimated execution time and resource usage (i.e., DSP, Block RAM, FF and LUT) are provided by the Autopilot synthesis report. The area metric is empirically estimated as the maximum utilization percentage o any o these our kinds o resources. A. Results on StreamIt Benchmarks StreamIt benchmarks provide us with high-level structure and behavior or realistic streaming applications. The actors in StreamIt benchmarks are restricted to have only one input/output channel, and data distribution and merging are implemented as split-join nodes. Since we allow multiple input/output channels in SDF, we modiy the benchmarks to embed the split-join nodes into the actors. For example, i an actor m is ollowed by a split node, we embed the split node into m. Within m, data tokens are distributed to multiple output channels. We compare our methodology with two baselines: ) the strategy in previous work [8] sys Parser 6 PP 58 Syn. Node 58 CC_MC TU (a) IDCT Fig sys Parser Syn. Node 58 pass CC_MC TU through 6 6 PP IDCT (b) SDFGs o MPEG- decoder. [9] that simply replicates the non-pipelined actors to reach high throughput; ) the strategy that all the actors are ully pipelined. In the irst case, throughput is usually achieved at a large area cost, while in the second case, non-bottleneck actors are over-perormed and thus occupy more area than they really need. The comparison results are shown in Table I. For each benchmark we show the area o baseline, baseline and our method that combines module selection and replication (CMSR or abbreviation). We also show the area reduction. B. Results on MPEG- Decoder MPEG- is a collection o methods and standards to perorm audio and video coding. The reerence C code or the MPEG- decoder is provided by Xilinx. The C code has already been developed with synthesis in mind, and most unsynthesizable constructs (such as dynamic allocation) are avoided. The design description and block diagram are presented in []. The modules are connected by either FIFOs or shared memory. The MPEG- decoder can be represented in an SDFG as shown in Fig. 6(a). There are ive actors in the graph: Parser, CC MC (copy control and motion compensations), TU (texture update), PP (pre-processor) and IDCT. The data ormat is set as CIF and the data token granularity is set at the macroblock (6x6 pixels) level. So, a CIF-size image rame (70x576) corresponds to x36 macroblocks. The iring o CC MC requires the output data rom TU in the previous rame; thus there is a synchronization node between CC MC and TU. While maintaining the correctness o the algorithm, we can transorm the SDFG in Fig. 6(a) into Fig. 6(b) a pass-through actor is added. There are two eedback paths in the new SDFG, and we select the path that takes the longer execution time (estimated rom the Autopilot synthesis report) as the eedback loop. We use unction block pipelining inside some actors to generate hardware modules with dierent II. For example, IDCT is urther broken into two modules, IDCT row and IDCT col, and they can be pipelined. The initial interval o IDCT is the larger value between the execution time o IDCT row and IDCT col. The implementation library we generated or Parser, PP, IDCT, CC MC and TU is shown in Table II. In Table III we show that the minimum area cost or a given throughput target that varies rom 30ps to 60ps. For each actor, we list the selected implementation, number o replicas, and total area usage or the actor. As we can see rom the table, bottleneck actors are Parser, PP and IDCT, which require a larger number o replicas and implementation with smaller II. It is worthwhile to point out that Autopilot uses the worst case latency as the estimated ex-

6 TABLE I DESIGN SPACE EXPLORATION FOR THREE STREAMIT BENCHMARKS. Filterbank FFT Autocor Throughput (# o data per cycle) Baseline Area Baseline Area CMSR Area Reduct -9.5% -6.8% -.% -6.6% % -3.8% 0 Reduct -36.5% -6.% -77.6% -3.7% -5.8% -6.% 0-39.% -59.% TABLE II IMPLEMENTATION LIBRARY FOR MPEG- DECODER KERNELS Parser PP IDCT CC MC TU v v v v v3 v v v3 v v v3 v v II FF LUT TABLE III DESIGN SPACE EXPLORATION FOR MPEG- DECODER. Throughput Parser PP IDCT CC MC TU Total Area (ps) impl rep area % impl rep area % impl rep area % impl rep area % impl rep area % % 60 v v 3.07 v 3.7 v 0.8 v v v 3.07 v 0.98 v 0.8 v v 3.9 v3.55 v 0.98 v 0.8 v v 3.7 v.53 v 0.98 v 0.8 v ecution time, thus may underestimate the actor s perormance in reality. Parser is such a case. The ILP solver we use is GLPK [], and results are generated on the order o minutes. V. CONCLUSION In this paper we studied the design space exploration problem o mapping streaming applications onto FPGAs. Our method diers rom existing methods because it combines both module selection and replicationto ind a better design point in design space. The method is veriied with both small designs like FFT and large designs like the MPEG- decoder, and results can be computed in minutes using an ILP solver. In the uture, we would like to investigate more on the communication cost in this design space exploration problem. ACKNOWLEDGMENT This research was partially supported by the NSF Expedition in Computing Award CCF-0967 (Center or Domain- Speciic Computing), Altera and Mentor Graphics. REFERENCES [] M. I. Gordon, W. Thies, and S. Amarasinghe, Exploiting coarse-grained task, data, and pipeline parallelism in stream programs, SIGOPS, vol. 0, pp. 5 6, 006. [] M. I. Gordon et al., A stream compiler or communication-exposed architectures, SIGARCH, vol. 30, pp , 00. [3] M. Kudlur and S. Mahlke, Orchestrating the execution o stream programs on multicore platorms, SIGPLAN, vol. 3, pp., 008. [] A. Hormati et al., MacroSS: macro-simdization o streaming applications, SIGARCH, vol. 38, pp , 00. [5] A. Hormati et al., Flextream: Adaptive compilation o streaming applications or heterogeneous architectures, in PACT, 009, pp. 3. [6] S. Liao et al., Data and computation transormations or Brook streaming applications on multiprocessors, in CGO, 006, pp [7] A. Udupa, R. Govindarajan, and M. J. Thazhuthaveetil, Synergistic execution o stream programs on multicores with accelerators, SIGPLAN, vol., pp , 009. [8] A. Hagiescu et al., A computing origami: Folding streams in FPGAs, in DAC, 009, pp [9] F. Plavec, Z. Vranesic, and S. Brown, Enhancements to FPGA design methodology using streaming, in FPL 009, 009, pp [0] M. Ishikawa and G. De Micheli, A module selection algorithm or high-level synthesis, in ISCAS, 99, pp vol.3. [] I. Ahmad, M. Dhodhi, and C. Chen, Integrated scheduling, allocation and module selection or design-space exploration in high-level synthesis, CDT, vol., no., pp. 65 7, 995. [] K. Ito, L. Lucke, and K. Parhi, ILP-based cost-optimal DSP synthesis with module selection and data ormat conversion, VLSI, vol. 6, no., pp , 998. [3] W. Sun, M. J. Wirthlin, and S. Neuendorer, FPGA pipeline synthesis design exploration using module selection and resource sharing, TCAD, vol. 6, no., pp. 5 65, 007. [] D. Chen, J. Cong, and J. Xu, Optimal module and voltage assignment or low-power, in ASP-DAC, 005, pp [5] W. Thies and S. Amarasinghe, An empirical characterization o stream programs and its implications or language and compiler design, in PACT, 00, pp [6] E. Lee and D. Messerschmitt, Synchronous data low, Proceedings o the IEEE, vol. 75, no. 9, pp. 35 5, 987. [7] StreamIt benchmarks, benchmarks.shtml. [8] E. A. Lee and D. G. Messerschmitt, Static scheduling o synchronous data low programs or digital signal processing, Computers, IEEE Transactions on, vol. C-36, no., pp. 35, 987. [9] Govindarajan et al., Rate-optimal schedule or multi-rate DSP computations, J. VLSI Signal Process. Syst., vol. 9, pp. 3, 995. [0] J. Cong et al., High-level synthesis or pgas: From prototyping to deployment, TCAD, vol. 30, no., pp. 73 9, 0. [] P. Schumacher et al., A scalable, multi-stream mpeg- video decoder or conerencing and surveillance applications, in ICIP 005, vol., 005, pp. II [] GNU Linear Programming Kit,

FPGA PLB EVALUATION USING QUANTIFIED BOOLEAN SATISFIABILITY

FPGA PLB EVALUATION USING QUANTIFIED BOOLEAN SATISFIABILITY Andrew C. Ling Electrical and Computer Engineering University o Toronto Toronto, CANADA email: aling@eecg.toronto.edu Deshanand P. Singh, Stephen