Combining Module Selection and Replication for Throughput-Driven Streaming Programs

Size: px
Start display at page:

Download "Combining Module Selection and Replication for Throughput-Driven Streaming Programs"

Transcription

1 Combining Module Selection and Replication or Throughput-Driven Streaming Programs Jason Cong, Muhuan Huang, Bin Liu, Peng Zhang and Yi Zou Computer Science Department, University o Caliornia, Los Angeles Abstract Streaming processing is widely adopted in many data-intensive applications in various domains. FPGAs are commonly used to realize these applications since they can exploit inherent data parallelism and pipelining in the applications to achieve a better perormance. In this paper we investigate the design space exploration problem (DSE) when mapping streaming applications onto FPGAs. Previous works narrowly ocus on using techniques like replication or module selection to meet the throughput target. We propose to combine these two techniques together to guide the design space exploration. A ormal ormulation and solution to this combined problem is presented in this paper. Our objective is to optimize the total area cost subject to the throughput constraint. In particular, we are able to handle the eedback loops in the streaming programs, which, to the best o our knowledge, has never been discussed in previous work. Our methodology is evaluated with high-level synthesis tools, and we demonstrate our worklow on a set o benchmarks that vary rom module kernel design such as FFT to large designs such as an MPEG- decoder. I. INTRODUCTION The primary goal o implementing streaming applications is to meet throughput requirements. Streaming programs are usually modeled in graphs, where each node in the graph represents an actor. Previous approaches to increasing throughput ocus on replicating the bottleneck actors [] [7]. Those approaches point out that stateless actors, which have no dependence between consecutive executions, oer much data parallelism opportunities that can be exploited by replication o these actors. This strategy is targeted at a multicore architecture [] [6] or a multicore architecture with accelerators such as GPU [7]. For example, a greedy algorithm or mapping actors to dierent cores was proposed in []. I there are ewer actors than cores, the partitioner considers the actors in decreasing order o their computational requirements and replicates the candidate actors until all the cores are occupied. Conversely, i there are more actors than cores, the partitioner repeatedly uses the least demanding actors until all actors it in the multicore. In the context o the FPGA platorm, a similar idea was also exploited. [8] [9] extend the idea o actor replication to the FPGA platorm with consideration o resource limitations. For example, in [8] the proposed algorithm essentially perorms maximal replication o all the stateless actors bounded only by the FPGA resource, then uses the actors that do not aect the throughput. While replication does increase the throughput, such a mechanism is /DATE/ 0 EDAA quite resource-consuming since occupied resources increase quickly as the number o replicas increases. High resource utilization might lead to a timing closure problem in FPGA implementation. Another disadvantage o this scheme is that it does not evaluate dierent possible implementation options or the actors to improve the perormance; thus, only limited design space is explored. In this paper we address the ollowing issue: how do we design area-eicient FPGA implementations or streaming programs under the throughput constraints? We notice that most actors in streaming programs could have dierent implementations that trade o the processing throughput and area. Their combinations provide great opportunities or optimization since much area can be saved by selecting slower implementations or non-critical actors. This idea o module selection has been extensively explored in the past (or example in [0] []). However, these studies target a ine-grained circuit level, and thus replication techniques are not employed. Moreover, most studies ocus on the tradeo between area and time, and do not consider throughput as a constraint. In this work we are interested in building a library o implementations and selecting the best implementation rom the library to replicate. Thus, our strategy could be viewed as a combination o module replication and module selection. As ar as we know, we are the irst to tackle the DSE problem by combining these two techniques. In contrast to previous work on design space exploration [] [8] [9] that considers only streaming programs without eedback loops, we also take eedback loops into consideration since these are usually signiicant throughput bottlenecks in streaming programs [5]. The remainder o the paper is organized as ollows. Section II provides background and our motivating example. Section III details the ormulation and our methodology or the design space exploration problem. Section IV presents experiment results, and we conclude in Section V. II. BACKGROUND AND MOTIVATING EXAMPLE In this work we use a synchronous data low (SDF) [6] as our computation model. Streaming applications are represented as SDF graphs (SDFGs.) Nodes in the graph are actors, and the edges are called channels. In general, actor replication is not an area-eicient design technique or increasing throughput because control logics, which do not directly contribute to throughput, are also duplicated. Moreover, some computing logics might be wasteul ater replication. For example, consider an actor that contains

2 (, 6059) (, 33) (, 3556) (8, 5) (6, 668) SLICE DSP (3, 5) (5, 75) Initiation Interval Fig.. Tradeo between perormance and area o one actor in benchmark ilterbank. two add and one multiply operation, while the slowest implementation would use one adder and one multiplier to do the computation. When we duplicate such an implementation, two adders and two multipliers are generated, and one multiplier is wasted. When mapping an actor in streaming programs onto FPGAs, by setting dierent design goals and deploying dierent optimization conigurations we can generate unctionally equivalent implementations that dier in area and perormance. Thereore, we could generate a library o implementations or each actor so that design space is enlarged. For example, initiation interval (II) is a well-known parameter that trades o between area and throughput. In streaming applications, the initiation interval speciies the number o cycles between the consecutive iring o an actor and indicates to what extent the actor is pipelined. A pipelined actor oers high throughput, but usually at a large area cost because resource sharing opportunities are reduced. Fig. plots the initiation interval and area cost o our FPGA-based implementations or one actor in the StreamIt benchmark ilterbank [7]. The initiation interval o these design points are,,, 8, 6 and 5, respectively. We can see rom the igure that decreasing the initiation interval results in more DSP logics and slices. To achieve a throughput o one iring per cycle, we could either replicate the slowest implementation (with II = 5) into 5 copies, or use the implementation with II = directly. Obviously the latter option is more area eicient. It s worthwhile to mention that pipelining does not guarantee that the throughput target can be satisied, and replication is still needed in cases where a ully pipelined design cannot satisy the high throughput target. There are also situations when hardware designers choose to use the predeined IP cores to avoid high development cost, and they do not have many choices or dierent implementations. III. PROBLEM FORMULATION AND SOLUTIONS Consider a stream graph that has M actors. When mapping actor m in the stream graph to a physical module on FPGAs, we can have dierent implementations Pm, Pm,...,P S m m, where each implementation Pm s can perorm the unctionality o m with area cost A(Pm), s initiation interval δ(pm), s and execution time t(pm). s S m denotes the number o implementations we have or actor m. Moreover, to meet the throughput target, m in ( m ) in ( m ) in3 ( m ) m 8 Fig.. out ( m )=max{ in ( m )/, in ( m )/, in3 ( m )/} * out ( m )=max{ in ( m )/, in ( m )/, in3 ( m )/} * 8 Throughput target propagation. can be implemented as several replicas o one implementation selected rom the library. The DSE problem or streaming applications requires selecting implementations or each actor under throughput constraint while minimizing the total area cost. Such a throughput-driven DSE problem or streaming applications diers rom traditional DSE problems in that we need to maintain a balance between the consecutive actors; i.e., it is not beneicial to increase the throughput o a producer actor when its producer rate is higher than the highest consumer rate that the ollowing actor can sustain. A. Throughput Calculation For actor m and its implementation P s m, the maximal input throughput θ in (P s m) and output throughput θ out (P s m) o m are calculated as θ in (P s m) = In( m) δ(p s m), θ out(p s m) = Out( m) δ(p s m), () where In( m ) and Out( m ) equals the number o data tokens that m consumes on the input data channel and produces on the output data channel during each iring. And δ(p s m) is the initial interval o P s m. This deinition is dierent rom [8] in which the authors substitute δ(p s m) with t(p s m), because their deinition does not consider pipelined implementations. I an actor has multiple input channels, the channel with the lowest data consumer rate is used in unction (). The system throughput is measured at the irst actor in the SDFG. It measures how many data tokens the system can consume in one clock cycle. We notice that or the SDFG that does not contain eedback loops, we are able to express the relationship between the system throughput target and the throughput target o each actor. Thus, we can break down the original problem into independent module selection subproblems or each actor and greatly reduce the problem size. For a SDFG with eedback loops, system throughput is related to the total execution time o the actors on the eedback path, and thus we need to consider the operation semantics o these actors. We will describe the detailed ormulations o these two scenarios in the ollowing two subsections. B. SDFGs Without Feedback Loops Let us irst discuss how to propagate the input throughput target to the output throughput target or a single actor. Denote the number o input and output channels or actor m as numin( m ) and numout( m ). The throughput target on the input/output channel is denoted as θ j in ( m)/θout( k m ), where j numin( m ) and k numout( m ). Now, given the input throughput target θ j in ( m), the output throughput target

3 (a) 5 (c) 3 5 (b) 5 3 Fig. 3. (a) An example o the simplest orm o eedback loop. (b) An example o a more complicated orm o eedback loop. (c) and (d) are two ways to transorm (b) into a simpler orm. In (c), and are merged into one actor. In (d), assume has a longer execution time than, then we only need to consider in the eedback path. θout( k m ) is calculated as { θ k out( m ) = max j θ j in ( m) In j ( m ) } (d) Out k ( m ), k numout( m ). () The use o max presumes that the selected physical implementations o m could sustain the highest input throughput target. Fig. illustrates the throughput target propagation process. The system throughput target is enorced on the input channels o the irst actor. Using ormulation (), we can propagate the throughput target to each input and output channel in the graph. We can also derive the initial interval target (δ tgt ( m )) or actor m as { } δ tgt ( m ) = min j In j ( m ) θ j in ( m). (3) Thereore, given an SDFG and the system throughput target, we can compute the initial interval target or each actor. Then, the DSE problem can be described as ollows: Given a throughput (or initial interval) constraint or an actor, which implementation should be selected and how many replicas are needed so that the aggregated throughput o all the replicas is no less than the throughput constraint? That is, or actor m we want to get binary integers x, x,...,x Sm indicating which implementation is selected, and integer u m indicating number o replicas needed. Then the ormulation o the problem is: minimize u m Sm A(Pm)x i i i= subject to u m Sm S m i= i= x i = δ(p i m)x i δ tgt ( m ) where δ tgt ( m ) is calculated as in unction (3). To solve this problem, we could simply enumerate all the possible values o x i, compute the number o replicas needed based on the throughput constraint, compute the area cost, () 3 (a) 3 (b) Fig.. Precedence graphs o eedback loop in Fig. 3(a). According to the data dependency, in each iteration actor could ire once ater every two irings o actor. In Fig. (a), each actor has only one physical implementation, thus each pair o consecutive irings is connected with a dashed edge. In Fig. (b), actor has two physical modules. According to the cyclic scheduling policy, the irst and third irings are scheduled to one physical implementation, while the second and the ourth irings are scheduled to another physical implementation. and select the implementation that consumes the least area. The complexity o this method is O(S m ), where S m is the number o implementations in the library or actor m. C. SDFGs With Feedback Loops ) Feedback Loops: In this section we will discuss the simplest orm o eedback loop actors are sequentially connected with each other along the eedback path. Despite strong restrictions, many SDFGs that do not meet such requirements can still be transormed into this simple orm as shown in Fig. 3. Fig. 3(a) shows examples o the SDFGs that we are dealing with. Fig. 3(b) is an SDFG that violates the requirement it has two actors iring concurrently rather than sequentially. We can either merge these two actors (Fig. 3(c)) or only consider the bottleneck actor in the loop (Fig. 3(d).) The iteration o a eedback loop is a set o actor irings with as many irings as in the corresponding entry in the repetition vector q [8]. Ater one iteration, the SDFG returns to the same state; i.e. the number o tokens on each channel remains the same. In Fig. 3(a), the repetition actor q is [;;]. We also assume that the initial data tokens on the eedback loop are located at the input edge o the irst actor in the eedback path. Those eedback tokens, or in other words, dependency distance, are intrinsic in the algorithm. In act, such distance is in many streaming applications. Thereore, we also assume that the number o eedback tokens is just enough or the irst actor to ire or one iteration. ) Formulation: In the eedback loop, the throughput target imposes a constraint on the execution time o all the actors in the loop. Semantics o how the actors are ired are the key to estimate the latency o the loop. To delineate such a iring sequence, we construct a precedence graph as in [9]. The weight o edge denotes by how many cycles two irings should be separated. Fig. shows the precedence graph o a eedback loop in Fig. 3(a). Each node in the graph represents an actor iring. Solid edges denote the data dependency between adjacent actors. Dashed edges relect the physical resource conlict between consecutive irings o the same actor. Physical resource conlict depends on the number o physical modules

4 (implementations) selected, their initiation intervals, and how the irings are scheduled to these physical modules. We adopt the cyclic scheduling policy, which assigns a iring to the irst available physical module. Two irings that are scheduled to the same physical module should be separated by at least as many cycles as the initiation interval. In the precedence graph, edges with a dot denote the dependence across the iteration. Our problem has two parts: module selection and scheduling. In module selection, we select one implementation and decide the number o replicas or each actor, thus inalizing the area cost, module execution time, initiation interval and speciic structure o the precedence graph. In module scheduling, we schedule the start time o each node in the precedence graph to see i it can satisy the throughput target. Usually scheduling is perormed ater the module selection, but in our proposed ormulation we perorm module selection concurrently with scheduling. Moreover, we should generate a scheduling that is easible among all iterations, although we are only scheduling the irings in one iteration. Thus, we unold the precedence graph to include the s irings in the second iteration (as shown in Fig. 5). The reason is that i has the same iring timeline in iteration as in iteration, then all the other actors would also have the same timeline in the irst two iterations. And i the second iteration has the same iring timeline as the irst iteration, we are guaranteed to get a periodical scheduling. The total area cost, execution time, and initiation interval o the selected implementation or actor m is denoted as A m, t m and δ m respectively. We introduce binary integers x i m to denote i implementation P i m is selected and use u m to denote the number o replicas. Then we have A m = u m Sm A(Pm)x i i m, (5) i= δ m = S m δ(pm)x i i m, (6) i= t m = S m t(pm)x i i m, (7) i= S m i= x i m =. (8) In the unolded precedence graph, the set o nodes in the graph is V, and the set o edges is E. For e(u, v) E, denote the weight as w(u, v). I u and v are the irings o dierent actors m and m+, then I u and v are dierent irings m k,j m, then w(u, v) can be expressed as: j, j, j < j, w( k,j m, k,j m ) = w(u, v) = t m. (9) and k,j m o the same actor { δm, i j j = u m 0, otherwise. (0) Since u m is an unknown variable, every two irings o the same actor can be connected with an edge. And since the number o irings or m in one iteration is q[m], the number o such constraints is O(q[m] ). In realistic cases, we usually Fig t t t t 3 Unolded precedence graph o eedback loop in Fig. 3(a). know the maximum possible number o replicas ( m ). Thus the number o constraints is O(q[m] m ). Denote the start time o the jth iring o actor m in iteration k as T (m k,j ). We can ormulate the problem as ollows, M minimize A m, subject to m= j {,,..., q[m] }, T (m,(j+) ) T (m,j ) = T (m,(j+) ) T (m,j ),() j {,,..., q[]}, T (,j,j ) T ( ) In( ) q[] θ tgt in (, () ) e(u, v) E, T (v) T (u) w(u, v). (3) where θ tgt in ( ) is the throughput target o the eedback loop, which is measured at actor. Constraints () and () guarantee that the schedule is periodical and should meet the system throughput. Constraint (3) ensures the dependency that iring v ollows u by at least w(u, v) cycles. w(u, v) is calculated as shown in unctions (9) and (0). Functions (5) to (3) present our ormulation o the DSE problem or an SDFG with eedback loop. 3) Solution: The above set o ormulas are diicult to solve because unction (5) and (0) contain nonlinear expressions. Thereore, we propose the ollowing techniques to transorm the ormulation into linear expression so that we can use an ILP solver to solve the problem. Let us look at unction (0) irst. We could enumerate the possible values o u m as,,..., m and introduce binary integer variables ym i to denote which one is the actual value o u m. Thus w(m k,j, m k,j ) and u m is: w( k,j m, m k,j ) = δ m y j j m, () m u m = i ym, i (5) m i= i= Combining unction (6) and (), we have: w( k,j m, k,j m ) = ( S m i= 3 y i m =. (6) δ(p i m)x i m) y j j m. (7) Due to the product o unknown variables x i m and y j j m, the unction is not linear. Thus, we propose the ollowing theorem:

5 Theorem : Function F () is a unction o integer variable n x and binary variable y,y,...,y n. y i =, n >. Given i= the value o binary variables, i F () is a linear unction o x, then F (x, y, y,..., y n ) 0 is equivalent to the ollowing set o unctions which are all in the linear orm: F (x,, 0,..., 0) ( y ) in x F (x,, 0,..., 0) F (x, 0,,..., 0) ( y ) in x F (x, 0,,..., 0)... F (x, 0, 0,..., ) ( y n ) in x F (x, 0, 0,..., ) (8) where in means the inimum o the unction. We can also derive a similar set o unctions or the case where (x, y, y,..., y n ) 0 (which is omitted here due to page limitation). This theorem can be easily proved by enumerating all the possible values o binary variables. In unction (7), the variables x i m and y j j m correspond to the binary variables y i and x in the above theorem. We could also apply this theorem to unction (5); thus all o the ormulation o our problem is in a linear orm, and can be solved with an ILP solver. The total number o variables is O( M m= O( M m= S m + M m= m q[m]). m ), and the number o constraints is IV. EXPERIMENT RESULTS The experiment is carried out in two parts. For the streaming programs without eedback loops, we evaluate our strategies using StreamIt benchmarks [7]. For those containing eedback loops we examine our methodology using the MPEG- decoder. Implementation o these benchmarks on an FPGA is carried out by the high-level synthesis tool Autopilot [0] rom AutoESL Xilinx (version: AutoESL 0.). It takes C code as input and generates synthesizable RTL code which can be ed into a Xilinx ISE design suite. Thus, most benchmarks are rewritten into C code. The FPGA device we use is the Xilinx Virtex6 XC6VLX0T board. The target FPGA clock cycle is set at 00 MHz. Initiation intervals are speciied in the C program by Autopilot pragmas. Estimated execution time and resource usage (i.e., DSP, Block RAM, FF and LUT) are provided by the Autopilot synthesis report. The area metric is empirically estimated as the maximum utilization percentage o any o these our kinds o resources. A. Results on StreamIt Benchmarks StreamIt benchmarks provide us with high-level structure and behavior or realistic streaming applications. The actors in StreamIt benchmarks are restricted to have only one input/output channel, and data distribution and merging are implemented as split-join nodes. Since we allow multiple input/output channels in SDF, we modiy the benchmarks to embed the split-join nodes into the actors. For example, i an actor m is ollowed by a split node, we embed the split node into m. Within m, data tokens are distributed to multiple output channels. We compare our methodology with two baselines: ) the strategy in previous work [8] sys Parser 6 PP 58 Syn. Node 58 CC_MC TU (a) IDCT Fig sys Parser Syn. Node 58 pass CC_MC TU through 6 6 PP IDCT (b) SDFGs o MPEG- decoder. [9] that simply replicates the non-pipelined actors to reach high throughput; ) the strategy that all the actors are ully pipelined. In the irst case, throughput is usually achieved at a large area cost, while in the second case, non-bottleneck actors are over-perormed and thus occupy more area than they really need. The comparison results are shown in Table I. For each benchmark we show the area o baseline, baseline and our method that combines module selection and replication (CMSR or abbreviation). We also show the area reduction. B. Results on MPEG- Decoder MPEG- is a collection o methods and standards to perorm audio and video coding. The reerence C code or the MPEG- decoder is provided by Xilinx. The C code has already been developed with synthesis in mind, and most unsynthesizable constructs (such as dynamic allocation) are avoided. The design description and block diagram are presented in []. The modules are connected by either FIFOs or shared memory. The MPEG- decoder can be represented in an SDFG as shown in Fig. 6(a). There are ive actors in the graph: Parser, CC MC (copy control and motion compensations), TU (texture update), PP (pre-processor) and IDCT. The data ormat is set as CIF and the data token granularity is set at the macroblock (6x6 pixels) level. So, a CIF-size image rame (70x576) corresponds to x36 macroblocks. The iring o CC MC requires the output data rom TU in the previous rame; thus there is a synchronization node between CC MC and TU. While maintaining the correctness o the algorithm, we can transorm the SDFG in Fig. 6(a) into Fig. 6(b) a pass-through actor is added. There are two eedback paths in the new SDFG, and we select the path that takes the longer execution time (estimated rom the Autopilot synthesis report) as the eedback loop. We use unction block pipelining inside some actors to generate hardware modules with dierent II. For example, IDCT is urther broken into two modules, IDCT row and IDCT col, and they can be pipelined. The initial interval o IDCT is the larger value between the execution time o IDCT row and IDCT col. The implementation library we generated or Parser, PP, IDCT, CC MC and TU is shown in Table II. In Table III we show that the minimum area cost or a given throughput target that varies rom 30ps to 60ps. For each actor, we list the selected implementation, number o replicas, and total area usage or the actor. As we can see rom the table, bottleneck actors are Parser, PP and IDCT, which require a larger number o replicas and implementation with smaller II. It is worthwhile to point out that Autopilot uses the worst case latency as the estimated ex-

6 TABLE I DESIGN SPACE EXPLORATION FOR THREE STREAMIT BENCHMARKS. Filterbank FFT Autocor Throughput (# o data per cycle) Baseline Area Baseline Area CMSR Area Reduct -9.5% -6.8% -.% -6.6% % -3.8% 0 Reduct -36.5% -6.% -77.6% -3.7% -5.8% -6.% 0-39.% -59.% TABLE II IMPLEMENTATION LIBRARY FOR MPEG- DECODER KERNELS Parser PP IDCT CC MC TU v v v v v3 v v v3 v v v3 v v II FF LUT TABLE III DESIGN SPACE EXPLORATION FOR MPEG- DECODER. Throughput Parser PP IDCT CC MC TU Total Area (ps) impl rep area % impl rep area % impl rep area % impl rep area % impl rep area % % 60 v v 3.07 v 3.7 v 0.8 v v v 3.07 v 0.98 v 0.8 v v 3.9 v3.55 v 0.98 v 0.8 v v 3.7 v.53 v 0.98 v 0.8 v ecution time, thus may underestimate the actor s perormance in reality. Parser is such a case. The ILP solver we use is GLPK [], and results are generated on the order o minutes. V. CONCLUSION In this paper we studied the design space exploration problem o mapping streaming applications onto FPGAs. Our method diers rom existing methods because it combines both module selection and replicationto ind a better design point in design space. The method is veriied with both small designs like FFT and large designs like the MPEG- decoder, and results can be computed in minutes using an ILP solver. In the uture, we would like to investigate more on the communication cost in this design space exploration problem. ACKNOWLEDGMENT This research was partially supported by the NSF Expedition in Computing Award CCF-0967 (Center or Domain- Speciic Computing), Altera and Mentor Graphics. REFERENCES [] M. I. Gordon, W. Thies, and S. Amarasinghe, Exploiting coarse-grained task, data, and pipeline parallelism in stream programs, SIGOPS, vol. 0, pp. 5 6, 006. [] M. I. Gordon et al., A stream compiler or communication-exposed architectures, SIGARCH, vol. 30, pp , 00. [3] M. Kudlur and S. Mahlke, Orchestrating the execution o stream programs on multicore platorms, SIGPLAN, vol. 3, pp., 008. [] A. Hormati et al., MacroSS: macro-simdization o streaming applications, SIGARCH, vol. 38, pp , 00. [5] A. Hormati et al., Flextream: Adaptive compilation o streaming applications or heterogeneous architectures, in PACT, 009, pp. 3. [6] S. Liao et al., Data and computation transormations or Brook streaming applications on multiprocessors, in CGO, 006, pp [7] A. Udupa, R. Govindarajan, and M. J. Thazhuthaveetil, Synergistic execution o stream programs on multicores with accelerators, SIGPLAN, vol., pp , 009. [8] A. Hagiescu et al., A computing origami: Folding streams in FPGAs, in DAC, 009, pp [9] F. Plavec, Z. Vranesic, and S. Brown, Enhancements to FPGA design methodology using streaming, in FPL 009, 009, pp [0] M. Ishikawa and G. De Micheli, A module selection algorithm or high-level synthesis, in ISCAS, 99, pp vol.3. [] I. Ahmad, M. Dhodhi, and C. Chen, Integrated scheduling, allocation and module selection or design-space exploration in high-level synthesis, CDT, vol., no., pp. 65 7, 995. [] K. Ito, L. Lucke, and K. Parhi, ILP-based cost-optimal DSP synthesis with module selection and data ormat conversion, VLSI, vol. 6, no., pp , 998. [3] W. Sun, M. J. Wirthlin, and S. Neuendorer, FPGA pipeline synthesis design exploration using module selection and resource sharing, TCAD, vol. 6, no., pp. 5 65, 007. [] D. Chen, J. Cong, and J. Xu, Optimal module and voltage assignment or low-power, in ASP-DAC, 005, pp [5] W. Thies and S. Amarasinghe, An empirical characterization o stream programs and its implications or language and compiler design, in PACT, 00, pp [6] E. Lee and D. Messerschmitt, Synchronous data low, Proceedings o the IEEE, vol. 75, no. 9, pp. 35 5, 987. [7] StreamIt benchmarks, benchmarks.shtml. [8] E. A. Lee and D. G. Messerschmitt, Static scheduling o synchronous data low programs or digital signal processing, Computers, IEEE Transactions on, vol. C-36, no., pp. 35, 987. [9] Govindarajan et al., Rate-optimal schedule or multi-rate DSP computations, J. VLSI Signal Process. Syst., vol. 9, pp. 3, 995. [0] J. Cong et al., High-level synthesis or pgas: From prototyping to deployment, TCAD, vol. 30, no., pp. 73 9, 0. [] P. Schumacher et al., A scalable, multi-stream mpeg- video decoder or conerencing and surveillance applications, in ICIP 005, vol., 005, pp. II [] GNU Linear Programming Kit,

FPGA PLB EVALUATION USING QUANTIFIED BOOLEAN SATISFIABILITY

FPGA PLB EVALUATION USING QUANTIFIED BOOLEAN SATISFIABILITY FPGA PLB EVALUATION USING QUANTIFIED BOOLEAN SATISFIABILITY Andrew C. Ling Electrical and Computer Engineering University o Toronto Toronto, CANADA email: aling@eecg.toronto.edu Deshanand P. Singh, Stephen

More information

FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS. Waqas Akram, Cirrus Logic Inc., Austin, Texas

FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS. Waqas Akram, Cirrus Logic Inc., Austin, Texas FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS Waqas Akram, Cirrus Logic Inc., Austin, Texas Abstract: This project is concerned with finding ways to synthesize hardware-efficient digital filters given

More information

Automated Space/Time Scaling of Streaming Task Graphs. Hossein Omidian Supervisor: Guy Lemieux

Automated Space/Time Scaling of Streaming Task Graphs. Hossein Omidian Supervisor: Guy Lemieux Automated Space/Time Scaling of Streaming Task Graphs Hossein Omidian Supervisor: Guy Lemieux 1 Contents Introduction KPN-based HLS Tool for MPPA overlay Experimental Results Future Work Conclusion 2 Introduction

More information

A Proposed Approach for Solving Rough Bi-Level. Programming Problems by Genetic Algorithm

A Proposed Approach for Solving Rough Bi-Level. Programming Problems by Genetic Algorithm Int J Contemp Math Sciences, Vol 6, 0, no 0, 45 465 A Proposed Approach or Solving Rough Bi-Level Programming Problems by Genetic Algorithm M S Osman Department o Basic Science, Higher Technological Institute

More information

An Efficient Configuration Methodology for Time-Division Multiplexed Single Resources

An Efficient Configuration Methodology for Time-Division Multiplexed Single Resources An Eicient Coniguration Methodology or Time-Division Multiplexed Single Resources Benny Akesson 1, Anna Minaeva 1, Přemysl Šůcha 1, Andrew Nelson 2 and Zdeněk Hanzálek 1 1 Czech Technical University in

More information

A fast and area-efficient FPGA-based architecture for high accuracy logarithm approximation

A fast and area-efficient FPGA-based architecture for high accuracy logarithm approximation A ast and area-eicient FPGA-based architecture or high accuracy logarithm approximation Dimitris Bariamis, Dimitris Maroulis, Dimitris K. Iakovidis Department o Inormatics and Telecommunications University

More information

13. Power Optimization

13. Power Optimization 13. Power Optimization May 2013 QII52016-13.0.0 QII52016-13.0.0 The Quartus II sotware oers power-driven compilation to ully optimize device power consumption. Power-driven compilation ocuses on reducing

More information

Meta-Data-Enabled Reuse of Dataflow Intellectual Property for FPGAs

Meta-Data-Enabled Reuse of Dataflow Intellectual Property for FPGAs Meta-Data-Enabled Reuse of Dataflow Intellectual Property for FPGAs Adam Arnesen NSF Center for High-Performance Reconfigurable Computing (CHREC) Dept. of Electrical and Computer Engineering Brigham Young

More information

MATRIX ALGORITHM OF SOLVING GRAPH CUTTING PROBLEM

MATRIX ALGORITHM OF SOLVING GRAPH CUTTING PROBLEM UDC 681.3.06 MATRIX ALGORITHM OF SOLVING GRAPH CUTTING PROBLEM V.K. Pogrebnoy TPU Institute «Cybernetic centre» E-mail: vk@ad.cctpu.edu.ru Matrix algorithm o solving graph cutting problem has been suggested.

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Larger K-maps. So far we have only discussed 2 and 3-variable K-maps. We can now create a 4-variable map in the

Larger K-maps. So far we have only discussed 2 and 3-variable K-maps. We can now create a 4-variable map in the EET 3 Chapter 3 7/3/2 PAGE - 23 Larger K-maps The -variable K-map So ar we have only discussed 2 and 3-variable K-maps. We can now create a -variable map in the same way that we created the 3-variable

More information

ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests

ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests Mingxing Tan 1 2, Gai Liu 1, Ritchie Zhao 1, Steve Dai 1, Zhiru Zhang 1 1 Computer Systems Laboratory, Electrical and Computer

More information

A Methodology for Energy Efficient FPGA Designs Using Malleable Algorithms

A Methodology for Energy Efficient FPGA Designs Using Malleable Algorithms A Methodology for Energy Efficient FPGA Designs Using Malleable Algorithms Jingzhao Ou and Viktor K. Prasanna Department of Electrical Engineering, University of Southern California Los Angeles, California,

More information

FPGA Matrix Multiplier

FPGA Matrix Multiplier FPGA Matrix Multiplier In Hwan Baek Henri Samueli School of Engineering and Applied Science University of California Los Angeles Los Angeles, California Email: chris.inhwan.baek@gmail.com David Boeck Henri

More information

2. Recommended Design Flow

2. Recommended Design Flow 2. Recommended Design Flow This chapter describes the Altera-recommended design low or successully implementing external memory interaces in Altera devices. Altera recommends that you create an example

More information

EECS 583 Class 20 Research Topic 2: Stream Compilation, GPU Compilation

EECS 583 Class 20 Research Topic 2: Stream Compilation, GPU Compilation EECS 583 Class 20 Research Topic 2: Stream Compilation, GPU Compilation University of Michigan December 3, 2012 Guest Speakers Today: Daya Khudia and Mehrzad Samadi nnouncements & Reading Material Exams

More information

Joint Congestion Control and Scheduling in Wireless Networks with Network Coding

Joint Congestion Control and Scheduling in Wireless Networks with Network Coding Title Joint Congestion Control and Scheduling in Wireless Networks with Network Coding Authors Hou, R; Wong Lui, KS; Li, J Citation IEEE Transactions on Vehicular Technology, 2014, v. 63 n. 7, p. 3304-3317

More information

Co-Design of Many-Accelerator Heterogeneous Systems Exploiting Virtual Platforms. SAMOS XIV July 14-17,

Co-Design of Many-Accelerator Heterogeneous Systems Exploiting Virtual Platforms. SAMOS XIV July 14-17, Co-Design of Many-Accelerator Heterogeneous Systems Exploiting Virtual Platforms SAMOS XIV July 14-17, 2014 1 Outline Introduction + Motivation Design requirements for many-accelerator SoCs Design problems

More information

Synthesis of DSP Systems using Data Flow Graphs for Silicon Area Reduction

Synthesis of DSP Systems using Data Flow Graphs for Silicon Area Reduction Synthesis of DSP Systems using Data Flow Graphs for Silicon Area Reduction Rakhi S 1, PremanandaB.S 2, Mihir Narayan Mohanty 3 1 Atria Institute of Technology, 2 East Point College of Engineering &Technology,

More information

10. SOPC Builder Component Development Walkthrough

10. SOPC Builder Component Development Walkthrough 10. SOPC Builder Component Development Walkthrough QII54007-9.0.0 Introduction This chapter describes the parts o a custom SOPC Builder component and guides you through the process o creating an example

More information

Overview. CSE372 Digital Systems Organization and Design Lab. Hardware CAD. Two Types of Chips

Overview. CSE372 Digital Systems Organization and Design Lab. Hardware CAD. Two Types of Chips Overview CSE372 Digital Systems Organization and Design Lab Prof. Milo Martin Unit 5: Hardware Synthesis CAD (Computer Aided Design) Use computers to design computers Virtuous cycle Architectural-level,

More information

MPEG RVC AVC Baseline Encoder Based on a Novel Iterative Methodology

MPEG RVC AVC Baseline Encoder Based on a Novel Iterative Methodology MPEG RVC AVC Baseline Encoder Based on a Novel Iterative Methodology Hussein Aman-Allah, Ehab Hanna, Karim Maarouf, Ihab Amer Laboratory of Microelectronic Systems (GR-LSM), EPFL CH-1015 Lausanne, Switzerland

More information

A Normal I/O Order Radix-2 FFT Architecture to Process Twin Data Streams for MIMO

A Normal I/O Order Radix-2 FFT Architecture to Process Twin Data Streams for MIMO 2402 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 6, JUNE 2016 A Normal I/O Order Radix-2 FFT Architecture to Process Twin Data Streams for MIMO Antony Xavier Glittas,

More information

1 Introduction Data format converters (DFCs) are used to permute the data from one format to another in signal processing and image processing applica

1 Introduction Data format converters (DFCs) are used to permute the data from one format to another in signal processing and image processing applica A New Register Allocation Scheme for Low Power Data Format Converters Kala Srivatsan, Chaitali Chakrabarti Lori E. Lucke Department of Electrical Engineering Minnetronix, Inc. Arizona State University

More information

Optimal Partition with Block-Level Parallelization in C-to-RTL Synthesis for Streaming Applications

Optimal Partition with Block-Level Parallelization in C-to-RTL Synthesis for Streaming Applications Optimal Partition with Block-Level Parallelization in C-to-RTL Synthesis for Streaming Applications Authors: Shuangchen Li, Yongpan Liu, X.Sharon Hu, Xinyu He, Pei Zhang, and Huazhong Yang 2013/01/23 Outline

More information

Binary recursion. Unate functions. If a cover C(f) is unate in xj, x, then f is unate in xj. x

Binary recursion. Unate functions. If a cover C(f) is unate in xj, x, then f is unate in xj. x Binary recursion Unate unctions! Theorem I a cover C() is unate in,, then is unate in.! Theorem I is unate in,, then every prime implicant o is unate in. Why are unate unctions so special?! Special Boolean

More information

Method estimating reflection coefficients of adaptive lattice filter and its application to system identification

Method estimating reflection coefficients of adaptive lattice filter and its application to system identification Acoust. Sci. & Tech. 28, 2 (27) PAPER #27 The Acoustical Society o Japan Method estimating relection coeicients o adaptive lattice ilter and its application to system identiication Kensaku Fujii 1;, Masaaki

More information

Mapping real-life applications on run-time reconfigurable NoC-based MPSoC on FPGA. Singh, A.K.; Kumar, A.; Srikanthan, Th.; Ha, Y.

Mapping real-life applications on run-time reconfigurable NoC-based MPSoC on FPGA. Singh, A.K.; Kumar, A.; Srikanthan, Th.; Ha, Y. Mapping real-life applications on run-time reconfigurable NoC-based MPSoC on FPGA. Singh, A.K.; Kumar, A.; Srikanthan, Th.; Ha, Y. Published in: Proceedings of the 2010 International Conference on Field-programmable

More information

Logic Debugging of Arithmetic Circuits

Logic Debugging of Arithmetic Circuits Logic Debugging o Arithmetic Circuits Samaneh Ghandali, Cunxi Yu, Duo Liu, Walter Brown, Maciej Ciesielski University o Massachusetts, Amherst, USA {samaneh, ycunxi, duo, webrown, ciesiel}@umass.edu Abstract

More information

Modeling Arbitrator Delay-Area Dependencies in Customizable Instruction Set Processors

Modeling Arbitrator Delay-Area Dependencies in Customizable Instruction Set Processors Modeling Arbitrator Delay-Area Dependencies in Customizable Instruction Set Processors Siew-Kei Lam Centre for High Performance Embedded Systems, Nanyang Technological University, Singapore (assklam@ntu.edu.sg)

More information

ISSCC 2006 / SESSION 22 / LOW POWER MULTIMEDIA / 22.1

ISSCC 2006 / SESSION 22 / LOW POWER MULTIMEDIA / 22.1 ISSCC 26 / SESSION 22 / LOW POWER MULTIMEDIA / 22.1 22.1 A 125µW, Fully Scalable MPEG-2 and H.264/AVC Video Decoder for Mobile Applications Tsu-Ming Liu 1, Ting-An Lin 2, Sheng-Zen Wang 2, Wen-Ping Lee

More information

First Video Streaming Experiments on a Time Driven Priority Network

First Video Streaming Experiments on a Time Driven Priority Network First Video Streaming Experiments on a Driven Priority Network Mario Baldi and Guido Marchetto Department o Control and Computer Engineering Politecnico di Torino (Technical University o Torino) Abstract

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN Xiaoying Li 1 Fuming Sun 2 Enhua Wu 1, 3 1 University of Macau, Macao, China 2 University of Science and Technology Beijing, Beijing, China

More information

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center SDAccel Environment The Xilinx SDAccel Development Environment Bringing The Best Performance/Watt to the Data Center Introduction Data center operators constantly seek more server performance. Currently

More information

Automated Planning for Feature Model Configuration based on Functional and Non-Functional Requirements

Automated Planning for Feature Model Configuration based on Functional and Non-Functional Requirements Automated Planning or Feature Model Coniguration based on Functional and Non-Functional Requirements Samaneh Soltani 1, Mohsen Asadi 1, Dragan Gašević 2, Marek Hatala 1, Ebrahim Bagheri 2 1 Simon Fraser

More information

StreamIt on Fleet. Amir Kamil Computer Science Division, University of California, Berkeley UCB-AK06.

StreamIt on Fleet. Amir Kamil Computer Science Division, University of California, Berkeley UCB-AK06. StreamIt on Fleet Amir Kamil Computer Science Division, University of California, Berkeley kamil@cs.berkeley.edu UCB-AK06 July 16, 2008 1 Introduction StreamIt [1] is a high-level programming language

More information

DIGITAL VS. ANALOG SIGNAL PROCESSING Digital signal processing (DSP) characterized by: OUTLINE APPLICATIONS OF DIGITAL SIGNAL PROCESSING

DIGITAL VS. ANALOG SIGNAL PROCESSING Digital signal processing (DSP) characterized by: OUTLINE APPLICATIONS OF DIGITAL SIGNAL PROCESSING 1 DSP applications DSP platforms The synthesis problem Models of computation OUTLINE 2 DIGITAL VS. ANALOG SIGNAL PROCESSING Digital signal processing (DSP) characterized by: Time-discrete representation

More information

A Design Framework for Mapping Vectorized Synchronous Dataflow Graphs onto CPU-GPU Platforms

A Design Framework for Mapping Vectorized Synchronous Dataflow Graphs onto CPU-GPU Platforms A Design Framework for Mapping Vectorized Synchronous Dataflow Graphs onto CPU-GPU Platforms Shuoxin Lin, Yanzhou Liu, William Plishker, Shuvra Bhattacharyya Maryland DSPCAD Research Group Department of

More information

CS485/685 Computer Vision Spring 2012 Dr. George Bebis Programming Assignment 2 Due Date: 3/27/2012

CS485/685 Computer Vision Spring 2012 Dr. George Bebis Programming Assignment 2 Due Date: 3/27/2012 CS8/68 Computer Vision Spring 0 Dr. George Bebis Programming Assignment Due Date: /7/0 In this assignment, you will implement an algorithm or normalizing ace image using SVD. Face normalization is a required

More information

Energy Optimization of FPGA-Based Stream- Oriented Computing with Power Gating

Energy Optimization of FPGA-Based Stream- Oriented Computing with Power Gating Energy Optimization of FPGA-Based Stream- Oriented Computing with Power Gating Mohammad Hosseinabady and Jose Luis Nunez-Yanez Department of Electrical and Electronic Engineering University of Bristol,

More information

ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE. Ren Chen, Hoang Le, and Viktor K. Prasanna

ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE. Ren Chen, Hoang Le, and Viktor K. Prasanna ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE Ren Chen, Hoang Le, and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California, Los Angeles, USA 989 Email:

More information

DUE to the high computational complexity and real-time

DUE to the high computational complexity and real-time IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 15, NO. 3, MARCH 2005 445 A Memory-Efficient Realization of Cyclic Convolution and Its Application to Discrete Cosine Transform Hun-Chen

More information

Co-synthesis and Accelerator based Embedded System Design

Co-synthesis and Accelerator based Embedded System Design Co-synthesis and Accelerator based Embedded System Design COE838: Embedded Computer System http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer

More information

6.189 IAP Lecture 12. StreamIt Parallelizing Compiler. Prof. Saman Amarasinghe, MIT IAP 2007 MIT

6.189 IAP Lecture 12. StreamIt Parallelizing Compiler. Prof. Saman Amarasinghe, MIT IAP 2007 MIT 6.89 IAP 2007 Lecture 2 StreamIt Parallelizing Compiler 6.89 IAP 2007 MIT Common Machine Language Represent common properties of architectures Necessary for performance Abstract away differences in architectures

More information

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume VII /Issue 2 / OCT 2016

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume VII /Issue 2 / OCT 2016 NEW VLSI ARCHITECTURE FOR EXPLOITING CARRY- SAVE ARITHMETIC USING VERILOG HDL B.Anusha 1 Ch.Ramesh 2 shivajeehul@gmail.com 1 chintala12271@rediffmail.com 2 1 PG Scholar, Dept of ECE, Ganapathy Engineering

More information

Reliable Embedded Multimedia Systems?

Reliable Embedded Multimedia Systems? 2 Overview Reliable Embedded Multimedia Systems? Twan Basten Joint work with Marc Geilen, AmirHossein Ghamarian, Hamid Shojaei, Sander Stuijk, Bart Theelen, and others Embedded Multi-media Analysis of

More information

Cache Aware Optimization of Stream Programs

Cache Aware Optimization of Stream Programs Cache Aware Optimization of Stream Programs Janis Sermulins, William Thies, Rodric Rabbah and Saman Amarasinghe LCTES Chicago, June 2005 Streaming Computing Is Everywhere! Prevalent computing domain with

More information

A Stream Compiler for Communication-Exposed Architectures

A Stream Compiler for Communication-Exposed Architectures A Stream Compiler for Communication-Exposed Architectures Michael Gordon, William Thies, Michal Karczmarek, Jasper Lin, Ali Meli, Andrew Lamb, Chris Leger, Jeremy Wong, Henry Hoffmann, David Maze, Saman

More information

Verilog for High Performance

Verilog for High Performance Verilog for High Performance Course Description This course provides all necessary theoretical and practical know-how to write synthesizable HDL code through Verilog standard language. The course goes

More information

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Zhining Huang, Sharad Malik Electrical Engineering Department

More information

AN 608: HST Jitter and BER Estimator Tool for Stratix IV GX and GT Devices

AN 608: HST Jitter and BER Estimator Tool for Stratix IV GX and GT Devices AN 608: HST Jitter and BER Estimator Tool or Stratix IV GX and GT Devices July 2010 AN-608-1.0 The high-speed communication link design toolkit (HST) jitter and bit error rate (BER) estimator tool is a

More information

Porting Performance across GPUs and FPGAs

Porting Performance across GPUs and FPGAs Porting Performance across GPUs and FPGAs Deming Chen, ECE, University of Illinois In collaboration with Alex Papakonstantinou 1, Karthik Gururaj 2, John Stratton 1, Jason Cong 2, Wen-Mei Hwu 1 1: ECE

More information

A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation

A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation Abstract: The power budget is expected to limit the portion of the chip that we can power ON at the upcoming technology nodes. This problem,

More information

Comparison of Two Interactive Search Refinement Techniques

Comparison of Two Interactive Search Refinement Techniques Comparison o Two Interactive Search Reinement Techniques Olga Vechtomova Department o Management Sciences University o Waterloo 200 University Avenue West, Waterloo, Canada ovechtom@engmail.uwaterloo.ca

More information

KANGAL REPORT

KANGAL REPORT Individual Penalty Based Constraint handling Using a Hybrid Bi-Objective and Penalty Function Approach Rituparna Datta Kalyanmoy Deb Mechanical Engineering IIT Kanpur, India KANGAL REPORT 2013005 Abstract

More information

Optimized architectures of CABAC codec for IA-32-, DSP- and FPGAbased

Optimized architectures of CABAC codec for IA-32-, DSP- and FPGAbased Optimized architectures of CABAC codec for IA-32-, DSP- and FPGAbased platforms Damian Karwowski, Marek Domański Poznan University of Technology, Chair of Multimedia Telecommunications and Microelectronics

More information

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Yun R. Qu, Viktor K. Prasanna Ming Hsieh Dept. of Electrical Engineering University of Southern California Los Angeles, CA 90089

More information

A Complete Data Scheduler for Multi-Context Reconfigurable Architectures

A Complete Data Scheduler for Multi-Context Reconfigurable Architectures A Complete Data Scheduler for Multi-Context Reconfigurable Architectures M. Sanchez-Elez, M. Fernandez, R. Maestre, R. Hermida, N. Bagherzadeh, F. J. Kurdahi Departamento de Arquitectura de Computadores

More information

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors Brent Bohnenstiehl and Bevan Baas Department of Electrical and Computer Engineering University of California, Davis {bvbohnen,

More information

High Throughput Energy Efficient Parallel FFT Architecture on FPGAs

High Throughput Energy Efficient Parallel FFT Architecture on FPGAs High Throughput Energy Efficient Parallel FFT Architecture on FPGAs Ren Chen Ming Hsieh Department of Electrical Engineering University of Southern California Los Angeles, USA 989 Email: renchen@usc.edu

More information

Parallel Implementation of Arbitrary-Shaped MPEG-4 Decoder for Multiprocessor Systems

Parallel Implementation of Arbitrary-Shaped MPEG-4 Decoder for Multiprocessor Systems Parallel Implementation of Arbitrary-Shaped MPEG-4 oder for Multiprocessor Systems Milan Pastrnak *,a,c, Peter H.N. de With a,c, Sander Stuijk c and Jef van Meerbergen b,c a LogicaCMG Nederland B.V., RTSE

More information

Counting Interface Automata and their Application in Static Analysis of Actor Models

Counting Interface Automata and their Application in Static Analysis of Actor Models Counting Interace Automata and their Application in Static Analysis o Actor Models Ernesto Wandeler Jörn W. Janneck Edward A. Lee Lothar Thiele Abstract We present an interace theory based approach to

More information

ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE. Ren Chen, Hoang Le, and Viktor K. Prasanna

ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE. Ren Chen, Hoang Le, and Viktor K. Prasanna ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE Ren Chen, Hoang Le, and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California, Los Angeles, USA 989 Email:

More information

Early Performance-Cost Estimation of Application-Specific Data Path Pipelining

Early Performance-Cost Estimation of Application-Specific Data Path Pipelining Early Performance-Cost Estimation of Application-Specific Data Path Pipelining Jelena Trajkovic Computer Science Department École Polytechnique de Montréal, Canada Email: jelena.trajkovic@polymtl.ca Daniel

More information

2. Getting Started with the Graphical User Interface

2. Getting Started with the Graphical User Interface February 2011 NII52017-10.1.0 2. Getting Started with the Graphical User Interace NII52017-10.1.0 The Nios II Sotware Build Tools (SBT) or Eclipse is a set o plugins based on the popular Eclipse ramework

More information

Polyhedral-Based Data Reuse Optimization for Configurable Computing

Polyhedral-Based Data Reuse Optimization for Configurable Computing Polyhedral-Based Data Reuse Optimization for Configurable Computing Louis-Noël Pouchet 1 Peng Zhang 1 P. Sadayappan 2 Jason Cong 1 1 University of California, Los Angeles 2 The Ohio State University February

More information

Scheduling in Multihop Wireless Networks without Back-pressure

Scheduling in Multihop Wireless Networks without Back-pressure Forty-Eighth Annual Allerton Conerence Allerton House, UIUC, Illinois, USA September 29 - October 1, 2010 Scheduling in Multihop Wireless Networks without Back-pressure Shihuan Liu, Eylem Ekici, and Lei

More information

The Efficient Implementation of Numerical Integration for FPGA Platforms

The Efficient Implementation of Numerical Integration for FPGA Platforms Website: www.ijeee.in (ISSN: 2348-4748, Volume 2, Issue 7, July 2015) The Efficient Implementation of Numerical Integration for FPGA Platforms Hemavathi H Department of Electronics and Communication Engineering

More information

The Synthesis of Cyclic Combinational Circuits

The Synthesis of Cyclic Combinational Circuits The Synthesis o Cyclic Combinational Circuits Marc D. Riedel riedel@paradise.caltech.edu Caliornia Institute o Technology Mail Code 136 93, Pasadena, CA 91125 Jehoshua Bruck bruck@paradise.caltech.edu

More information

Modeling and Analysis of Workflow Based on TLA

Modeling and Analysis of Workflow Based on TLA JOURNAL OF COMPUTERS, VOL. 4, NO. 1, JANUARY 2009 27 Modeling and Analysis o Worklow Based on TLA CHEN Shu Wu Han University Computer Science Department, Wu Han China Email: Chenshu181@yahoo.com.cn WU

More information

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM

More information

Resource-Aware Throughput Optimization for High-Level Synthesis

Resource-Aware Throughput Optimization for High-Level Synthesis Resource-Aware Throughput Optimization for High-Level Synthesis Peng Li 1 Peng Zhang 1 Louis-Noël Pouchet 2,1 Jason Cong 1 {pengli, pengzh, pouchet, cong}@cs.ucla.edu 1 Computer Science Department, University

More information

Improving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints

Improving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints Improving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints Amit Kulkarni, Tom Davidson, Karel Heyse, and Dirk Stroobandt ELIS department, Computer Systems Lab, Ghent

More information

SUPER RESOLUTION IMAGE BY EDGE-CONSTRAINED CURVE FITTING IN THE THRESHOLD DECOMPOSITION DOMAIN

SUPER RESOLUTION IMAGE BY EDGE-CONSTRAINED CURVE FITTING IN THE THRESHOLD DECOMPOSITION DOMAIN SUPER RESOLUTION IMAGE BY EDGE-CONSTRAINED CURVE FITTING IN THE THRESHOLD DECOMPOSITION DOMAIN Tsz Chun Ho and Bing Zeng Department o Electronic and Computer Engineering The Hong Kong University o Science

More information

A new approach for ranking trapezoidal vague numbers by using SAW method

A new approach for ranking trapezoidal vague numbers by using SAW method Science Road Publishing Corporation Trends in Advanced Science and Engineering ISSN: 225-6557 TASE 2() 57-64, 20 Journal homepage: http://www.sciroad.com/ntase.html A new approach or ranking trapezoidal

More information

A VLSI Architecture for H.264/AVC Variable Block Size Motion Estimation

A VLSI Architecture for H.264/AVC Variable Block Size Motion Estimation Journal of Automation and Control Engineering Vol. 3, No. 1, February 20 A VLSI Architecture for H.264/AVC Variable Block Size Motion Estimation Dam. Minh Tung and Tran. Le Thang Dong Center of Electrical

More information

Energy-Aware Scheduling for Acyclic Synchronous Data Flows on Multiprocessors

Energy-Aware Scheduling for Acyclic Synchronous Data Flows on Multiprocessors Journal of Interconnection Networks c World Scientific Publishing Company Energy-Aware Scheduling for Acyclic Synchronous Data Flows on Multiprocessors DAWEI LI and JIE WU Department of Computer and Information

More information

System Modeling and Implementation of MPEG-4. Encoder under Fine-Granular-Scalability Framework

System Modeling and Implementation of MPEG-4. Encoder under Fine-Granular-Scalability Framework System Modeling and Implementation of MPEG-4 Encoder under Fine-Granular-Scalability Framework Literature Survey Embedded Software Systems Prof. B. L. Evans by Wei Li and Zhenxun Xiao March 25, 2002 Abstract

More information

Structural Gate Decomposition for Depth- Optimal Technology Mapping in LUT- Based FPGA Designs

Structural Gate Decomposition for Depth- Optimal Technology Mapping in LUT- Based FPGA Designs Structural Gate Decomposition or Depth- Optimal Technology Mapping in LUT- Based FPGA Designs JASON CONG and YEAN-YOW HWANG Uniersity o Caliornia In this paper we study structural gate decomposition in

More information

Modeling of an MPEG Audio Layer-3 Encoder in Ptolemy

Modeling of an MPEG Audio Layer-3 Encoder in Ptolemy Modeling of an MPEG Audio Layer-3 Encoder in Ptolemy Patrick Brown EE382C Embedded Software Systems May 10, 2000 $EVWUDFW MPEG Audio Layer-3 is a standard for the compression of high-quality digital audio.

More information

Multi-Gigahertz Parallel FFTs for FPGA and ASIC Implementation

Multi-Gigahertz Parallel FFTs for FPGA and ASIC Implementation Multi-Gigahertz Parallel FFTs for FPGA and ASIC Implementation Doug Johnson, Applications Consultant Chris Eddington, Technical Marketing Synopsys 2013 1 Synopsys, Inc. 700 E. Middlefield Road Mountain

More information

2. Design Planning with the Quartus II Software

2. Design Planning with the Quartus II Software November 2013 QII51016-13.1.0 2. Design Planning with the Quartus II Sotware QII51016-13.1.0 This chapter discusses key FPGA design planning considerations, provides recommendations, and describes various

More information

REDUCING THE CODE SIZE OF RETIMED SOFTWARE LOOPS UNDER TIMING AND RESOURCE CONSTRAINTS

REDUCING THE CODE SIZE OF RETIMED SOFTWARE LOOPS UNDER TIMING AND RESOURCE CONSTRAINTS REDUCING THE CODE SIZE OF RETIMED SOFTWARE LOOPS UNDER TIMING AND RESOURCE CONSTRAINTS Noureddine Chabini 1 and Wayne Wolf 2 1 Department of Electrical and Computer Engineering, Royal Military College

More information

AES Core Specification. Author: Homer Hsing

AES Core Specification. Author: Homer Hsing AES Core Specification Author: Homer Hsing homer.hsing@gmail.com Rev. 0.1.1 October 30, 2012 This page has been intentionally left blank. www.opencores.org Rev 0.1.1 ii Revision History Rev. Date Author

More information

RECENTLY, researches on gigabit wireless personal area

RECENTLY, researches on gigabit wireless personal area 146 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 55, NO. 2, FEBRUARY 2008 An Indexed-Scaling Pipelined FFT Processor for OFDM-Based WPAN Applications Yuan Chen, Student Member, IEEE,

More information

A Series Lighting Controller Modbus Register Map

A Series Lighting Controller Modbus Register Map DEH41097 Rev. 3 g A Series Lighting Control Panelboards A Series Lighting Controller Modbus Register Map Table o Contents Introduction...1 RTU Transmission Mode...1 Message rame...1 RTU Mode Message Frames...1

More information

VLSI Implementation of Parallel CRC Using Pipelining, Unfolding and Retiming

VLSI Implementation of Parallel CRC Using Pipelining, Unfolding and Retiming IOSR Journal of VLSI and Signal Processing (IOSR-JVSP) Volume 2, Issue 5 (May. Jun. 203), PP 66-72 e-issn: 239 4200, p-issn No. : 239 497 VLSI Implementation of Parallel CRC Using Pipelining, Unfolding

More information

Symbolic System Synthesis in the Presence of Stringent Real-Time Constraints

Symbolic System Synthesis in the Presence of Stringent Real-Time Constraints Symbolic System Synthesis in the Presence o Stringent Real-Time Constraints Felix Reimann, Martin Lukasiewycz, Michael Glass, Christian Haubelt, Jürgen Teich University o Erlangen-Nuremberg, Germany {elix.reimann,glass,haubelt,teich}@cs.au.de

More information

Certificate Translation for Optimizing Compilers

Certificate Translation for Optimizing Compilers Certiicate Translation or Optimizing Compilers Gilles Barthe IMDEA Sotware and Benjamin Grégoire and César Kunz and Tamara Rezk INRIA Sophia Antipolis - Méditerranée Proo Carrying Code provides trust in

More information

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:

More information

DESIGN OF AN FFT PROCESSOR

DESIGN OF AN FFT PROCESSOR 1 DESIGN OF AN FFT PROCESSOR Erik Nordhamn, Björn Sikström and Lars Wanhammar Department of Electrical Engineering Linköping University S-581 83 Linköping, Sweden Abstract In this paper we present a structured

More information

Introduction to FPGA Design with Vivado High-Level Synthesis. UG998 (v1.0) July 2, 2013

Introduction to FPGA Design with Vivado High-Level Synthesis. UG998 (v1.0) July 2, 2013 Introduction to FPGA Design with Vivado High-Level Synthesis Notice of Disclaimer The information disclosed to you hereunder (the Materials ) is provided solely for the selection and use of Xilinx products.

More information

A SCALABLE COMPUTING AND MEMORY ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION ON FIELD-PROGRAMMABLE GATE ARRAYS. Theepan Moorthy and Andy Ye

A SCALABLE COMPUTING AND MEMORY ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION ON FIELD-PROGRAMMABLE GATE ARRAYS. Theepan Moorthy and Andy Ye A SCALABLE COMPUTING AND MEMORY ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION ON FIELD-PROGRAMMABLE GATE ARRAYS Theepan Moorthy and Andy Ye Department of Electrical and Computer Engineering Ryerson

More information

MARKET demands urge embedded systems to incorporate

MARKET demands urge embedded systems to incorporate IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 3, MARCH 2011 429 High Performance and Area Efficient Flexible DSP Datapath Synthesis Sotirios Xydis, Student Member, IEEE,

More information

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 14 Parallelism in Software I

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 14 Parallelism in Software I EE382 (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 14 Parallelism in Software I Mattan Erez The University of Texas at Austin EE382: Parallelilsm and Locality, Spring 2015

More information

How Much Logic Should Go in an FPGA Logic Block?

How Much Logic Should Go in an FPGA Logic Block? How Much Logic Should Go in an FPGA Logic Block? Vaughn Betz and Jonathan Rose Department of Electrical and Computer Engineering, University of Toronto Toronto, Ontario, Canada M5S 3G4 {vaughn, jayar}@eecgutorontoca

More information

A Requirement Specification Language for Configuration Dynamics of Multiagent Systems

A Requirement Specification Language for Configuration Dynamics of Multiagent Systems A Requirement Speciication Language or Coniguration Dynamics o Multiagent Systems Mehdi Dastani, Catholijn M. Jonker, Jan Treur* Vrije Universiteit Amsterdam, Department o Artiicial Intelligence, De Boelelaan

More information

Double Precision Floating-Point Multiplier using Coarse-Grain Units

Double Precision Floating-Point Multiplier using Coarse-Grain Units Double Precision Floating-Point Multiplier using Coarse-Grain Units Rui Duarte INESC-ID/IST/UTL. rduarte@prosys.inesc-id.pt Mário Véstias INESC-ID/ISEL/IPL. mvestias@deetc.isel.ipl.pt Horácio Neto INESC-ID/IST/UTL

More information

HETEROGENEOUS MULTIPROCESSOR MAPPING FOR REAL-TIME STREAMING SYSTEMS

HETEROGENEOUS MULTIPROCESSOR MAPPING FOR REAL-TIME STREAMING SYSTEMS HETEROGENEOUS MULTIPROCESSOR MAPPING FOR REAL-TIME STREAMING SYSTEMS Jing Lin, Akshaya Srivasta, Prof. Andreas Gerstlauer, and Prof. Brian L. Evans Department of Electrical and Computer Engineering The

More information