MOST computations used in applications, such as multimedia

Size: px
Start display at page:

Download "MOST computations used in applications, such as multimedia"

Transcription

1 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 13, NO. 9, SEPTEMBER Pipelining With Common Operands for Power-Efficient Linear Systems Daehong Kim, Member, IEEE, Dongwan Shin, Member, IEEE, and Kiyoung Choi, Member, IEEE Abstract We propose a systematic pipelining method for a linear system to minimize power and maximize throughput, given a constraint on the number of pipeline stages and a set of resource constraints. Unlike most existing pipelining approaches, our method takes the number of pipeline stages as one of the constraints and considers the pipelining as an aspect of power minimization. Operations are retimed so that as many operations as possible take common operands as their inputs, using a novel technique called force-directed retiming; operand sharing is then determined, based on list scheduling. Experimental results show that the proposed approach reduces the power consumption of functional units by 27.8% on average and by more than 50% in some cases, compared to the state-of-the-art pipelining and operand sharing techniques. Index Terms Common operand, linear system, operand sharing, pipelining, power. I. INTRODUCTION MOST computations used in applications, such as multimedia processing, digital signal processing (DSP), telecommunication, and control are linear [2]. In recent years, throughput, latency and power have become important metrics in the evaluation of a linear system, and hence there has been lots of research into system optimization using one or more of the above three metrics. Pipelining is one of the topics addressed by such research. Various aspects of the pipelining of a linear system have been considered in the context of high-level synthesis for ASIC design. Previous research on the pipelining of a linear system has primarily focused on throughput maximization under resource constraint [3], [4], latency minimization under resource and throughput constraints [5], and joint throughput and latency optimization under resource constraint [6]. However, the design of embedded real-time linear systems is gaining more and more attention and to optimize pipelining in this kind of systems, we often need to minimize the power consumption while taking account of throughput and latency constraints. This has not been considered in previous research. Manuscript received February 25, 2002; revised October 23, This paper was presented in part in Proceedings of the International Symposium on Low Power Electronics and Design, pp , October D. Kim was with the School of Electrical Engineering, Seoul National University, Seoul, Korea. He is now with the SOC Division, GCT Research, Inc., Seoul , Korea ( daehong@ gctsemi.com). D. Shin is with the Information and Computer Science from University of California, Irvine, CA USA ( dongwans@ics.uci.edu). K. Choi is with the School of Electrical Engineering and Computer Science at Seoul National University, Seoul , Korea ( kchoi@ azalea.snu.ac.kr). Digital Object Identifier /TVLSI There has been lots of research on power optimization in high-level synthesis by means of input activity reduction of functional units (FUs). Chatterjee et al. [7] tried to minimize system switching activity by rearranging the data flow graph (DFG) through algebraic manipulation of linear operations in a fully dedicated hardwired design of a linear DSP circuit. However, it was on a dedicated architecture and reduced the switching activities of adders only, resulting in a small reduction in overall system power. In contrast, Musoll et al. [8], [9] presented an algorithm to minimize the switching activities of functional units such as adders, multipliers through scheduling and binding in a shared architecture. While their algorithm maximizes operand locality based on common operands (COs) to minimize the switching activities of functional units, it achieves only a small reduction in overall system power, due to a small number of common operands in the input control data flow graph. Raghunathan and Jha [10] reduced the switched capacitance of modules using an iterative improvement technique for scheduling and module allocation process. It exploits the fact that scheduling and binding should be considered together for effective power reduction. Chang and Pedram [11] addressed the problem of reducing the power of functional units by preserving correlation of data inputs to functional units through careful binding of operations to functional units for functionally pipelined and scheduled designs. They guaranteed optimal solution for one scheduled design instance. Lyuh and Liu [12] proposed an integrated data path optimization method for low power based on network flow method that is similar to Chang s approach. They combined allocation/binding problem with scheduling to improve Chang s approach. While all approaches mentioned above considered module binding or integrated high-level synthesis tasks for low power in a more general way, they did not address the problem of transforming the original design for low power without affecting other design metrics such as throughput. Given a DFG as an input, the operand sharing technique tries to bind operations with a CO to the same FU so that the input activity of the shared FU is reduced. Additionally, reduction in the number of memory accesses due to operand reuse tends to prevent the limit of the number of memory ports from increasing system latency. Operand sharing also helps to reduce the dynamic range of FU inputs, compared to normal FU sharing, resulting in more chances to obtain area and/or power reduction through wordlength optimization. But these additional merits of operand sharing are beyond the scope of this paper. Some work on power reduction using the operand sharing technique has been reported in [8] and [9]. However published results are limited because the DFGs being considered usually /$ IEEE

2 1024 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 13, NO. 9, SEPTEMBER 2005 Fig. 1. Resource binding without and with operand sharing. have only a small number of operation nodes with inputs in common, resulting in few opportunities of operand sharing, and hence insignificant power reduction. A technique has been proposed to overcome such limitations by generating as many operation nodes with COs as possible by means of loop pipelining [13]. But this technique does not consider constraints on the number of pipeline stages and on throughput. In addition, it does not deal with applications that exhibit feedback edges (i.e., loop-carried dependencies). This paper proposes systematic pipelining to solve the problems mentioned above. Our approach is an extension of the work by Kim and Choi [13], in the sense that it generates operation nodes with COs, which are invisible in the original DFG, by means of the pipelining technique. But our approach is completely new in that it can deal with feedback edges in a DFG efficiently and takes into account the number of pipeline stages and throughput when generating nodes with COs. In this paper, we propose a systematic pipelining method for a linear system to minimize power and subsequently to maximize throughput, given a constraint on the number of pipeline stages and a set of resource constraints. Unlike most existing pipelining approaches, our method takes the number of pipeline stages as one of the constraints and considers the pipelining as an aspect of power minimization. This paper is organized as follows. In Section II, we present preliminaries, including DFG as an algorithm input, the operand sharing and the retiming concepts, which are used throughout this paper. In Section III, we present the motivation of our approach. In Section IV, we illustrate a novel systematic pipelining algorithm for a linear system with or without feedback edges. Section V reviews previous work. Experimental results and conclusions are given in Section VI and VII, respectively. II. PRELIMINARIES A. Data Flow Graph We use a DFG as a model of a simple loop body in a linear system. Our DFG is defined as a graph,, where is a set of operation nodes and is a set of directed edges between nodes which denotes the data dependencies. Fig. 2(c) shows the DFG of a simple second-order IIR filter. The number beside each edge is the weight that denotes the iteration difference between the source node (data producer) and the target node (data consumer) of the edge and is represented by,. Definition 1: Data dependency for data that is produced and consumed within the same iteration is called intra-iteration data dependency and the corresponding edge is called an intra-iteration edge. Such an edge has zero weight in the DFG. Definition 2: Data dependency for data that is transferred across one or more iterations is called inter-iteration dependency or loop-carried dependency, and the corresponding edge is called an inter-iteration edge. Such an edge has a positive weight in the DFG. The weight corresponds to the number of delay elements that are triggered, one per iteration. B. Operand Sharing The operand sharing technique binds one identical FU to more than two operation nodes with at least one CO, in such a way that any node without a CO does not intervene between the nodes. Therefore, it enables switching power reduction by maximizing the temporal correlation of input signals to a FU. Fig. 1 illustrates the binding alternatives after scheduling. Assume that we allocate two multipliers and for multiplication operations. Fig. 1(a) shows the resource binding without considering the operand sharing concept, and Fig. 1(b) shows an alternative binding using the concept. The latter maximizes the temporal correlation of the input signal sequence since the input operands, fed to one of inputs of, are fixed within the same iteration and even preserve the original correlation well across consecutive iterations. The new binding decreases the switched capacitance of and, therefore, its power consumption. It is reported that typical value of, computed through switch-level simulation, is about 0.65 for a 12-bit multiplier [8], where is the average power consumption of the multiplier when only one operand changes and is the average power consumption when both operands change simultaneously. C. Retiming Retiming is a transformation technique that can be used to increase the throughput of a loop or to improve the utilization of resources by introducing partial overlap between the execution times of successive loop iterations that are separate in the original description (i.e., pipelining several loop body iterations). The retiming function, is the number of delays drawn from each of the incoming edges of node and pushed to each of the outgoing edges. By changing the position

3 KIM et al.: PIPELINING WITH COMMON OPERANDS FOR POWER-EFFICIENT LINEAR SYSTEMS 1025 Fig. 3. DFG of a typical second-order IIR filter. Fig. 2. Loop pipelining of a simple second-order IIR filter using retiming. of delays through retiming, the weight,, of each edge on the original DFG is transformed to a new weight,,, where is an edge from node to. A retiming is legal if, is nonnegative [14]. Fig. 2 shows an example of loop pipelining a simple secondorder IIR filter using the retiming technique. We assume that one multiplier is shared by the two multiplication operations in the loop body. As shown in Fig. 2(d) and (f), moving one delay on the incoming edge of the upper left multiplication node to its outgoing edge corresponds to pipelining two nodes from different iterations: the upper left multiplication node from the iteration and the remaining nodes from the iteration. As a result of loop pipelining, the critical path length is reduced by the elimination of the intra-iteration dependency from the upper left multiplication node to an addition node. Therefore, the initiation interval of the loop is reduced from 3 to 2 time steps. Note, however, that this improvement in throughput is achieved at the cost of controller and register overhead, due to the increased number of pipeline stages. We also observe that only one multiplier is allocated to perform the two multiplications, which have one common operand within one iteration after pipelining, as shown in Fig. 2(d) and (f). By reducing the switching activity involving input signals to the multiplier, pipelining reduces power consumption. This is an important motivation of our approach, which will be described in detail in the following section. III. MOTIVATION The scheduling algorithm proposed by Musoll et al. [9] utilizes the operand sharing technique to obtain power reductions from 5% to 8% against conventional scheduling algorithms. The reason why they obtained such relatively small improvements is that, for most designs including linear DSP applications, it is not easy to find operation nodes whose inputs are common. To increase the number of nodes with COs in the original DFGs, we present a novel loop pipelining method. While existing loop pipelining transformations try to maximize the throughput or minimize the latency of a loop, our pipelining algorithm retimes the operation nodes such that as many nodes as possible take COs as their inputs and performs scheduling and binding based on the operand sharing concept; it therefore reduces the switching activity of FUs, especially of multipliers, while still aiming to maximize the throughput of the loop. This transformation has a significant power-reducing effect on linear applications such as filters. We illustrate the motivation briefly mentioned at the end of the previous section in more detail, using the simple secondorder IIR filter in Fig. 2. The switching activity of the multiplier is determined by the change of values of the two input operands occurring between consecutive executions. As shown in Fig. 2(a), (c), and (e), if there is no transformation, then both input operands change their values twice per iteration. But now consider pipelining two consecutive iterations as shown in Fig. 2(b), (d), and (f). Then one input of the multiplier changes its value only once at every iteration and we can save some of the power that would otherwise consumed by the multiplier. Obviously, after pipelining, the two operands of the two multiplication operations (one for each) become common. As shown in Fig. 2(f), we can directly apply operand sharing to fanout edges of a node whenever such edges have the same weight and the target nodes of those edges are bound to the same FU. Definition 3: Nodes that have the same operation type are called compatible nodes. Edges having the same source node and compatible target nodes are called compatible edges. A set of compatible edges is called a compatible edge set. Target nodes of all edges in a compatible edge set have the possibility of taking a CO as one of their inputs through retiming. Fig. 3 shows the DFG for a typical second-order IIR filter. Such a filter contains only one compatible edge set, as shown in

4 1026 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 13, NO. 9, SEPTEMBER 2005 Fig. 4. Change in iteration differences of edges in the compatible edge set (a) before pipelining and (b) after pipelining. Fig. 4(a). It consists of four edges whose target multiplication nodes,,, and take their values from an addition node,, after one or two iterations. Let us assume that the DFG in Fig. 3 is originally nonpipelined, and so all nodes in a DFG are initially put into one pipeline stages. We also assume that one multiplier is allocated for the multiplication operations. If appropriate scheduling is performed and therefore four multiplications are executed by the multiplier in the order,, and, then the value of one input of the multiplier changes three times within each iteration and does not change across successive iterations. If two pipeline stages are used, so that,,, and nodes are executed at the second stage, we have the same weight for the edges in the compatible edge set, as shown in Fig. 4(b). So now we have the situation that one input value of the multiplier does not change within each iteration, and changes only once across successive iterations, irrespective of the scheduling. This contributes to the power reduction of the allocated multiplier. The pipelining in Fig. 3 may be regarded as a retiming which moves one delay (weight) on the incoming edges of each of and to their fanout edges. The relation between loop pipelining and retiming to generate COs is explained in detail in the next section through a novel force-directed retiming which is the first phase of our pipelining algorithm. When performing loop pipelining for CO generation, we must consider constraints such as throughput, latency and available resource. In this paper we consider throughput under the constraints on resource and the number of pipeline stages, although the primary concern is power. IV. SYSTEMATIC PIPELINING FOR LOW POWER A. Problem Definition Problem: Find a retiming that generates as many nodes with common input operands as possible and then a scheduling and a binding that minimize the power consumption of FUs, subject to a constraint on the number of pipeline stages and a resource constraint, while maximizing the throughput. Fig. 5 shows the overall process of systematic pipelining that we use to solve the problem. It consists of three phases in a main loop, following the preprocessing step. In the first phase, we use a novel technique called force-directed retiming which retimes operation nodes to allow more nodes to take COs. In the second phase, operand sharing is applied to the retimed DFG, based on the list scheduling. In the third phase, the resulting schedule is checked to ensure it satisfies the given constraints. Fig. 5. Overall process of our systematic pipelining. If the result satisfies all the constraints, it is called feasible and is regarded as one of the solutions. But if the result is not feasible, we try to make it feasible by performing additional incremental folding and rescheduling on nodes excluding the nodes made to take COs by force-directed retiming. To explain the details of each phase of the algorithm, including preprocessing, we use the example of a second-order IIR filter, as shown in Fig. 3 and assume that three adders and two multipliers are allocated with three pipeline stages. We also assume, for the time being, that execution of the adders and multipliers takes one time unit, but our algorithm is able to handle multi-cycle and pipelined FUs. B. Preprocessing Lower bound on initiation interval (line 1): The algorithm first computes a lower bound on the initiation interval for the input DFG. The lower bound on initiation interval, as already presented in many papers [2], [4], [5], is determined by the resource constraint and the lengths of cycles in the DFG. In the case of an IIR filter, the lower bound is 2. The pipelining algorithm initially takes the lower bound as one of its constraints in addition to the number of pipeline stages and the resource constraint, and tries to obtain solutions. If no solution can be found, the initiation interval is increased by one (lines 15 and 16). The reason why we start our algorithm at the lower bound is that the algorithm also aims at maximizing the throughput of the loop, although the primary concern is the power consumption.

5 KIM et al.: PIPELINING WITH COMMON OPERANDS FOR POWER-EFFICIENT LINEAR SYSTEMS 1027 Fig. 6. Pipelined ASAP scheduling. Fig. 7. Pipelined ALAP scheduling. Compatible edge sets and a PPS set (lines 2, 4, 5, and 6): Next, the algorithm finds compatible edge sets (line 2). As mentioned in Section III, target nodes of all edges in a compatible edge set can be made to take a CO as one of their inputs by means of retiming. However, to determine whether a target node can take a CO or not, we need to fix the source node to a specific pipeline stage. First, we determine the range of pipeline stages and time steps,, called the PS&TS frame, for each operation node (line 4). The PS&TS frames are computed by performing pipelined ASAP and ALAP scheduling under constraints on the initiation interval and the number of pipeline stages. We use the algorithm proposed in [6] for the pipelined ASAP, and a slightly modified version of that algorithm for the pipelined ALAP. Figs. 6 and 7 show the results of pipelined ASAP scheduling and pipelined ALAP scheduling for the second-order IIR filter. In the IIR filter example, the PS&TS frames for the,,, and nodes are, and those for the, and nodes are, where denotes integers from one to initiation interval. Then, based on the PS&TS frames, we generate a set of vectors called a PPS set (line 5). Each vector in the set represents a combination of the pipeline stages where the source nodes in compatible edge sets are positioned. Assuming compatible edge sets, the vector is in the form of, where denotes a pipeline stage where the source node of the compatible edge set is positioned. Our algorithm shown in Fig. 5 obtains solutions for all the vectors in the PPS set: in other words, all possible combinations in which source nodes of compatible edge sets can be placed in their ranges of pipeline stages (from line 6 to 13). The algorithm then determines the best solution according to the criteria such as power, initiation interval and turn-around time (latency) (line 18). The IIR filter example contains only one compatible edge set [shown in Fig. 4(a)] as mentioned briefly in the previous section. Its source node,, has the PS&TS frame and one PPS set with two elements, 1 and 2. C. Force-Directed Retiming To determine the optimal locations of the target nodes within the PS frames in such a way that more nodes take COs as their Fig. 8. Force-directed retiming/clustering. input values, we use a force-directed algorithm (line 8 in Fig. 5). 1 The procedure Force-directed retiming/clustering is shown in Fig. 8. Definition 4: CO probability: denotes the probability that the target node of edge in compatible edge set is placed in a pipeline stage such that the edge receives the weight. Definition 5: CO type distribution: denotes the sum of CO probabilities over the edges in compatible edge set. Definition 6: Force: is defined as, where and are respectively the upper and lower bound of the PS frame of the target node of edge. This definition is the same as the one used in force-directed scheduling [15]. In the IIR filter example, we assume that the source node is put in pipeline stage 1. Then and are computed as shown in Fig. 9. Since we want to maximize the number of compatible edges with the same weight, we select the largest type distribution, which is in our example. Then we compute the forces for the selected CO type distribution only. The algorithm selects the target node with the largest force and retimes it at the corresponding pipeline stage. The reason why we first select and retime the node with the largest force is that it contributes most to the corresponding CO type distribution. The process is then applied to the target node with the next 1 Before applying the force-directed algorithm, we need to update the PS&TS frames of the target nodes (line 7 in Fig. 5) for a given placement of source nodes

6 1028 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 13, NO. 9, SEPTEMBER 2005 Fig. 9. Calculation of CO probability, CO type distribution, and force. largest force. During this process, we partially bind the selected nodes to the same FU. The number of selected target operation nodes is limited by the total number of target nodes divided by the total number of FUs of the corresponding type. The limit corresponds to the maximum number of operation nodes with a CO that can be executed by an FU. Setting such a limit distributes the load evenly over the FUs, resulting in an improvement in resource utilization. In our IIR filter example, because two multipliers are available to execute a total of four target operation nodes, two nodes are selected for each multiplier. Although is the smallest force, we place first since it is on the critical path. Then we place since is the largest force. Therefore, in our example, and are selected for one multiplier. In case of a tie, a node belonging to a cycle in the DFG has priority, because such a node is less flexible in determining the pipeline stage. After selecting all target nodes required for one multiplier, the algorithm takes another multiplier and repeats the same process for the remaining target nodes. It continues until all target nodes are allocated to a pipeline stage. As a result of force-directed retiming for the compatible edge set of the IIR filter, pipeline stages for,,, and nodes are determined: pipeline stage 0 for and and pipeline stage 1 for and. Finally, the remaining nodes that are not related to the set are put into their earliest pipeline stages so as not to have a negative effect on system latency. D. Operand Sharing After all nodes are allocated to a pipeline stage, operand sharing based on list scheduling is performed for the retimed DFG. The priorities of nodes for list scheduling are given as follows. 1) Operation nodes whose sibling operation nodes have already been scheduled get higher priority. 2) In the case of a tie, we first select the node with constant CO. 3) In the case of a tie, consider urgency [15] in the retimed DAG, with the aim of reducing the initiation interval. 4) If the tie is still not broken, consider urgency in the original DAG, with the aim of reducing latency. Note that there exist two groups of sibling operation nodes, depending on whether the CO that they take as one of their inputs is a data item (denoted by a compatible edge) or a constant such as a coefficient. A constant CO is not affected by any retiming; therefore, if the original DFG contains sibling operation nodes with a constant CO, then the nodes still take the constant CO as one of their inputs after pipelining. Linear systems such as filters generally have constants that are fed to the inputs of multipliers. In Case 2 above, we first select the node with a constant CO since the effect of constant COs on power reduction is usually bigger than that of data COs [1]. The reason why a FU that takes constant COs at one of its inputs successively consumes less power over one with data COs is related to the correlation of values fed to one input of a FU. That is because, when operands are not being shared, successive data operands fed to input of a FU, such as and, have some correlation between them. In contrast, different constant operands such as and have no correlation between them. E. Feasibility Test After scheduling, we check the scheduled result to see if it meets constraints on initiation interval and the number of pipeline stages. The test result is classified as one of the following four cases: Case 1) if (the last time step in the last pipeline stage is greater than the initial interval) not feasible; Case 2) else if (the initiation interval obtained is less than or equal to the constraint) done; Case 3) else if (there are nodes with COs among the excess nodes) not feasible; Case 4) else incremental folding/rescheduling. An excess node in Case 3 is a node that is scheduled in excess of the initiation interval constraint. In Case 4, excess nodes are rescheduled in the following pipeline stages, so as to resolve the violation of the initiation interval constraint. The result of list scheduling for the IIR filter in Fig. 10 belongs to Case 4 and contains two excess nodes, and, which are not related to COs. By performing incremental folding and rescheduling, we obtain the final scheduling result with three pipeline stages and an initiation interval of two, as shown in Fig. 11. In this solution, one multiplier executes two multiplication operation nodes, and, which have COs and the other one executes and nodes with COs. Note that this is just one of the solutions that our systematic pipelining algorithm generates to reduce power consumption under the given constraint on the number of pipeline stages and a set of resource constraints.

7 KIM et al.: PIPELINING WITH COMMON OPERANDS FOR POWER-EFFICIENT LINEAR SYSTEMS 1029 Fig. 10. List scheduling result for the retimed DAG of the IIR filter. Fig. 11. Final schedule. F. Algorithm Complexity Analysis Assuming that the number of nodes is, the number of compatible edges is, and lower bound on II is, the complexity of each step of the algorithm in Fig. 5 is as follows. 1) Find compatible edge sets() in line 2 of Fig. 5: 2) Calculate PS&TS frame() in line 4: 3) Generate a PPS set for every source nodes in compatible edge sets() in line 5: 4) Force-directed retiming/clustering() in line 8: 5) Operand sharing() in line 9: 6) Feasibility test() in line 10: Detailed complexity analysis of the force-directed retiming algorithm in Fig. 8 is as follows. 1) Calculate CO probabilities for target nodes() in line 3 of Fig. 8: 2) Calculate CO type distribution() in line 4: 3) Ranking the CO type distribution in line 5: 4) Compute force() in line 6: 5) Determine PSs for critical target nodes() in line 7: 6) Retime/cluster the nodes with large force() in line 8: Fig. 12. DFG of a typical third-order FIR filter. Since steps from lines 3 to 8 in Fig. 8 times and force-directed retiming algorithm iterates times, overall complexity of the force-directed retiming algorithm in line 8 of Fig. 5 is. G. Application to Linear Systems Without Feedback Edges While linear systems such as IIR filters have feedback edges with inter-iteration dependency, some linear systems such as FIR filters do not have feedback edges. However, an FIR filter has inter-iteration dependency, originating from a system input node, as shown in Fig. 12. As shown in Fig. 13, this sort of filter has one compatible edge set, whose source node is always a system input node, IN. Although the input node cannot be positioned at any pipeline stage, we assume that it should have the same PS frame as that of target nodes that have an intra-iteration dependency (denoted by an edge with zero weight) with Fig. 13. A compatible edge set in third order FIR filter. the input node. Another important step that we should consider is the procedure Update PS&TS frames (line 7 in Fig. 5), which is invoked immediately before the Force-directed retiming/clustering procedure. After one element in the PPS set is fixed (i.e., the input node is assigned to a specific pipeline stage), the PS frames of target nodes are updated, reflecting the fact that target nodes that have an intra-iteration dependency with the input node should be positioned at the same pipeline stage as the input node. V. EXPERIMENTAL RESULTS We implemented our pipelining method using C++ in the UNIX environment. Our implementation takes the VHDL description of a linear system and compiles it into a DFG. Given

8 1030 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 13, NO. 9, SEPTEMBER 2005 TABLE I COMPARISONS OF THE ESTIMATOR WITH IRSIM TABLE II COMPARISONS OF POWER CONSUMPTION FOR THE SECOND-ORDER IIR FILTER, UNDER TWO MULTIPLIERS AND TWO ALUS (INITIATION INTERVAL =3CLOCK CYCLES, THE NUMBER OF PIPELINE STAGES =2) the DFG, constraints on the number of pipeline stages, and a resource allocation table provided by the user, the pipelining method produces a retimed, scheduled, and partially bound DFG, on which normal FU, register, and interconnect binding is performed. For the experiment, we selected linear systems from the wellknown HYPER examples [16]. To show the comparative effectiveness of the proposed method, we also implemented the state-of-the-art loop pipelining method proposed in [6], which minimizes the latency under constraints on target initiation interval and resource, and the operand sharing algorithm proposed by Musoll et al. [9]. We first compared our pipelining technique with an existing loop pipelining technique [6] followed by normal binding to show that we can reduce the power consumption, while having the same initiation interval. Next we compared our pipelining algorithm with an existing pipelining algorithm followed by binding, based on operand sharing, to show that the synthesized results from our pipelining consume less power even than those achieved by a combination of such existing techniques. To estimate power consumption, we modified SPA [17], an RT-level power estimation tool that is based on the Dual Bit Type model. Although SPA is reported to have an average power estimation error of 9% with a standard deviation of 6%, we need to verify its reliability in estimating the power under our experimental environment, since it depends on the power characterization of each module for relatively accurate estimation. We compared the estimated powers of FUs for the second-order infinite-impulse response (IIR) filter with those estimated by running IRSIM [18], a switch-level simulator, on the simulation file extracted from the layout. For automatic generation of the layout, we used the LagerIV layout synthesis system [19]. To use the system, we transformed our synthesis result into.sdl file format, which is an input format of the LagerIV system. The synthesis result consists of the pipelined data path that is the retimed, scheduled, and bound, and the corresponding controller descriptions. The transformation into layout was done using the DMOCT and DMPOST tools in LagerIV. The controller was synthesized by the ESPRESSO tool. We used 1.2 micrometer technology for the layout generation and took real speech data for functional simulation and power estimation. Table I shows that our estimator has only a 2.4% error in evaluating the effect of the power reduction in FUs. In fact, the reliability of our estimation program was verified in the same way for different applications in other literature [20], which reports an average power estimation error of 1.8%. Comparison results in both [20] and Table I imply that our power estimation program reliably evaluates the power reductions of FUs. The second-order IIR filter in Fig. 2 was synthesized under two different resource constraints. The first constraint is two array multipliers and two ALUs. We assume that both the multipliers and the ALUs take one cycle to execute. In the second, third and fourth columns of Table II, we show the estimated switched capacitance values in every modules, after applying the existing pipelining [6], the existing pipelining combined with operand sharing, and our approach, respectively. Fourth and sixth columns show comparisons between our approach and the existing pipelining, and also between our approach and a combination of existing pipelining and operand sharing techniques, in terms of switched capacitance. In this example, the ratio of the switched capacitance in FUs to that in the overall system is about 65% 70%. The proposed method reduces the switched capacitance in FUs by about 30% and

9 KIM et al.: PIPELINING WITH COMMON OPERANDS FOR POWER-EFFICIENT LINEAR SYSTEMS 1031 TABLE III COMPARISONS OF POWER CONSUMPTION FOR THE SECOND-ORDER IIR FILTER, UNDER ONE MULTIPLIER AND ONE ALU (INITIATION INTERVAL =4CLOCK CYCLES, THE NUMBER OF PIPELINE STAGES =2) TABLE IV EXPERIMENTAL EXAMPLES that corresponds to about 21% in the overall system. Table III shows the case of one multiplier and one ALU. In contrast to Table II, where we have no reduction compared to the combination of existing pipelining and operand sharing, the sixth column of Table III shows more than 47% reduction. This is due to the fact that we can generate many COs through our approach, whereas no CO is generated with the existing approaches. Table IV shows the characteristics of other benchmark examples that are used in our experiments. The second column in Table IV shows the number of operation nodes of each type in the DFG and the third and fourth columns show the number of compatible edge sets and the sum of the number of compatible edges in each set, for each benchmark example. If a DFG contains a large number of compatible edge sets that have many compatible edges as their respective elements, the potential for CO generation through pipelining is very high. The rows marked FIR7 and FIR11 correspond to 7th- and 11th-order FIR filters and IIR7 is a 7th-order IIR filter. The labels Wavelet and Lattice denote a wavelet filter and a lattice filter. Parallel is a parallel form of Avenhaus filter. While IIR7 and the parallel and lattice filters have feedback edges, FIR7, FIR11, and the wavelet filter are representative linear system examples without feedback edges. In Table V, we present experimental results for the benchmark examples, under various constraints. The second column of Table V shows the initiation interval and the number of pipeline stages. The third column shows resource constraints. In the fourth, fifth, and sixth columns, we show the estimated switched capacitance values in FUs, after applying the existing pipelining, the existing pipelining combined with operand sharing, and our approach, respectively. We assume that an ALU takes one cycle and a multiplier takes two cycles to execute. The same values in some rows of fourth and sixth columns mean that no CO is generated after the existing pipelining is applied, or normal binding following existing pipelining has the same effect as binding based on operand sharing. As shown in the seventh column of Table V, our pipelining algorithm consumes less power under the same initiation interval as the existing one. The eighth column shows that our algorithm also achieves a larger power reduction than the combination of the existing techniques for pipelining and operand sharing. We conclude that our CO-centric pipelining approach is more effective for power reduction even compared with the combination of existing techniques. The impact of our algorithm on initiation interval (or throughput) needs to be addressed. The second column in Table V shows the initiation interval and the number of pipeline stages, which are used as constraints of our algorithm, as mentioned above. To select reasonable values of the two parameters, we first use state-of-the-art loop pipelining [6] for the minimization of latency, and accordingly the number of pipeline stages under the given initiation interval and resource constraints. It starts from the lower bound of the

10 1032 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 13, NO. 9, SEPTEMBER 2005 TABLE V COMPARISONS OF POWER CONSUMPTION IN FUS FOR EXAMPLES UNDER VARIOUS CONSTRAINTS initiation interval, trying to calculate the minimum latency under resource constraints and continues the process by increasing the initiation interval by one until the feasible solution for the resource constraints is obtained. The initiation interval constraint in the second column of Table V is the value calculated in this way. The reason why we use the same initiation intervals as those obtained by [6] is to demonstrate that our algorithm does not adversely affect the throughput. However, by increasing the initiation interval or pipeline stages, we can further reduce the switched capacitance (or energy consumption). For example, by increasing (II, PS) of the IIR7 example from (9, 2) to (9, 3), we can obtain about 6% more reduction. This is because loose constraints allow force-directed retiming to generate more common operands, resulting in more opportunities for operand sharing. Table VI shows the power reduction results for the total power consumed by the whole circuit. For all examples in the table, we can observe that the reduction ratio is smaller compared to that of Table V and sometimes is even negative values, i.e., an increase of power consumption. This is primarily due to the small portion occupied by the FU power consumption in the total power consumption of the whole circuit and, if any, the increase of the power consumed by other components such as multiplexers, registers, and buses. Note that the power consumption of modules, other than FUs, can increase, as shown in Table III. Generally speaking, pipelining and operand sharing techniques increase the number of registers needed because the lifetime of variables is increased. Since the target architecture of our hardware synthesis system is based on a point-to-point interconnection scheme, any increase of the number of registers needed increases the number of buses required (which corresponds to the interconnection between an FU and a register, or a register-to-register interconnection in our current target architecture). This is a shortcoming of our approach, since the power consumption in buses is becoming more and more dominant as the technology goes into deep submicron. However, by using a bus-based interconnection scheme, we can make more efficient use of interconnects by means of bus allocation and scheduling. We expect that, by combining our pipelining with the bus-based interconnection scheme, the power consumption of the overall system can be further reduced.

11 KIM et al.: PIPELINING WITH COMMON OPERANDS FOR POWER-EFFICIENT LINEAR SYSTEMS 1033 TABLE VI COMPARISONS OF TOTAL POWER CONSUMPTION OF THE CIRCUITS VI. CONCLUSIONS In this paper, we have proposed a systematic pipelining method for linear systems to minimize power and maximize throughput, under a constraint on the number of pipeline stages and a set of resource constraints. Our method deals with loop-carried dependencies in a data flow graph using forcedirected retiming, while considering throughput and latency. Experimental results over a set of linear systems show that, by focusing on common operands, our pipelining algorithm reduces the power consumption of functional units by 27.8% on average compared with a conventional pipelining technique with the same initiation interval. A power reduction of 22.6% on average is also achieved over the combination of a conventional pipelining algorithm and operand sharing. REFERENCES [1] D. Kim, D. Shin, and K. Choi, Low power pipelining of linear systems: A common operand centric approach, in Proc. Int. Symp. on Low Power Electronics and Design, Oct. 2001, pp [2] M. Srivastava and M. Potkonjak, Power optimization in programmable processors and ASIC implementations of linear systems: Transformation-based approach, in Proc. 33rd ACM/IEEE Design Automation Conf., 1996, pp [3] G. Goossens, J. Vandewalle, and H. D. Man, Loop optimization in register-transfer scheduling for DSP systems, in Proc. 26th ACM/IEEE Design Automation Conf., 1989, pp [4] L. F. Chao, A. LaPaugh, and E. H. M. Sha, Rotation scheduling: A loop pipelining algorithm, in Proc. 30th ACM/IEEE Design Automation Conf., 1993, pp [5] C. T. Hwang, Y. C. Hsu, and Y. L. Lin, Scheduling for functional pipelining and loop winding, in Proc. 28th ACM/IEEE Design Automation Conf., 1991, pp [6] T. F. Lee, A. C. H. Wu, Y. L. Lin, and D. D. Gajski, A transformationbased method for loop folding, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 13, pp , Apr [7] A. Chatterjee and R. Roy, Synthesis of low power linear DSP circuits using activity metrics, in Proc. Int. Conf. on VLSI Design, 1994, pp [8] E. Musoll and J. Cortadella, High-level synthesis technique for reducing the activity of functional units, in Proc. Int. Symp. on Low Power Design, 1995, pp [9], Scheduling and resource binding for low power, in Proc. Int. Symp. on System Synthesis, 1995, pp [10] A. Raghunathan and N. K. Jha, SCALP: An iterative improvement based low-power data path synthesis system, IEEE Trans. Comput.- Aided Des. Integr. Circuits Syst., [11] J.-M. Chang and M. Pedram, Module assignment for low power, in Proc. European Design Automation Conf., [12] C.-G. Lyuh, T. Kim, and C. L. Liu, An integrated data path optimization for low power based on network flow method, in Proc. ACM/IEEE Design Automation Conf., [13] D. Kim and K. Choi, Power-conscious high level synthesis using loop folding, in Proc. 34th ACM/IEEE Design Automation Conf., Jun. 1997, pp [14] C. E. Leiserson and J. B. Saxe, Retiming synchronous circuitry, in ALGORITHMICA, 1991, vol. 6, pp [15] G. De Micheli, Synthesis and Optimization of Digital Circuits. New York: McGraw-Hill, [16] J. Rabaey, C. Chu, P. Hoang, and M. Potkonjak, Fast prototyping of datapath-intensive architectures, IEEE Design &Test, pp , Jun [17] P. Landman and J. M. Rabaey, Activity-sensitive architectural power analysis, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 15, pp , Jun [18] A. Salz and M. Horowitz, IRSIM: An incremental MOS switch-level simulator, in Proc. 26th ACM/IEEE Design Automation Conf., 1989, pp [19] C. B. Shung, R. Jain, K. Rimey, E. Wang, M. B. Srivastava, E. Lettang, B. C. Richards, S. K. Azim, L. Thon, P. N. Hilfinger, J. M. Rabaey, and R. W. Brodersen, An integrated CAD system for algorithm-specificic design, in IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 10, Apr. 1991, pp [20] J. Choi, J. Jeon, and K. Choi, Power minimization of functional units by partially guarded computation, in Proc. Int. Symp. on Low Power Electronics and Design, Jul

12 1034 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 13, NO. 9, SEPTEMBER 2005 Daehong Kim (M 02) received the B.S. degree in electronics engineering in 1995, the M.S. degree in electrical engineering in 1997, and the Ph.D. degree in electrical engineering and computer science in 2002, from Seoul National University, Seoul, Korea. Since 2002, he is working on SOC design and verification at SOC division, GCT Research, Inc., Seoul, Korea. His research interests are low-power design and system design including hardware-software codesign, SoC design methodology, and embedded system. Kiyoung Choi (S 84 M 88) received the B.S. degree in electronics engineering from Seoul National University, Seoul, Korea, the M.S. degree in electrical and electronics engineering from Korea Advanced Institute of Science and Technology, and the Ph.D. degree in electrical engineering from Stanford University, Stanford, CA. He is currently a professor in the School of Electrical Engineering and Computer Science at Seoul National University. His research interests include hardware-software co-design, low-power systems design, and configurable/re-configurable processor architecture design. Dongwan Shin (M 04) received the B.S. degree in electronics engineering from Hanyang University, Seoul, Korea, in 1995, the M.S. degree in electronics engineering from Seoul National University, Seoul, Korea, in 1997, and Ph.D. degree in information and computer science from University of California, Irvine, in From 1997 to 1999, he was with LG Semicon Company Ltd., Seoul, Korea. Since 2004, he has been a Project Scientist in Center for Embedded Computer Systems, University of California, Irvine. His research interests are system-level design automation, high-level synthesis, and low-power system design.

FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS. Waqas Akram, Cirrus Logic Inc., Austin, Texas

FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS. Waqas Akram, Cirrus Logic Inc., Austin, Texas FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS Waqas Akram, Cirrus Logic Inc., Austin, Texas Abstract: This project is concerned with finding ways to synthesize hardware-efficient digital filters given

More information

FIR Filter Synthesis Algorithms for Minimizing the Delay and the Number of Adders

FIR Filter Synthesis Algorithms for Minimizing the Delay and the Number of Adders 770 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 48, NO. 8, AUGUST 2001 FIR Filter Synthesis Algorithms for Minimizing the Delay and the Number of Adders Hyeong-Ju

More information

Synthesis of DSP Systems using Data Flow Graphs for Silicon Area Reduction

Synthesis of DSP Systems using Data Flow Graphs for Silicon Area Reduction Synthesis of DSP Systems using Data Flow Graphs for Silicon Area Reduction Rakhi S 1, PremanandaB.S 2, Mihir Narayan Mohanty 3 1 Atria Institute of Technology, 2 East Point College of Engineering &Technology,

More information

Unit 2: High-Level Synthesis

Unit 2: High-Level Synthesis Course contents Unit 2: High-Level Synthesis Hardware modeling Data flow Scheduling/allocation/assignment Reading Chapter 11 Unit 2 1 High-Level Synthesis (HLS) Hardware-description language (HDL) synthesis

More information

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management International Journal of Computer Theory and Engineering, Vol., No., December 01 Effective Memory Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management Sultan Daud Khan, Member,

More information

An Area-Efficient BIRA With 1-D Spare Segments

An Area-Efficient BIRA With 1-D Spare Segments 206 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 26, NO. 1, JANUARY 2018 An Area-Efficient BIRA With 1-D Spare Segments Donghyun Kim, Hayoung Lee, and Sungho Kang Abstract The

More information

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume 9 /Issue 3 / OCT 2017

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume 9 /Issue 3 / OCT 2017 Design of Low Power Adder in ALU Using Flexible Charge Recycling Dynamic Circuit Pallavi Mamidala 1 K. Anil kumar 2 mamidalapallavi@gmail.com 1 anilkumar10436@gmail.com 2 1 Assistant Professor, Dept of

More information

High-Level Synthesis (HLS)

High-Level Synthesis (HLS) Course contents Unit 11: High-Level Synthesis Hardware modeling Data flow Scheduling/allocation/assignment Reading Chapter 11 Unit 11 1 High-Level Synthesis (HLS) Hardware-description language (HDL) synthesis

More information

Behavioral Array Mapping into Multiport Memories Targeting Low Power 3

Behavioral Array Mapping into Multiport Memories Targeting Low Power 3 Behavioral Array Mapping into Multiport Memories Targeting Low Power 3 Preeti Ranjan Panda and Nikil D. Dutt Department of Information and Computer Science University of California, Irvine, CA 92697-3425,

More information

DYNAMIC CIRCUIT TECHNIQUE FOR LOW- POWER MICROPROCESSORS Kuruva Hanumantha Rao 1 (M.tech)

DYNAMIC CIRCUIT TECHNIQUE FOR LOW- POWER MICROPROCESSORS Kuruva Hanumantha Rao 1 (M.tech) DYNAMIC CIRCUIT TECHNIQUE FOR LOW- POWER MICROPROCESSORS Kuruva Hanumantha Rao 1 (M.tech) K.Prasad Babu 2 M.tech (Ph.d) hanumanthurao19@gmail.com 1 kprasadbabuece433@gmail.com 2 1 PG scholar, VLSI, St.JOHNS

More information

An Efficient Hierarchical Clustering Method for the Multiple Constant Multiplication Problem

An Efficient Hierarchical Clustering Method for the Multiple Constant Multiplication Problem An Efficient Hierarchical Clustering Method for the Multiple Constant Multiplication Problem Akihiro Matsuura, Mitsuteru Yukishita, Akira Nagoya NTT Communication Science Laboratories 2 Hikaridai, Seika-cho,

More information

Low Power Bus Binding Based on Dynamic Bit Reordering

Low Power Bus Binding Based on Dynamic Bit Reordering Low Power Bus Binding Based on Dynamic Bit Reordering Jihyung Kim, Taejin Kim, Sungho Park, and Jun-Dong Cho Abstract In this paper, the problem of reducing switching activity in on-chip buses at the stage

More information

RPUSM: An Effective Instruction Scheduling Method for. Nested Loops

RPUSM: An Effective Instruction Scheduling Method for. Nested Loops RPUSM: An Effective Instruction Scheduling Method for Nested Loops Yi-Hsuan Lee, Ming-Lung Tsai and Cheng Chen Department of Computer Science and Information Engineering 1001 Ta Hsueh Road, Hsinchu, Taiwan,

More information

Distributed Synchronous Control Units for Dataflow Graphs under Allocation of Telescopic Arithmetic Units

Distributed Synchronous Control Units for Dataflow Graphs under Allocation of Telescopic Arithmetic Units Distributed Synchronous Control Units for Dataflow Graphs under Allocation of Telescopic Arithmetic Units Euiseok Kim, Hiroshi Saito Jeong-Gun Lee Dong-Ik Lee Hiroshi Nakamura Takashi Nanya Dependable

More information

Behavioural Transformation to Improve Circuit Performance in High-Level Synthesis*

Behavioural Transformation to Improve Circuit Performance in High-Level Synthesis* Behavioural Transformation to Improve Circuit Performance in High-Level Synthesis* R. Ruiz-Sautua, M. C. Molina, J.M. Mendías, R. Hermida Dpto. Arquitectura de Computadores y Automática Universidad Complutense

More information

DUE to the high computational complexity and real-time

DUE to the high computational complexity and real-time IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 15, NO. 3, MARCH 2005 445 A Memory-Efficient Realization of Cyclic Convolution and Its Application to Discrete Cosine Transform Hun-Chen

More information

An Intelligent Priority Decision Making Algorithm for Competitive Operators in List-based Scheduling

An Intelligent Priority Decision Making Algorithm for Competitive Operators in List-based Scheduling IJCSNS International Journal of Computer Science and Network Security, VOL.9 No.1, January 2009 81 An Intelligent Priority Decision Making Algorithm for Competitive Operators in List-based Scheduling Jun-yong

More information

HIGH-LEVEL SYNTHESIS

HIGH-LEVEL SYNTHESIS HIGH-LEVEL SYNTHESIS Page 1 HIGH-LEVEL SYNTHESIS High-level synthesis: the automatic addition of structural information to a design described by an algorithm. BEHAVIORAL D. STRUCTURAL D. Systems Algorithms

More information

Optimized Design Platform for High Speed Digital Filter using Folding Technique

Optimized Design Platform for High Speed Digital Filter using Folding Technique Volume-2, Issue-1, January-February, 2014, pp. 19-30, IASTER 2013 www.iaster.com, Online: 2347-6109, Print: 2348-0017 ABSTRACT Optimized Design Platform for High Speed Digital Filter using Folding Technique

More information

Architecture and Synthesis for Multi-Cycle Communication

Architecture and Synthesis for Multi-Cycle Communication Architecture and Synthesis for Multi-Cycle Communication Jason Cong, Yiping Fan, Xun Yang, Zhiru Zhang Computer Science Department University of California, Los Angeles Los Angeles CA 90095 USA {cong,

More information

Introduction to Electronic Design Automation. Model of Computation. Model of Computation. Model of Computation

Introduction to Electronic Design Automation. Model of Computation. Model of Computation. Model of Computation Introduction to Electronic Design Automation Model of Computation Jie-Hong Roland Jiang 江介宏 Department of Electrical Engineering National Taiwan University Spring 03 Model of Computation In system design,

More information

A VLSI Architecture for H.264/AVC Variable Block Size Motion Estimation

A VLSI Architecture for H.264/AVC Variable Block Size Motion Estimation Journal of Automation and Control Engineering Vol. 3, No. 1, February 20 A VLSI Architecture for H.264/AVC Variable Block Size Motion Estimation Dam. Minh Tung and Tran. Le Thang Dong Center of Electrical

More information

COE 561 Digital System Design & Synthesis Introduction

COE 561 Digital System Design & Synthesis Introduction 1 COE 561 Digital System Design & Synthesis Introduction Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum & Minerals Outline Course Topics Microelectronics Design

More information

REDUCING THE CODE SIZE OF RETIMED SOFTWARE LOOPS UNDER TIMING AND RESOURCE CONSTRAINTS

REDUCING THE CODE SIZE OF RETIMED SOFTWARE LOOPS UNDER TIMING AND RESOURCE CONSTRAINTS REDUCING THE CODE SIZE OF RETIMED SOFTWARE LOOPS UNDER TIMING AND RESOURCE CONSTRAINTS Noureddine Chabini 1 and Wayne Wolf 2 1 Department of Electrical and Computer Engineering, Royal Military College

More information

A Complete Data Scheduler for Multi-Context Reconfigurable Architectures

A Complete Data Scheduler for Multi-Context Reconfigurable Architectures A Complete Data Scheduler for Multi-Context Reconfigurable Architectures M. Sanchez-Elez, M. Fernandez, R. Maestre, R. Hermida, N. Bagherzadeh, F. J. Kurdahi Departamento de Arquitectura de Computadores

More information

Asynchronous Behavior Related Retiming in Gated-Clock GALS Systems

Asynchronous Behavior Related Retiming in Gated-Clock GALS Systems Asynchronous Behavior Related Retiming in Gated-Clock GALS Systems Sam Farrokhi, Masoud Zamani, Hossein Pedram, Mehdi Sedighi Amirkabir University of Technology Department of Computer Eng. & IT E-mail:

More information

Design and Implementation of VLSI 8 Bit Systolic Array Multiplier

Design and Implementation of VLSI 8 Bit Systolic Array Multiplier Design and Implementation of VLSI 8 Bit Systolic Array Multiplier Khumanthem Devjit Singh, K. Jyothi MTech student (VLSI & ES), GIET, Rajahmundry, AP, India Associate Professor, Dept. of ECE, GIET, Rajahmundry,

More information

1 Introduction Data format converters (DFCs) are used to permute the data from one format to another in signal processing and image processing applica

1 Introduction Data format converters (DFCs) are used to permute the data from one format to another in signal processing and image processing applica A New Register Allocation Scheme for Low Power Data Format Converters Kala Srivatsan, Chaitali Chakrabarti Lori E. Lucke Department of Electrical Engineering Minnetronix, Inc. Arizona State University

More information

/$ IEEE

/$ IEEE IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 56, NO. 1, JANUARY 2009 81 Bit-Level Extrinsic Information Exchange Method for Double-Binary Turbo Codes Ji-Hoon Kim, Student Member,

More information

Architecture-Level Synthesis for Automatic Interconnect Pipelining

Architecture-Level Synthesis for Automatic Interconnect Pipelining Architecture-Level Synthesis for Automatic Interconnect Pipelining Jason Cong, Yiping Fan, Zhiru Zhang Computer Science Department University of California, Los Angeles, CA 90095 {cong, fanyp, zhiruz}@cs.ucla.edu

More information

Design of Low-Power and Low-Latency 256-Radix Crossbar Switch Using Hyper-X Network Topology

Design of Low-Power and Low-Latency 256-Radix Crossbar Switch Using Hyper-X Network Topology JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.15, NO.1, FEBRUARY, 2015 http://dx.doi.org/10.5573/jsts.2015.15.1.077 Design of Low-Power and Low-Latency 256-Radix Crossbar Switch Using Hyper-X Network

More information

OPTIMIZATION OF FIR FILTER USING MULTIPLE CONSTANT MULTIPLICATION

OPTIMIZATION OF FIR FILTER USING MULTIPLE CONSTANT MULTIPLICATION OPTIMIZATION OF FIR FILTER USING MULTIPLE CONSTANT MULTIPLICATION 1 S.Ateeb Ahmed, 2 Mr.S.Yuvaraj 1 Student, Department of Electronics and Communication/ VLSI Design SRM University, Chennai, India 2 Assistant

More information

Maximally and Arbitrarily Fast Implementation of Linear and Feedback Linear Computations

Maximally and Arbitrarily Fast Implementation of Linear and Feedback Linear Computations 30 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 19, NO. 1, JANUARY 2000 Maximally and Arbitrarily Fast Implementation of Linear and Feedback Linear Computations Miodrag

More information

MRPF: An Architectural Transformation for Synthesis of High-Performance and Low-Power Digital Filters

MRPF: An Architectural Transformation for Synthesis of High-Performance and Low-Power Digital Filters MRPF: An Architectural Transformation for Synthesis of High-Performance and Low-Power Digital Filters Hunsoo Choo, Khurram Muhammad, Kaushik Roy Electrical & Computer Engineering Department Texas Instruments

More information

Head, Dept of Electronics & Communication National Institute of Technology Karnataka, Surathkal, India

Head, Dept of Electronics & Communication National Institute of Technology Karnataka, Surathkal, India Mapping Signal Processing Algorithms to Architecture Sumam David S Head, Dept of Electronics & Communication National Institute of Technology Karnataka, Surathkal, India sumam@ieee.org Objectives At the

More information

Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator

Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator A.Sindhu 1, K.PriyaMeenakshi 2 PG Student [VLSI], Dept. of ECE, Muthayammal Engineering College, Rasipuram, Tamil Nadu,

More information

A Level-wise Priority Based Task Scheduling for Heterogeneous Systems

A Level-wise Priority Based Task Scheduling for Heterogeneous Systems International Journal of Information and Education Technology, Vol., No. 5, December A Level-wise Priority Based Task Scheduling for Heterogeneous Systems R. Eswari and S. Nickolas, Member IACSIT Abstract

More information

Dynamic Voltage Scaling of Periodic and Aperiodic Tasks in Priority-Driven Systems Λ

Dynamic Voltage Scaling of Periodic and Aperiodic Tasks in Priority-Driven Systems Λ Dynamic Voltage Scaling of Periodic and Aperiodic Tasks in Priority-Driven Systems Λ Dongkun Shin Jihong Kim School of CSE School of CSE Seoul National University Seoul National University Seoul, Korea

More information

Twiddle Factor Transformation for Pipelined FFT Processing

Twiddle Factor Transformation for Pipelined FFT Processing Twiddle Factor Transformation for Pipelined FFT Processing In-Cheol Park, WonHee Son, and Ji-Hoon Kim School of EECS, Korea Advanced Institute of Science and Technology, Daejeon, Korea icpark@ee.kaist.ac.kr,

More information

A Ripple Carry Adder based Low Power Architecture of LMS Adaptive Filter

A Ripple Carry Adder based Low Power Architecture of LMS Adaptive Filter A Ripple Carry Adder based Low Power Architecture of LMS Adaptive Filter A.S. Sneka Priyaa PG Scholar Government College of Technology Coimbatore ABSTRACT The Least Mean Square Adaptive Filter is frequently

More information

A Normal I/O Order Radix-2 FFT Architecture to Process Twin Data Streams for MIMO

A Normal I/O Order Radix-2 FFT Architecture to Process Twin Data Streams for MIMO 2402 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 6, JUNE 2016 A Normal I/O Order Radix-2 FFT Architecture to Process Twin Data Streams for MIMO Antony Xavier Glittas,

More information

Architectural-Level Synthesis. Giovanni De Micheli Integrated Systems Centre EPF Lausanne

Architectural-Level Synthesis. Giovanni De Micheli Integrated Systems Centre EPF Lausanne Architectural-Level Synthesis Giovanni De Micheli Integrated Systems Centre EPF Lausanne This presentation can be used for non-commercial purposes as long as this note and the copyright footers are not

More information

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume VII /Issue 2 / OCT 2016

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume VII /Issue 2 / OCT 2016 NEW VLSI ARCHITECTURE FOR EXPLOITING CARRY- SAVE ARITHMETIC USING VERILOG HDL B.Anusha 1 Ch.Ramesh 2 shivajeehul@gmail.com 1 chintala12271@rediffmail.com 2 1 PG Scholar, Dept of ECE, Ganapathy Engineering

More information

An Efficient Flexible Architecture for Error Tolerant Applications

An Efficient Flexible Architecture for Error Tolerant Applications An Efficient Flexible Architecture for Error Tolerant Applications Sheema Mol K.N 1, Rahul M Nair 2 M.Tech Student (VLSI DESIGN), Department of Electronics and Communication Engineering, Nehru College

More information

A framework for automatic generation of audio processing applications on a dual-core system

A framework for automatic generation of audio processing applications on a dual-core system A framework for automatic generation of audio processing applications on a dual-core system Etienne Cornu, Tina Soltani and Julie Johnson etienne_cornu@amis.com, tina_soltani@amis.com, julie_johnson@amis.com

More information

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm 1 A.Malashri, 2 C.Paramasivam 1 PG Student, Department of Electronics and Communication K S Rangasamy College Of Technology,

More information

Improving Area Efficiency of Residue Number System based Implementation of DSP Algorithms

Improving Area Efficiency of Residue Number System based Implementation of DSP Algorithms Improving Area Efficiency of Residue Number System based Implementation of DSP Algorithms M.N.Mahesh, Satrajit Gupta Electrical and Communication Engg. Indian Institute of Science Bangalore - 560012, INDIA

More information

A 256-Radix Crossbar Switch Using Mux-Matrix-Mux Folded-Clos Topology

A 256-Radix Crossbar Switch Using Mux-Matrix-Mux Folded-Clos Topology http://dx.doi.org/10.5573/jsts.014.14.6.760 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.14, NO.6, DECEMBER, 014 A 56-Radix Crossbar Switch Using Mux-Matrix-Mux Folded-Clos Topology Sung-Joon Lee

More information

Accelerating DSP Applications in Embedded Systems with a Coprocessor Data-Path

Accelerating DSP Applications in Embedded Systems with a Coprocessor Data-Path Accelerating DSP Applications in Embedded Systems with a Coprocessor Data-Path Michalis D. Galanis, Gregory Dimitroulakos, and Costas E. Goutis VLSI Design Laboratory, Electrical and Computer Engineering

More information

HDL. Operations and dependencies. FSMs Logic functions HDL. Interconnected logic blocks HDL BEHAVIORAL VIEW LOGIC LEVEL ARCHITECTURAL LEVEL

HDL. Operations and dependencies. FSMs Logic functions HDL. Interconnected logic blocks HDL BEHAVIORAL VIEW LOGIC LEVEL ARCHITECTURAL LEVEL ARCHITECTURAL-LEVEL SYNTHESIS Motivation. Outline cgiovanni De Micheli Stanford University Compiling language models into abstract models. Behavioral-level optimization and program-level transformations.

More information

RECENTLY, researches on gigabit wireless personal area

RECENTLY, researches on gigabit wireless personal area 146 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 55, NO. 2, FEBRUARY 2008 An Indexed-Scaling Pipelined FFT Processor for OFDM-Based WPAN Applications Yuan Chen, Student Member, IEEE,

More information

DESIGN OF 2-D FILTERS USING A PARALLEL PROCESSOR ARCHITECTURE. Nelson L. Passos Robert P. Light Virgil Andronache Edwin H.-M. Sha

DESIGN OF 2-D FILTERS USING A PARALLEL PROCESSOR ARCHITECTURE. Nelson L. Passos Robert P. Light Virgil Andronache Edwin H.-M. Sha DESIGN OF -D FILTERS USING A PARALLEL PROCESSOR ARCHITECTURE Nelson L. Passos Robert P. Light Virgil Andronache Edwin H.-M. Sha Midwestern State University University of Notre Dame Wichita Falls, TX 76308

More information

FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST

FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST SAKTHIVEL Assistant Professor, Department of ECE, Coimbatore Institute of Engineering and Technology Abstract- FPGA is

More information

An Efficient Constant Multiplier Architecture Based On Vertical- Horizontal Binary Common Sub-Expression Elimination Algorithm

An Efficient Constant Multiplier Architecture Based On Vertical- Horizontal Binary Common Sub-Expression Elimination Algorithm Volume-6, Issue-6, November-December 2016 International Journal of Engineering and Management Research Page Number: 229-234 An Efficient Constant Multiplier Architecture Based On Vertical- Horizontal Binary

More information

Worst Case Execution Time Analysis for Synthesized Hardware

Worst Case Execution Time Analysis for Synthesized Hardware Worst Case Execution Time Analysis for Synthesized Hardware Jun-hee Yoo ihavnoid@poppy.snu.ac.kr Seoul National University, Seoul, Republic of Korea Xingguang Feng fengxg@poppy.snu.ac.kr Seoul National

More information

Design and Implementation of CVNS Based Low Power 64-Bit Adder

Design and Implementation of CVNS Based Low Power 64-Bit Adder Design and Implementation of CVNS Based Low Power 64-Bit Adder Ch.Vijay Kumar Department of ECE Embedded Systems & VLSI Design Vishakhapatnam, India Sri.Sagara Pandu Department of ECE Embedded Systems

More information

A Transaction Processing Technique in Real-Time Object- Oriented Databases

A Transaction Processing Technique in Real-Time Object- Oriented Databases 122 IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.1, January 2008 A Transaction Processing Technique in Real-Time Object- Oriented Databases Woochun Jun Dept. of Computer

More information

RS-FDRA: A Register-Sensitive Software Pipelining Algorithm for Embedded VLIW Processors

RS-FDRA: A Register-Sensitive Software Pipelining Algorithm for Embedded VLIW Processors IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 21, NO. 12, DECEMBER 2002 1395 RS-FDRA: A Register-Sensitive Software Pipelining Algorithm for Embedded VLIW Processors

More information

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Zhining Huang, Sharad Malik Electrical Engineering Department

More information

High Level Synthesis

High Level Synthesis High Level Synthesis Design Representation Intermediate representation essential for efficient processing. Input HDL behavioral descriptions translated into some canonical intermediate representation.

More information

Eliminating False Loops Caused by Sharing in Control Path

Eliminating False Loops Caused by Sharing in Control Path Eliminating False Loops Caused by Sharing in Control Path ALAN SU and YU-CHIN HSU University of California Riverside and TA-YUNG LIU and MIKE TIEN-CHIEN LEE Avant! Corporation In high-level synthesis,

More information

VLSI Implementation of Low Power Area Efficient FIR Digital Filter Structures Shaila Khan 1 Uma Sharma 2

VLSI Implementation of Low Power Area Efficient FIR Digital Filter Structures Shaila Khan 1 Uma Sharma 2 IJSRD - International Journal for Scientific Research & Development Vol. 3, Issue 05, 2015 ISSN (online): 2321-0613 VLSI Implementation of Low Power Area Efficient FIR Digital Filter Structures Shaila

More information

Implementation of a Low Power Decimation Filter Using 1/3-Band IIR Filter

Implementation of a Low Power Decimation Filter Using 1/3-Band IIR Filter Implementation of a Low Power Decimation Filter Using /3-Band IIR Filter Khalid H. Abed Department of Electrical Engineering Wright State University Dayton Ohio, 45435 Abstract-This paper presents a unique

More information

DESIGN AND IMPLEMENTATION OF VLSI SYSTOLIC ARRAY MULTIPLIER FOR DSP APPLICATIONS

DESIGN AND IMPLEMENTATION OF VLSI SYSTOLIC ARRAY MULTIPLIER FOR DSP APPLICATIONS International Journal of Computing Academic Research (IJCAR) ISSN 2305-9184 Volume 2, Number 4 (August 2013), pp. 140-146 MEACSE Publications http://www.meacse.org/ijcar DESIGN AND IMPLEMENTATION OF VLSI

More information

A Low Power Asynchronous FPGA with Autonomous Fine Grain Power Gating and LEDR Encoding

A Low Power Asynchronous FPGA with Autonomous Fine Grain Power Gating and LEDR Encoding A Low Power Asynchronous FPGA with Autonomous Fine Grain Power Gating and LEDR Encoding N.Rajagopala krishnan, k.sivasuparamanyan, G.Ramadoss Abstract Field Programmable Gate Arrays (FPGAs) are widely

More information

Design of Low Power Digital CMOS Comparator

Design of Low Power Digital CMOS Comparator Design of Low Power Digital CMOS Comparator 1 A. Ramesh, 2 A.N.P.S Gupta, 3 D.Raghava Reddy 1 Student of LSI&ES, 2 Assistant Professor, 3 Associate Professor E.C.E Department, Narasaraopeta Institute of

More information

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,

More information

Verilog for High Performance

Verilog for High Performance Verilog for High Performance Course Description This course provides all necessary theoretical and practical know-how to write synthesizable HDL code through Verilog standard language. The course goes

More information

Linköping University Post Print. Analysis of Twiddle Factor Memory Complexity of Radix-2^i Pipelined FFTs

Linköping University Post Print. Analysis of Twiddle Factor Memory Complexity of Radix-2^i Pipelined FFTs Linköping University Post Print Analysis of Twiddle Factor Complexity of Radix-2^i Pipelined FFTs Fahad Qureshi and Oscar Gustafsson N.B.: When citing this work, cite the original article. 200 IEEE. Personal

More information

COPY RIGHT. To Secure Your Paper As Per UGC Guidelines We Are Providing A Electronic Bar Code

COPY RIGHT. To Secure Your Paper As Per UGC Guidelines We Are Providing A Electronic Bar Code COPY RIGHT 2018IJIEMR.Personal use of this material is permitted. Permission from IJIEMR must be obtained for all other uses, in any current or future media, including reprinting/republishing this material

More information

Keywords: Fast Fourier Transforms (FFT), Multipath Delay Commutator (MDC), Pipelined Architecture, Radix-2 k, VLSI.

Keywords: Fast Fourier Transforms (FFT), Multipath Delay Commutator (MDC), Pipelined Architecture, Radix-2 k, VLSI. ww.semargroup.org www.ijvdcs.org ISSN 2322-0929 Vol.02, Issue.05, August-2014, Pages:0294-0298 Radix-2 k Feed Forward FFT Architectures K.KIRAN KUMAR 1, M.MADHU BABU 2 1 PG Scholar, Dept of VLSI & ES,

More information

ARITHMETIC operations based on residue number systems

ARITHMETIC operations based on residue number systems IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 53, NO. 2, FEBRUARY 2006 133 Improved Memoryless RNS Forward Converter Based on the Periodicity of Residues A. B. Premkumar, Senior Member,

More information

Efficient Implementation of Single Error Correction and Double Error Detection Code with Check Bit Precomputation

Efficient Implementation of Single Error Correction and Double Error Detection Code with Check Bit Precomputation http://dx.doi.org/10.5573/jsts.2012.12.4.418 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.12, NO.4, DECEMBER, 2012 Efficient Implementation of Single Error Correction and Double Error Detection

More information

Integrating MRPSOC with multigrain parallelism for improvement of performance

Integrating MRPSOC with multigrain parallelism for improvement of performance Integrating MRPSOC with multigrain parallelism for improvement of performance 1 Swathi S T, 2 Kavitha V 1 PG Student [VLSI], Dept. of ECE, CMRIT, Bangalore, Karnataka, India 2 Ph.D Scholar, Jain University,

More information

ROTATION SCHEDULING ON SYNCHRONOUS DATA FLOW GRAPHS. A Thesis Presented to The Graduate Faculty of The University of Akron

ROTATION SCHEDULING ON SYNCHRONOUS DATA FLOW GRAPHS. A Thesis Presented to The Graduate Faculty of The University of Akron ROTATION SCHEDULING ON SYNCHRONOUS DATA FLOW GRAPHS A Thesis Presented to The Graduate Faculty of The University of Akron In Partial Fulfillment of the Requirements for the Degree Master of Science Rama

More information

High Speed Special Function Unit for Graphics Processing Unit

High Speed Special Function Unit for Graphics Processing Unit High Speed Special Function Unit for Graphics Processing Unit Abd-Elrahman G. Qoutb 1, Abdullah M. El-Gunidy 1, Mohammed F. Tolba 1, and Magdy A. El-Moursy 2 1 Electrical Engineering Department, Fayoum

More information

II. MOTIVATION AND IMPLEMENTATION

II. MOTIVATION AND IMPLEMENTATION An Efficient Design of Modified Booth Recoder for Fused Add-Multiply operator Dhanalakshmi.G Applied Electronics PSN College of Engineering and Technology Tirunelveli dhanamgovind20@gmail.com Prof.V.Gopi

More information

Cosimulation of ITRON-Based Embedded Software with SystemC

Cosimulation of ITRON-Based Embedded Software with SystemC Cosimulation of ITRON-Based Embedded Software with SystemC Shin-ichiro Chikada, Shinya Honda, Hiroyuki Tomiyama, Hiroaki Takada Graduate School of Information Science, Nagoya University Information Technology

More information

An Optimal Resource Binding Algorithm with Inter-Transition. Switching Activities for Low Power

An Optimal Resource Binding Algorithm with Inter-Transition. Switching Activities for Low Power An Optimal Resource Binding Algorithm with Inter-Transition Switching Activities for Low Power Deming Chen and Scott Cromar Abstract Resource binding, a key step encountered in behavioral synthesis, has

More information

DESIGN AND PERFORMANCE ANALYSIS OF CARRY SELECT ADDER

DESIGN AND PERFORMANCE ANALYSIS OF CARRY SELECT ADDER DESIGN AND PERFORMANCE ANALYSIS OF CARRY SELECT ADDER Bhuvaneswaran.M 1, Elamathi.K 2 Assistant Professor, Muthayammal Engineering college, Rasipuram, Tamil Nadu, India 1 Assistant Professor, Muthayammal

More information

STATIC SCHEDULING FOR CYCLO STATIC DATA FLOW GRAPHS

STATIC SCHEDULING FOR CYCLO STATIC DATA FLOW GRAPHS STATIC SCHEDULING FOR CYCLO STATIC DATA FLOW GRAPHS Sukumar Reddy Anapalli Krishna Chaithanya Chakilam Timothy W. O Neil Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science The

More information

The Serial Commutator FFT

The Serial Commutator FFT The Serial Commutator FFT Mario Garrido Gálvez, Shen-Jui Huang, Sau-Gee Chen and Oscar Gustafsson Journal Article N.B.: When citing this work, cite the original article. 2016 IEEE. Personal use of this

More information

An Efficient Design of Sum-Modified Booth Recoder for Fused Add-Multiply Operator

An Efficient Design of Sum-Modified Booth Recoder for Fused Add-Multiply Operator An Efficient Design of Sum-Modified Booth Recoder for Fused Add-Multiply Operator M.Chitra Evangelin Christina Associate Professor Department of Electronics and Communication Engineering Francis Xavier

More information

FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression

FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression Divakara.S.S, Research Scholar, J.S.S. Research Foundation, Mysore Cyril Prasanna Raj P Dean(R&D), MSEC, Bangalore Thejas

More information

Analysis of Conditional Resource Sharing using a Guard-based Control Representation *

Analysis of Conditional Resource Sharing using a Guard-based Control Representation * Analysis of Conditional Resource Sharing using a Guard-based Control Representation * Ivan P. Radivojevi c orrest Brewer Department of Electrical and Computer Engineering University of California, Santa

More information

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE Reiner W. Hartenstein, Rainer Kress, Helmut Reinig University of Kaiserslautern Erwin-Schrödinger-Straße, D-67663 Kaiserslautern, Germany

More information

Compilation Approach for Coarse-Grained Reconfigurable Architectures

Compilation Approach for Coarse-Grained Reconfigurable Architectures Application-Specific Microprocessors Compilation Approach for Coarse-Grained Reconfigurable Architectures Jong-eun Lee and Kiyoung Choi Seoul National University Nikil D. Dutt University of California,

More information

A Simple Placement and Routing Algorithm for a Two-Dimensional Computational Origami Architecture

A Simple Placement and Routing Algorithm for a Two-Dimensional Computational Origami Architecture A Simple Placement and Routing Algorithm for a Two-Dimensional Computational Origami Architecture Robert S. French April 5, 1989 Abstract Computational origami is a parallel-processing concept in which

More information

Incremental Exploration of the Combined Physical and Behavioral Design Space

Incremental Exploration of the Combined Physical and Behavioral Design Space 13.3 Incremental Exploration of the Combined Physical and Behavioral Design Space Zhenyu (Peter) Gu, Jia Wang, Robert P. Dick, Hai Zhou orthwestern University 2145 Sheridan Road Evanston, IL, USA {zgu646,

More information

High-Level Synthesis of Programmable Hardware Accelerators Considering Potential Varieties

High-Level Synthesis of Programmable Hardware Accelerators Considering Potential Varieties THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS TECHNICAL REPORT OF IEICE.,, (VDEC) 113 8656 7 3 1 CREST E-mail: hiroaki@cad.t.u-tokyo.ac.jp, fujita@ee.t.u-tokyo.ac.jp SoC Abstract

More information

Cycle-accurate RTL Modeling with Multi-Cycled and Pipelined Components

Cycle-accurate RTL Modeling with Multi-Cycled and Pipelined Components Cycle-accurate RTL Modeling with Multi-Cycled and Pipelined Components Rainer Dömer, Andreas Gerstlauer, Dongwan Shin Technical Report CECS-04-19 July 22, 2004 Center for Embedded Computer Systems University

More information

VHDL for Synthesis. Course Description. Course Duration. Goals

VHDL for Synthesis. Course Description. Course Duration. Goals VHDL for Synthesis Course Description This course provides all necessary theoretical and practical know how to write an efficient synthesizable HDL code through VHDL standard language. The course goes

More information

High-Level Synthesis

High-Level Synthesis High-Level Synthesis 1 High-Level Synthesis 1. Basic definition 2. A typical HLS process 3. Scheduling techniques 4. Allocation and binding techniques 5. Advanced issues High-Level Synthesis 2 Introduction

More information

High-level Variable Selection for Partial-Scan Implementation

High-level Variable Selection for Partial-Scan Implementation High-level Variable Selection for Partial-Scan Implementation FrankF.Hsu JanakH.Patel Center for Reliable & High-Performance Computing University of Illinois, Urbana, IL Abstract In this paper, we propose

More information

Parallel-computing approach for FFT implementation on digital signal processor (DSP)

Parallel-computing approach for FFT implementation on digital signal processor (DSP) Parallel-computing approach for FFT implementation on digital signal processor (DSP) Yi-Pin Hsu and Shin-Yu Lin Abstract An efficient parallel form in digital signal processor can improve the algorithm

More information

Design of an Efficient 128-Bit Carry Select Adder Using Bec and Variable csla Techniques

Design of an Efficient 128-Bit Carry Select Adder Using Bec and Variable csla Techniques Design of an Efficient 128-Bit Carry Select Adder Using Bec and Variable csla Techniques B.Bharathi 1, C.V.Subhaskar Reddy 2 1 DEPARTMENT OF ECE, S.R.E.C, NANDYAL 2 ASSOCIATE PROFESSOR, S.R.E.C, NANDYAL.

More information

Data Wordlength Optimization for FPGA Synthesis

Data Wordlength Optimization for FPGA Synthesis Data Wordlength Optimization for FPGA Synthesis Nicolas HERVÉ, Daniel MÉNARD and Olivier SENTIEYS IRISA University of Rennes 6, rue de Kerampont 22300 Lannion, France {first-name}.{name}@irisa.fr Abstract

More information

Real-time and smooth scalable video streaming system with bitstream extractor intellectual property implementation

Real-time and smooth scalable video streaming system with bitstream extractor intellectual property implementation LETTER IEICE Electronics Express, Vol.11, No.5, 1 6 Real-time and smooth scalable video streaming system with bitstream extractor intellectual property implementation Liang-Hung Wang 1a), Yi-Mao Hsiao

More information

A High Performance Bus Communication Architecture through Bus Splitting

A High Performance Bus Communication Architecture through Bus Splitting A High Performance Communication Architecture through Splitting Ruibing Lu and Cheng-Kok Koh School of Electrical and Computer Engineering Purdue University,West Lafayette, IN, 797, USA {lur, chengkok}@ecn.purdue.edu

More information

Performance Analysis of CORDIC Architectures Targeted by FPGA Devices

Performance Analysis of CORDIC Architectures Targeted by FPGA Devices International OPEN ACCESS Journal Of Modern Engineering Research (IJMER) Performance Analysis of CORDIC Architectures Targeted by FPGA Devices Guddeti Nagarjuna Reddy 1, R.Jayalakshmi 2, Dr.K.Umapathy

More information