Performance/Cost trade-off evaluation for the DCT implementation on the Dynamically Reconfigurable Processor

Size: px

Start display at page:

Download "Performance/Cost trade-off evaluation for the DCT implementation on the Dynamically Reconfigurable Processor"

Carmel Hoover
6 years ago
Views:

1 Performance/Cost trade-off evaluation for the DCT implementation on the Dynamically Reconfigurable Processor Vu Manh Tuan, Yohei Hasegawa, Naohiro Katsura and Hideharu Amano Graduate School of Science and Technology, Keio University Hiyoshi, Kohoku-ku, Yokohama, Kanagawa , Japan Abstract. The Dynamically Reconfigurable Processor (DRP) developed by NEC Electronics is a coarse grain reconfigurable processor with the capability of changing its hardware functionality within a clock cycle. While implementing an application on the DRP, designers face the task of selecting how to efficiently use resources in order to achieve particular goals such as to improve the performance, to reduce the power dissipation, or to minimize the resource use. To analyze the impact of trade-off selections on these aspects, the Discrete Cosine Transform (DCT) algorithm has been implemented exploiting various design policies. The evaluation result shows that the performance, cost and consuming power are influenced by the implementation method. For example, the execution time can reduce 17% in case of using the distributed memory against the register files; or up to 4% whether the embedded multipliers are used. 1. Introduction Dynamically reconfigurable devices have the potential to provide high processing performance, flexibility and power efficiency especially for a wide range of stream and network processing applications. Recently, the development of dynamically reconfigurable processors such as DRP [1], DAPDNA-2[2], XPP[3] and D-Fabrix[4] have been received much attention for their remarkable achievements. Such devices incorporate following characteristics: 1. A dynamically reconfigurable processor consists of an array of coarse-grained processing elements (PEs), distributed memory modules and finite-state-machine-based sequencers. Execution circuits can be freely configured by programming the instruction set of the PEs and wiring between PEs. The chip achieves high performance using customized data path configurations comprised of arrays of PEs. 2. An application can be implemented either as multi-task or time-division execution. A multi-context mechanism, which stores a number of configuration data for the same PE array, allows the capability of changing the hardware functionality of the on-chip circuit, often in one clock cycle. 3. High-level design languages, automatic synthesis techniques and place-and-route tools are often applied to ease the development process.

to implement on the target device DRP-1 using different design policies. The rest of this paper is organized as follows.

2 While developing a certain application, there is often a trade-off to be made between improving the performance and reducing the cost. In order to quantitatively analyze the impact of resource usage on the performance and the power dissipation of a dynamically reconfigurable processor, a typical task DCT used in JPE codes is chosen to implement on the target device DRP-1 using different design policies. The rest of this paper is organized as follows. Section 2 describes the DRP architecture, which is the target device of this study. The evaluation results and analysis are illustrated in the Section 3. Finally, the conclusion of this research is mentioned in Section DRP overview DRP is a coarse-grain dynamically reconfigurable processor that was released by NEC Electronics in 22 [1]. DRP-1 is the prototype chip fabricated with.18-um 8-metal layer CMOS processes. It consists of 8-tile DRP Core, eight 32-bit multipliers, an external SRAM controller, a PCI interface, and 256-bit I/Os. The structure of DRP-1 is shown on the Fig.1. Fig. 1. DRP-1 architecture Fig. 2. DRP tile architecture The primitive unit of DRP Core is called a `Tile', and the number of iles can be expandable, horizontally and vertically. The primitive modules of the Tile are processing elements (PEs), State Transition Controller (STC), 2-ported memories (VMEMs: Vertical MEMories), VMEM Controller (VMCtrl) and 1-ported memories (HMEMs: Horizontal MEMories). The structure of a Tile is shown in Fig. 2. Each has an 8-bit ALU, an 8-bit DMU, and an 8-bit x 16-word register file. These units are connected by programmable wires specified by instruction data. PE has 16-depth instruction memories and supports multiple context operation which can be changed with a clock cycle by an instruction pointer delivered from STC. An integrated design environment, called Musketeer, is available for DRP-1. It includes a high level synthesis tool, a design mapper for DRP, simulators, and a layout viewer tool. Applications can be written in a C-like high level hardware description language called BDL, synthesized, and mapped directly onto the DRP-1.

3 3. Trade-off of the design policies This section presents quantitative evaluation results of different DCT implementations with following evaluation metrics. Performance: The performance of an implementation can be expressed by its execution time for a given set of data. The execution time is computed as the product of the delay or the critical path and the number of execution clock cycles. Power and energy consumption: The power consumption for an application can be estimated from the power profile based on the simulation. Here, the energy consumption, which is defined as the product of the power consumption and the execution time, can be used as a general measure for evaluation. The energy consumption is also the total energy necessary for executing a target application. Small energy consumption means the high degree of efficiency in the computation. Required resource: The required resource of each implementation is the total number of PEs used for each context. It shows not only the PE usability, but also the parallel processing capability of the application. Following design policies are chosen and compared with each other in order to clarify the performance/cost trade-off. Memory array vs. register array Multiplier use vs. no-multiplier use Optimum context sizes 3.1. Memory array vs. Register array In BDL, an array variable can be assigned either to registers or to memory modules. The difference is that while a memory access requires a clock latency, data read out from a register file can be processed in the same clock. Table 1. DCT implementation using different types of array VMEM HMEM Register Delay or critical path (ns) Execution time (µs) Power consumption (mw) Energy consumption (µsw) Clock cycles Table 1 shows the results of the DCT implementation when the input data block is stored in VMEMs, HMEMs and registers respectively. The DCT version using the VMEM has the best result in terms of the critical path, while the execution time of the case of using register is the worst because of the large delay time by reading registers in the same clock cycle. However, the register-based design achieves the best result in terms of the number of clock cycles; and it also consumes small power consumption. Execution with low clock frequency but small number of steps can reduce power. In terms of the execution time and the energy consumption, the VMEM use policy outperforms the register use policy by about 17% and 3% respectively. Although the

4 power of register based design is small, the total energy consumption is increased because of its long execution time. Fig. 3 illustrates the required resources where "PEs" denotes the number of required PEs in each context. From Fig. 3, it is easy to point out that although the number of contexts is different the required number of PEs is well distributed into each context, while the PE usability in VMEM and HMEM cases is quite imbalanced. Since the total cost is depending on the maximum number PEs Required resource for Memory Required resource for Register Context Fig. 3. Required resource for Memory and Register-use polic of required PEs in all contexts, the register based design is advantageous from the viewpoint of the cost Multiplier use vs. no-multiplier use The DRP supports two types of multiplication. If the multiplier factor is a constant, the multiplication is automatically transformed into shifts and additions by the DRP compiler. On the other hand, since the DRP has eight 32-bit multipliers distributed on the top and the bottom of the chip (Fig. 1), multiplications can be performed using these embedded multipliers. Using the multipliers has two limitations: their numbers are limited, and there is a delay of two clock cycles from the input of data until the result is available although pipelined operation is allowed. Table 2. DCT implementation using different strategies of multiplication Memory Register Multiplier No-multiplier Multiplier No-multiplier Delay or critical path (ns) Execution time (µs) Power consumption (mw) Energy consumption (µsw) Clock cycles Table 2 shows the results of the DCT implementation in case multipliers are used or not for the memory-based design and the register-based design respectively. The results prove that although multipliers are located far from PEs and have certain limitations; their use could lead to satisfactory outcomes. Using the multipliers achieves the shortest critical path as well as the highest throughput. However, in terms of the

5 power consumption and the number of clocks, using the multipliers does not outperform the case without them; especially, the design using multipliers dissipate almost double power as that without multipliers, although the power of multipliers itself is not counted in the value because of the problem of the profiler. The large power consumption, in this case, mainly comes from its high clock frequency. The energy consumption proves that, in general, the no-multiplier policy is more efficient than the multiplier-use policy as illustrated on the above table. In terms of the execution time, the multiplier-use with memory policy outperforms the no-multiplier policy by about 4%. Nonetheless, the no-multiplier with memory design consumes power about 53% less than the multiplier-use design; more importantly, the no-multiplier design proves to be more effective about 1% in term of the energy consumption. Fig.4 presents the resource required in the DCT implementation Required resource for Memory-based array using the multipliers Required resource for Register-based array for the memory-based 2 design and the register-based design. The 15 necessary resources when the multipliers are not used are shown in Fig.3. As expected, the use of multiplier reduces the resources dramatically. In general, the best version of the DCT implementation is the case when using the PEs Context Fig. 4. Required resource when using multipliers multipliers coupled with VMEM based design in terms of both the performance and the resource usage. On the contrary, in terms of the power efficiency, the case when the multipliers are not used and data are stored in the registers is the best, although it is the worst from the viewpoint of the performance and the resource usage Optimum context sizes Fig. 5 presents different parameters of the DCT implementation on the DRP against the context size. Evaluation results of performance show that execution time can be reduced with a large context size because of the parallel processing. On the other hand, the critical path tends to increase when the context size becomes large with some exceptions. Therefore, the performance improvement by increasing the context size faces a certain limitation. In contrast with the performance, the power consumption seems to increase with the larger context size. The reason is that the larger context size means the more number of PEs used to form computation circuits, which requires more power. Besides, as the context size becomes larger, additional wires are necessary to connect more PEs together, so the power dissipation tends to increase. Nevertheless, the en-

6 ergy consumption reduces when the context size becomes large, since the execution time is reduced. As a result, it is likely that the larger context size provides the better performance/cost ratio for solving DCT. From Fig. 5, it is quite clear that there exists an optimum context size, where both the performance and the power dissipation are well balanced. In case of the DCT application, when the context size is 6, the execution time, the power dissipation and the energy consumption are not much different from that of the maximum context size. More importantly, the energy consumption shows that the 6-tile case is the best case in terms of performance and the cost Critical path (ns) Execution time (µs) Power consumption (mw) Energy consumption (µsw) Context size (number of tiles) Context size (number of tiles) Fig. 5. Critical path, Execution time, Power and Energy consumption vs. context size 4. Conclusion This paper presents the performance/cost trade-off when designing applications on a dynamically reconfigurable processor based on implementations of the DCT algorithm. Results show that implementation policies on the array data allocation and usage of multipliers influence the performance, cost and power consumption. The optimal context size also should be chosen. Based on the analysis, a tool for rapidly developing a prototype or a model of target applications to help the designers decision is required. References [1]. M.Motomura, "A Dynamically Reconfigurable Processor Architecture", In Microprocessor Forum, Oct. 22. [2] IPFlex. [3] PACT. [4] Elixent. [5] M. Suzuki, Y. Hasegawa, Y. Yamada, N. Kaneko, K. Deguchi, H. Amano, K. Anjo, M. Motomura, K. Wakabayashi, T. Toi, and T. Awashima, Stream Applications on the Dynamically Reconfigurable Processor, In Proceedings of International Conference on Field Programmable Technology (FPT24), pages , Dec. 24.

Performance and Power Analysis of Time-multiplexed Execution on Dynamically Reconfigurable Processor

Performance and Power Analysis of Time-multiplexed Execution on Dynamically Reconfigurable Processor Performance and Analysis of Time-multiplexed Execution on Dynamically Reconfigurable Processor Yohei Hasegawa, Shohei Abe, Shunsuke Kurotaki, Vu Manh Tuan, Naohiro Katsura, Takuro Nakamura 2, Takashi Nishimura