A HARDWARE COMPLETE DETECTION MECHANISM FOR AN ENERGY EFFICIENT RECONFIGURABLE ACCELERATOR CMA

Size: px

Start display at page:

Download "A HARDWARE COMPLETE DETECTION MECHANISM FOR AN ENERGY EFFICIENT RECONFIGURABLE ACCELERATOR CMA"

Austen O’Neal’
5 years ago
Views:

1 A HARDWARE COMPLETE DETECTION MECHANISM FOR AN ENERGY EFFICIENT RECONFIGURABLE ACCELERATOR CMA Akihito Tsusaka Mai Izawa Rie Uno Nobuyuki Ozaki Hideharu Amano Keio University, Yokohama, , Japan ABSTRACT Cool Mega Array (CMA) is an energy efficient Coarse Grained Reconfigurable processor Array (CGRA) consisting of a large PE (Processing Element) array. In order to reduce the power for storing intermediate results and clock tree, the PE array is consisting of combinatorial circuits. The completion time in the PE array has been calculated from the delay table and mapping results manually, and specified in the micro-code. A hardware completion detection mechanism for CMA is proposed, implemented and evaluated. Each PE uses serially connected buffers with selectable taps, and the delay is decided according to the operation executed in the PE. Since the completion signal is transferred exactly on the same paths that for computation, the delay in the switch and wires are accounted. The mechanism was implemented in CMA with 65nm CMOS process, and post layout simulation revealed that the same performance without the mechanism can be obtained only with 5.1% area overhead and less than 6% extra power consumption. With the mechanism, a single micro-code can be used for various supply voltages to PE array. Also, dynamic change of the delay by changing of the temperature and the variation for each chip can be treated. 1. INTRODUCTION Recent battery driven mobile devices require high performance for a certain area of application as well as energy efficiency. As a solution, Coarse-Grained Reconfigurable processor Arrays (CGRA) [1, 2, 3] have received attention as energy efficient accelerators, and some of them have been utilized in commercial products[4, 5]. CMA (Cool Mega Array)[6] has been developed as a highly energy efficient CGRA. It provides a large PE (Processing Element) array consisting of combinatorial logic. Data-flow graphs for target application programs are mapped directly on the array, and computation is done without storing intermediate results. The energy for storing intermediate results into registers in each PE and clock distribution through the clock tree are not required. A small microcontroller manages data distribution and collection between data memory and registers only provided at input/output of the PE array. The supply voltage of the PE array can be scaled so that the computation delay in the PE array is well balanced to the time for data management by the microcontroller. The first prototype CMA-1 using 65nm CMOS technology achieved 2.7GOPS/11.2 mw sustained performance, and a multicore system which has a number of CMA chips Cube-1[7] is now available. One of the most difficult problems of CMA architecture is how to evaluate the computational delay on the PE array. Ozaki proposed a method to compute the largest delay time in the PE array from the result of the application mapping[8]. It uses a table in which the delay of each PE at various supply voltage, and with a certain amount of margin, the programmer decides the timing to store results from the PE array. However, the method does not care about the temperature of environment and variance of the delay in each chip. For safe computation, a large amount of margin is required, and it will degrade both performance and energy efficiency. In order to address the problem, a hardware mechanism that detects the completion of the execution in the PE array is proposed. It uses a selectable delay line consisting of buffers connected in tandem. The delay is decided according to the operation executed in the PE array. Since the completion signal is transferred exactly on the same paths that for computation, the delay in the switch and wires are accounted. The rest of paper is organized as follows: in Section 2, the architecture of CMA is introduced focusing on the delay estimation method. A hardware mechanism for detecting the completion of execution is proposed in Section 3. The overhead and efficiency of the mechanism are shown in Section 4. Section 5 concludes the paper with discussion of future work. 2. CMA ARCHITECTURE AND COMPLETION DETECTION 2.1. The CMA architecture Like other CGRAs, the target application of CMA is multimedia streaming application which has a large degree of parallelism. By parallel execution of a lot of PEs, it achieves a required performance with low supply voltage. The impor-

2 tant difference between other CGRAs is that it adopts an extreme architecture for saving energy as possible. A large PE array of CMA consists of combinatorial circuits without registers and context memory unlike other CGRAs. The energy for storing intermediate results and the power for clock distribution inside the PE array are not required. Dynamic reconfiguration which requires a large amount of energy is not adopted. The configuration data for the PE array is given from configuration registers provided outside the PE array and fixed during execution. The data flow graph corresponding to the application is mapped statically on the PE array. For keeping the flexibility, a small microcontroller is provided between PE array and data memory. It reads data from the data memory and distributes it to the register attached to the input of the PE array. It also collects the results from the register attached to the output of the PE array, and writes them back to the data memory. It flexibly manages the data transfer between the memory and registers by using mapping registers and vector operations. With the above structure, it enables to implement various application programs without power hungry dynamic reconfiguration in the PE array. Since the computation in the PE array and data management by the microcontroller are done in a pipelined manner, their execution speeds must be balanced. If the computation delay is longer than the data management delay, the voltage supplied to the PE array can be reduced. The total power required for computation can thus be reduced without degrading computing performance. On the other hand, if the data management delay is longer than the computation delay, wave pipelining in the PE array can be used. The delay time for achieving wave pipelining can be also controlled by changing the voltage supplied to the PE array Prototype chip CMA-1 The first prototype, CMA-1[6] with 8 8 PE array was fabricated in mm 2 65-nm CMOS technology, and achieved 2.4-GOPS/11.2-mW sustained performance. Figure 1 shows the block diagram of CMA-1. It consists of PE array, microcontroller, data memory (DMEM) and registers. Here, the computation in the PE array by the control of microcontroller is described in detail. As shown in Figure 2, microcontroller is consisting of a controller, Fetch register, Launch register and Gather register. First, it reads from DMEM and distributes them to entries of Fetch register. The data distribution and collection by the micro controller was designed to be flexible to enable arbitrary mapping between the address of the data memory and the input of the PE array using address mapping registers. Stride vector access operations are also supported. When the input data in Fetch register is ready, it is transferred to Launch register and the computation starts in the PE array. After a certain time interval, the result of PE array PE_ARRAY CMA CONF_REG 25bit Data Channel 17bit Constant Value Data PE array COL_0 COL_1 COL_2 COL_3 COL_4 COL_5 COL_6 COL_7 CONST_REG (1) data distribution micro controller 25bit X 1K µ - Controller DMEM Connect to/from Host CPU 25bit X 1K Fig. 1. Block diagram of CMA-1 (2) computation in the PE array 24bit X 1K DMEM 24bit X 1K Passing Links CONF_REG Feedback Lines launch register(lr) fetch register(fr) gather register(gr) (3) data collection (1) (2) (3) (1) (2) (3) (1) (2) (3) (a) All stages are balanced (b) computation in the PE array is short -> Voltage Scaling Fig. 2. Pipleined operation of CMA-1 (c) data manupilation by the micro controller is short -> Wave Pipelining is stored into Gather register, and the results in the entries of Gather register are written back into DMEM. (1) Distribution from DMEM to Fetch register, (2)computation in the PE array and (3) written back of the results to DMEM from Gather register are done in the pipelined manner. Supply voltage scaling to PE array is used to balance the time for stage (2) with other two stages Microcodes of the CMA-1 The programmable controller of CMA-1 has 16 general purpose registers, and uses 14-bit micro operations stored in a small (128 depth) micro-code memory. Table 1 shows an example of micro-code. Only body of a loop is extracted.

3 Table 1. An example of micro-code... LD ADD r0,r8 //1: Load data LD ADD r1,r8 //2: to Fetch LD ADD r2,r8 //3: register LD ADD r3,r8 //4: LD ADD r4,r8 //5: LD ADD r5,r8 //6: SCATTER r9 //7: Scatter loop: LD ADD r0,r8 //8: LD ADD r1,r8 //9: LD ADD r2,r8 //10: LD ADD r3,r8 //11: LD ADD r4,r8 //12: LD ADD r5,r8 //13: NOP 3 //14: GATHER r11,r12,0 //15: Gather SCATTER r9 //16: ADDI r13,#-1 //17: BNEZ r13,loop //18:... LD ADD reads the data from data memory and transfers it to each entry of Fetch register. In this code, r8 is used as a base register and a predefined value is added when LD ADD is executed. r0-r5 are used as mapping registers. When all data in Fetch register are ready (Line 6), SCAT- TER is executed to transfer the content of Fetch register to Launch register. At that time, the computation in the PE array starts. During computation, microcontroller fetches the second data set (Line 8-13), and wait 3 clock cycles for the end of the computation in the PE array (Line 14). Then the results of the PE array are stored in Gather register. GATHER instruction transfers the results into data memory according to the base register (r11) and the mask register (r12). The base register is incremented by the number of transferred data. Although GATHER instruction takes multiple clock cycles, it is executed automatically by a dedicated controller, and the micro controller can execute next code SCATTER immediately for starting the computation of the next data set. In this example, the loop is iterated until r13 reaches zero. Note that the PE array computation and GATHER instruction are done independently from the execution of the microcode, three steps are performed in the pipeline manner as shown in Figure Completion detection The problem for microcode designers is that they must estimate the completion of the PE array and specify it into the micro-code. In this case, from the first SCATTER to GATHER, 6 micro-codes (Line8-13) are executed. When the microcontroller works at 250MHz, 24nsec is spent with them. In this case, if the delay of the PE array is estimated about 36nsec, 3 clock cycles must be added by NOP 3 micro-code. However, the execution time in the PE array is depending on the applications. For simple applications which use a small number of PEs, the results are ready with a small delay, while it becomes large in complicated applications. The data flow on the PE array is designed with Black Diamond retargetable compiler[9]. It compiles the program described in C-like language, maps, routes and generates the configuration data for the PE array. Ozaki et.al. proposed a method to evaluate the total delay time by using the result of mapping and a delay table by changing the supply voltage based on the measurement of real chips[8]. Since the delay is different depending on the operations, the table is provided for each instruction of PE. The longest path in the PE array can be computed from the mapping results and the sum of the appropriate delays in the table. Since the largest wire delay is assumed for each operation, the computed total delay includes a certain margin. 3. COMPLETION DETECTION MECHANISM 3.1. Related Work Since CMA uses a large PE array with combinatorial circuits, the completion detection mechanism is somehow like that of asynchronous systems. Although a large number of researches have been done on asynchronous FPGA architectures, most of them use microsynchronization mechanisms to recognize the end of computation[10]. However, since it takes a large amount of additional hardware, it is difficult to be applied to the PE array in CMA. PCA-1/2[11] is a reconfigurable architecture which uses delay lines to send the results to the next cells, and Xia et.al. proposes a hybrid architecture using delay and synchronization mechanism. Although they are based on fine-grained reconfigurable architectures, using delay line is cost efficient way for recognize the completion of computation. Techniques for controlling delay mechanisms have been well studied and available[12]. For coarse grained architecture, a dataflow-driven execution control mechanism is proposed[13], but it is for a general PE array with clock The concept of the proposed method The current delay estimation method has the following problems: (1) When the supply voltage is scaled, the micro-code must be changed. Different codes must be provided when voltage is scaled dynamically. (2) The temperature and the delay variance of each chip are not cared. The delay variance will become large when the low power supply voltage is used in the future process. Considering the safe operation, a large margin is required. In order to address these problems, we propose a hardware completion detection mechanism with tandem connected buffers. Figure 3 shows the concept of the mechanism. When

SCATTER instruction is executed, the completion signals attached to all input data are asserted at the input of PE array. They are propagated exactly on the same way as the input data.

When two completion signals are joined into a PE, the earlier asserted input must wait for the later asserted signal.

4 SCATTER instruction is executed, the completion signals attached to all input data are asserted at the input of PE array. They are propagated exactly on the same way as the input data. In the PE, the completion signal is delayed with the serially connected buffers whose delay is arranged according to the operation executed in the PE. When two completion signals are joined into a PE, the earlier asserted input must wait for the later asserted signal. When all completion signals attached to outputs data are available, the results are stored into Gather register. PE ARRAY PE PE PE PE PE PE PE DATA signal PE PE PE Fig. 4. The layout of the PE array Completion signal 3.4. Implementation of the mechanism Design environment Fig. 3. Hardware Completion Mechanism The key implementation issue is that the completion signal must flow in the same way of the data. For this purpose, we implemented it as an extra data bit of data bus. Thus, it has the same fan-out and almost the same routing path as the other data. As shown in Figure 4, in CMA, all PEs are aligned naturally in the two dimensional structure, the completion signals can be routed with the same manner as the corresponding data wires so that the delay time becomes almost the same Delay line in the PE As shown in Figure 5, the completion signals from both input are forwarded through the AND gate to the delay line which is implemented with a buffers connected in tandem. When constant data is used or the PE executes instructions with single operand, the corresponding input of the completion signal is set to be H beforehand. The delay line has several taps, and the signal with appropriate delay is selected by the output multiplexer. Since the operation of PE is defined by the configuration data, the multiplexer is also selected according to the configuration data for each PE. The proposed completion detection mechanism is implemented in the CMA architecture shown in Table 2. Design tools are shown in the same table. The target CMA is almost same as the CMA-1 except providing the dedicated links for transferring the constant data. This improvement was proved to reduce the loading time of the configuration data[14] Delay in the PE The position of taps are decided by the post layout simulation results of PE. According to the analysis results, four taps are provided for corresponding operations: ADD/SUB, MUL, LOGICs and SHIFTs. The buffer SC23BUFXA1 whose maximum delay time is 69ps is used for building delay line. Table 3 shows the number of buffers for each taps. In our PE, the delay of ADD/SUB operation is slightly larger than that of MUL operation. It comes from that a high speed multiplier is adopted for multiplication while a carry ripple adder is selected for ADD/SUB operation Modification of micro-code SCATTER/GATHER instructions are modified as follows: SCATTER: When SCATTER instruction executes, all completion signals attached to available inputs are asserted. When all propagated completion signals are asserted at the output and GATHER instruction is executed, the input completion signals are negated. Until it, the next SCATTER instruction is suspended. GATHER: When all completion signals are ready at the output of PE array, and GATHER instruction is

5 DL buffer DELAY OUT ALU PE DATA_A DATA_B DELAY_B DELAY_A IN_A IN_B Fig. 5. Delay Line in a PE ALU_CONF Table 2. Specifications of the target CMA Technology Fujitsu e-shuttle 65-nm 12-metal CMOS Cell Library CS202SZ low-power standard cell library Supply Voltage V for PE array (1.2V for evaluation) PE 24-bit ALU,64-bit Network 2-lane island-style 2 direct links Micro controller 14-bit micro-codes, 16 instructions, 128 entries, 8 GPRs, 8-address register, 4-base register Clock frequency 210 MHz Synthesis Design Compiler SP5 Layout IC Compiler 2009 Analysis Primetime SP3 executed, the results are stored into Gather register. If GATHER instruction is issued before completion signals being ready, the execution of the microcontroller is stalled. With this modification, the designers don t have to worry about the timing of issuing GATHER instruction. The results are stored in the Gather register and then automatically written back to the data memory. Although the hardware completion detection mechanism is substantially suitable for wave-pipelining, the implementation makes the wave-pipeline impossible. The restriction was given just for safety, relaxing the condition about multiple SCATTER instructions to enable wave-pipeline is our future work The delay time 4. EVALUATION We implemented four simple image filter programs: alpha blender, sepia filter, gray scale filter, and edge detection filter on CMA with completion detection mechanism. All programs correctly worked with the post layout simulation including wiring and parastic capacitance delay. Figure 6 shows the maximum delay actually measured in the PE during the simulation. The graph shows that the delay by the tandem connected buffer is appropriately given with about 10% margin. The execution time is completely the same as the case without the completion detection mechanism, since the delay in the microcode is well tuned. However, the same code can be used with any supply voltage or temperature. Table 3. The setting of the delayline Operation Num of buffers Delay time(ps) ADD/SUB MULT LOGICs SHIFT The Overhead The proposed mechanism requires additional hardware which will introduce overhead on area and power consumption. The area is increased 5.1% that of the design without the mechanism. Since the additional hardware is simple hardware consisting of buffers, multiplexers and AND gates, the increasing area is small. Figure 7 shows the average power consumption in a PE when applications are executed. The power consumed in the completion detection mechanism in only about 6%. In this implementation, only 1 bit for each data is changed in a computation. This is the reason why the power consumption is not so large in this mechanism. 5. CONCLUSION A hardware completion detection mechanism for CMA is proposed, implemented and evaluated. Each PE uses serially connected buffers with selectable taps, and the delay is decided according to the operation executed in the PE. Since the completion signal is transferred exactly on the same paths that for computation, the delay in the switch and

1% area overhead and less than 6% extra power consumption. With the mechanism, the single code can be used for various supply voltages to PE array.

6 Fig. 6. The delay in the PE Fig. 7. The power consumed in a PE ARRAY wires are accounted. The mechanism was implemented in CMA with 65nm CMOS process, and post layout simulation revealed that the same performance without the mechanism can be obtained only with 5.1% area overhead and less than 6% extra power consumption. With the mechanism, the single code can be used for various supply voltages to PE array. Also, dynamic delay variation by changing the temperature or chip variation are also treated. The proposed mechanism is substantially suitable to wavepipeline. Now, the wave-pipeline is implemented on CMA- 1 by careful manual tuning. Execution of the wave-pipeline with the proposed hardware completion mechanism is our future work. Acknowledgments A part of this research was performed by Japan Science and Technology Agency [JST] of Core Research for Evolutional Science and Technology [CREST]. This work is also supported by VLSI Design and Education Center (VDEC), the University of Tokyo in collaboration with Cadence Design Systems, Inc. [1] F.J.Veradas, M.Scheppler, W.Moffat, B.Mei, Custom Implementation of the Coarse-Grained Reconfigurable ADRES architecture for multimedia Purposes, in Proc. of International Conference on Field Programmable Logic and Applications (FPL05), 2005, pp [2] C.Ebeling, D.C.Cronquist and P.Franklin, Rapid - Reconfigurable Pipelined Datapath, in Proc. of the FPL 2004, [3] H. Amano, Y. Hasegawa, S. Tsutsumi, T. Nakamura, T. Nisimura, V. Tunbunheng, A. Parimala, T. Sano and M. Kato, MuCCRA Chips: Configurable Dynamically- Reconfigurable Processors, in Proc. of ASSCC, Nov. 2007, pp [4] M. Motomura, STP Engine, a C-based Programmable HW Core featuring Massively P aralleland Reconfigurable PE Array: its Architecture, Tool, and SystemImplicatio ns, in Prof. of CoolChips XII, [5] H-S.Kim, M.Ann, J.A.Sratton, W.Mei, W.Hwu, ULP-SRP: Ultra Low Power Samsung Reconfigurable Processor for Biomedical Applications, in Prof. of ICFPT 2012, 2012, pp [6] N.Ozaki, Y.Yasuda, Y.Saito, D.Ikebuchi, M.Kimura, H.Amano, H.Nakamura, K.Usami, M.Namiki, M.Kondo, Cool Mega-Arrays: Ultralow-Power Reconfigurable Accelerator Chips, IEEE Micro, Vol.31, pp. 6 18, [7] Y. Koizumi, et al, CMA-Cube: a scalable reconfigurable accelerator with 3-D wireless inductive coupling interconnect, in Proc. of the FPL 2012, Aug [8] N.Ozaki, et.al., Cool Mega-Arrays: A highly energy efficient accelarator, Proc. on ICFPT 2011, [9] V. Tunbunheng and H. Amano, Black-Diamond: a Retargetable Compiler Using Graph with Configuration Bits for Dynamically Reconfigurable Architectures, in Proc. of The 14th Workshop on Synthesis And System Integration of Mixed Information technologies (SASIMI), 2007, pp [10] J. Teifel, R. Manohar, An Asynchronous Dataflow FPGA Architecture, IEEE Trans. on Computers, vol. 53, no. 11, pp , November [11] R.Konishi, H.Ito, H.Nakada, A.Nagoya, K.Oguri, N.Imlig, T.Shiozawa, M.Inamori, K.Nagami, PCA-1: A Fully Asynchronous Self-Reconfigurable LSI, Proc. of Int l Symp. Asynchrnous Circuits and Systems, [12] M.Onouchi, A low-power wide-range clock synchronizer with predictive-delay-adjustment scheme for continuous voltage scaling in dvfs, IEEE Journal of Solid-State Circuits, vol. 45, no. 380, pp , November [13] R.Panda, C.Ebeling, S.Hauck, Adding dataflow-driven Exection Control to a Coarse-Grained Reconfigurable Array, Proc. of FPL, [14] R.Uno, N.Ozaki, H.Amano, A Research of PE Array Connection Network for Cool Mega-Array, in Proc. of Int. Workshop on Renewable Computing Systems, March REFERENCES

A 297MOPS/0.4mW Ultra Low Power Coarse-grained Reconfigurable Accelerator CMA-SOTB-2

A 297MOPS/0.4mW Ultra Low Power Coarse-grained Reconfigurable Accelerator CMA-SOTB-2 A 297MOPS/.4mW Ultra Low Power Coarse-grained Reconfigurable Accelerator CMA-SOTB-2 Koichiro Masuyama, Yu Fujita, Hayate Okuhara, Hideharu Amano Dept. of ICS, Keio University, Yokohama Japan Email: {wasmii,