A Novel Profile-Driven Technique for Simultaneous Power and Code-size Optimization of Microcoded IPs

Size: px

Start display at page:

Download "A Novel Profile-Driven Technique for Simultaneous Power and Code-size Optimization of Microcoded IPs"

Cameron Dalton
6 years ago
Views:

1 A Novel Profile-Driven Technique for Simultaneous Power and Code-size Optimization of Microcoded IPs Bita Gorjiara, Daniel Gajski Center for Embedded Computer Systems, University of California, Irvine {bgorjiar, Abstract Microcoded customized IPs have significantly better performance, yet larger code size, compared to similarly-sized instruction-based processors. Storing wide microcodes on-chip requires wide memory-blocks that occupy a large area and consume high leakage power. Therefore, addressing the code size of microcoded IPs is very important. In this paper, we introduce compression techniques that along with careful resolution of don t care values (denoted by X ) in microcode can address the code size issue. We observed that X values can be used for improving either dynamic power of IPs or their compression. However, achieving the efficiency of both is challenging. In this paper, we propose a profile-guided X -resolution technique that can achieve both power and compression efficiency. Using our technique, the code size of microcoded IPs is reduced by 2.7 times, while saving 20% dynamic power, on average. 1. Introduction Shrinking time-to-market and high demand for productivity has driven traditional hardware designers to use design methodologies that start from high-level languages. However, meeting timing and physical constraints of automatically generated IPs is often challenging and time-consuming. Moreover, slight changes in the high-level specification require redoing the RTL and physical synthesis phases because the behavioral synthesis tool may generate a new datapath. To avoid repeating these phases, high-level synthesis tools should be extended to recompile the application without modifying the previously generated datapath. A new breed of design methodologies based on traditional statically-scheduled Horizontal Microcoded Architectures (HMA) [1] have emerged that not only address the recompilation issue but also support more complex architectures than traditional synthesis techniques. The PICO [2], TIPI [3] and NISC [4] [5] [6] are examples of such methodologies. In staticallyscheduled HMAs, the compiler compiles the program directly to microcodes or nanocodes without using instruction abstraction. A nanocode is a low-level microcode that directly controls the units of datapath for one cycle. HMAs can potentially have better performance, lower power, and lower area than their conventional processor counterparts. This is because of replacing the costly decoder and scheduler hardware with off-line compiler algorithms. As a result, highly parallel customized architectures can be designed as HMA without any concern about the complexity of instructions, controller and scheduler. Microcodes may contain don t care values (denoted by X ) that can be mapped to 0 or 1 in the final executable binary. In each microcode, the X values correspond to the control signals of idle units at that cycle. To save power, one approach is to use combination of signal gating and power gating to disable the idle units. This approach requires placing gated registers at the inputs of all units. Additional registers can increase clock power, especially in platforms that do not support clock gating (such as FPGAs). Also, they can increase number of cycles and hence overall energy consumption of the IP. Another alternative or complementary approach for reducing dynamic power of datapath is to carefully resolve X values so that the overall switching activity of the units is minimized. This requires replacing the X values with the corresponding non- X values in the preceding microcodes. We refer to this technique as power-aware X -resolution or PX. Note that, in this approach, seemingly equivalent microcodes will be resolved to two different binaries depending on preceding microcodes in the program. Such power optimization is possible because: (1) there is a one-to-one relationship between microcode bits and control signals of the units; and (2) microcodes can be customized for a given application at no hardware cost. In contrast, conventional processors rely on fixed hardware decoders to convert instructions to microcodes. Such decoders are more complex to customize. Unfortunately, microcoded IPs have an important limitation: their code size is very large. In this paper, we show that the code size of microcoded IPs can be several times larger than traditional instruction-set-based processors. Storing wide microcodes on-chip requires wide memory-blocks that occupy a large area and consume more leakage power. Also, loading the microcodes from an off-chip memory increases the I/O power. Therefore, addressing the code size of microcoded IPs is very important. In this paper, we introduce compression techniques that along with careful /07/$ IEEE 609

2 resolution of X values can address the code size issue of microcoded IPs. We first show that power-aware X resolution (PX) reduces the dynamic power of the IPs by 26%, on average. Also, we show that using a dictionary-based compression technique along with compression-aware X resolution (CX) can reduce the code size of microcoded IPs by 3 times, on average. Next, we discuss that combining the two power and code-size optimizations is challenging; because, if compressionaware X -resolution is replaced with the power-aware one (PX), then, the compression efficiency drops significantly. According to our experiments, after combining the two approaches, the code size may even exceed the original uncompressed size! To address this issue, we propose a new profile-guided X -resolution technique that can achieve both power and compression efficiency. Using our technique, the dynamic power is reduced by 20% while the code size is reduced by 2.7 times. The rest of the paper is organized as follows: Section 2 presents an overview of our design approach. Section 3 explains the power-aware X resolution. Section 4 compares the code size of a microcoded IP with that of a conventional processor. Then, it introduces a compression technique that along with compressionaware X resolution can address the code size issue. In Section 5, we show that replacing the compression-aware X resolution with the power-aware one makes the compression ineffective. Section 6 introduces our approach for resolving X values that maintains both power and compression efficiency. 2. Overview of Our Design Flow (NISC) Our design flow is based on No-Instruction-Set Computer (NISC) [4] [5] [6]. In our flow, a custom datapath is generated or selected for a given application and then, the program is compiled on the datapath. Figure 1 shows an example of our custom IPs. The IP consists of a datapath and a controller. Since it is general enough to run many applications, we refer to it as GNISC. The datapath contains functional units, register file, registers, multiplexers, and memory. Our approach relies on a sophisticated compiler [5] to compile a program described in a high-level language to a binary that directly drives the control signals of components in the datapath. The values of control signals generated for each cycle are called a nanocode or a Control Word (CW). The CWs are stored in a control memory (CMem). Our toolset is available online at [4]. At each cycle, some of the control signals in the datapath may have don t care value. This means that both 0 and 1 can be assigned to those control signals without affecting the correctness of the program. The compiler generates don t care values for the components that are not used. Note that not all control signals can be assigned X. For example, register-file write enable cannot be X, because if X gets resolved to 1, an incorrect data is written to the RF. In other words, the control signals that can affect the registers and register files in the IP cannot be assigned X. Control signals such as register-file read and write addresses, Mux selection signal, and ALU operation signal can be X when the units are not used. Figure 1- Block diagram of GNISC architecture. Figure 2- A subset of GNISC datapath 3. Power-aware X resolution (PX) Dynamic power of CMOS circuits is proportional to average switching activity of the gates. The X values in a control word can be resolved in such a way that it decreases switching activity of the datapath. To explain this concept, we use the following example. Figure 2 is a subset of GNISC architecture shown in Figure 1. In GNISC, the ALU is connected to other components in the datapath through two multiplexers (i.e. MUX1 and MUX2). The multiplexers have two-bit control signals to address all their inputs. The value of these control signals depends on the preceding operations that produce ALU s inputs. Suppose that in a given cycle, ALU must add the output of multiplier MUL with a constant value (i.e. CONST) that is also stored in the control word. In that case, control signal of MUX1 and MUX2 should be changed to 00 and 11, respectively, to propagate the correct data to ALU. Now, suppose that in the next clock cycle the ALU is idle, and therefore compiler generates X values for the control signals of ALU and multiplexers as well as the CONST. To reduce dynamic power, the inputs of the ALU must remain unchanged as much as possible during the idle cycle. This means that, the constant value (i.e. CONST), the MUL output, and the control signals of multiplexers should remain unchanged. To do so, the X values generated for multiplexers and CONST are replaced by their previous values in the preceding cycle. 610

3 Figure 3 and Figure 4 show how power-aware X resolution is implemented: Figure 3 shows a set of control words corresponding to a basic-block of a program. To reduce dynamic power, the X values are replaced with the non- X values from the preceding control word (the final result is shown in Figure 4). In [5] and [8], a similar activity-reduction approach is proposed for reducing dynamic power of datapath generated by high-level synthesis. They first extract the don t care information from controller s FSM by constructing dependency graphs, and then add extra logic to the controller to decrease their activity. In our approach, the compiler directly generates the don t cares and there is no need for constructing additional data structure during controller generation. 1 0 X X 1 1 X X X X X X X 0 X X 1 1 X X X X X 1 X 0 X 1 X 0 X X 0 X X X Figure 3- Example CWs generated by compiler 1 0 X X 1 1 X X X X X X Figure 4- X values are resolved for power optimization 3.1 Experiments: Power efficiency of PX In this section, we present the power savings achieved by applying PX to GNISC architecture. The experiments are performed on a set of benchmarks including adpcm_decoder, crc32, dijkstra, and sha, from MiBench (the free version of EEMBC embedded benchmarks). All the designs are synthesized on Xilinx Virtex4 (90nm) FPGA. For power simulation, the signal activities are collected by post-placement-and-routing simulation using ModelSim simulator. The activities are then fed into Xilinx gate-level power simulator called XPower. Table 1 shows the power consumption of the benchmarks with and without PX optimization. On average, the power consumption is reduced from 30.9 mw to 23mW. This shows that PX reduces the dynamic power of GNISC by 26% on average. In the next section, we show that the X values may be resolved in a different way to help with the code size reduction. Table 1- Power consumption (mw) with and without PX Power (mw) Power Savings without PX with PX (%) adpcm_decoder CRC dijkstra sha average Reducing code-size of microcoded IPs Microcoded and nanocoded IPs have larger code size compared to instruction-based processors. Table 2 compares the code size of GNISC (shown in Figure 1) with that of Xilinx soft-core RISC processor called MicroBlaze. The code size of MicroBlaze is the size of instruction section (.text) of the ELF file generated by the compiler. On average, the code size of GNISC is 3.6 times larger than MicroBlaze while its performance is several times better [14]. The goal of our optimization is to simultaneously improve the code size and power consumption of GNISC while maintaining the performance benefits. Table 2- Comparing code size (KB) of GNISC with MicroBlaze Benchmarks MicroBlaze GNISC code size (KB) code size (KB) code size ratio adpcm_decoder CRC dijkstra sha Average Overview of code-size reduction techniques In general-purpose processors, the instruction-set abstraction is used to reduce the code size of processors. In RISC processors, designers define 32-bit or 16-bit [13] instructions to encode wide control words. At runtime, the instructions are decoded back to the control words (nanocode) using a hardware decoder. In most processors, one or more pipeline stages are added to the datapath for instruction decoding. As a result, it affects the performance of the processor. On the other hand, designing instruction-set is a very complex and timeconsuming task for a typical IP designer; because compiler, assembler, linker and instruction decoder must be re-designed to handle custom instructions. Therefore, in our approach, we eliminate the need for instruction-set and directly compress the control words. This not only simplifies the design, but also enables us to use the don t care values of the control words for improving the compression ratio or power consumption. In general-purpose domain, dictionary-based compression is used for reducing the code size [9], [10], [11], [12]. In CCRP [9], unique instructions in the program are stored in a dictionary, where the location of the instructions is determined by Huffman coding; most frequent instructions in the program are placed in low addresses of the dictionary and are coded with less number of bits. Due to Huffman coding, the compressed instructions have variable sizes. Although decompressing variable-size code is more complex, they manage to hide the decompression latency using cache. IBM CodePack [10], [11], [12] is another compression technique that has the same memory structure as of CCRP. In CodePack, each instruction is partitioned to two halves and two dictionaries are used to store the unique patterns of each half. Nevertheless, none of these approaches consider binary optimization for dynamic power reduction. 611

4 4.2 Our compression approach In this paper, we present our approach assuming no cache is used in the design and the code size is fixed (i.e. no Huffman coding). These assumptions help in simplifying our discussions. The cache and Huffman coding are orthogonal optimizations and can be implemented later. In our compression approach, we resolve X values in the nanocode so that both power consumption and code size are reduced. Our tool constructs a dictionary of unique control words and, in the executable binary, replaces each control word by its corresponding dictionary line addresses. Figure 5 shows a one-dictionary (D1) code compression approach. The memory structure consists of a Code lookup table (CodeLUT) and a dictionary. The Program Counter (PC) contains the address of CodeLUT and is used to read the next codeword. The codeword is then used to read the corresponding control word from dictionary. The following example shows how dictionary-based compression can reduce the total size of CMem in controller. Suppose that Figure 6 shows the CWs of a sample program. Each CW has 16 bits and the program has nine CWs. Therefore, the code size of the program is 144 bits (16 9). Figure 7 shows the compressed implementation of Figure 6, where the dictionary contains five unique CWs and the CodeLUT contains the corresponding address of the CWs. To address the dictionary, three bits is needed; thus, the codewords are three-bit wide. After compression the total binary size is reduced to 107 (i.e ). Figure 5- One-dictionary code compression (D1) Figure 6- CWs of a sample program Figure 7- Single-dictionary compression on CWs of Figure 6 Since CWs can be very wide with many unique patterns, the dictionary may have many entries and the compression efficiency may be low. To increase the chances of finding matching patterns, we can partition the CWs to smaller slices and construct multiple dictionaries. Usually the total size of the partitioned dictionaries is much smaller than that of a single big dictionary. However, corresponding to each dictionary, a code field must be added to code words. As number of dictionaries increases, the number of these fields in the codeword increases and eventually cancels out the gain of partitioning. Figure 8 shows two-dictionary (D2) and three-dictionary (D3) code compression approaches. In addition to number of dictionaries, the way X values are resolved in the binary may affect the efficiency of compression. This concept is discussed in the next section. lookup Code LUT lookup lookup Dictionaries (a) (b) Figure 8- (a) Two-dic (D2) and (b) Three-dic compression (D3) 4.3 Compression-aware X resolution (CX) The following example shows that careful resolution of X values can lead to a better compression. For the control words of Figure 3, if don t cares are simply replaced by 0, then the dictionary (shown in Figure 9) will have four entries, because only the second and the last vectors match. However, if the X values are smartly resolved, then the seemingly different control words match with each other as well. In Figure 10 the X values are resolved so that the first, third, and fourth vectors in Figure 3 are mapped to the first entry of Figure 10, and the other two vectors are mapped to the second entry of Figure Figure 9- Dictionary content for CWs of Figure 3 ( X are replace by 0 ) Figure 10- Dictionary content for CWs of Figure 3 (using compression-aware X resolution) To solve this problem in a general case, the X values in CWs must be resolved so that the total number of unique patterns is minimized. This problem can be converted to graph coloring problem: For a given list of bit-vectors, a graph G(V, E) is constructed. The vertices in V are the bit-vectors, and the edges in E show the conflict between the vectors. Two bit-vectors do not have conflict if they can be merged to a single bit vector. The graph coloring algorithm partitions the vertices (or 612

5 vectors) to subsets so that there is no edge (i.e. conflict) between any two vertices in the same set while minimizing the total number of sets (i.e. colors). Solving the graph coloring problem optimally is NP-hard. But there are many well-known heuristics that generate efficient results in polynomial time. After coloring the graph, corresponding to each color a new vector is generated and all the same-color vectors are merged into that vector. The new vectors are used to fill the dictionary. The details of this algorithm are available in [14]. 4.4 Experiments: Compression efficiency Table 3 shows the binary size of the MiBench benchmarks running on GNISC with different code compressions. The second column (D0) shows the baseline code size without any compression. The remaining columns (D1CX-D4CX) show the code size with different number of dictionaries (Section 4.3). These compression techniques use CX approach for X resolution. As number of dictionaries increases (columns 3, 4, 5, and 6), the code size (i.e. the total size of dictionaries and CodeLUT) of all the benchmarks decrease up to certain points (the highlighted values) and then increases again. These are the points where the increase in CodeLUT size cancels out the benefit of having more dictionaries. The optimum number of dictionaries may vary for different applications. Table 3- Code size (KB) of benchmarks with CX D0 D1CX D2CX D3CX D4CX adpcm_decoder CRC dijkstra sha average CR The last row in the table shows the Compression Ratio (CR), a metric commonly used to evaluate a compression algorithm. CR is the ratio between the compressed size and the original size, and smaller CR numbers show a better compression. On average, for all these benchmarks, the three-dictionary compression (i.e. D3CX) outperforms the others with CR of In D3CX, the total code size is one-third of the code size of D0. In other words, it compresses the code by 3 times. These experiments show that dictionary-based compression combined with CX can result in a very impressive compression ratio. In the next section, we investigate whether the efficiency of dictionary-based compression is maintained if X values are resolved for power rather than for compression. 5. Combining PX with compression To reduce dynamic power and code size at the same time, we combine the dictionary-based technique with the PX approach introduced in Section 3, and rerun the experiments. Table 4 shows code size of benchmarks after both optimizations. Although PX does not affect the code size of D0PX, it significantly increases that of compressed controllers (i.e. D1PX-D3PX). That is because PX significantly increases number of unique binary patterns. This issue can also be observed in the example of Figure 3. After PX optimization (see Figure 4), there are four unique patterns in the code. However, after CX (Figure 10), there are only two unique patterns in the code. In general, compared to CX, PX increases number of dictionary entries, and hence width of codewords. If number of dictionary entries increases significantly, the size of compressed code may even exceed the size of original uncompressed code. This case happens for D1PX in our experiments. As shown in the third column, the compression ratio (CR) of D1PX is 1.06, indicating a 6% increase in the code size. While using more dictionaries (D2PX and D3PX) improve the compression ratio (CR) to 0.79 and 0.62, the CR is still two times worse than that of CX implementation shown in Table 3. This motivates us to develop an X - resolution technique that achieves both power and compression efficiency. Table 4- Code size (KB) of benchmarks with PX D0PX D1PX D2PX D3PX adpcm_decoder CRC dijkstra sha average CR Hybrid X resolution (HX) In most programs, 90% of execution time is spent in 5%-10% of the basic-blocks. We use this property to improve both code size and dynamic power consumption: we limit the PX optimization to the X values in the frequently-executed basic blocks. Using the application profiling information, these frequently executed basic blocks that are the main contributors to the power consumption are identified and PX is applied to them. For the rest of basic blocks, CX is used to improve the compression efficiency. Using this hybrid technique, power consumption can be reduced with little loss of compression efficiency. Figure 11 shows the flow of our controller synthesizer tool. The inputs to our tool are the CW binaries with X information generated by the compiler, and the application profile information (i.e. the execution frequency of basic blocks). In the first step of the synthesis, the binary is partitioned according to the requested dictionary style. Then, the content of each dictionary is compressed using HX approach: i.e. first the X values of frequently-executing basic block are resolved using PX, and then the graph coloring algorithm in CX is applied to resolve the rest of X values and compact the content. Next, the codewords are generated according to the dictionary contents. Finally, the HDL code of the controller is produced. 613

6 profile-guided X -resolution technique can achieve both power and compression efficiency. CX PX HX Avg. Binary Size (KB) D0 D1 D2 D3 Figure 12- Average binary size for different dictionary counts and X resolutions CX PX HX Figure 11- Profile-guided controller generator (HX) 6.1 Experiments: efficiency of HX Table 5 shows the code size and power consumption of different controller implementations with HX power optimization. In Table 5, second, third, and fourth columns show the code size of one-, two-, and threedictionary implementation with HX optimization. Compared to PX implementation (see Table 4 last row), the compression ratio (CR) of HX is improved significantly (note that the smaller CR values show a better compression). Comparing the power consumption of the HX with PX (see Table 1) shows that only a few mw extra power is consumed to achieve 1.8 times better compression. Table 5- Power consumption and binary size of benchmarks with profile-guided controller generation Binary size (KByte) Power (mw) D1HX D2HX D3HX D1HX D2HX D3HX adpcm_decoder CRC dijkstra sha average average CR average power Figure 12 and Figure 13 compare CX, PX, and HX implementations in terms of binary size and power consumption, respectively. Note that the binary size of HX is within 17% of CX, while its power consumption is within 6% of PX. Therefore, HX has the benefits of both CX and PX without their limitations. It is worth noting that all power values in Table 5, Figure 12 and Figure 13 include the dynamic power consumption of the datapath, controller, memories and decompression logic. 7. Conclusion In this paper, we show that using dictionary-based compression techniques the code size of microcoded IPs can be reduced by 3 times, on average. We also show that power-aware X resolution (PX) reduces the dynamic power of the IPs by 26%, on average. However combining the two power and code-size optimizations is challenging. To address this issue, we propose a new Avg. Power (mw) D0 D1 D2 D3 Figure 13- Average power for different dictionary counts and X resolutions References [1] A. Agrawala, T. Rauscher, Foundations of Microprogramming: Architecture, Software, and Applications, Academic Press, ISBN: , [2] R. Schreiber, S. Aditya, S. Mahlke, V. Kathail, B. R. Rau, D. Cronquist and M. Sivaraman, "PICO-NPA: High-Level Synthesis of Nonprogrammable Hardware Accelerators". Journal of VLSI Signal Processing, , [3] S. J Weber and K. Keutzer, Using minimal minterms to represent programmability, Proc. CODES+ISSS 2005, [4] Nisc Website: [5] M. Reshadi, D. Gajski, "A Cycle-Accurate Compilation Algorithm for Custom Pipelined Datapaths", Proc. CODES+ISSS, p , [6] M. Reshadi, B. Gorjiara, D. Gajski, "Utilizing Horizontal and Vertical Parallelism Using a No-Instruction-Set Compiler and Custom Datapaths", International Conference on Computer Design (ICCD), p , [7] A. Raghunathan et al., "Controller re-specification to minimize switching activity in controller/data path circuits", ISLPED [8] A. Raghunathan, S. Dey, N. Jha, and K. Wakabayashi, "Power management techniques for control-flow intensive designs", DAC [9] A. Wolfe and A. Chanin, Executing compressed programs on an embedded RISC architecture, Intl. Symposium on Microarchitecture, [10] IBM, CodePack PowerPC code Compression Utility User s Manual Version 3.0, IBM, [11] T.M. Kemp, R.K. Montoye, D.J. Auerback, J.D. Harper, J.D. Palmer, "A decompression core for PowerPC", IBM Syst. J. 42,6(November), [12] C. Lefurgy, E. Piccininni, T. Mudge, Evaluation of a high performance code compression method, Intl. Symposium on Microarchitecture [13] S. Segars, K. Clarke, and L. Goudge, Embedded control problems, Thumb, and the ARM7TDMI, IEEE Micro, vol. 15, no. 5, Oct [14] B. Gorjiara, D. Gajski, "FPGA-friendly Code Compression for Horizontal Microcoded Custom IPs", Intl. Symposium on FPGA,

Lossless Compression using Efficient Encoding of Bitmasks

Lossless Compression using Efficient Encoding of Bitmasks Chetan Murthy and Prabhat Mishra Department of Computer and Information Science and Engineering University of Florida, Gainesville, FL 326, USA