Software Power Optimizations In An Embedded System

Size: px

Start display at page:

Download "Software Power Optimizations In An Embedded System"

Augusta Watson
5 years ago
Views:

1 Software Power Optimizations In An Embedded System Vishal Dalal 3G Wireless Group Silicon Automation Systems Limited Bangalore, India sasi. com C.P. Ravikumar Department of Electrical Engineering Indian Institute of Technology New Delhi, India iitd. ernet. in Abstract The topic of reducing power dissipation in embedded systems has received considerable attention in the recent years. Techniques have been reported to minimize energy dissipation through (a) selection of better algorithms for the application e.g. DSP algorithms that require fewer number of operations to perform a task such as $filtering (b) minimizing state transitions and switching activity in the hardware implementation, and (c) reducing the operating supply voltage by changing the architecture of the system e.g. through the use of pipelining. However; power dissipation is often neglected when developing the software for embedded systems. Software optimization techniques can be used to reduce the cost, size, and power dissipation in embedded systems without adding to system overheads. In this paper; we view the power dissipation as consisting of two parts, the power dissipated in the application-specijic integrated circuits (hardware power) and the power dissipated by the CPU, memory and associated busses (software power). We provide a trace-based technique to estimate software power and study the effect of different code optimization techniques on software power; performance and code size. 1. Introduction The essential components of an embedded system are the following: 0 a processor, which may either be a general-purpose microprocessor/microcontroller, or an applicationspecific instruction-set processor 0 memory, which may be embedded in the processor or may be external to the processor, 0 software that resides in the memory and runs on the /00.B EEE 254 processor, mainly responsible for real-time handling of U0 requests from the external world, and 0 application-specific integrated circuits (coprocessorsj which carry out the compute-intensive tasks Since many embedded systems, especially those that are used in mobile applications, run on battery power, it is important to ensure that the system dissipates the least possible power while still providing the required functionality. Algorithms that use fewer arithmetidlogic operations to perform the same function must be used to conserve energy e.g. use filtering algorithms that require fewer multiplications. Techniques have also been developed to reduce switching activity in the application-specific hardware in order to reduce power dissipation [2,3,4,9]. These techniques span all the levels of abstraction in VLSI design: architectural-level (e.g. use of pipelining to reduce Vdd), gate-level (e.g. technology mapping to reduce switching activity), transistorlevel (e.g. gate resizing), and layout-level (e.g. use shorter wires for high-activity nets). Unfortunately, power dissipation is often neglected during the software implementation of the algorithms in embedded systems, since code size and performance take priority over power dissipation at this stage. There have been efforts to study power minimization through better use of the instruction repertoire of the CPU [9]. In this paper, our aim is to study software optimization techniques that are used in compilers with the objective of meeting these constraints. We show that the use of code optimization can reduce the software power significantly without adversely affecting the code size or performance. By software power we mean the following components: 0 Power dissipated in the arithmeticllogic circuits and the control unit of the CPU when executing embedded code, 0 Power dissipated to charge and discharge the address and data busses, 0 Power dissipated within the memory circuits

2 In most embedded systems, significant portions of code are repetitively executed e.g. the ADPCM (Adaptive Differential Pulse Code modulation) algorithm is executed for every input sample in a telecom system. The number of samples per second can be 8000 or larger. Thus, if the software is not efficiently written, we can expect that it will require a larger number of cycles, consume more power, and occupy more memory. Programmers may, at times, overrule efficiency to improve the readability of code and simplify debugging. The use of function calls is an example. Compared to inline coding, the use of function calls improves the modularity of the code, results in lesser memory requirement, increases the execution time and power dissipation due to stacking/unstacking. Depending on the number of times a function is called, its inline coding may reduce the power dissipation and improve the execution time considerably at the cost of extra memory space. We study the effect of different code optimization techniques on software power, performance and code size. Our work is an extension of the study reported in [7], which mainly focussed on the size-performance tradeoff in code optimization, but did not consider the effect of the code optimizations or the order in which they are applied on system power dissipation. We believe that ours is the first attempt at modeling and estimation of software power. Our results can be useful in implementing design decisions such as hardware-software partitioning. They are also useful in guiding compiler design. The rest of the paper is organized as follows. In Section 2, we discuss the various components of software power. Section 3 explains the implementation environment. Section 4 describes the various optimization techniques considered in this paper. In Section 5, we discuss software power estimation. We explain the optimization flow in Section 6. In Section 7, we present the results of our study on the example of the ADPCM algorithm. Section 8 concludes the paper. 2. Software Power As mentioned in the previous section, we shall subdivide the power dissipation in an embedded system as hardware power and software power. The former includes the power dissipated in the application specific hardware, whereas the latter includes the power dissipated in the CPU, the memory, and in the address and data busses. We shall assume CMOS implementation in this paper, which means that the main source of software power dissipation is the switching activity in the CPU, memory circuits, and the busses Bus Power The busses comprising of unidirectional address and bidirectional data busses are a group of interconnecting wires through which the processor communicates with the memory and U0 circuits. Each line can be conveniently modeled as lumped RC-transmission line, where R is the wire resistance and C is the wiring capacitance. The capacitor C will charge or discharge depending on the present and previous data. For example, on an 8-bit bus, if the data changes from ' ' to ' ' there are 6 transitions or switching. One estimate shows that charging and discharging of bus lines will take upto half or more the total chip power for O.1um ULSI [6,8]. In another estimate, the power dissipated in the I/O busses can be as high as 80% [8]. There are coding techniques (like bus invert coding) which reduce external switching at the expense of slightly increasing the internal switching, to reduce the overall power. We attempt to reduce these switching activities through efficient source coding Memory power The power dissipated in memory can be a significant component of overall power dissipated in an embedded system. In the InfoPad subsystem [ 11, 50% of power is dissipated in memory. The major component of memory power are as follows: 0 Power dissipated in cell array 0 Power dissipated to charge and discharge the word line and bit lines capacitances 0 Power dissipated in the decode logic 0 Power dissipated in the sense amplifier The power dissipated depends on the type of memory access. A sequential memory access will consume less energy as the next word can be returned from the same buffer. One can also expect that the switching on address lines will be small in sequential accesses. One exception is when the previous word is the last one in the page. In that case, a separate page access needs to be performed, causing more power dissipation. A non-sequential access consumes more energy, as the next word address is either not related to previous address or it is entirely different if the data is on a different page. In the latter case, relative switching in the successive words is also larger. These concepts are illustrated in Figure 1. In the ARM microprocessor which we considered in this paper, CPU cycles are classified as S-cycles, N-cycles, or I-cycles. S-cycles refer to sequential memory accesses, N-cycles refer to non-sequential accesses, and I-cycles refer to internal cycles where there is no external memory access. 255

3 Page 0 Page 2 Sequential Access instructions are decompressed at the time of execution to produce 32-bit ARM instructions, which are then executed as normal lkacer Access Page 3 Page 4 Figure 1. Types of Memory Accesses 2.3. CPU Power Every instruction executed by the CPU will result in switching activity. We can broadly classify instructions as follows: 0 Loadstore instructions 0 Branch instructions 0 Type- 1 Arithmetic instructions (addition, subtraction, shift etc.) 0 Type-2 Arithmetic instructions (multiplication, division) The average energy consumption for these instruction types can be measured either by gate level simulation or instruction level current measurements [9]. Suppose that the relative weights associated with the average energy consumptions for the four instruction types are Wj, 1 < j < 4 and the number of instructions of these types are Ij, then the CPU power Pcpu is given by 4 4 Pcpu 0; cwj x Ij)/ j=1 j=1 3. Implementation Environment We have used the ADPCM algorithm as a vehicle to demonstrate the software power optimizations. ADPCM is widely used in DECT (Digitally Enhanced Cordless Telecommunication) wireless telephone in Mhz band. ADPCM is a speech compression and decompression algorithm. It takes the difference between successive samples of the signal and encodes the difference. We assume that the ADPCM algorithm has been implemented as part of an embedded system which uses the ARM processor [15]. The ARM processor has two instruction sets, the normal 32-bit ARM instruction set and 16-bit Thumb instruction set which is a compressed form of former. The Thumb I We used the ARM Software Development Toolkit [ 11,12,13] which enables the development of applications for the ARM family of microprocessors. The kit contain!; the armulator which emulates the execution of applications on the ARM processor without accessing real hardware. The armulator models both ARM and Thumb instruction sets. There are a number of software modules provided with armulator such as the tracer that can trace out the executed instructions, the type of memory accesses, and any other events that occur during the execution. For example, the tracer can give us the information on the number of S- cycles, N-cycles, and I-cycles Assumptions In our estimation of software power, we made the following assumptions. 0 There is no glitching on the busses 0 A single, bidirectional databus, 0 A full Vdd swing in bus switching, 0 On an average, a non-sequenlial access takes twice the power as compared to sequential access 0 When CPU performs 8/16-bit operations on a 32-bit data bus, it will output 0's in the remaining lines. 4. Optimization Techniques The various optimizations cons1 dered are fully described in [ 141. Although the ARM compiler (armcc) provides the options -otime and -ospace for performing peephole optimizations for performance and code size, these were not effective, suggesting that improvements must be made at the source code level. 0 A for loop coded as for (i = l;i <= max;i++) can be replaced by for (i = maz;z > 0;i - -). The latter style is more efficient, since no register is required for saving max. 0 In loop unrolling, the increment of i can be done during the same iteration provided max is even. This minimizes the total number of Compare instructions, but increases code size. 256

4 0 A program normally contains a number of function calls. These function calls are associated with computational overheads such as stacking and unstacking. If these functions are coded inline, then these overheads can be eliminated but at the cost of increase in code size. 0 Another technique is creatingfunction macros through #define preprocessor directive. The following structural transformations can be applied to ADPCM code. 0 Making code option spec@ instead of generalising it to include different data rates, PCM laws etc. The code was made specific to 32-bit data rate, the uniform law PCM, and ITU-T recommended standard. 0 Making the code embedded system-oriented by eliminating the print f statements. 0 Eliminating unnecessary addition, for example, D = (SLI SEI)&65535; can be efficiently replaced by D = (SLI-SEI)&65535; as is a 17-bit number. 0 Transforming the branching operations, for example, in Power2-exp function, use to find the exponent in ADPCM is: if (Val >=16384) i=15; else if ((val>=8192)&& (va )) i=14; else if ((val>=8) && (vak16)) i=4; It can be efficiently replaced by if (val>=16384) i=15; else if (val>=16) i=5; else if (val>=l) i=l; These optimizations were applied individually and there effect on power, performance and code size was analysed for both 32-bit ARM compiled code (armcc) and 16-bit Thumb compiled code (tcc) instruction sets. & Performance Source Code % Compiler Code Size...I U I Emu! I 71 Tracer Estimator 5. Power Estimation Figure 2. Optimization Flow In order to estimate bus power, switching between the two consecutive words on 32-bit busses was calculated. The tracer module traces all the memory accesses in the execution of the program. As per ARM documentation, an N- cycle can consume up to 2 clock cycles whereas an S-cycle requires only one cycle. Therefore, the total number of cycles required to complete a program is (2N + S + I) clock cycles in the worst case. The dissipated power is proportional to [Total Switching x (N + S + I ) l(2n + S + I)]. The power dissipated in memory is proportional to (B.N + S)/(A.N + S + I), where B is the relative energy of a non-sequential access in comparison to sequential access, A is the relative length of a non-sequential access in comparison to a sequential access. In the worst case, A can be 2. The value of B will vary, depending upon the type and size of memory. One has to experimentally tune the value of B. In this work, we took B = Optimization Flow The complete process of optimizing the code is depicted in the optimization flow shown in Figure 2. If the specitications are not met, then optimizations need to be applied again as shown by the dotted lines. Y 257

5 7. Results The percentage changes for each optimization technique with respect to the original ADPCM are shown in Tables 1 and 2. The negative sign shows that the optimization degrades the respective criterion and therefore should not be applied. The compiler options -otime and -ospace degrades the performance or compiler is not able to optimize for the criterion. Simultaneous application of optimization techniques give better results. Table 1. Percentage Changes for armcc simultaneously, many technique!; offer only marginal improvements. Many optimizations are performed on a small part of the code. This produces results which are locally optimum but not globally optimum. E.g. transformation of brunching, which gives maximum improvements in power, was used 4 times per sample in ithe ADPCM code, giving good improvements. When the compiler performs optimizations one after another there may be undesired interaction between them. Therefore, the order of optimizations matters. The compiler can try all possible orderings, but in practice, it orders optimizations by experimentation bec:ause of time constraint. Power size -otime ospace for loop Table 3. Order of Optimizations for armcc branchings 3 I branching oriented I 1.08 I option specific I 0.61 I 1.43 I 2.64 Table 2. Percentage Changes for tcc I Optimizations I Bus I Performance I Code I -otime -ospace for looo unnecessary addition function call function macros embedded oriented power size I The tables clearly show that some of these optimization techniques offer good improvements in power and performance e.g. transformation of branching and function calls. We also note that many optimizations improve all the three aspects, but some of them result in tradeoffs. In the latter situation, we can rank the optimization techniques for each of the three criteria, as shown in Tables 3 and 4. When used function call unnecessary addition embedded option specific for loop termination -0space otime 10 Table 4. Order of Optimizations for tcc power branching 1 option specific 2 embedded oriented 3 unnecessary addition 4 loop unrolling 5 function macros 6 function call for loop termination I -ospace

6 8. Conclusions In this paper, we have studied the effect of several source-level optimizations on the performance, power, and code size of embedded software. We illustrated the tradeoffs involved using the example of the ADPCM algorithm, which is often used in applications such as the answering machine. Our results indicate that significant reductions in power dissipation are possible through code rewriting. We have provided a method to estimate software power in embedded systems, which considers CPU power, bus power, and memory power. Acknowledgements We thank Thomas Major of Philips Semiconductors, Bangalore, for permitting us to use the ARM Software Development Kit. We thank Ani1 Sharma of Philips Semiconductors, Eindhoven, for many useful discussions. Tiwari V, Malik S et.al, "Power Analysis Of Embedded Software: A first Step Towards Software Power Minimization", IEEE Transactions on VLSI Systems, pp , December Weste N and Eshragian K, "Principles Of CMOS VLSI Design", Addison-Wesley, Advanced RISC Machine User Guide, ARM DUI 0040C. Advanced RISC Machine Reference Guide, ARM DUI B. ARM7TDMI Data Sheet, ARM DDI, 0029E. ARM Application note 34, "Writing Efficient C For ARM", ARM DAI0034A. Website of Advanced RISC Machine References Burd T.D and Broderson R, "Processor Design For Portable Systems", Department of EECS, University of California at Berkeley. Chandrakasan A and Broderson R, "Low Power CMOS Design", IEEE press, Cahndrakasan A, Sheng S and Broderson R, "Low power CMOS Digital Design", IEEE Journal of Solid State Circuits", pp , April Mehta H, Owens R.M, Irwin M.J, Chen R and Ghosh D, "Techniques for Low Energy Software", Department of Computer Science and Engineering, The Pennsylvania State University, PA. Najm F, "Transition Density: A New Measure Of Activity In Digital Circuit", IEEE Transactions on CAD of Integrated Circuits and Systems, pp , Feb Nakagome Y, Itoh K et.al, "Sub-1-v Swing Internal Bus Architecture for Future Low Power ULSI's", IEEE Journal of Solid State Circuit, pp , April Sharma A and Ravikumar C.P, "Efficient Implementation Of ADPCM Codec", The 13th International conference on VLSI Design, Calcutta, January 3-7, Stan M, Burleson W, "Bus Invert Coding For Low Power VO", IEEE Transactions of VLSI System, pp 49-58, March

Low-Power Data Address Bus Encoding Method

Low-Power Data Address Bus Encoding Method Tsung-Hsi Weng, Wei-Hao Chiao, Jean Jyh-Jiun Shann, Chung-Ping Chung, and Jimmy Lu Dept. of Computer Science and Information Engineering, National Chao Tung University,