Procedural Functional Partitioning for Low Power

Procedural Functional Partitioning for Low Power Enoch Hwang Frank Vahid Yu-Chin Hsu Department of Computer Science Department of Computer Science La Sierra University, Riverside, CA 92515 University of California, Riverside, CA 92521 ehwang@cs.lasierra.edu vahid@cs.ucr.edu hsu@cs.ucr.edu Abstract Power consumption in VLSI systems has become a critical metric for design evaluation. Although power reduction techniques can be applied at every level of design abstraction, most techniques are applied to the lower levels such as the register-transfer or gate levels and are limited to reducing power only to a portion of the entire circuit. We demonstrate power reduction using a coarsegrained procedural functional partitioning technique. This technique allows us to easily partition the entire processor (controller and datapath) by separating entire procedures into several smaller, mutually exclusive, interacting processors at a much higher level of the design abstraction. Power reduction is accomplished because only one smaller processor needs to be active at a time. Our results show that procedural functional partitioning can reduce average power consumption by as much as 78% with an average of 51%. 1. Introduction Power reduction of VLSI systems is becoming an increasingly important goal for system design. The shrinking sizes of integrated circuits calls for reduced power consumption, in order to extend battery life and prevent excessive heat generation. While power reduction techniques can be applied at nearly every design abstraction level, most of the recent power estimation and reduction tools focus on the lower levels of abstraction, such as the register-transfer (RT) or gate levels [1]. There is therefore still much opportunity to reduce power by developing tools that focus on higher levels, such as the behavioral level, where major reductions may be gained. Power consumption of a CMOS circuit is composed mainly from dynamic power [2] due to capacitive charging and discharging when a signal toggles as computed from the equation 1 2 P d = C V f N (1) 2 where C is the loading capacitance in the circuit, V is the supply voltage, f is the clock frequency, and N is the switching frequency of the gate. For a given behavioral level circuit design, all the variables except for N are usually fixed by a design decision. Thus, the power reduction of a circuit at this level is obtained mainly by reducing the switching frequency N from the outputs of the gates, and this is made possible by shutting down unnecessary portions of the circuit. This is accomplished either by adding new control circuitry to the circuit or by partitioning the circuit. Shut down techniques that add new control circuitry include [3], [4], [5], [6], [7], [8], [9], [1] and [11]. [3] isolates the portion of the circuit that is responsible for most of the computation complexity and tries to shut down the rest of the circuit as much as possible. [4], [5] and [6] reduces the switching activities in flip-flops, registers, and memories. [7], [8], [9], [1] and [11] modifies the datapath to reduce switching activities during calculations. Partitioning techniques include [12], [13], [14] and [15]. [12] and [13] partitions resources in the datapath, whereas [14] partitions the controller. The idle parts, whether in the datapath or the controller, can then be shut down. All the above techniques, however, try to reduce the switching activities within a localized subset of gates either in the datapath or the controller alone. [15] partitions both the controller and the datapath so that when a part is idle, switching activities are reduced for the entire processor and not just to a localized subset of gates. The technique described in [15], however, is applied to the finite state machine with datapath (FSMD) model where the resources have already been scheduled, so the timings are already determined and fixed. In this paper, we demonstrate power reductions at the behavioral level using procedural functional partitioning. Procedural functional partitioning allows us to easily partition the entire processor (both the controller and the datapath together) into several smaller, mutually exclusive, interacting processors at an even higher level of the design abstraction. The smaller inactive processors are then shut down and power reduction is accomplished because only one smaller processor is active at a time. Furthermore, this technique is applied before any of the resources are scheduled so timing issues are not of a concern. Our results show that procedural functional partitioning can reduce average power consumption by as much as 78% with an average of 51%. In addition to power reduction, this technique has been shown to reduce I/O and synthesis runtimes by nearly an order of magnitude, as well as being suited for hardware/software partitioning [16]. The rest of this paper is organized as follows. In section 2, we give a background of how functional ISLPED2 2/18/ 11:32 AM Page 1 of 5

partitioning can reduce power. Our procedural functional partitioning technique is discussed in section 3. Section 4 shows our experimental results, and we conclude in section 5. 2. Background Circuit partitioning can occur either before synthesis (as in functional partitioning) or after synthesis (as in structural partitioning). When a circuit is synthesized from a behavioral description to a gate level netlist, only one control unit and one datapath is generated. The implication of having only one control unit and one datapath is that when there is a signal change at the primary input, the entire datapath and control unit may be affected. This results in much unnecessary switching activities and therefore, wasted power. In structural partitioning, as shown in Figure 1 (a), we first synthesize a behavioral process written in a behavioral hardware description language (HDL), down to a registertransfer or gate level netlist. Next, we partition the netlist into two or more parts. From a power reduction perspective, structural partitioning does not reduce switching activity. The reason is that synthesis converts one process into one custom hardware processor, consisting of a controller and a datapath. Even though partitioning creates more than one physical partition, there is still logically only one datapath and one control unit. When a primary input signal changes, it may affect many of the gates in the datapath regardless of which partition they are in. Similarly, all the gates in the control unit must also be active, regardless of which part they are in, in order to provide the correct control signals to the datapath. In functional partitioning, the behavioral process, written in HDL, is first partitioned into several smaller processes, as shown in Figure 1 (b). Each process is then synthesized to its own custom processor, each with its own controller and datapath. From a power reduction perspective, functional partitioning can significantly reduce switching activity. The reason is that each processor, having its own controller and datapath, is smaller than one large processor implementing the entire process, and only one processor is executing a computation at any given time while the other processors will be idle. When a processor is Behavioral Specification High-Level Synthesis Structural Cont- Un- Data (a) Partitioning path rol it Behavioral Fn 1 Fn Specification 2 Functional Partitioning Fn 1 High-Level Synthesis Figure 1: Partitioning: (a) structurally; (b) functionally. (b) Fn 2 idle, we have, in effect, shut down both the controller and the datapath for that processor. Thus, power reduction is possible through functional partitioning because it reduces the overall switching activities of the entire system by localizing the activities within smaller processors, hence, power consumed per operation is less. 3. Procedural functional partitioning In contrast to FSMD functional partitioning [15] which performs a partitioning of FSMD states, procedural functional partitioning performs a coarse-grained partitioning of procedures and functions. Given a HDL description of a design with two or more procedures and functions, we partition this HDL design into two or more parts by roughly dividing the procedures and functions equally between the parts. In our current model there is no explicit power metric in the partitioning cost function. Since we are dealing with homogeneous parts, we have found that minimizing communication between parts, while maintaining balanced part sizes, is a good approach to minimizing power. Our future work will replace this indirect approach with a more direct power metric. The parts are then individually synthesized down to the gate level with each part having its own controller and datapath. Finally, the parts are reconnected back together via a common communication bus as explained in section 3.2. Working together, these parts are functionally equivalent to the original unpartitioned design. Partitioning will introduce new switching activity for interprocessor communication. Therefore, the problem that must be solved is one of partitioning such that the reduction in switching activity for computations far outweighs the switching activity increase for communication. Communication will also increase the execution time because extra cycles must be added for each handshaking. This, however, is not a problem in regards to preserving the cycle-by-cycle behavior of the original unpartitioned design because our technique is applied to the HDL design before scheduling. 3.1 Example Given the pseudo-code in Figure 2, this unpartitioned process calls three separate procedures, mod, divide, and is_prime. These procedures execute sequentially but they are all consuming power for the entire duration of the execution of the process. We can partition this process into three parts as in Figure 3 by placing mod and divide in part two, is_prime in part three and the remaining code in part one. Part one controls the entire program flow, performing the primary I/O, and calling the procedures in the other parts when required. Initially, part one executes while parts two and three remains idle. Thus, only part one is consuming power. When the procedures Mod and Divide are required, part one passes the control to part two, after which only part two is active. When part two completes its execution, it returns the results and control back to part one. ISLPED2 2/18/ 11:32 AM Page 2 of 5

Part one then calls part three to perform the is_prime function. Finally, part three returns back to part one and execution stops. Figure 4 and Figure 5 show a plot of the switching activities for the unpartitioned and partitioned examples as shown in Figure 2 and Figure 3 respectively. Here we actually see the decreased in switching activities as a result of the partitioning. In the unpartitioned case, Figure 4, the number of toggles can go as high as 2 in a clock cycle. The total power consumption is 597 µjoules. Whereas, in the partitioned case, Figure 5, the maximum number of toggles in a clock cycle is only 65. The total power consumption for this case is only 2,243 µjoules, a 56% reduction. In the partitioned graph, Figure 5, parts two and three are called twice each. Part one is the main controlling module which is active only during the communication with the other two parts. We see that while part two is active, parts one and three do not have any switching activities whatsoever. Similarly, when part three is active, parts one and two are completely inactive. The only time when all three parts have switching activities is when the handshaking is occurring over the communication bus. 3.2 Communication bus A common communication bus is added between all the parts as shown in Figure 6. This bus is used for all handshaking and data transfers between the parts. The bus input x; count := 2; while ((count * count) <= x) loop mod(x,count,mod_result); divide(x,count,divide_result); is_prime(count,prime_result1); is_prime(divide_result,prime_result2); if (mod_result = ) and (prime_result1 = 1) and (prime_result2 = 1) then answer1 <= divide_result; answer2 <= count; exit; else count := count + 1; end if; end loop; Figure 2: Sample unpartitioned pseudo-code for Fac1. part 1 while ((count * count) <= x) loop call and wait for result from part 2; call and wait for result from part 3; if end loop; part 2 wait for part 1 to call; get parameters x and count; mod(x,count,mod_result); divide(x,count,divide_result); return mod_result and divide_result; part 3 wait for part 1 to call; get parameters count and divide_result; is_prime(count,prime_result1); is_prime(divide_result,prime_result2); return prime_result1 and prime_result2; Figure 3: Sample partitioned pseudo-code for Fac1. is composed of n address lines and m data lines. The size of n is determined by the number of parts being connected together, and m is determined by the maximum of all the procedure call parameter sizes. Every procedure call will have some input and output parameters and each parameter has a known bit-width. We simply sum the bit-widths for all the parameters in a procedure call to get the total bit-width for that particular procedure call. Since the data bus is shared between all the parts, we therefore need the maximum of the total bitwidths for all the procedure calls. Different bus configurations will result in different power consumption as reported in [19]. In our current experiment, we have used only a simple bus configuration. We should get better results if we also try to optimize the bus configuration. Only one part will control the bus at a time, with the others providing high-impedance values. There is activity on the communication bus only when one part is passing the control to another part. During this short handshaking time, all parts are active. A system executes as a series of procedure calls. If part 1 calls part 2, part 1 sends part 2 s address over the address bus, followed by input parameters to part 2. When part 2 sees its address on the address bus, it wakes up and captures the input parameters and then Number of Toggles 21 14 7 6 16 26 36 46 56 66 76 86 96 16 116 126 136 146 156 166 176 186 196 26 216 Time (cycles) Figure 4: Plot of switching activities for the Fac1 unpartitioned example. Total power is 597 µjoules. Number of Toggles 8 6 4 2 Part 1 Part 2 Part 3 21 41 61 81 11 121 141 161 181 21 221 241 261 281 31 321 341 361 381 41 421 Time (cycles) Figure 5: Plot of switching activities for the Fac1 partitioned example. Total power is 2243 µjoules. External ports Part 1 Part 2 Part n Data Lines Address Lines... Figure 6: Communication bus architecture. ISLPED2 2/18/ 11:32 AM Page 3 of 5

executes. While part 2 is executing, all the other parts are idle. Upon completion, part 2 sends part 1 s address over the address bus, followed by any output parameters. Upon seeing its address on the address bus, part 1 wakes up, captures the output parameters, and resumes execution. With this model, a system executes as a series of procedure calls. Such an execution model has the feature that only one part is active at a time. As long as there is no activity on the communication bus, the inputs to all the parts will be at a constant value. If the inputs to a part do not change, then there will be no switching activities within that part. Thus, all switching activities will be localized to just the one active part at a time as shown in Figure 5. 4. Experiment We want to compare the power consumption of the partitioned system with that of the unpartitioned system. First, we describe a system at the behavioral level using VHDL. For the unpartitioned system, we simply synthesize the behavioral description to the gate level and calculate the power consumed. For the partitioned system, we first partition the behavioral description into two or more functional parts as explained in section 3. We then add the communication bus with the necessary handshaking protocol to each part. The parts are individually synthesized to the gate level. The same synthesis optimizations applied to the unpartitioned system are also applied to the partitioned system. At the gate level, we join these parts back together by adding common wires for the communication bus. Power consumption is then calculated. In our experiments, we described eight examples at the behavioral level using VHDL. After applying our partitioning technique, the system is synthesized and simulated to obtain the switching activity data. Power results are calculated using the switching activity obtained from the simulation and the loading capacitance of the netlist obtained from the technology library. The unpartitioned and partitioned systems are compared in terms of their average and total power usage, area, and execution time. The results take into consideration the fact that the capacitance for the communication bus is larger than the internal capacitance. In our power calculation, we have used a bus capacitance that is four times the internal Table 1: Example statistics. Ex gate count comm loops Unpart Part= P1+P2+P3+... calls Fac1 15251 17172=8918+351+523 1/n/n 1/2/2 Fac2 15251 1582=926+1349+1695+3498 1/n/n/n 1/1/1/1 Ch 1 19766 3246=14965+3679+13816 1/1/1 4/3/n Ch 2 19766 34111=11116+5499+3679+13817 1/1/1/1 1/3/3/n Ch 3 19766 29551=11694+2345+1695+13817 1/3/3/1 1/1/1/n Diffeq 11487 12245=134+195 1/12 1/n Volsyn 11193 13163=1798+2365 1/n n/n NLoops 2622 337=1811+1496 1/n n/n capacitance. Table 1 shows the statistics for the examples. Fac is a factorization program as shown in Figure 2. Ch evaluates the Chinese Remainder Theorem. Diffeq is an example from the HLSynth MCNC benchmark. Volsyn is a volumemeasuring medical instrument controller. NLoops is an example with nested while loops. For the Fac example, two different ways of partitioning the system were used. The first way is shown in Figure 3. For the Ch example, three different ways of partitioning the system was done. The Unpart and Part columns show the gate count for the unpartitioned and partitioned system respectively. In the Part column, the gate count for the individual parts are further broken down. The comm calls column shows the number of times each part is called via the communication bus. For example, for the Fac1 example, part one is called one time and parts two and three are called n times to denote that it is dependent on the input value. The loops column shows whether there is a loop in that part and if so how many times it loops around. Again, n denotes that it is dependent on the input value. The results are summarized in Table 2. The Area and Time columns show the percent increase in area (in terms of gate count) and execution time respectively. The absolute average and total power for the partitioned systems are shown in columns 4 and 5 respectively. The percent average and total power savings are shown in columns 6 and 7 respectively. In all cases, both the average and total power is reduced. The reduction in average power ranges from 27% to as much as 78% with an average of 51%. The reduction in total power ranges from 5% to 66% with an average of 41%. The tradeoff for the area on the average is 32% and 22% on average for the execution time. The reason for the size and execution time increase is because of the extra communication overhead. It is interesting to note that in the Fac2 example, the gate count for the partitioned system is increased by only 4%. This increase is very insignificant. In fact, a decrease Table 2: Power reduction results. % Overhead Absolute Partitioned Ex Area Time Average Total % % Power Power (µw) (µj) % Power Savings Average Power % Total Power % Fac1 4 23 22.65 2242.95 64 56 Fac2 13 53 13.84 1694.67 78 66 Ch 1 49 2 16.3 364.99 45 44 Ch 2 73 5 15.16 3352.1 48 45 Ch 3 64 16 13.12 335.78 55 48 Diffeq 7 3 4.76 7771.52 27 5 Volsyn 18 24 1.38 2587.22 29 12 NLoops 26 2 3.7 2157.51 59 51 Average 32 22 16.96 335.83 51 41 ISLPED2 2/18/ 11:32 AM Page 4 of 5

was observed in [2]. A possible reason is that the synthesizer can optimize a small design much better than a large design. For both the Fac and Ch examples, we see that power is further reduced just by adding an extra part. For the Fac example, as much as 1% total power was further reduced just by partitioning four parts instead of three parts. By adding an extra part, we have increased the total size because of the communication overhead. However, the individual size of each part is smaller. This causes the switching activity to be even more localized and confined within fewer gates. Thus, more reduction in the power is seen. From this result, we see that it is better to have more smaller parts rather than bigger but less parts, as long as the size and performance overhead are kept within the limits. The rationale is that smaller parts will have less switching activities when the part is active. The rest of the dormant parts do not contribute any dynamic power due to capacitive charging and discharging because there are no switching activities. Of course, more parts mean more communication on the communication bus and longer execution time. 5. Conclusions We have considered the problem of reducing power consumption at the behavioral level using a coarse-grained procedural functional partitioning technique that produces two or more mutually exclusive parts. Unlike many previous power reduction shutdown techniques that focus only on either the datapath or the controller separately, we partition the entire processor to shut down both the controller and the datapath. Furthermore, our technique is applied at a much higher level. We take a process with several procedures and partition the process by putting entire procedures into different parts. We achieved up to 78% average power reduction with an average of 51%. The area overhead is 32% with a 22% increase in the execution time. Proceedings of the International Conference on Computer Design, pp. 74-81, October 1994. [9] Vivek Tiwari, Sharad Malik, & Pranav Ashar, Guarded Evaluation: Pushing Power Management to Logic Synthesis/Design, International Symposium on Low Power Design, 1995. [1] A. Raghunathan, & N. Jha, ILP formulation for Low Power Based on Minimizing Switched Capacitance during Data Path Allocation, Proc. of the Int. Sym. on Circuits & Systems, 1995. [11] T. Lang, E. Musoll, & J. Cortadella. Redundant Adder for Reduced Output Transitions, Universitat Politècnica de Catalunya DAC Technical Report 22, 1996. [12] J. Henkel, "A low power hardware/software partitioning approach for corebased embedded systems," DAC, pp. 122-127, 1999. [13] Vaishnav & Pedram, Delay Optimal Partitioning Targeting Low Power VLSI Circuits, Intl. Conf. on CAD, pp. 638-643, 1995. [14] L. Benini, P. Vuillod, G. De Micheli & C. Coelho, Synthesis of Low- Power Selectively-Clocked Systems from High-Level Specification, International Symposium on System Synthesis, pp. 57-63, Nov. 1996. [15] Enoch Hwang, Frank Vahid, & Yu-Chin Hsu, FSMD Functional Partitioning for Low Power, Proceedings of the Design, Automation and Test in Europe, pp. 22-28, March 1999. [16] F. Vahid, T. Le, & Y.C. Hsu, A Comparison of Functional and Structural Partitioning, International Symposium on System Synthesis, pp. 121-126, November 1996. [17] Y. Hsu, T. Liu, F. Tsai, S. Lin, & C. Yu, Digital design from concept to prototype in hours, in Asia-Pacific Conference on Circuits and Systems, December 1994. [18] PureSpeed Simulator, An event-driven simulator from FrontLine Design Automation, Inc. [19] T. Givargis and F. Vahid, "Interface Exploration for Reduced Power in Core-Based Systems," International Symposium on System Synthesis, pp. 117-122, December 1998. [2] F. Vahid, "I/O and Performance Tradeoffs with the FunctionBus during Multi-FPGA Partitioning," Int. Sym. on FPGA, pp. 27-34, February, 1997. References [1] Srinivas Devadas & Sharad Malik, A Survey of Optimization Techniques Targeting Low power VLSI Circuits, Proceedings of the Design Automation Conference, pp. 242-247, 1995. [2] A. Chandrakasan, T. Sheng, & R. Brodersen, Low Power CMOS Digital Design, Journal of Solid State Circuits, Vol. 27, No. 4, pp. 473-484, April 1992. [3] G. Lakshminarayana, A. Raghunathan, K. Khouri, N. Jha, and S. Dey, "Common-Case Computation: A High-Level Technique for Power and Performance Optimization," DAC, pp. 56-61, June 1999. [4] Wen-Tsong Shiue and Chaitali Chakrabarti, "Memory exploration for low power, embedded systems," DAC, pp. 14-145, June 1999. [5] T. Lang, E. Musoll, & J. Cortadella. Reducing Energy Consumption of Flip-Flops, Universitat Politècnica de Catalunya DAC Technical Report 4, 1996. [6] J. Chan and M. Pedram, Register allocation and binding for low power, Proceedings of the Design Automation Conference, pp. 29-35, June 1995. [7] M Muench, N When, B Wurth, R Mehra and J Sproch, Automating RT- Level Operand Isolation to Minimize Power Consumption in s, Proceedings of the Design, Automation and Test in Europe, March 2. [8] Mazhar Alidina, Jose Monteiro, Srinivas Devadas, & Abhijit Ghosh, Precomputation-Based Sequential Logic Optimization for Low Power, ISLPED2 2/18/ 11:32 AM Page 5 of 5