Dynamic Voltage Scaling of Supply and Body Bias Exploiting Software Runtime Distribution

Similar documents
Parallelism for Nested Loops with Non-uniform and Flow Dependences

An Entropy-Based Approach to Integrated Information Needs Assessment

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

An Optimal Algorithm for Prufer Codes *

Efficient Distributed File System (EDFS)

Load Balancing for Hex-Cell Interconnection Network

The Codesign Challenge

Support Vector Machines

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

CHAPTER 2 PROPOSED IMPROVED PARTICLE SWARM OPTIMIZATION

Sorting Review. Sorting. Comparison Sorting. CSE 680 Prof. Roger Crawfis. Assumptions

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

S1 Note. Basis functions.

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

AADL : about scheduling analysis

A Binarization Algorithm specialized on Document Images and Photos

Feature Reduction and Selection

Design and Implementation of an Energy Efficient Multimedia Playback System

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Verification by testing

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

Virtual Machine Migration based on Trust Measurement of Computer Node

TPL-Aware Displacement-driven Detailed Placement Refinement with Coloring Constraints

Repeater Insertion for Two-Terminal Nets in Three-Dimensional Integrated Circuits

ELEC 377 Operating Systems. Week 6 Class 3

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

A New Token Allocation Algorithm for TCP Traffic in Diffserv Network

Quality Improvement Algorithm for Tetrahedral Mesh Based on Optimal Delaunay Triangulation

Parallel matrix-vector multiplication

Conditional Speculative Decimal Addition*

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Smoothing Spline ANOVA for variable screening

Configuration Management in Multi-Context Reconfigurable Systems for Simultaneous Performance and Power Optimizations*

Cluster Analysis of Electrical Behavior

Motivation. EE 457 Unit 4. Throughput vs. Latency. Performance Depends on View Point?! Computer System Performance. An individual user wants to:

y and the total sum of

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z.

Application of Improved Fish Swarm Algorithm in Cloud Computing Resource Scheduling

Maintaining temporal validity of real-time data on non-continuously executing resources

Wishing you all a Total Quality New Year!

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

Mathematics 256 a course in differential equations for engineering students

Modeling Multiple Input Switching of CMOS Gates in DSM Technology Using HDMR

Classifier Selection Based on Data Complexity Measures *

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Comparison of Heuristics for Scheduling Independent Tasks on Heterogeneous Distributed Environments

Load-Balanced Anycast Routing

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Today s Outline. Sorting: The Big Picture. Why Sort? Selection Sort: Idea. Insertion Sort: Idea. Sorting Chapter 7 in Weiss.

Fusion Performance Model for Distributed Tracking and Classification

Channel 0. Channel 1 Channel 2. Channel 3 Channel 4. Channel 5 Channel 6 Channel 7

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Parallel Numerics. 1 Preconditioning & Iterative Solvers (From 2016)

A fair buffer allocation scheme

CS 534: Computer Vision Model Fitting

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Simulation Based Analysis of FAST TCP using OMNET++

Overview. Basic Setup [9] Motivation and Tasks. Modularization 2008/2/20 IMPROVED COVERAGE CONTROL USING ONLY LOCAL INFORMATION

NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics

Meta-heuristics for Multidimensional Knapsack Problems

GSLM Operations Research II Fall 13/14

Design and Analysis of Algorithms

An Efficient Garbage Collection for Flash Memory-Based Virtual Memory Systems

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

Real-Time Systems. Real-Time Systems. Verification by testing. Verification by testing

Empirical Distributions of Parameter Estimates. in Binary Logistic Regression Using Bootstrap

Optimal Scheduling of Capture Times in a Multiple Capture Imaging System

Programming in Fortran 90 : 2017/2018

Review of approximation techniques

5 The Primal-Dual Method

Biostatistics 615/815

Problem Set 3 Solutions

Petri Net Based Software Dependability Engineering

Optimizing Document Scoring for Query Retrieval

Combined Rate Control and Mode Decision Optimization for MPEG-2 Transcoding with Spatial Resolution Reduction

A Newton-Type Method for Constrained Least-Squares Data-Fitting with Easy-to-Control Rational Curves

Can We Beat the Prefix Filtering? An Adaptive Framework for Similarity Join and Search

Some Advanced SPC Tools 1. Cumulative Sum Control (Cusum) Chart For the data shown in Table 9-1, the x chart can be generated.

An Efficient Genetic Algorithm with Fuzzy c-means Clustering for Traveling Salesman Problem

Reducing Frame Rate for Object Tracking

Reliability and Energy-aware Cache Reconfiguration for Embedded Systems

Module Management Tool in Software Development Organizations

Intra-Parametric Analysis of a Fuzzy MOLP

APPLICATION OF MULTIVARIATE LOSS FUNCTION FOR ASSESSMENT OF THE QUALITY OF TECHNOLOGICAL PROCESS MANAGEMENT

Imperialist Competitive Algorithm with Variable Parameters to Determine the Global Minimum of Functions with Several Arguments

Edge Detection in Noisy Images Using the Support Vector Machines

Explicit Formulas and Efficient Algorithm for Moment Computation of Coupled RC Trees with Lumped and Distributed Elements

Dynamic Camera Assignment and Handoff

Hierarchical clustering for gene expression data analysis

Transcription:

Dynamc Voltage Scalng of Supply and Body Bas Explotng Software Runtme Dstrbuton Sungpack Hong EE Department Stanford Unversty Sungjoo Yoo, Byeong Bn, Kyu-Myung Cho, Soo-Kwan Eo Samsung Electroncs Taehwan Km EECS Seoul Natonal Unversty ABSTRACT Ths paper presents a method of dynamc voltage scalng (DVS that tackles both swtchng and leakage power wth combned V dd /V bs scalng and gves mnmum average energy consumpton explotng the runtme dstrbuton of software executon. We present a mathematcal formulaton of the DVS problem and an effcent numercal soluton. Expermental results show that the presented method shows up to % further reducton n energy consumpton compared wth exstng methods. Especally, when the leakage power consumpton s sgnfcant,.e. when temperature s hgh, the presented method s proven to be the most effectve.. Introducton Dynamc voltage scalng (DVS s one of the most effectve methods n reducng both swtchng and leakage power consumpton. There have been two classes of DVS methods: ntertask and ntra-task DVS. Inter-task DVS methods [][] determnes the performance level at a task granularty whle ntra-task DVS methods at fner granulartes [3][][]. In ntra-task DVS, workload estmaton plays a central role snce the performance level (normalzed w.r.t. maxmum frequency n the mddle of task executon s dynamcally determned, mostly, by X/T, where X s the estmated remanng workload and T s the tme to deadlne. Thus, the accuracy of workload estmaton determnes the qualty of ntra-task DVS method. Several methods of workload estmaton have been proposed: worst case executon tme [3][], average case executon path [], average energy executon path [6], and statstcal methods [7]. Among them, the statstcal method and average energy executon path-based one are reported to gve the best reducton n average swtchng energy consumpton snce they provde global mnmum solutons based on mathematcal formulatons. However, the leakage power consumpton s not mnmzed by the methods snce they mnmze only the swtchng energy based on the assumpton of P ~ f 3 (P ~ CV f ~ f 3 snce V ~ f. Leakage power consumpton has already become a real desgn ssue. Especally, excessve leakage power consumpton at hgh temperatures often causes sgnfcant product parametrc yeld drop n realty. Thus, DVS methods need to optmze leakage energy as well as swtchng energy. In order to reduce leakage energy consumpton, we apply combned V dd /V bs scalng [][] snce body basng (scalng V bs Although the power consumpton specfcaton can be met at room temperature, t cannot often be met due to sgnfcant leakage power consumpton at hgh temperatures n the product specfcatons, e.g. 8 or. s the most effectve way to control leakage power consumpton. In our work, we extend the statstcal DVS method (whch orgnally targets only dynamc energy to tackle the reducton of both swtchng and leakage energy by scalng both V dd and V bs. Note that the statstcal method covers the method based on average energy executon path [6] as a smplfed case. We gve a mathematcal formulaton of the problem of V dd /V bs scalng based on the statstcal nformaton,.e. the dstrbuton of software runtme. The formulaton gves a mult-varable non-lnear functon of total energy consumpton. As a practcal soluton to obtan the workload estmatons for the mnmum average energy consumpton, we present a numercal soluton. Ths paper s organzed as follows. Secton revews exstng DVS methods. Secton 3 explans the mathematcal formulaton of statstcal DVS based on combned V dd /V bs scalng. Secton gves a total power functon for combned V dd /V bs scalng. Secton presents a numercal soluton to the problem. Secton 6 reports expermental results and Secton 7 concludes the paper.. Related Work In [3], an ntra-task DVS method called runtme voltage hoppng s presented to explot workload varaton to reduce the energy consumpton. In ths work, the workload varaton s a slack whch s calculated, at the hoppng pont, as the dfference between the expected worst-case executon tme and the real program runtme of already executed software code. In [], the remanng workload s estmated to be the executon cycle of worst-case executon path from a performance settng pont n the software program to the end of program. The executon cycle of average-case executon path s estmated to be the remanng workload n []. The concept of vrtual executon path s presented n [6] to estmate workloads for mnmum average energy consumpton. The method uses the worstcase executon cycles of remanng basc blocks to predct the remanng workload assumng P ~ f 3. In [3] and [], methods of DVS based on combned V dd /V bs scalng are presented. In these works, the relatonshp between power and frequency can be an arbtrary one. However, those works do not consder the runtme dstrbuton of software executon, but s based on the worst-case executon cycle. The above methods have a common assumpton n estmatng the remanng workload. They all assume the worst-case executon cycle (of the entre remanng program or of each basc block. However, n realty, t s rare to encounter the worst-case executon. Snce mnmzng energy consumpton s mostly an optmzaton problem for average cases (e.g. the battery lfetme of moble devce s mostly evaluated n an average sense after runnng an extensve set of benchmarks and use cases, t s requred to tackle DVS problems statstcally to reduce average energy consumpton whle meetng the gven deadlne constrants. In [7], a statstcal method based on the runtme dstrbuton of software executon cycle (not 978-3-988-3-/DATE8 8 EDAA

the worst-case executon cycle s presented. Ths method enables to obtan the estmaton of remanng workload that yelds the mnmum average energy consumpton. However, ths method s based on the assumpton of P ~ f 3. Thus, t does not mnmze the entre energy consumpton, especally, when the leakage power s not neglgble. 3. Mathematcal Formulaton of Statstcal DVS based on Combned V dd /V bs Scalng Fgure llustrates the ntra-task DVS based on runtme dstrbuton. We dvde the entre software program nto chunks of code called program regons (PR s, shown as rectangles n the fgure, and set operatng frequency at the begnnng of each PR (called performance settng pont based on the estmaton of remanng workload, X. For nstance, n the begnnng of program regon N n the fgure, we set frequency to be X /T where X s the estmated remanng workload and T the tme to deadlne. x x N Probablty, p (W Probablty, p (W Exec. cycles N w w Exec. cycles Fgure Intra-task DVS wth runtme dstrbuton Each program regon has a dstrbuton of ts runtme (executon cycles,.e. a probablty dstrbuton functon (PDF as llustrated n Fgure. The dstrbuton results from both data dependency (e.g., data dependent number of loop teratons and underlyng hardware archtecture (e.g., cache msses, varable latency nstructons, etc.. In ths case, our problem s to calculate the estmated workload x that wll gve the mnmum average energy consumpton of all the remanng program regons. In [7], the authors formulate the problem mathematcally wth two assumptons. The frst assumpton s P ~ f 3. The second assumpton s the ndependence between X and X. Thus, X can be calculated assumng that X has been already obtaned (thus, assumng that t s a constant n the bottom-up traversal of PR s from the leaf PR (.e. the end of program. At each program regon, x s obtaned by solvng δe(x /δx =. However, when applyng combned V dd /V bs scalng, the two assumptons do not hold any more. The average energy consumpton functon E(X,X becomes more complcated as follows (the detals are omtted for page lmt. X w P( T p ( w + T X ( E( X, X = dw XX w ( X w P( T p ( w dw T( X w X X where P(f s the total power functon (when V dd /V bs scalng s appled wth a normalzed frequency f as the nput argument, p (w and p (w are the PDF s of program regons N and N as exemplfed n Fgure. In Eqn. (, the total power functon P(f s no longer a smple functon (such as P ~ f 3, but more complcated one (whch wll be presented n Secton. In Eqn. (, n order to In realty, the X-axs of PDF s quantzed nto multple bns as explaned n Secton. obtan the mnmum average energy consumpton E, X cannot be determned ndependently from X. They need to be determned smultaneously. In the case of n program regons, n workload estmatons { X, X,, X n } for the mnmum average energy consumpton need to be obtaned smultaneously. Thus, the orgnal bottom-up traversal method n [7] cannot be appled n ths case. In summary, gven n program regons (and ther PDF s, we need to fnd a set of workload estmatons { X, X,, X n } gvng the global mnmum of average energy consumpton. Snce the total power functon s not an analytcal functon, we need numercal solutons to fnd the global mnmum. In ths paper, we present a numercal soluton to ths problem.. Total Power Functon n Combned V dd /V bs Scalng Ths secton presents the total power functon and ts dependency on temperature when combned V dd /V bs scalng s appled. Gven a frequency requrement, we can obtan a par of V dd and V bs values that gve the mnmum total (swtchng and leakage power consumpton. In ths secton, we present our total power functon and assocated par of V dd and V bs values followng the method n [3]. Swtchng power P SW s calculated as follows. P SW = C eff * V dd * f ( where the effectve capactance C eff s gven n Table. Leakage power P Leak s obtaned as follows. P Leak = (V dd * I sub + V bs * I j * L g (3 where the subthreshold current I sub s gven as I sub = K 3 * e (K * Vdd * e (K * Vbs ( The parameters I j, L g, K 3, K, and K are gven n Table. The threshold voltage V th and nverter delay t nv s modeled as follows. V th = V th K *V dd K *V bs ( t nv = L d * K 6 / (V dd V th alpha (6 The operatng frequency f s determned to be /(L d *t nv, where L d s the logc depth of crtcal path. Table summarzes all the parameters used when dervng the total power functon for combned V dd /V bs scalng based on the method and parameters n [9][]. Table Power parameters [9][] K.63 K 6.6e- V th. K.3 K 7 -. I j.8e- K 3.38e-7 V dd C eff.e-9 K.83 V bs L d 3 K.9 alpha. L g e6 C r e-6 C s e-6 Fgure (a shows the total power functon and the correspondng par of V dd and V bs values. V dd (V bs ranges between.v and.8v (-.v and v 3. As shown n the fgure, the total power s not an analytcal functon of frequency. 3 Note that the frequency range s between GHz and 6GHz. Ths range s hgher than the crtcal frequency n [3].

8 6 ( W r e w o P 8 6 9 8 7 3 -. -... 3 3... 6 Freq (GHz P SW P Leak V dd V bs (a Total power functon at C.. - ( V e g lta o V 6 ( W r e 7 w o P.. 3 3... 6 Freq (GHz (b Temperature dependency of total power consumpton Fgure Total power functon n combned V dd /V bs scalng Fgure (b shows the trend of total power consumpton as temperature vares. We model the temperature dependency of leakage power as follows []. I(Temp = I s * exp(- A / (Temp + 73 B (7 where I s s the leakage power at room temperature and parameters A and B are 66.3 and 9., respectvely []. In our experments, we use the total power functon presented n ths secton. Note that the presented method n Secton does not necessarly depend on the total power functon presented n ths secton. Thus, any total power functons, e.g., those obtaned from slower processors or real measurements, can be used n the presented method.. Presented Numercal Method In ths secton, frst we explan our termnology, a numercal soluton based on an teratve mprovement, and how to calculate energy consumpton wth arbtrary total power functon under software runtme dstrbuton.. Terms and Notatons Node: Snce we dvde the whole program nto a set of program regons, we consder the entre program as a graph where nodes correspond to program regons and arcs to the control dependency between nodes. Thus, we use two terms, node and program regon, nterchangeably. In our notatons, N s the node wth ndex. N s the node correspondng to the entry program regon. Runtme Dstrbuton: We regard the runtme dstrbuton of program regon N a random varable w. We obtan the runtme dstrbuton by extensve proflng to be explaned n Secton 6. In realty, we have the PDF of w as a functon wth M bns (M=3 n our experments. w k denotes the representng value (executon cycle of k-th bn, and p(w k s the probablty of the bn. Remanng Executon Cycles: We profle the maxmum (WT, average (AT and mnmum (BT numbers of executon cycles from the begnnng of node N to the end of program. Tme to Deadlne: T s a random varable representng the remanng tme to deadlne at the begnnng of node N. By defnton, the remanng tme to deadlne at the entry node T equals to the gven deadlne D. The dstrbuton of T s represented j as a PDF wth L bns (L=6 n our experments. T s the representng value (tme to deadlne of j-th bn, and q(t j s the probablty of the bn. Note that the dstrbuton of T s trval as follows. j. j = L q( T = (8 j = ~ ( L We explan how to calculate PDF q(t j n Secton.3. Estmated Workload: We denote the estmated workload of node N as X, whch s the varable we want to determne for each node so that the average energy consumpton under runtme dstrbuton s mnmzed for the whole program. Performance Settng Pont (PSP: The PSP s a code locaton where the performance level of processor s adjusted. Each program regon has a PSP at ts begnnng. In terms of software code, we nsert, at the PSP, a functon for performance settng, PS( as follows. PS( { // for node N T = Get_Tme_to_Deadlne(; f mn = Get_Mn_Requred_ Freq(,T ; f = max(x /T, f mn ; // X was calculated at desgn tme. Set_Freq_Voltages(f; /* Adjust V dd and V bs as n Fgure */ } As shown above, n order to meet the gven deadlne, functon PS( frst calculates the mnmum requred frequency f mn by callng functon Get_Mn_Requred_Freq(. The detals of ths routne can be found n [7] and we omt the detals here. Note that the presented DVS method satsfes the gven deadlne. In our work, we nsert PSP s manually at the boundares of compute-ntensve loop teratons. Automatc nserton of PSP s wll be an nterestng topc and we wll nvestgate t n our future work.. Workload Estmaton Now we present the numercal soluton to obtan the set of workload estmaton X s such that the average total energy consumpton (as llustrated n Eqn. ( s mnmzed. Fgure 3 gves a pseudo code of the presented soluton. Fnd_All_Workload_Estmatons( { X = WT ; /* for all = ~ N- */ /* Intal Soluton */ 3 E best = Calculate_Average_Energy(; /* In Secton. */ Do { /* Man Loop */ for = to N- 6 { X = Fnd_Sngle_Workload_Estmaton(, BT, WT ; } 7 E = Calculate_Average_Energy(; 8 f (E < E best { E best = E; Save_Current_Estmatons(; } 9 else break; /* No mprovement n the whle loop */ } whle ( loop_count++ < MAX_COUNT } Fgure 3 An teratve soluton The basc dea of presented soluton s that we tackle one varable (workload estmaton of a node at a tme teratvely untl there s no further reducton n average total energy consumpton. Frst, all the workload estmatons X s are set to the worst-case remanng Note that we take also voltage transton delay nto account n our mplementaton of PS( as n [7].

executon cycles WT s as the ntal solutons (lne n Fgure 3. Then, the expected energy consumpton for ths soluton s calculated as wll be explaned n Secton. (lne 3. In the man loop (lnes, we obtan the workload estmaton of each node that gves the mnmum average energy consumpton whle assumng the other workload estmatons are set to the current set of X s (lne 6. Functon Fnd_Sngle_Workload_Estmaton(, whch gves the workload estmaton, wll be explaned later n ths subsecton. The functon returns a new workload estmaton X (lne 6. Once all the nodes are processed, we calculate total energy consumpton wth new workload estmatons (lne 7, and update the best case f there s any mprovement (lne 8. Ths man loop s repeated for a gven number of teratons (MAX_COUNT, lne. Fgure shows the pseudo code of functon Fnd_Sngle_ Workload_Estmaton(. The functon sweeps canddate values of workload estmaton wthn the gven range of possble workload estmaton (between mnmum and maxmum remanng cycles,.e., BT and WT. For each canddate value, frst we update the PDF s of tme to deadlne T for all the other remanng nodes (lne n the fgure. It s because T s for the remanng nodes (nodes to be executed after node N change dependng on how much cycles node N spends. Thus, dependng on the choce of X, the PDF s of tme to deadlne T s for all the remanng nodes need to be updated consequently. We explan how to update the PDF s of tme to deadlne T (n functon Update_PDF_Tme_to_Deadlne_ Recursvely( n Secton.3. Fnd_Sngle_ Workload_Estmaton(, MIN, MAX { E best = MAX_VAL; 3 for X = MIN to MAX step STRIDE { Update_PDF_Tme_to_Deadlne_Recursvely(, X ; E = Calucate_Average_Energy(; 6 f (E < E best E best = E; X best = X ; 7 } 8 return X best ; } Fgure Functon to fnd sngle workload estmaton Fgure llustrates an example result of such a sweep (lnes 3 7 n Fgure to fnd the workload estmaton. As shown n Fgure, the sweep locates the workload estmaton (nsde the rectangle n the fgure gvng the mnmum average energy consumpton. (n PDF of N, p(w and ts PDF of tme to deadlne T, q(t n the upper part. The fgure llustrates how to derve node N s PDF of tme to deadlne q(t from the two PDF s, p(w and q(t. Suppose that X = n the loop of Fgure (lnes 6 and that functon Update_PDF_Tme_to_Deadlne_Recursvely(, s called (lne n Fgure. Update_PDF_Tme_to_Deadlne (,X { T +l =.; /* l = ~ L */ 3 for j = to L { for k = to M { T used = w k / (X /T j ; 6 T reman = T j T used ; 7 prob = q(t j *p(w k ; 8 l = Get_Bn_Index_T(T reman ; 9 T +l = T +l + prob; } } } Update_PDF_Tme_to_Deadlne_Recursvely (, X { 3 for (m = to N { Update_PDF_Tme_to_Deadlne(m, X m } } Fgure 6 Functons to update the PDF, q(t Now we calculate the PDF of node N s tme to deadlne, q(t. Snce T L = as shown n the mddle of Fgure 7, we set frequency at N to (=X /T L. Consderng the executon cycle of node N, w, we can have dfferent probabltes for dfferent w k values. For nstance, the probablty that node N takes w (= 3 clock cycles, p(w =3 s. as shown n Fgure 7. In ths case, snce node N takes 3 cycles (=w /(X /T L as n lne of Fgure 6, only 9 cycles (=-3 remans as the tme to deadlne for node N (lne 6 n Fgure 6. Thus, the probablty that T has 9 cycles of tme to deadlne, q(t 3 =9 becomes. (=p(w * q(t L =.*. as the two arrows n Fgure 7 llustrates (also, lne 7 n Fgure 6. All the probabltes of the other bns for T are calculated n the same way. p (w q (T w w.. w 3. 3 w.. w y rg e n E d te a tm s E.E+ 9.9E+ 9.8E+ 9.7E+ 9.6E+ 9.E+ 9.E+ 9.3E+ 9.E+ 9.E+ Algorthm fals Too small estmatons Workload estmaton F max reached Too bg estmatons Fgure An example of sngle workload estmaton.3 Dervaton of the PDF s of Tme to Deadlne The PDF of tme to deadlne of node N, T s trval as shown n Eqn. (8. Fgure 6 shows the pseudo code of PDF dervaton for the other nodes. Fgure 7 llustrates the PDF updatng. Assumng that the PDF updatng s appled to two nodes N and N (N starts to execute after N fnshes, the fgure shows the runtme dstrbuton q (T T. T.3 T 3. T. Fgure 7 Updatng the PDF of tme to deadlne Based on functon Update_PDF_Tme_to_Deadlne(, functon Update_PDF_Tme_to_Deadlne_Recursvely( n Fgure can be easly mplemented as shown n Fgure 6 (lnes.. Calculatng Average Energy Consumpton When the workload estmatons are once determned, functon For smplcty, we assume a sequental chan of nodes n ths explanaton. However, n the case that there are multple chldren or parents for a node (.e. condtonal branches, the above algorthm should be slghtly modfed to nclude branch probablty and breadth-frst teraton as n [7]. However, the extenson s trval and hence omtted for brevty here. T T

Calculate_Average_Energy( gves the average total energy consumpton for the workload estmatons. Frst, the energy consumpton of node N can be calculated as follows. Energy = Power ( freq ( tme consumed by N X = P ( T w T ( X where, X s a constant n ths case whle w and T are random varables wth ther PDF s, p(w and q(t. The average energy consumpton of node N, e can be calculated as follows. L M k j x w T k j e P( ( p( w q( T ( = j T x j= k = In consquence, the average total energy consumpton of the whole program s calculated by summng the e s each multpled by the probablty of executng the correspondng node to account for condtonal branches [7].. Consderaton of Temperature Condtons In order to consder that the power frequency characterstc changes as temperature, we propose a smple approach to adapt the presented method to varyng temperature condtons. Frst, we select a number of representatve temperatures (e.g.,, 7,, and establsh the total power functon P K (f for each of those cases. We make a set of workload estmatons for each representatve temperature. Note that these calculatons are all done at desgn tme. Durng the runtme, the DVS algorthm obtans temperature nformaton by consultng the thermal sensor and chooses the approprate set of workload estmatons based on the current temperature. Then, t performs performance settng as explaned above. 6. EXPERIMENTS In our experments, we assume the processor power model presented n Secton. We assume dscrete frequences that range between GHz and 6GHz wth MHz step. For each frequency step, a set of V dd /V bs s appled to gve an optmal V dd /V bs scalng as explaned n Secton. From the set of dscrete frequences, we select a frequency level, whch s the lowest but hgher than or equal to the frequency calculated n functon PS(, as the requred performance level. We also prepare the power models at four dfferent temperature condtons,,, 7, and. Voltage transton tme s assumed to be µs. The energy consumpton of voltage transton, E s s modeled as follows []. E s = V dd *C r + V bs *C s ( The runtme overhead of performance settng functon call PS( and that of voltage/frequency transton are assumed to be k cycles and µs, respectvely 6. The delay overhead of voltage transton s also taken nto account n functon PS( as n [7] when checkng whether the deadlne constrant can be met. When software executon fnshes before the deadlne, we apply power and/or clock gatng. If the remanng tme s less than ms, we apply only clock gatng. Thus, n ths case, leakage power s 6 Regardng the voltage transton overhead, µs, we take a conservatve approach that processor does not perform computaton durng the transton. The runtme overhead of functon PS( s neglgble n our examples (n realty also snce the spacng between two consecutve PSP s s n the order of mllsecond as shown n Table. (9 consumed from the end of software executon to the deadlne. If the remanng tme s longer than ms, we apply power gatng, after the ms perod of clock gatng, untl the deadlne. We assume also that processor power gatng takes ms. We apply the presented method to four multmeda software applcatons: H.6 decode, MPEG decode/encode, and MP3 decode. We nsert PSP s manually at the boundares of sets of loop teratons n the source code of the applcatons as n [7] 7. We obtan the dstrbuton of software runtme after runnng representatve benchmarks for each applcaton on the PC (Pentum,.8GHz. For H.6 and MPEG, we use the same benchmarks that are used n [7]. Table gves the summary of applcatons. Note that we set practcal deadlnes on the applcatons to account for the real mult-task software executon envronment. For nstance, OS consumes a porton of processor cycles for ts housekeepng operaton, e.g. tmer. Table Software programs used n the experments Applcaton # PSP s Deadlne H.6 Decode (H.6 3 ms (33 fps MPEG Decode (MPEG-d 3 ms (33 fps MPEG Encode (MPEG-e ms ( fps MP3 Decode (MP3 6 ms 8 Fgure 8 shows the runtme dstrbuton (PDF of fve nodes for H.6 and MPEG-d, respectvely. It shows the rato of maxmum to mnmum executon cycle for each node (X-axs. It also shows the relatve porton (numbers n rectangles of executon cycles of each node to the total executon cycle. For nstance, the fourth program regon of H.6 has the rato of 6.9 (maxmum executon cycle s 6.9 tmes bgger than mnmum executon cycle and consumes 9.3% of total executon cycle. As shown n the fgure, H.6 gves more runtme dstrbuton than MPEG-d. Ths fact s reflected n the expermental results n Fgure 9...... Probablty.7 3.88 (a.7 6.9. 8.7%.%.7% 9.3%.3% 3 max/mn 7 6 3 Node# Probablty.6...3.. 3..%.%.% 3.% 3.%.6..9. 3 max/mn Fgure 8 Runtme dstrbutons of H.6 (a and MPEG-d (b (b 8 6 Node# Fgure 9 shows the comparson of energy consumpton for the four software applcatons. We apply three methods of workload estmaton: worst-case remanng executon cycle-based method 7 To fnd sutable PSP locatons s another nterestng problem, but s beyond the scope of ths paper. Practcally, however, programmers can easly dentfy a few canddates among major loops and functons n ther codes. 8 In the case of MP3 applcaton, the deadlne of ms s set assumng a mult-task software executon envronment where most of processor cycles are consumed by other computentensve user programs, e.g. web browsng, game, multmeda searchng, etc.

(WT, e.g. [3], average remanng executon cycle-based method (AT, and the proposed one (Ours. The results are normalzed to the WT method. In order to analyze the effectveness of those methods, four dfferent temperatures are assumed as shown n the fgure. The fgure also shows the energy reducton (% of our method compared wth the best of WT and AT,.e. mn(wt, AT. The presented method gves up to % reducton n energy consumpton. Fgure 9 Energy consumpton comparson Fgure explans how the presented method gves better energy effcency than the other two. The fgure shows the frequency change of the three methods durng the executon of H.6 decode applcaton for three frames. As shown n the fgure, WT starts at a hgh frequency snce t assumes the worst-case executon for the remanng executon. However, as the program run advances, performance level drops rapdly and the program fnshes earler than the other two cases. AT shows the opposte behavor. In the begnnng, assumng the average executon cycle as the remanng workload, t starts wth a very low frequency level. However, due to the too optmstc estmaton n the begnnng, the performance level needs to be ncreased at the end of executon to meet the gven tmng constrant, n ths case, 3ms for one frame decodng. We call each of the above frequency settngs early and late hgh frequency settng, respectvely. As shown n Fgure, at hgh temperatures, hgh frequency levels suffer from the penalty of large leakage power. Thus, both WT and AT suffer from ths penalty. The presented method takes a balanced approach. As shown n Fgure, t starts a performance level between those of WT and AT and keeps the balanced poston to the end of executon thereby avodng the penalty of hgh frequency. Fgure Comparson of frequency settngs The results of Fgure 9 are explaned by both ( the dfference of early and late hgh frequency settngs between WT and AT and ( the runtme dstrbuton shown n Fgure 8. In H.6 and MP3, WT suffers from large energy consumpton at hgh temperatures. It s because ( ther runtme dstrbuton (H.6 s s shown n Fgure 8 has hgh max/mn rato (thus, worst-case estmaton can be too pessmstc and ( the penalty of ts early hgh frequency becomes domnant at hgh temperatures. In contrast, n the cases of MPEG-d and MPEG-e, AT gves nferor results to the others as temperature ncreases. The max/mn rato s small n these cases (MPEG-d s max/mn rato s shown n Fgure 8. Thus, the penalty of early hgh frequency n WT dmnshes snce the workload estmaton based on worst-case executon cycle gves more accurate estmaton than n the case of hgh max/mn rato. However, AT stll suffers from the penalty of late hgh frequency settngs thereby gvng nferor results. 7. CONCLUSION In ths paper, we presented a DVS problem based on V dd /V bs scalng and software runtme dstrbuton. We explaned the problem mathematcally and presented a numercal soluton to solve ths problem. The expermental results show that the presented method gves sgnfcant energy reducton, up to %, especally when temperature s hgh and leakage power domnates. Currently, we are workng on applyng the presented method to multprocessor DVS and on developng adaptve methods that explot the dynamcally varyng software runtme dstrbuton. 8. REFERENCES [] D. Kwon and T. Km, Optmal Voltage Allocaton Technques for Varable Voltage Processors, DAC, 3. [] K. Cho, W. Lee, R. Soma, and M. Pedram, Dynamc Voltage and Frequency Scalng under a Precse Energy Model Consderng Varable and Fxed Components of the System Power Dsspaton, ICCAD,. [3] S. Lee and T. Sakura, Run-tme Voltage Hoppng for Low- Power Real-Tme Systems, DAC,. [] A. Azevedo, et. al., Profle-Based Dynamc Voltage Schedulng Usng Program Checkponts, DATE,. [] D. Shn and J. Km, Optmzng Intra-Task Voltage Schedulng usng Data Flow Analyss, ASPDAC,. [6] J. Seo, T. Km, and K. Chung, Profle-Based Optmal Intra- Task Voltage Schedulng for Hard Real-Tme Applcatons, DAC,. [7] S. Hong, et. al. Runtme Dstrbuton-Aware Dynamc Voltage Scalng, ICCAD, 6. [8] R. Jejurkar and R. Gupta, Dynamc Slack Reclamaton wth Procrastnaton Schedulng n Real-tme Embedded Systems, DATE,. [9] R. Jejurkar, C. Perera, R. Gupta, Leakage Aware Dynamc Voltage Scalng of Real-Tme Embedded Systems, DAC,. [] S. Martn, K. Flautner, T. Mudge, D. Blaauw, Combned Dynamc Voltage Scalng and Adaptve Body Basng for Lower Power Mcroprocessors under Dynamc Workloads, ICCAD,. [] L. Yan, J. Luo, N. Jha, Jont Dynamc Voltage Scalng and Adaptve Body Basng for Heterogeneous Dstrbuted Real-Tme Embedded Systems, TCAD,. [] W. Lao, F.L, L.He, Mcroarchtecture Level Power and Thermal Smulaton Consderng Temperature Dependent Leakage Model, ISPLED, 3. [3] P. Huang, S. Ghas, Leakage-aware Intraprogram Voltage Scalng for Embedded Processors, DAC, 6. [] P. Huang, S. Ghas, Effcent and Scalable Compler- Drected Energy Optmzaton for Realtme Applcaton", DATE, 7.