DiBA: Distributed Power Budget Allocation for Large-Scale Computing Clusters

Size: px

Start display at page:

Download "DiBA: Distributed Power Budget Allocation for Large-Scale Computing Clusters"

Leonard Gibson
5 years ago
Views:

26 6th IEEE/ACM Internatonal Symposum on Cluster, Cloud, and Grd Computng DBA: Dstrbuted Power Budget Allocaton for Large-Scale Computng Clusters Masoud Bade SEAS Harvard Unversty mbadek@seas.harvard.

1 26 6th IEEE/ACM Internatonal Symposum on Cluster, Cloud, and Grd Computng DBA: Dstrbuted Power Budget Allocaton for Large-Scale Computng Clusters Masoud Bade SEAS Harvard Unversty Xn Zhan 2 School of Engneerng Brown Unversty xn zhan@brown.edu Reza Azm School of Engneerng Brown Unversty reza azm@brown.edu Sheref Reda School of Engneerng Brown Unversty sheref reda@brown.edu Na L SEAS Harvard Unversty nal@seas.harvard.edu Abstract Power management has become a central ssue n large-scale computng clusters where a consderable amount of energy s consumed and a large operatonal cost s ncurred annually. Tradtonal power management technques have a centralzed desgn that creates challenges for scalablty of computng clusters. In ths work, we develop a framework for dstrbuted power budget allocaton that maxmzes the utlty of computng nodes subject to a total power budget constrant. To elmnate the role of central coordnator n the prmaldual technque, we propose a dstrbuted power budget allocaton algorthm (DBA) whch maxmzes the combned performance of a cluster subject to a power budget constrant n a dstrbuted fashon. Specfcally, DBA s a consensus-based algorthm n whch each server determnes ts optmal power consumpton locally by communcatng ts state wth neghbors (connected nodes) n a cluster. We characterze a synchronous prmaldual technque to obtan a benchmark for comparson wth the dstrbuted algorthm that we propose. We demonstrate numercally that DBA s a scalable algorthm that outperforms the conventonal prmal-dual method on large scale clusters n terms of convergence tme. Further, DBA elmnates the communcaton bottleneck n the prmal-dual method. We thoroughly evaluate the characterstcs of DBA through smulatons of large-scale clusters. Furthermore, we provde results from a proof-of-concept mplementaton on a real expermental cluster. Index Terms- Dstrbuted algorthm, Prmal-dual algorthm, Power management, Consensus algorthm. I. INTRODUCTION The advent of cloud computng and new onlne servces such as software as a servce (SaaS) over the past decade demands a sgnfcant scalng of computng clusters. Ths demand has led to the development of very large-scale clusters that rely on scale-out models (horzontal scalng), whch expand capabltes by ncorporatng new server nodes rather than ncreasng the capabltes of exstng servers []. Furthermore, there s a keen nterest n dynamcally controllng the total power consumpton of clusters to enable partcpaton n demand response energy programs and to mprove energy effcency n the presence of varatons n cluster utlzaton. Incorporatng new server nodes s only practcal f the computng cluster has a modular archtecture. Currently used power management methods are not algned wth such an archtecture as they have a centralzed desgn,.e., there s a central controller aggregatng nformaton from all actve computng nodes to allocate power effcently [2]. Specfcally,,2 These authors contrbuted equally to ths work. n conventonal power management technques, local data of all computng nodes, such as workload, applcaton prorty, and thermal footprnt, are constantly montored to effcently allocate power budget among them n a centralzed manner. Therefore, addng new computng nodes to a cluster necesstates reconfguraton of power management hardware and software whch drves up the scalng cost. To tackle ths challenge, we propose a dstrbuted power budgetng framework that scales gracefully for large clusters. In partcular, our contrbutons are as follows: We propose DBA, a novel algorthm for dstrbuted power budget allocaton to maxmze the combned utlty of the entre cluster subject to a gven total power budget. The utlty of each computng node s a metrc for ts throughput that depends on the benchmark of the workload as well as server specfcatons. In our dstrbuted framework, each server exchanges ts decson to ncrease/decrease ts power usage wth neghborng nodes n the cluster through rounds of communcatons. After each round, the local parameters of each node are updated n the form of a state-space model. In ths phase, the local actons of neghbors are mplemented as a control nput. Further, the actons of servers take nto account the global constrant that must be satsfed. We compare the performance of DBA wth a centralzed power budgetng scheme n terms of total and per teraton communcaton/computaton tme. We note that there s a communcaton bottleneck at the central coordnator n the centralzed method for large clusters. The prmaldual (PD) method also suffers from the same ssue snce t reles on the same archtecture to dssemnate nformaton among nodes. In contrast, DBA provdes a fully decentralzed mechansm whch elmnates the role of the central coordnator. As a result, we demonstrate that for large scale clusters wth thousands of nodes, the communcaton tme of DBA s substantally less than that of centralzed scheme and prmal-dual method. We also examne the mpact of a number of ssues on the performance of DBA, ncludng dynamc workloads and dynamc power budgets. We also analyze the mpact of topology of communcaton among the dstrbuted computatons, and we show that DBA can provde fast convergence wth topologes as smple as a rng. We also mplement a workng prototype of DBA on a /6 $3. 26 IEEE DOI.9/CCGrd.26. 7

2 real expermental cluster and demonstrates ts ablty n meetng dynamc total power budget n a fully dstrbuted fashon. The organzaton of ths paper s as follows. Secton II provdes a bref overvew of related work. Secton III motvates the need for dstrbuted power management technques, and Secton IV provdes the detaled of our proposed DBA algorthm. Our expermental results are gven n Secton V. Fnally, we summarze our conclusons and drectons for future work n Secton VI. II. RELATED WORKS The problem of effcent power management n computng clusters has been studed extensvely from dfferent perspectves, for a recent survey see [3]. Dfferent practces can be appled to reduce power consumpton. Nevertheless, they can be dvded nto one of the followng categores: Schemes based on dynamc voltage/frequency scalng (DVFS) [2], [4], [5], [6], [7]. Load balancng and workload schedulng technques [8], [9]. For memory-ntensve workloads, dynamc voltage and frequency scalng (DVFS) durng memory-bound phases of executon have been shown to provde power savngs wth mnmal mpact on the qualty of servce (QoS) []. Based on ths premse, the power management framework we present n ths paper utlzes the DVFS technque and thus belongs to the frst class. Usng the same technque, Beloglazov et al. [4] have proposed some heurstcs for energy management under strct servce level agreement (SLA) constrant. Specfcally, by dynamcally mgratng vrtual machnes (VMs) across physcal machnes, workloads are consoldated and subsequently dle resources are put on a low-power state usng DVFS. In conjuncton wth the DVFS technque, Elnozahy et al. [5] have evaluated fve dfferent polces for cluster-wde power management n server farms. In partcular, n the ndependent voltage scalng polcy, the authors consder a mechansm n whch each node ndependently manages ts own power consumpton usng dynamc voltage scalng. Nevertheless, ths polcy s dfferent from our proposed dstrbuted scheme (.e. DBA) n that there s no hard constrant on the power budget of the entre cluster and therefore, there s no need for coordnaton among computng nodes to meet the power constrant. In the same lne of work on DVFS technques, dstrbuted power management schemes have also been studed [6]. Specfcally, n [6], the problem of mnmzng the power consumpton n a three ter computng cluster under the endto-end SLA constrant has been nvestgated. To solve ths problem, two strateges have been nvestgated. In the frst strategy dubbed as OptTuner, a prmal-dual decomposton method has been employed where each ter of servers updates ts power consumpton by communcatng wth a central unt. Also, as the second strategy, a Lnear-quadratc regulator (LQR) control method has been employed. Despte havng a dstrbuted structure, OptTuner suffers from the Sngle Pont of Falure (SPF) problem. That s, t requres a central node to coordnate among computng nodes, and therefore, s susceptble to node falure. Moreover, OptTuner has a homogeneous power dstrbuton at the ter level,.e., all machnes at each ter are assumed to operate at the same frequency. In contrast, DBA optmzes the performance of each ndvdual server node and thus results n a better overall performance. Along the same lne of work, a centralzed, feedback based, power controller for satsfyng end-to-end delay s also studed [2]. In the proposed scheme, a central applcaton performance montor measures the latency of delay senstve workloads. It then passes the nformaton to a centralzed DVFS controller to adjust allocated power for each server. Clearly, such a centralzed scheme does not sute large-scale clusters as n addton to SPF problem, the centralzed controller must solve a complex, hgh-dmensonal optmzaton problem to determne the power usage of each computng node. In the context of load balancng technques for power management, a bddng mechansm akn to the economc models for managng shared resources has been studed [8]. In the proposed framework, servces bd for resources and negotate for SLA based on the offered prce and ther QoS requrement. Therefore, SLA s assumed to be a flexble parameter. Whle load balancng mproves the system performance, load concentraton (unbalancng) can save energy by makng servers dle and thus consume less energy [9]. To optmze the trade-off between performance versus power savng, a load balancng and unbalancng algorthm s proposed [9] that takes nto account both the total load mposed on the cluster and the power and performance mplcatons of turnng nodes off. III. MOTIVATION Large clusters are often constraned by power consumpton due to the large operatonal costs for energy consumpton. Furthermore, dynamc constrants such as those arsng from demand response energy programs or from varatons n utlzaton requre cluster operators to fnd ways to meet changng total power budget throughout daly operaton. Although current employed power management technques are effectve n ther objectve of reducng power usage, they have a centralzed desgn. Those technques requre a central controller to measure ther workload and throughput and determne the optmal DVFS of processors of actve servers. Nevertheless, a centralzed approach has the followng shortcomngs: Fault solaton: Centralzed control creates a Sngle Pont of Falure (SPF) and also a performance bottleneck. In partcular, the malfuncton n controller or communcaton breakdown between controller and local servers renders the SLA volatons. To avod such a scenaro n centralzed methods, the local servers are set to automatcally default to maxmum power [2]. We present a decentralzed method n whch each server determnes ts acton locally. Thus, the falure n one or few servers or the communcaton breakdown can be mtgated as the overall performance of the system does not hnge on a partcular unt. 7

3 Real-tme montorng: In centralzed power budgetng technques, the state of each server s measured by a centralzed montorng unt. Ths ntroduces addtonal latency and deterorates the qualty of servce (QoS) due to data aggregaton and communcaton delay. In comparson, the decentralzed technque obvates the centralzed montorng unt as each computng node measures ts performance metrcs locally. For varable nternet traffc such as search and socal networkng, the localzed performance montorng provded by the decentralzed control approach results n a faster response tme to workload and power management. Scalablty: A centralzed power management s an nflexble archtecture for horzontal scalng, n the sense that addng new racks to the cluster must be complemented wth the ncrease n the communcaton bandwdth and power management reconfguraton of the central power controller. In comparson, a decentralzed archtecture facltates a modular desgn. Partcularly, DBA can scale to large clusters wth dstrbuted communcaton requrements that are as smple as a rng and that are carred over regular cluster network nfrastructure. IV. POWER MANAGEMENT METHODOLOGY We formulate an optmzaton problem for maxmzng the throughput of a cluster wth N server nodes, where we assume a power budget of P for the whole cluster. Thus, Problem : max r (p ) () {p } N = = p P, (2) = p dle p p max, =, 2,,N, (3) where p dle and p max denote the power that s used by each server n the dle mode and the maxmum possble power usage of server, respectvely. We use the smple structure of Problem to exhbt our dstrbuted optmzaton schemes clearly. In ths formulaton, the utlty metrc r ( ) :R R corresponds to the throughput of each server node that can be ncreased/decreased by adjustng ts capacty (e.g., ncreasng/decreasng the frequency of the processors usng DVFS). We note that the throughput of server as a functon of ts power usage p depends on the characterstc of the workload that s beng executed. We defer further detals about modelng a proper throughput functon to Secton V. Solvng the optmzaton Problem n a decentralzed fashon requres coordnaton among servers, snce the power consumptons of servers are coupled va the constrant (2) n Problem. To acheve coordnaton among nodes, our dstrbuted algorthm creates a consensus framework for each node. In ths secton, we consder two dstrbuted schemes that are structurally dfferent. Frst, we analyze the more conventonal prmal-dual decomposton approach and dscuss ts lmtatons for practcal mplementaton n large clusters. We then propose an alternatve algorthm, DBA, that s fully dstrbuted that outperforms the prmal-dual method n terms of convergence speed. A. Prmal-dual decomposton algorthm The algorthm that we present here s based on the prmaldual decomposton method whch s a well-known scheme for dstrbuted optmzaton and has been extensvely studed n the context of TCP/IP and congeston control n networks wth elastc traffc []. We also remark that the prmal-dual decomposton method has been nvestgated n [6] n the context of power management of computng clusters. However, we use ths algorthm as the benchmark for comparson wth the performance of DBA we present n subsequent subsectons. Hence, we gve a detaled analyss of the prmal-dual method n the followng. We descrbe the smple structure of the asynchronous prmal-dual algorthm when t s characterzed for Problem. Consder the dual form of the optmzaton problem that s formulated n ()-(3), mn g(λ) (4) λ where λ s the dual varable, and g(λ) s defned as ( ) g(λ) def = sup r (p )+λ P p. {p } N = = Let λ be the optmal soluton of the dual optmzaton problem (4). Gven the optmal Lagrangan multpler λ, each server can compute ts optmal power budget p, {, 2,, N} va solvng the followng local optmzaton problem p = arg p dle max p p max = r (p ) λ p. In ths formulaton, λ can be nterpreted as the prce of usng the shared power budget P. In partcular, n the case that power demands exceed the amount of shared power budget P, the central coordnator ncreases the prce λ to reduce the demand. A smple scheme to compute the optmal value of λ can be acheved by an teratve update rule n whch the prce λ s computed by a coordnator unt, and then communcated to all local servers. In partcular, the coordnator updates the prce usng the followng update formula; see [], [ ( + λ t+ = λ t ɛ P p)] t, (5) = where [z] + def = max{,z}, ɛ> s a small step sze. By havng the Lagrangan update λ t+, the -th server can compute ts power consumpton p as follows p t+ = arg max p dle p p max r (p ) λ t p. (6) The structure of the prmal-dual decomposton algorthm for Problem s descrbed n Algorthm. We make few remarks about Algorthm. 72

4 Algorthm PRIMAL-DUAL ALGORITHM : Intalze p,=, 2,,N, λ, and ɛ R. 2: for t =, 2, do AT THE CENTRAL COORDINATOR: 3: update λ t accordng to (5) AT THE th SERVER: 4: update p t accordng to (6) 5: transmt p t to the central coordnator. 6: end for ) Although the prmal-dual decomposton method s scalable, t s not effcent n terms of communcaton requrements. For large clusters, the central coordnator must be equpped wth a hgh communcaton bandwdth to accommodate communcaton wth all servers. Moreover, snce network swtches support a lmted number of servers, for a large cluster of thousands server nodes, multple network swtches must be employed whch s costly and adds addtonal latency to the network. 2) The exstence of a central coordnator n the prmaldual approach s also undesrable snce t creates the SPF problem. In the case of falure of the central coordnator, the PD power management technque fals. Therefore, PD does not provde the robustness that s necessary to power management n data centers. 3) The PD algorthm we consdered s synchronzed n the sense that servers update ther power usage nformaton smultaneously. Ths level of synchronzaton between dfferent servers s usually provded through Network Tme Protocol (NTP) n computer systems [2]. B. DBA: Dstrbuted Power Budget Allocaton Algorthm In ths secton we propose and analyze our proposed dstrbuted algorthm called DBA to control the power consumpton of each computng node to maxmze the throughput of the cluster subject to total power budgets. We can qualtatvely descrbe DBA from two dfferent perspectves. From a game desgn perspectve, DBA s a gradent algorthm that acheves the pure Nash equlbrum correspondng to the state based game defned by Problem [3]. From the vew pont of dstrbuted optmzaton, DBA s a consensus based optmzaton algorthm. In partcular, DBA establshes the consensus on the estmaton from the couplng constrant (2) by augmentng the local utlty functon of each server node wth a barrer (penalty) functon whch when mnmzed/maxmzed, gradually reduces the dfference between local varables of computng nodes. Naturally, to create such a consensus, the communcaton graph between servers needs to be connected. The connectvty requrement s naturally facltated n clusters, where clusters can communcate usng the cluster s network nfrastructure. For each -th server, we ntroduce the estmaton e term for the coupled constrant (2),.e., e p P. (7) = Durng each round of DBA algorthm, the -th server transmts ts local measurement and also receves the local varables of all ts neghbors. Specfcally, lets ê k (ê k ) denote the estmates from (resp. to) the -th server to (resp. from) the k-th server, where k S and S {, 2,,N} denotes the set of servers that are connected to the -th server. By aggregatng these estmates, the -th server updates ts state varable as well as ts estmate from the coupled system constrant. Specfcally, based on a state-space control model, we have the followng update rule durng the t-th teraton of DBA algorthm, p t+ = p t +ˆp t (8) = e t +ˆp t + ê t k ê t k, (9) k S k S e t+ where p t s the power state of the -th server, and ˆp s the power control nput that takes values from the acton space { A (p t,e t ) def = (ˆp, ê ):p t +ˆp [p dle,p max ], ê k <, e t +ˆp t } ê k <. k S The acton space guarantees that the controller nputs (ˆp, ê ) n the state-space model does not volate the constrant on the maxmum power and dle power of each computng node, and also satsfy the coupled constrant (2). For each node, we now defne a local utlty functon for node =, 2,,N as follows R (p t, {e t j} j N )=r (p t ) η j S log( e t j), () where η R. The second term j S log( e t j ) s a penalty factor that plays a role on the cost functon only when e t j >, whch translates nto the constrant volaton condton N = pt >P. We note that the speed of reachng the consensus s controlled by the value of η. By choosng a small postve value for η n (), we can assure that the soluton of DBA s close to the optmal soluton of Problem. But, choosng a small η wll cause numercal dffculty whch may lead to slower convergence. In addton, the barrer functon n () enforces the constrant to be satsfed. By employng ths barrer functon, we thus assure that the resultant answer from the above algorthm fulflls the system power constrant. We now have all the machnery to descrbe DBA. The steps of DBA are hghlghted n Algorthm 2. It s mperatve to calculate the communcaton overhead ntroduced from decentralzng the power management controller. The communcaton overhead of each control epoch can be approxmated by the product of the number of communcatons requred durng each round and the number of teratons to converge. In a cluster wth N servers, 73

5 Algorthm 2 DISTRIBUTED POWER BUDGET ALLOCATION (DIBA) : Intalze η R and a non-ncreasng sequence ɛ t R, =, 2,,N. Choose p,e such that N = p P. 2: for t =, 2, do AT THE th SERVER: 3: compute ) ˆp t = β ( ɛ t t R ˆp ˆp= ( ) ê t j = β t mn, ɛ t R, ê j ê j= where β t s computed by backtrackng to the constrant (ˆp t, êt ) A (p t,et ), and ɛt s the gradent step sze. 4: update p t+, e t+ accordng to (8)-(9). 5: end for ) for prmal-dual decomposton algorthm, each computng node sends one packet and then receves one packet from the central coordnator durng each teraton. Thus, total number of 2N packets are communcated per teraton. 2) for DBA, each computng node sends packets only to ts local neghbors whch n case of a rng communcaton topology s two and thus results n the total number of 2N packets that are communcated n parallel among the nodes. For a general graph, dn communcatons per teraton s requred where d s the average degree of each node. Although t may appear that DBA has the same communcaton complexty as the prmal-dual method, n practce t has a lower communcaton latency and packet loss. In partcular, f we consder the network overhead n terms of the bottleneck of communcaton topology, the central coordnator n the prmaldual decomposton method must communcate wth all N computng servers whch has a degree of N. In comparson, the degree of each node n DBA s a small fxed number (e.g., two n case of rng topology), the packets from the nodes to ther neghbors proceed n parallel. As a result, packet loss s less lkely to happen n DBA archtecture. We wll show n the next secton that DBA also shows better performance compared to the prmal-dual decomposton method n terms of communcaton latency. V. NUMERICAL EVALUATION A. Setup Workloads: To evaluate the performance of our dstrbuted power management archtecture, we select HPC benchmarks. Eght benchmarks from NAS Parallel Benchmarks (NPB) sute [4] (BT, CG, EP, FT, IS, LU, MG, and SP) and two benchmarks from Hgh Performance Computng Challenge (HPCC) benchmark (HPL, MPI-RA) and the descrpton of each benchmark s lsted n Table I. name BT (NPB) CG (NPB) EP (NPB) FT (NPB) IS (NPB) LU (NPB) MG (NPB) SP (NPB) HPL (HPCC) RA (HPCC) descrpton Block Tr-dagonal solver Conjugate Gradent Embarrassngly Parallel dscrete 3D fast Fourer Transform Integer Sort Lower-Upper Gauss-Sedel solver Mult-Grd on a sequence of meshes Scalar Penta-dagonal solver Hgh performance Lnpack benchmark Integer random access of memory TABLE I LIST OF THE SELECTED BENCHMARKS AND CORRESPONDING DESCRIPTION. Infrastructure: We evaluate DBA usng numercal smulatons and usng a real server cluster. The numercal smulatons enable us to study DBA s performance for large-scale clusters, whle the real cluster mplementaton demonstrates a proof-ofconcept for DBA. ) Server cluster nfrastructure: Our cluster conssts of 2 Dell PowerEdge C servers, where each server has two Xeon L552 quad-core processors, 4 GB of memory and Gbe network controller. The servers are connected to a Mellanox SX2 Gbe swtch. The frequency of each processor can scale from.6 GHz to 2.27 GHz. To collect performance counters values from all servers, the perfmon2 [5] tool s used. We created a power measurement appartus to measure the suppled AC current to each server and ts power consumpton correspondngly. To enforce the local power targets for the ndvdual servers, we mplement a software feedback controller on each server. The DVFS-based controller adjusts the DVFS up or down accordng to the dfference between the power target and the current power consumpton, wth postve dfference trggerng an ncrease n DVFS, whle negatve dfferences trggerng a decrease n DVFS [6]. The controller enables us to apply the local constrants on each server as derved from the DBA. 2) Smulaton nfrastructure: To evaluate the performance and overhead of DBA on large clusters, we have developed a smulaton tool as follows. We frst run each workload n Table I on one of our servers and collect the throughput and power consumpton p of th workload under varous DVFS levels, and then ft a throughput functon r (p ) on the attaned data. The power consumed by each workload has a range that s determned by the lowest and hghest DVFS levels. The throughput functons are then used to emulate the behavor of the applcatons runnng on the server nodes. To smulate a cluster wth arbtrary number of servers N, we draw the throughput functons from a unform dstrbuton such that each server hosts at least one type of workload and such that the entre server cluster s fully utlzed. To mtgate the effect of randomness n our smulaton results, we average the convergence results we obtan for DBA over trals. The network archtecture of our smulaton envronment s based on a two-ter star topology. Each rack conssts of 4 servers whch are 74

6 Fg.. The communcaton topology of the decentralzed algorthms. Left: Prmal-Dual Decomposton. Rght: DBA algorthm. connected wth one top-of-rack Gbe Ethernet swtch. Further, all racks are connected va a hgher-layer core swtch. We model the network traffc usng queung systems n our smulaton, where the network servce tmes are derved from measurements on our real cluster. Algorthm Topology: For the prmal-dual method, each computng node s connected wth the central coordnator as shown n Fg.. Furthermore, for DBA algorthm, we adopt a rng communcaton network as shown n Fg.. Note that the algorthm communcaton topology runs on top of our cluster network, whch naturally facltates any two servers to communcate. Throughput Estmaton: The throughput functon r (p ) of each server s dependent on the nature of the workload. Moreover, we create a framework n whch the local controller learns the throughput functon r ( ) on-the-fly as the workload changes on the server. Specfcally, we can learn the throughput functon of each workload n our benchmark sute of Table I by applyng dfferent DVFS power levels and subsequently measurng the resultant throughput value. We then nterpolate a quadratc throughput functon. Fg. 2 llustrates a number of throughput functons for our workloads on our servers. Ths process s repeated for each workload to derve ts throughput functon when t s executed on a server. As we observe from Fg. 2, the throughput functons for all the benchmarks are concave n power whch justfes the use of convex optmzaton toolbox CVX n our smulatons. System Performance metrcs: To ensure farness n our power allocaton scheme, we normalze the throughput of ANP FT LU MG SP power (Watt) Fg. 2. The normalzed throughput functon of 4 workloads. each workload by ts peak value of r max. The applcaton normalzed performance (ANP) of each server s computed, where ANP s the rato of deal runtme to actual runtme of the workload on the server [7]. In our case, the ANP of applcaton under power budget p s defned as the rato of actual throughput to ts peak throughput: ANP (p )= r (p ) r max Then to quantfy the performance of the entre cluster, the system normalzed performance (SNP) s computed as the arthmetc mean of the ANPs of all M workloads that are runnng on the cluster. That s, M = SNP = ANP (p ). M We next descrbe the results from our smulaton nfrastructure n Subsecton V-B and the results from our mplementaton on the real cluster n Subsecton V-C. B. Smulaton results To evaluate the performance, scalablty and reslency of DBA, we organze the followng experments.. Statc experment of allocatng power budget to maxmze the system s performance. 2. The scalablty of the DBA on clusters wth large number of nodes, consderng the communcaton and computaton overhead. 3. Dynamc smulatons that evaluate how DBA reacts to the power budget reallocaton and workload dynamcs. 4. Experments that evaluate the effect of communcaton topology of the dstrbuted computatons of DBA.. Statc power budgetng results: In ths experment we demonstrate the value of power budget algorthms that use the utlty functons of the servers to maxmze performance subject to a total power budget. We smulate servers that are servng nstances of the benchmarks under total power budgets that range from P = 66 kw to P = 86 kw wth the ncremental step sze of 4 kw. The results of our Fg. 3. The system normalzed performance of servers under dfferent power budgets. 75

7 # of nodes centralzed prmal-dual decomposton DBA Computaton (ms) Communcaton (ms) Computaton (ms) Communcaton (ms) Computaton (ms) Communcaton (ms) TABLE II THE BREAKDOWN OF ALGORITHM RUNTIME WITH DIFFERENT NUMBER OF SERVER NODES. decentralzed algorthms are compared aganst the method of allocatng power budget unformly. The SNP results under each total power budget are gven n Fg. 3. The prmal-dual decomposton algorthm mproves the SNP by 4.7% over unform power allocaton n average, whle DBA acheves 4.5% mprovement. The performance gap between unform and DBA shrnks from 22.6% to 8.2% dependng on the amount of total power allocated. Clearly, when the total power budget s suffcently large, all servers receve enough power for ther workloads, and hence the performance dfferences between DBA and unform power allocaton dmnshes. 2. Scalablty: To show the advantage of DBA for large cluster szes, we compare ts performance to that of the centralzed approach and prmal-dual method. For centralzed method, we use CVX toolbox [8] to solve the optmzaton problem formulated n Eqs. ()-(3). Specfcally, n the centralzed method, the computng servers transmt ther utlty functons to the centralzed power management unt to calculate the optmal allocaton of the total power budget across all servers. Then the power budget of each server s sent back and appled by the servers. By decentralzng the power budgetng unt, the utlty functons are optmzed locally and the optmzaton s accelerated. However, to assess the optmzaton performance, the communcaton latency between of each teraton has to be taken nto account. In ths experment, we quantfy the runtme of each algorthm wth dfferent number of server nodes. The average latency of readng and wrte a packets to TCP sockets between two nodes n our cluster s measured to be approxmately 2μs and μs respectvely. We used these results as the servce tmes n our network queung model, whch s used to model the communcaton tme at the central coordnator n the centralzed and prmal-dual methods. In the uplnk phase, when nodes transmt ther packets to the central coordnator, the packets arrval tme are drawn from the Posson dstrbuton wth average nter-arrval tme of 2μs. In the downlnk phase,.e., transmttng from the central coordnator to the nodes n the centralzed and prmal-dual methods, the communcaton tme s assumed to be the total tme to send all the packets n a seral fashon. In contrast, n DBA archtecture, each node only communcates wth two neghbor nodes n parallel, thus the communcaton tme for each teraton s tme to send and receve a packet. We compute the runtme of each algorthm wth dfferent number of nodes and break the total runtme between computaton tme and communcaton tme n Table II. We take the result from CVX solver as the base lne and set the convergence condton of the prmal-dual decomposton algorthm and DBA as achevng 99% of the base lne performance. In partcular, the ntal power settng for DBA s set as unform allocaton. The convergence tme for DBA s the frst tme that the utlty functon obtaned from DBA satsfes the followng nequalty Optmal Utlty Utlty of DBA Optmal Utlty <., () where the optmal utlty s the utlty that s computed by solvng () centrally. The convergence of PD algorthm s defned smlarly. The total computaton and communcaton tme of the prmal-dual decomposton and DBA are computed by takng nto account the number of teratons requred for each algorthm to converge. The communcaton tme of each teraton s modeled by our smulaton network archtecture and the computaton tme of each teraton s the real measurement from our smulaton system. DBA s completely dstrbuted and all communcatons and computatons are assumed to be carred out n parallel for each teraton. For prmal-dual, the computaton of all the nodes are done n parallel but need to be added to the computaton tme of the central coordnator. From Table II, we observe that PD method mproves upon the centralzed power allocaton strategy n terms of computaton tme. Ths s due to the fact that n the PD method, the optmzaton problem s solved n a decentralzed fashon. Nevertheless, n the overall performance, the PD method performs poorly when the communcaton tme s taken nto account. As we mentoned n Secton IV-A, ths ncrease n latency s due to the communcaton bottleneck at the central coordnator. In contrast, we observe that the communcaton tme of DBA on a rng network ncreases only slghtly. Consequently, DBA outperforms both the PD and the centralzed method n the overall run-tme. DBA also outperforms both the prmal-dual and centralzed method n the computaton tme for small to md range sze clusters. 3. Dynamc smulaton results: In the prevous experment, we consdered a statc power budget. Here, we examne a dynamc case where the power budget changes every mnute durng operaton as possbly the case n demand-response programs. We demonstrate how DBA re-allocates the power budget. Specfcally, we evaluate DBA for cluster of N = servers wth dynamc power budget n Fg. 4. We observe that DBA provdes a near optmal performance wthout power budget volaton. To demonstrate a more detaled response of 76

8 total power (kw) SNP DBA total power budget tme (sec) DBA optmal tme (sec) Fg. 4. Dynamc smulaton of total power budget reallocaton. DBA when there s an adjustment n the total power budget, we consder cases where the power budget decreases/ncreases between P = 9kW and P = 7kW power levels as depcted n Fg. 5 and Fg. 6, respectvely. In both cases, DBA mmedately adapts to the new power budget. DBA s also adaptve to workload varatons on servers. DBA dynamcally recomputes the power usage of each server as the workloads change on the servers. We smulate a cluster wth N = servers at a fxed total power budget of P = 8kW. The servers host dfferent nstances of the testng workloads that are presented n Table I. After a workload s completed, a new workload s randomly drawn from the workload pool n Table I and s launched on the free server. We smulated ths process for 8 mnutes. The resultng power consumpton and SNP s gven n Fg. 7. We observe that despte varatons n the server s workload, the SNP metrc of DBA s close to optmal. Furthermore, our smulaton has shown that under varable workload scenaro, the total power consumpton s strctly below the power lmt. We now analyze a scenaro n whch the utlty of a sngle server node undergoes an abrupt change n ts utlty functon. Ths scenaro can occur for example when a node serves a workload and begns processng a new workload from a dfferent benchmark. Under ths crcumstance, t s nterestng to observe the effect of workload changes n a sngle server on the power states of other servers n the network. Frst we examne how the estmaton e of the affected node propagates through the rng network. More precsely, we consder a rng network of sze N = nodes. Furthermore, we assume that the coeffcents of the quadratc utlty functon of the node = 5 undergoes a sudden change. For ths scenaro, the estmaton vector e def =(e,,e N ) s shown n Fg. 8 at dfferent phases of DBA algorthm. As shown n Fg. 8 ntally, all nodes n the network, except for the perturbed node, have neglgble error estmatons e, 5. After few teratons, the estmaton error propagates through the network whle the magntude of the absolute error decreases. More nterestngly, as depcted n Fg. 9, after convergence to the new equlbrum pont, only few nodes n the vcnty of the perturbed server node need to total power (kw) SNP actual power total power budget tme (ms) DBA optmal tme (ms) Fg. 5. Detaled smulaton of total power budget reallocaton when total power budget drops from hgh to low. total power (kw) SNP actual power total power budget tme (ms)..9 DBA optmal tme (ms) Fg. 6. Detaled smulaton of total power budget reallocaton when total power budget jumps from low to hgh. adjust ther power consumpton states. More specfcally, Fg. 9 shows the dfference between power usage before and after utlty perturbaton at the node =5. It can bee seen that fluctuaton n the utlty functon has a desrable local effect on power consumptons of computng nodes. 4. The effect of algorthm communcaton topology: The communcaton topology of of the dstrbuted algorthms can be optmzed to mprove ther convergence. A rng topology s partcularly deal for DBA due to ts low degree and symmetry. In practce, to guarantee the connectvty of the dstrbuted algorthm n the case of multple node falures, the rng topology must be equpped wth a few chords. SNP.95.9 DBA optmal tme (sec) Fg. 7. The performance of DBA on a cluster wth dynamc workloads. 77

9 Absolute estmaton error 2 Absolute estmaton error 5 5 Absolute estmaton error 5 Iteraton= Iteraton=7 Iteraton= Absolute estmaton error 3 2 Absolute estmaton error.5 Absolute estmaton error.4.2 Iteraton=8 Iteraton=2 Iteraton=225 Fg. 8. The absolute estmaton error of server nodes n the case of utlty change of node =5. 5 Absolute change n power 2 Iteraton Fg. 9. The absolute change n the power consumpton states (p,,p N ) after settlng at the new equlbrum. To nvestgate the effect of connectvty of the dstrbuted computaton on the convergence speed of DBA, we created nstances of connected Erdös-Réyn random graphs [9], where the number of computng nodes s fxed at N =. In the Erdös-Réyn random graph model, a graph s chosen unformly at random from the set of all possble graphs wth N nodes and a certan number of edges. Note that Erdös- Réyn random graph model s degree homogeneous snce ts degree dstrbuton (.e., the probablty of a node to have a certan degree) decays exponentally fast. In our smulaton experment, the number of edges vares whch results n graphs wth dfferent average degrees per computng node. For each random graph, we have computed the average degree and the number of teratons. The termnaton crteron for the teratons s defned to be the tme at whch DBA acheves 99% of the optmal utlty value (cf. Eq. ()). The result of ths analyss s shown n Fg.. We observe that DBA s convergence tme has strong correlaton wth the average degree of connectvty. The approprate settng for the average communcaton degree s then a choce of the system admnstrator, who can assess the requred fault tolerance and adjust the topology of the dstrbuted computaton accordngly. It s also worth mentonng the convergence tme of DBA can be adjusted by tunng the free parameters ɛ t as well as μ n the structure of Algorthm Average Degree Fg.. The teraton numbers of samples of Erdös-Réyn random graphs wth N = versus the average degree. The red lne shows the 3rd order polynomal regresson on the samples. C. Results from Real Cluster Implementaton In ths subsecton we demonstrate a fully operatonal mplementaton of DBA on our cluster of 2 servers, whch are connected usng the top-of-the-rack Gbe swtch. Each server runs a DBA power manager that mplements our algorthm and the managers communcate wth each other usng a rng topology wth standard socket nterfaces. We chose the rng topology snce our smulaton results dentfed ths smple topology as an effectve way to communcate among the DBA managers wth low latency. We also mplement a software feedback controller to enforce local power budgets. Once a local DBA manager computes a local power target, the DVFS-based controller adjusts the DVFS up or down (n ncrements of.3 GHz) dependng on the dfference between the power target and the current power consumpton of the server, wth postve dfference trggerng an ncrease n DVFS, whle negatve dfferences trggerng a decrease n DVFS. Our mplementaton of DBA takes about 7 lnes of C++ code. In our experments, we launch the MG applcaton on 6 servers, and launch HPL applcaton on the remanng sx servers. These two benchmarks were selected because of ther dfferent characterstcs: HPL s a CPU-ntensve benchmark, whle MG s a memory-bound applcaton. To demonstrate that DBA operates n a practcal scenaro, 78

10 Total power(kw) SNP actual power total power budget tme(s) tme(s) Fg.. Real mplementaton of DBA on an expermental cluster of 2 servers. we consder a dynamc power budget to verfy that DBA satsfes the power budget restrcton. SNP was calculated by measurng retred floatng pont operaton of the servers. Fgure depcts the cluster power consumpton and SNP of DBA as the power budget alternates between 2. kw to 2.4 kw over 9 seconds. The fgure shows that DBA s able to meet the total power target. We note the slght oscllatons n power measurement are due to the dscrete nature of DVFS settngs that forces the local DVFS controllers to alternate between two settngs to effectvely capture an ntermedate power level that s otherwse not drectly attanable by DVFS. Also the benchmarks have dfferent phases of operatons, whch result n slght oscllatons n power and performance. The SNP results show the expected trend that the performance of the cluster wll be reduced when subjected to the lower power budgets. VI. CONCLUSIONS We studed the problem of utlty maxmzaton n largescale server clusters wth the power budget constrant. We proposed DBA, a framework to compute the optmal power usage of each node locally. In ths framework, each node must exchange ts power consumpton state to the neghborng nodes n the cluster. We nvestgated varous performance metrcs assocated wth DBA. In partcular, we showed that DBA outperforms the centralzed power budgetng scheme as well the as prmal-dual algorthm n terms of computaton/communcaton tme. The fast convergence of DBA s partcularly deal for dynamc workloads as well as dynamc power budget wth fast tme scales. We demonstrated ths characterstc of DBA through our smulaton experments and experments on the real cluster usng dynamc power budget. As a future work, t s nterestng to modfy the structure of DBA to nclude delay senstve applcatons. In the presence of an end-to-end delay constrant on workloads, computng nodes must coordnate not only ther power consumpton data, but also ther latency nformaton, whch leads to a challengng problem formulaton. ACKNOWLEDGMENT The research of Xn Zhan, Reza Azm and Sheref Reda s partally supported by NSF grants 3548 and The research of Masoud Bade and Na L s supported by NSF CAREER grant and Harvard Center for Green Buldngs and Ctes. REFERENCES [] A. Vahdat, M. Al-Fares, N. Farrngton, R. N. Mysore, G. Porter, and S. Radhakrshnan, Scale-out networkng n the data center, IEEE Mcro, vol. 3, no. 4, pp. 29 4, 2. [2] D. Lo, L. Cheng, R. Govndaraju, L. A. Barroso, and C. Kozyraks, Towards energy proportonalty for large-scale latency-crtcal workloads, n Proceedng of the 4st annual nternatonal symposum on Computer archtecuture, pp. 3 32, 24. [3] A. Beloglazov, R. Buyya, Y. C. Lee, A. Zomaya, et al., A taxonomy and survey of energy-effcent data centers and cloud computng systems, Advances n computers, vol. 82, no. 2, pp. 47, 2. [4] A. Beloglazov, J. Abawajy, and R. Buyya, Energy-aware resource allocaton heurstcs for effcent management of data centers for cloud computng, Future generaton computer systems, vol. 28, no. 5, pp , 22. [5] E. M. Elnozahy, M. Kstler, and R. Rajamony, Power-Aware Computer Systems. Sprnger, 23. [6] J. Heo, P. Jayachandran, I. Shn, D. Wang, T. Abdelzaher, and X. Lu, Opttuner: On performance composton and server farm energy mnmzaton applcaton, IEEE Transactons on Parallel and Dstrbuted Systems, vol. 22, no., pp , 2. [7] X. Zhan and S. Reda, Power budgetng technques for data centers, IEEE Transactons on Computers, vol. 64, pp , Aug 25. [8] J. S. Chase, D. C. Anderson, P. N. Thakar, A. M. Vahdat, and R. P. Doyle, Managng energy and server resources n hostng centers, n ACM SIGOPS Operatng Systems Revew, vol. 35, pp. 3 6, 2. [9] E. Pnhero, R. Banchn, E. V. Carrera, and T. Heath, Load balancng and unbalancng for power and performance n cluster-based systems, tech. rep., Rutgers Unversty, 5 2. [] H. Davd, C. Falln, E. Gorbatov, U. R. Hanebutte, and O. Mutlu, Memory power management va dynamc voltage/frequency scalng, n Proceedngs of the 8th ACM nternatonal conference on Autonomc computng, pp. 3 4, ACM, 2. [] S. H. Low and D. E. Lapsley, Optmzaton flow control-i: basc algorthm and convergence, IEEE/ACM Transactons on Networkng, vol. 7, no. 6, pp , 999. [2] D. L. Mlls, Internet tme synchronzaton: the network tme protocol, Communcatons, IEEE Transactons on, vol. 39, no., pp , 99. [3] N. L and J. R. Marden, Decouplng coupled constrants through utlty desgn, Automatc Control, IEEE Transactons on, vol. 59, no. 8, pp , 24. [4] D. H. Baley, E. Barszcz, J. T. Barton, D. S. Brownng, R. L. Carter, R. A. Fatooh, P. O. Frederckson, T. A. Lasnsk, H. D. Smon, V. Venkatakrshnan, and S. K. Weeratunga, The NAS parallel benchmarks, 99. [5] S. Eranan, Perfmon2: a flexble performance montorng nterface for lnux, n Proc. of the 26 Ottawa Lnux Symposum, pp , Cteseer, 26. [6] R. Cochran, C. Hankend, A. Coskun, and S. Reda, Pack & Cap: Adaptve DVFS and Thread Packng Under Power Caps, n ACM/IEEE Internatonal Symposum on Mcroarchtecture, pp , 2. [7] X. Zhang, S. Dwarkadas, G. Folkmans, and K. Shen, Processor hardware counter statstcs as a frst-class system resource, n Proceedngs of the th USENIX Workshop on Hot Topcs n Operatng Systems, pp. 4: 4:6, 27. [8] M. Grant, S. Boyd, and Y. Ye, CVX: Matlab software for dscplned convex programmng, 28. [9] B. Bollobás, Random graphs. Sprnger,

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr