QBox: Guaranteeing I/O Performance on Black Box Storage Systems

Size: px

Start display at page:

Download "QBox: Guaranteeing I/O Performance on Black Box Storage Systems"

Adele Kelly
6 years ago
Views:

1 QBox: Guaranteeng I/O Performance on Black Box Storage Systems Dmtrs Skourts Shnpe Kato Department of Computer Scence Unversty of Calforna, Santa Cruz Scott Brandt ABSTRACT Many storage systems are shared by multple clents wth dfferent types of workloads and performance targets. To acheve performance targets wthout over-provsonng, a system must provde solaton between clents. Throughputbased reservatons are challengng due to the mx of workloads and the stateful nature of dsk drves, leadng to low reservable throughput, whle exstng utlzaton-based solutons requre specalzed I/O schedulng for each devce n the storage system. Qbox s a new utlzaton-based approach for generc black box storage systems that enforces utlzaton (and, ndrectly, throughput) requrements and provdes solaton between clents, wthout specalzed low-level I/O schedulng. Our expermental results show that Qbox provdes good solaton and acheves the target utlzatons of ts clents. Categores and Subject Descrptors D.4. [Operatng Systems]: Storage Management; D.4.8 [Operatng Systems]: Performance Keywords Storage vrtualzaton, qualty of servce, resource allocaton, performance. INTRODUCTION Durng the past decade there has been a sgnfcant growth of data wth no sgns of slowng. Due to that growth there s a real need for storage devces to be shared effcently by dfferent applcatons and avod the extra costs of havng more and more under-utlzed devces dedcated to specfc applcatons. In envronments such as cloud systems, where multple clents,.e., streams of requests, compete for the same storage devce, t s especally mportant to manage the performance of each clent. Falure to do so leads to low performance for some or all clents dependng on complex factors such as the I/O schedulers used, the mx of clent Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, to republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. HPDC, June 8,, Delft, The Netherlands. Copyrght ACM //6...$.. Clents... n stream e.g. database stream n e.g. meda Controller Closed storage devce Dsk Dsk... Dsk m Fgure : Gven that we have no access to the storage devce, we place a controller between the clents and the devce to provde performance management to the clents,.e., the request streams. workloads, as well as storage-specfc characterstcs. Unfortunately, due to the nature of storage devces, managng the performance of each clent and solatng them from each other s a non-trval task. In a shared system, each clent may have a dfferent workload and each workload may affect the performance of the rest n undesrable and possbly unpredctable ways. A typcal example would be a stream of random requests reducng the performance of a sequental or sem-sequental stream, mostly due to the storage devce performng unnecessary seeks. The above s the result of storage devces tryng to be equally far to all requests by provdng smlar throughput to every stream. Of course, not all requests are equally costly, wth sequental requests takng only a small fracton of a mllsecond and random requests takng several mllseconds, -3 orders of magntude longer. Note that a sequental stream does not have to be perfectly sequental none ever truly are and that real workloads often exhbt such behavor. Provdng a soluton to the above problem may requre usng specfc I/O schedulers for every dsk-drve or node n a clustered storage system. Moreover, t could requre changes to current nfrastructure such as the replacement of the I/O scheduler of every clent. Such changes may create compatblty ssues preventng upgrades or other modfcatons to be appled to the storage system. Instead of makng modfcatons to the nfrastructure of an exstng system t s often easer and n practce cheaper to deploy a soluton between the clents and the storage. We call that the black box approach snce t mposes mnmal requrements on the 73

2 clents and storage, and because t s agnostc to the specfcatons of ether sde. Our approach partly fts the grey-box framework for systems presented n [], however, QBox requres fewer algorthmc assumptons about the underlyng system. In ths paper we take an almost agnostc approach and target the followng problem: gven a set of clents and a storage devce, our goal s to manage the performance of each clent s request stream n terms of dsk-tme utlzaton and provde each clent wth a pre-specfed proporton of the devce s tme, whle havng no nternal control of ether the clents or the storage devce, or requrng any modfcatons to the nfrastructure of ether sde. Clents want throughput reservatons. However, except for hghly regular workloads, throughput vares by orders of magntude dependng upon workload (Fgure ) and only a fxed fracton of the (hghly varable) total may be guaranteed. By solatng each stream from the rest, utlzaton reservatons allow a system to ndrectly guarantee a specfc throughput (not just a share of the total) based on drect or nferred knowledge about the workload of an ndvdual stream, ndependent of any other workloads on the system and can allow much greater total throughput than throughput-based reservatons [7]. Our utlzaton-based approach can work wth Servce Level Agreements (SLA); requrements can be converted to utlzaton as demonstrated n [8] and as long as we can guarantee utlzaton, we can guarantee throughput provded by an SLA. To our knowledge there s no pror work on utlzatonbased performance guarantees for black box storage devces. Most work that s close to our scenaro such as [5, ] s based on throughput and latency requrements, whch are hard to reserve drectly wthout under-utlzng the storage for a number of reasons such the orders-of-magntude cost dfferences between best- and worst-case requests. Moreover, throughput-based solutons create other challenges such as admsson control. Wthout very specfc knowledge about the workloads, the system must make worst-case assumptons, leadng to extremely low reservable throughput. On the other hand, exstng solutons based on dsk utlzaton [8, 7, ] only support sngle drves and f used n a clustered storage system they requre ther scheduler to be present on every node. In ths paper, we present a novel method for managng the performance of multple clents on a storage devce n terms of dsk-tme utlzaton. Unlke the management of a sngle drve, n the black box scenaro t s hard to measure the servce tme of each request. Instead, our soluton s based on the perodc estmaton of the average cost of sequental and random requests as well as the observaton that ther costs have an orders-of-magntude dfference. We observe the throughput of each request type n consecutve tme wndows and mantan separate movng estmates for the cost of sequental and random requests. By takng nto account the desred utlzaton of each clent we schedule ther requests by assgnng them deadlnes and dspatchng them to the storage devce accordng to the Earlest Deadlne Frst (EDF) algorthm [3, ]. Our results show that the desred utlzaton rates are acheved closely enough, achevng both good performance guarantees and solaton. Those results stand over any combnaton of random, sequental, and sem-sequental workloads. Moreover, due to our utlzatonbased approach, t s easy to decde whether a new clent may be admtted to the storage system, possbly by modfy Throughput acheved wth utlzaton based schedulng Stream A (Seq. 5% %) Stream B (Seq. 5% %) Stream C (Ran. % %) 5 (C: %) 3 35 (C: 5%) 4 45 (C:%) 5 Seconds Throughput acheved wth throughput based schedulng Stream A (Seq. 5% %) Stream B (Seq. 5% %) Stream C (Ran. % %) 5 (C: %) 3 35 (C: 5%) 4 45 (C:%) 5 Seconds Fgure : QBox (top) provdes solaton, whle the ntroducton of a random stream makes the throughput of sequental streams drop dramatcally wth throughput-based schedulng (bottom.) ng the rates of other clents. Fnally, all clents may access any fle on the storage devce and we make no assumptons about the locaton of the data on a per clent bass.. SYSTEM MODEL Our basc scenaro conssts of a set of clents each assocated wth a stream of requests and a sngle storage devce contanng multple dsks. Clents send requests to the storage devce and each stream uses a proporton of the devce s executon tme. We call that proporton the utlzaton rate of a stream and t s ether provded by the clent or n practce, by a broker, whch s part of our controller and translates SLAs nto throughput and latency requrements as n [7, 8] or []. Brefly, to translate an SLA to utlzaton, we measure the aggregate throughput of the system for sequental and random requests separately over small amounts of requests (e.g., ) and set a confdence level (e.g., 95%) to avod treatng all requests as outlers. Detals about arrval patterns and ssues such as head/track swtches and bad layout are presented n [9]. The man characterstc of our scenaro s that we treat the storage devce as a black box. In other words, we only nteract wth the storage devce by passng t clent requests 74

3 QBox requests from multple clents Stream rates (dsk tme percentage) acheved wth no management when desred rates are (5%, 5%) Deadlne Queue... deadlne assgnment FIFO Stream Queues Estmaton Statstcs Average Utlzaton Sequental stream Random stream Dsk Dsk Dsk Wndow (5ms) counter Fgure 3: The controller archtecture. and recevng responses. For example, we cannot modfy or replace the devce s scheduler as s the case n [8] and do not assume t uses a partcular scheduler. Moreover, we cannot control whch dsk(s) are gong to execute each request and do not restrct clents to specfc parts of the storage devce. Due to those requrements the natural choce s to place a controller between the clents and the storage devce. Hence, all clent requests go through the controller, where they are scheduled and eventually dspatched to the devce. As we wll see, ths setup allows us to gather lttle nformaton regardng the dsk executon tmes, whch turns schedulng and therefore black box management nto a challenge. We manage the performance of the streams n a tmebased manner. After a request reaches our controller, we assgn t a deadlne by keepng an estmate of the expected executon tme e for each type of request (sequental or random) and by usng the stream s rate r provded by the broker. Usng e and r we compute the request s deadlne by d = e/r. The absolute deadlne of a request comng from stream s s set to D s = T s +d, wheret s s the sum of all the relatve deadlnes assgned so far to the requests of stream s. Although, we are usng deadlnes for schedulng, our goal s not to strctly satsfy deadlnes. Instead, t s the relatve values that matter wth regards to the dspatchng order. On the other hand, f we used a strcter dspatchng approach e.g., [8], then the absolute tmes would be mportant for replacng the expected cost wth the actual cost after the request was completed. In ths paper we do not focus on urgent requests, however, t s possble to place such requests ahead of others n the correspondng stream queue by smply assgnng them earler deadlnes. Although we do not assume the storage devce s usng a specfc dsk scheduler, t s better to have a scheduler whch tres to avod starvaton and orders the requests n a reasonable manner (as most do). A strcter dspatchng polcy such as [8] can be used on the controller sde to avod starvaton by placng more emphass on satsfyng the assgned deadlnes nstead of overall performance. The next secton presents our method for estmatng executon tmes and managng the performance of each stream n terms of Fgure 4: When our controller sends the requests n the order they are receved from the clents, the system fals to provde the desred rates. tme. We also dscuss practcal ssues we faced whle applyng our method and dscuss how we addressed them. 3. PERFORMANCE MANAGEMENT In QBox we mantan a FIFO queue for each stream and a deadlne queue, whch may contan requests from any stream. The deadlne queue s ordered accordng to the Earlest Deadlne Frst (EDF) scheduler and the deadlnes are computed as descrbed n the prevous secton. Whenever we are ready to dspatch a request to the storage devce the request wth the smallest absolute deadlne out of all the stream queues s moved to the deadlne queue. To fnd the earlest-deadlne request t suffces to look at the oldest request from each stream queue, snce any other request before that has ether arrved at a later tme or s less urgent. Next, the request wth the earlest deadlne s removed from the deadlne queue and dspatched to the devce. 3. Estmatng executon tmes As mentoned earler, we am to provde performance management through a controller placed between the clents and the storage devce. We wsh to acheve ths goal wthout knowledge of how the storage devce schedules and dstrbutes the requests among ts dsks and wthout access to the storage system nternals. Most mportantly, we are unaware of the tme each request takes on a sngle dsk, whch we could otherwse measure by lookng at the tme dfference between two consecutve responses,.e., the nterarrval tme. In our case, the tme between two consecutve responses does not necessarly reflect the tme spent by the devce executng the second request, because those two requests may have been satsfed by dfferent dsks. On the other hand, we know the number of requests executed from each stream on the storage devce. If all requests had the same cost, then we could take the average over a tme wndow T,.e., e = T/n,wheren s the number of requests completed n T. Clearly, that would not solve 75

y (, z/b) sequental cost random cost y x tme wndow z a=5, b= (zj/bj,) λ=a/b j tme wndow zj aj=9, bj= (z/a,) (zj/aj,) x Fgure 5: The ntersecton of the two lnes from (3) gves us the average cost x of a

4 y (, z/b) sequental cost random cost y x tme wndow z a=5, b= (zj/bj,) λ=a/b j tme wndow zj aj=9, bj= (z/a,) (zj/aj,) x Fgure 5: The ntersecton of the two lnes from (3) gves us the average cost x of a sequental and y of a random request n the tme wndows z and z j. the problem snce random requests are orders-of-magntude more expensve than sequental requests,.e., the dsk has to spend sgnfcantly more tme to complete a random request. Based on that observaton, for each stream we classfy ts requests nto sequental and random whle keepng track of the number of requests completed by type per wndow. Assumng the clents saturate the devce and the cost x of the average sequental request and the cost y of the average random request reman the same across two tme wndows z and z j we are lead to the followng system of lnear equatons: { αx + β y = z () α jx + β jy = z j, where α s the number of sequental requests completed n wndow, and smlarly for the number of random requests denoted by β. Often, j wll be equal to +. Solvngthe above system gves us the sequental and random average request costs for wndows z and z j: x = zjβ βjz α jβ α β j, y = z β α β x. () The above equatons may gve us negatve solutons due to system nose and other factors. Snce executon costs may only be postve we restrct the solutons to postve (x,y) pars (Fgure 5),.e., satsfyng: z /α <z j/α j z /α >z j/α j z /β >z j/β j or z /β <z j/β j. (3) α /β >α j/β j α /β <α j/β j Intutvely, settng z equal to z j n (3) would requre that f the number of completed sequental requests goes down n wndow j, then the number of random requests has to go up (and vce-versa.) Otherwse, the ntersecton would contan a negatve component. By focusng on the case where every tme wndow has the same length we reduce the chances of gettng hghly volatle solutons and make the analytcal soluton smpler to ntutvely understand. In that case the soluton becomes z x =, y = λx, (4) α + β λ Fgure 6: Countng sequental and random completons per wndow lets us estmate ther average cost. where α αj λ =. (5) β j β From (5) we see that the ntersecton solutons are expected to be volatle f the wndow sze s small. On the other hand, f the wndow sze s large and the throughput does not change, the ntersecton wll often be negatve,.e., t wll happen on a negatve quadrant, snce the two lnes from Fgure 5 wll often have a smlar slope. It would be easy to gnore negatve solutons by skppng wndows. However, dependng on the wndow sze and workload t s possble to get negatve solutons more often than postve ones. That leads to fewer updates and therefore a slower convergence to a stable estmate. To face that ssue we looked nto two drectons. One drecton s to observe that f the wndow sze s small enough, t s not mportant whether we take the ntersecton of the current wndow wth the prevous one or some other wndow not too far n the past. Based on that observaton we consder the postve ntersectons of the current wndow wth a number of the prevous ones and take the average. That method ncreases the chance of gettng a vald soluton. In addton, updatng more frequently allows the movng estmate to converge more quckly wthout gvng a large weght on any of the ndvdual estmates. The other way we propose to face negatve solutons s to compute the projecton of the prevous estmate on the current wndow assumng the x/y rato remans the same along those two wndows. In partcular, we may assume that α /β s close to α j/β j. In that case, we can project the prevous ntersecton pont or estmate on the lne descrbng the second wndow. The projecton s gven by α jz x =, y = μx, (6) α (μβ j + α j) where μ = ( z ) β x α. (7) The dea s that f both the number of completed sequental and random requests n a wndow drops (or ncreases) proportonally the cost must have shfted accordngly. Although we observed that the projecton method works especally well, ts correctness depends on the prevous estmate. It could stll be used when some ntersecton s nvald to keep updatng the estmate but leave t as future work to determne whether t can enhance our estmates. 3. Estmaton error and seek tmes A key assumpton s that the request costs are the same among wndows. Assumng that at some pont we have the 76

5 true (x, y) cost and that the cost n the next wndow s not exactly the same due to system nose we expect to have error. To compute that error we replace z j n the soluton for x n () by ts defnton.e., α jx + β jy and denote that expresson by x. Takng the dfference between x and x gves x x β = (αjx + βjy) zj (8) α jβ α β j and y y = α x x. (9) β So far we have not consdered seek tmes between streams and how they mght affect our estmates. In the typcal case where m random requests are executed by a dsk followed by n sequental requests, the frst request out of the sequental ones wll ncur a seek. That seek s not fully charged to ether type of request n our model, smply because t s ether hard or mpossble n our scenaro. Intutvely, the total seek cost of a wndow s dstrbuted across both request types. Frstly, because fewer requests of both types wll end up beng executed n that wndow and secondly due to the error formula (9) for y. In partcular, assumng the delayed requests n some wndow would also follow the α /β rato we now show that seeks do not affect our schedulng. Let α = α δ (α) and β = β δ (β),whereδ τ s the number of requests of type τ that are not executed n wndow due to seek events. From the above assumpton, δ (β) = β /α δ (α). Then α /β = (α δ (α) )/(α δ (β) ), whch gves α /β and smlarly, for wndow j. Usng the orgnal soluton () for the sequental and random costs, consder the rato of y/x as well as y /x, whch uses α nstead of α and smlarly for β. By substtutng, we get that y/x = y /x = α /β, whch s ndependent of the number of seeks δ and by the above s equal to α /β. From the above, we conclude that seeks do not affect the relatve estmaton costs and consequently our schedule. The reason the ratos are negatve can be seen from Fgure 5. Specfcally, fxng every varable n (), whle ncreasng the x-cost, reduces the y-cost and vce versa. Therefore, the slope y/x s negatve whether we have seeks or not. 3.3 Wrte support and estmatng n practce In ths work, we only deal wth read requests. Snce wrtes typcally respond mmedately, t s harder to approxmate the dsk throughput over small tme ntervals. On the other hand, f a system s busy enough, the wrte throughput over large ntervals (e.g., 5 seconds) s expected to have a smaller varance and be closer to the true throughput. Prelmnary results suggest the above holds. There are stll some challenges, such as the effect of wrtes on reads when there s sgnfcant wrte actvty, whch may be addressed by dspatchng wrtes n groups. Addng support for wrtes s a prorty for future work and s expected to lead nto a more general soluton supportng SSDs and hybrd systems. In our mplementaton we took the approach of havng small wndows, e.g. ms, to ncrease the frequency of estmates and to gve a small weght to each of them. As we compute ntersectons we keep a movng average and weght each estmate dependng on ts dstance from the prevous one. Due to the frequent updates, f there s a shft n the cost, the movng estmate wll reach that value quckly. Moreover, to mprove estmates, for each wndow we fnd ts Average Rate Sequental 5%, Random 5% (Dsk ) Total Sequental (5%) Random (5%) Sequental (%) Sequental (%) Sequental (%) 5 5 Wndow (5ms) counter Fgure 7: Usng one dsk and a mxture of sequental and random streams the rates are acheved and convergence happens quckly. ntersecton wth a number of the prevous wndows (e.g.,.) Fnally, f the λ cost rato as defned n (5) s too small or too large we gnore that par of costs. We set the bounds to what we consder safe values n that they wll only take out clearly wrong ntersectons. 4. EXPERIMENTAL EVALUATION In ths secton, we evaluate QBox n terms of utlzaton and throughput management. We frst verfy that the sequental and random request cost estmates are accurate enough and that the desred stream rates are satsfed n dfferent scenaros. Next, we show that the throughput acheved s to a large degree n agreement wth the target rates of each stream. 4. Prototype In all our experments we use up to four dsks (dfferent models) or a software RAID over two dsks. We forward stream requests to the dsks asynchronously usng Kernel AIO. We avoded usng threads n order to keep a large number of requests queued up (e.g., ) and to avod race condtons leadng to naccurate nter-arrval tme measurements. Up to subsecton 4.4 we are nterested n evaluatng QBox n a tme-based manner. For that purpose we avod httng the flesystem cache by enablng O DIRECT and do not use Natve Command Queung (NCQ) n any of the dsks. Moreover, we send requests n a RAID fashon rather than usng a true RAID. The above allows us to know the dsk each request targets, whch consequently lets us compute the servce tmes by measurng the nter-arrval tmes and compare those wth our estmates. The extra nformaton s not used by our method snce t s normally unavalable. It s used only for evaluaton purposes. Startng from subsecton 4.4 we gradually remove all the above restrctons and evaluate QBox mplctly n a throughput-based manner. We evaluate QBox both wth synthetc and real workloads dependng on the goal of the experment. All synthetc requests are reads of sze 4KB unless we are usng a RAID over two dsks n whch case they are 8KB. For the synthetc workload, each dsk contans a hundred GB fles. We use a subset of the Deasna [3] NFS trace wth request szes typ- 77

6 Average rate Sequental 3%, Random 7% (<L <75) Sequental (%) Sequental (%) Sequental (%) Total Sequental (3%) Random (7%) Wndow (ms) counter Average rate Sequental 5%, Random 5% (<L <75) Sequental (%) Sequental (%) Sequental (%) Total Sequental (5%) Random (5%) 5 5 Wndow (ms) counter Average rate Sequental 7%, Random 3% (<L <75) Sequental (3%) Sequental (%) Sequental (%) Total Sequental (7%) Random (3%) 5 5 Wndow (ms) counter Fgure 8: Usng two dsks and our schedulng and estmaton method we acheve the desred rates most of the tme relatvely well. In the above we have three sequental streams and a random one. Random Requests Executon Costs.5 x 4 Estmates [Sequental 3%, Random 7%] [<L<75] Averages (d) Averages (d).5.5 Random Requests Executon Costs.5 x 4 Estmates [Sequental 5%, Random 5%] [<L<75] Averages (d) Averages (d).5.5 Random Requests Executon Costs.5 x 4 Estmates [Sequental 7%, Random 3%] [<L<75] Averages (d) Averages (d) Wndow (ms) counter 5 5 Wndow (ms) counter 5 5 Wndow (ms) counter Fgure 9: Usng two dsks (d, d) and our estmaton method we mantan a movng estmate of the average random executon cost on the storage devce. cally beng 3KB or 64KB. Fnally, except for a workload contanng dle tme, we assume there are always requests queued up, snce that s the most nterestng scenaro, and we bound the number of pendng requests on the storage by a constant, e.g.,. Fnally,we do not assume a specfc I/O scheduler s used by the storage devce. In our experments, the Deadlne Scheduler was used, however, we have tred other schedulers and observed smlar results. 4. Sequental and random streams Our approach s based on the dfferentaton between sequental and random requests and so the frst step n evaluatng QBox s to consder a workload of fully sequental and random streams wth the goal of provdng solaton between them. Note that to provde solaton a prerequste s that our cost estmates for the average sequental and random request are close enough to the true values, whch are not known, and t s not possble to explctly measure them n our black box scenaro. In ths set of experments, the workload conssts of three sequental streams and one random. Each sequental stream starts at a dfferent fle to ensure there are nter-stream seeks. Each request of the random stream targets a fle and offset unformly at random. For each stream we measure the average utlzaton provded by the storage devce. We look nto three sets of desred utlzatons. Fgure 8(a) shows a random stream wth a utlzaton target of 7%, whle each sequental stream has a target of % for a total of 3%. As the experment runs, the cost estmates take values wthn a small range and the average acheved utlzaton converges. In Fgures 8(b) and (c) the sequental streams are gven hgher utlzatons. In all three cases, the acheved utlzatons are close to the desred ones. Agan n Fgures 8(b) and (c) the ntal estmate was relatvely close to the actual cost, so the movng rates approach the convergng rates more quckly. From Fgure 9 we notce that estmates get above the average cost when there s many sequental requests even f the utlzaton targets are acheved (Fgure 8). The man reason for that s that we keep track and store (n memory) large amounts of otherwse unnecessary statstcs per request. Therefore, f n a wndow of e.g., ms there s a very large number of request completons,.e., when the rate of sequental streams s hgh, ms (μs requests) may be gven to that processng and therefore the estmates are scaled up. Of course, those operatons can be optmzed or elmnated wthout affectng QBox. As expected, a smlar effect happens wth the estmated cost of sequental requests (not shown), therefore the rato of the costs stays vald leadng to proper schedulng as shown n Fgure 8. Besdes the ntal estmates, the convergence rate also depends on the wndow sze, snce a smaller sze mples more frequent updates and faster convergence. However, f the wndow sze becomes too small the number of completed requests become too few and the qualty of the estmate may not be accurate enough due to the sgnfcant nose. Note that whether the wndow sze s consdered too small 78

7 Stream Rate Acheved Stream rates A (%, 5r, 475s) B (3%, 7r, 693s) C(%, All S) D (4%, All R) Wndow (75ms) counter Stream Rate Acheved stream rates A (%, 5r, 475s) B (3%, 7r, 693s) C (%, All S) D (3%, All R) Wndow (75ms) counter Stream Rate Acheved Stream rates A (%, 5r, 475s) B (%, 7r, 693s) C (%, All S) D (7%, All R) Wndow (75ms) counter Fgure : Usng two dsks and our schedulng and estmaton method we acheve the desred rates most of the tme relatvely well. Stream A requres % of the dsk tme and sends 5 random requests every 475 sequental requests. Smlarly for the rest of the streams. Random Requests Executon Costs.5 x Estmates [<L<75] (Exp. B) Averages (d) Averages (d) Random Requests Executon Costs.5 x Estmates [<L<75] (Exp. C) Averages (d) Averages (d) Random Requests Executon Costs.5 x Estmates [<L<75] (Exp. D) Averages (d) Averages (d) Wndow (75ms) counter Wndow (75ms) counter Wndow (75ms) counter Fgure : Usng two dsks (d, d) and our estmaton method we mantan a movng estmate of the average random executon cost on the storage devce. depends on the number of dsks n the storage system as havng more dsks mples that a greater number of requests complete per wndow. The wndow sze we pcked n the above experment (Fgure 8) s ms. Other values such as 5ms provde smlar estmaton qualty and later we look at smaller wndows of 75ms. Note that n Fgure 9 there s a number of recorded averages that are because random streams wth low target rates are more lkely to have no arrvals n a wndow. Not havng any completed random requests n a wndow mples that we can estmate a new sequental estmate more easly. Fnally, n the above, we assume there s always enough queued requests from all streams. Wthout any modfcaton to our method we see from Fgure that under dle tme t s stll possble to manage the rates. In partcular, every 5 requests (on average) dspatched to the storage devce we delay dspatchng the next request(s) for a (unformly at) random amount of tme between.5 to second. From Fgure we see that the rates are stll acheved, whle there s slghtly more nose n the estmates compared to Fgure 9. We noted that f the dle tmes are larger than the wndow sze, then our method s less affected. That was expected, snce dlng over a number of consecutve wndows mples that new requests wll be scheduled accordng to the prevous estmates as the estmates wll not be updated. Fnally, although the start and end of the dle tme wndow may affect the estmate, the effect s not sgnfcant snce the estmate moves only by a small amount on each update and most updates are not affected. 4.3 Mxed-workload streams In practce, most streams are not perfectly sequental. For example, a stream of requests may consst of m random requests for every n sequental requests, where m s often sgnfcantly smaller than n. To face that ssue, nstead of characterzng each stream as ether sequental or random we classfy each request. Note that the frst request of a sequental group of requests after m random ones s consdered random f m s large enough. Although not all random requests cost exactly the same, we do not dfferentate between them snce we work on top of the flesystem and do not assume we have access to the logcal block number of each fle. Therefore, we do not have a real measure of sequentalty for any two I/O requests. However, as long as the cost of random requests does not vary sgnfcantly between streams we expect to acheve the desred utlzaton for each stream. Indeed, as t has been observed n [], good utlzaton management can stll be provded when random requests are assumed to cost the same. Moreover, from [3] we see that requests from common workloads are usually ether almost sequental or fully random. Dfferentatng between cost estmates on a per stream bass s expected to mprove the management qualty and leave t as future work. From Fgure we see that the targets are acheved n the 79

8 Average Rate Sequental 5%, Random 5% (wth dle tme) Total Sequental (5%) Random (5%) Sequental (%) Sequental (%) Sequental (%) 5 5 Wndow (5ms) counter Random Requests Executon Cost.5 x 4 Estmates [Sequental 5%, Random 5% (wth dle tme.5.5 Averages (d) Averages (d) 5 5 Wndow (5ms) counter Fgure : Usng two dsks (d, d), the desred rates are acheved well enough (reach 45% quckly) even when there s dle tme n the workload. Stream Rates Acheved Stream Rates wth Shfts (4 Dsks) A (3%, %, %; 5R, 475S) B (%, %, %; 7R, 593S) C (%, %, %; All S) D (3%, 5%, 7%; All R) presence of sem-sequental streams. In partcular, n (a), stream A sends 5 random requests for every 475 sequental ones. Stream B sends 7 random requests for every 693 sequental ones, whle streams C and D are purely sequental and random, respectvely. Other target sets n Fgure are satsfed equally well. Note that each group of requests does not have to be completed before the next one s sent. Instead, requests are contnuously dequeued and scheduled. So far we have seen scenaros wth fxed target rates. Our method supports changng the target rates onlne as long as the rate sum s up to %. Dependng on the new target rates, the cost estmaton updates can be crucal n achevng those rates. For example, ncreasng the rate of a random stream decreases the average cost of a random request and our estmates are adjusted automatcally to reflect that. Fgure 3(a) llustrates that the utlzaton rates are satsfed and Fgure 3(b) shows how the random estmate changes as the clents adjust ther desred utlzaton rates every thrty seconds. For ths experment we set the number of dsks to four to llustrate our method works wth a hgher number of dsks and to support our clam that t can work wth any number of dsks. The same experment was run wth two dsks gvng nearly dentcal results (fgure omtted.) As explaned earler, the dsk queue depth s set to one for evaluaton purposes. However, snce a large queue depth can mprove the dsk throughput we mplctly evaluate QBox by comparng the throughput acheved when the depth s and 3, whle the target rates change. In partcular, we look at sem-sequental and random streams. As expected and llustrated n Fgure 5, havng a depth of 3 acheves a hgher throughput over a range of rates. Although, ths does not verfy our method works perfectly due to lack of nformaton, t provdes evdence that t works and, as we wll see n the next subsecton, that s ndeed the case. 4.4 RAID utlzaton management In our experments so far, we have been sendng requests to dsks manually n a strpng fashon nstead of usng an actual RAID devce. That was done for evaluaton purposes. Here, we use a (software) RAID devce and nstead evaluate QBox ndrectly. The RAID confguraton conssts of two dsks wth a chunk sze of 4KB to match our prevous experments, whle requests have a sze of 8KB. In the frst experment we focus on the throughput acheved by two (sem-)sequental streams as we vary ther desred rates. Moreover, we add a random stream to make t more realstc and challengng. We fx the target rate of the random stream snce otherwse t would have a varable effect on Random Request Executon Cost Wndow (75ms) counter.5 x Shfted Estmates [<L<75] (Exp. F) Dsk Averages Dsk Averages Fgure 3: Usng four dsks and desred rates that shft over tme, the rates are stll acheved quckly under sem-sequental and random workloads. the sequental streams throughput and make the evaluaton uncertan. As long as the throughput acheved by each of the sequental streams vares n a lnear fashon we are able to conclude that our method works. Indeed, from Fgure 4 stream A starts wth a target rate of.5 and goes down to, whle stream B moves n the opposte drecton. As the throughput of stream A goes down, the dfference s provded to stream B. Moreover, n Fgure 6 we see that havng two random streams and a sequental one fxed at 5% (not plotted) has a smlar behavor. The dfference between those two cases s the drop n the total throughput of the frst case wth streams A and B havng a lower throughput when ther rates get closer to each other. That s due to the more balanced number of requests beng executed from each sequental stream leadng to a greater number of seeks between them. Snce seeks are relatvely expensve compared to the typcal sequental request the overall throughput drops slghtly. If that effect was not observed n Fgure 4, then the random stream (C) would be gettng a smaller amount of the storage tme, whch would go aganst ts performance targets. Instead, the random stream throughput remans unchanged. On the other hand, n Fgure 6 there 8

9 8 7 6 Throughput of Streams for dfferent rates (no NCQ) Stream A (seq) Stream B (seq) Total (seq) Stream C (random) Throughput of Streams for dfferent rates (NCQ 3) Stream A (seq) Stream B (seq) Total (seq) Stream C (random) (.5, ) (.4,.) (.3,.) (.,.3) (.,.4) (,.5) (.5, ) (.4,.) (.3,.) (.,.3) (.,.4) (,.5) Fgure 4: Usng RAID the throughput acheved by each sequental stream s n agreement wth ther target rates. Stream A has a vared target rate from 5% to and the opposte for B. Random stream C requres a fxed rate of 5% of the storage tme. Smlarly for a large dsk queue depth (NCQ.) s no drop n the total throughput, whch s expected snce the cost of seeks between random requests are smlar to the typcal cost of a random request. Therefore, the total throughput remans constant. Moreover, the sequental stream (not plotted) reaches an average throughput of 36 and 56 IOPS wth a depth of and 3, respectvely as n Fgure 4. Fnally, note that whether we use no NCQ or a depth of 3 the throughput behavor s smlar n both Fgures (4 and 6), whch s desred snce a large depth can provde a hgher throughput n certan cases [6], along wth other benefts such as reducng power consumpton [4]. 4.5 Evaluaton usng traces To strengthen our evaluaton, besdes synthetc workloads we run QBox usng two dfferent days of the Deasna [3] trace as two of the three read streams, whle the thrd stream sends random requests. Deasna contans sem-sequental traces of emal and workloads from Harvard s dvson of engneerng and appled scences. As the requests wat to be dspatched, we classfy them as ether sequental or random dependng on the other requests n ther queue. Unlke tme, evaluatng a method by comparng throughput values s hard because the acheved throughput depends on the stream workloads. However, by lookng at the throughput acheved usng QBox n Fgure 7 and the results of throughput-based schedulng n Fgure 8 t s easy to conclude that QBox provdes a sgnfcantly hgher degree of solaton and that the target rates of the streams are respected well enough. Moreover, lookng more closely at Fgure 7, we see that wherever the throughput s not n perfect accordance wth the targets of streams A and B, there s an ncrease of random requests comng from the same streams. That effect s vald and due to the trace tself. On the other hand, Fgure 8 demonstrates the destructve nterference nherent n throughput-based reservaton schemes wth semsequental streams recevng a very low throughput. 4.6 Caches So far our experments have skpped the fle system cache to more easly evaluate our method and to send requests asynchronously, snce wthout O DIRECT they become block- Total throughput of (mostly) sequental streams (Exp. F wth three dsks) 8 Dsk Queue Depth 3 7 Dsk Queue Depth (S:.3, R:.7) (.4,.6) (.5,.5) (.6,.4) (.7,.3) Total throughput of (mostly) random streams (Exp. F wth three dsks) 4 DIsk Queue Depth 3 35 Dsk Queue Depth (S:.3, R:.7) (.4,.6) (.5,.5) (.6,.4) (.7,.3) Fgure 5: The throughput wth an NCQ of and 3 s mantaned whle the desred rates vary. 8

10 5 Throughput of Streams for dfferent rates (no NCQ) Stream A (random) Stream B (random) Total (random) 5 Throughput of Streams for dfferent rates (NCQ 3) Stream A (random) Stream B (random) Total (random) (.5,.) (.4,.) (.3,.) (.,.3) (.,.4) (.,.5) (.5, ) (.4,.) (.3,.) (.,.3) (.,.4) (,.5) Fgure 6: Usng RAID the throughput of each random stream s n agreement wth ts target. Stream A has a vared target rate from 5% to and B from to 5%. The sequental stream (not plotted) has a utlzaton of 5% leadng to an average of 36 and 56 IOPS wth no NCQ and a depth of 3, respectvely. 3,,5,,5, 5 5 Throughput wth a varyng utlzaton target (usng traces) Stream A (Trace/Sem seq 5% %) Stream B (Trace/Sem seq 5% %) Stream C (Random % %) Stream (A+B) random throughput Stream (A+B) sec avg throughput Stream (A+B) sec avg throughput C:% C:5% C:%4 Seconds Fgure 7: QBox provdes streams of real traces the throughput correspondng to ther rate close enough even n the presence of a random stream. Throughput wth a varyng utlzaton target (usng traces) and throughput schedulng 3 Stream A (Trace/Sem seq 5% %) Stream B (Trace/Sem seq 5% %) 5 Stream C (Random % %) Stream (A+B) random throughput C: % C:5% C:%4 Seconds Fgure 8: Random stream C affects throughputschedulng leadng to a low throughput for A and B. ng requests. Although applcatons such as databases may avod fle system caches, we are nterested n QBox beng applcable n a general settng. For our purposes, request completons resultng from cache hts could be gnored or accounted dfferently. From our experments, detecton of random cache hts seems relable and the well-known relaton as explaned n [7] between the queue sze and the average latency may also be useful to mprove accuracy as well as grey-box methods []. However, we cannot say the same for sequental requests due to prefetchng. Moreover, snce random workloads may cover a large segment of the storage, hts are not as lkely. Hence, n ths paper we treat hts as regular completons for smplcty. Wthout modfyng QBox we enable the fle system cache and see from Fgure 9 that although the throughput s noser than n the prevous experments due to the nature of cache hts, we stll manage to acheve throughput rates that are n accordance wth the target rates. Schedulng based on throughput (Fgure ) gves smlar results to Fgure 8, supportng our poston on throughput-based schedulng. Fnally, usng synthetc workloads we get an output of the same form as Fgure 4 wth a maxmum sequental throughput of 7 IOPS (fgure omtted.) 4.7 Overhead The computatonal overhead s trval. We know the most urgent request n each stream queue and thus pckng the nextrequesttodspatchrequresasmanyoperatonsasthe number of streams. Snce the number of streams s expected to be low, that cost s trval. In addton, on a request completon we ncrease a fxed number of counters and at the end of each wndow we compute a fxed, small number of ntersectons. The tme t takes to compute each ntersecton s nsgnfcant. Fnally, updatng the movng estmate only requres computng the new estmate weght. In total the procedure at the end of each wndow takes less than μs. 5. RELATED WORK A large body of lterature exsts related to provdng guarantees over storage devces. Typcally they ether am to 8

11 Throughput wth a varyng utlzaton target (usng traces) and caches Stream A (Trace/Sem seq 5% %) Stream B (Trace/Sem seq 5% %) Stream C (Random % %) Stream (A+B) random requests Stream (A+B) sec avg throughput Stream (A+B) sec avg throughput C:% C:5% C:% 4 Seconds Fgure 9: The throughput acheved by QBox n the presence of caches follows the target rates. Throughput wth a varyng utlzaton target (usng traces), caches and throughput schedulng 7 Stream A (Trace/Sem seq 5% %) Stream B (Trace/Sem seq 5% %) 6 Stream C (Random % %) Stream (A+B) random requests C:% C:5% C:% 4 Seconds Fgure : Throughput-based schedulng fals to solate stream performance leadng to a low throughput for sem-sequental streams. satsfy throughput or latency requrements, or attempt to proportonally dstrbute throughput. Most solutons do not dstngush between sequental and random workloads, whch leads to the storage beng under-utlzed. Avodng that dstncton leads to chargng sem-sequental streams unfarly due to the sgnfcant cost dfference between sequental and random requests. Instead, QBox uses dsk servce tme rather than IOPS or Bytes/s to solve that problem. Stonehenge [8] clusters storage systems n terms of bandwdth, capacty and latency, however, beng based on bandwdth ts reservatons cover only a fracton of the dsk performance. Other proposed solutons based on bandwdth, nclude [, 5, ] and take advantage of the relaton - as was later explaned n [7] - between the queue length and average latency to throttle requests. mclock [6] does not provde performance nsulaton, whle both [6] and pclock [5] do not dfferentate between sequental and random requests. Other solutons such as [9,, 5, 6] do not provde nsulaton ether. On the other hand, Argon [] provdes nsulaton, however, workload changes may affect ts provded soft bounds. In [4], dstrbuton-based QoS s provded to a percentage of the workload to avod over-provsonng. [] attempts to predct response tmes rather than servce tmes through statstcal models. PARDA [4] provdes guarantees by assumng a specfc scheduler resdes on each host, unlke QBox, whch does not assume access to the hosts/clents. Facade [5] ams to provde performance guarantees descrbed by an SLA for each vrtual storage devce. It places a vrtual store controller between a set of hosts and storage devces n a network and throttles clent requests so that the devces do not saturate. In partcular, t adjusts the queue sze dynamcally, whch affects the latency of each workload. However, a sngle set of low latency requests may decrease the queue sze of the system and t s hard to determne whether a new workload may be admtted. YouChoose [7] tres to provde the performance of reference storage systems by measurng ther performance off-lne and mmckng t onlne. It s based on an off-lne machne learnng process, smlar to [3], whch can be hard to prepare due to the challengng task of selectng a representatve set of tranng data. Moreover, the safe admsson of new vrtual storage devces can be challengng. Solutons based on executon tme estmates such as [8, 7, ] assume we have low-level control over each harddrve. Moreover, n Horzon [8] t was shown that such a soluton can be used n dstrbuted storage systems wth sngle-dsk nodes usng the Horzon scheduler. Our work s also based on dsk-tme utlzaton and deadlne assgnment, however, we treat the storage devce as a black box and therefore do not assume our own scheduler s n front of every hard-drve. Fnally, [5, ] reserve I/O rates usng worst-case executon tmes, therefore, they can only reserve a fracton of the storage devce tme. 6. CONCLUSIONS In ths paper, we targeted the problem of provdng solaton and performance guarantees n terms of storage devce utlzaton to multple clents wth dfferent types of workloads. We proposed a plug-n-play method for solatng the performance of clents accessng a sngle fle-level storage devce treated as a black box. Our soluton s based on a novel method for estmatng the expected executon tmes of sequental and random requests as well as on assgnng deadlnes and schedulng requests usng the Earlest Deadlne Frst (EDF) schedulng algorthm. Our experments show that QBox provdes solaton between streams havng dfferent characterstcs wth changng needs and on storage systems wth a varable number of dsks. There are multple drectons for future work. Extensons nclude support for SSDs based on the cost dfference of reads and wrtes as well as hybrd systems. Addng support for wrtes and RAID 4, 5 s another drecton. Techncal mprovements nclude a better use of the hstory of requests n computng estmates and sophstcated methods to detect sudden and stable changes. Fnally, we would lke to verfy QBox works on Network Attached Storage, test t at the hypervsor level n a vrtualzed envronment and explore the case where there s multple controllers and storage devces. 7. REFERENCES [] A. C. Arpac-Dusseau and R. H. Arpac-Dusseau. Informaton and control n gray-box systems. In Proceedngs of the eghteenth ACM symposum on Operatng systems prncples, SOSP, pages 43 56, New York, NY, USA,. ACM. 83

12 [] D. D. Chamblss, G. A. Alvarez, P. Pandey, D. Jadav, J. Xu, R. Menon, and T. P. Lee. Performance vrtualzaton for large-scale storage systems. In In Proceedngs of the th Internatonal Symposum on Relable Dstrbuted Systems (SRDSÕ3, pages 9 8, 3. [3] D. Ellard, J. Ledle, P. Malkan, and M. Seltzer. Passve nfs tracng of emal and research workloads. In Proceedngs of the nd USENIX Conference on Fle and Storage Technologes, FAST 3, pages 3 6, Berkeley, CA, USA, 3. USENIX Assocaton. [4] A. Gulat, I. Ahmad, and C. A. Waldspurger. Parda: proportonal allocaton of resources for dstrbuted storage access. In Proccedngs of the 7th conference on Fle and storage technologes, pages 85 98, Berkeley, CA, USA, 9. USENIX Assocaton. [5] A. Gulat, A. Merchant, and P. J. Varman. pclock: an arrval curve based approach for qos guarantees n shared storage systems. In Proceedngs of the 7 ACM SIGMETRICS nternatonal conference on Measurement and modelng of computer systems, SIGMETRICS 7, pages 3 4, New York, NY, USA, 7. ACM. [6] A. Gulat, A. Merchant, and P. J. Varman. mclock: handlng throughput varablty for hypervsor o schedulng. In Proceedngs of the 9th USENIX conference on Operatng systems desgn and mplementaton, OSDI, pages 7, Berkeley, CA, USA,. USENIX Assocaton. [7] A. Gulat, G. Shanmuganathan, I. Ahmad, C. Waldspurger, and M. Uysal. Pesto: onlne storage performance management n vrtualzed datacenters. In Proceedngs of the nd ACM Symposum on Cloud Computng, SOCC, pages 9: 9:4, New York, NY, USA,. ACM. [8] L. Huang, G. Peng, and T.-c. Chueh. Mult-dmensonal storage vrtualzaton. In Proceedngs of the jont nternatonal conference on Measurement and modelng of computer systems, SIGMETRICS 4/Performance 4, pages 4 4, New York, NY, USA, 4. ACM. [9] W.Jn,J.S.Chase,andJ.Kaur.Interposed proportonal sharng for a storage servce utlty. SIGMETRICS Perform. Eval. Rev., 3:37 48, June 4. [] T. Kaldewey, T. Wong, R. Goldng, A. Povzner, S. Brand, and C. Maltzahn. Vrtualzng dsk performance. In Real-Tme and Embedded Technology and Applcatons Symposum, 8. RTAS 8. IEEE, pages 39 33, Aprl 8. [] M. Karlsson, C. Karamanols, and X. Zhu. Trage: Performance dfferentaton for storage systems usng adaptve control. Trans. Storage, :457 48, November 5. [] T. Kelly, I. Cohen, M. Goldszmdt, and K. Keeton. Inducng models of black-box storage arrays. In Techncal Report HPL-SSP-4-8, 4. [3] C. L. Lu and J. W. Layland. Schedulng algorthms for multprogrammng n a hard-real-tme envronment. J. ACM, :46 6, January 973. [4] L. Lu, P. Varman, and K. Dosh. Graduated qos by decomposng bursts: Don t let the tal wag your server. In Dstrbuted Computng Systems, 9. ICDCS 9. 9th IEEE Internatonal Conference on, pages, June 9. [5] C. R. Lumb, A. Merchant, and G. A. Alvarez. Facade: Vrtual storage devces wth performance guarantees. In Proceedngs of the nd USENIX Conference on Fle and Storage Technologes, pages 3 44, Berkeley, CA, USA, 3. USENIX Assocaton. [6] A. Merchant, M. Uysal, P. Padala, X. Zhu, S. Snghal, and K. Shn. Maestro: qualty-of-servce n large dsk arrays. In Proceedngs of the 8th ACM nternatonal conference on Autonomc computng, ICAC, pages 45 54, New York, NY, USA,. ACM. [7] A. Povzner, T. Kaldewey, S. Brandt, R. Goldng, T. M. Wong, and C. Maltzahn. Effcent guaranteed dsk request schedulng wth fahrrad. In Proceedngs of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 8, Eurosys 8, pages 3 5, New York, NY, USA, 8. ACM. [8] A. Povzner, D. Sawyer, and S. Brandt. Horzon: effcent deadlne-drven dsk /o management for dstrbuted storage systems. In Proceedngs of the 9th ACM Internatonal Symposum on Hgh Performance Dstrbuted Computng, HPDC, pages, New York, NY, USA,. ACM. [9] A. S. Povzner. Effcent guaranteed dsk /o performance management. PhD thess, Unversty of Calforna at Santa Cruz, Santa Cruz, CA, USA,. AAI3495. [] L. Reuther and M. Pohlack. Rotatonal-poston-aware real-tme dsk schedulng usng a dynamc actve subset (das). In In Proceedngs of the 4th IEEE Real-Tme Systems Symposum (RTSS 3). IEEE, page 374. IEEE Computer Socety, 3. [] M. Spur, G. Buttazzo, and S. S. S. Anna. Schedulng aperodc tasks n dynamc prorty systems. Real-Tme Systems, :79, 996. [] M. Wachs, M. Abd-El-Malek, E. Thereska, and G. R. Ganger. Argon: performance nsulaton for shared storage servers. In Proceedngs of the 5th USENIX conference on Fle and Storage Technologes, pages 5 5, Berkeley, CA, USA, 7. USENIX Assocaton. [3] M. Wang, K. Au, A. Alamak, A. Brockwell, C. Faloutsos, and G. R. Ganger. Storage devce performance predcton wth cart models. SIGMETRICS Perform. Eval. Rev., 3:4 43, June 4. [4] Y. Wang. Ncq for power effcency. Whte paper, February 6. [5] T.M.Wong,R.A.Goldng,C.Ln,andR.A. Becker-szendy. Zygara: storage performance as a managed resource. In In IEEE Real Tme and Embedded Technology and Applcatons Symposum (RTAS 6, pages 5 34, 6. [6] Y.J.Yu,D.I.Shn,H.Eom,andH.Y.Yeom.Ncq vs. /o scheduler: Preventng unexpected msbehavors. Trans. Storage, 6:: :37, Aprl. [7] X. Zhang, Y. Xu, and S. Jang. Youchoose: Choosng your storage devce as a performance nterface to consoldated /o servce. Trans. Storage, 7:9: 9:8, October. 84

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory Background EECS. Operatng System Fundamentals No. Vrtual Memory Prof. Hu Jang Department of Electrcal Engneerng and Computer Scence, York Unversty Memory-management methods normally requres the entre process