Estimating Progress of Execution for SQL Queries

Size: px

Start display at page:

Download "Estimating Progress of Execution for SQL Queries"

Ferdinand Potter
6 years ago
Views:

1 Estatng Progress of Executon for SQL Queres Surajt Chaudhur Vvek arasayya Ravshankar Raaurthy Mcrosoft Research Mcrosoft Research Unversty of Wsconsn, Madson ABSTRACT Today s database systes provde lttle feedback to the user/dba on how uch of a SQL query s executon has been copleted. For long runnng queres, such feedback can be very useful, for exaple, to help decde whether the query should be ternated or allowed to run to copleton. Although the above requreent s easy to express, developng a robust ndcator of progress for query executon s challengng. In ths paper, we study the above proble and present technques that can for the bass for effectve progress estaton. The results of experentally valdatng our technques n Mcrosoft SQL Server are prosng.. ITRODUCTIO Decson support applcatons typcally nclude long-runnng queres. For such queres, the ablty to estate the progress of query executon could be very useful. Progress estaton could help DBAs as well as end users or applcatons help decde whether to ternate the query or allow t to fnsh. Such feedback could qualtatvely prove the experence for any database user. However, today s database systes only provde rudentary feedback to users about progress of query executon. Ths feedback s lted to the query optzer generated executon plan and ts cost, as well as the nuber of tuples returned by the query durng ts executon. Beyond ths, to the best of our knowledge, there s no pror publshed work on the proble of progress estaton for SQL query executon. The ost useful easure of progress would report to the user at any pont durng the query s executon, the aount of te requred for the query to coplete executon. However, any ethod that provdes such a easure would be subject to uncertanty arsng fro concurrent executon of other queres. Due to ths dffculty, we focus on the proble of estatng the percentage reanng (or equvalently copleted) of the query, at any pont durng ts executon,.e., reportng a progress bar for query executon. Such an estator s spler than estatng te reanng snce t s ndependent of other queres. In effect, ths easure estates the te reanng on an solated syste where only the gven query s executng. Effectve progress estaton for query executon requres us to accurately estate the total work requred to execute the query. Queres n odern database systes are qute coplex nvolvng Persson to ake dgtal or hard copes of all or part of ths work for personal or classroo use s granted wthout fee provded that copes are not ade or dstrbuted for proft or coercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, or republsh, to post on servers or to redstrbute to lsts, requres pror specfc persson and/or a fee. SIGMOD 2004, June 3 8, 2004, Pars, France. Copyrght 2004 ACM /04/06 $5.00. jons, nested sub-queres and aggregaton. Any easure of work for a query that s ndependent of the nteredate cardnaltes of such operators s lkely to be too splstc. For exaple, consder a etrc that reports progress as the percentage of query results that have been returned thus far. Let us assue that we could accurately estate the total nuber of rows that a query wll return n ts result. To see why such a etrc for progress could be really naccurate, consder an executon plan consstng of a very expensve jon followed by an nexpensve Sort operaton. Snce Sort s a blockng operaton, query results are not returned untl the Sort starts outputtng rows. Therefore, untl such te, the above etrc would report no progress rrespectve of how uch work was done n the jon. As another llustraton of why the proble s dffcult, consder a etrc that reports the percentage of nodes (.e., operators) n the executon plan that have copleted. However, f a query s just a sngle ppelne of operators, for alost the entre duraton of the executon of the query, all the operators n the plan are actve.e., not yet copleted. Thus the above etrc wll not report any progress untl near the very end of query executon. We note that a query optzer already uses a odel of work done by a query (based on estated CPU and I/O costs). Whle leveragng ths odel for progress estaton ay be possble, n ths paper, we ask whether an even spler odel would suffce for the purposes of progress estaton. The otvaton for ths spler odel s the ease of ncorporaton nto exstng query executon engnes. We odel work done by a query as a functon of the nuber of rows output by each operator n the query executon plan. Whle ths odel does nhert the known dffcultes of cardnalty estaton faced by a query optzer, we use two key deas to help tgate the pact of naccurate cardnalty estaton on progress estaton. Frst, we observe that t s possble to estate the cardnaltes of certan operators e.g., Table Scans or Index Scans whch we refer to as drver nodes (forally defned n Secton 2) uch ore accurately than other nteredate nodes n a ppelne e.g., a Flter or Hash Jon. We show that n any cases estatng the overall query progress by only ontorng progress of these drver nodes can greatly prove accuracy. Second, durng query executon we leverage runte executon nforaton to refne cardnalty estaton. We take a conservatve approach (based on antanng and refnng upper and lower bounds on cardnaltes of operators n the plan) that s guaranteed not to ntroduce addtonal naccuraces as a result of such refneent. Our soluton s applcable to arbtrary SQL queres and can be pleented at low overhead n exstng database systes. We have pleented our technques nsde Mcrosoft SQL Server and the ntal results of experentally Work done whle author was vstng Mcrosoft Research.

2 evaluatng our estator on the TPC-H benchark [2] queres (0 GB verson on unfor as well as skewed data dstrbutons) are prosng. The rest of the paper s structured as follows. Secton 2 descrbes the proble and presents our odel of work done by a query. Gven ths odel, we propose n Secton 3, an estator for the progress of a query whose executon conssts of a sngle ppelne. Secton 4 presents our soluton for the general case of a query nvolvng ultple ppelnes. Experental valdaton of our prototype on decson support queres s presented n Secton 5. Secton 6 dscusses the desrable property of onotoncty of progress estaton and ts relatonshp to accuracy of estaton. We present extensons to our odel of work to be ore robust to runte condtons n Secton 7, and dscuss related work n Secton 8. We conclude wth a bref dscusson on nterestng areas of future work. 2. PROBLEM DESCRIPTIO 2. Defntons A progress estator uses an executon plan that s chosen by the query optzer for the gven query. An executon plan s a tree where the nodes of the tree are physcal operators. For exaple, Fgure shows the executon plan for a query. Index ested Loops Jon sngle ppelne. For a ested Loops or Index ested Loops Jon operator, the outer chld, the jon operator and ts entre nner subtree are part of a the sae ppelne as the outer chld node. Both Sort and Group-By (hash-based) operators, whch are blockng, start a new ppelne of ther own. For the exaple n Fgure, the ppelnes are: P = {Table Scan A, Flter}, P 2 = {Index Scan B, Hash Jon, Index ested Loops, Index Seek C}. In prncple the above defnton of can be extended to other physcal operators as well. Thus, ntutvely, a ppelne can be thought of as a axal subtree of concurrently executng operators. Every ppelne has a set of drver nodes,.e., operators that are the sources of tuples operated upon by reanng nodes n the ppelne. More precsely, we defne the drver nodes of a ppelne as the set of all leaf nodes of the ppelne, except those that are n the nner subtree of a ested Loops/ Index ested Loops jon. For exaple, n Fgure, the shaded nodes are drver nodes Table Scan A s the drver node for the ppelne P and Index Scan B s the drver node for ppelne P 2. ote that Index Seek C s not a drver node snce t s a leaf node of the nner subtree of an Index ested Loops Jon. We observe that t s possble for a ppelne to contan ore than one drver node, e.g., n a Merge-Jon of two sorted relatons, both the nput relatons to the Merge-Jon are drver nodes. Index ested Loops Jon Hash Jon Index Seek C Sort-Merge Jon Index Seek C Flter Sort A Sort B Index Scan B Table Scan A Table Scan A Index Scan B Fgure. Exaple of an executon plan for a query A physcal operator s referred to as a blockng operator f t does not produce any outputs untl t has consued at least one of ts nputs copletely. For exaple, suppose Table Scan A wth Flter s the buld relaton of the Hash Jon and Index Scan B s the probe relaton. The Hash Jon operator n Fgure s blockng snce t ust consue all rows fro the buld relaton before t produces any output. Another exaple of a coon blockng operator s Sort. The overall executon of a query s staged nto ultple ppelnes. We now defne the noton of ppelnes for an executon plan consstng of coon physcal operators such as Table Scan, Index Scan, Index Seek, Flter, Hash Jon, Merge-Jon, Index ested Loops (IL) Jon, L Jon, Group-By (Hash-based) and Sort. The defnton s procedural and proceeds nductvely n a botto up anner over the nodes of an executon plan. A leaf node of the plan (Table Scan, Index Scan, Index Seek) starts a ppelne. A Flter node s part of the ppelne that ts chld operator belongs to. For a Hash Jon, the jon operator s ncluded n the ppelne of the probe chld, and the buld chld s the root of another ppelne. For a Merge-Jon, the ppelnes contanng ts chldren and the Merge Jon operator tself are unon ed to create a Fgure 2. Executon plan wth Sort-Merge Jon. Ths s llustrated n Fgure 2. The ppelnes dentfed for ths query would be P = {Table Scan A}, P 2 = {Index Scan B} P 3 = {Sort A, Sort B, Merge Jon, Index ested Loops, Index Seek C} and the drver nodes (the shaded nodes) would be respectvely {Table Scan A}, {Table Scan B}, {Sort A, Sort B}. Thus there are two drver nodes for the last ppelne. We note that unlke a Hash Jon, for a Sort-Merge Jon, the scans of both nputs do not necessarly need to coplete for the Sort-Merge Jon to coplete. An executon plan can be vewed as a partal order of ppelnes snce, n general, for certan ppelnes to start executng, one or ore other ppelnes need to coplete. For exaple n Fgure, executon of P ust precede P 2. Slarly n Fgure 2, executon of P and P 2 ust precede P 3, but the order between P and P 2 s arbtrary. 2.2 Desrable Propertes of a Progress Estator Accuracy: The estated percentage of work copleted by the query at any pont durng ts executon should be close to the actual percentage of work copleted by the query at that pont.

3 Fne granularty: It follows fro the above accuracy requreent that the estator should be able to provde estates at suffcently fne granularty over the duraton of the query s executon. Thus, for exaple, an estator that only provdes accurate estates at 0% and 00% copleton would not be useful. Low overhead: An essental requreent for a progress estator to be practcal s that t should pose low overhead on the actual executon of the query. Leveragng feedback fro executon: As query executon progresses, ore nforaton based on (nteredate) results of executon can becoe avalable. Ideally, an estator should be able to take full advantage of such nforaton. Monotoncty: Snce the actual executon of the query progresses onotoncally, deally, the estated progress should be also be onotoncally ncreasng fro the start of query executon to ts fnsh. We observe that n today s database systes feedback on query progress durng executon does not satsfy one or ore of the above requreents. Whle the optzer estated cost of a query can be obtaned at low overhead and progress estaton based on ths cost s trvally onotonc (snce the estated cost does not change over the lfete of the query s executon), t can potentally be naccurate and t does not leverage any feedback fro executon. Slarly, the nuber of tuples returned by a query durng ts executon (whle low overhead and onotonc) has the ajor drawback that t can be naccurate and lackng n granularty as llustrated n the ntroducton. Moreover, t only takes lted advantage of executon feedback. Fnally, we note that n general, there s a trade-off between guaranteeng onotoncty and achevng accuracy of progress estaton (we dscuss ths further n Secton 6). 2.3 The Getext() Model of Work As descrbed n the ntroducton, our goal s to estate progress of a query on an solated syste,.e., on a syste where there s no other actvty besdes the executon of ths query. Any progress estator requres a odel of work done by a query as the bass of ts estaton. In ths secton we present such a odel of work. One approach for odelng the work done by a query could have been to use the cost odel used by query optzer s for coparng dfferent executon plans for a query. Query optzers typcally odel the work done by the query as a functon of CPU, rando I/O and sequental I/O costs. Thus, to use such a odel for progress estaton, we would need to easure the CPU, rando and sequental I/O s perfored by the query durng ts executon. In ths paper we nvestgate whether an even spler odel of work would be adequate for the purposes of progress estaton. The an otvaton for a spler odel s the ease wth whch t can be ncorporated nto today s database systes. The reason we expect that a spler odel ay be adequate for progress estaton s that unlke the query optzer that needs to dstngush between ultple plans for a gven query usng ts cost odel, we only need to be able to estate the percentage of work done for a gven query executon plan. We note that operators n a query executon plan are typcally pleented usng a deand drven terator odel [5], where each physcal operator n the executon plan exports a standard nterface for query processng (ncludng Open(), Close() and Getext()). We propose to odel the work done by a query as the total nuber of Getext() calls ssued throughout the duraton of the query s executon over all operators n the executon plan. In essence, we are countng each Getext() call as a prtve operaton of query processng and odelng the total work done by the query by the total nuber of Getext() calls. ote that all CPU nstructons, I/Os etc. perfored by the query occurs as a result of Getext() calls. Thus, ths odel assues that the total te requred to execute the query s aortzed across ultple Getext() calls, and therefore the percentage of Getext() calls done thus far s a good estator of the te taken by the query (on an solated syste). It should be noted that the Getext() odel of work s nadequate for the purposes of query optzaton. As a sple exaple of why ths s the case, consder two plans for the sae query: one nvolvng a non-clustered ndex Index Seek and another nvolvng a Table Scan. Wth the above Getext() odel of work, the Index Seek would always be consdered cheaper (.e., less work) by the query optzer snce the nuber of rows t returns can never exceed that of the Table Scan. Progress Estaton Based on Getext() odel We now defne progress estaton based on the Getext() odel of work. Suppose the executon plan has a total of operators. Let the total nuber of tuples that flow out of operator Op (.e., nuber of Getext() calls nvoked on that operator) at the end of query executon be (=..). At any pont durng query executon, let the nuber of tuples that have flowed out of every operator thus far be ( =..). Thus, the deal estator under the Getext() odel of work (we call t gn) would estate progress at that pont durng the query s executon as: gn = ote that whle accurate values can be obtaned as the query s executng, the exact values are avalable only at the end of query executon. Thus, the estator gn s not drectly pleentable as stated above snce s are not known exactly whle the query s executng. Thus, the key challenge for any progress estator E that uses the above odel of work s to estate as accurately as possble whle the query s executng. ote that the proble of estatng the nuber of Getext() calls for an operator n the query executon plan s the cardnalty estaton proble faced by query optzers. The only dfference s that unlke a query optzer, whch can only use pre-coputed database statstcs (e.g., hstogras), the estator E can potentally also observe feedback fro query executon for use n ts estaton. We observe that the fne granularty requreent (see Secton 2.2) should typcally be satsfed by an estator usng the Getext() odel snce for a long runnng query, a large nuber of Getext() calls are ade durng ts executon. Another desrable property of an estator s sall runte overhead. For exaple, an estator that actually executes the query n order to obtan the total nuber of Getext() calls ( ) would be unacceptable. Thus, we requre that the nforaton used by any estator be lted to a sall aount of aggregated nforaton ether n the for of pre-

4 coputed database statstcs or statstcs coputed on observed feedback fro query executon. Although ths restrcton by tself s not suffcent to guarantee low overhead, t appears to be necessary for an estator to be practcal. The estator that we present n ths paper uses feedback fro query executon (see Secton 4) to refne estates of. Observe that snce s onotoncally ncreasng as the query executes, the onotoncty of the estator depends on how the estates of are changed by the estator as the query executes. We coent on the onotoncty property of our estator based on the Getext() odel n Secton 6. We note a couple of addtonal propertes of the Getext() odel of work: () It can be appled to odern database systes snce they typcally eploy a deand drven terator odel for query executon. (2) It has the property that t s nvarant across ultple runs of the sae query. 3. DRIVER ODE ESTIMATOR: SIGLE PIPELIE QUERIES In ths secton, we outlne our soluton for the progress estaton proble for the class of queres that consst of a sngle executon ppelne. We show how our soluton extends to the general class of arbtrary query executon plans (consstng of ultple ppelnes) n Secton 4. For splcty, we consder a query whose executon plan s a sngle ppelne consstng of a chan of (non-blockng) operators: Op -> Op 2. -> Op and havng a sngle operator Op as ts drver node (see Secton 2. for defnton of a drver node). Typcally, such a ppelne conssts of a sngle drver node (e.g., Table Scan or Index Scan) followed by a sequence of nonblockng operators such as Flter and Index ested Loops (IL) jon. As descrbed earler, they key challenge for any estator usng the Getext() odel (.e., tryng to estate gn) s to accurately estate, the total nuber of Getext() calls that wll be perfored over all nodes n the query. In an deal world, the optzer s estates of (and hence the progress estator, whch can use such estates) would be accurate. But cardnalty estaton usually nvolves splfyng assuptons (partcularly on the correlaton between data values) and consequently s prone to estaton errors. For exaple, t s known that estaton errors propagate exponentally as a functon of the nuber of jons n the query [8]. Our focus n ths paper s not on developng technques for better cardnalty estaton for the purpose of query optzaton. Rather, we develop addtonal technques that could tgate the pact of errors n cardnalty estaton on progress estaton. Our estator (called the Drver ode Estator, dne for short) for sngle ppelne queres havng exactly one drver node s defned as: dne = where s the nuber of Getext() calls done on the drver node of the ppelne, Op, thus far; and s the estated total nuber of Getext() calls for Op. Therefore, underlyng dne s the hypothess (we refer to t as the drver node hypothess) that overall query progress can be estated by the progress of only the drver node of the ppelne,.e.,: There are a few portant reasons why the estator dne can work well n practce. Frst, note that naccuraces n gn arse due to naccurate estates. Snce a drver node n a ppelne s the source of tuples that are operated upon by other nodes n the ppelne, pror to start of executon of that ppelne, the cardnalty of the drver node s typcally known accurately. For exaple, for any ppelnes, drver nodes are typcally Table Scans or Index Scans, and the estates of for such drver nodes can be obtaned (alost exactly) fro the database syste catalogs. Whle the estates ay not be as accurate n the case of the drver node beng an Index Seek operator, any hstogras on the predcate coluns can be leveraged. In such cases, the estate of for the drver node can stll be qute accurate. On the other hand, accurately estatng for a Flter node that references a UDF, or a ested Loops Jon node are usually ore naccurate due to the nherent dffcultes n selectvty estaton and errors n propagaton to nteredate nodes [8]. Thus, usng only drver nodes for progress estaton can often result n better accuracy. Second, when cardnalty of the drver node donates s of other operators n the ppelne we can expect the estator dne to be close to gn. Ths s not uncoon n decson support queres such as TPC-H [2] where the drver node cardnaltes are large (e.g. large Table/Index Scans), and where operators such as Flter and Group-By can greatly reduce the cardnalty of nondrver nodes. Thrd, observe that the drver node hypothess ples: where / can be thought of as the work done per tuple output by the drver node. Therefore, when the total nuber of Getext() calls ade over all nodes n the ppelne does not vary sgnfcantly over the lfete of the ppelne, ontorng progress of only drver node s suffcent. Although ths condton does not hold for arbtrary ppelnes, we show below an portant class of ppelnes (n whch the output cardnalty of each operator s no larger than ts nput cardnalty) for whch dne stll yelds progress estates that are wthn a constant factor of gn. Fnally, snce a drver node s the source of tuples processed by other operators n the ppelne, t typcally provdes suffcently fne granularty of progress estaton. Ths property ay not hold n general for other operators (such as Flter or L Jon) n a ppelne, snce ther cardnaltes ay be arbtrarly sall. Guarantee of dne for onotoncally decreasng ppelnes: We dscuss an portant class of sngle ppelne queres where dne s guaranteed to be accurate wthn a constant factor of gn (the constant factor s the nuber of operators n the ppelne). Consder ppelnes havng the logcal property that no operator n the ppelne can ncrease ts ncong cardnalty. Thus, at any pont durng the query s executon, + and +. We refer to such a ppelne as a onotoncally decreasng ppelne. Soe of coon physcal operators that could be part of a onotoncally decreasng ppelne are Table Scan, Flter and

5 streang aggregate operators. IL Jon would also satsfy the above property when the jon looks up a key value (.e., a foregn key key jon). Cla: For a onotoncally decreasng ppelne wth operators, the estator dne s guaranteed to be accurate wthn a constant factor of the deal estator gn,.e. gn dne Proof: See Appendx A.. gn ote that for the case of a sngle ppelne consstng of a Table Scan, Flter and aggregaton operator (slar to the class of queres studed n the onlne aggregaton work e.g., [7]), f the nput tuples to such a ppelne P are read n rando order, then the drver node hypothess wll hold,.e., the expected value of / s P / P for that ppelne. Fnally, for the case of a sngle ppelne that s not onotoncally decreasng, the above guarantee does not hold, and dne s a heurstc. Intutvely, f an nteredate operator (e.g., a non foregn-key ested Loops Jon) can ncrease ts ncong cardnalty arbtrarly, then the dstrbuton of work done n the ppelne can be skewed so that progress at the drver node ay not be ndcatve of overall progress of the ppelne. 4. SOLUTIO FOR GEERAL CASE In ths secton, we extend our soluton for an arbtrary SQL query executon plan that conssts of ultple ppelnes. As descrbed n Secton 2., we odel an arbtrary executon plan as a partal order of ppelnes and extend the deas of Secton 3 of usng only drver nodes for each currently executng ppelne. Therefore, n our approach, the key ssues are: () explanng how to use the Drver ode Estator (dne) for a sngle ppelne to obtan an overall progress estate for entre query executon plan (Secton 4.), and (2) ntalzng and refnng the cardnalty estates based on feedback fro query executon (Secton 4.2). 4. Estator for Arbtrary Query Executon Plan As per our defnton of gn (Secton 2.3), for a query executon plan wth s ppelnes, our estator for the entre query can be rewrtten equvalently as follows: gn = P P Ps where each suaton ter denotes the su over all nodes n the correspondng ppelne. As dscussed prevously, the key challenge for a ppelne P s estatng the P for that ppelne. We note that n a query executon plan that nvolves ultple ppelnes, we know that each ppelne ust be n one of the followng states: (a) Copleted. (b) Currently executng. (c) ot yet started executng. For any ppelne that has copleted executon we have the exact values of the nuber of Getext() calls done on all operators n that ppelne, and thus P = P for such a ppelne. For a currently executng ppelne, we use dne to estate P. Specfcally, t follows drectly fro the drver node hypothess that P = P / dne. For a ppelne that has not Ps yet started executng ( P = 0), and we use the optzer s estates for P. In fact, t s for ths case where we expect a sgnfcant opportunty to prove estate of P usng feedback fro query executon. 4.2 Explotng Executon Feedback for Refnng Estates A key challenge arses fro estatng cardnalty of nodes of ppelnes that start wth nteredate blockng nodes e.g., Sort nodes and hash based Group-By nodes. For nodes of such ppelnes there s an opportunty to get better cardnalty estates by usng feedback fro query executon. Consder the query executon plan shown n Fgure 3. Suppose A s the buld relaton and B the probe relaton of the Hash Jon. The ppelnes for the query are P = {Table Scan A, Flter, Hash Jon}, P 2 = {Table Scan B}, P 3 = {Group-By} and P 4 = {Sort}, whch are executed n the order P, P 2, P 3, P 4. The drver nodes for the query are shaded n the fgure. To estate the cardnalty of the Sort operator (n ppelne P 4 ), we would need to have accurate estates on the flter, jon and group-by operators of the frst two ppelnes. Ths s n fact the tradtonal cardnalty estaton proble and s error prone. Hence, our ntal estate of work done by the Sort could be naccurate, potentally leadng to overall ncorrect progress estaton. Here, executon feedback can be leveraged to prove estate of the Sort node cardnalty. For exaple, when ppelne P 2 copletes, we n fact have the exact value of the cardnalty of the result of the jon. Slarly, when the Group-By, copletes, we have exact cardnalty (no uncertanty) of the nput to the Sort. In the rest of ths secton, we descrbe a general fraework for refnng cardnalty estates of a gven executon plan based on executon feedback. Flter Table Scan A Sort Group-By (Hash) Hash Jon Table Scan B Fgure 3. Executon plan wth ultple blockng operators Whle several technques are possble, n ths paper we follow a conservatve approach that ensures that we never ntroduce any addtonal naccuraces due to the refneent process. Thus, we refne the current estate of any node only f we are certan that the refneent wll ake the estate ore accurate. We acheve ths as follows: For each node n the executon plan, we track two addtonal values UB and LB, whch are respectvely, the upper and lower bounds on the cardnaltes of the rows that can be output fro that node. These bounds are based solely on the algebrac propertes of the operator and observed cardnaltes fro executon, and are guaranteed to be actual bounds on. In partcular, ths eans that LB UB. We adjust these lower

6 and upper bounds as we get ore nforaton fro query executon usng the technques descrbed below. The nvarant that we antan at all tes s that LB current estate of UB,.e., f we fnd that the current estate of les outsde the bounds, then we correct ts value to the approprate bound. The effectveness of such refneent based on bounds depends on how uch and how quckly these bounds can be refned based on executon feedback. When an upper (or lower) bound for a partcular node s refned, ths could potentally help refne the upper (or lower) bound of other nodes above t n the executon tree. We propagate these bounds usng algebrac propertes of operators. For exaple, n Fgure 3, suppose that at soe pont n te T durng the query s executon, we were able to conclude that the upper bound for the Hash Jon can be reduced fro llon rows to 0.5 llon rows. Suppose the upper bounds for the Group By and Sort nodes were 0.8 llon rows. Then, based on the algebrac propertes of Group-By and Sort nodes, we can also conclude that each of ther upper bounds cannot exceed 0.5 llon rows. The lowerng of the upper bound could help refne the estates of at one or both of these nodes at te T. ote that although dne uses only the drver node cardnaltes for the currently executng ppelne, t s necessary to refne cardnaltes of all nodes n the ppelne, snce t could nfluence the estates for nodes n a ppelne that s yet to start executng. In our pleentaton, we propagate bounds a few tes per second (at roughly the granularty at whch feedback s necessary to the user/applcaton). Refnng lower and upper bounds The refneent of lower and upper bounds for an operator Op at query executon te uses the followng nforaton: () The observed nput and output cardnaltes of the operator (.e., the of the operator as well as ts nput operators) (2) Algebrac propertes of the operator. For exaple, for Flter and Group-By operators, we know that the cardnalty cannot exceed ts nput cardnalty. (3) The current state of the operator. Ths refers to the state of nternal data structures used by the operator. For exaple, the current nuber of entres n the hash table of a Group-By operator. For refnng lower bounds, (the actual nuber of rows output fro the operator thus far) s tself a correct lower bound for any operator. An exaple of where the algebrac property of the operator s useful for refnng lower bounds s Sort. Snce Sort has the property that t does not change ts nput cardnalty, n fact, - (.e., cardnalty of the nput operator to the Sort) s a vald lower bound. Thus, the cardnalty of the Sort operator (whch s always the start of a new ppelne) can be refned when the prevous ppelne s executng. An exaple of where the current state of the operator s useful n refnng lower bounds, consder the Group-By (hash based) operator. If we can count the nuber of dstnct values observed durng the operator s executon thus far (say d), then the lower bound can be refned to d at that pont n te. Ths could be done, for exaple, by trackng the nternal hash table used by the operator. As far as the upper bound s concerned, for operators such as Flter and L Jon (foregn-key jon), we can leverage ther algebrac propertes (the fact that they can never ncrease ther nput cardnalty) and the s to refne the upper bound to: (UB - - ) +. Another exaple where algebrac propertes help refne upper bound s Sort, where UB - (.e., upper bound of nput to Sort) s an upper bound for the Sort tself. An exaple of the use of current operator state for refnng upper bounds s the Hash Jon operator. Consder a Hash Jon between two relatons A (buld sde) and B (probe sde). Assue A has already been hashed nto buckets, and suppose S s the nuber of tuples of the largest bucket. We can explot ths nforaton durng the probe phase to obtan a tghter upper bound snce we know that each row fro B can produce at ost S tuples after the jon. We refer the reader to Appendx B for detals of how upper and lower bounds can be refned for certan coon physcal operators. In the future, we ntend to explore applcablty of other rules that can yeld tghter bounds based on executon feedback. We observe that whenever an operator ternates, we know exactly the upper and lower bounds of that operator (whch are dentcal at that pont). Thus, e.g., for the query plan n Fgure 3, when the fnal ppelne (P 4 ) starts executng, we know exactly the cardnalty of ts drver node (the Sort node). In general, when a ppelne starts executng, we know exactly the cardnalty of ts drver nodes. In our experents on TPC-H queres, we have found that both lower and upper bounds help refne s of certan drver nodes sgnfcantly (e.g., by three orders of agntude for Q2) for drver nodes of upper level ppelnes (when the optzer underestates the cardnalty e.g., of a Sort node). Interestngly, the pact of these refneents on the overall estaton errors (see Secton 5) s typcally uch saller (a few percent). Ths s because n these queres, the s of drver nodes such as Table/Index Scan donate the s of other nodes. A ore thorough evaluaton of the effectveness of these boundng technques on other data sets/queres s part of our ongong work. Fnally, we note that other technques to leverage nforaton fro query executon are possble. For exaple, onlne estaton technques based on observng nteredate results as n [7], refnng statstcs based on observed query results [,], and renvokng optzer for cardnalty estated based on observed cardnaltes slar to [4]. Explotng these deas to augent our technques s an portant area of future work. 5. IMPLEMETATIO AD EXPERIMETAL EVALUATIO In ths secton, we frst descrbe the pleentaton of our soluton for estatng progress of SQL queres nsde Mcrosoft SQL Server. We follow ths wth the results of an experental evaluaton of our soluton for long runnng decson support queres on both the TPC-H benchark [2] as well as an nternal custoer database. 5. Ipleentaton Our pleentaton nsde Mcrosoft SQL Server conssts of the followng sple extensons to the exstng query executon engne. We augent the data structure correspondng to a node n the query executon plan wth counters for (nuber of rows output by the node thus far), (current estate of total nuber of rows that wll be output by node at copleton), UB and LB (upper and lower bounds respectvely of nuber of rows that can be output by node). After the query s optzed and an executon plan tree P has been generated for the query, we dentfy ppelnes

7 n P and the drver node(s) for each ppelne. We ntalze for each node to the optzer estated cardnalty (for leaf-level nodes such as Table/Index Scan ths s the cardnalty of the base table/ndex). We update and propagate the values of UB, LB and for nodes usng the and algebrac propertes of operators as descrbed n Secton 4.2. For convenence of collectng the progress nforaton of an executng query, we pleent a background thread that wakes up perodcally (approxately 4 tes a second), traverses P, coputes the progress, and logs the progress estate and a testap to a fle. The overheads of gatherng ths nforaton at runte are neglgble relatve to executon te of queres we consdered. In general, we would expect that database servers wll extend nterfaces (e.g., va syste stored procedures or functons) to allow clents to prograatcally access progress nforaton for an executng query by pollng the server. 5.2 Experents Goal: The goal of the experents s to: Evaluate the accuracy of our estator (whch s based on the Getext() odel of work presented n Secton 2) on a set of long runnng and coplex decson support queres. Evaluate robustness of our estator when data skew s vared. Valdate the drver node hypothess for progress estaton of currently executng ppelnes. Setup: We conducted the experents on a achne wth a 2.8GHz CPU and 52 MB RAM. Databases: We ran the experents on the TPC-H 0GB database [2]. We chose the 0GB confguraton because the queres are truly long runnng (typcally 0s of nutes). For the evaluaton wth varyng skew, we generated a TPC-H 0GB database wth a Zpfan skew factor of 2 usng the publcly avalable tool [3]. We also ran queres fro a real data warehouse applcaton used wthn the copany to analyze sales (we refer to ths as the SALES database approx. 5GB n sze). Queres: For TPC-H we evaluate all the queres defned n the benchark. We report nubers for all the long runnng queres n the benchark (those that reference the lnete table). For TPC- H queres, the jons are typcally foregn-key jons, and thus ost ppelnes exhbt the property of beng onotoncally decreasng (Secton 3). For the SALES database, we pcked a few queres for evaluaton. The queres aganst the sales database are aggregaton queres that are jons of 7-0 tables, and have 8-0 groupng coluns. The jons are non foregn-key jons, thus the property of onotoncally decreasng ppelnes does not hold. Evaluaton Metrc: Our experents are conducted a sngle query at a te, and on a achne on whch only the database server s executng. In ths settng, we expect the percentage work copleted reported by any schee to be a good estator of the percentage te taken by the query. As descrbed n Secton 5. above, we record the fracton coplete predcted by our soluton at regular ntervals throughout query executon. Assue the query starts executng at te t 0. Let f be the percentage of the query copleted as reported by our estator at te t ( > 0, t > t - ). Let t n be the te at whch the query copletes. Then, at any pont n te t, an estator that has perfect knowledge of the future would report the actual percentage of the query copleted as 00. (t -t 0 )/(t n -t 0 ). Thus, we defne the estaton error of an estator at te t (denoted by e ) as: 00 ( t t0 ) e = f ( t t ) ote also that snce we take the absolute value of the dfference, we do not dstngush between under estates or over estates. We report the overall estaton error for a query usng three aggregate easures over all the e s collected for the query, the average, standard devaton and ax over all e s TPC-H Benchark Queres The goal of ths experent s to evaluate the accuracy of our progress estator (see Secton 4.), whch s based on the Getext() odel of work. We evaluate the estator on coplex decson support queres of the TPC-H benchark [2] on the 0GB database. Table 2 shows the ean and axu error (as defned above) for several long runnng TPC-H queres for the unfor data dstrbuton case (Z=0) as well as the skewed data dstrbuton case (Z=2). As we see fro the table, for the Z=0 case, the axu error for any query does not exceed 0%, and the average error s sall (typcally below 5%). The standard devaton was also sall (at or below 5% n all cases). One nterestng observaton s that expensve Sort nodes at the top of a query executon plan can potentally be probleatc (as n Q5 for Z=0), partcularly when the query optzer overestates the cardnalty of the Sort node. In such cases t s dffcult to rectfy errors based on executon feedback untl the lower ppelne (that feeds nto the Sort node) s alost coplete. Thus, the error nduced by the optzer s estates perssts for alost the entre duraton of the query. For the Z=2 case, the axu and ean errors are hgher for certan queres e.g., Q8, Q8, and Q2. To understand the reasons for the errors better, refer to Fgures 4 and 5, whch show scatter plots of the actual percentage copleted vs. estated percentage copleted for Q8 for Z=0 and Z=2 respectvely. A perfect estator would have all data ponts along the dagonal of the graph. For the Z=2 case (Fgure 5), when the ajor ppelne n ths query (nvolvng scan of the lnete table followed by a Merge Jon and couple of Hash Jons (probes)) starts, the estates of the cardnaltes of the jons used by our estator are sgnfcantly overestated. However, shortly after the ppelne starts executng, (as explaned n Secton 4.2) we estate the cardnaltes usng dne whch s based on the progress of the drver node (Scan of lnete). Ths results n quckly reducng the estaton error, and explans the dscontnuty n progress estaton around 20% actual copleton. In general, untl a ppelne starts executng, our estator s ore susceptble to errors n cardnalty estaton. For the case of Z=0, the cardnalty estates of ths ppelne are qute accurate, and therefore we see lower errors. We observe slar behavor n queres Q8 and Q2. Ths experent shows our estator (based on the Getext() odel of work) results n farly robust progress estaton, even n the presence of skewed data dstrbutons. n 0

8 Table 2. Estaton Errors TPC-H Benchark Queres (0 GB database), Unfor and Skewed Data Sets Estated Pct. Copleton Estated Pct. Copleton Estaton Error (Z=0) TPCH-0GB Query 8 (Z=0) 00% 80% 60% 40% 20% 0% 0% 20% 40% 60% 80% 00% Actual Pct. Copleton TPCH-0GB Query 8 (Z=2) 00% 80% 60% 40% 20% 0% Estaton Error (Z=2) Query Mean Max Mean Max Q 0.9% 2.8% 0.2% 0.5% Q3.% 2.0% 3.4% 4.7% Q4 0.5%.0% 0.6%.4% Q5 7.3% 9.0% 3.7% 5.4% Q6.2% 2.9% 2.8% 4.6% Q7 2.3% 4.0% 3.8% 7.6% Q8 0.8%.7% 5.2% 6.2% Q9 2.7% 4.9% 2.9% 8.3% Q0 0.4%.4%.6% 4.4% Q2.0%.7% 0.9% 3.8% Q4 0.5%.8%.5% 3.2% Q5 0.6%.3%.6% 4.4% Q7.7% 2.6% 0.7% 2.0% Q8 5.9% 6.8% 4.2% 25.5% Q9 0.5%.5%.8% 2.7% Q20 3.0% 9.8% 3.7% 5.9% Q2 0.9% 2.5% 5.7% 38.8% Fgure 4. Scatter plot of actual vs. estated percentage copleted (TPC-H Q8), Unfor dstrbuton 0% 20% 40% 60% 80% 00% Actual Pct. Copleton Fgure 5. Scatter plot of actual vs. estated percentage copleted (TPC-H Q8), Skewed dstrbuton Valdaton of Drver ode Hypothess In ths experent, we deonstrate the portance of estatng overall progress based only on progress of drver nodes wthn a currently executng ppelne. We do ths by coparng wth an estator that also uses the Getext() odel, but does not use dne (.e., the drver node hypothess) to estate the cardnalty of all nodes n the currently executng ppelne, but reles only on the optzer estated cardnaltes. We show the results for TPC-H query Q9 aganst the 0 GB database wth Zpfan skewed data (Z=2). The results for our estator and the estator that uses only optzer estates (OPT) for currently executng ppelnes s shown n Fgures 6 and 7 respectvely. The ean and ax errors for our estator s 2.9% and 8.3% respectvely, whereas the errors for OPT are 23% and 47% respectvely. The reason s that OPT, due to the ncluson of several jon nodes (whose cardnalty estates are naccurate), ends up wth a sgnfcant overestate of the actual work whch gets refned only near the very end of query executon (when the estated copleton jups fro 48% to 89%). Estated Pct. Copleton Estated Pct. Copleton TPCH-0GB Query 9 (Z=2) 00% 80% 60% 40% 20% 0% 00% 0% 20% 40% 60% 80% 00% Actual Pct. Copleton Fgure 6. Scatter plot of actual vs. estated percentage copleted (TPC-H Q9, Usng Drver node hypothess) TPCH-0GB Query 9 (Z=2) 80% 60% 40% 20% 0% 0% 20% 40% 60% 80% 00% Actual Pct. Copleton Fgure 7. Scatter plot of actual vs. estated percentage copleted (TPC-H Q9, Usng only optzer estates) Queres on SALES Database In ths experent, we evaluate our estator on coplex decson support style queres fro a real database applcaton. As n TPC- H, we see that the ean estaton errors are qute low (around 0%) and the ax errors are around 20% (see Table 3). Ths

9 experent shows that that the accuracy of our estator does not degrade apprecably for ths set of real world queres that contan non foregn-key jons, and sgnfcant groupng and aggregaton. Table 3. Estaton Errors SALES queres Estaton Error Query Mean Max Q 7.% 7.3% Q2 8.2% 6.9% Q3.6% 8.2% Q4 9.3% 2.4% Q5 7.0% 8.% 6. MOOTOICITY As dscussed n Secton 2.2., onotoncty s a desrable property fro a user s perspectve. Consder a progress estator that uses the Getext() odel of work (Secton 2.3). Snce the values (observed cardnaltes durng executon) are onotoncally ncreasng, the estator wll be onotonc provded any changes to the values durng query executon are onotoncally decreasng. An estator that has up front knowledge of the exact nuber of Getext() calls that wll be ade by each operator (.e., the values) can guarantee onotoncty, snce t would never need to change. However, for any other technque that can only estate the value of the there s a trade-off between guaranteeng onotoncty and the accuracy of progress estaton. One way to ensure onotoncty s to ntally use a value for the estated that s uch larger than the actual,.e., an upperbound. The proble wth such an approach s that accuracy can suffer, snce the actual ay be uch saller. For exaple, consder a query plan whch perfors a hash jon of relatons R and R 2 and then sorts the result of the jon. ote that obtanng a tght upper-bound on the estate of the Sort node cardnalty can be probleatc. If the jon s a foregn-key jon, then we know that an upper bound on the cardnalty of the joned relaton, and hence the Sort node, s the sze of table wth the foregn-key. However, for non foregn-key jons the upper-bound can be a consderable overestate of the actual for the Sort node, and thus the accuracy of the estator ay be poor untl ost of the query has copleted executng. Therefore, the real challenge s to fnd tght upper-bounds so that accuracy of the estator s not sgnfcantly coprosed. Gven the dffculty of guaranteeng a tght upper bound for nteredate drver nodes, a trade-off between onotoncty and the accuracy of progress estaton appears unavodable. Thus, an nterestng ssue s whether users prefer ore accurate estates or estates that are guaranteed to be onotonc. A possble approach for addressng ths ssue s to present both the estated progress as well as the progress based on the upper-bounds. Let the progress coputed usng upper bounds be p % and the correspondng one coputed usng estates be p 2 %. Then (p, p 2 ) as a par of values would ndcate to the user that the % done at any nstant s not lower than p and our current best estate s the value p 2. ote that p s onotonc, whereas p 2 ay not be. We observe that for a sngle ppelne query, the estator dne (Secton 3) s onotonc, snce s known exactly and does not change durng the executon of the ppelne (see Secton 7 for runte condtons that ay cause onotoncty volatons even for sngle ppelne queres). However, for the case of ultppelne queres, our estator s not guaranteed to be onotonc. In partcular, onotoncty volatons can occur when a new ppelne starts executng, and we revse the optzer estates of, wth the estate based on dne (as descrbed n Secton 4.2). For the queres aganst the TPC-H 0GB (Skew Z=2) data n our experents, we coputed progress estates at regular ntervals (approxately 4 tes a second), and we easured: (a) the nuber of onotoncty volatons,.e., nuber of tes n whch a progress estate was less than the prevous estate, (b) the average % by whch the estate decreased and (c) the axu % by whch the estate decreased. We observed onotoncty volatons n fve queres (Q7, Q8, Q9, Q20, Q2). Moreover, except for Q8 and Q20, there was only volaton n the other three queres. The axu decrease n estated progress across all queres was 8.3% (for Q2) and the average decrease for each query respectvely was.4%, 0.7%, 4.9%, 0.0%, and 8.3%. One reason for the relatvely few and sall onotoncty volatons s that n these queres the s are donated by the leaf-level drver nodes (scans of lnete, orders tables). Due to the flterng and aggregatons perfored n these queres, the actual s of the upper-level nodes n the plan are usually uch saller. Thus, even n cases when the s for the non-leaf drver nodes are ntally under-estated, the agntude of the onotoncty volatons are sall. 7. RUTIME CODITIOS The odel of work done by a query (see Secton 2) akes the splfyng assupton that the actual work done by a call to Getext() s the sae across all operators n the plan,.e., Getext() fro all operators are weghted equally. In general, ths assupton does not hold, e.g., due to an expensve operator lke a UDF n a Flter node, or because one Table Scan reads fro a fast dsk whereas another Table Scan reads fro a slow dsk. A possble way to extend the basc odel of work to account for dfferent cost of Getext() of dfferent operators s to odel the work as a weghted su of the nuber of Getext() calls done by operators n the plan. The weghtng factor C j assocated wth operator j s a relatve easure of the work done by a Getext() on that operator. Of course, ths ntroduces an addtonal paraeter (besdes drver node cardnalty) that needs to be estated and refned. A possble soluton s to start wth unfor relatve rates (.e. C j = for all j) or use cost estates ade by the query optzer, and then adjust the C j values based on executon feedback. Modelng and coputng per-tuple work for every ppelne could be an portant factor n general, and developng technques to address ths ssue s part of our ongong work. In the rest of ths secton we dscuss an portant specal case of a runte condton, splls of tuples to dsk due to nsuffcent eory, and show how our estator can be adapted to handle splls wthout ntroducng an addtonal weghtng factor, by treatng spll processng as a runte ppelne. Handlng Splls: Splls of tuples to dsk, whch can occur as a result of nsuffcent eory can result n ore work that s not accounted for by our odel of work snce t occurs wthn an operator. Consder a jon between two relatons A and B, where the optzer pcks a hybrd hash jon operator. Hybrd hash proceeds by buldng a hash table of A n eory. Durng the scan of

10 relaton A, f the eory budget of the hash jon s exhausted, then certan buckets wll be splled to dsk. When the table B s used to probe the hash parttons, the tuples of B that hash to the buckets that are not eory resdent are also wrtten to dsk. Bucket spllng s a runte effect and hence t can be dffcult to accurately estate n advance the nuber of tuples that wll be splled to dsk. We observe that we can odel the query executon as coprsng two parts, one that processes the orgnal relatons and another that processes the splled parttons. In other words we can thnk of the orgnal query as follows. Q = (A jon B) (A jon B ) where A and B denote the correspondng parts of relatons A and B that have been splled (0 A A, 0 B B ). The drver nodes for query Q would nclude scans of A, B, A and B. Thus the total work for Q would be A + B + A + B. The an proble s that A, B cannot be predcted at optzaton te. The an dea behnd our soluton to the spll proble s as follows. Whenever a tuple s splled to dsk (ether fro relaton A or B) the denonator value (whch denotes the total work) s ncreented by one (.e., another Getext() call). We are n essence addng ore work to be done later and the denonator value should reflect the estated cardnalty of the ppelne. ow, consder the pont durng executon when the frst phase of hash processng s over and none of the splled parttons have been processed. The odfed estator would have ncreented the denonator counter for each tuple that had been splled and would estate the progress as ( A + B )/ ( A + B + A' + B' ) whch s correct as t accounts for the reanng tuples to be processed. When the splled parttons are re-read the correspondng counts would be counted n the nuerator and only when all the parttons have been processed wll the estator report the progress as 00%. Ths correcton to the estator works because of the syetry of splls,.e., exactly the tuples that have been wrtten to dsk wll be processed later. It s also easy to see that ths odfcaton to the orgnal algorth would work for ultple recurson levels n a hash jon ppelne. Fnally, we note that splls could occur n other operators lke hash-based Group-By or the erge phase n a Sort- Merge jon f there are too any duplcates of a partcular value. Thus, n general, a query can be consdered as Q Q where Q accounts for the work done by the current query n handlng data that s splled. The followng experent on TPC-H Q8 hghlghts n the portance of handlng splls. Fro Fgure 8 we see that the progress estator reans stuck on 44% for a relatvely long te (ore than 5% of total query executon te). Ths s because durng ths nterval the query s wrtng and readng the splled parttons, and the estator does not capture ths effect. On the other hand when we enable spll handlng as dscussed above (see Fgure 9) the estator s ore accurate. Estated Percentage Copleted TPC-H 0 GB Query 8 (Z=0) Wthout Spll Handlng 00% 80% 60% 40% 20% 0% 0% 20% 40% 60% 80% 00% Actual Percentage Copleted Fgure 8. Scatter plot of actual vs. estated percentage copleted (TPC-H Q8, o Spll handlng) Estated Percentage Copleted TPC-H 0GB Query 8 (Z=0) Wth Spll Handlng 00% 80% 60% 40% 20% 0% 0% 20% 40% 60% 80% 00% Actual Percentage Copleted Fgure 9. Scatter plot of actual vs. estated percentage copleted (TPC-H Q8, Wth Spll handlng) 8. RELATED WOR There are two broad areas that are related to our work. Frst s the area of estatng cardnalty of query expressons. Selectvty estaton e.g., [9] plays a key role n enablng query optzers to pck a sutable query executon plan. Our work leverages the query optzer to provde an ntal estate of cardnalty of nodes n an executon plan. The second broad area that relates to ths paper s the use of nforaton gathered durng query executon. One body of work e.g., [2,4] uses feedback of observed cardnaltes at runte to potentally re-optze the sae query, pck aong copetng query plans or to prove decsons on resource allocaton for t. In contrast, we use observed cardnalty of operators n the executon tree to prove estate of total work that needs to be done, whle leavng the query executon plan unchanged. We note that n prncple, the technques n [4] that collect statstcs such as cardnaltes/hstogras etc. of nteredate query results can be adapted n our context for obtanng better estates of s by renvokng the query optzer s cardnalty estaton odule at runte wth ore accurate statstcs. Whle ths does requre nontrval extensons to today s query processng engnes, t represents an nterestng avenue of future work for progress estaton. Another use of runte feedback s to refne statstcs e.g., [,] that can be used for selectvty estaton for

Introduction. Leslie Lamports Time, Clocks & the Ordering of Events in a Distributed System. Overview. Introduction Concepts: Time

Introduction. Leslie Lamports Time, Clocks & the Ordering of Events in a Distributed System. Overview. Introduction Concepts: Time Lesle Laports e, locks & the Orderng of Events n a Dstrbuted Syste Joseph Sprng Departent of oputer Scence Dstrbuted Systes and Securty Overvew Introducton he artal Orderng Logcal locks Orderng the Events