Scalable Exploration of Physical Database Design

Size: px

Start display at page:

Download "Scalable Exploration of Physical Database Design"

Clemence Woods
5 years ago
Views:

1 Scalable Exploratio of Physical Database Desig Ard Christia Köig Microsoft Research Shubha U. abar Staford Uiversity Abstract Physical database desig is critical to the performace of a large-scale DBMS. The correspodig automated desig tuig tools eed to select the best physical desig from a large set of cadidate desigs quickly. However, for large workloads, evaluatig the cost of each query i the workload for every cadidate does ot scale. To overcome this, we preset a ovel compariso primitive that oly evaluates a fractio of the workload ad provides a accurate estimate of the likelihood of selectig correctly. We show how to use this primitive to costruct accurate ad scalable selectio procedures. Furthermore, we address the issue of esurig that the estimates are coservative, eve for highly skewed cost distributios. The proposed techiques are evaluated through a prototype implemetatio iside a commercial physical desig tool. 1 Itroductio The performace of applicatios ruig agaist eterprise database systems depeds crucially o the physical database desig chose. To eable the exploratio of potetial desigs, today s commercial database systems have icorporated APIs that allow What-if aalysis [8]. These take as iput a query Q ad a database cofiguratio C, ad retur the optimizer-estimated cost of executig Q if cofiguratio C were preset. This iterface is the key to buildig tools for exploratory aalysis as well as automated recommedatio of physical database desigs [13, 20, 1, 10]. The problem of physical desig tuig is the defied as follows: a physical desig tool receives a represetative query workload WL ad costraits o the cofiguratio space to explore as iput, ad outputs a cofiguratio i which executig WL has the least possible cost as measured by the optimizer cost model) 1. To determie the best cofiguratio, a umber of cadidate cofiguratios are eumerated ad the evaluated usig What-if aalysis. The represetative workload is typically obtaied by tracig the queries that execute agaist a productio system usig tools such as IBM Query Patroler, SQL Server Profiler, or ADDM [10]) over a represetative period of time. 1 ote that queries iclude both update ad select statemets. Thus this formulatio models the trade-off betwee the improved performace of select-queries ad the maiteace costs of additioal idexes ad views. Sice a large umber of SQL statemets may execute i this time, the straightforward approach of comparig cofiguratios by repeatedly ivokig the query optimizer for each query i WL with every cofiguratio is ofte ot tractable [5, 20]. Our experiece with a commercial physical desig tool shows that a large part of the tool s overhead arises from repeated optimizer calls to evaluate large umbers of cofiguratio/query combiatios. Most of the research o such tools [13, 20, 1, 10, 7, 8] has focused o miimizig this overhead by evaluatig oly a few carefully chose cofiguratios. Our approach is orthogoal to this work ad focuses o reducig the umber of queries for which the optimizer calls are issued. Because the topic of which cofiguratios to eumerate has bee studied extesively, we will ot commet further o the cofiguratiospace. Curret commercial tools address the issue of large workloads by compressig them up-frot, i.e. iitially selectig a subset of queries ad the tuig oly this smaller set [5, 20]. These approaches do ot offer ay guaratees o how the compressio affects the likelihood of choosig the right cofiguratio. However, such guaratees are importat due to the sigificat overhead of chagig the physical desig ad the performace impact of bad database desig. The problem we study i this paper is that of efficietly comparig two or more) cofiguratios for large workloads. Solvig this problem is crucial for scalability of automated physical desig tuig tools [20, 1, 10]. We provide probabilistic guaratees o the likelihood of correctly choosig the best cofiguratio for a large workload from a give set of cadidate cofiguratios. Our approach is to sample from the workload ad use statistical iferece techiques to compute the probability of selectig correctly. The resultig probabilistic compariso primitive ca be used for a) fast iteractive exploratory aalysis of the cofiguratio space, allowig the DB admiistrator to quickly fid promisig cadidates for full evaluatio, or b) as the core compariso primitive iside a automated physical desig tool, providig both scalability ad locally good decisios with probabilistic guaratees o the accuracy of each compariso. Depedig o the search strategy used, the latter ca be exteded to guaratees o the quality of the fial result. The key challeges to this probabilistic approach are

2 twofold. First, the accuracy of the estimatio depeds critically o the variace of the estimator we use. The challege is thus to pick a estimator with as little variace as possible. Secod, samplig techiques rely o a) the applicability of the Cetral Limit Theorem CLT) [9] to derive cofidece statemets for the estimates ad b) the sample variace beig a good estimate of the true variace of the uderlyig distributio. Ufortuately, both assumptios may ot be valid i this sceario. Thus, there is a eed for techiques to determie the applicability of the CLT for the give workload ad set of cofiguratios. 1.1 Our Cotributios We propose a ew probabilistic compariso primitive that give as iput a workload WL, a set of cofiguratios C ad a target probability α outputs the cofiguratio with the lowest optimizer-estimated cost of executig WL with probability at or above the target probability α. It works by icremetally samplig queries from the origial workload ad computig the probability of selectig the best cofiguratio with each ew sample, stoppig oce the target probability is reached. Our work makes the followig saliet cotributios: We derive probabilistic guaratees o the likelihood of selectig the best cofiguratio Sectio 4). We propose a modified samplig scheme that sigificatly reduces estimator variace by leveragig the fact that query costs exhibit some stability across cofiguratios Sectio 4.2). We show how to reduce the estimator variace further through a stratified samplig scheme that leverages commoality betwee queries Sectio 5). Fially, we describe a ovel techique to address the problem of highly skewed distributios i which the sample may ot be represetative of the overall distributio ad/or the CLT may ot apply for a give sample size Sectio 6). The remaider of the paper is orgaized as follows: I Sectio 2 we review related work. I Sectio 3 we give a formal descriptio of the problem ad itroduce the ecessary otatio. I Sectio 4 we describe two samplig schemes that ca be used to estimate the probability of selectig the correct cofiguratio. I Sectio 5 we the show how to reduce variaces through the use of stratificatio ad combie all parts ito a efficiet algorithm. I Sectio 6 we describe how to validate the assumptios o which the probabilistic guaratees described earlier are based. We evaluate our techiques experimetally i Sectio 7. 2 Related Work The techiques developed i this paper are related to the field of statistical selectio ad rakig [15] which is cocered with the probabilistic rakig of systems i experimetal setups based o a series of measuremets from each system. However, statistical rakig techiques are typically aimed at comparig systems for which the idividual measuremets are distributed accordig to a ormal distributio. This is clearly ot the case i our sceario. To icorporate o-ormal distributios ito statistical selectio, techiques like batchig e.g. [17]) have bee suggested, that iitially geerate a large umber of measuremets, ad the trasform this raw data ito batch meas. The batch sizes are chose to be large eough so that the idividual batch meas are approximately idepedet ad ormally distributed. However, because procedures of this type eed to produce a umber of ormally distributed estimates per cofiguratio, they require a large umber of iitial measuremets accordig to [15], batch sizes of over 1000 measuremets are commo), thereby ullifyig the efficiecy gai due to samplig for our sceario. Workload compressio techiques such as [5, 20] compute a compact represetatio of a large workload before submittig it to a physical desig tool. Both of these approaches are heuristics i the sese that they have o meas of assessig the impact of the compressio o cofiguratio selectio or cosequetly o physical desig tuig itself. [5] poses workload compressio as a clusterig problem, usig a distace fuctio that models the maximum differece i cost betwee two queries for arbitrary cofiguratios. This distace fuctio does ot use the optimizer s cost estimate, so it is ot clear how well this approximatio holds for complex queries. [20] compresses a workload by selectig queries i order of their curret costs, util a user-specifiable percetage X of the total workload cost has bee reached. While computatioally simple, this approach may lead to a sigificat reductio i tuig quality for workloads i which queries of some templates are geerally more expesive tha the remaiig queries, as the oly some templates are cosidered for tuig. Makig X user-specifiable does ot alleviate this problem, for the user has o way of assessig the impact of X o tuig quality. The most serious drawback of both approaches is that they do ot adapt the umber of queries retaied to the space of cofiguratios cosidered. Cosequetly, they may result i compressio that is either too coservative, resultig i excessive umbers of optimizer calls or too coarse, resultig i iferior physical desigs. 3 Problem Statemet I this sectio we give a formal defiitio of the problem addressed i this paper. This defiitio is idetical to the iformal oe give i Sectio 1.1, but i additio to the parameters C, WL ad the target probability α, we itroduce 2

3 a additioal parameter δ, which describes the miimum differece i cost betwee cofiguratios that we care to detect. Specifyig a value of δ > 0 helps to avoid scearios i which the algorithm samples a large fractio of the workload whe comparig cofiguratios of early) idetical costs. While accuracy for such cofiguratios is ecessary i some cases, detectig large differeces is all that matters i others. For istace, whe comparig a ew cadidate cofiguratio to the curret oe, the overhead of chagig the physical database desig is justified oly whe the ew cofiguratio is sigificatly better. 3.1 otatio I this paper, we use Costq, C) to deote the optimizerestimated cost of executig a query q i a cofiguratio C. Similarly, the total estimated cost of executig a set of queries {q 1,..., q } i cofiguratio C is Cost{q 1,..., q }, C) := i=1 Costq i, C); whe the set of queries icludes the etire workload, this will be referred to as the cost of cofiguratio C. I the iterest of simple otatio, we will use the simplifyig assumptio that the overhead of makig a sigle optimizer call is costat across queries 2. I the remaider of the paper we use the term cost to describe optimizer-estimated query costs. We use the phrase to sample a query to deote the process of obtaiig the query s text from a workload table or file, ad evaluatig its cost usig the query optimizer. The cofiguratio beig used to evaluate the query will be clear from cotext. The probability of a evet A will be deoted by P ra), ad the coditioal probability of evet A give evet B by P ra B). 3.2 The Cofiguratio Selectio Problem Problem Formulatio: Give a set of physical database cofiguratios C = {C 1,..., C k } ad a workload WL = {q 1,..., q }, a target probability α ad a sesitivity parameter δ, select a cofiguratio C i such that the probability of a correct selectio, P rcs), is larger tha α, where P rcs) := P rcostwl, C i ) < CostWL, C j ) + δ, j i), 1) while makig a miimal umber of optimizer-calls. Our approach to solvig the cofiguratio selectio problem is to repeatedly sample queries from WL, evaluate P rcs), ad termiate oce the target probability α has bee reached. ext, we describe how to estimate P rcs) Sectio 4) ad how to use these estimates ad stratificatio of the workload to costruct a algorithm for the cofiguratio selectio problem Sectio 5). 2 We discuss how to icorporate differeces i optimizatio costs betwee queries i Sectio Samplig Techiques I this sectio we preset two samplig schemes to derive P rcs) estimates first we describe a straightforward approach called Samplig Sectio 4.1); the we describe Delta Samplig, which exploits properties of query costs to come up with a more accurate estimator Sectio 4.2). 4.1 Samplig Samplig is the base samplig scheme that we use to estimate the differeces i costs of pairs of cofiguratios. First, we defie a ubiased estimator X i of CostWL, C i ). The estimator is obtaied by samplig a set of queries SL i from WL ad is calculated as X i = SL i q SL i Costq, C i ), i.e., X i is the mea of SL i radom variables, scaled up by the total umber of queries i the workload. The variace of the uderlyig cost-distributio is σi 2 l=1 = Costq l,c i ) CostWL,Ci) ) 2. ow, whe usig a simple radom sample of size to estimate X i, the variace of this estimate is V arx i ) := 2 σ 2 i ) 1 1 [9]. I the followig, we will use the shorthad Si 2 := σi 2 / 1). To choose the better of two cofiguratios C l ad C j we ow use the radom variable X l X j, which is a ubiased estimator of the true differece i costs µ l,j := CostWL, C l ) CostWL, C j ). Uder large-sample assumptios, the stadardized radom variable l,j := S 2 l SL l X l X j ) µ l,j 1 SL l )+ S 2 j SL j 1 SL j ) ) 2) is ormally distributed with mea 0 ad variace 1 accordig to the CLT). Based o this we ca assess the probability of makig a correct selectio whe choosig betwee C l ad C j, which we write as P rcs l,j ). The decisio procedure to choose betwee the two cofiguratios is to pick the cofiguratio C l, for which the estimated cost X l is the smallest. I this case the probability of makig a icorrect selectio correspods to the evet X l X j < 0 ad µ l,j > δ. Thus, P rcs l,j ) = 1 P rx l X j < 0 µ l,j > δ) = P rx l X j 0 µ l,j > δ) P rx l X j 0 µ l,j = δ) = δ P r l,j > S 2 l SL l 1 SL l ) + S 2 j SL j 1 SL ) j ). 3

4 Give that l,j 0, 1), we ca compute this probability usig the appropriate lookup-tables. Because the true variaces of the cost distributios are ot kow, we use the sample variaces si 2 = q SL Costq,C q SL Costq,C i ) i i ) ) 2 i SL i SL i 1, which are ubiased estimators of the true variaces, i their place i P rcs l,j ). Cosequetly, accurate estimatio of P rcs l,j ) depeds o a) havig sampled sufficietly may values for the Cetral Limit Theorem CLT) [9] to be applicable ad b) beig able to estimate σi 2 accurately by si 2. It is ot possible to verify these coditios based o a sample aloe, as a sigle very large outlier value may domiate both the variace ad the skew of the cost distributio. We will describe how to use additioal domai kowledge o the cost distributios to verify a) ad b) i Sectio 6. I practice, a stadard rule-of-thumb is to sample mi := 30 queries before P rcs) is computed based o the ormality of l,j. While this i may cases results i the CLT beig applicable, the depedece o the sample-variaces si 2 still meas that P rcs) may be either over- or uder-estimated. Whe the umber of cofiguratios k is greater tha two, we derive the probability of correct selectio P rcs) based o the pairwise selectio probabilities P rcs l,j ). Oce agai, the selectio procedure here is to choose the cofiguratio, C i, with the smallest estimate, X i. Cosider the case where C i is chose as the best cofiguratio from C 1,..., C k. C i would be a icorrect choice, if for ay j i, X i X j < 0 but µ i,j > δ. Thus we ca derive a trivial boud for P rcs) usig the Boferroi iequality: P rcs) 1 j {1,...,k},j i 1 P rcsi,j ) ). 3) 4.2 Delta Samplig Delta-Samplig is a variatio of Samplig, that leverages the stability of the rakigs of query costs across cofiguratios to obtai sigificat reductio i the variace of the resultig estimator. I Delta Samplig, we pick oly a sigle set of queries SL WL from the workload ad the estimate the differece i total cost betwee two cofiguratios C l, C j as X l,j = SL q SL Costq, C l) Costq, C j )). The selectio procedure here is to pick the cofiguratio C l for which X l,j < 0. To give bouds for the probability of correct selectio, we defie l,j := X l,j µ l,j S l,j 2 1 SL SL ) 4) where Sl,j 2 := 1 σ l,j 2 ad σ l,j 2 is the variace of the distributio {Costq i, C l ) Costq i, C j ) i = 1,..., } of the cost-differeces betwee C l ad C j. Agai l,j 0, 1) ad we ca derive P RCS l,j ) as i the case of Samplig. The lower boud o P RCS) from equatio 3 i the case of multiple cofiguratios applies here as well. To explai why Delta Samplig typically outperforms samplig, compare equatio 4 to equatio 2. The differece i estimator-variace betwee Delta Samplig ad Samplig depeds o the differece betwee the terms Sl,j 2 / SL i Equatio 4) ad Sl 2/ SL l +Sj 2/ SL j i Equatio 2), i.e. o the weighted differece betwee the respective variaces. It is kow that σl,j 2 = σ l 2 +σj 2 2 Cov l,j [3], where Cov l,j is the covariace of the distributios of the query costs of the workload whe evaluated i cofiguratios C l ad C j. Delta Samplig exploits this as follows. For large workloads it typically holds that whe rakig the queries by their costs i C l ad C j the raks for each idividual query do ot vary too much betwee cofiguratios especially cofiguratios of similar costs which correspod to difficult selectio problems). This meas that i most cases if the cost of a query q i, whe evaluated i C l is higher tha the average cost of all queries i C l, its cost whe evaluated i C j is likely to be higher tha that of the average query i C j ad vice versa for below-average costs). For example, multi-joi queries will be typically more expesive tha sigle-value lookups, o matter what the physical desig. ow, the covariace Cov l,j is defied as i=1 Costq i,c l ) CostWL,C l) )Costq CostWL,Cj) ) ) i,c j). All queries q i for which the above property holds, will add a positive term to the summatio i the above formula as the sigs of both factors will be equal). Cosequetly, if this property holds for sufficietly may queries, the covariace Cov l,j itself will be positive, thereby makig σl,j 2 < σ l 2 + σj 2. So, if the umber of optimizer calls are roughly equally distributed betwee cofiguratios i Samplig, Delta-Samplig should result i tighter estimates of P rcs). We verify this experimetally i Sectio 7. 5 Workload It is possible to further reduce the estimator variaces usig stratified samplig [9]. Stratified samplig is a geeralizatio of uiform samplig, i which the workload is partitioed ito L disjuct strata WL 1... WL L = WL. Strata that exhibit high variaces cotribute more samples. We itroduce stratificatio by partitioig the WL ito icreasigly fie strata as the samplig progresses. The resultig algorithm for the cofiguratio selectio problem is show i Algorithm 1. The algorithm samples from WL util P rcs) is greater tha α. After each sample, we check if further stratificatio is expected to lower estimator variace; if so, we implemet it. We describe this i detail i Sectio 5.1. Oce the workload is stratified, we also have 4

5 to choose which stratum to pick the ext sample from. We describe this i detail i Sectio 5.2. A further heuristic for large k) is to stop samplig for cofiguratios that are clearly ot optimal. If C l is the chose cofiguratio i a iteratio ad 1 P rcs l,j ) is egligible for some C j, the C j is dropped from future iteratios. This way, may cofiguratios are elimiated quickly, ad oly the oes cotributig the most ucertaity remai i the pool. We study the effects of this optimizatio i Sectio 7. 1: // Iput: Workload WL, Cofiguratios C 1,..., C k, 2: // Probability α, Sesitivity δ 3: // Output: Cofiguratio C best, Probability P rcs) 4: For each C l, pick pilot-sample SL l, SL l = mi. 5: // ote: for Delta-Samplig SL =... = SL k 6: repeat 7: if further beeficial the 8: Create ew strata; sample additioal queries so that every stratum cotais mi queries Sectio 5.1) 9: ed if 10: Select ext query q ad i case of Samplig cofiguratio C i to evaluatesectio 5.2) 11: Evaluate i,j for all 1 i < j k Sectios 4.1, 4.2) 12: Select C best from C s.t. X best X j < 0 j best for -samplig or X best,j < 0 j best for Delta-samplig Sectios 4.1, 4.2) 13: util P rcs) > α 14: retur C best,p rcs) Algorithm 1: Computig PrCS) through Samplig Preprocessig: For workloads large eough that the query strigs do ot fit ito memory, we write all query strigs to a database table, which also cotais the query s ID ad template also kow as a query s sigature [6, 5] or skeleto [18]). Two queries have the same template if they are idetical i everythig but the costat bidigs of their parameters. Template-iformatio ca be acquired either by usig a workload-collectio tool capable of recordig template iformatio e.g. [6]) or by parsig the queries, which geerally requires a small fractio of the overhead ecessary to optimize them. ow we ca obtai a radom sample of size from this table by computig a radom permutatio of the query IDs ad the usig a sigle sca) readig the queries correspodig to the first IDs ito memory. This approach trivially exteds to stratified samplig. 5.1 Progressive Give the strata WL h, h = 1,..., L ad a cofiguratio C i, the stratified estimates X i, i the case of idepedet samplig, are calculated as the sum of the weighted estimates for each stratum WL h. Let the variace of the cost distributio of WL h, whe evaluated i cofiguratio C i be deoted by σih) 2 ad S ih) 2 := σ ih) 2 WL h WL h ). Let 1) the umber of queries sampled from WL h be h. The the variace of the estimator X i of the cost of C i for Samplig is [9] V arx i ) = L h=1 WL h 2 S 2 ih) h 1 h WL h ). 5) We would thus like to pick a stratificatio ad a allocatio of sample sizes L = that miimizes equatio 5. For Delta-samplig, the variaces of the estimators X i,j ca be obtaied aalogously. Pickig a stratificatio: Iitially, cosider the case of Samplig: here, we ca use a differet stratificatio of WL for every X i. I the followig we discuss how to stratify WL for oe cofiguratio C i. Because the costs of queries that share a template typically exhibit much smaller variace tha the costs of the etire workload, it is ofte possible to get a good estimate of the average cost of queries sharig a template usig oly few sample queries. Cosequetly, we oly cosider stratificatios i which all queries of oe template are grouped ito the same stratum ad use the average costs per template to estimate the strata variaces S 2 ih). This does ot mea that usig a very fie-graied stratificatio with a sigle stratum for every template must ecessarily be the best way to stratify. While a fie stratificatio may result i a small variace for large sample-sizes, it comes at a price: i order for the estimate based o stratified samplig to be valid, the umber of samples h take from each stratum WL h has to be sufficietly large so that the estimator of the cost of a cofiguratio i each stratum is ormal. Choosig betwee stratificatios: Based o the above, we ca compute the stratum variaces resultig from a stratificatio ST = {WL 1,..., WL L }, give a sample allocatio T = 1,..., L ) usig equatio 5. Usig these variaces, we ca the estimate the umber of samples, #T otal Samples, ecessary to achieve P rcs) > α for a set of cofiguratios C. We omit the details of this algorithm due to space costraits, but it essetially ivolves computig target variaces T argetv arx i ) for each estimator X i such that these variaces, whe substituted i to the formulas i Sectio 4, give us the target probability. For a give a stratificatio ST, iitial sample allocatio T ad cofiguratio C i we deote the miimum umber of samples required to achieve the target variace uder the assumptio that the uderlyig σih) 2 #SamplesC i, ST, T ). 3 remai costat) by We use this estimate to compare differet stratificatio schemes. We ow choose the stratificatio of WL for a cofiguratio C i as follows: whe samplig, we maitai the average cost of the queries for C i i each template ad use it to estimate Sih) 2, oce we have see a small umber of queries for each template. The algorithm starts with a sigle stratum ST = WL). After every sample from C i it evaluates 3 If we igore the fiite populatio correctio, the this umber ca be computed usig OL log 2 )) operatios by combiig a biary search ad eyma Allocatio [9]. 5

6 if it should chage the stratificatio of the queries for this cofiguratio. This is doe by iteratig over a set of possible ew stratificatios ST p, p = 1..., #templates 1 resultig from splittig oe of the existig strata ito two at the pth-most expesive template ad computig #SamplesC i, ST p, T p ). Here, we chose the T p such that iitially each stratum cotais the maximum of the umber of queries already sampled from it ad mi. For scalability, we oly add a sigle stratum per step. ote that ulike stratificatio i the cotext of survey samplig or approximate query processig e.g. [4] where the chose stratificatio is the result of a complex optimizatio problem), we icur egligible computatioal overhead for chagig the stratificatio. The details are show i Algorithm 2. 1: // Iput: Strata WL 1,..., WL L, Cofiguratio C i 2: // Output: Stratum to split WL s = WL 1 s WL 2 s 3: // Templates to store i WL 1 s, WL 2 s 4: s := L + 1. // Iitializatio 5: mi sam := #Samples C i, {WL 1,.., WL L }, 1,.., L ). 6: for all WL j {WL 1,..., WL L } do 7: Set j := the expected umber of samples from WL j i #Samples C i, {WL 1,..., WL L }, 1,..., L ). 8: if j 2 mi the 9: Order the query templates i WL j by their average cost as tmp p1,..., tmp pm. 10: for Every split poit t {p 1,..., p m 1} do 11: Cosider the split WL j = WL 1 j WL 2 j ito templates {tmp p1,..., tmp t }, {tmp t+1,..., tmp pm }. 12: Compute sam[t] := #Samples C i, {WL 1... WL 1 j, WL 2 j... WL L}, 1,..., max{ mi, WL 1 j }, max{ mi, WL 2 j },..., L ). 13: if sam[t] < mi sam the 14: mi sam := sam[t]; s = t. 15: ed if 16: ed for 17: ed if 18: ed for 19: if s L + 1 the 20: Update #T otal Samples ad target variaces. 21: retur WL s ad template iformatio. 22: ed if Algorithm 2: Computig the optimum stratificatio Overhead: Give T templates i WL, there ca oly be T 1 possible split poits over all strata. For each split poit, we have to evaluate the correspodig value of #Samples...), which requires OL log 2 )) operatios. Cosequetly, the etire algorithm rus i OL log 2 ) T ) time. Because the umber of differet templates is typically orders of magitude smaller tha the size of the etire workload, the resultig overhead is egligible whe compared to the overhead of optimizig eve a sigle query. ote that we execute this algorithm oly for the cofiguratio C i that the last sample was chose from, as the estimates ad sample variaces for the remaiig cofiguratios stay uchaged. for Delta Samplig: For Delta-Samplig, the problem of fidig a good stratificatio is complicated by the fact that Delta-Samplig samples the same set of queries for all cofiguratios ad thus a stratificatio scheme eeds to recocile the variaces of multiple radom variables X i,j. As a cosequece, we cosider the reductio i the average variace of X i,j, 1 i < j k whe comparig stratificatio schemes. Furthermore, the orderig of the templates for each X i,j may vary, resultig i up to k k 1)/2 orderigs to cosider whe computig a ew cadidate stratificatio. For reasos of tractability, we oly cosider a sigle rakig over the X i,j istead. 5.2 Pickig the ext sample We first cosider Samplig without stratificatio: ideally, we would like to pick the ext combiatio of query ad cofiguratio to evaluate so that P rcs) is maximized. We use a heuristic approach, attemptig to miimize the sum of the estimator-variaces istead. I case of Samplig this meas icreasig oe sample-size SL j by oe, so that k i=1 V arx i) is miimized. Sice we do t kow the effect the ext query will have o sample meas ad si 2 s beforehad, we estimate the chage to the sum of variaces assumig that these remai uchaged. For Delta-Samplig, the sampled query is evaluated i every cofiguratio, so sample selectio is trivial. If the workload is stratified, we use a similar approach for Samplig, choosig the cofiguratio ad stratum resultig i the biggest estimated improvemet i the sum of variaces. For Delta-Samplig with stratificatio, a query from the stratum resultig i the biggest estimated improvemet i the sum of the variaces of all estimators is chose ad evaluated i every cofiguratio. While this approach assumes optimizatio times to be costat, differet optimizatio times for each template ca be modeled by computig the average overhead for each cofiguratio/stratum pair ad selectig the oe maximizig the variace reductio relative to the expected overhead. 6 Applicability of the CLT As discussed i Sectio 4.1, the estimatio of P rcs) relies o the fact that i) we have sampled sufficietly may queries for the cetral limit theorem to apply ad ii) the sample variaces si 2 are accurate estimators of the true variaces σi 2. While these assumptios ofte do hold i practice, the potetial impact of choosig a iferior database desig makes it desirable to be able to validate them. We do this as follows: i the sceario of physical database desig, we do kow the total umber of queries i the workload ad ca obtai upper ad lower bouds o their idividual costs. Usig these, we the compute upper bouds o the skew ad variace of the uderlyig distributio ad use these to verify the assumptios i) ad ii). I 6

7 Sectio 6.1, we will describe how to obtai bouds o the costs of queries that have ot bee sampled; the, i Sectio 6.2 we describe how to use these bouds to compute upper bouds o skew ad variace. 6.1 Derivig Cost Bouds for Queries The problem of boudig the true cost of arbitrary database queries has bee studied extesively ad is kow to be hard e.g. [14]). However, i the cotext of physical database desig, we oly seek the cofiguratio with the best optimizer-estimated cost, ad ot bouds o the true selectivity of query expressios. This simplificatio allows us to utilize kowledge o the space of physical database desigs ad the optimizer itself to make this problem tractable; moreover, it is actually essetial to physical desig tuig, as the tuig tool caot ifluece optimizer-decisios at ru-time ad thus eeds to keep its recommedatios i syc with optimizer behavior. First, cosider the issue of obtaiig bouds o SELECT-queries. I the cotext of automated physical desig tools, it is possible to determie a so-called base cofiguratio, cosistig of all idexes ad views that will be preset i all cofiguratios eumerated durig the tuig process. If the optimizer is well-behaved, the addig a idex or view to the base cofiguratio ca oly improve the optimizer estimated cost of a SELECT-query. Cosequetly, the cost of a SELECT-query i its base cofiguratio gives a upper boud o the cost for ay cofiguratio eumerated durig the tuig. Lower bouds o the costs of a SELECT-query ca be obtaied usig the reverse method: usig the cost of a query Q i a cofiguratio cotaiig all idexes ad views that may be useful to Q. All automated physical desig tools kow to us have compoets that suggest a set of structures the query may beefit from; however, ultimately the umber of such structures may be very large. To overcome this, it is possible to use the techiques described i [2]. Here, the query-optimizer is istrumeted with additioal code that outputs for ay idividual access path cosidered durig the optimizatio of a query the correspodig idex or view that would be optimal to support this access path. This reduces the umber of relevat idexes ad views sigificatly, as i) we do ot have to guess which access paths may be relevat ad ii) while a complex SQL query may be executed i a extremely large umber of differet ways, the optimizer itself for reasos of scalability oly cosiders a subset of them. I order to boud the costs of UPDATE statemets icludig ISERT ad DELETE statemets), we use the stadard approach of splittig a complex update statemet ito a SELECT ad a UPDATE part, e.g. the statemet UP- DATE R SET A 1 = A 3 WHERE A 2 < 4 is separated ito: i) SELECT A 3 FROM R WHERE A 2 < 4, ad ii) UPDATE TOPk) R SET A1 = 0, where k is the estimated selectivity of i). The SELECTpart of the query is hadled as outlied above. To boud the cost of the UPDATE-part, we use two observatios: 1) the umber of differet update templates occurrig teds to be orders of magitude smaller tha the workload size ad 2) i the cost models curretly used i query optimizers, the cost of a pure update statemet grows with its selectivity. Thus, we ca boud the cost of all UPDATE-statemets of a specific template T o a cofiguratio C usig the optimizer cost-estimate i C for the two queries with the largest/smallest selectivity i T. This meas that we have to make 2 optimizer calls per template ad cofiguratio ulike whe boudig the cost of SELECT queries); however, this approach scales well sice the umber of differet templates is geerally orders of magitude smaller tha the size of the workload. ote that automated physical desig tuig tools already issue a sigle optimizer call for all queries i WL to derive iitial costs e.g. [20], figure 2) ad require a secod optimizer call per query for report geeratio. While these techiques are very simple, eve very coservative cost bouds ted to work well, as the resultig cofidece bouds give by the CLT coverge quickly. A exhaustive discussio of boudig techiques for SQL queries is ot the focus of this paper, ad these are a iterestig research challege themself. 6.2 Verifyig the validity of the estimators ow, the presece of bouds o the costs of the queries ot part of the sample allows us to compute a upper boud σmax 2 o σi 2, which we ca use i its place, thereby esurig that the estimated value of P rcs) will be coservative. Computig σmax 2 is a maximizatio problem with costraits defied by the cost itervals. Problem Statemet [Boudig the Variace]: Give a set of variables V = {v 1,..., v } represetig query costs, with each v i bouded by low i v i high i, compute σ 2 max = 1 max v 1,...,v ) R i:low i v i high i i=0 v i i=0 v ) i 2. 6) It is kow that this problem is P-hard [11] to solve exactly or to approximate withi a arbitrary ɛ > 0 [12]. The best kow algorithm for this problem requires up to O2 2 ) operatios [12], ad is thus ot practical for the workload sizes we cosider. Therefore, we propose a ovel algorithm approximatig σmax. 2 The basic idea is to solve the problem for values of v i that are multiples of a appropriately chose factor ρ ad to boud the differece betwee the solutio obtaied this way which we ll refer to as ˆσ max) 2 ad the solutio σmax 2 for ucostraied v i. We use the otatio v ρ i whe describig the rouded values. Rewritig equatio 6, we get σmax ) 2 = max v i ) 2 v i ). 7) v 1,...,v ) R i:low i v i high i i=0 i=0 7

8 ow, we defie MaxV 2 [m][j] as the maximum value of m m i=1 vρ i )2 uder the costrait that i=1 vρ i ) = m i=1 lowρ i + j ρ. Cosequetly, every solutio ˆσ2 max to the maximizatio problem for multiples of ρ must be of the form 1 MaxV 2 [][j] 1 m low ρ i + j ρ)2), 8) i=1 ad we ca fid the solutio by examiig all possible values of equatio 8. Because we are oly cosiderig multiples of ρ, we roud the boudaries of each v i to the closest multiples of ρ: low ρ i := low i + ρ/2)/ρ ρ ad high ρ i := high i + ρ/2)/ρ ρ. Thus each v i ca oly take o rage i = high ρ i lowρ i )/ρ + 1 distict values. Cosequetly, the average of the first m values ca take o total m = m i=1 rage i) m 1) differet values. This limitatio is the key idea to makig our approximatio algorithm scalable, as the values of total i typically icrease much more slowly tha the umber of combiatios of differet high i ad low i values. We ca compute all possible values of MaxV 2 [m][j] for m = 1,...,, j = 0,..., total m 1 1 usig the followig recurrece: MaxV 2 [i][j + l] { max MaxV 2 [i 1][l] + low ρ i = + j l=1,...,total ρ)2, if i > 1 i 1 low ρ i + j ρ)2, otherwise. This computatio requires total i 1 steps for each i = 1,...,. ow we ca compute the solutio to the costraied maximizatio problem usig equatio 8 i total steps. Two further performace-optimizatios are possible: because the 2d cetral momet of a distributio has o global maximum over a compact box that is ot attaied at a boudary poit [16], each v ρ i must either take o its maximum or miimum value. Cosequetly, we oly eed to check the cases of j = 0 ad j = total m 1 1 i the recurrece. Furthermore, we ca miimize the umber of steps i the algorithm by traversig the values v i i icreasig order of rage i whe computig MaxV 2 [m][j]. Accuracy of the approximatio: The differece betwee the approximate solutio ˆσ max 2 ad the true optimum σmax 2 ca be bouded as follows: the existece of a set of variables v 1,..., v with i : low i v i high i implies that there exists a set of of multiples of ρ, v ρ 1,..., vρ with i : low ρ i vρ i highρ i, so that i : v i v ρ i ρ/2. Cosequetly, the differece betwee the value of i=0 vρ i )2 ad the value of i=0 v i) 2 i equatio 7 for these sets of values is at most i=0 ρ v ρ i + ρ2 /4 ). Similarly, the differece betwee 1 i=0 vρ i )2 ad 1 i=0 v i) 2 i equatio 7 is at most i=0 ρ v ρ i + ρ2 /4 ). Therefore, the existece of a solutio σ 2 max to equatio 6 implies the existece of a set of of multiples of ρ, v ρ 1,..., vρ with i : low ρ i v ρ i high ρ i such that the differece betwee σmax 2 ad the variace of v ρ 1,..., vρ is bouded by θ := 2 i=0 ρ v ρ i + ρ2 /4 ) *). Similarly, the existece of a solutio ˆσ max 2 to the optimizatio problem for multiples of ρ implies the existece of a set of variables v 1,..., v with i : low i v i high i such that the differece betwee ˆσ max 2 ad the variace of v 1,..., v is bouded by θ as well **). Therefore, if we compute a solutio ˆσ max 2 to the optimizatio problem for multiples of ρ, the **) implies that the solutio to the ucostraied problem is at least ˆσ max 2 θ ad *) implies that o solutio σmax 2 to equatio 6 exists that is larger tha ˆσ max 2 + θ. Overhead: To demostrate the scalability of this approximatio algorithm, we have measured the overhead of computig ˆσ max 2 for a TPC-D workload of 100K queries ad various values of ρ o a Petium 4 2.8GHz) PC i Table 1. =100K ρ=10 =100K ρ=1 =100K ρ=1/10 Timeˆσ 2 max) 0.4 sec. 5.2 sec. 53 sec. Table 1: Overhead of approximatig σ 2 max Verifyig the applicability of the CLT: The 2d assumptio made whe modelig P rcs) is that the sample size is sufficietly large for the CLT to apply. However, we have ot stated thus far how a sufficietly large miimum sample size mi is picked. Oe geeral result i this area is Cochra s rule [9], p. 42) which states that for populatios marked by positive skewess i.e. the populatio cotais sigificat outliers), the sample size should satisfy > 25 G 1 ) 2, where G 1 is Fisher s measure of skew. Uder the assumptio that the disturbace i ay momet higher tha the 3rd is egligible ad additioal coditios o the samplig fractio f := 1 /), it ca be show that this coditio guaratees that a 95% cofidece statemet will be wrog o more tha 6% of the time, i.e. for the stadardized statistic Z := 1 f X σ i µ i : 2 P rz 1.96) P rz 1.96) > The derivatio of additioal results of this type is studied i [19]. While details of this work are beyod the scope of this paper, we use a modificatio of Cochra s Rule proposed i [19], which was foud to be robust for the populatio sizes we cosider: > G 1 ) 2. 9) I order to verify this coditio, we eed to be able to compute a upper boud o G 1 : Problem Statemet [Boudig the skew]: Give a set of variables V = {v 1,..., v }, with each v i bouded by 8

9 low i v i high i, compute G1 max = max v 1,...,v ) R i:low i v i high i i=0 v i i=0 v i ) 3 i=0 vi i=0 vi ) ) Ulike the case of variace maximizatio, to the best of our kowledge, the complexity of maximizig G 1 is ot kow. We approach this problem usig a approximatio scheme similar to the oe used for σ 2 max; because of size costraits, we omit the full descriptio of this algorithm. Because of the quick covergece of the CLT, eve the rough bouds derived i the previous sectio allow us to formulate practical costraits o sample size. For example, for the highly skewed query costs vary by multiple degrees of magitude) 13K query TPC-D workload described i Sectio 7, satisfyig equatio 9 required about a 4% sample; for a 131K query TPC-D workload, a samples of less tha 0.6% of the queries was eeded. 7 Experimets I this sectio we first evaluate the efficiecy of the samplig techiques described i Sectios 4 ad 5. The, we show how the estimatio of P rcs) ca be used to costruct a scalable ad accurate compariso primitive for large umbers of cofiguratios. Fially, we compare our approach to existig work experimetally. Setup: All experimets were ru o a Petium 4 2.8GHz) PC ad use the cost model of the SQL Server optimizer. We evaluated our techiques o two databases, oe sythetic, ad oe real-life. The sythetic database follows the TPC-D schema ad was geerated so that the frequecy of attribute values follows a Zipf-like distributio, usig the skew-parameter θ = 1. The total data size is 1GB. Here, we use a workload cosistig of about 13K queries, geerated usig the stadard QGE tool. The real-life database used was a database ruig a CRM applicatio with over 500 tables ad of size 0.7 GB. We obtaied the workload for this database by usig a trace tool; the resultig workload cotais about 6K queries, iserts, updates ad deletes. 7.1 Evaluatig the Samplig Techiques I this sectio, we evaluate the efficiecy of the differet samplig techiques described ad Delta Samplig, both with ad without progressive stratificatio). I the first experimet, we use the TPC-D workload ad cosider the problem of choosig betwee two cofiguratios C 1, C 2 that have a sigificat differece i cost 7%) ad i their respective sets of physical desig structures C 2 is idex-oly, whereas C 1 cotais a umber of views). ote that WL = 13K, so solvig the cofiguratioselectio problem for k = 2 exactly requires 26K optimizer calls. I the first experimet we set δ = 0 ad ru each samplig scheme for a give sample size ad output the selected cofiguratio. This process is repeated 5000 times, resultig i a Mote Carlo simulatio to compute the true probability of correct selectio for a give umber of samples. The results are show i Figure 1. We ca see that Probability of correct Selectio Sample Size Samplig Delta Samplig Samplig + Delta Samplig + Figure 1: Mote Carlo Simulatio of P rcs) the samplig-based approach to cofiguratio selectio is efficiet: less tha 1% of the umber of optimizer calls required to compute the cofiguratio costs exactly suffice to select the correct cofiguratio with ear-certaity. Delta- Samplig outperforms Samplig sigificatly for small sample sizes, whereas addig progressive stratificatio makes little differece, give the small sample sizes. I the ext experimet we illustrates the issues of iitially pickig a very fie stratificatio of the workload. Here, we ru the same experimet agai, ad use both samplig schemes whe havig partitioed the workload ito L strata, oe for each query template Figure 2). For the fie Probability of correct Selectio Sample Size 130 Figure 2: Progressive vs. fie stratificatio Samplig + prog. Delta Samplig + prog. Samplig + simple Delta Samplig + simple stratificatio ad small sample sizes, the estimates withi each stratum are ot ormal ad thus the probability of correct selectio is sigificatly lower. For large sample sizes, the accuracy of the fie stratificatio is comparable to the progressive stratificatio scheme. We have foud this to hold true for a umber of differet experimetal setups. The ext experimet Figure 3) uses the same workload, but two cofiguratios that are sigificatly harder to distiguish differece i cost 2%), ad share a sigificat umber of desig structures both cofiguratios are idex-oly). Here, Delta Samplig outperforms Idepe- 9

10 det Samplig by a bigger margi, sice the cofiguratios share a large umber of objects, resultig i higher covariace betwee the cost distributios. Because of the larger sample sizes, stratificatio sigificatly improves the accuracy of Samplig. Probability of correct Selectio Sample Size Samplig Delta Samplig Samplig + prog. Delta Samplig + prog. Figure 3: Mote-Carlo Simulatio of P rcs) We coducted the same experimets o the real-life database, usig two cofiguratios that were difficult to compare differece i cost < 1%), ad had little overlap i the physical desig structures Figure 4). Cosequetly, the advatage of Delta Samplig is less proouced. Furthermore, this workload cotais a relatively large umber of distict templates > 120). This meas that we rarely have estimates of the avg. cost of all templates, so progressive stratificatio is used oly i a few cases. Probability of correct Selectio Sample Size Samplig Delta Samplig Samplig + prog. Delta Samplig + prog. Figure 4: Mote-Carlo Simulatio of P rcs) 7.2 Experimets o multiple cofiguratios I this sectio we evaluate the accuracy ad scalability of the compariso primitive based o P rcs)-estimates for large umbers of cofiguratios k, which were collected from a commercial physical desig tool. We ru Algorithm 1 to determie the best cofiguratio with probability α = 90%, σ = 0 ad usig Delta-Samplig with progressive stratificatio as the samplig techique. I these experimets, we use the sample variaces si 2 to estimate σi 2. I order to guard agaist the oscillatio of the P rcs)-estimates, we oly accept a P rcs)-coditio if it holds for more tha 10 cosecutive samples. As a additioal optimizatio, we stop samplig from cofiguratios C j for which the cotributio to the overall ucertaity i P rcs) is egligible i this case P rcs l,j ) >.995). We compare the primitive to two alterative sampleallocatio methods usig idetical umber of samples): a) samplig without stratificatio ad b) samplig the same umber of queries from each stratum. For all methods, we report the resultig probability of selectig the best cofiguratio True P rcs) ) ad the maximum differece i cost betwee the best cofiguratio ad the oe selected by each method Max. ). The latter allows us to assess the worst-case impact of usig the alterative techiques. Each experimet was carried out 5000 times, for both the TPC-D Table 2) ad the CRM database Table 3). Method k = 50 k = 100 k = 500 Delta-Samplig True P rcs) 91.7% 88.2% 88.3% Max. 0.5% 1.5% 1.6% o Strat. True P rcs) 39.1% 28.2% 12.0% Max. 8.8% 9.9% 9.8% Equal Alloc. True P rcs) 42.5% 28.6% 12.8% Max. 7.7% 9.0% 8.6% Table 2: Results for TPCD-Workload Method k = 50 k = 100 k = 500 Delta-Samplig True P rcs) 97.5% 94.4% 89.7% Max. 1.7% 1.4% 0.8% o Strat. True P rcs) 56.0% 37.5% 11.0% Max % 12.69% 6.5% Equal Alloc. True P rcs) 71.1% 52.8% 17.0% Max. 7.2% 5.8% 3.26% Table 3: Results for CRM-Workload I both cases, the compariso primitive based o Delta- Samplig performs sigificatly better tha the alteratives. Moreover, the true P rcs) resultig from the selectio primitive matches the target probability α closely or exceeds it 4, demostratig its capability to chose the sample size effectively to match the accuracy requiremet. 7.3 Compariso to Workload Compressio I this sectio, we compare existig workload compressio techqiues [5, 20] to our approach with three key evaluatio criteria: scalability, quality ad adaptivity. I [20], queries are selected i order of their costs for the curret cofiguratio util a prespecified percetage, X, of the total workload cost is selected. Obviously, this method scales well. Regardig quality, however, the techique fails whe oly few query templates cotai the most expesive queries. To illustrate this, we geerated a 2K query TPC-D workload usig the QGE tool. If X = 20%, [20] will capture queries correspodig to oly few of the TPC-D query 4 The high P rcs) for the CRM workload is due to the requiremet that P rcs) > α must hold for 10 cosecutive samples, which causes over-samplig for easy selectio problems. 10

11 templates. Cosequetly, tuig this compressed workload fails to yield several desig structures beeficial for the remaiig templates. To show this, we tued 5 differet radom samples of the same size as the compressed workload; the improvemet over the etire workload) resultig from tuig each sample was more tha twice the improvemet resultig from tuig the compressed workload. To evaluate the quality resultig from [5], we measured differece i improvemet whe tuig a Delta-sample ad a compressed workload of the same size for a TPC-D workload; i this experimet, both approaches performed comparably. Regardig scalability, [5] requires up to O WL 2 ) complex distace-computatios as a preprocessig step to the clusterig problem. I cotrast, the overhead for executig the Algorithms 1 ad 2 is egligible, as all ecessary couters ad measuremets ca be maitaied icremetally at costat cost. The biggest differece betwee [5, 20] ad our techique cocers adaptivity. [5, 20] both deped o a iitial guess of a sesitivity parameter. I [5], it is the maximum allowable icrease i the estimated ruig time whe queries are discarded ad i [20] it is the percetage of total cost retaied. It is ot clear how to set this parameter correctly, as these determiistic techiques do ot allow the formulatio of probabilistic guaratees similar to the oes i this paper; i both [5, 20] the sesitivity parameter is set up-frot, without takig the cofiguratios space ito accout. We have observed i experimets that the fractio of a workload required for accurate selectio varies sigificatly for differet sets of cadidate cofiguratios. Thus choosig the sesitivity parameter icorrectly has sigificat impact o tuig quality ad speed. Our algorithm, i cotrast, offers a pricipled way of adjustig the sample size olie. 8 Coclusio ad Outlook I this paper, we preseted a scalable compariso primitive solvig the cofiguratio selectio problem, while providig accurate estimates of the probability of correct selectio. This primitive is of crucial importace to scalable exploratio of the space of physical database desigs ad ca serve as a core compoet withi automated physical desig tools. Moreover, we described a ovel scheme to verify the validity of usig the Cetral Limit Theorem ad sample variaces to compute P rcs). These techiques are ot limited to this problem domai, but ca be used for ay CLT-based estimator, if the required bouds o idividual values i the uderlyig distributio ca be obtaied. Refereces [1] S. Agrawal, S. Chaudhuri, et al. Database Tuig Advisor for Microsoft SQL Server. Proc. of the 30th VLDB Cof., Toroto, Caada, [2]. Bruo ad S. Chaudhuri. Automatic Physical Database Tuig: a Relaxatio-based Approach. I ACM SIGMOD Cof., [3] G. Casella ad R. L. Berger. Statistical Iterferece. Duxbury, [4] S. Chaudhuri, G. Das, et al. A Robust, Optimizatio-Based Approach for Aprroximate Aswerig of Aggregate Queries. I Proc. of ACM SIGMOD Cof., [5] S. Chaudhuri, A. Gupta, et al. Compressig SQL Workloads. I Proc. of ACM SIGMOD Cof., [6] S. Chaudhuri, A. C. Köig, et al. SQLCM: A Cotiuous Moitorig Framework for Relatioal Database Egies. I Proc. of 20th I CDE Cof., [7] S. Chaudhuri ad V. R. arasayya. A Efficiet Cost-Drive Idex Selectio Tool for Microsoft SQL Server. I Proc. the 23rd VLDB Cof., Athes, Greece, [8] S. Chaudhuri ad V. R. arasayya. AutoAdmi What-If Idex Aalysis Utility. I Proc. of ACM SIGMOD Cof., Seattle, WA, USA, [9] W. G. Cochra. Samplig Techiques. Wiley, [10] B. Dageville, D. Das, et al. Automatic SQL Tuig i Oracle 10g. I Proc. of the 30th VLDB Cof., [11] S. Ferso, L. Gizburg, et al. Computig Variace for Iterval Data is P-Hard. I ACM SIGACT ews, Vol. 33, o. 2, pages , [12] S. Ferso, L. Gizburg, et al. Exact Bouds o Fiite Populatios of Iterval Data. Techical Report UTEP-CS-02-13d, Uiversity of Texas at El Paso, [13] S. Fikelstei, M. Schikolick, et al. Physical Database Desig for Relatioal Databases. I ACM TODS, 131), [14] Y. Ioaidis ad S. Christodoulakis. Optimal Histograms for limitig Worst-Case Error Propagatio i the Size of Joi Results. I ACM TODS, [15] S.-H. Kim ad B. L. elso. Selectig the Best System: Theory ad Methods. I Proc. of the 2003 Witer Simulatio Cof., pages [16] V. Kreiovich, S. Ferso, et al. Computig Higher Cetral Momets for Iterval Data. Techical Report UTEP-CS-03-14, Uiversity of Texas at El Paso, [17]. M. Steiger ad J. R. Wilso. Improved Batchig for Cofidece Iterval Costructio i Steady-State Simulatio. I Proc. of the 1999 Witer Simulatio Cof., [18] M. Stillger, G. Lohma, V. Markl, ad M. Kadil. LEO - DB2 s Learig Optimizer. I Proc. of the 27th Coferece o Very Large Databases, [19] R. Sugde, T. Smith, et al. Cochra s rule for simple radom samplig. I Joural of the Royal Statistical Society, pages , [20] D. C. Zilio, J. Rao, et al. DB2 Desig Advisor: Itegrated Automatic Physical Database Desig. Proc. of the 30th VLDB Cof., Caada, Ackowledgemets: This paper beefitted tremedously from may isightful commets from Vivek arasayya, Surajit Chaudhuri, Sajay Agrawal ad Ami Saberi. 11

CIS 121 Data Structures and Algorithms with Java Fall Big-Oh Notation Tuesday, September 5 (Make-up Friday, September 8)

CIS 121 Data Structures and Algorithms with Java Fall Big-Oh Notation Tuesday, September 5 (Make-up Friday, September 8) CIS 11 Data Structures ad Algorithms with Java Fall 017 Big-Oh Notatio Tuesday, September 5 (Make-up Friday, September 8) Learig Goals Review Big-Oh ad lear big/small omega/theta otatios Practice solvig