Self-tuning Histograms: Building Histograms Without Looking at Data

Similar documents
Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Mathematics 256 a course in differential equations for engineering students

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

Parallelism for Nested Loops with Non-uniform and Flow Dependences

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

The Codesign Challenge

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

Learning-Based Top-N Selection Query Evaluation over Relational Databases

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

Optimal Workload-based Weighted Wavelet Synopses

Feature Reduction and Selection

y and the total sum of

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Optimizing Document Scoring for Query Retrieval

Wishing you all a Total Quality New Year!

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Parallel matrix-vector multiplication

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

S1 Note. Basis functions.

NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

Hierarchical clustering for gene expression data analysis

Brave New World Pseudocode Reference

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

X- Chart Using ANOM Approach

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15

Some Advanced SPC Tools 1. Cumulative Sum Control (Cusum) Chart For the data shown in Table 9-1, the x chart can be generated.

A Binarization Algorithm specialized on Document Images and Photos

TN348: Openlab Module - Colocalization

Video Proxy System for a Large-scale VOD System (DINA)

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

GSLM Operations Research II Fall 13/14

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Problem Set 3 Solutions

Machine Learning: Algorithms and Applications

Unsupervised Learning

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

An Entropy-Based Approach to Integrated Information Needs Assessment

CS 534: Computer Vision Model Fitting

An Optimal Algorithm for Prufer Codes *

arxiv: v3 [cs.ds] 7 Feb 2017

Programming in Fortran 90 : 2017/2018

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

An Image Fusion Approach Based on Segmentation Region

Reducing Frame Rate for Object Tracking

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Simulation Based Analysis of FAST TCP using OMNET++

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

User Authentication Based On Behavioral Mouse Dynamics Biometrics

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

Meta-heuristics for Multidimensional Knapsack Problems

Biostatistics 615/815

Related-Mode Attacks on CTR Encryption Mode

CMPS 10 Introduction to Computer Science Lecture Notes

Classifier Selection Based on Data Complexity Measures *

Parameter estimation for incomplete bivariate longitudinal data in clinical trials

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

Learning the Kernel Parameters in Kernel Minimum Distance Classifier


Smoothing Spline ANOVA for variable screening

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Distributed Resource Scheduling in Grid Computing Using Fuzzy Approach

Support Vector Machines

High resolution 3D Tau-p transform by matching pursuit Weiping Cao* and Warren S. Ross, Shearwater GeoServices

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

Analysis of Continuous Beams in General

Intelligent Information Acquisition for Improved Clustering

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

Quality Improvement Algorithm for Tetrahedral Mesh Based on Optimal Delaunay Triangulation

Synthesizer 1.0. User s Guide. A Varying Coefficient Meta. nalytic Tool. Z. Krizan Employing Microsoft Excel 2007

Report on On-line Graph Coloring

Unsupervised Learning and Clustering

Concurrent Apriori Data Mining Algorithms

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

Query Clustering Using a Hybrid Query Similarity Measure

Module Management Tool in Software Development Organizations

Lecture 5: Multilayer Perceptrons

Unsupervised Learning and Clustering

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

Random Kernel Perceptron on ATTiny2313 Microcontroller

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

Edge Detection in Noisy Images Using the Support Vector Machines

An Efficient Genetic Algorithm with Fuzzy c-means Clustering for Traveling Salesman Problem

Simplification of 3D Meshes

Estimating Costs of Path Expression Evaluation in Distributed Object Databases

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc.

(1) The control processes are too complex to analyze by conventional quantitative techniques.

Cluster Analysis of Electrical Behavior

Analysis of Collaborative Distributed Admission Control in x Networks

Efficient Distributed File System (EDFS)

USING GRAPHING SKILLS

arxiv: v2 [cs.db] 18 Sep 2017

Lecture #15 Lecture Notes

Automatic selection of reference velocities for recursive depth migration

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

Today s Outline. Sorting: The Big Picture. Why Sort? Selection Sort: Idea. Insertion Sort: Idea. Sorting Chapter 7 in Weiss.

Comparison of Heuristics for Scheduling Independent Tasks on Heterogeneous Distributed Environments

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Transcription:

Self-tunng Hstograms: Buldng Hstograms Wthout Lookng at Data Ashraf Aboulnaga Computer Scences Department Unversty of Wsconsn - Madson ashraf@cs.wsc.edu Surajt Chaudhur Mcrosoft Research surajtc@mcrosoft.com Abstract In ths paper, we ntroduce self-tunng hstograms. Although smlar n structure to tradtonal hstograms, these hstograms nfer data dstrbutons not by examnng the data or a sample thereof, but by usng feedback from the query executon engne about the actual selectvty of range selecton operators to progressvely refne the hstogram. Snce the cost of buldng and mantanng self-tunng hstograms s ndependent of the data sze, self-tunng hstograms provde a remarkably nexpensve way to construct hstograms for large data sets wth lttle up-front costs. Self-tunng hstograms are partcularly attractve as an alternatve to mult-dmensonal tradtonal hstograms that capture dependences between attrbutes but are prohbtvely expensve to buld and mantan. In ths paper, we descrbe the technques for ntalzng and refnng self-tunng hstograms. Our expermental results show that self-tunng hstograms provde a low-cost alternatve to tradtonal mult-dmensonal hstograms wth lttle loss of accuracy for data dstrbutons wth low to moderate skew. 1. Introducton Database systems requre knowledge of the dstrbuton of the data they store. Ths nformaton s prmarly used by query optmzers to estmate the selectvtes of the operatons nvolved n a query and choose the query executon plan. It could also be used for other purposes such as approxmate query processng, load balancng n parallel database systems, and gudng the process of samplng from a relaton. Hstograms are wdely used for capturng data dstrbutons. They are used n most commercal database systems such as Mcrosoft SQL Server, Oracle, Informx, and DB2. Whle hstograms mpose very lttle cost at query optmzaton tme, the cost of buldng them and mantanng or rebuldng them when the data s modfed has to be consdered when we choose the attrbutes or attrbute combnatons for whch we buld hstograms. Buldng a hstogram nvolves scannng or samplng the data, and sortng the data and parttonng t nto buckets, or fndng quantles. For large databases, the cost s sgnfcant enough to prevent us from buldng all the hstograms that we Work done whle the author was at Mcrosoft Research beleve are useful. Ths problem s partcularly strkng for multdmensonal hstograms that capture jont dstrbutons of correlated attrbutes [MD88, PI97]. These hstograms can be extremely useful for optmzng decson-support queres snce they provde valuable nformaton that helps n estmatng the selectvtes of mult-attrbute predcates on correlated attrbutes. Despte ther potental, to the best of our knowledge, no commercal database system supports mult-dmensonal hstograms. The usual alternatve to mult-dmensonal hstograms s to assume that the attrbutes are ndependent, whch enables usng a combnaton of one-dmensonal hstograms. Ths approach s effcent but also very naccurate. The naccuracy results n a poor choce of executon plans by the query optmzer. Self-tunng Hstograms In ths paper, we explore a novel approach that helps reduce the cost of buldng and mantanng hstograms for large tables. Our approach s to buld hstograms not by examnng the data but by usng feedback nformaton about the executon of the queres on the database (query workload). We start wth an ntal hstogram bult wth whatever nformaton we have about the dstrbuton of the hstogram attrbute(s). For example, we wll construct an ntal two-dmensonal hstogram from two exstng onedmensonal hstograms assumng ndependence of the attrbutes. As queres are ssued on the database, the query optmzer uses the hstogram to estmate selectvtes n the process of choosng query executon plans. Whenever a plan s executed, the query executon engne can count the number of tuples produced by each operator. Our approach s to use ths free feedback nformaton to refne the hstogram. Whenever a query uses the hstogram, we compare the estmated selectvty to the actual selectvty and refne the hstogram based on the selectvty estmaton error. Ths ncremental refnement progressvely reduces estmaton errors and leads to a hstogram that s accurate for smlar workloads. We call hstograms bult usng ths process self-tunng hstograms or ST-hstograms for short. Ths work was done n the broader context of the AutoAdmn project at Mcrosoft Research (http://research.mcrosoft.com/db/autoadmn) that nvestgates technques to make databases self-tunng. ST-hstograms make t possble to buld hgher dmensonal hstograms ncrementally wth lttle overhead, thus provdng commercal systems wth a low-cost approach to creatng and mantanng such hstograms. The ST-hstograms have a low upfront cost because they are ntalzed wthout lookng at the data. The refnement of ST-hstograms s a smple low-cost procedure that leverages free nformaton from the executon engne. Furthermore, we demonstrate that hstogram refnement converges quckly. Thus, the overall cost of ST-hstograms s much lower than that of tradtonal mult-dmensonal hstograms, yet the loss of accuracy s very acceptable for data wth low to moderate skew n the jont dstrbuton of the attrbutes.

ST-hstogram Plan Optmze Execute Result On-lne Refnement Result Sze of Selecton Refne Off-lne Refnement Later Workload Log Fgure 1: On-lne and off-lne refnement of ST-hstograms A ST-hstogram can be refned on-lne or off-lne (Fgure 1). In the on-lne mode, the module executng a range selecton mmedately updates the hstogram. In the off-lne mode, the executon module wrtes every selecton range and ts result sze to a workload log. Tools avalable wth commercal database systems, e.g., Profler n Mcrosoft SQL Server, can accomplsh such loggng. The workload log s used to refne the hstogram n a batch at a later tme. On-lne refnement ensures that the hstogram reflects the most up-to-date feedback nformaton but t mposes more overhead durng query executon than off-lne refnement and can also cause the hstogram to become a hghcontenton hot spot. The overhead mposed by hstogram refnement, whether on-lne or off-lne, can easly be talored. In partcular, the hstogram need not be refned n response to every sngle selecton that uses t. We can choose to refne the hstogram only for selectons wth a hgh selectvty estmaton error. We can also skp refnng the hstogram durng perods of hgh load or when there s contenton for accessng t. On-lne refnement of ST-hstograms brngs a ST-hstogram closer to the actual data dstrbuton, whether the estmaton error drvng ths refnement s due to the ntal naccuracy of the hstogram or to modfcatons n the underlyng relaton. Thus, ST-hstograms automatcally adapt to database updates. Another advantage of ST-hstograms s that ther accuracy depends on how often they are used. The more a ST-hstogram s used, the more t s refned, the more accurate t becomes. Applcatons of Self-tunng Hstograms One can expect tradtonal hstograms bult by lookng at the data to be more accurate than ST-hstograms that learn the dstrbuton wthout ever lookng at the data. Nevertheless, SThstograms, and especally mult-dmensonal ST-hstograms, are sutable for a wde range of applcatons. As mentoned above, mult-dmensonal ST-hstograms are partcularly attractve. Tradtonal mult-dmensonal hstograms, most notably MHIST-p hstograms [PI97], are sgnfcantly more expensve than tradtonal one-dmensonal hstograms, ncreasng the value of the savngs n cost offered by ST-hstograms. Furthermore, ST-hstograms are very compettve n terms of accuracy wth MHIST-p hstograms for data dstrbutons wth low to moderate skew (Secton 5). Mult-dmensonal SThstograms can be ntalzed usng tradtonal one-dmensonal hstograms and subsequently refned to provde a cheap and effcent way of capturng the jont dstrbuton of multple attrbutes. The other nexpensve alternatve of assumng ndependence has been repeatedly demonstrated to be naccurate (see, for example, [PI97] and our experments n Secton 5). Furthermore, note that buldng tradtonal hstograms s an offlne process, meanng that hstograms cannot be used untl the system ncurs the whole cost of completely buldng them. Ths s not true of ST-hstograms. Fnally, note that ST-hstograms make t possble to nexpensvely buld not only two-dmensonal, but also n-dmensonal hstograms. ST-hstograms are also a sutable alternatve when there s not enough tme for updatng database statstcs to allow buldng all the desred hstograms n the tradtonal way. Ths may happen n data warehouses that are updated perodcally wth huge amounts of data. The sheer data sze may prohbt rebuldng all the desred hstograms durng the batch wndow. Ths very same data sze makes ST-hstograms an attractve opton, because examnng the workload to buld hstograms wll be cheaper than examnng the data and can be talored to a gven tme budget. The technque of ST-hstograms can be an ntegral part of database servers as we move towards self-tunng database systems. If a self-tunng database system decdes that a hstogram on some attrbute or attrbute combnaton may mprove performance, t can start by buldng a ST-hstogram. The low cost of ST-hstograms allows the system to experment more extensvely and try out more hstograms than f tradtonal hstograms were the only choce. Subsequently, one can construct a tradtonal hstogram only f the ST-hstogram does not provde the requred accuracy. Fnally, an ntrgung possble applcaton of ST-hstograms wll be for applcatons that nvolve queres on remote data sources. Wth recent trends n database usage, query optmzers wll have to optmze queres nvolvng remote data sources not under ther drect control, e.g., queres nvolvng data sources accessed over the Internet. Accessng the data and buldng tradtonal hstograms for such data sources may not be easy or even possble. Query results, on the other hand, are avalable from the remote source, makng the technque of ST-hstograms an attractve opton. The rest of ths paper s organzed as follows. In Secton 2 we present an overvew of the related work. Secton 3 descrbes onedmensonal ST-hstograms and ntroduces the basc concepts that lead towards Secton 4 where we descrbe mult-dmensonal SThstograms. Secton 5 presents an expermental evaluaton of our proposed technques. Secton 6 contans concludng remarks. 2. Related Work Hstograms were ntroduced n [Koo80], and most commercal database systems now use hstograms for selectvty estmaton.

Although one-dmensonal equ-depth hstograms are used n most commercal systems, more accurate hstograms have been proposed recently [PIHS96]. [PI97] extends the technques n [PIHS96] to multple dmensons. However, we are unaware of any commercal systems that use the MHIST-p technque proposed n [PI97]. A novel approach for buldng hstograms based on wavelets s presented n [MVW98]. A major dsadvantage of hstograms s the cost of buldng and mantanng them. Some recent work has addressed ths shortcomng. [MRL98] proposes a one-pass algorthm for computng approxmate quantles that could be used to buld approxmate equ-depth hstograms n one pass over the data. Reducng the cost of mantanng equ-depth and compressed hstograms s the focus of [GMP97]. Recall that our approach s not to examne the data at all, but to buld hstograms usng feedback from the query executon engne. However, our technque for refnng ST-hstograms shares commonaltes wth the splt and merge algorthm proposed n [GMP97]. Ths relatonshp s further dscussed n Secton 3. In addton to hstograms, another technque for selectvty estmaton s samplng the data at query optmzaton tme [LNS90]. The man dsadvantage of ths approach s the overhead t adds to query optmzaton. The concept of usng feedback from the query executon engne to estmate data dstrbutons s ntroduced n [CR94]. In ths paper, the data dstrbuton s represented as a lnear combnaton of model functons. Feedback nformaton s used to adjust the weghtng coeffcents of ths lnear combnaton by a method called recursve-least-square-error. Ths paper only consders one-dmensonal dstrbutons. It remans an open problem whether one can fnd sutable mult-dmensonal model functons, or whether the recursve least-square-error technque would work well for mult-dmensonal dstrbutons. In contrast, we show how our technque can be used to construct multdmensonal hstograms as well as one-dmensonal hstograms. Furthermore, our work s easly ntegrated nto exstng systems because we use the same hstogram data structures that are currently supported n commercal systems. A dfferent type of feedback from the executon engne to the optmzer s proposed n [KD98]. In ths paper, the executon engne nvokes the query optmzer to re-optmze a query f t beleves, based on statstcs collected durng executon, that ths wll result n a better query executon plan. 3. One-dmensonal ST-hstograms Although the man focus of our paper s to demonstrate that SThstograms are low cost alternatves to tradtonal multdmensonal hstograms, the fundamentals of ST-hstograms are best ntroduced usng ST-hstograms for sngle attrbutes. Sngleattrbute ST-hstograms are smlar n structure to tradtonal hstograms. Such a ST-hstogram conssts of a set of buckets. Each bucket, b, stores the range that t represents, [low(b), hgh(b)], and the number of tuples n ths range, or the frequency, freq(b). Adjacent buckets share the bucket endponts, and the ranges of all the buckets together cover the entre range of values of the hstogram attrbute. We assume that the refnement of SThstograms s drven by feedback from range selecton queres. A ST-hstogram assumes that the data s unformly dstrbuted untl the feedback observaton contradcts the unformty assumpton. Thus, the refnement/restructurng of ST-hstograms corresponds to weakenng the unformty assumpton as needed n response to feedback nformaton. Therefore, the lfecycle of a ST-hstogram conssts of two stages. Frst, t s ntalzed and then, t s refned. The process of refnement can be broken down further nto two parts: (a) refnng ndvdual bucket frequences, and (b) restructurng the hstogram,.e., movng the bucket boundares. The refnement process s drven by a query workload (see Secton 1). The bucket frequences are updated wth every range selecton on the hstogram attrbute, whle the bucket boundares are updated by perodcally restructurng the hstogram. We descrbe each of these steps n the rest of the secton. 3.1 Intal Hstogram To buld a ST-hstogram, h, on an attrbute, a, we need to know the requred number of hstogram buckets, B, the number of tuples n the relaton, T, and the mnmum and maxmum values of attrbute a, mn and max. The B buckets of the ntal hstogram are evenly spaced between mn and max. At the tme of ntalzng the hstogram structure, we have no feedback nformaton. Therefore, we make the unformty assumpton and assgn each of the buckets a frequency of T/B tuples (wth some provson for roundng) The parameter T can be looked up from system catalogs mantaned for the database. However, the system may not store mnmum and maxmum values of attrbutes n ts catalogs. The precse value of the mnmum and maxmum s not crtcal. Therefore, the ntalzaton phase of ST-hstograms can explot addtonal sources to project an estmate that may subsequently be refned. For example, doman constrants on the column, as well as the mnmum and maxmum values referenced n the query workload can be used for such estmaton. 3.2 Refnng Bucket Frequences The bucket frequences of a ST-hstogram are refned (updated) wth feedback nformaton from the queres of the workload. For every selecton on the hstogram attrbute, we compute the absolute estmaton error, whch s the dfference between the estmated and actual result szes. Based on ths error, we refne the frequences of the buckets that were used n estmaton. The key problem s to decde how to dstrbute the blame for the error among the hstogram buckets that overlap the range of a gven query. In a ST-hstogram, error n estmaton may be due to ncorrect frequences n any of the buckets that overlap the selecton range. Ths s dfferent from tradtonal hstograms n whch, f the hstogram has been bult usng a full scan of data and has not been degraded n accuracy by database updates, the estmaton error can result only from the frst or last bucket, and only f they partally overlap the selecton range. Buckets that are totally contaned n the selecton range do not contrbute to the error. The change n frequency of any bucket should depend on how much t contrbutes to the error. We use the heurstc that buckets wth hgher frequences contrbute more to the estmaton error than buckets wth lower frequences. Specfcally, we assgn the blame for the error to the buckets used for estmaton n proporton to ther current frequences. An alternatve heurstc, not studed n ths paper, s to assgn the blame n proporton to the current ranges of the buckets. Fnally, we multply the estmaton error by a dampng factor between 0 and 1 to make sure that bucket frequences are not modfed too much n response to errors, as ths may lead to oversenstve or unstable hstograms. Fgure 2 presents the algorthm for updatng the bucket frequences of a ST-hstogram, h, n response to a range selecton, [rangelow,rangehgh], wth actual result sze act. Ths algorthm

algorthm UpdateFreq Inputs: h, rangelow, rangehgh, act Outputs: h wth updated bucket frequences begn 1 Get the set of k buckets overlappng the selecton range,{ b 1, b 2,, b k }; 2 est = Estmated result sze of selecton usng hstogram h; 3 esterr = act est ; / Compute the absolute estmaton error. / 4 / Dstrbute the error among the buckets n proporton to frequency. / 5 for = 1 to k do 6 frac = mn( rangehgh, hgh( b )) max( rangelow, low( b )) + 1 ; hgh( b ) low( b ) + 1 7 freq( b ) = max (freq( b ) + esterr frac freq( b ) / est, 0) ; 8 endfor end UpdateFreq Fgure 2: Algorthm for updatng bucket frequences n one-dmensonal ST-hstograms s used for both on-lne and off-lne refnement. The algorthm frst determnes the hstogram buckets that overlap the selecton range, whether they partally overlap the range or are totally contaned n t, and the estmated result sze. The query optmzer usually obtans ths nformaton durng query optmzaton, so we can save some effort by retanng ths nformaton for subsequently refnng bucket frequences. Next, the algorthm computes the absolute estmaton error, denoted by esterr (lne 3 n Fgure 2). The error formula dstngushes between overestmaton, ndcated by a negatve error and requrng the bucket frequences to be lowered, and underestmaton, ndcated by a postve error and requrng the bucket frequences to be rased. As mentoned earler, the blame for ths error s assgned to hstogram buckets n proporton to the frequences that they contrbute to the result sze. We assume that each bucket contans all possble values n the range that t represents, and we approxmate all frequences n a bucket by ther average (.e., we make the contnuous values and unform frequences assumptons [PIHS96]). Under these assumptons, the contrbuton of a hstogram bucket to the result sze s equal to ts frequency tmes the fracton of the bucket overlappng the selecton range. Ths fracton s the length of the nterval where the bucket overlaps the selecton range dvded by the length of the nterval represented by the bucket (lne 6). To dstrbute the error among buckets n proporton to frequency, each bucket s assgned a porton of the absolute estmaton error, esterr, equal to ts contrbuton to the result sze, frac freq( b ), dvded by the * total result sze, est, damped by a dampng factor, (lne 7). We expermentally demonstrate n Secton 5 that the refnement process s robust across a wde range of values for, and we recommend usng values of n the range 0.5 to 1. 3.3 Restructurng Refnng bucket frequences s not enough to get an accurate hstogram. The frequences n a bucket are approxmated by ther average. If there s a large varaton n frequency wthn a bucket, the average frequency s a poor approxmaton of the ndvdual frequences, no matter how accurate t s. Specfcally, hgh frequency values wll be contaned n hgh frequency buckets, but they may be grouped wth low frequency values n these buckets. Thus, n addton to refnng the bucket frequences, we must also restructure the buckets,.e., move the bucket boundares to get a better parttonng that avods groupng hgh frequency and low frequency values n the same buckets. Ideally, we would lke to make hgh frequency buckets as narrow as possble. In the lmt, ths approach separates out hgh frequency values n sngleton buckets of ther own, a common objectve for hstograms (e.g., see [PIHS96]). Therefore, we choose buckets that currently have hgh frequency and splt them nto several buckets. Splttng nduces the separaton of hgh frequency and low frequency values nto dfferent buckets, and the frequency refnement process later adjusts the frequences of these new buckets. In order to ensure that the number of buckets assgned to the SThstogram does not ncrease due to splttng, we need a mechansm to reclam buckets as well. To that end, we use a step of mergng that groups a run of consecutve buckets wth smlar frequences nto one bucket. Thus, our approach s to restructure the hstogram perodcally by mergng buckets and usng the buckets thus freed to splt hgh frequency buckets. Restructurng may be trggered usng a varety of heurstcs. In ths paper, we study the smplest scheme where the restructurng process s nvoked after every R selectons that use the hstogram. The parameter R s called the restructurng nterval. To merge buckets wth smlar frequences, we frst have to decde how to quantfy smlar frequences. We assume that two bucket frequences are smlar f the dfference between them s less than m percent of the number of tuples n the relaton, T. m s a parameter that we call the merge threshold. In most of our experments, m 1% was a sutable choce. We use a greedy strategy to form a run of adjacent buckets wth smlar frequences and collapse them nto a sngle bucket. We repeat ths step untl no further mergng s possble that satsfes the merge threshold condton (Steps 2 9 n Fgure 3). We also need to decde whch hgh frequency buckets to splt. We choose to splt the s percent of the buckets wth the hghest frequences. s s a parameter that we call the splt threshold. In our experments, we used s=10%. Our heurstc dstrbutes the reclamed buckets among the hgh frequency buckets n proporton to frequency. The hgher the frequency of a bucket, the more extra buckets t gets. Fgure 3 presents the algorthm for restructurng a SThstogram, h, of B buckets on a relaton wth T tuples. The frst step n hstogram restructurng s greedly fndng runs of consecutve buckets wth smlar frequences to merge. The algorthm repeatedly fnds the par of adjacent runs of buckets such that the maxmum dfference n frequency between a bucket n the frst run and a bucket n the second run s the mnmum

algorthm RestructureHst Inputs: h Outputs: restructured h begn 1 / Fnd buckets wth smlar frequences to merge. / 2 Intalze B runs of buckets such that each run contans one hstogram bucket; 3 For every two consecutve runs of buckets, fnd the maxmum dfference n frequency between a bucket n the 4 frst run and a bucket n the second run; 5 Fnd the mnmum of all these maxmum dfferences, mndff; 6 f mndff m T then 7 Merge the two runs of buckets correspondng to mndff nto one run; 8 Look for other runs to merge. Goto lne 3; 9 endf 10 11 / Assgn the extra buckets freed by mergng to the hgh frequency buckets. / 12 k = s B; 13 Fnd the set, { b 1, b2,, b k } of buckets wth the k hghest frequences that were not chosen to be 14 merged wth other buckets n the mergng step; 15 Assgn the buckets freed by mergng to the buckets of ths set n proporton to ther frequences; 16 17 / Construct the restructured hstogram by mergng and splttng. / 18 Merge each prevously formed run of buckets nto one bucket spannng the range represented by all the buckets 19 n the run and havng a frequency equal to the sum of ther frequences; 20 Splt the k buckets chosen for splttng, gvng each one the number of extra buckets assgned to t earler. 21 The new buckets are evenly spaced n the range spanned by the old bucket and the frequency of the old 22 bucket s equally dstrbuted among them; end RestructureHst Fgure 3: Algorthm for restructurng one-dmensonal ST-hstograms over all pars of adjacent runs. The two runs are merged nto one f ths dfference s less than the threshold m T, and we stop lookng for runs to merge f t s not. Ths process results n a number of runs of several consecutve buckets. Each run s replaced wth one bucket spannng ts entre range, and wth a frequency equal to the total frequency of all the buckets n the run. Ths frees a number of buckets to allocate to hgh frequency buckets durng splttng. Splttng starts by dentfyng the s percent of the buckets that have the hghest frequences and are not sngleton buckets. We avod splttng buckets that have been chosen for mergng snce ther selecton ndcates that they have smlar frequences to ther neghbors. The extra buckets freed by mergng are dstrbuted among the buckets beng splt n proporton to ther frequences. A bucket beng splt, b, gets freq( b ) / totalfreq of the extra buckets, where totalfreq s the total frequency of the buckets beng splt. To splt a bucket, t s replaced wth tself plus the extra buckets assgned to t. These new buckets evenly dvde the range of the old bucket, and the frequency of the old bucket s evenly dstrbuted among them. Splttng and mergng are used n [GMP97] to redstrbute hstogram buckets n the context of mantanng approxmate equdepth and compressed hstograms. The algorthm n [GMP97] merges pars of buckets whose total frequency s less than a threshold, whereas our algorthm merges runs of buckets based on the dfferences n ther frequency. Our algorthm assgns the freed buckets to the buckets beng splt n proporton to the frequences of the latter, whereas the algorthm n [GMP97] merges only one par of buckets at a tme and can, thus, splt only one bucket nto two. A key dfference between the two approaches s that n [GMP97], a sample of the tuples of the relaton s contnuously mantaned (the backng sample ), and buckets are splt at ther approxmate medans computed from ths sample. On the other hand, our approach does not examne the data at any pont, so we do not have nformaton smlar to that represented n the backng sample of [GMP97]. Hence, our restructurng algorthm splts buckets at evenly spaced ntervals, wthout usng any nformaton about the data dstrbuton wthn a bucket. Fgure 4 gves an example of hstogram restructurng. In ths example, the merge threshold s such that algorthm RestructureHst merges buckets f the dfference between ther frequences s wthn 3. The algorthm dentfes two runs of buckets to be merged, buckets 1 and 2, and buckets 4 to 6. Mergng these runs frees three buckets to assgn to hgh frequency buckets. The splt threshold s such that we splt the two buckets wth the hghest frequences, buckets 8 and 10. Assgnng the extra buckets to these two buckets n proporton to frequency means that bucket 8 gets two extra buckets and bucket 10 gets one extra bucket. Splttng may unnecessarly separate values wth smlar, low frequences nto dfferent buckets. Such runs of buckets wth smlar low frequences would be merged durng subsequent restructurng. Notce that splttng dstorts the frequency of a bucket by dstrbutng t among the new buckets. Ths means that the hstogram may lose some of ts accuracy by restructurng. Ths accuracy s restored when the bucket frequences are refned through subsequent feedback. In summary, our model s as follows: The frequency refnement process s appled to the hstogram, and the refned frequency nformaton s perodcally used to restructure the hstogram. Restructurng may reduce accuracy by dstrbutng frequences among buckets durng splttng but frequency refnement restores, and hopefully ncreases, hstogram accuracy.

Merge: m*t = 3 Splt: s*b = 2 3 extra buckets 70 3 2 70 + 30 30 3 1 70 + 30 Frequences Merge 1 extra bucket Merge 2 extra buckets Splt Splt 10 13 17 14 13 11 25 70 10 30 Buckets 1 2 3 4 5 6 7 8 9 10 Frequences 23 17 38 25 23 23 24 10 15 15 Buckets 1 2 3 4 5 6 7 8 9 10 Fgure 4: Example of hstogram restructurng 4. Mult-dmensonal ST-hstograms In ths secton, we present mult-dmensonal (.e., mult-attrbute) ST-hstograms. Our goal s to buld hstograms representng the jont dstrbuton of multple attrbutes of a sngle relaton. These hstograms wll be used to estmate the result sze of conjunctve range selectons on these attrbutes, and are refned based on feedback from these selectons. Usng accurate one-dmensonal hstograms for all the attrbutes s not enough, because they do not reflect the correlaton between attrbutes. In ths secton, we dscuss the specal consderatons for mult-dmensonal hstograms. Workng n multple dmensons rases the ssue of how to partton the mult-dmensonal space nto hstogram buckets. The effectveness of ST-hstograms stems from ther ablty to pnpont the buckets contrbutng to the estmaton error and learn the data dstrbuton. The parttonng we choose must effcently support ths learnng process. It must also be a parttonng that s easy to construct and mantan, because we want the cost of SThstograms to reman as low as possble. To acheve these objectves, we use a grd parttonng of the mult-dmensonal space. Each dmenson of the space s parttoned nto a number of parttons. The parttons of a dmenson may vary n sze, but the parttonng of the space s always fully descrbed by the parttonng of the dmensons. We choose a grd parttonng due to ts smplcty and low cost, even though t does not offer as much flexblty n groupng values nto buckets as other parttonngs such as, for example, the MHIST-p hstogram parttonng [PI97]. The smplcty of a grd parttonng allows our hstograms to have more buckets for a gven amount of memory. It s easer for ST-hstograms to nfer the data dstrbuton from feedback nformaton when workng wth a smple hgh-resoluton representaton of the dstrbuton than t s when workng wth a complex low-resoluton representaton. Furthermore, we doubt that the smple feedback nformaton used for refnement can be used to glean enough nformaton about the data dstrbuton to justfy a more complex parttonng. Each dmenson,, of an n-dmensonal ST-hstogram s parttoned nto B parttons. B does not necessarly equal B j for j. The parttonng of the space s descrbed by n arrays, one per dmenson, whch we call the scales [NHS84]. Each array element of the scales represents the range of one partton, [low,hgh]. In addton to the scales, a mult-dmensonal SThstogram has an n-dmensonal matrx representng the grd cell Scales attrbute 2 1 [1,10] 2 [11,20] 3 [21,30] 4 [31,40] 5 [41,50] attrbute 1 1 2 3 4 5 [1,5] [6,10] [11,15][16,20][21,25] 11 6 43 14 26 11 60 12 8 9 65 37 28 44 26 10 5 8 20 7 14 9 7 19 11 Frequency matrx Range selecton usng hstogram Fgure 5: A 2d ST-hstogram and a range selecton usng t frequences, whch we call the frequency matrx. Fgure 5 presents an example of a 5 5 two-dmensonal ST-hstogram and a range selecton that uses t. 4.1 Intal Hstogram To buld a ST-hstogram on attrbutes, a 1, a 2,, a n, we can assume complete unformty and ndependence, or we can use exstng one-dmensonal hstograms but assume ndependence of the attrbutes as the startng pont. If we start wth the unformty and ndependence assumpton, we need to know the mnmum and maxmum values of each attrbute a, mn and max. We also need to specfy the number of parttons for each dmenson, B 1, B 2,, B n. Then, each dmenson,, s parttoned nto B equally spaced parttons, and the T tuples of the relaton are evenly dstrbuted among all the buckets of the frequency matrx. Ths technque s an extenson of one-dmensonal ST-hstograms. Another way of buldng mult-dmensonal ST-hstograms s to start wth tradtonal one-dmensonal hstograms on all the mult-dmensonal hstogram attrbutes. Such one-dmensonal hstograms, f they are avalable, provde a better startng pont than assumng unformty and ndependence. In ths case, we ntalze the scales by parttonng the space along the bucket boundares of the one-dmensonal hstograms, and we ntalze the frequency matrx usng the bucket frequences of the onedmensonal hstograms and assumng that the attrbutes are ndependent. Under the ndependence assumpton, the ntal frequency of a cell of the frequency matrx s gven by n 1 freq[ j1, j2,, jn ] = n freq [ j 1 ], where freq [ j ] s the T = 1 frequency of bucket j of the hstogram for dmenson.

. 4.2 Refnng Bucket Frequences The algorthm for refnng bucket frequences n the multdmensonal case s dentcal to the one-dmensonal algorthm presented n Fgure 2, except for two dfferences. Frst, fndng the hstogram buckets that overlap a selecton range (lne 1 n Fgure 2) now requres examnng a mult-dmensonal structure. Second, a bucket s now a mult-dmensonal cell n the frequency matrx, so the fracton of a bucket overlappng the selecton range (lne 6) s equal to the volume of the regon where the bucket overlaps the selecton range dvded by volume of the regon represented by the whole bucket (Fgure 5). 4.3 Restructurng Perodc restructurng s needed only for mult-dmensonal SThstograms ntalzed assumng unformty and ndependence. ST-hstograms ntalzed usng tradtonal one-dmensonal hstograms do not need to be perodcally restructured, assumng that the one-dmensonal hstograms are accurate. Ths s based on the assumpton that the parttonng of an accurate tradtonal one-dmensonal hstogram bult by lookng at the data s more accurate when used for mult-dmensonal ST-hstograms than a parttonng bult by splttng and mergng. As n the one-dmensonal case, restructurng n the multdmensonal case s based on mergng buckets wth smlar frequences and splttng hgh frequency buckets. The requred parameters are also the same, namely the restructurng nterval, R, the merge threshold, m, and the splt threshold, s. Restructurng changes the parttonng of the mult-dmensonal space one dmenson at a tme. The dmensons are processed n any order, and the partton boundares of each dmenson are modfed ndependent of other dmensons. The algorthm for restructurng one dmenson of the mult-dmensonal ST-hstogram s smlar to the algorthm n Fgure 3. However, mergng and splttng n multple dmensons present some addtonal problems. For an n-dmensonal ST-hstogram, every partton of the scales n any dmenson dentfes an (n-1)-dmensonal slce of the grd (e.g., a row or a column n a two-dmensonal hstogram). Thus, mergng two parttons of the scales requres mergng two slces of the frequency matrx, each contanng several buckets. Every bucket from the frst slce s merged wth the correspondng bucket from the second slce. To decde whether or not to merge two slces, we fnd the maxmum dfference n frequency between any two correspondng buckets that would be merged f these two slces are merged. We merge the two slces only f ths dfference s wthn m T tuples. We use ths method to dentfy runs of parttons to merge. The hgh frequency parttons of any dmenson are splt by assgnng them the extra parttons freed by mergng n the same dmenson. Thus, restructurng does not change the number of parttons n a dmenson. To decde whch parttons to splt n any dmenson and how many extra parttons each one gets we use the margnal frequency dstrbuton along ths dmenson. The margnal frequency of a partton s the total frequency of all buckets n the slce of the frequency matrx that t dentfes. Thus, the margnal frequency of partton j n dmenson s gven by f ( j ) B = 1 B 1 + 1 n j1 = 1 j 1 = 1 j+ 1= 1 jn = 1 B B freq[ j 1, j2,, jn] As n the onedmensonal case, we splt the s percent of the parttons n any dmenson wth the hghest margnal frequences, and we assgn them the extra parttons n proporton to ther current margnal frequences. 1 [1,10] 2 [11,20] 3 [21,30] 4 [31,40] 5 [41,50] 1 [1,10] 2 [11,20] 3 [21,25] 4 [26,30] 5 [31,50] 1 2 3 4 5 [1,5] [6,10] [11,15][16,20][21,25] 11 6 43 14 26 11 60 12 8 9 65 37 28 44 26 10 5 8 20 7 14 9 7 19 11 11 6 43 14 26 11 60 12 8 9 33 18 14 22 13 32 19 14 22 13 24 14 15 39 18 Fgure 6: Restructurng the vertcal dmenson Fgure 6 demonstrates restructurng the hstogram n Fgure 5 along the vertcal dmenson (attrbute 2). In ths example, the merge threshold s such that we merge two parttons f the maxmum dfference n frequency between buckets n ther slces that would be merged s wthn 5. Ths condton leads us to merge parttons 4 and 5. The splt threshold s such that we splt one partton along the vertcal dmenson. We compute the margnal frequency dstrbuton along the vertcal dmenson and dentfy the partton wth the maxmum margnal frequency, partton 3. Mergng and splttng (wth some provsons for roundng) result n the shown hstogram. 5. Expermental Evaluaton In ths secton, we present an expermental evaluaton of our technques usng synthetc data sets and workloads. We nvestgate the accuracy and effcency of one and multdmensonal ST-hstograms. In partcular, we are nterested n the accuracy of ST-hstograms for data dstrbutons wth varyng degrees of skew, and for workloads wth dfferent access patterns. We examne whether hstogram refnement converges to an accurate state, or whether t oscllates n response to refnement. Another mportant consderaton s how well ST-hstograms adapt to database updates, and how effcently they use the avalable memory. Due to space lmtatons, we present only a subset of the experments conducted. 5.1 Setup for Experments Maxmum frequency dfference 60-6 = 54 65-11 = 54 65-10 = 55 14-10 = 4 5 Merge Margnal frequency dstrbuton 100 100 200 50 60 max Splt Merge: m*t = 5 Splt: s*b 2 = 1 5.1.1 Data Sets We present the results of experments usng one to threedmensonal nteger data sets. The results for hgher dmensonal data sets are smlar. The one-dmensonal data sets have 100K tuples and the mult-dmensonal data sets have 500K tuples. Each dmenson n a data set has V dstnct values drawn randomly from a doman rangng from 1 to 1000. V = 200, 100, and 10, for 1, 2, and 3 dmensons, respectvely. For multdmensonal data sets, the number of dstnct values and the domans of all dmensons are dentcal, and the value sets of all dmensons are generated ndependently. Frequences are generated accordng to the Zpfan dstrbuton [Zp49] wth parameter z = 0, 0.5, 1, 2, and 3. z controls the skew of the dstrbuton, wth z=0 representng a unform dstrbuton (no skew). For one-dmensonal data sets, the frequences are assgned at random to the values. For mult-dmensonal data

sets, the frequences are assgned at random to combnatons of values usng the technque proposed n [PI97], namely assgnng the generated frequences to randomly chosen cells n the jont frequency dstrbuton matrx. 5.1.2 Query Workloads We use workloads consstng of random range selecton queres n one or more dmensons. Each workload conssts of 2000 ndependent selecton queres. Most experments use random workloads, n whch the corner ponts of each selecton range are ndependently generated from a unform dstrbuton over the entre doman. Some experments use workloads wth localty of reference. The attrbute values used for selecton range corner ponts n these workloads are generated from pecewse unform dstrbutons n whch there s an 80% probablty of choosng a value from a localty range that s 20% of the doman. The localty ranges for the dfferent dmensons are ndependently chosen at random accordng to a unform dstrbuton. 5.1.3 Hstograms Unless otherwse stated, we use 100, 50, and 15 buckets per dmenson for 1, 2, and 3 dmensonal ST-hstograms, respectvely. For mult-dmensonal ST-hstograms, we use the same number of buckets n all dmensons, resultng n two and three-dmensonal hstograms wth a total of 2500 and 3375 buckets. The one, two, and three-dmensonal ST-hstograms occupy 1.2, 10.5, and 13.5 klobytes of memory, respectvely. Our tradtonal hstograms of choce are MaxDff(V,A) hstograms for one dmenson, and MHIST-2 MaxDff(V,A) hstograms for multple dmensons. These hstograms were recommended n [PIHS96] and [PI97] for ther accuracy and ease of constructon. We compare the accuracy of ST-hstograms to tradtonal hstograms of these types occupyng the same amount of memory. We consder a wder range of memory allocaton than most prevous works (e.g., [PIHS96], [PI97], and [MVW98]) because of current trends n memory technology. We also demonstrate that our technques are effectve across a wde range of avalable memory (Secton 5.7). Note that the cost of buldng and mantanng tradtonal hstograms s a functon of the sze of the relaton (or the sze of the sample used to buld the hstogram). In contrast, the cost of ST-hstograms s ndependent of the data sze and depends on the sze of the query workload used for refnement. 5.1.4 Refnement Parameters Unless otherwse stated, the parameters we use for restructurng the hstogram (Secton 3.3) are a restructurng nterval, R=200 queres, a merge threshold, m=0.025%, and a splt threshold, s=10%. For frequency refnement (Secton 3.2), we use a dampng factor, =0.5 for one dmenson, and =1 for multple dmensons. 5.1.5 Measurng Hstogram Accuracy We use the relatve estmaton error (abs(actual result sze - estmated result sze) / actual result sze) to measure the accuracy of query result sze estmaton. To measure accuracy over an entre workload, we use the average relatve estmaton error for all queres n the workload, gnorng queres whose actual result sze s zero. One mportant queston s wth respect to whch workload should we measure the accuracy of a ST-hstogram. Recall that the premse of ST-hstograms s that they are able to adapt to feedback from query executon. Therefore, for our evaluaton we generate workloads that are statstcally smlar, but not the same as the tranng workload. Unless otherwse stated, our experments use off-lne hstogram refnement. Our steps for verfyng the effectveness of ST-hstograms for some partcular data set are: 1. Intalze a ST-hstogram for the data set. 2. Issue the query workload that wll be used to refne the hstogram and generate a workload log. We call ths the refnement workload. 3. Refne the hstogram off-lne based on the generated workload log. 4. After refnement, ssue the refnement workload agan and compute the estmaton error. Verfy that the error after refnement s less than the error before refnement. 5. Issue a dfferent workload n whch the queres have the same dstrbuton as the workload used for refnement. We call ths the test workload. We cannot expect the workload ssued before refnement to be repeated exactly after refnement, but we can reasonably expect a workload wth smlar statstcal characterstcs. The ultmate test of accuracy s whether the ST-hstogram performs well on the test workload. 5.2 Accuracy of One-dmensonal ST-hstograms In ths secton, we expermentally study the effectveness of onedmensonal ST-hstograms for a wde range of data skew (z) usng random workloads and the procedure outlned n Secton 5.1.5. We demonstrate that ST-hstograms are always better than assumng unformty, and that they are compettve wth MaxDff(V,A) hstograms n terms of accuracy except for hghly skewed data sets. Relatve Error 8 7 0 0.5 1 1.5 2 2.5 3 z (skew) Assumng Unformty Before Refnement After Refnement After Refnement - Test Workload MaxDff(V,A) MaxDff(V,A) - Test Workload Fgure 7: One-dmensonal data, random workload Fgure 7 presents the estmaton errors for a random refnement workload on one-dmensonal data sets wth varyng z. For each data set, the fgure presents the estmaton error for the random refnement workload assumng a unform dstrbuton and usng the ntal ST-hstogram constructed assumng unformty. The estmaton errors n these two cases are dfferent due to roundng errors durng hstogram ntalzaton. The fgure also presents the average relatve estmaton error for the random refnement workload usng the refned ST-hstogram when ths workload s ssued agan after t s used for refnement. It also presents the error for a statstcally smlar test workload usng the refned ST-hstogram. Fnally, the fgure presents the estmaton errors for the refnement and test workloads usng a tradtonal MaxDff(V,A) hstogram occupyng the same amount of memory as the ST-hstogram.

Hstogram refnement results n a sgnfcant reducton n estmaton error for all values of z. Ths reduced error s observed for both the refnement workload and the test workload ndcatng a true mprovement n hstogram qualty. Thus, ST-hstograms are always better than assumng unformty. The MaxDff(V,A) hstograms are more accurate than the ST-hstograms. Ths s expected because MaxDff(V,A) hstograms are bult based on the true dstrbuton determned by examnng the data. However, for low values of z, the estmaton errors usng refned ST-hstograms are very close to the errors usng MaxDff(V,A) hstograms, and are small enough for query optmzaton purposes. MaxDff(V,A) hstograms are consderably more accurate than ST-hstograms only for hghly skewed data sets (z 2). Ths s expected because as z ncreases, the data dstrbuton becomes more dffcult to capture usng smple feedback nformaton. At the same tme, the beneft of MaxDff(V,A) hstograms s maxmum for hghly skewed dstrbutons [PIHS96]. 5.3 Accuracy of Mult-Dmensonal ST-hstograms In ths secton, we show that mult-dmensonal ST-hstograms ntalzed usng tradtonal one-dmensonal hstograms are much more accurate than assumng ndependence. We also compare the performance of such ST-hstograms and MHIST-2 hstograms. In partcular, we demonstrate that these ST-hstograms are more accurate than MHIST-2 hstograms for low to moderate values of z (.e., low correlaton). Ths s an mportant result because t ndcates that ST-hstograms are better than MHIST-2 hstograms n both cost and accuracy for data dstrbutons wth low to medum correlaton. For ths paper, we only present the results of our experments wth ST-hstograms ntalzed usng tradtonal hstograms. Experments wth the less accurate ST-hstograms ntalzed assumng unformty and ndependence have smlar results. Fgures 8 and 9 present the results of usng mult-dmensonal ST-hstograms ntalzed usng MaxDff(V,A) hstograms and assumng ndependence for random workloads on two and threedmensonal data set wth varyng z. The nformaton presented s the same as n Fgure 7, except that we do not show the estmaton error assumng unformty because one would never assume unformty when one-dmensonal hstograms are avalable, and we compare the performance of the ST-hstograms aganst multdmensonal MHIST-2 hstograms nstead of one-dmensonal MaxDff(V,A) hstograms. Snce the ST-hstograms are ntalzed usng MaxDff(V,A) hstograms, usng them before refnement s the same as usng the one-dmensonal hstograms and assumng ndependence. The refned ST-hstograms are more accurate than assumng ndependence, and the beneft of usng them (.e., the reducton n error) ncreases as z ncreases. ST-hstograms are not as accurate as MHIST-2 hstograms for hgh z, especally n three dmensons. Ths ndcates that nferrng jont data dstrbutons based on smple feedback nformaton becomes ncreasngly dffcult wth ncreasng dmensonalty. As expected, MHIST-2 hstograms are very accurate for hgh z [PI97], but we must bear n mnd that the cost of buldng mult-dmensonal MHIST-2 hstograms s much more than the cost of buldng one-dmensonal MaxDff(V,A) hstograms. Furthermore, ths cost ncreases wth ncreasng dmensonalty. Notce, though, that ST-hstograms are more accurate than MHIST-2 hstograms for low z. Ths s because MHIST-2 hstograms use a complex parttonng of the space (as compared to ST-hstograms). Representng ths complex parttonng Relatve Error Relatve Error 35.00% 25.00% 15.00% 5.00% 0 0.5 1 1.5 2 2.5 3 z (of jont dstrbuton) Fgure 8: Two-dmensons, startng wth MaxDff(V,A) 8 7 0 0.5 1 1.5 2 2.5 3 z (of jont dstrbuton) Before Refnement After Refnement After Refnement - Test Workload MHIST-2 MHIST-2 - Test Workload Before Refnement After Refnement After Refnement - Test Workload MHIST-2 MHIST-2 - Test Workload Fgure 9: Three-dmensons, startng wth MaxDff(V,A) requres MHIST-2 hstograms to have complex buckets that consume more memory than ST-hstogram buckets. Consequently, ST-hstograms have more buckets than MHIST-2 hstograms occupyng the same amount of memory. For low z, the complex parttonng of MHIST-2 hstograms does not ncrease accuracy because the jont dstrbuton s close to unform so any parttonng s fne. On the other hand, the large number of buckets n ST-hstograms allows them to represent the dstrbuton at a fner granularty leadng to hgher accuracy. Ths result demonstrates the value of mult-dmensonal ST-hstograms for database systems. For data wth low to moderate skew, SThstograms provde an effectve way of capturng dependences between attrbutes at a low cost. Equ-wdth Equ-depth MaxDff(V,A) z Before After Before After Before After 0 4.27% 5.47% 6.41% 6.65% 4.93% 4.95% 0.5 6.77% 5.84% 8.67% 8.21% 6.64% 6.35% 1 37.61% 11.64% 39.94% 12.61% 36.37% 11.08% 2 562.36% 518.33% 615.06% 78.36% 435.54% 22.57% 3 530.71% 233.32% 383.76% 48.26% 460.71% 26.07% Table 1: Startng wth dfferent types of 1d hstograms Table 1 presents the estmaton errors for random workloads on two-dmensonal data sets wth varyng z usng ST-hstograms bult startng wth tradtonal one-dmensonal hstograms. The errors are shown before refnement and after off-lne refnement usng the same random workloads. All one-dmensonal hstograms have 50 buckets. In addton to MaxDff(V,A) hstograms, the table presents the errors when we start wth equwdth hstograms, whch are the smplest type of hstograms, and

when we start wth equ-depth hstograms, whch are currently used by many commercal database systems. The table shows that ST-hstograms are equally effectve for all three types of onedmensonal hstograms. 5.4 Effect of Localty of Reference n the Query Workload An nterestng ssue s studyng the performance of ST-hstograms on workloads wth localty of reference n accessng the data. Localty of reference s a fundamental concept underlyng all database accesses, so one would expect real lfe workloads to have such localty. Moreover, purely random workloads provde feedback nformaton about the entre dstrbuton, whle workloads wth localty of reference provde most of ther feedback about a small part of the dstrbuton. We would lke to know how effectve ths type of feedback s for hstogram refnement. In ths secton, we demonstrate that ST-hstograms perform well for workloads wth localty of reference. We also demonstrate that hstogram refnement adapts to changes n the localty range of the workload. Relatve Error 8 7 1 2 Dmensons W1 Unf and Indep W1 Before Refnement W1 After Refnement W1 Tradtonal W2 Unf and Indep W2 Refned on W1 W2 After Refnement W2 Tradtonal Fgure 10: Workloads wth localty of reference, z=1 Fgure 10 presents the estmaton errors for workloads wth an 80%-20% localty for one and two-dmensonal data sets wth z=1. The frst four bars for each data set present the errors for a workload, W1. The frst two bars respectvely show the errors assumng unformty and ndependence, and usng an ntal SThstogram representng the unformty and ndependence assumpton. The bars are not dentcal because of roundng errors. The thrd bar shows the error usng the ST-hstogram when ssung W1 agan after t s used for refnement. The fourth bar shows the error for W1 usng a tradtonal hstogram. It s clear that refnement consderably mproves estmaton accuracy, makng the ST-hstogram almost as accurate as the tradtonal hstogram. Ths mprovement s also observed on test workloads that are statstcally smlar to W1. Next, we keep the refned hstogram and change the localty of reference of the workload. We ssue a new workload, W2, wth a dfferent localty range. The next four bars n Fgure 10 present the estmaton errors for W2. Frst, we ssue W2 and use the ST-hstogram refned on W1 for result sze estmaton (sxth bar). Ths hstogram s not as accurate for W2 as t was for W1, but t s better than assumng unformty and ndependence. Ths means that refnement was stll able to nfer some nformaton from the 20% of the queres of W1 that le outsde the localty range. When we refne the hstogram on W2 and ssue t agan, we see that the ST-hstogram becomes as accurate for W2 as t was for W1 after refnement. Ths mprovement s also seen for workloads that are statstcally smlar to W2. Relatve Error 1 2 Dmensons R1 Unf and Indep Fgure 11: Adaptng to database updates, z=1 R1 Before Refnement R1 After Refnement R1 Tradtonal R2 Unf and Indep R2 Refned on R1 R2 After Refnement R2 Tradtonal for R1 R2 Tradtonal 5.5 Adaptng to Database Updates The results of ths secton demonstrate that although SThstograms do not examne data, the feedback mechansm enables these hstograms to adapt to updates n the underlyng relaton. Fgure 11 presents the estmaton errors for one and twodmensonal data sets wth z=1 usng random workloads. The frst four bars present the estmaton errors for the orgnal relaton before update, whch we denote by R1. We update the relaton by deletng a random 25% of ts tuples and nsertng an equal number of tuples followng a Zpfan dstrbuton wth z=1. We denote ths updated relaton by R2. We retan the tradtonal and SThstograms bult for R1 and re-ssue the same random workload on R2. The ffth and sxth bars n Fgure 11 are the estmaton error for ths workload on R2 assumng unformty and ndependence, and usng the ST-hstogram that was refned for R1, respectvely. The hstogram s not as accurate as t was for R1, whch s expected, but t s stll more accurate than assumng unformty and ndependence. The seventh bar shows the error usng the ST-hstogram for R2 after refnement usng the same workload. Refnement restores the accuracy of the ST-hstogram and adapts t to the updates n the relaton. We also observe ths mprovement n error for statstcally smlar test workloads. The last two bars n Fgure 11 present the estmaton error for the random workload ssued on R2 usng the tradtonal hstograms for R1 and R2, respectvely. As expected, updatng the relaton reduces hstogram accuracy, and rebuldng the hstogram restores ths accuracy. 5.6 Refnement Parameters In ths secton, we nvestgate the effect of the refnement parameters: R, m, and s for restructurng and for updatng bucket frequences. Table 2 presents the average relatve estmaton errors for random test workloads usng ST-hstograms that have been refned off-lne usng other random refnement workloads for one to three-dmensonal data sets wth varyng z. For each data set, the error s presented f the hstogram s not restructured durng refnement, and f t s restructured wth R=200, m=0.025%, and s=10%. Restructurng has no beneft for low z, but as z ncreases the need for restructurng becomes evdent. Thus, restructurng extends the range of data skew for whch ST-hstograms are effectve. Fgure 12 presents the estmaton errors for random workloads and workloads wth localty of reference on one to threedmensonal data sets wth z=1 usng ST-hstograms that have been refned off-lne usng other statstcally smlar refnement workloads for =0.01 to 1. The estmaton errors are relatvely