Estimating Costs of Path Expression Evaluation in Distributed Object Databases

Similar documents
Parallelism for Nested Loops with Non-uniform and Flow Dependences

Needed Information to do Allocation

A Binarization Algorithm specialized on Document Images and Photos

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Reducing Frame Rate for Object Tracking

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

UB at GeoCLEF Department of Geography Abstract

Cost-based Selection of Path Expression Processing. Algorithms in Object-Oriented Databases

An Optimal Algorithm for Prufer Codes *

Learning-Based Top-N Selection Query Evaluation over Relational Databases

Machine Learning: Algorithms and Applications

Performance Evaluation of Information Retrieval Systems

An Entropy-Based Approach to Integrated Information Needs Assessment

Petri Net Based Software Dependability Engineering

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Related-Mode Attacks on CTR Encryption Mode

Classifier Selection Based on Data Complexity Measures *

Simulation Based Analysis of FAST TCP using OMNET++

S1 Note. Basis functions.

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Adjustment methods for differential measurement errors in multimode surveys

Mathematics 256 a course in differential equations for engineering students

An Efficient Genetic Algorithm with Fuzzy c-means Clustering for Traveling Salesman Problem

y and the total sum of

Explicit Formulas and Efficient Algorithm for Moment Computation of Coupled RC Trees with Lumped and Distributed Elements

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Evaluation of an Enhanced Scheme for High-level Nested Network Mobility

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

2x x l. Module 3: Element Properties Lecture 4: Lagrange and Serendipity Elements

Self-tuning Histograms: Building Histograms Without Looking at Data

Querying by sketch geographical databases. Yu Han 1, a *


Evaluation of Parallel Processing Systems through Queuing Model

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

Optimizing Document Scoring for Query Retrieval

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

Run-Time Operator State Spilling for Memory Intensive Long-Running Queries

Support Vector Machines

High-Boost Mesh Filtering for 3-D Shape Enhancement

Under-Sampling Approaches for Improving Prediction of the Minority Class in an Imbalanced Dataset

Query Clustering Using a Hybrid Query Similarity Measure

Cluster Analysis of Electrical Behavior

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints

Two-Stage Data Distribution for Distributed Surveillance Video Processing with Hybrid Storage Architecture

Skew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

PYTHON IMPLEMENTATION OF VISUAL SECRET SHARING SCHEMES

Avoiding congestion through dynamic load control

Sensor-aware Adaptive Pull-Push Query Processing for Sensor Networks

Module Management Tool in Software Development Organizations

Lecture 5: Multilayer Perceptrons

On Some Entertaining Applications of the Concept of Set in Computer Science Course

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

X- Chart Using ANOM Approach

Hierarchical clustering for gene expression data analysis

ON SOME ENTERTAINING APPLICATIONS OF THE CONCEPT OF SET IN COMPUTER SCIENCE COURSE

Algorithms for data warehouse design to enhance decision-making

Efficient Semantically Equal Join on Strings in Practice

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z.

Type-2 Fuzzy Non-uniform Rational B-spline Model with Type-2 Fuzzy Data

Oracle Database: SQL and PL/SQL Fundamentals Certification Course

FAHP and Modified GRA Based Network Selection in Heterogeneous Wireless Networks

Real-time Motion Capture System Using One Video Camera Based on Color and Edge Distribution

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Wishing you all a Total Quality New Year!

Analysis of Continuous Beams in General

Introduction. Leslie Lamports Time, Clocks & the Ordering of Events in a Distributed System. Overview. Introduction Concepts: Time

An Efficient Garbage Collection for Flash Memory-Based Virtual Memory Systems

VISUAL SELECTION OF SURFACE FEATURES DURING THEIR GEOMETRIC SIMULATION WITH THE HELP OF COMPUTER TECHNOLOGIES

NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics

Parallel matrix-vector multiplication

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

Constructing Minimum Connected Dominating Set: Algorithmic approach

Hybrid Non-Blind Color Image Watermarking

Optimal Workload-based Weighted Wavelet Synopses

Scheduling Remote Access to Scientific Instruments in Cyberinfrastructure for Education and Research

Analysis of Collaborative Distributed Admission Control in x Networks

Concurrent models of computation for embedded software

Backpropagation: In Search of Performance Parameters

SCALABLE AND VISUALIZATION-ORIENTED CLUSTERING FOR EXPLORATORY SPATIAL ANALYSIS

An Image Fusion Approach Based on Segmentation Region

CE 221 Data Structures and Algorithms

Array transposition in CUDA shared memory

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

arxiv: v3 [cs.ds] 7 Feb 2017

SAO: A Stream Index for Answering Linear Optimization Queries

Simulation: Solving Dynamic Models ABE 5646 Week 11 Chapter 2, Spring 2010

AADL : about scheduling analysis

Parameter estimation for incomplete bivariate longitudinal data in clinical trials

Transcription:

Estmatng Costs of Path Expresson Evaluaton n Dstrbuted Obect Databases Gabrela Ruberg, Fernanda Baão, and Marta Mattoso Department of Computer Scence COPPE/UFRJ P.O.Box 685, Ro de Janero, RJ, 2945-970 Brazl {gruberg, baao, marta}@cos.ufr.br Abstract. Effcent evaluaton of path expressons n dstrbuted obect databases nvolves choosng among several query processng strateges, due to the rch semantcs nvolved n obect-based data models and to the complexty added by the dstrbuton. Ths work presents a new cost model for obect-based query processors and addresses relevant ssues, whch are gnored or relaxed n other works n the lterature, such as the selectvty of the path expresson, the sharng degree of the referenced obects, the partal partcpaton of the collectons n the relatonshps, and the dstrbuton of the database obects across the nodes of a network. These ssues allowed us to present more realstc estmates for the query optmzer. Our cost model has been valdated aganst expermental results obtaned wth an obect DBMS prototype runnng n a dstrbuted archtecture, usng the OO7 benchmark applcaton. Introducton The development of realstc and effcent query optmzers s extremely mportant n enhancng the performance of database systems. In current query languages, path expresson processng optmzaton s a central and dffcult ssue. Reference attrbutes n path expressons provde drect (ponter) access through obect navgaton, such as n obect databases, or element navgaton n XML [6]. The choce of the best executon plan to process a query wth reference attrbutes s not smple for the query optmzer to make, due to the large number of executon strateges and algorthms to evaluate a path expresson. Relevant ssues must be consdered for ths problem, ncludng choosng from bnary or n-ary operators, ponter- or value-based algorthms, forward or reverse evaluaton drectons. These ssues are not fully addressed n current cost models for obect-based query optmzers, compromsng the accuracy of estmate models. In a dstrbuted envronment, the query executon search space s even larger because of fragmented data. Dstrbuted data processng s becomng popular due to performance gans obtaned from PC clusters, grd computng, and the Web, among others []. However, current path expresson optmzers lack practcal cost functons for ad-hoc queres n fragmented collectons of obects. These functons may not be drectly obtaned from centralzed cost models because some fragmented data may be prevously dsregarded durng the query executon, modfyng substantally the query R. Ccchett et al. (Eds.): DEXA 2002, LNCS 2453, pp. 35 360, 2002. Sprnger-Verlag Berln Hedelberg 2002

352 G. Ruberg, F. Baão, and M. Mattoso costs. Even n a centralzed context, a realstc cost model can not be obtaned wth a smple combnaton of relevant ssues nto a sngle model, because these ssues are strongly related to each other and have to be remodeled. Table dentfes the presence of mportant ssues n current obect database cost models. Next, we dscuss the mpact of the ssues of Table n the cost estmates. Estmatng selectvty factor s essental to the performance analyss of query processng. The selectvty of path expressons can vary sgnfcantly accordng to partal or total partcpaton of a class n a relatonshp. Partal partcpaton means that only a subset of obects n a class are related to the obects of another class. However, most cost models [, 2, 7, 8, 9, 0, 3] dsregard partal partcpaton. Cho et al.[6] present a realstc method for the estmaton of selectvty factors, but only n centralzed obect databases. A path expresson can be evaluated n a forward drecton (from the frst to the last collecton) or n a reverse drecton (n the opposte way). Many cost models [, 2, 7, 8, 9] are lmted to the forward drecton. The reverse drecton s not obtaned by smply changng the ndex varaton, rather other parameters have to be added. The two basc algebra operators for path expresson evaluaton are the n-ary operator and the bnary operator. The executon costs of a path expresson may sgnfcantly vary for each par (evaluaton drecton, algebra operator), accordng to the selectvty of the nested predcates and to the partal partcpaton of the collectons n the relatonshps of the path expresson. A cost model restrcted to a specfc drecton or executon strategy may prevent the query optmzer from choosng the best executon plan. The amount of IO operatons, estmated n terms of data pages, s often presented as the basc cost n the query processng [, 2, 5, 8, 9, 0, 3], specally n a centralzed executon. The obect data model allows complex strateges due to the rch varety of constructors provded, and can drastcally affect cost estmates of IO operatons f technques of obect clusterng are appled. Ths aspect has been dsregarded by most cost models [, 2, 5, 7, 8, 0, 3]. Very few works analyze CPU costs [9, 0]. Communcaton costs of dstrbuted evaluaton of path expressons n vertcally and/or horzontally fragmented classes s not addressed n the lterature. Almost all processng cost factors are very nfluenced by the sze of avalable man memory, although ths factor s usually not taken nto account. In the small memory hypothess, the IO reload overhead of a path expresson evaluaton s tradtonally estmated usng the collecton fan-out parameter [8, 9, 0]. However, we have notced that practcally no addtonal IO operatons are necessary f there s no obect sharng n the relatonshps of the path expresson, even f the fan-out s greater than one. Works on dstrbuted obect-based cost functons are dedcated to algorthms for class parttonng n obect databases. They are focused on the analyss of prmary horzontal (P.H.F.) [, 2, 7], derved horzontal (D.H.F.) [2], and vertcal (V.F) [8] fragmentaton methodologes. Ther applcaton n a real query optmzer s somewhat restrcted, snce they dsregard mportant ssues n the path expresson evaluaton, such as the obect clusterng polcy, the evaluaton drecton, the bnary operator and algorthms, and CPU costs. These ssues are not trvally ncluded n the cost model of the algorthms.

Estmatng Costs of Path Expresson Evaluaton n Dstrbuted Obect Databases 353 Table. Important ssues n path expresson processng and related cost models Issues / Cost Models [] [2] [3] [5] [7] [8] [9] [0] [3] [4] C Partal partcpaton X X X E Physcal obect clusterng X X N Evaluaton drecton X X X X X T N-ary operator X X X X X X X X R A Bnary operator X X X X L IO overhead due to ob. sharng X I IO costs X X X X X X X X Z. CPU costs X X X D P.H.F. X X X X I D.H.F. X X S V.F. X X T. Communcaton costs X We present a new cost model that covers the most representatve algorthms for bnary and n-ary operators, as well as forward and reverse drectons for general path expresson evaluaton, n both centralzed and dstrbuted envronments. An extended verson of ths work wth detaled cost formulas may be found n [4]. In addton, our cost model has been valdated aganst expermental results obtaned n our prevous work [5]. The remanng of ths paper s organzed as follows. Secton 2 descrbes our cost model wth emphass n estmaton of selectvty factors. Secton 3 shows the valdaton of our cost model aganst expermental results, obtaned wth an obect DBMS prototype usng the OO7 benchmark. Fnally, Secton 4 draws some consderatons and future work. 2 Cost Model Invarably, the complexty of optmzaton problems requres some smplfcatons n the cost model. We assume that: () the query optmzer s able to break the encapsulaton property; () obects have a sze less than a database page; () the attrbute values are unformly dstrbuted among nstances of a class; and (v) each obect collecton has ust one class as ts doman. These assumptons are present n other cost models [, 2, 3, 5, 8, 9, 0, 3] snce they occur n most obect-based DBMS, as well as n ther typcal applcatons. Thus, they do not lmt the expressve power of our cost model. In our approach for estmatng the cost of query executon plans, we consder that queres are ssued aganst collectons, thus some statstcs are mantaned for collectons rather than for classes. The parameters of the fragments F are represented smlarly to the parameters of the collectons, addng the ndex, f. Therefore,

354 G. Ruberg, F. Baão, and M. Mattoso represents the selectvty of the path expresson over the fragment + D s the total number of dstnct ponters from C obects to F obects. +, F whle Table 2. Cost Model Parameters Para m SEL Descrpton Selectvty of nested predcate p over C Selectvty of the path expresson over C C Cardnalty of C C # pages of C S Average sze of one obect of C C f # fragments of C Z -, Average # dstnct ponters to C + obects from C obects that have at least one non null reference D Total # dstnct ponters from C, + obects to C + obects X # C, + obects havng all ponters to C + obects as null references sel Selectvty factor over the C cardnalty accordng to the F cardnalty REF # dstnct accessed obects from C n the path evaluaton. ref Analogous to REF, n F Length of the path expresson 2. Selectvty Factor of Path Expressons The bass for evaluatng query optmzaton strateges s the estmaton of the selectvty factor of selecton predcates and ons [5]. The selectvty factor of a path expresson s the selectvty factor resultant from the nested predcates and the partcpaton of each class collecton n path relatonshps. Partal partcpaton of a class collecton nfluences the predcton of the path expresson selectvty due not only to the estmaton of the selectvty of mplct ons, but also due to the estmaton of selectvty of nested predcates. Therefore, only the referenced obects n the path expresson must be taken nto account to estmate the selectvty factors over the collectons. When ponter-based algorthms are used, the path expresson selectvty over a collecton C represents the porton of the C obects that wll be accessed durng the path evaluaton. Moreover, the path expresson selectvty determnes the cardnalty of the ntermedate results generated by on algorthms (ponter and value-based). Its computaton over each collecton C,, also depends on the drecton used to evaluate the path expresson. Thus, gven a path expresson, we may express the number of dstnct accessed obects n collecton C durng the path navgaton as: The term, REF = SEL C. (), s obtaned accordng to the evaluaton drecton: In forward, SEL ' = and SEL D, = ; (2) C

Estmatng Costs of Path Expresson Evaluaton n Dstrbuted Obect Databases 355 In reverse, ' = SEL and + SEL+ ( C X, + ) =. (3) C Note that all obects n the startng collecton (C or C, accordng to the evaluaton drecton) are accessed because there s no flter from a prevous relatonshp n the path expresson. In path expressons nvolvng large collectons wth low selectvty factors, the tradtonal probablstc method for selectvty estmaton [, 2, 3, 5, 8, 0] results n an expressve devaton from real values, as shown n secton 3. Ths dfference, whch s avoded n our method, may be propagated to the estmaton of page hts and to all costs that are based on the selectvty factor (IO, CPU and communcaton costs). Addtonally, our method presents low computatonal complexty, thus mprovng processng costs n the optmzaton task. Fragmentaton Effects. Horzontal fragmentaton dstrbutes class nstances among fragments (obect collectons) wth the same structure, accordng to a gven fragmentaton crtera. Analogously, vertcal fragmentaton splts the logcal structure of a class and dstrbutes ts attrbutes (and methods) among fragments wth the same cardnalty. Let C,, be a collecton of a path expresson wth prmary horzontal or vertcal fragmentaton. Durng the evaluaton of ths path expresson, the query processor can prevously dentfy: ) a horzontal fragment F, f, where the selectvty of the assocated nested predcate p s zero; or ) a vertcal fragment F, f, whch attrbutes are not used n the query. In both cases, we assume SEL = 0, thus causng the elmnaton of F durng the query processng. If C s fragmented, only the set of fragments of C n whch SEL 0 wll be scanned durng the query evaluaton process. We may defne the Elm subset contanng all C fragments elmnated by SEL as: { F, f SEL = 0} Elm =. (4) In addton, we may defne the subset Elm that refers to the derved horzontal fragments from C whch were ndrectly elmnated by the path expresson selectvty (f ther prmary fragments were elmnated too), as follows: { F, f ( F F ) ( F E )} = Elm'. (5) The term F F denotes that the prmary fragment F determnes the derved fragment F, n the forward evaluaton. The reverse evaluaton formula s obtaned analogously to (5). We may defne the set E,, wth cardnalty #E, of all C fragments that wll not be scanned durng the path expresson evaluaton as: E = Elm Elm'. (6)

356 G. Ruberg, F. Baão, and M. Mattoso We estmate the selectvty factor of C obects that belong to E as: sele = sel. (7) F E The formal defnton of set E and of ts subsets, representng the fragmented data that s dsregarded durng the query evaluaton, allows us to properly estmate the selectvty factors and executon costs of a dstrbuted path expresson evaluaton. Path Expresson Selectvty n Horzontal Fragmentaton. The number of dstnct obects retreved from a horzontally fragmented collecton C,, durng the evaluaton of a path expresson s gven by: f = REF ref = where ref SEL F, (8) = ', f. (9) If F E then we have ref = 0. Otherwse, F E and s calculated accordng to both the horzontal fragmentaton strategy of C (prmary or derved) and to the path expresson evaluaton drecton. In a forward evaluaton, < and, we have: f In P.H.F., SEL ' = and In D.H.F., ' = SEL D, = ; (0) F SEL and part ( SEL ) D, =. () F In equaton (), the functon part ( factor) returns the partcpaton of the fragment F n the obects selected by factor from C. Modelng ths partcpaton s mportant because f derved horzontal fragmentaton s appled on C and some of ts fragments are elmnated by ther C - prmary fragments, then only non-elmnated C fragments contrbute to REF obects. Indeed, the selectvty term (SEL - x SEL - ) s not proportonally dstrbuted among all C fragments, but restrcted to non-elmnated C fragments. Therefore: part ( factor) = selelm' factor factor + ( selelm' ), f factor =,, otherwse. The selectvty factor selelm of obects from C fragments that were elmnated by the path expresson selectvty s analogous to formula (7). Fnally, the path (2) Estmaton n reverse evaluaton s obtaned analogously to (3), applyng the functon part(factor) f C has derved fragmentaton.

Estmatng Costs of Path Expresson Evaluaton n Dstrbuted Obect Databases 357 expresson selectvty and the nested predcate selectvty over a collecton C that s horzontally fragmented are gven respectvely by: f SEL and S = SEL ( sel ) = ' = f = SEL ( sel SEL ) = ' ; (3). (4) The term represents the selectvty factor of the path expresson n the S startng collecton. Note that partal partcpaton of collectons n path relatonshps nfluences the estmaton of the selectvty factors of each fragment nvolved n the path expresson evaluaton. In a dstrbuted context, f total partcpaton s assumed, then the dfference from real values to estmates s even larger due to accumulaton of many fragment devatons. Path Expresson Selectvty n Vertcal Fragmentaton. Let C,, be a vertcally fragmented collecton where only one C vertcal fragment contans the reference attrbute used n the path expresson navgaton. The remanng C fragments are accessed durng the query evaluaton only f ther attrbutes are necessary to probe the predcate p. The selectvty factor of the path expresson s the same n all C fragments and the total number of dstnct C obects whch are accessed durng the path expresson evaluaton s obtaned by: REF = ref, (5) * where ref = C. (6) * The term ref * denotes the number of dstnct C obects that are accessed n one C vertcal fragment. Note that SEL s obtaned accordng to equatons (2) and (3). However, each C obect corresponds to f stored obects, accordng to C vertcal fragments. Therefore, we estmate the total number of C obects whch are accessed n the non-elmnated vertcal fragments durng the path expresson evaluaton as: ( f #E ) ref REF_v =. (7) * Fnally, the nested predcate p has several selectvty factors accordng to C vertcal fragments, thus ts resultant selectvty factor s estmated as: F E ( ) SEL = mn SEL. (8) Both vertcal and horzontal fragmentaton estmates may be easly combned to calculate the selectvty factors of hybrd fragmentaton technques.

358 G. Ruberg, F. Baão, and M. Mattoso 3 Expermental Analyss In order to valdate our cost model, we have compared ts performance wth results prevously obtaned [5] n practcal experments. These expermental results were obtaned usng the OO7 benchmark [4] on top of the GOA DBMS prototype [2]. Expermental and smulaton results n terms of number of IO operatons per query are shown n Fgures to 4. The results focus on the performance of the path expresson evaluaton n queres Q-Q5 usng strategy NP-F (forward naïve ponter chasng) and n queres Q-Q2 usng strategy VJ-R (reverse value-based on), dsregardng the cost of dsplayng query results. Cost model results Expermental results Cost model results Expermental results #IO operatons 0000 8000 6000 4000 2000 0 9609 9463 3244 3244 3244 3244 3023 3023 3025 3025 58 475 825 Q-F Q2-F Q3-F Q4-F Q5-F Q-R Q2-R #IO operatons 3000 2500 2000 500 000 500 0 863 952 984 849 478 498 346 228 2 4 8 2 # nodes Fg.. NP-F and VJ-R executon IO cost (4Mbytes memory) Fg. 2. IO cost per node of Q-F executon n a dstrbuted envronment NP-F NP-R VJ-F VJ-R Cost model results Expermental results #IO operatons 00000 80000 60000 40000 20000 0 3 0 sharng degree #IO operatons 7000 6000 5000 4000 3000 2000 000 0 5979 56 2804 2990 405 495 943 997 2 4 8 2 # nodes Fg. 3. IO cost varyng the sharng degree (4Mbytes memory) Fg. 4. IO cost per node of Q4- F executon n a dstrbuted envronment Fgure shows the number of IO operatons that occurred n the executon of each path expresson evaluaton strategy n the centralzed envronment, and compares them to the predctons of our cost model, showng that the estmates are very close to all the evaluated scenaros. As expected, most of the predcted results are slghtly hgher than the expermental ones, snce some cost model formulas calculate the worst case for dsk random access. Queres Q3 and Q5, however, presented the

Estmatng Costs of Path Expresson Evaluaton n Dstrbuted Obect Databases 359 expermental result somewhat hgher than the predcted by the cost model. Ths s due to the fact that they are very fast queres, thus the overhead of catalog access n the real experment was more predomnant. Query Q4-F s defned over two large collectons (AtomcParts and Connectons). We assume that AtomcParts =00000 and Connectons =300000. Accordng to [, 0], the number of accessed obects n the collecton Connectons s estmated as X 2 =89637. Snce the partcpaton of the collectons n the path expresson s total, we observe that the real value of accessed obects from Connectons should be 300000. Accordng to our proposed formulas () and (2), the correspondng parameter s REF 2 =300000. Ths example shows a dfference, whch s avoded n our estmaton method, of approxmately 37% between the result obtaned by the tradtonal probablstc estmaton method and the real result. In Fgure 3, we analyzed the effect of varyng the sharng degree (, 3, and 0 n each collecton) of the obects along a path expresson wth = 3. The n-ary operator (NP-F and NP-R) has the worst behavor as share ncreases, snce t gnores repeated obect access and thus performs very poorly. The value-based on (VJ-F and VJ-R) presented a constant behavor because t avods the bad effect of the obect sharng and should be consdered a good choce when obect sharng s very hgh. Ths example shows that f the cost model does not consder the reverse drecton or the value-based on algorthm, then the query executon strategy s lmted to a very neffcent choce. Queres Q-F and Q4-F were executed n a dstrbuted envronment usng 2, 4, 8, and 2 nodes. Fgures 2 and 4 show the number of IO operatons per node that occurred n the executon of each query. Our cost predctons are farly close to values from expermental dstrbuted executon, as n the centralzed case. 4 Conclusons Effcent processng of path expressons s fundamental for current query languages. The man contrbuton of ths work s a new, realstc cost model to estmate the executon costs of evaluatng path expressons n a dstrbuted envronment. The proposed cost model addresses bnary and n-ary operators, as well as forward and reverse drectons for path expresson evaluaton. It also consders ssues such as the selectvty of the path expresson, the sharng degree of the referenced obects whch contrbutes to IO reload overhead estmate, physcal clusterng of the obects n dsk, and the partal partcpaton of the class collectons n path relatonshps. These ssues were combned and extended to encompass dstrbuted processng, coverng both horzontal (prmary and derved) and vertcal fragmentaton of data. We have shown the expressve devaton from real results n the tradtonal probablstc method for estmaton of the path expresson selectvty when large collectons wth low selectvty factors are taken nto account. Our selectvty estmaton method avods ths devaton and presents low computatonal complexty, consequently dmnshng processng costs n the optmzaton task. We also presented the lmtatons of always usng the same algorthm and evaluaton drecton n path

360 G. Ruberg, F. Baão, and M. Mattoso expresson processng. The new cost model takes nto account a large number of dfferent factors, yet t remans farly smple. The estmates generated by our cost model are very close to observed expermental results. Currently we are workng on extendng ths model for regular path expresson processng. We are also expermentng the cost model to examne dfferent strateges and new algorthms for evaluatng path expressons. Acknowledgement Ths work was partally fnanced by CNPq and FAPERJ. The author G. Ruberg was supported by Central Bank of Brazl. References. Bellatreche, L., Karlapalem, K., Basak, G.: Query-Drven Horzontal Class Parttonng for Obect-Orented Databases. DEXA 998, 692-70 2. Bellatreche, L., Karlapalem, K., L, Q.: Derved Horzontal Class Parttonng n OODBs: Desgn Strateges, Analytcal Model and Evaluaton. ER 998, 465-479 3. Bertno, E., Foscol, P.: On Modelng Cost Functons for Obect-Orented Databases. IEEE TKDE 9(3), 500-508 (997) 4. Carey, M., DeWtt D., Naughton, J.: The OO7 Benchmark. ACM SIGMOD 22(2), 2-2 (993) 5. Cho, W., Park, C., Whang, K., Son, S.: A New Method for Estmatng the Number of Obects Satsfyng an Obect-Orented Query Involvng Partal Partcpaton of Classes. Informaton Systems 2(3), 253-267 (996) 6. Deutsch, A., Fernandez, M., et al.: Queryng XML Data. IEEE Data Engneerng Bulletn 22(3), 0-8 (999) 7. Ezefe, C., Zheng, J.: Measurng the Performance of Database Obect Horzontal Fragmentaton Schemes. IDEAS 999, 408-44 8. Fung, C. Karlapalem, K., L, Q.: Cost-drven evaluaton of vertcal class parttonng n obect orented databases. DASFAA 997, -20 9. Gardarn, G., Gruser, J., Tang, Z.: A Cost Model for Clustered Obect-Orented Databases. VLDB 995, 323-334 0. Gardarn, G., Gruser, J., Tang, Z.: Cost-based Selecton of Path Expresson Processng Algorthms n Obect-Orented Databases. VLDB 996, 390-40. Kossmann, D.: The State of the Art n Dstrbuted Query Processng. ACM Computng Surveys 32(4), 422-469 (2000) 2. GOA++ Obect Management System. URL: http://www.cos.ufr.br/~goa 3. Ozkan, C., Dogac, A., Altnel, M.: A Cost Model for Path Expressons n Obect Orented Queres. Journal of Database Management 7(3), 25-33 (996) 4. Ruberg, G.: A Cost Model for Query Processng n Dstrbuted-Obect Databases, M.Sc. Thess n Portuguese, COPPE/UFRJ, Brazl (200). Reduced verson n Englsh avalable n http://www.cos.ufr.br/~gruberg/ruberg200_englsh.pdf 5. Tavares, F.O., Vctor, A.O., Mattoso, M.: Parallel Processng Evaluaton of Path Expressons. SBBD 2000, 49-63